When a Solaris application crashes, it usually produces a core file, which is a disk copy of the application's memory at the time of the crash.
One way to generate a traceback is to use a debugger such as dbx with the core file:
% dbx /path/executable core
(dbx) where > traceback.txt
(dbx) quit
In addition, starting with Solaris 8, the pstack(1) utility can also print out a traceback from the core file, just like dbx:
% pstack core > traceback.txt
If you can't use dbx or pstack with a core file for any reason, read on for a way of generating a traceback directly from the application.
Handling application core files
Generally, dealing with application core files is difficult for the following reasons:
- No debugger: Many user sites don't have access to a debugger such as
dbx. It costs money, needs maintenance, and it is not really needed there in most cases.
- Big core files: Saving a core file generated by a large application can require a lot of disk space, which may not be available. Also, sending such core files to the software developers may be hard because of the core file's large size.
- Unusable core files: Interpreting the core files on a machine different from where the core file was created is often impossible because no two systems are exactly alike. There are always differences in hardware, OS versions, and patch levels, which can make the debugger refuse to read a core file from a different machine. Therefore, sending the core files to the software developer usually doesn't help.
Generating a traceback from the application
The solutions described below use pstack(1), which is specific to Solaris. However, the same methods can be used with other operating systems where similar functionality is available.
Note that generating a traceback on crash provides only one more piece of the puzzle in determining why the application has crashed and how to fix the problem. Nevertheless, such a traceback may lead to the underlying problem causing the crash, or at the very least to determining whether the problem is in the application or in the system. In any case, getting a traceback is a step in the right direction.
My recent article "Building Library Interposers for Fun and Profit" described how to build library interposers and use them for various debugging and performance tuning tasks. Some of the tools described in this article are additional applications of the library interposition technology.
If the application does not have signal handlers for SIGSEGV and SIGBUS installed, the following simple library interposer produce_traceback.so is all you need to automatically generate a traceback any time the application crashes.
You can determine whether the application has those signal handlers installed or not using the psig(1) command. Here is an example for a dtpad (CDE editor) process:
% psig 10630 | egrep ":|^SEGV|^BUS"
10630: dtpad /tmp/test1.c
BUS default
SEGV default
That tells us that dtpad does not have signal handlers for those two signals. Of course, dtpad is only an example. You can use all techniques described in this article with any application on Solaris.
If the application does have a signal handler installed for the signal causing the crash, see the next section of this article.
produce_traceback.c
produce_traceback.so
Note: If you use Netscape Navigator as your browser, you can download binary files such as produce_traceback.so by clicking the left mouse button while holding a Shift key.
In that interposer, I install signal handlers for SIGSEGV and SIGBUS and then call system(3S), invoking pstack(1) with the current application's process ID. Note the use of the #pragma init construct to specify a routine to be invoked before the application starts. That feature allows the interposer to work with any application.
You can add any other fatal signal to the list specified in produce_traceback.c, although usually SIGSEGV and SIGBUS are enough for the purpose.
Also note that system(3S) calls vfork(2) internally rather than fork(2). That means that the application is unlikely to run out of memory when it calls system(pstack), even with very large applications. The vfork(2) function doesn't copy the parent process' address space and only consumes enough memory to run the command specified in the system(3S) call.
I built the produce_traceback.so library under Solaris 2.6, so it'll work with that and later Solaris versions. To use the interposer, simply define LD_PRELOAD pointing to the interposer library like this (using the C-shell syntax in this example):
% setenv LD_PRELOAD /full_path/produce_traceback.so
[run the application here]
If you have access to a Sun compiler and want to build the interposer library yourself, you can do it this way:
% cc -o produce_traceback.so -G -Kpic produce_traceback.c
Here is an example of using the interposer. I'm artificially sending a SIGBUS (10) signal to a Netscape process, thus causing a crash.
% setenv LD_PRELOAD ./produce_traceback.so
% /opt/netscape/netscape &
[1] 28966
% kill -10 28966
Processing signal 10
28966: /opt/netscape/netscape
ef2b927c waitid (0, 7129, ef2013e0, 103)
ef2d4254 _libc_waitpid (7129, ef2014c8, 100, 0, ef323180, ef2e8f48) + 54
ef2e8f48 system (ef2016a0, e16c28, ef325b30, ef323180, 0, 0) + 1f4
ef7a07a8 handle_crash (a, 0, ef2017e0, 0, 0, 0) + 60
ef2b8a0c sigacthandler (a, 0, ef2017e0, 20, 0, 200) + 28
ef2ccc74 select (ef201b18, ef3260fc, ef3260fc, ef326100, ef326100, 9) + 280
008f6e00 _OS_SELECT (9, ef203d70, 0, 0, ef203c68, 8f7bbc) + 14
008f7bf8 _PR_PauseForIO (0, 8, ffffffff, ffffffff, 0, 0) + 4a0
008f7e00 _PR_Idle (0, 0, 0, 0, 0, 0) + 20
008f6070 HopToad (8f7de0, 0, 0, e44830, 0, 0) + 14
008f60a8 HopToadNoArgs (1, 0, 0, 0, 0, 0) + 20
00000000 ???????? (0, 0, 0, 0, 0, 0)
A copy of the traceback is stored in /var/tmp/traceback.txt.
[1] Exit 1 /opt/netscape/netscape
Note that the Netscape executable is stripped:
% file /opt/netscape/netscape
/opt/netscape/netscape: ELF 32-bit MSB executable SPARC Version 1,
dynamically linked, stripped
See "Dealing with hidden function names" below for a discussion of stripped executables.
If application has signal handlers for SIGSEGV and SIGBUS
If the application installs its own signal handler for the signal generating the crash, the produce_traceback.c interposer described above will not generate a traceback. The application's handler will overwrite the one specified in produce_traceback.c.
There are two ways to circumvent that problem:
- In addition to installing the signal handlers for the fatal signals, interpose on
signal(3C), sigset(3C), and sigaction(2). When the application calls one of those functions to install a handler for SIGSEGV or SIGBUS, simply return without installing it. Here is the modified source code and the new version of the interposer library:
produce_traceback2.c
produce_traceback2.so
Clearly, that is a hack, which probably should not be used in production. Ignoring the SIGSEGV or SIGBUS handler installed by the application changes the intent of the application's authors. However, that trick serves our debugging purposes quite well.
- If you can change the application (or have software developers change it), simply add something like the following code to those signal handlers before exiting or crashing the application:
char buf[128];
sprintf( buf,
"/usr/proc/bin/pstack %d | /bin/tee /var/tmp/traceback.txt\n",
(int)getpid() );
system(buf);
Dealing with hidden function names
If the application executable is stripped (using the strip(1) command), both pstack(1) and dbx will automatically use the dynamic symbol table and will print some or even most of the traceback anyway. The example with Netscape shown above demonstrates that.
For the stripped executable's symbols that pstack(1) can't find in the dynamic symbol table (static functions), we need a special solution.
In addition to stripping, some applications hide their function names in another way. They use a linker mapfile (-M) option, making local symbols of most of their routines. They do it for security and performance reasons, among others, and doing so is often a good idea. (See Solaris Linker and Libraries Guide for details.) pstack(1) or dbx cannot print out the function names hidden that way because the information is simply not available at runtime.
When it can't determine the function name, pstack(1) will print a ???????? string in its place; dbx will print such a function's hexadecimal address only. Obviously, those aren't useful for debugging.
To solve that problem, I've written a utility in Perl, which recovers the lost function names when you or the software developer have access to the unstripped version of the same executable as the one that crashed.
Here is the Perl source code of that utility.
unstrip_traceback
When the application crashes, you can produce a traceback, either using a library interposer described above or one from the core file, and send the result to the appropriate software development or support organization. They can use the unstrip_traceback utility with the unstripped version of the same executable to convert the blind spots to the real function names. Here is the syntax:
% unstrip_traceback traceback.txt unstripped_executable
traceback.txt is the name of the file containing the traceback with the blind functions, and unstripped_executable is the path to the unstripped executable.
Internally, the unstrip_traceback utility calls nm(1), sorts function addresses and corresponding names, and then performs a binary search for each blind spot in the original traceback, replacing it with the real function name. It automatically detects whether dbx or pstack was used to produce the blind traceback.
Here is an example of unstrip_traceback usage. Program test1.c contains two static functions. To produce a crash and a traceback, I intentionally dereference a NULL pointer to cause a segment violation. That is run under Solaris 8.
test1.c
% cc -o test1 test1.c
% cp test1 test1_unstripped
% strip test1
% test1 | tee traceback.txt
503: test1
ff31a5ac waitid (0, 1f9, ffbeece0, 103)
ff2d4d88 _waitpid (0, ffbeedc8, 100, ffbeedc8, 23208, ff310120) + 60
ff310134 system (ffbeef98, 10a28, 1f7, 0, 0, 0) + 204
0001089c handler (b, 0, ffbef098, ff338000, 0, 0) + 2c
ff319834 sigacthandler (b, 0, ffbef098, 0, 0, 0) + 28
--- called from signal handler with signal 11 (SIGSEGV) ---
000108d4 ???????? (0, b, ffbef3f8, ffbef3d8, 231e0, ff2ccf88)
00010900 ???????? (0, 10870, b, 5, 23664, ff29b68c)
0001093c main (1, ffbef4e4, ffbef4ec, 20800, 0, 0) + 1c
00010848 _start (0, 0, 0, 0, 0, 0) + 108
% unstrip_traceback traceback.txt test1_unstripped
This is a pstack traceback
Running nm for test1_unstripped ...
Searching for the functions missing from traceback ...
503: test1
ff31a5ac waitid (0, 1f9, ffbeece0, 103)
ff2d4d88 _waitpid (0, ffbeedc8, 100, ffbeedc8, 23208, ff310120) + 60
ff310134 system (ffbeef98, 10a28, 1f7, 0, 0, 0) + 204
0001089c handler (b, 0, ffbef098, ff338000, 0, 0) + 2c
ff319834 sigacthandler (b, 0, ffbef098, 0, 0, 0) + 28
--- called from signal handler with signal 11 (SIGSEGV) ---
000108d4 sub2 (0, b, ffbef3f8, ffbef3d8, 231e0, ff2ccf88) + 0xc
00010900 sub (0, 10870, b, 5, 23664, ff29b68c) + 0x8
0001093c main (1, ffbef4e4, ffbef4ec, 20800, 0, 0) + 1c
00010848 _start (0, 0, 0, 0, 0, 0) + 108
%
As you can see, unstrip_traceback has restored the lost names of static functions sub and sub2 and even the hexadecimal offsets within those functions.
Working around possible complications
A slight possibility exists that sprintf(3S) will cause a deadlock in a multithreaded application when called from a signal handler. The sprintf(3S) function is not declared Async-Signal-Safe. To eliminate that slight possibility, you can avoid using sprintf(3S) by converting the process ID to a character string on your own.
Also, strictly speaking, printf(3S) is not supposed to be called from a signal handler because it calls malloc(3C). In practice, however, Solaris printf(3S) (at least, the current version) only calls malloc(3C) when the given output buffer is larger than 1,024 bytes. Therefore, it's usually safe to call printf(3C) from a Solaris signal handler. That is especially true for debugging tools such as our interposers, where the code doesn't have to be nearly as robust as in production.
In any case, it's easy enough to circumvent those potential problems. For example:
/* a file in /var/tmp will survive a reboot but not in /tmp */
/* 0 2345678 1 2345678 2 2345678 3 */
char buf[]="/usr/proc/bin/pstack |/bin/tee /var/tmp/traceback.txt\n";
char cdigits[]="0123456789";
int n, i;
n = (int)getpid();
i = 30; /* index to character in buf */
while ( n !=0 ) /* keep dividing by 10 */
{
buf[i--] = cdigits[n%10];
n /= 10;
}
(void)putenv("LD_PRELOAD=");
system(buf);
/* write(2) is safe in a signal handler, while printf(3S) is not */
/* 0 2345678 1 2345678 2 2345678 3 2345678 4 2345678 5 2345678 6 */
write(1,"A copy of the traceback is stored in /var/tmp/traceback.txt\n",60);
One more possible problem with the produce_traceback library interposers when dealing with multithreaded applications is the fact that system(3S) is not MT-safe. However, an application is highly unlikely to be in the middle of another system(3S) call when a SIGSEGV or a SIGBUS signal arrives and its handler is invoked.
In the worst case, even if a deadlock occurs, the application will hang. You can still determine the process ID (using ps(1)) and run /usr/proc/bin/pstack pid manually, which will produce the traceback anyway.
Acknowledgments
I'd like to thank the following Sun engineers for their help with the traceback generation issues: Morgan Herrington, Michael Shapiro, Ivan Soleimanipour, and Michael Walker.
Resources