Sunday, August 13, 2006

Don't be afraid of mdb

Many Solaris system admins or developers would know that Solaris has some very good debugging tools. Most sysadmins would know there is a command called mdb. Sadly most would have either never used it, or was scared off when they scanned through the documentation. While using mdb does require a good knowledge of the Solaris internals, and some assembly language skills, there are times where it is probably the only (or best) tool for the job.

Consider the case where you have an application that your company has been using for a long time. Something has changed on the system, and now it crashes when it is run. Since the person who wrote the application now does not work for your company anymore and nobody knows where the source code is, you have a problem. To make things worst, when you do a pstack on the core file, you find that they have “stripped” the binary of its symbol table to save a few bytes. Your options are now really limited to do any useful debugging. Enter 'mdb'....

Now to simulate this I have created a small C program, with a null pointer buried in a couple of functions. I compile the program (not using any optimizations as the compiler will inline all of the functions as they are very small), and then run the strip command on it. During running the program we get not a very useful error message, and a core dump. Argggg!

Running a pstack on the binary, because it was stripped, pstack returns an address with “????????”, as the function name. Ah, it is now turned into a challenge.
doug@bangkok> /bin/pstack core
core 'core' of 101996: ./makecore
080506d5 ???????? (8060898, 80470fc, 805071b, 804720c, 8047124, 805062a)
080506f4 ???????? (804720c, 8047124, 805062a, 1, 8047130, 8047138)
0805071b main (1, 8047130, 8047138) + b
0805062a _start (1, 8047274, 0, 804727f, 804728c, 80472c6) + 7a
You can get the similar output from the “::stack” command, within mdb.
doug@bangkok> mdb core
Loading modules: [ libc.so.1 ld.so.1 ]
> ::stack
0x80506d5(8060898, 80470fc, 805071b, 804720c, 8047124, 805062a)
(804720c, 8047124, 805062a, 1, 8047130, 8047138)
main+0xb(1, 8047130, 8047138)
_start+0x7a(1, 8047274, 0, 804727f, 804728c, 80472c6)
Since there is nothing in human readable form, at this point most people would look elsewhere or through it in the too hard basket. If you know a little assembly language (32bit x86 in this case), you should probably continue on. A good starting point would be the assembly listing of the function where it bombs out. The first address “0x80506d5” is for the instruction where we bombed out. Doing a disassemble backwards from this address is tedious, especially if this instruction is a long way from the beginning. The address on the next line “0x80506f4” is actually more useful. It is the return address of the function, which should be the next instruction after the function call. The function calling code should be immediately before this. Lets attack it with the disassembler built into 'mdb' byte by byte.
> 80506f4::dis
0x80506f4: movl -0x4(%ebp),%eax
> 80506f3::dis
0x80506f3: decl 0xe850fc45(%ebx)
> 80506f2::dis
0x80506f2: ***ERROR--unknown op code***
> 80506f1::dis
0x80506f1: ***ERROR--unknown op code***
> 80506f0::dis
0x80506f0: int $0x3
> 80506ef::dis
0x80506ef: call -0x34 <0x80506c0>
Bingo! We have a winner - 0x80506c0. You will probably notice, the "call" op-code (1 byte) was followed by a 4 byte address, so we could have first tried the addess – 5. In my case the command inside of mdb would have been “80506f4-5::dis”.

Now we have an address, we can now easily list the function from the start.
> 80506c0::dis
0x80506c0: pushl %ebp
0x80506c1: movl %esp,%ebp
0x80506c3: subl $0x4,%esp
0x80506c6: movl $0x0,-0x4(%ebp)
0x80506cd: movl $0x63,%eax
0x80506d2: movl -0x4(%ebp),%edx
0x80506d5: movb %al,0x0(%edx)
0x80506d8: leave
0x80506d9: ret
0x80506da: nop
We can eithen add the function into mdb's user-defined symbol table, so we can now see symbolic names, rather than hex addresses. The rough comments we added by me :)
> 80506c0::nmadd -f -e 80506da badfunc
added badfunc, value=80506c0 size=1a
> badfunc::dis
badfunc: pushl %ebp ; save frame pointer to the stack
badfunc+1: movl %esp,%ebp ; copy stack pointer to frame pointer
badfunc+3: subl $0x4,%esp ; make room for the pointer - char *p
badfunc+6: movl $0x0,-0x4(%ebp) ; initialize pointer to null - p=(char*)NULL;
badfunc+0xd: movl $0x63,%eax ; copy 'c' to %eax register
badfunc+0x12: movl -0x4(%ebp),%edx ; copy pointer to register %edx - now = 0
badfunc+0x15: movb %al,0x0(%edx) ; *p = 'c' - Hmmm copy 'c' to address 0 - BAD!!!
badfunc+0x18: leave ; cleanup function call
badfunc+0x19: ret ; return to calling function
> ::stack
badfunc+0x15(8060898, 80470fc, 805071b, 804720c, 8047124, 805062a)
0x80506f4(804720c, 8047124, 805062a, 1, 8047130, 8047138)
main+0xb(1, 8047130, 8047138)
_start+0x7a(1, 8047274, 0, 804727f, 804728c, 80472c6)
> 80506f4-5::dis
0x80506ef: call -0x34
From a quick look at my disassembled code, it is clear that some idiot created a null pointer , and then tried to copy a byte to there. Not very bright eh! In a real world example you would probably need to run the command in mdb and set the breakpoint to the start of the function. From there you could step through the code to see what is does. It would go something like this -
doug@bangkok> mdb ./makecore
> 80506c0::nmadd -f -e 80506da badfunc ; Add our own symbol from above
added badfunc, value=80506c0 size=1a
> badfunc:b ; Set a breakpoint at the beginning of badfunc
> :r ; run ./makecore in the debugger
mdb: stop at badfunc
mdb: target stopped at:
badfunc: pushl %ebp
> :s ; Step through code, 1 step at a time
mdb: target stopped at:
badfunc+1: movl %esp,%ebp
> :s
mdb: target stopped at:
badfunc+3: subl $0x4,%esp
> :s
mdb: target stopped at:
badfunc+6: movl $0x0,-0x4(%ebp)
> :s
mdb: target stopped at:
badfunc+0xd: movl $0x63,%eax
> ::regs ; Check the registers - Hmmm. %edx = 0
%cs = 0x003b %eax = 0x08060898
%ds = 0x0043 %ebx = 0xfeffa7c0
%ss = 0x0043 %ecx = 0xfefa9768 libc.so.1`_sse_hw
%es = 0x0043 %edx = 0x00000000
%fs = 0x0000 %esi = 0x080470e0
%gs = 0x01c3 %edi = 0x08047204

%eip = 0x080506cd badfunc+0xd
%ebp = 0x080470e4
%kesp = 0x00000000

%eflags = 0x00000202
id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0
status=

%esp = 0x080470e0
%trapno = 0x1
%err = 0x0
To find the best reference on the guts of Solaris and how to make the best use mdb and other Solaris tools such as DTrace. Go and purchase the just released 2nd edition of Solaris Internals and it's new companion Solaris Performance and Tools. You can save 30% by buying the through Sun. While you are stacking your bookself, you should also consider Solaris System Programming, and Sun Performance and Tuning. Some light reading :)

1 comment:

Anonymous said...

Yep, I was afraid of mdb. But I was compelled to use it after wasting a day trying to pinpoint an address alignment bug in my code to no avail. Struggled to find a understandable newbie mdb doc. The man page was cryptic. The Sun docs were voluminous. Your blog entry was perfect. After reading it and just using ::stack and "addy"::dis, I zeroed in on the errant line of code in minutes. Thanks, Doug!