How to debug errors

When developing an operating system on top of L4 you do not have the luxury of using a source level debugger such as gdb. However you do have the L4 kernel debugger, which can be quite useful if you know how to use it, and is certainly better than just doing printf style debugging.

Kernel debugger

If your operating system causes certain types of faults you are likely to be dropped into the L4 kernel debugger. Once you are in the kernel debugger the thing you mostly likely want to do is find out which line of code caused the error. An example output of what happens when you cause an exception is shown below.

:Exception occured: 0x07, EPC=0x0000000000a013f0 VA=0x000000001c800030
STATUS=0x040180e0
--- KD# Unhandled Exception ---

-- current ASID is 2, CPU 0 -- 
    

The first thing you want to do is show the thread control block (TCB) of the thread that cause the exception. The command 't' will print out the current TCB. eg:

> showtcb                                                          
tcb/tid [current]: current
=== TCB: 400000000002a000 === ID: 0000002a00000001 = 000000010000a880/ffffffff80ae6800 === PRIO: 0x64 ========
UIP: 0000000000a013f0   queues: Rswl      wait : 0000000000000000:0000000000000000   space: ffffffff80ab6000  
USP: 0000000000a5ab78   tstate: RUNNING   ready: 400000000002a000:400000000002a000   pdir : 0000000000000000
KSP: 400000000002ad58   sndhd : 0000000000000000  send : 0000000000000000:0000000000000000   pager: 0000002900000001
total quant:            0us, ts length:          10000us, curr ts:        10000us
abs timeout:            0us, rel timeout:            0us                         
sens prio: 100, delay: max=0us, curr=0us                
resources: 0000000000000000 []          
partner: 0000000000000000, saved partner: 0000000000000000, saved state: ABORTED, scheduler: 0000002900000001
utcb_page_area: 0000000100000000 40000
    

NB: Even though showtcb is shown after the prompt you have to actually hit 't', not type showtcb

The first thing you need to do is work out which thread is executing. the ID: field on the first line of output indicates the thread id of the faulting thread. You will probably find it useful to print the thread id of each thread you start in your operating system; at least while you are developing the system.

Once you have determined which thread is executing you want to work out the code it is executing. The UIP: (user instruction pointer) field helps you here. In this case the code was executing at address 0xa013f0.

Now that we know the faulting address we now want to find out which line of code contains the fault. The first step is to find which executable file the fault is in. Dite helps us here:

% dite -d rootimage.dite
'rootimage.dite': (2 entries)
No.    Name           Base     Size     Entry    Init   Resource
0   sos              0xa00000 0x5af10  0xa00000   Yes    Yes
1   tty_test         0xa5b000 0x42ee0  0xa5b000   No     No 

Dite shows us in this case the the fault is in the sos object code. (This may be obvious from which thread is faulting, but not always.) We can now run objdump on the appropriate file:

% mips64-elf-objdump -dl sos.app | less

You can now use the searching facility in less to search for the faulting address. In this case I find the following fragment of output:

00a013d8 <zsccgetreg>:
  a013d8:   30a500ff    andi  a1,a1,0xff
  a013dc:   50a00004    beqzl a1,a013f0 <zsccgetreg+0x18>
  a013e0:   dc830000    ld v1,0(a0)
  a013e4:   dc820000    ld v0,0(a0)
  a013e8:   a0450000    sb a1,0(v0)
  a013ec:   dc830000    ld v1,0(a0)
  a013f0:   03e00008    jr ra
  a013f4:   90620000    lbu   v0,0(v1)

In this case I am lucky and it is a small piece of function so tracking down the bug shouldn't be too hard. In large functions your skill at reading assembler code, (that was one of the recommended skills for this course remember), comes into play. Of course this should encourage to keep you functions small.

Rebooting the machine

Often when debugging your code you will want to reboot the machine, however you may be working remotely, or otherwise not be able to physically reboot the machine. The kernel debugger can do this for you. First drop into the kernel debugger by hitting ESC. The from the prompt hit 6. This should restart the machine, and get you back to the PMON prompt.

If for some reason this doesn't work L4 may have gotten itself into an unexpected state and you will have to manually reboot the machine.


Last modified: Thu Aug 7 09:37:45 EST 2003