Monday, March 19, 2007

Decoding BlackBerry Server NSD Logs

Even though I've tried to optimize my environment to the best of my ability, I will still occasionally see the dreaded Fault report that one of my BES servers has crashed. If this happens to you, here is how you can decode the NSD log that Domino creates and track it back to what BES was doing at the time. In some situations - not all - this can help you figure out what is going on and resolve it yourself instead of waiting for a log file analysis from RIM.

For example, I just happen to have an NSD log from a crash that happened this morning. These crash files can be found in the "Lotus\Domino\Data\IBM_TECHNICAL_SUPPORT" directory of your data partition. The NSD log file is name something like this:

nsd_W32I_[YOUR SERVER NAME]_2007_03_18@21_15_02.log

In my experience, most of the relevant information can be gleaned from the first page full of information. The first thing we want to look for is the following line right near the top:

Arguments : "c:\lotus\domino\nsd.exe" -dumpandkill -termstatus 5 -crashpid 4592 -crashtid 3252

The most important information is the "-crashtid 3252" portion, which indicates that the thread ID that caused the crash is 3252 decimal. Well if you remember, the BlackBerry log files record the thread ID for every operation, however they are encoded in Hex. So using calc we convert 3252 decimal = CB4 Hex.

With the thread ID known, lets parse the BES MAGT log for the thread ID CB4. I use the line "grep 'xCB4' [logfile name]" The final line of this thread process is this:

[40000] (03/18 20:42:38):{0xCB4} CN=XXXX/OU=XXXX/O=XXXX!!mail\XXXX.nsf, fetching modified documents since 03/18/2007 08:42:37 PM for user XXXX

Of course identifying information has been removed to protect the innocent, however this appears to be the last thing this thread did before the server crashed. I think we have our culprit mailfile.

So now I open the mailfile and look in the All Documents view, and discover there is a Sent Item from guess what time? That's right: 03/18/2007 08:42:37 PM. Could this be our culprit document? It appears to be a reply, initiated from the device, to an email that contains a large bitmap in the body.

To verify, I check the NoteID; it is "1C542". If I now do a search on "1C542" in the NSD file I find the following lines:

@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f310 35b4f7d0 00000000 0001c542 0bd1a0e4 |...5....B.......|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
# 35b4f354 35b4f380 6003d3b7 00000071 0001c542 |...5...`q...B...|
# 35b4f374 0001c542 35b4f7d0 00000010 35b4f5ac |B......5.......5|
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f380 35b4f5ac 0058ab01 00000071 0001c542 |...5..X.q...B...|
# 35b4f560 35b4f5c4 0058f993 0001c542 35b4f7c8 |...5..X.B......5|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
# 35b4f5bc 00000071 0001c542 00000000 35b4f7d0 |q...B..........5|
# 35b4f61c 00000002 0d44e858 00475182 0001c542 |....X.D..QG.B...|
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f620 0d44e858 00475182 0001c542 00000000 |X.D..QG.B.......|
# 35b4f700 0001c542 0d2e7b15 1f7957bc 00096573 |B....{...Wy.se..|
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!


It looks like NBES was trying to open this Note using the NSFNoteOpen and NSFNoteOpenExtended API calls when it crashed the server. So now I am sure this note caused the crash. What now? Delete the note from the user's mailfile so that it now resides in the trash, then restart the BES server. Of course you want to let the user know why their message is in the trash, and of course follow up on RIM on why this happened in the first place. But at least you have the server back up!

1 comment:

Richard said...

Great post and very useful, thanks for taking the time. :-)