Monday, March 19, 2007

Decoding BlackBerry Server NSD Logs

Even though I've tried to optimize my environment to the best of my ability, I will still occasionally see the dreaded Fault report that one of my BES servers has crashed. If this happens to you, here is how you can decode the NSD log that Domino creates and track it back to what BES was doing at the time. In some situations - not all - this can help you figure out what is going on and resolve it yourself instead of waiting for a log file analysis from RIM.

For example, I just happen to have an NSD log from a crash that happened this morning. These crash files can be found in the "Lotus\Domino\Data\IBM_TECHNICAL_SUPPORT" directory of your data partition. The NSD log file is name something like this:

nsd_W32I_[YOUR SERVER NAME]_2007_03_18@21_15_02.log

In my experience, most of the relevant information can be gleaned from the first page full of information. The first thing we want to look for is the following line right near the top:

Arguments : "c:\lotus\domino\nsd.exe" -dumpandkill -termstatus 5 -crashpid 4592 -crashtid 3252

The most important information is the "-crashtid 3252" portion, which indicates that the thread ID that caused the crash is 3252 decimal. Well if you remember, the BlackBerry log files record the thread ID for every operation, however they are encoded in Hex. So using calc we convert 3252 decimal = CB4 Hex.

With the thread ID known, lets parse the BES MAGT log for the thread ID CB4. I use the line "grep 'xCB4' [logfile name]" The final line of this thread process is this:

[40000] (03/18 20:42:38):{0xCB4} CN=XXXX/OU=XXXX/O=XXXX!!mail\XXXX.nsf, fetching modified documents since 03/18/2007 08:42:37 PM for user XXXX

Of course identifying information has been removed to protect the innocent, however this appears to be the last thing this thread did before the server crashed. I think we have our culprit mailfile.

So now I open the mailfile and look in the All Documents view, and discover there is a Sent Item from guess what time? That's right: 03/18/2007 08:42:37 PM. Could this be our culprit document? It appears to be a reply, initiated from the device, to an email that contains a large bitmap in the body.

To verify, I check the NoteID; it is "1C542". If I now do a search on "1C542" in the NSD file I find the following lines:

@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f310 35b4f7d0 00000000 0001c542 0bd1a0e4 |...5....B.......|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
# 35b4f354 35b4f380 6003d3b7 00000071 0001c542 |...5...`q...B...|
# 35b4f374 0001c542 35b4f7d0 00000010 35b4f5ac |B......5.......5|
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f380 35b4f5ac 0058ab01 00000071 0001c542 |...5..X.q...B...|
# 35b4f560 35b4f5c4 0058f993 0001c542 35b4f7c8 |...5..X.B......5|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
# 35b4f5bc 00000071 0001c542 00000000 35b4f7d0 |q...B..........5|
# 35b4f61c 00000002 0d44e858 00475182 0001c542 |....X.D..QG.B...|
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f620 0d44e858 00475182 0001c542 00000000 |X.D..QG.B.......|
# 35b4f700 0001c542 0d2e7b15 1f7957bc 00096573 |B....{...Wy.se..|
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!


It looks like NBES was trying to open this Note using the NSFNoteOpen and NSFNoteOpenExtended API calls when it crashed the server. So now I am sure this note caused the crash. What now? Delete the note from the user's mailfile so that it now resides in the trash, then restart the BES server. Of course you want to let the user know why their message is in the trash, and of course follow up on RIM on why this happened in the first place. But at least you have the server back up!

Friday, March 16, 2007

Moves between Servers - Not 100% foolproof

I moved over 500 accounts from an old BES server to a new BES server today. I set the expectation with our regional managers and BB admins that there would be no impact to end users during this migration. How could I be so naive?

First, a minor amount of users were not able to send or send & receive after the move, and required either pushing the service books, reactivation, or complete account removal and re-add / re-activate to get working again. This is not so bad, it is a small number (<20 probably) of 500 people.

The bigger problem was this: The source (old) BES server's "Blackberry Synchronization Service" would sometimes freak out during bulk moves. I would move maybe 50 people at a time max, but apparently that was too much for the sync service to handle. So it would just quit, and I would have to restart it. I also noticed that the initial moves went really fast, a few per minute, but then they started to bog down and it would take 5 minutes to move one account.

That is no big deal, until I realized by looking through the logs what was happening: the sync service contains the device backup service, and it was going through and deleting the backup data not only for the accounts being moved, but FOR EVERY USER ON THE SERVER: (names removed to protect the innocent)

[46036] (03/14 10:15:41):{0xC50} [SYNC-Gate] Start removing user. [W
[46036] (03/14 10:15:41):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:42):{0xC50} [SYNC-Gate] Start removing user. [Z
[46036] (03/14 10:15:42):{0xC50} [SYNC-Gate] Start removing user. [O
[46036] (03/14 10:15:43):{0xC50} [SYNC-Gate] Start removing user. [S
[46036] (03/14 10:15:43):{0xC50} [SYNC-Gate] Start removing user. [H
[46036] (03/14 10:15:44):{0xC50} [SYNC-Gate] Start removing user. [K
[46036] (03/14 10:15:44):{0xC50} [SYNC-Gate] Start removing user. [D
[46036] (03/14 10:15:45):{0xC50} [SYNC-Gate] Start removing user. [G
[46036] (03/14 10:15:45):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:46):{0xC50} [SYNC-Gate] Start removing user. [D
[46036] (03/14 10:15:46):{0xC50} [SYNC-Gate] Start removing user. [O
[46036] (03/14 10:15:47):{0xC50} [SYNC-Gate] Start removing user. [S
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [A
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [C
[46036] (03/14 10:15:49):{0xC50} [SYNC-Gate] Start removing user. [F
[46036] (03/14 10:15:49):{0xC50} [SYNC-Gate] Start removing user. [N
[46036] (03/14 10:15:50):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:50):{0xC50} [SYNC-Gate] Start removing user. [W
[46036] (03/14 10:15:51):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:51):{0xC50} [SYNC-Gate] Start removing user. [G
[46036] (03/14 10:15:52):{0xC50} [SYNC-Gate] Start removing user. [A
[46036] (03/14 10:15:53):{0xC50} [SYNC-Gate] Start removing user. [P
[46036] (03/14 10:15:53):{0xC50} [SYNC-Gate] Start removing user. [Z
[46036] (03/14 10:15:54):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:55):{0xC50} [SYNC-Gate] Start removing user. [L
[46036] (03/14 10:15:58):{0xC50} [SYNC-Gate] Start removing user. [M

Once this happened, it kicked off an OTA device backup, which once again sync'ed everything back from each device, which slowed down EVERYTHING.

Now here is the worst part: many of the devices (with accounts that were not even scheduled to be moved that day) received either:

1) An activation complete - OK prompt
2) A continuing activation process
3) A leftover activation icon on their ribbon (home screen).

I started to get calls from admins in other cities, who weren't even scheduled to be moved that day. Bad.

Lesson: Don't move more than 10-20 accounts at a time, and watch the SYNC service.

Oh, and by the way - the brand new server hardware I migrated to? It crashed this morning. :(

Wednesday, March 7, 2007

Gathering Wireless Vendor Info from the SQL DB

My manager asked for a report of the BlackBerry users and their associated wireless provider. Given this is all stored in the BES SQL DB, it shouldn't be a problem to get the info - but there are some things to figure out before it works right:

1. There are two fields in the BES SQL DB that refer to a wireless provider. The first is the VendorID field, which specifies the actual maker of the BlackBerry device itself, regardless of what network it is currently connected to or roaming on.

2. The second is the Home Network field, which actually doesn't make sense because it shows you which wireless network provider the device is connected to right now, so it may not actually be the home network, but a roaming partner.

For example I have a user whose device is a T-Mobile device and their home network would obviously be T-Mobile, however they currently appear as Cingular under Home Network.

To get these values out of the SQL DB we need a query, here is the one I use to pull the data into Excel:

SELECT UserConfig.DisplayName, vHandheldSummaryInfo.VendorID, vHandheldSummaryInfo.HomeNetwork
FROM BesConfig.dbo.UserConfig UserConfig, BesConfig.dbo.vHandheldSummaryInfo vHandheldSummaryInfo
WHERE vHandheldSummaryInfo.PIN = UserConfig.PIN

Note that you need to join these two tables in Microsoft Query using the PIN field, not the default "id" field that it starts with, otherwise your data won't match. Haven't discovered why, perhaps their are two different id fields in the two tables. Anyway, just use PIN because that is a common key that I know is unique.

So once you run this query you get the user's Display Name, Vendor ID, and Home Network. Now to get the device's carrier I need just the Vendor ID field but I like to see what Home Network people are on, it is interesting to see where your people are in the world and what unique names other wireless carriers have, such as "CCT Boatphone" (!). (Apparently from the British Virgin Islands, but is it really for boats? a cruise ship? hmmm...)

One final issue is that the Vendor ID is in decimal format, so I wasn't quite sure which was which. The Home Network let me see which vendor ID matched up with which vendor name, but after doing a bunch of work for that I figured out that the vendor id's are stored in where else... the Vendors.xml file!

While the Vendors.xml file did in fact contain all the matchings between vendor ID and vendor name, the id's in this file were in Hex format! :( After a little hacking in Excel with the HEX2DEC formula, I came up with the following master table below of all vendors RIM supports:


Hex (vendor.xml)Decimal (BES DB)Provider Name
64100T-Mobile US
65101AWS
66102Cingular Wireless
67103Nextel
68104Sprint PCS
69105Verizon
6B107Rogers
6D109BWA (Aliant/Sasktel)
6F111MM02 DE
71113Telfort
72114T-Mobile DE/AU
73115TIM
74116Hutchison
75117Bouygues
76118SFR
77119Orange_France
78120VODAFONE (UK)
79121Telcel
7A122Telstra
7B123T-Mobile (UK)
7C124Vodafone Germany
7D125MM02 UK/IRL/NL
7E126Telus
7F127SMART
80128Starhub
81129Telefonica
82130Swisscom
83131Cable and Wireless
84132Vodafone (IT)
85133Vodafone (ES)
86134T-Mobile (NL)
87135Cincinnati Bell
89137Vodafone Austria
8A138Vodafone Australia
8B139Vodafone Ireland
8C140Vodafone Sweden
8D141CSL
8E142Orange UK
8F143Vodafone New Zealand
90144SingTel
91145Globe
92146Optus
93147Orange Mobistar
95149Bharti
96150KPN
98152Proximus
99153Vodafone_Portugal
9B155BT_Mobile
9C156Earthlink
9E158E-Plus
9F159BASE
A0160Dobson
A1161Vodafone_Egypt
A2162Orange_Switzerland
A4164Triton
A5165Maxis
A6166Vodafone_Denmark_(TDC)
A7167MobileOne_(Voda_Singapore)
A8168Vodacom


So now I can run my SQL query and easily match up the vendor ID's in decimal with the vendor names from this table. Cool.