Friday, November 2, 2007

MDS What?

MDS stands for Mobile Data Service, and it is a service that allows you to access other data sources (mostly Internet / Intranet browsing) via your BlackBerry.

In BES 4.1, however, MDS has turned into a many-headed hydra. What we know of as "BlackBerry Mobile Data Service" in 4.0 is now inexplicably called the "BlackBerry MDS Connection Service" in 4.1. Meanwhile, there are a whole bunch of new MDS related services:

•BlackBerry MDS Application Integration Service
•BlackBerry MDS Data Optimization Service
•BlackBerry MDS Provisioning Service
•BlackBerry MDS Administrative and Management Service
•BlackBerry® MDS Studio Application Repository

But wait, there's more! The new MDS stuff also requires a brand new SQL db running on the same SQL server as the BESMgmt database, which can results in permission issues when creating without the proper authority.

All of the above services & database are related to the new 4.1 software deployment environment installed when you choose the upgrade option called:

"BlackBerry Enterprise Server with MDS Services and Components"

But really, this whole new "MDS" software deployment infrastructure is not needed at the initial upgrade from 4.0 to 4.1. To avoid complexity, you can leave this whole new set of services out of the upgrade. Later, you can install the new MDS stuff on a separate server if you like, as it was made to be modular and have one separate MDS instance serve many BES servers.

The confusing part is that if you run the upgrade, you get these two choices:


If I do not know *exactly* what this means, I will by default choose the second option, "BlackBerry Enterprise Server with MDS Services and Components", because I want to keep my MDS service from 4.0, right?

Wrong... choosing the first option will give you the same 4.0 MDS (renamed MDS Connection Service) while avoiding the complexity of installing the new whiz-bang MDS software deployment stuff.

In a nutshell: if you want to greatly simplify your 4.0 -> 4.1 upgrade, opt for the first install method selected in the picture above. You will lose no MDS functionality from the 4.0 perspective, and can add on the new stuff later when you are ready and comfortable with 4.1.

Thursday, November 1, 2007

Malformed Message Crashes BES 4.1 SP4

The idea that a malformed message will crash a BES server is nothing new - service packs have taken care of these issues many times in the past. I apparently discovered another one, as one of my servers crashed twice just after midnight last night.

Fortunately Domino restarted itself and was back up and operational in minutes (thanks transaction logging!). After the second crash, however, it did not crash again. Usually the BES will keep trying to re-read the malformed message and crash over and over until you figure out the message and delete it from the user's mailfile, but not this time.

From the logs I see the attempts to read which resulted in crashes:

[40000] (11/01 00:00:44.858):{0x19B0} {User} [Mailfile], ModifiedByName detected change
[40000] (11/01 00:00:44.905):{0x19B0} {User} [Mailfile], fetching modified documents since 11/01/2007 12:00:43 AM

..[CRASH HERE!]..

[40000] (11/01 00:05:02.452):{0x1830} {User} [Mailfile], ModifiedByName detected change
[40000] (11/01 00:05:02.452):{0x1830} {User} [Mailfile], fetching modified documents since 11/01/2007 12:00:43 AM

..[CRASH HERE!]..


But on the third attempt I see this:

[40000] (11/01 00:08:10.515):{0x1640} {User} [Mailfile], fetching modified documents since 11/01/2007 12:00:43 AM
[20039] (11/01 00:08:10.530):{0x1640} {User} Already attempted to open NID=3DAF2 for user User: Message has been quarantined, skipping now


Nice job RIM! This quarantining feature allowed me to stay peacefully asleep instead of having to get up and hunt down the offending message.

BTW, the message in question was a digest from a mailing list which included a BinHex encoded MIME part in the body of the message:

--B_3276667131_13573
Content-type: application/mac-binhex40; name="[Filename].doc"
Content-disposition: attachment;
filename="[Filename].doc"

(This file must be converted with BinHex 4.0)
:(8K[G#p1Eh3J9A"NBA4P)#dJ5R9XH5!R-$FZC'pM!&Fi3Nj08eG%!*!%UJ#3"GN
Jd-m4i+'a'Z%!N"!q!!-!r[m*!!B!N!X"!*!$8!#3#"!!!&)!N!-"!*!$r[q3!`#
3"%m!N!2rN2rrN,(XTF%!Kf%*"!!!q"+r!*!&!4%!!3!"!!B!!28L!!!1!'TLDQ+
`Zl#l!*!5#33@!#3d!!$Df3%!fYN"!28F!*!Hrrm2!*!*rrm2!*!*rrm2!*!4L!#


I am pretty sure this is the part that utterly confused the BES since:

a) it was in the body

and

b) was a binhex part, which I have seen trouble with in prior BES versions even when it was properly encoded.

Looking through the release notes for 4.1 SP4 MR2, I see the following:

*SDR 135729 In BlackBerry Enterprise Server Version 4.1 SP4, if a message contained truncated or incorrectly encoded data, the BlackBerry Enterprise Server might have stopped responding. In BlackBerry Enterprise Server Version 4.1 SP4 MR2 and later, this issue is resolved.

I love to see this - it means I don't have to report the issue to RIM, someone else already has! Also I can tell my manager when he asks about the crashes that it is already fixed in the next maintenance release, which we will now plan on deploying.

Wednesday, October 24, 2007

Decoding the BlackBerry State Database

So we all know that state database correlates the messages in the mailfile with the messages on the BlackBerry. The way it does this is by creating a separate entry in the state database for each email the BES sees in the mailfile. Each of these entries has exactly the same UNID as the original email in the state database.

Now I am trying to decode the MessageState field in the state database entries. Searching through my own state database, I find all of the following various "state codes" under the MessageState field:

0
1
2
3
4
5
7
8
9
10
13
14
17

That's alot of "states" a message can be in!

I have started testing and have determined the first couple:

2: Email has been queued / sent to wireless network but not received by device yet
3: Email not redirected to handheld (redirection disabled)
4: Email has been delivered to device

It will be tough to figure out the rest, though! I will update this doc as I discover them...

Monday, October 8, 2007

Remove IT Policy - New Feature of 4.1 SP4 / OS 4.2.2

We now have an easy way to remove the applied IT Policy from a handheld, which was previously locked to the device even after you wiped it.

If you have 4.1 SP4 and a device with OS 4.2.2 or greater, here is how to remove the IT policy and set the device to a true factory default state:

1) Create a new IT policy (or modify an existing policy) and set the Remote Wipe Reset to Factory Defaults setting in the Security Policy group to True



2) Assign this IT policy to a particular user account
3) Send the command "Erase Data and Disable Handheld"



4) Verify receipt of command by device under "IT Policy Status"

Thursday, September 27, 2007

Update on Comcast's Man in the Middle Attacks on Lotus Notes

For those of you Notes client users suffering from Comcast's "filtering", there is a great post by my colleague Kevin Kanarski who has details of packet captures that result from sending an email with a 6MB attachment from a Notes client using the Internet connection.

These captures, from both ends, clearly show that Comcast is imitating both the client and server in sending RST [reset] packets to the other end of the connection. Neither the client nor the server generated any RST packets, so this is definitely shady behavior by Comcast.

Thursday, September 20, 2007

Tuning the "More" Cache

We all know the pesky little "More" command that lets you download successively more and more of your email, beyond the first 4K delivered to the BlackBerry.

Well on the server side, there is a cache that holds this information, and it is called the "More cache". [Wow these names really need a little obfuscation, they make too much sense to exist in the PC world]

Anyhoo, you can tune the amount of cache that the BES allocates to this function, and by default it gives it 10MB. Well at this level I was filling up the cache in about 6 hours from server startup, and only realizing a maximum of 84% hit rate:

[45009] (09/14 02:35:58.078):{0x13BC} More cache hit rate: 0.0%, requests: 0, adds: 0, size: 0.0/10.0 Mb
...
[45009] (09/14 08:33:32.778):{0x13BC} More cache hit rate: 69.0%, requests: 126, adds: 2270, size: 5.7/10.0 Mb
...
[45009] (09/14 09:48:32.294):{0x13BC} More cache hit rate: 75.7%, requests: 214, adds: 5031, size: 10.0/10.0 Mb
...
[45009] (09/14 23:48:32.111):{0x13BC} More cache hit rate: 84.4%, requests: 784, adds: 29120, size: 10.0/10.0 Mb



If the remaining text is not found in the More cache, then the BES server needs to open the mailfile to pull it in, which is expensive in terms of network I/O and Domino server I/O. [Not hugely, of course, but I like to squeeze as much performance out of my systems as possible] If it can find it in the cache, here is what you will see in your MAGT log:

[40231] (09/14 01:11:01.290):{0x608} {XXXX} Original message (RID=-576019096) retrieved from More cache. Not opening the mail file.

That is a message that I like to see!

So I decided to double the default of 10MB to 20MB by adding the following DWORD registry key, and setting it to "20" decimal:

HKLM\Software\Research In Motion\BlackBerry Enterprise Server\Agents\MoreCacheSize


After a restart of the server (not required) here is what I now see:

[45009] (09/20 13:50:51.888):{0x14E0} More cache hit rate: 89.5%, requests: 3004, adds: 127078, size: 20.0/20.0 Mb

So fully loaded, it is running ~90% cache hit rate now, which is not a huge gain but it is something. RIM considers >90% hit rates to be good, but I haven't decided how much more I want to push this setting as it takes away memory from other processes for this task.

Wednesday, August 15, 2007

BlackBerry Service Books - BES only

Here is a comprehensive list of the BES related service books available on a BlackBerry device:

Desktop [CMIME]: Send/Receive Email and Wireless Reconciliation
Desktop [SYNC]: Wireless Address Book
Desktop [CICAL]: Wireless Calendar
Desktop [ALP]: Address Lookup
Desktop [IPPP]: MDS Info for web browsing
Desktop [BrowserConfig]: MDS Info for web browser
Desktop [OTASL]: Over the Air Software Loading (4.2 device OS or later)
Provisioning [Provisioning]: Enterprise Activation


Note:

If you got something broke, then you may be able to quickly resolve it by deleting / undeleting a service book from the Blackberry device itself. This process regenerates / reconnects with the BES server in a magical way, requiring no server side intervention. Here are two common issues you can maybe resolve this way:

Wireless reconciliation such as read/unread marks, deletions not working, even though it says enabled from both server and device: Kill and Resurrect the Desktop [CMIME]

Address Book is not syncing, even though it says it is enabled from both server and device: Kill and Resurrect the Desktop [SYNC]

Friday, August 10, 2007

Comcast Blocking Lotus Notes Attachments over 2MB

This is not related to Domino BES in any way, but thought I would share. It appears that Comcast (at least in the Chicago area) has recently installed a blocking system in order to combat P2P traffic, which has the inadvertent effect of blocking Lotus Notes attachment uploads over 2MB.

Using Lotus Notes with the internet passthrough, via a home Comcast cable IP connection, and trying to send any message over 2MB will result in a bunch of IP reset (RSET) commands sent by the Comcast blocking system which blocks the message from being sent.

We have received 2 calls in the last week regarding this issue, and 4 other tech people with Comcast cable modems have confirmed that sending attachments over 2MB no longer works. The error received is the following:


"Remote system no longer responding: [Server name] mail.box"


If you try to send a message just under 2MB you might see the following:


"Mail was successfully submitted for delivery but a copy has not yet been saved in your mail file due to server not responding"

Has anyone else received complaints from home / laptop Notes users about this issue?

UPDATE: This appears to be confirmed by a posting at www.dslreports.com which discusses Comcast's use of a filtering product called "Sandvine":

http://www.dslreports.com/forum/r18323368-Comcast-is-using-Sandvine-to-manage-P2P-Connections

Thursday, August 2, 2007

BlackBerry Date/Time Source Explained

You have three options of how to get the Date/Time synchronized automatically on the BlackBerry:

1. Set it to BlackBerry (the default), which gathers the information from the BlackBerry network, i.e. the RIM NOC.

There is some confusion about this, let's clear it up right now. Setting this to BlackBerry does not mean you have to set the date/time yourself on the BlackBerry. It does not mean that it gets the time from Desktop Manager when you cradle / sync. (that is so 2005 anyways, it's all wireless now baby!) It means that you get the date/time directly from RIM over the air, that's it. Here is the debug log lines from the device when you click Update Time:

guid:0x1295B4AADE149AFC time: Thu Aug 02 15:12:08 2007 severity:5 type:2 app:net.rim.timesync data:SynR
guid:0x1295B4AADE149AFC time: Thu Aug 02 15:12:08 2007 severity:0 type:2 app:net.rim.timesync data:Send
guid:0x1295B4AADE149AFC time: Thu Aug 02 15:12:10 2007 severity:0 type:2 app:net.rim.timesync data:Recv

2. Set it to Network, which gathers the information from the wireless carrier network, whether AT&T, T-Mobile, Verizon, Rogers, O2, Vodafone, etc. Here is the debug log lines from the device when you do this:

guid:0x1295B4AADE149AFC time: Thu Aug 02 15:15:55 2007 severity:5 type:2 app:net.rim.timesync data:SynR

Interestingly, when you click Update Time with the Network setting, it just copies the date/time recorded under "Network Time" and "Network Date", instead of directly querying the carrier network. I assume this "Network Time" is regularly synced through some background mobile radio process.

3. Set if to Off, which does not gather any info. You set it and maintain it yourself.

Please note that, as I explained in my prior post, none of these settings will update the time zone you are in to the local time zone when you travel. None of them. Stop trying to change from Network to BlackBerry and back and clicking "Update Time" over and over again. It just won't do it.

Automatic TimeZone Switching for BlackBerry

The Motorola Q does it, the Treo does it, the Razr does it... and yet NO BlackBerry does it. What am I talking about? The seemingly obvious and basic cellular phone function of automatically switching the time zone for you while you travel.

This does not require internal GPS, as the other phones do not have/need it and yet have gotten their location / time zone directly from the network they attach to for years.

Now I have seen multiple discussions on the forums complaining/questioning about this feature, where "switchers" to BlackBerry just can't understand how to enable this feature on their BlackBerry... it is such an obvious feature that they think they are stupid and don't know how to get it to work, rather than the ridiculous notion that maybe this feature does not exist on the BlackBerry!

Arguments about setting the Date/Time source to Network vs. BlackBerry, the use of GPS, GPS not required, A-GPS vs internal GPS, it is just an endless circular conversation I have seen over and over with little resolution.

There have been justifications floated around that since the BlackBerry is a multifunction device, not just a phone, that the calendar items would be off whenever you traveled to a new time zone, causing mass confusion. For this reason they left it to the user to manually change the time zone.

This makes little sense to me, in that it might cause me mass confusion to have my calendar appointment times off from the local time I am in! In any case, RIM could even leave it to the user by default, but at least put it there as an option to turn on.

My theory on this, using no background information at all, is that RIM simply does not trust the wireless carrier networks to provide an accurate time zone setting from their networks. If one of the carriers passes out bad time zone data and messes up someone's entire calendar then RIM would be blamed. That is just a theory on my part, but that fear does not necessitate leaving out even the *option* of enabling this feature on the device.

So at WES a few months back I came armed with a couple of questions I was going to get answers to... one of them being this issue. The people I talked to - who granted were not the go to handheld development people - did not express any sort of justification for leaving the functionality off. In fact, they expressed the idea that "Yeah, that makes sense, we should look into enabling a feature like that."

There must be a back story I am missing here, there must have been some conversation about this at some conference table at some time in the 5 years since the first BlackBerry phone was released. Can someone enlighten me... please?

Monday, July 23, 2007

Dedicated Attachment Service Gotcha

Early on we decided to separate our attachment service onto it's own dedicated server in order to offload that work from the BES servers. So now all 3 production BES servers point to this one attachment server.

One thing I noticed early on is that the BES servers cannot re-attach to the attachment server automatically. That is, if you restart the attachment server/service, attachment decoding will fail for all BlackBerry users until the associated BES task is restarted on all the associated BES servers.

The connection to the remote attachment server is handled within the BES Domino task, and the first time it connects upon startup, the BES task creates a special ID associated with that server's connection. When the attachment server is restarted, that connection ID no longer exists and the attachment server will reject decode requests from the BES server.

It is a simple fix (restarting BES task), but disruptive and can cause headaches during our monthly maintenance window - I have to always remember to restart the attachment server first before any of the BES servers are restarted.

This issue occurred in the 4.0 environment however it would be interesting to see if it still occurs in the 4.1 environment. When I find out I will post an update, however I haven't seen any SDR's addressing this issue as of yet.

One more thing to watch out for in a complex distributed BES environment!

UPDATE 7/24/07: It appears that this issue was addressed in 4.1 SP4, so no more reboot juggling required!

Wednesday, June 20, 2007

Thinking of removing your DST software config? Read this...

Thinking that this was no longer a large issue, I nonchalantly removed the DST software config from ~1300 users last week. About a few hours later I get some inquiries from the regional offices to the effect of:

"We had some people call about receiving a 'Permission's restrictions need to be updated, do you want to restart your device now or later' prompt on their handheld. They accepted it and the device restarted, but I had never seen this before, what is it?"

This is the "Oh s***" moment, when you realize that something seemingly innocuous had directly impacted each and every user you modified in bulk.

You could say this was pretty minor, their functionality was fine and all they needed to do was restart the device, but that type of stuff doesn't fly in our environment. Everything that directly impacts the user requires a change control approval process.

Time to move the function "unapply software configuration" from daily administrative task to change-controlled lockdown!

Apparently this happens when you initially followed RIM's recommendation and created a specific software policy for DST which allowed external communications. Removing this software policy reverts it back to some default policy which requires a restart, hence the user prompt.

Oh well, live and learn. (...and share the pain!)

Tuesday, June 19, 2007

Lotus Domino TimeDate Structure

This structure is used in many places within Domino, such as creation / modification times as well as database and replica ID's. Here is a diagram showing just what nuggets of interesting information are stored in this critical Notes data structure:


Tuesday, June 5, 2007

Analysis: "Transaction Error: Failure at Service"

Had this error pop up on a handheld today when trying to send from the device, got the dreaded red X. Receiving messages, however, was fine. Usually this boils down to one of a couple of issues:

1) BES server issue
2) Carrier issue
3) Service Book issue

Since the BES was fine and the carrier seemed to be OK for other users, it must have been a Service Book issue. Tried deleting and then undeleting the Desktop [CMIME] service book, didn't work. Tried deleting the Desktop [CMIME] service book, then pushing service books from the BES. Didn't work. Since lookups were also having a problem, I theorized that the odds of 2 service books (lookups are the Desktop [ALP] service book) getting corrupted was pretty small, so time to move onto the next possible culprit.

To take the carrier out of the picture, I hooked the device up to USB so that the messages would be sent using the BlackBerry Router over USB / LAN, totally bypassing the wireless carrier and thus getting that variable out of the picture.

Well I still got the Red X. Hmmm.... but this time it did not specify "Failure at Service". This time the sub-error was "Decryption Failure". Interesting: so now we are getting down to the root of the problem and I can definitely say it is not the carrier or the service books.

So I check the Dispatcher logs, since it is responsible for encryption/decryption to the device, and voila:

[30368] (06/05 10:41:18):{0xFC0} {User} Packet has been delivered to device, Tag=4758107
[40700] (06/05 10:41:44):{0xD8} {User} Receiving packet from device, size=127, TransactionId=1429352292, Tag=259213, content type=ITADMIN, cmd=0x3
[20209] (06/05 10:41:44):{0xD8} {User} DecryptDecompress() failed, Tag=259213, Error=604
[40275] (06/05 10:41:44):{0xD8} {User} Sending transaction error to device for transaction 1429352292, size=46, TransactionId=-966799068, Tag=4758202

So now I know it is an encryption issue, so let's generate a new key:

[40700] (06/05 11:05:13.616):{0x1468} {User} Receiving packet from device, size=225, TransactionId=1429352316, Tag=7143787, content type=OTAKEYGEN, cmd
=0x3
[30222] (06/05 11:05:13.616):{0x1468} {User} MFH: contentType=OTAKEYGEN, sizeOTA=180, sizeOTW=180, TransactionId=1429352316, Tag=7143787
[30308] (06/05 11:05:13.616):{0x1468} [BIPPa] {User} Forwarding data to BES Agent (S45776843), size=218, intTag=1911024, Tag=7143787
[30311] (06/05 11:05:13.616):{0x1438} {User} Forwarding status to relay, intTag=1911024, Tag=7143787, Status=1
[40279] (06/05 11:05:13.616):{0x1438} {User} SubmitToRelaySendQ, Tag=7143787
[30222] (06/05 11:05:13.616):{0x1288} {User} MTH: contentType=OTAKEYGEN, sizeOTA=14, sizeOTW=14, TransactionId=-2070675455, Tag=6675317
[30310] (06/05 11:05:13.616):{0x1288} {User} Forwarding internal data to device, contentType=OTAKEYGEN, routing=S45776843, device=23F7A64F, size=55, cm
d=0x3, ack=0, TransactionId=-2070675455, intTag=7486040, Tag=6675317, Submit=1
[40279] (06/05 11:05:13.616):{0x1288} {User} SubmitToRelaySendQ, Tag=6675317

Well it looks like it didn't do much, let's go back to the MAGT log and see what it did during this same time period:

[40076] (06/05 11:05:13.616):{0x920} {User} SendStatusToWirelessNetworkUsingSRP SendQ, TID=1911024
[40400] (06/05 11:05:13.616):{0x920} {User} Received datagram with content type OTAKEYGEN, TID=1911024, for user User
[40386] (06/05 11:05:13.616):{0x920} {User} {User} Sending message to device, Size=49, TID=7486040, TransactionId=-2070675455
[40081] (06/05 11:05:13.616):{0x920} {User} SendToWirelessNetworkUsingSRP - SendQ, TID=7486040
[30591] (06/05 11:05:13.616):{0x920} {User} {User} *** Activation *** transaction aborted.


OK, so it looks like when trying to generate an encryption key from the device, it goes through a mini Activation, however it aborts the activation, I am guessing, because it can't establish encryption with the device. So it's a catch-22.

Let's hook it up to the USB cable and provision the device & see what happens, theoretically this should just so the same things - resend the service books and regenerate the key - but do it over the wire. Let's look at the MAGT logs:

[30615] (06/05 11:09:33.130):{0x920} {User} Received RPC command [ProvisionUser] (id=1666)
[40000] (06/05 11:09:33.130):{0x920} {User} Synchronizing user configuration
[40000] (06/05 11:09:33.130):{0x920} {User} Synchronizing device capabilities
[40000] (06/05 11:09:33.161):{0x920} {User} ProvisionUser: Updated handheld capabilities info
[40000] (06/05 11:09:33.161):{0x920} {User} ProvisionUser: Handheld capabilities indicate AES encryption is supported
[40000] (06/05 11:09:33.161):{0x920} {User} ProvisionUser: Using key type AES
[40000] (06/05 11:09:33.161):{0x920} {User} Synchronizing AES encryption keys
[30000] (06/05 11:09:33.192):{0x920} {User} *** Activation *** User activated; triggering message pre-population
[30616] (06/05 11:09:33.192):{0x920} {User} RPC command [ProvisionUser] successful (id=1666)


Everything looks OK there, and now we can do lookups and send mail just fine.

What is different about the USB provisioning? Ideally I would like to be able to fix these types of problems OTA, in case someone is out of the office. I don't like the idea of having to hook up to the BES Manager and provision via USB to fix. Anybody out there know anything I could have tried OTA before hooking up to USB, without having to go through a full wireless enterprise activation?

Wednesday, May 30, 2007

BES 4.x Out of Office Agent Integration Breaks when Mailfile owner set to Editor in ACL (Update: Fixed in 4.1 SP5!!!)

Do you use the Out of Office functionality that is integrated into the BlackBerry? It's a wonderful feature, unless you have followed IBM's recommendations on security and limited your mailfile access to Editor.

How could this possibly affect the Out of Office agent usage from the BlackBerry you ask? Well let me tell you it took me quite awhile to figure this out. I'll start from the beginning.

The Out of Office agent is an agent that runs in your mailfile, and when enabled responds to people who send you mail while you are out. In versions of Domino prior to R6 "Manager" mailfile access was required in order to enable this agent. A specific field "$AssistFlags" in the agent was set to "E" to signify that the agent was enabled, and the "E" was removed from this field to signify that the agent was disabled.

For Domino R6 and later, however, a mechanism was put into place which allowed an organization to ratchet down user level access to Editor, which tightened up security and also reduced helpdesk calls about users deleting their own mailfiles and other ridiculous things that happen when users have too much power.

However, with Editor access users would no longer be able to enable / disable their out of office agents. A workaround was developed - a complex one which I will try not to go into here in too much detail - which allowed for the AdminP process to assign this access to the user. This involved adding a field called "$AssistFlags2" (creative, no?) to the agent.

The purpose of this flag was to tell R6 clients and above whether the agent was enabled or disabled, regardless of whether the "E" was in the older "$AssistFlags" field. New agent icons were added, so now instead of seeing only a checkmark to signify enabled or a red circle-X to signify disabled, you would now see an additional "check-5" icon.

This "check-5" icon signifies that the agent is enabled for R5 and below, but disabled for R6 and above! Yes I know this is ridiculous but that is how software coding is sometimes.

The fields would be set as follows (you can ignore the small "s" in $AssistFlags, I am not sure what that is for but appears to be unrelated to the agent status):

$AssistFlags = Es
$AssistFlags2 = D





These field settings mean that the agent is disabled for R6 and later, who know to check the $AssistFlags2 field for the real status, but enabled for prior versions that didn't know about the $AssistFlags2 field and ignored it, using the "E" in $AssistFlags.

The problem here is that the BES server acts as if it were an R5 or prior Notes client in this regard. When enabling or disabling the agent, it adds or removes an "E" from the $AssistFlags field, and completely ignores the $AssistFlags2 field.

Assuming the AdminP process has run properly the first time and setup your Editor level OOO agent access, the consequences of this are the following:

1) OOO is currently disabled, the agent icon shows a "check-5", and the fields are set to :

$AssistFlags = Es
$AssistFlags2 = D

2) A user enables the OOO agent from the BlackBerry. The BES server attempts to set the $AssistFlags field to "E", sees that it is already there, and then does nothing. The BES server does not, however, remove the "D" from $AssistFlags2, so we have the same result:

$AssistFlags = Es
$AssistFlags2 = D

The R6 Domino Mail server will use the "D" value and not run the agent, the Lotus Notes R6 client will also see the "D" value and show the agent as disabled, yet the BlackBerry shows that it is enabled. So we have a remote user who thinks the agent is activated, gets back into the office 2 weeks later, and discovers that it never sent a single message out. Bad!

Another issue comes into play as well, in the following sequence of events:

1) OOO is currently disabled, the agent icon shows a "check-5", and the fields are set to :

$AssistFlags = Es
$AssistFlags2 = D

2) OOO is enabled by the Lotus Notes R6 client, which leaves the $AssistFlags field alone but removes the $AssistFlags2 field. This change now shows the agent icon as a "check", and the fields are set to:

$AssistFlags = Es
[$AssistFlags2 field deleted]




2) A user disables the OOO agent from the BlackBerry. The BES server removes the "E" from the $AssistFlags field, however it does NOT create/populate the $AssistFlags2 field with a "D" to tell R6 clients that the agent is disabled.

$AssistFlags = s
[$AssistFlags2 field still missing]




3) The next time the user enables the OOO agent using the Lotus Notes R6 client, an interesting thing will happen, it will kick off the AdminP process once again to enable the agent for Editor level access. You see the familiar blue text and the enabling of the agent requires a delay to implement.

I believe it does this because the BES, when disabling the agent, for some unknown reason also clears the "Run on Behalf of" field, which is needed to allow the agent to run as the user:


In addition to this, when AdminP finally processes the request, grants the access, and enables the agent, the BES server does not pick up this change, for reasons which escape me at this point.

So the result of this sequence is an OOO agent that is enabled on the mail server, shows as enabled on the R6 Notes client, but shows as disabled on the BlackBerry device. This is also bad.


Here is what I believe needs to be done to resolve these issues:

1) Modify the BES server logic such that it acts like an R6 client, that is aware of the $AssistFlags2 field and modifies it instead of (or in addition to) the outdated R5 $AssistFlags field. Of course this might necessitate the minimum Domino mail server level be at R6 or higher, but since we are coming up on R8 and even R6 support is expiring, I don't see this as a huge issue.

2) Ensure the BES server logic does not clear the "Run on Behalf Of" agent field when disabling the OOO agent.

3) Ensure that changes made to the OOO agent by the AdminP process are picked up by the BES server, such that the Domino Mail Server/Notes Client and the BlackBerry device are always in sync as to the status of the OOO agent.


Until these issues are addressed, I see no easy way of programmatically blocking my users from using the OOO functionality on the BlackBerry device, which results in the above issues. My last resort is to send a mass email saying that the BlackBerry OOO feature (which people love) is broken and to always use the Notes client. Unfortunately this will result in higher helpdesk calls, since we always have people leave the office who forget to enable their OOO and end up calling us to enable it for them.

If anyone out there can shed additional light on this issue or possible workarounds, feel free to let me know. I am also actively working on this with RIM, and hope to have it recreated in their environment and logged as an SDR so that it may be addressed in a future service pack.

GPS Reporting to BES Server in 4.1 SP3 IT Policy

Upon upgrading to 4.1 SP3, I noticed that the IT Policy has a new Policy Group named "Location Based Services". Under this new group are the following options:



Disable BlackBerry Maps: (Default: False, Requires: Device code 4.0.0 and higher)

Enable Enterprise Location Tracking: (Default: False, Requires: Device code 4.2.1 and higher)

Enterprise Location Tracking User Prompt: (Default: "Your Location is now being tracked at the server", Requires: Device code 4.0.0 and higher)

Enterprise Location Tracking Interval: (Default: 15 Minutes, Requires: Device code 4.2.1 and higher)

I was very excited to see these built in options, which allow tracking via GPS without having to purchase a third party solution. I enabled the tracking, leaving the interval set to 15 minutes, but changing the apparently mandatory notification to "Please call (312) XXX-XXXX to return this device." This serves the purpose of providing some sort of notification, but not letting the user know that their device location is being tracked. I named this policy "Device Tracker" so that I can use this to locate stolen devices, it is not used as a general purpose policy.

I then applied this policy to single user who uses a BlackBerry 8800 device at code level 4.2.1. Everything works great... except for the fact that I have no idea where the GPS information is logged to on the BES server. I would suspect the SQL server in a device related table, but cannot find anything.

I have asked RIM and am awaiting a response... I guess they are not sure either where this information is stored. Once this tiny issue is sorted out I am going to have some fun with Google Maps mashups, making tracking lost or stolen devices that much easier.

Friday, May 25, 2007

BES Domino 4.0 / 4.1 Interoperability Issues - Updated 6/1/07

I upgraded one of my main BES servers to 4.1 SP3 HF1 last weekend, but left the other two at 4.0 SP6 HF4. Since the DB Schema was upgraded, we needed to upgrade all of our BB Manager workstations to 4.1 as well.

Now, we are seeing an issue where, using the locally installed 4.1 Manager, we cannot provision a new user added to one of the 4.0 servers using the USB cable. The service books get assigned but the rest of activation / synchronization / prepopulation does not kick off. Wireless activation of new users on the 4.0 servers also seems erratic.

BTW, this upgrade was performed after getting the word at WES that 4.0 and 4.1 servers sharing the same SQL database is OK, since they have many larger accounts than ours and they can't expect them to upgrade 10's of servers at the same time.

Well apparently it is OK for existing users but not newly added users! RIM's tech's response now is: Well it can work but it is not recommended, and the "workaround" is to upgrade the remaining servers to 4.1.

Update 6/1/07: Apparently the tech I spoke with on this issue was incorrect, RIM verifies that they fully support a mixed 4.0/4.1 environment as they told me at WES just a few weeks ago. This issue with USB activation has to do with an identified bug in the 4.1 SP3 BlackBerry Manager console, which is fixed in the upcoming 4.1 SP4, but also can be worked around by downgrading to the 4.1 SP2 Manager console. Thanks to RIM for clarifying this for me.

BlackBerry ID Numbers... MOST of them

BlackBerry devices are multifunction, convergence devices. This being the case, they have multiple methods for identifying the different parts that make up the device. Here is a guide to the different serial / ID numbers used by the different components that make up the BlackBerry device:

Model: 4 digits + letter, tied to hardware, identifies hardware type (digits / letter) & carrier (letter)

PIN (Personal Identification Number): 8 Hex characters, tied to hardware, used for routing within the RIM network

IMEI (International Mobile Equipment Identity): 15 digits, tied to hardware, used to identify a
GSM device

Phone Number: Variable, tied to SIM card, used for voice calling (duh)

IMSI (International Mobile Subscriber Identity): 19 digits, tied to SIM card, used to identify a particular carrier account

FCC ID (Federal Communications Commission ID): 9 alpha digits (starts with L6A for RIM), tied to hardware, used to register wireless device with government agency

BT MAC (Bluetooth Media Access Control): 6 Hex pairs, tied to hardware, used to identify Bluetooth personal area network device

Missing Device Phone Number fields in BB Manager

This is a simple question that has devolved into a mini research project ( like so many do!)

We are trying to reconcile carrier billing info with active BlackBerry users, and are finding many SIM card accounts we are paying data plans for but we don't know who is using the device or where it is.

I use the BB Manager to find devices by carrier phone number which is easy most of the time, however many accounts (229 out of 1905 or about 12%) do not have the "phone number" field populated in BB Manager, these are the accounts I am trying to track down.

Results so far:

1. I have discovered that the "My Number:" field at the top of the Phone app pulls directly from the SIM card's "Phone number" field. If present in the Phone app display, the "My Number:" field then populates the BB Manager account "Phone Number" field in the SQL database.

2. Interestingly, this field can be user edited via Options / Advanced / SIM Card / Edit SIM Phone Number. Also interestingly, it appears to be a display field only and does not change the phone number that the SIM card actually uses.

3. Normally this SIM Card "Phone Number" field is populated by the carrier during the provisioning process, and before they ship the SIM Card out. In some cases, however, they sloppily forget this step which results in a missing SIM card field, and thus an "Unknown Number" on the Phone app display, however the SIM & voice calling work just fine, as again the true number is encoded into the SIM and unrelated to the "Phone Number" SIM display field.

4. It is easy enough for the end user to edit and add their phone number if nonexistent, or change it if the number associated with the SIM is changed by the carrier.

The issue for me is that I don't want to try to contact >200 users and have them run through a manual process of adding this field with the proper number. I am looking for an automated way of pushing it out. There doesn't appear to be any hack for this, so I will have to contact the wireless carrier's and see if they have a solution.

Monday, April 16, 2007

BES SQL Permissions Explained - Part 3

In this post we will discuss the necessary SQL permissions to allow administrative access to the BES database, as well as the new role based administration features in 4.1.

First, let's create a new SQL login named 'bbadmin' with no server roles or database access or database roles.


Now, what happens when we try to use this login and password from a remote BlackBerry Manager console?


[20000] (04/16 11:50:20.563):{0x150} COM Error 0x80004005 in ConnectionItem::ConnectToDB() - Cannot open database requested in login 'BESMgmt'. Login fails. - Unspecified error (connection string- -Provider=SQLOLEDB;Server=SQLTEST000;Database=BESMgmt;uid=bbadmin;pwd=)

So now let's give Permit access to the BES database.


And once again...


[20000] (04/16 11:54:34.257):{0x4E0} [ODBCRecord::DoGetFirstValue] SQL error: [0x80040E09 SELECT permission denied on object 'ObjectDefn', database 'BESMgmt', owner 'dbo'.] Source: [Microsoft OLE DB Provider for SQL Server] SQL State: [42000] NativeError: [229]

Same error popup, but the logs tell a different story. We get into the database but then have permissions issues when trying to do SQL calls. Per the RIM documentation, administrators also need the 'rim_db_bes_server' database role in order to do their work. So let's follow their advice and add that role to the account:

And voila, we are in!


Ummm, or are we? Everything is empty here. Not much to manage! Interesting that we have 'Unknown Authority' listed at the top. Although the 'rim_db_bes_server' role was sufficient in v4.0 to manage the server, apparently things have changed in 4.1.

New database roles were added in order to avoid giving full access to anyone that needed to do anything on the server. Let's take a look at them:


These are all the roles, the ones that have been added in 4.1 all start with 'rim_db_admin...'. We can also ignore the 'audit' roles which are read-only and for training purposes only, which leaves us with:

rim_db_admin_jr_helpdesk
rim_db_admin_sr_helpdesk
rim_db_admin_handheld
rim_db_admin_enterprise
rim_db_admin_security

Let's first remove our legacy 'rim_db_bes_server' role, then add the 'rim_db_admin_jr_helpdesk' role and see what happens.


Excellent, we have access now, and it even shows our authority level correctly at the top. Let's move up to the 'rim_db_admin_sr_helpdesk' role and see if there is a difference.


Hmmm, the only obvious change is in the title bar, which reflects our promotion to Sr. Helpdesk. Maybe there are some tasks now available in the menus, but nothing obvious. Let's move up to the 'rim_db_admin_handheld' role.


Now we are getting somewhere, I can see a whole new tab 'Software Configurations' available. Perhaps there are other subtasks available too. Let's go ahead and switch to the 'rim_db_admin_enterprise' role.


The only difference now appears to be the addition of the MDS service under the 'Servers' leaf of the Explorer View tree on the left. Perhaps we are now able to manage MDS with this new authority. How about we now go to the ultimate, 'rim_db_admin_security'?


Wow, now I can see everything plus the Role Administration tab. I guess now I can add people and assign/remove roles without having to use the SQL administration tools, let's try it out! Why don't we first list the administrators that are in the 'rim_db_admin_security' role, it should just be ourselves, the 'bbadmin' account.

Uh oh. I thought we had the top level of access, what is going on here? I can't even list the members of a role? Something is messed up.

Oh, wait. I seem to recall RIM saying that System Administrator (sysadmin) access on the SQL server is needed to perform role-based administration. Ummm, well that is the highest level of access available on the SQL server, so I am sure the SQL team is going to be pretty uncomfortable with that, but it makes sense that we would need this access in order to manage SQL logins and database roles through the front end of the BlackBerry Manager, which is really what we are doing after all. So let's go ahead and add 'sysadmin' server role to this account and try again.


OK, so we have full access now, at the expense of an angry SQL admin team. I guess I can forgo role based admin via BlackBerry manager, and have the SQL guys remove the 'sysadmin' server role. That just means I will have to work with them to add logins & roles in the future, so I will be sure to get on their good side!

Friday, April 13, 2007

BES SQL Permissions Explained - Part 2

In the last post we discussed the SQL Server roles required to create and configure the BES database on the SQL server. In this post we will discuss the database permissions required by the login the BES server uses.

First let's go back to our DbaMgr2k console and pull up the properties of the bbdbowner login, which was the SQL login we used to create the BESMgmt database. Now take a look at the Database access tab:



We can deduce a couple of things from this dialog:

1) Our SQL login 'bbdbowner' has 'Permit' checked for access to the BESMgmt database.

2) Our SQL login 'bbdbowner' is associated with the auto-created 'dbo' user of this database. I still don't understand why you need separate database accounts linked to SQL logins, but hey, I'm a newbie at this.

3) Our SQL login 'bbdbowner' has the 'public' database role assigned. This is a built-in SQL role allowing basic database access.

4) Our SQL login 'bbdbowner' has the 'db_owner' database role assigned. This is a built-in SQL role allowing all operations on the database. I guess we were assigned this role since we created the database.

Now in most installations the same SQL login is used to:

a) Initially create the database
b) Authenticate BES servers
c) Authenticate BES admins via the BlackBerry Manager console

In order to clarify just what roles require what permissions, we will break out these separate functions into separate SQL logins. We already have the 'bbdbowner' login which corresponds to a) above, the initial creation of the database.

Let's now create a login 'bbserver' and see what permissions it needs to allow a BES server to operate. First we will give it no permissions at all:


Then we will use the BlackBerry Server Configuration panel to change the SQL authentication from the 'bbdbowner' account to the new 'bbserver' account:


Note: Using the above utility only changes the SQL login credentials used by the BlackBerry Windows services, which are located in:

HKEY_CURRENT_USER\Software\Research In Motion\BlackBerry Enterprise Server\Database

To change the SQL login used by the Domino BES task we need to also modify the 'Login' and 'Password' values located in:

HKEY_USERS\.DEFAULT\Software\Research In Motion\BlackBerry Enterprise Server\Database

And here is the result when we click the 'Test SQL Server Connection' button:


OK, I guess we need to add some permissions. Let's start with at least adding 'Permit' access to the BESMgmt database...



...and try again:


OK, now that's a little better. Let's try to start some BES services now and see what happens:


Oops, I am getting an error code 5608 when trying to start this service. Interestingly, some other services started without error. They must not need access to the SQL database. Let's take a look:

BlackBerry Alert - Started Successfully
BlackBerry Attachment Service - Started Successfully
BlackBerry Controller - Started Successfully
BlackBerry Dispatcher - Failed with error code 5608
BlackBerry MDS Connection Service - Started, then stopped a few minutes later
BlackBerry Policy Service - Failed with error code 5003
BlackBerry Router - Started Successfully
BlackBerry Synchronization Service - Failed with error code 5203

What happens when we try to start the BES Domino task? Let's see...

> load bes
> Cannot start database notification system
04/13/2007 02:36:21 PM Error initialising BES. Terminating.
04/13/2007 02:36:21 PM Shutting down Performance Monitoring and SNMP
04/13/2007 02:36:21 PM BlackBerry Mailbox Agent for Lotus Domino shutdown complete


If we look in the Event Viewer we can see many entries which refer to 'permission denied', such as this when starting the BES task above:

ReportEnvironmentDB::ReportDBSchemaVersion: COM Error 0x80040E09 - IDispatch error #3081 - Source: "Microsoft OLE DB Provider for SQL Server" - Description "SELECT permission denied on object 'ServerDBVersion', database 'BESMgmt', owner 'dbo'." - Command ""

...so I guess we need to go a little further than just 'Permit' & 'public' role access to the BESMgmt database. If we look further down the database role list, we can see that in addition to the built-in database roles, the BES database installation script created some custom roles as well. One of these custom roles is called 'rim_db_bes_server', so let's assign that role as well, since it just makes sense.

04/13/2007 02:38:40 PM BlackBerry Mailbox Agent for Lotus Domino started

Ah, that's much better. The other Windows services started up just fine too. So we can determine from our little experiment that the following minimum SQL permissions are required in order for a BES server to operate:

- SQL Server roles: None required
- BESMgmt database roles: rim_db_bes_server

Next: Minimum permissions for remote BES administration, and role-based security in 4.1

Thursday, April 12, 2007

BES SQL Permissions Explained - Part 1

In the last post we talked about setting up a test environment to mimic a production BES / SQL environment. In this post we will talk about the actual SQL server permissions required to create or upgrade a BES management database.

If this is an initial install, then the BES management database will most likely be automatically created and configured via the BES installation program. However, there is a pre-requisite: If using SQL authentication like we are, we need to have a SQL login defined on the SQL server with the appropriate permissions.

Using our DbaMgr2k tool, we first:

1. Double click on our SQL Server to open the Connection window. Type in the password for the default 'sa' account and click the checkbox to 'Save password'. We are using SQL authentication so you can leave the 'Trusted NT Connection' blank.



2. If you click on Logins you can see that there are only 2 users setup by default. One is the 'sa' account we used, and the other is the NT Builtin\Administrators group. Although we are using SQL authentication this account should not be removed.



3. Right click on Logins and select New Login. This new user will be the account we use in the BES installation to create and configure the BES database. Be sure to type in a password and click 'Save' before configuring any other settings; a little quirk of our free tool.



4. Click on the 'Server Roles' tab and check the 'serveradmin' role and the 'dbcreator' role. These roles give the login account permission to create new databases and do other server wide tasks. For now we can ignore the 'Database access' tab, as there is no BES database created yet. Click 'Save' and then 'Close' to save and exit.



5. Now from our BES install, we connect to the SQL server using these credentials, and we are prompted to create the database.







Just for fun, let's see what happens when the SQL login does not have the
'serveradmin' role assigned. Of course, we get the dreaded error message:



But what actually fails? Lets check the installer logs to find out:

[30000] (04/12 14:55:19.192):{0x33C} SQL being executed:
EXEC sp_addmessage @msgnum = 60002, @severity = 16,
@msgtext = N'Unable to add new BlackBerry Agent "%s" AgentID=%d Machine Name=%s',
@lang = 'us_english'

/* BESRouter errors */
IF EXISTS (SELECT error FROM sysmessages
WHERE error = 60103)
EXEC sp_dropmessage @msgnum = 60103, @lang = 'us_english'
[20000] (04/12 14:55:19.192):{0x33C} SQL Error Message from CBESDBInstaller::ExecuteSql.executeDirect: SQLSTATE: 42000 Native error: 15247 Message: User does not have permission to perform this action.

Looks like the DB installer script was unable to add custom BES error messages to the SQL server.

How about the other way around, the SQL Login has 'serveradmin' permissions but not 'dbcreator'?

[20000] (04/12 14:49:17.952):{0x12C} SQL Error Message from CBESDBInstaller::ExecuteSql.executeDirect: SQLSTATE: 42000 Native error: 262 Message: CREATE DATABASE permission denied in database 'master'.

In this case the DB installer script could not even create the BES DB and failed immediately. Looks like we need both roles in order to create or upgrade the database.


Now that we have the database created and understand the SQL Server roles required, the next post will examine the database roles required for daily server functioning and administrative tasks.

BES SQL Permissions Explained - Part 0

I work for a large law firm, so each department is separate, even within IT. When we upgraded from 2.2 to 4.0 I needed to work with our SQL folks to setup and host the new SQL DB for the BES servers.

This coordination can be difficult, especially since I know little about the operation of SQL server and had difficulty communicating what I needed. It doesn't help that RIM's documentation about SQL permissions is conflicting in many cases, and dependent upon which version of 4.x you are running, as well as which components are being installed. Ugh!

This post will attempt to decode the mystery of SQL server permissions for a BES Domino environment. Here in part zero (zero b/c I don't actually talk about permissions yet!) I will discuss setting up a test environment for exploring how these parts work together to begin with.

In order to get the skinny on all of this, I had to hack together a test environment that mimicked a production BES / SQL install. I did this by:

1) Installing the free Microsoft Virtual PC software. (Free!)
2) Creating 3 Win2K SP4 virtual machines; two BES 4.0 servers and a SQL server

I used Win2K b/c there are no messy issues with online activation required, I reused some old keys I had written on the CD's. Since this is a test environment I don't feel too bad about that.

Not having access to the SQL server code or management tools, however, I had to improvise, which was an exercise in itself. From the BES 4.1 installation media, I installed the MSDE database engine (tools\SQLRun01.msi) on my "SQL" virtual server. I then performed the following to turn it into an almost gen-u-ine SQL server:

1) Change the default 'sa' password from NULL to something else

To do this you need to use the command line OSQL tool which allows you to run SQL scripts from a command prompt. For this I used the following commands:

osql -E
1> exec sp_password @old = null, @new = 'newpassword', @loginame = 'sa'
2> go
Password changed.
1> exit

This runs a Stored Procedure called 'sp_password' on the SQL server itself to change the password. Make sure to spell 'loginame' with only one 'n'!

2) Change the authentication mode from default of "Windows" authentication to "Mixed" authentication.

Note that you can have two authentication modes in MSDE / SQL:

- NT only using AD or local Windows users / groups (Windows)
- NT + SQL where logins are managed with SQL itself (Mixed)

Note that you cannot have just SQL authentication by itself.

What this change does is allow native SQL logins to access the SQL server, so I don't have to worry about ugly NT users / groups / permissions. I find SQL authentication easier and cleaner to manage and make sense of in my head. For this we need to:

- Stop the MSSQLSERVER service
- Change HKLM\Software\Microsoft\MSSqlserver\MSSqlServer\LoginMode from 1 to 2
- Restart the MSSQLSERVER service

3) Enable network access to the "SQL" server

By default MSDE is only useful for applications local to the machine it is installed on. Changing this allows remote machines on the network to access the MSDE SQL server just like it's big brother.

- Stop the MSSQLSERVER service
- Open command prompt
- Run 'svrnetcn'
- Click 'TCP/IP' and click 'Enable'
- Click 'Named Pipes' and click 'Enable'
- Click OK
- Restart the MSSQLSERVER service

4) Test login over the network.

- Copy the OSQL.EXE utility to another machine on the network
- From a command prompt, run OSQL -S [hostname or IP] -U sa -P newpassword
- You should get a '1>' prompt, meaning you are connected to the server.


OK, so now we have a working, network accessible SQL server. How do we manage it without learning all the arcane OSQL syntax and SQL stored procedures? Fortunately there is a free management tool out there called "DBAMgr2K". Google this name and you should be able to find the website and download it.

It is basically a free, simple version of SQL Enterprise Manager for use with MSDE databases. It is great because it acts and displays very similarly to the real SQL Enterprise Manager, so when I was comfortable operating this tool, I could tell the SQL guys exactly what to do in EM to get the same result in our production environment.

Next up: Permissions details for BES SQL databases

Monday, March 19, 2007

Decoding BlackBerry Server NSD Logs

Even though I've tried to optimize my environment to the best of my ability, I will still occasionally see the dreaded Fault report that one of my BES servers has crashed. If this happens to you, here is how you can decode the NSD log that Domino creates and track it back to what BES was doing at the time. In some situations - not all - this can help you figure out what is going on and resolve it yourself instead of waiting for a log file analysis from RIM.

For example, I just happen to have an NSD log from a crash that happened this morning. These crash files can be found in the "Lotus\Domino\Data\IBM_TECHNICAL_SUPPORT" directory of your data partition. The NSD log file is name something like this:

nsd_W32I_[YOUR SERVER NAME]_2007_03_18@21_15_02.log

In my experience, most of the relevant information can be gleaned from the first page full of information. The first thing we want to look for is the following line right near the top:

Arguments : "c:\lotus\domino\nsd.exe" -dumpandkill -termstatus 5 -crashpid 4592 -crashtid 3252

The most important information is the "-crashtid 3252" portion, which indicates that the thread ID that caused the crash is 3252 decimal. Well if you remember, the BlackBerry log files record the thread ID for every operation, however they are encoded in Hex. So using calc we convert 3252 decimal = CB4 Hex.

With the thread ID known, lets parse the BES MAGT log for the thread ID CB4. I use the line "grep 'xCB4' [logfile name]" The final line of this thread process is this:

[40000] (03/18 20:42:38):{0xCB4} CN=XXXX/OU=XXXX/O=XXXX!!mail\XXXX.nsf, fetching modified documents since 03/18/2007 08:42:37 PM for user XXXX

Of course identifying information has been removed to protect the innocent, however this appears to be the last thing this thread did before the server crashed. I think we have our culprit mailfile.

So now I open the mailfile and look in the All Documents view, and discover there is a Sent Item from guess what time? That's right: 03/18/2007 08:42:37 PM. Could this be our culprit document? It appears to be a reply, initiated from the device, to an email that contains a large bitmap in the body.

To verify, I check the NoteID; it is "1C542". If I now do a search on "1C542" in the NSD file I find the following lines:

@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f310 35b4f7d0 00000000 0001c542 0bd1a0e4 |...5....B.......|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
# 35b4f354 35b4f380 6003d3b7 00000071 0001c542 |...5...`q...B...|
# 35b4f374 0001c542 35b4f7d0 00000010 35b4f5ac |B......5.......5|
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f380 35b4f5ac 0058ab01 00000071 0001c542 |...5..X.q...B...|
# 35b4f560 35b4f5c4 0058f993 0001c542 35b4f7c8 |...5..X.B......5|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
# 35b4f5bc 00000071 0001c542 00000000 35b4f7d0 |q...B..........5|
# 35b4f61c 00000002 0d44e858 00475182 0001c542 |....X.D..QG.B...|
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!
# 35b4f620 0d44e858 00475182 0001c542 00000000 |X.D..QG.B.......|
# 35b4f700 0001c542 0d2e7b15 1f7957bc 00096573 |B....{...Wy.se..|
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
@[22] 0x6001f146 nnotes._NSFNoteOpenExtended@24+678 (71,1c542,4080000,0,35b4eb34,17)
@[23] 0x6003d3b7 nnotes._NSFNoteOpen@16+55 (71,1c542,0,35b4f7d0)
# 35b4f7d0 00003306 0001c542 00000001 00000000 |.3..B...........|
[24] 0x0058ab01 nBES (6a033c,5f,71,1c542) - SYM FILE OUTDATED!
[25] 0x0058f2d4 nBES (1c542,0,0,d4683b3) - SYM FILE OUTDATED!


It looks like NBES was trying to open this Note using the NSFNoteOpen and NSFNoteOpenExtended API calls when it crashed the server. So now I am sure this note caused the crash. What now? Delete the note from the user's mailfile so that it now resides in the trash, then restart the BES server. Of course you want to let the user know why their message is in the trash, and of course follow up on RIM on why this happened in the first place. But at least you have the server back up!

Friday, March 16, 2007

Moves between Servers - Not 100% foolproof

I moved over 500 accounts from an old BES server to a new BES server today. I set the expectation with our regional managers and BB admins that there would be no impact to end users during this migration. How could I be so naive?

First, a minor amount of users were not able to send or send & receive after the move, and required either pushing the service books, reactivation, or complete account removal and re-add / re-activate to get working again. This is not so bad, it is a small number (<20 probably) of 500 people.

The bigger problem was this: The source (old) BES server's "Blackberry Synchronization Service" would sometimes freak out during bulk moves. I would move maybe 50 people at a time max, but apparently that was too much for the sync service to handle. So it would just quit, and I would have to restart it. I also noticed that the initial moves went really fast, a few per minute, but then they started to bog down and it would take 5 minutes to move one account.

That is no big deal, until I realized by looking through the logs what was happening: the sync service contains the device backup service, and it was going through and deleting the backup data not only for the accounts being moved, but FOR EVERY USER ON THE SERVER: (names removed to protect the innocent)

[46036] (03/14 10:15:41):{0xC50} [SYNC-Gate] Start removing user. [W
[46036] (03/14 10:15:41):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:42):{0xC50} [SYNC-Gate] Start removing user. [Z
[46036] (03/14 10:15:42):{0xC50} [SYNC-Gate] Start removing user. [O
[46036] (03/14 10:15:43):{0xC50} [SYNC-Gate] Start removing user. [S
[46036] (03/14 10:15:43):{0xC50} [SYNC-Gate] Start removing user. [H
[46036] (03/14 10:15:44):{0xC50} [SYNC-Gate] Start removing user. [K
[46036] (03/14 10:15:44):{0xC50} [SYNC-Gate] Start removing user. [D
[46036] (03/14 10:15:45):{0xC50} [SYNC-Gate] Start removing user. [G
[46036] (03/14 10:15:45):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:46):{0xC50} [SYNC-Gate] Start removing user. [D
[46036] (03/14 10:15:46):{0xC50} [SYNC-Gate] Start removing user. [O
[46036] (03/14 10:15:47):{0xC50} [SYNC-Gate] Start removing user. [S
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [A
[46036] (03/14 10:15:48):{0xC50} [SYNC-Gate] Start removing user. [C
[46036] (03/14 10:15:49):{0xC50} [SYNC-Gate] Start removing user. [F
[46036] (03/14 10:15:49):{0xC50} [SYNC-Gate] Start removing user. [N
[46036] (03/14 10:15:50):{0xC50} [SYNC-Gate] Start removing user. [R
[46036] (03/14 10:15:50):{0xC50} [SYNC-Gate] Start removing user. [W
[46036] (03/14 10:15:51):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:51):{0xC50} [SYNC-Gate] Start removing user. [G
[46036] (03/14 10:15:52):{0xC50} [SYNC-Gate] Start removing user. [A
[46036] (03/14 10:15:53):{0xC50} [SYNC-Gate] Start removing user. [P
[46036] (03/14 10:15:53):{0xC50} [SYNC-Gate] Start removing user. [Z
[46036] (03/14 10:15:54):{0xC50} [SYNC-Gate] Start removing user. [B
[46036] (03/14 10:15:55):{0xC50} [SYNC-Gate] Start removing user. [L
[46036] (03/14 10:15:58):{0xC50} [SYNC-Gate] Start removing user. [M

Once this happened, it kicked off an OTA device backup, which once again sync'ed everything back from each device, which slowed down EVERYTHING.

Now here is the worst part: many of the devices (with accounts that were not even scheduled to be moved that day) received either:

1) An activation complete - OK prompt
2) A continuing activation process
3) A leftover activation icon on their ribbon (home screen).

I started to get calls from admins in other cities, who weren't even scheduled to be moved that day. Bad.

Lesson: Don't move more than 10-20 accounts at a time, and watch the SYNC service.

Oh, and by the way - the brand new server hardware I migrated to? It crashed this morning. :(