Messages stuck in Q and retry - sequence ID issue?

Messages stuck in Q and retry - sequence ID issue? SearchSearch
Author Message
ashot shahbazian
New member
Username: Animatele

Post Number: 37
Registered: 06-2004
Posted on Saturday, August 22, 2009 - 03:43 pm:   

Hi Bryce,

We've tried to investigate the issue with stuck messages:

In one of our scenarios, 1st NowSMS server is sending messages to the second one, the 2nd performs concat messages reassebly (WDP Adaptation setting) and sends them back to the 1st server. We've noticed that 2nd server often gets many (hundreds) of both short and concat messages stuck in its ourbound queue for 1-3 minutes and then they submit quickly. The routing on the 2nd server is rather simple, but the volume of such looped messages is high.

We've enabled the SMPP debug and tried checking how the 1st server responds to the submit_sm from the 2nd, and realised at first something very unusual: the 1st server would sometimes not send the submit_sm_resp to the 2nd (and so the message times out on the 2nd server and resubmits.) Sometimes, when the 1st server sent the resp, we couldn't find that resp in the 2nd server debug log - even though the servers are in the same LAN both plugged directly into the same switch.

What's more interesting, the problem always gets aggravated right after midnight. And what's even more interesting, we've observed several cases where there is a sequence_id for a completely unrelated command of a message that is identical to the seq id of some component of the troubled message, and that seems to be the case on both servers.

Judging by seq id values after midnight, it looks like you're resetting the counter for seq id-s at midnight. Is this the case?

Also, when tracking the seq id-s, do you differentiate them by the type of command? By that we've observed an identical seq id for a deliver_sm command for unrelated message is often paired to that of a troubled (dalayed and resubmitted) submit_sm command it seems that you might not be differentiating the seq id-s by the command codes (such as 0x00000004 for submit_sm, 0x80000004 for submit_sm_resp or 0x0000005 for deliver_sm) as the protocol specifies for.

Also, it looks like the length of the seq id in NowSMS is limited to 3 digits, which probably makes them reset more often than if you'd allowed them to increase to the maximum allowed value of 0xFFFFFFFF and only then re-set them.

This should not be difficult to recreate. Can you please check and see if that's indeed what's causing the resubmits?

If so, in our opinion, if you:
- did away with resetting the counters at midnight
- increased the max value of the id to the allowed limit
- assigned all seq id-s within one server consequently regardless of the command code, unless you're sure that proper tracking of the commands matched to their codes (can be VERY tricky) won't impact on performance
- on every system stop/restart assigned the 1st seq id to a random value within the allowed range, rather than 0x01 as you seem to be doing now

chances of trouble due to identical sequence id-s would become infinitely small.

For us, such delays have become a serious issue. We've not been noticing it much in the past since traffic around midnight was relatively small, but this has changed now and messages delay in massive numbers. I'd much appreciate it if you could address it as soon as possible.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7842
Registered: 10-2002
Posted on Monday, August 24, 2009 - 10:44 pm:   

Hi Ashot,

There is definitely a problem with the sequence number resetting to 0 prematurely.

In fact, technically 0 is not a valid value, so that may be the problem.

From what I can see, after the sequence number hits 0xFF, it then resets to 0.

This definitely should be advancing all the way to 0xFFFFFFFF. And should it reach 0xFFFFFFFF, the next value should be 1, not 0.

So there are two related problems here with the sequence number.

And I could see possibilities for the way this sequencing is handled causing confusion for another server.

We will get this fixed quickly.

I don't see any evidence that the sequence number resets at midnight. But on a busy system it is going to be reset quite frequently as it is limiting itself to a single byte value instead of a 4 byte value.

I also don't see any evidence of confusion if a deliver message is received with a sequence_number that matches a submit response for which we are waiting.

I was hopeful that the sequence number resetting problem was the cause of your problem ... until I read back and realized that it was two NowSMS servers talking to each other. And unfortunately, I can't see it causing a problem in that scenario. The sequence number should be incrementing all the way to 0xFFFFFFFF and when it wraps, it should skip 0. But the fact that it uses 0 and only increments to 0xFF would not be a problem with 2 NowSMS talking to each other, unless for some reason the window size was set to larger than 255.

I'm also puzzled by the midnight issue. The sequence number doesn't reset at midnight. So I'm trying to think of what else happens at midnight. Log file rollovers and creation of new message id database are the things that come to mind.

I'm not sure where to start. There is a sequence numbering problem that should be easily fixed. However, unfortunately, I don't see it having any impact on the problem that you describe.

When WDP Adaptation is used, NowSMS disables async mode. (The re-assembly logic is too complex to work with our async mode SMPP implementation.)

Without async mode, a lost submit_sm_resp will definitely cause a stutter.

In many ways, async mode is a lot simpler, so it's harder to see how a submit_sm_resp could be lost. That's the part that bothers me about this.

If you have some debug logs where you've noticed a missing submit_sm_resp, I'd be interested in seeing them ... to see if maybe there's some other factor that might come into consideration.

As a temporary solution, you might want to reduce the CommandTimeout setting on the SMPP connection that is submitting with WDP Adaptation. Take it down to 20 seconds (which is still much longer than it needs to be, but I'm being conservative ... the default is a hefty 120 seconds). No async window is being used, and the response.

Maybe also there is something about the sequence number issue that I'm not seeing, and fixing the premature wrap-around problem will make things better. We need to run some tests to make sure we get that fixed, but hopefully something can be posted soon for that issue at least.

-bn
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 1176
Registered: 08-2008
Posted on Wednesday, August 26, 2009 - 07:42 pm:   

Hi Ashot,

I just wanted to post a status update.

The issue with the sequence_number has been fixed in the update that was posted to http://www.nowsms.com/download/nowsmsupdate.zip.

We don't have any reason to believe that this is contributing to the problem that you're seeing.

We are confused about the submit_sm_resp that the first server is sending, but that the second server never sees. I'm wondering if you're looking at a debug log or a TCP/IP trace.

One thing we noticed was that there was no error handling if NowSMS sends a submit_sm_resp and the TCP/IP buffers are full. We fixed that, but we have no reason to believe it would be a problem ... there's simply not enough data that would need to be buffered. But maybe there is some congestion at the system level.

I just had another thought. You're not dealing with any ultralong messages, like longer than 20 segments? NowSMS does reject SMPP packets larger than 4096 bytes.

--
Des
NowSMS Support
ashot shahbazian
New member
Username: Animatele

Post Number: 38
Registered: 06-2004
Posted on Sunday, August 30, 2009 - 12:01 am:   

Hi Bryce, Des

We were looking at SMPP traces recorded by NowSMS servers and at the log of a thin application that basically records all traffic at the ports used by the thin application and NowSMS.

We've not recreated it yet and would postpone applying this update before we'd have re-tested and recorded traces again, some time next week.

The issue was not with ultralong or even long messages. The server in question is unlikely to get any ultralong ones, as it's handling genuine subscriber traffic only - that originated by mobile handsets.

Attached is a transcript of a Skype chat with our SMPP engineer during our first attempt to troubleshoot it. We've excluded other possible reasons before the issue with seq id-s was discovered. Perhaps you would find some clues in it.

application/octet-stream
seqid_skype_chat.doc (65.0 k)


What's apparent is that the issue gets aggravated:
- Right after midnight
- Right after NowSMS restart. We can clearly see it on a NowSMS (2nd or .43)server connected to another (1st or .36) NowSMS server. The 1st server sends messages to the 2nd one and the 2nd resubmits them back to the 1st over uplinks with WDP. So the messages are basically looping through the 2nd server.

By that:

a) both are NowSMS
b) the 1st is also receiving the same messages first from an external client and
c) there are also DLR looping back and forth for all of these messages and you don't seem to distinguish the seq id-s for deliver_sm by its command code

the probability of conflicts due to seq id match for different commands becomes A LOT higher. To the point of when the 2nd server is stopped momentarily and restarted we can see hundreds of messages hanging in its queue for minutes before submitting - with little outbound traffic on the uplink or server-wide. The condition lasts for between 40 mins and an hour gradually becoming less acute but never going away completely.


We don't have any reason to believe that this is contributing to the problem that you're seeing.


The changes you've made for this update, are they allowing the seq id-s to increase to 4-byte values and skipping the 0x00? Or you have also implemented it resetting not to 0x01 after restart (or after midnight if it indeed resets at midnight) but to a random value?

Knd regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7846
Registered: 10-2002
Posted on Sunday, August 30, 2009 - 02:31 am:   

Hi Ashot,

The sequence id definitely doesn't reset at midnight. I'm wondering if NowSMS is having an internal exception error that is triggering it to restart, coincidentally around midnight. That's the only thing that would cause that counter to reset.

Is there an EXCEPT.LOG file? This log file generates an entry when there is an internal exception.

Also ... another question. I want to make sure that we understand this configuration.

If you think sequence numbers are a problem, do you have to use a transceiver connection between these two NowSMS? Are you even using a transceiver connection? (As I'm not sure how a loopback could route the messages back over a transceiver connection.)

The reason I ask the transceiver connection question is this. We're evaluating SMPP packets on a per connection basis. If we sent a submit_sm with a seq-id value of 1, and the received a deliver_sm with a seq-id value of 1 ... then theoretically I could see where that could cause a problem. However, I don't see any problem with our logic ... and I don't see any problem in my tests.

We've spent a long time looking this over ... and as I read your message, I started thinking ... wait a minute, is he even using a transceiver connection?

If you are using a transceiver connection, switch to separate sender and receiver. That will rule out seq-id even being a potential issue.

From the sound of things, I really think something is crashing. Look for this EXCEPT.LOG file.

And explain to me how these two NowSMS servers are connected to each other.

The way I understand it, server 1 submits to server 2 ... no special configuration.

Server 2 submits back to server 1 with WDP adaptation. (What does server 1 then do with it?)

In order to have this configuration, Server 1 would need to initiate a bind to server 2.

Server 2 would need to initiate a bind to server 1.

submit_sm would be used for all of these message submissions.

The only thing come back via deliver_sm would be delivery receipts. So I guess if you're using a transceiver connection, those delivery receipts could be coming back in via the same connection. So to rule out seq-id, don't use a transceiver connection.

I'm really starting to suspect some sort of internal exception. It is a very unusual configuration. Help me understand the message flow, and how the two servers are configured to exchange messages with each other.

-bn

P.S. - No change was made to where the sequence id's start, they always start at 1 after a restart (or after a reconnect).
ashot shahbazian
New member
Username: Animatele

Post Number: 39
Registered: 06-2004
Posted on Sunday, August 30, 2009 - 03:26 pm:   

Hi Bryce,

Initially, the uplinks from the 2nd to 1st server were 2 TRX sessions with a window 10. That's when we've noticed the problem.

We've changed it to a sync mode (as we didn't know it'd make no difference with WDP setting) it didn't help.

Then we've added a 3d RX session and changed the 2nd session to TX. Didn't help. I'm now changing the 1st session to TX as well, let's see..

The problem's still there. Within about 30 secs after the service start 21 (short) messages got stuck in the outbound queue. Most submitted upstream 2 minutes later while some new ones got stuck.

I've searched for an EXCEPT.LOG file, there isn't one. I've looked in the event log, there aren't records of NowSMS service starting without first being stopped (that's how it looked when the service used to crash, in older versions of the product.)

The only aim of this loop is to make the 2nd server reassemble those messages which happened to be long ones. One in 5 to 7 messages are long, but the problem is observed regardless of whether it's a concat or a short one. We've first tried to make this loop on the 1st server, but that was kind of hard on its CPU so we've configured it to loop through the 2nd one. The uplink configuration on the 2nd server is rather simple:

TX:

[SMPP - uplink1_wdp:xxxx]
SMPPVersion=v3.4
UserName=sysid1_wdp
SenderAddressOverride=Yes
Receive=No
ReceiveMMS=No
UseSSL=No
WDPAdaptation=Yes
LongSMSAlt=Yes
AllowedUserOnly=Yes
AllowedUser1=user1_forsat

RX:

[SMPP - uplink1_wdp#3:xxxx]
SMPPVersion=v3.4
UserName=sysid1_wdp
SenderAddressOverride=Yes
Receive=Yes
ReceiveMMS=No
UseSSL=No
RoutePrefOnly=Yes
LongSMSAlt=Yes

The problem is apparent on one of the 3 similarly configured uplinks on this server, but two others each have about 1/6th of the message volume of the one in question. So the problem on the larger-volume one could be masking it for the other two - or this only happens when the volume is large enough.

OK, now it's been more than an hour since I've restarted the service, I was trying to find stuck messages. It took me 3-4 minutes hitting F5, and the one I've found was that intended to one of the smaller-volume uplinks (which are still configured as 1TRX+1TX+1RX.) Looks like doing away with the TRX session helped to some extent, although there were many stuck messages for the large-volume connection minutes after service start.

OK, let me reconfigure the other two uplinks which still have the TRX sessions..

Things have improved a lot. Within 5 minutes from the service restart, I could only find one stuck message (intended for the lower-volume connection.) So the absense of actively sending TRX uplinks with different sys id-s did it for the high-volume connection too: when I first reconfigured it leaving the other two with TRX sessions the stuck messages after restart were intended for the uplink that had no TRX sessions! So looks like the TRXs in the other 2 uplinks were interfering with this one..

In my opinion, this just confirms the assumption that the trouble is caused by seqid-s. Also, doing away with TRX is by no means a universal solution: the (external) binds from the customer who submits these messages originally are configured as 3 TRX sessions each, and we are seeing an unusually high number of resubmitting submit_sm from them too, to the extent these packets being catched by our Antispam.

How difficult is it to implement a reset of the seqid counters to random values after restart or connection loss, the random values being truly random and different for individual connections (or sessions?) Or not resetting the seqid-s at all and continuing from where it's stopped? I'm almost positive it'd solve the problem altogether.

Kind regards,
Ashot

P.S. We've not yet updated to the version with tha latest fix, the one runnning on the servers is 2009.07.09
ashot shahbazian
New member
Username: Animatele

Post Number: 40
Registered: 06-2004
Posted on Sunday, August 30, 2009 - 03:35 pm:   

Hi Bryce,

Forgot to confirm that the message flow is exactly how you've described it:


And explain to me how these two NowSMS servers are connected to each other.

The way I understand it, server 1 submits to server 2 ... no special configuration.

Server 2 submits back to server 1 with WDP adaptation. (What does server 1 then do with it?)

In order to have this configuration, Server 1 would need to initiate a bind to server 2.

Server 2 would need to initiate a bind to server 1.

submit_sm would be used for all of these message submissions.


Kind regards,
Ashot
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 1188
Registered: 08-2008
Posted on Monday, August 31, 2009 - 04:59 pm:   

Hi Ashot,

We're going to do a build with a setting to init the sequence_number to a random value. It's simple enough to try, even though we have reservations about this being the problem.

We could use one other piece of clarification though ... where exactly are the messages getting stuck?

Is it with server 1 submitting to server 2, or server 2 submitting to server 1? Or both directions?

If it is server 2 submitting to server 1, then we are wondering whether the issue is related to multiple transmit sessions with WDP adaptation. There may be some file contention occurring because of the way the reassembly occurs.

Stay tuned for the sequence_number update.

--
Des
NowSMS Support
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 1190
Registered: 08-2008
Posted on Monday, August 31, 2009 - 06:08 pm:   

Follow-up ...

An update with the setting for a random init to the sequence number can be found at http://www.nowsms.com/download/nowsms20090831.zip

It defaults to starting at 1, but you can set [SMSGW] SMPPSeqRandom=Yes to start it at a random value.

--
Des
NowSMS Support
ashot shahbazian
New member
Username: Animatele

Post Number: 41
Registered: 06-2004
Posted on Monday, August 31, 2009 - 07:32 pm:   

Hi Des,

Thanks for the update!

Two dumb questions, but I need to ask before patching a live system:

- upon restart, would each session's initial seq id set to ITS OWN random value, or all sessions seq id-s would set to the SAME random value?

- what happens when the seq id reaches 0xFFFFFFFF? If the new setting is on, would the next id be 0x01 or a random value?



We could use one other piece of clarification though ... where exactly are the messages getting stuck?


It is server 2 submitting to server 1, haven't noticed it in the other direction. The first server though is often bouncing the same messages as they are originally submitted by our customer. The customer, incidentally, is connected to the 1st server via 9 TRX sessions, 3 per sys id.

There may be some file contention occurring because of the way the reassembly occurs.


Some small number of concatenated message fragments get stuck in the 2nd server queue, perhaps a few dozen per million. They are most often 2nd of 1st segments of 2-part messages, and cannot be deleted without stopping the service(and if you delete the corresponding .lck files they'd reappear momentarily.) Which is understood, the server's expecting to receive the missing segment to reassemble the message and send upstream. But we haven't checked yet what's causing the second segment in the pair drop, maybe that also has to do with the seqid issue on the 1st server. Once both servers updated I'll let you know if that's fixed too.

Kind regards,
Ashot
ashot shahbazian
New member
Username: Animatele

Post Number: 42
Registered: 06-2004
Posted on Monday, August 31, 2009 - 08:19 pm:   

Ok, updated and enabled the new setting.

There is no difference between the new version and 2009.07.09.

If the uplinks on the 2nd server are separate TX and RX the problem is barely noticeable. If we switch the TX ones to TRX, upon service restart a queue of timing out messages is quickly building up then releasing.

To dig further we should probalby look at SMPP traces again . We'll try that tomorrow.

Kind regards,
Ashot
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 1191
Registered: 08-2008
Posted on Monday, August 31, 2009 - 09:58 pm:   

Hi Ashot,


quote:

- upon restart, would each session's initial seq id set to ITS OWN random value, or all sessions seq id-s would set to the SAME random value?




They'd be different ... but close ... actually could be the same for multiple outbind connections from the same server.

We were thinking what was important was that the ids from the different systems be different. But maybe that's not good enough.

We're using the Windows tick count for milliseconds since the system was last restarted ... so it's not truly random.

We can randomize more, just to safely put the issue of sequence_number to rest.

An updated SMSSMPP.DLL has been posted to http://www.nowsms.com/download/ashot.zip. The only difference in this build is that when the setting is in place to use the random init, it is truly random.

If you want a complete build that includes that updated DLL, I've refreshed http://www.nowsms.com/download/nowsms20090831.zip
... just be sure to clear your cache so you're sure to get the fresh copy.


quote:

- what happens when the seq id reaches 0xFFFFFFFF? If the new setting is on, would the next id be 0x01 or a random value?




After 0x7FFFFFFF, it wraps over to 1. Spec says 0 through 0x7FFFFFFF are valid.


quote:

It is server 2 submitting to server 1, haven't noticed it in the other direction.




I am beginning to get more suspicious that something is happening that is related to the WDP Adaptation setting. But I could be wrong.

Can you confirm that the errors in the SMSOUT log file refer to timeout errors? "ERROR: Timeout waiting for response from server or lost connection"

--
Des
NowSMS Support
ashot shahbazian
New member
Username: Animatele

Post Number: 43
Registered: 06-2004
Posted on Monday, August 31, 2009 - 11:51 pm:   

Great job guys, it's fixed!

Elusive stuff it was.

Saved the last DLL, changed the binds to TRX again. Upon start, not a single file got stuck in the queue for half an hour, with another two service restarts in between.

It was indeed a seqid issue. This time I was looking at the running SMSOUT log and noticed that upon service start for 15-20 seconds there were DLR in the log only, and only then the submits are seen. With the seqid switching to 0 or 1 for all sessions, the seqid-s for the DLR would conflict with those for submit-s that come moments later - and possibly between different sessions of the same bind, if the id-s were re-set individually per-session.

Now, as you've spread them apart this is no longer happening. The difference is very obvious.

This might also be a fix for a host of not-so-obvious troubles reported by other customers. Unmatched DLR from the Greek customer perhaps?

That server also had about 1 in 7-8 DLR not routing and staying in sms-in folder first as .rct files then changing to .sms ones. There were roughly 40K DLR accumulating in there every day. I've deleted it and would check tomorrow if the number of stuck DLR files would reduce.

We're also updating the 1st server - one that's receiving the customer's messages and would at the same time monitor if the number of timeout/resubmit errors decreases on their transcievers.

Can you confirm that the errors in the SMSOUT log file refer to timeout errors? "ERROR: Timeout waiting for response from server or lost connection"

Those stuck files would never have the "Timeout waiting" written in them, and that was also never in the outlogs for such messages. On the surface of it they would just sit quietly in the \q folder for 2 minutes and then submit. The entry in the smsout log would only appear when the message submitted. Only by looking at the traces we could see that one of the servers would not generate a resp or the other would not see it.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7847
Registered: 10-2002
Posted on Tuesday, September 01, 2009 - 12:58 am:   

Hi Ashot,

I'm very happy if this is resolved ... but I'm a little troubled, because I don't understand why this resolves it.

Each connection sees only its own transactions, it doesn't see transactions from another connection, so it doesn't care about sequence numbers for another connection, even if it is to/from the same server.

And when commands are received, sequence number matching is only applied for responses. If it's a new command, it doesn't matter that its sequence number just happens to match the sequence number for a pending response. They are logically separate.

So, while I'm happy that it seems to be resolved, this just doesn't sit right with me.

It also doesn't sit right that there's a 2 minute delay, but no timeout errors.

I know that I should probably just leave well enough alone. But I prefer when there's a good explanation for a problem.

Is there any chance that your thin layer could be intercepting/relying on sequence numbers in any way?

-bn
ashot shahbazian
New member
Username: Animatele

Post Number: 44
Registered: 06-2004
Posted on Tuesday, September 01, 2009 - 07:01 pm:   

Hi Bryce,

We don't consider this case closed either. One of the reasons is that the fix was based on our assumption - and we don't know how NowSMS operates internally. There is another worry that some of our customers' SMPP applications might not recognise higher seqid values, which may cause troulbe with DLR submitting from us to external clients.

To address the first problem we'll try disabling the Random seq id setting tonight, collect SMPP traces for stuck messages and post here so that you reach a definite conclusion.

The second potential issue should be addressed by implementing a new setting MaxSeqIdValue=(HEX), which could be applied independently of the random ID setting. Thus, if we'd encountered a problem with a customer not seeing the higher-value id-s we could limit the range. Ultimately, this setting is best to be able to apply independently to individual user accounts and uplinks, but making it system-wide for a start would be just fine.

Our thin layer app does intercept all SMPP traffic, however it does not manipulate the seqid-s. One unlikely problem could be that in the context of this 2nd server sedning back reassembled messages to the 1st one (through the thin app) the thin app adds the source address of the original message to the beginning of the message (reassembled, if it was segmented originally) payload. Most often though those stuck messages didn't look like they were long enough to exceed 160 chars if 13-16 more characters such as "+16462332323: " were to be prepended. And the fact that the problem can be fixed by either not using TRX or avoiding low-value seqid-s that could match across different sessions may attest to that neither the text manipulation nor the thin app (or the Cisco CSS server in-between) were to blame.

Hopefully I can have the engineer look at the traces tonight, and if so we'll simulate the problem and I'll post some soon.

Kind regards,
Ashot
ashot shahbazian
New member
Username: Animatele

Post Number: 45
Registered: 06-2004
Posted on Thursday, September 17, 2009 - 10:12 pm:   

OK, finally..

I have to apologise for having you guys spend so much time on this. While letting the seqid wrap at 0x7FFFFFFF and reset to a random value was a fix the underlining cause for the trouble was a bug in our thin app (Spider) apparent only when processing those of the WDP-reassembled messages which you encode with TLV. Here's a report from the engineer:

I’ve finally found an error in Spider occurring in certain relatively rare conditions. The bug is a hundred years old, possibly it was there in the first version. However it only became apparent after we’ve started changing the message body.

In some cases where NowSMS is performing WDP reassembly it’s coding the message body in a standard way, but in some cases – in alternative (with TLV.) The logic or processing in Spider in two scenarios is different. With standard, everything was okay, but with TLV the Spider while regenerating the PDU would wrongly add the message body again just as bytes, without adding necessary tags.

Since in PDU the message body is inserted with an indicator of its length (TLV=Tag-Length-Value) NowSMS was properly decoding the “first” body, but it was considering the second one a separate packet of data (since in the PDU header the length of the PDU is specified as well, but the “second” body was in excess of the specified length.) Obviously NowSMS won’t interpret it as a valid packet and would discard it responding with an invalid packet error. That’s why no errors would ever occur in terminating SMS, but it would stagger the message flow.

I couldn’t identify that second body for a long time since it’s completely identical to the first and because the receiving and sending of one message would occur many times in fast sequence (because of it looping between the servers to perform WDP.) Plus, because of dense traffic packets in logs often stick to each other and are hard to distinguish from one another.

In NowSMS SMPP debug on the other hand the packers are mixed in tiny bits with bits of other packets. Today, when compiling data for samples I’ve noticed other similar cases: no SMPP errors were returning, but the bind would re-initialise. It took me awhile to take it apart manually, arrange the bytes by respective packets in both Spider log and SMPP debug, realise where the error was and fix it.

Applied the new version at 17:17, would tail for possible issues. I’ve learned to find the errors easily, but catching potentially dangerous messages (which were becoming corrupt before the change) is tough, since TLV encoding is inside the PDU and finding it with Grep is rather difficult. So it’s easier to wait for a few hours, I’ll let you know if anything else was caught.


I can speculate that the message flow staggering was caused by the re-initialisation of the bind every time the server was discarding those duplicate packets. Why has the "randomising" seqid patch fixed it is anyone's guess, but it certainly did. Interestingly, this bind reinitialisation was not recording in the NowSMS event log on either server.

The conclusion is that the trouble was not related to NowSMS at all: it was in fact properly responding to corrupt messages. We'll monitor it for some time though and let you know if we find anything.

You guys go excellent job supporting and improving the product. I'll note that to our commercial department and suggest to purchase a service contract or a license upgrade for 2010.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7851
Registered: 10-2002
Posted on Sunday, September 20, 2009 - 02:55 pm:   

Hi Ashot,

Thanks for the follow-up. I'm relieved that you were able to identify the problem.

It was good that you noticed the sequence number issue that started things off. Even if in the end it wasn't related to the problem you were experiencing, we were sending an invalid sequence number of 0 every 256 transactions, and it is very possible that this could cause problems with some SMPP implementations.

Thanks for the follow-up.

-bn