Delivery reports not being handled for a period of time

Delivery reports not being handled for a period of time SearchSearch
Author Message
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 3
Registered: 05-2007
Posted on Thursday, June 11, 2009 - 08:53 am:   

Hello,
We are investigating an issue concerning the delivery reports of the two providers that use our application, it seems that for a period of around 2 hours there was a problem with the reports not being processed.

It seems that although SMS and delivery reports where received from NowSMS, all reports between around 12.15pm and 2.30pm where not processed in our system. Meanwhile, normal messages where processed like normal.
Our custom application does not show any error.
Do you have any idea as to why this happened and how we can trace it to find the source of the error?
Please find attached the SMSIN log file of the date of the issue, as well as an excel file with the messages for which we didn't have reports as being delivered.
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 934
Registered: 08-2008
Posted on Thursday, June 11, 2009 - 02:03 pm:   

Hi Nikos,

Are you running one of the NowSMS 2009 release candidate versions? (I think you are because it resolved another issue.)

All versions between 2008.11.05 and 2009.04.03 have a bug where if you have 2-way commands defined that go to different web servers, sometimes NowSMS would try to send the 2-way command to the wrong web server.

Update to the latest 2009 release candidate version at http://www.nowsms.com/download/nowsms2009rc.zip .... or as a quick fix, edit SMSGW.INI and under the [SMSGW] header add 2WayKeepAlive=No.

--
Des
NowSMS Support
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 4
Registered: 05-2007
Posted on Thursday, June 18, 2009 - 08:24 am:   

Hello Des,
We are running version 2008.11.27, but we don't have different web servers, there is only one. So I don't think that this is a related issue with the bug you mention.

We edited SMSGW.INI and included 2WayKeepAlive=No, but that didn't seem to fix anything (so I removed it a couple of hours later).

What we notice is that the reports aren't being processed for a period of time (could be 30 mins, 1 hour, 2 hours, its not specific) but then after a while it is somehow "fixed" by itself.

So after a while not only are delivery reports handled just fine, but we also see that the previous reports are getting processed ok as well.

The url that processes the delivery reports is a page within the same directory (IIS application) of the page that processes the normal SMS, and this works 100% at all times.
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7793
Registered: 10-2002
Posted on Monday, June 22, 2009 - 02:42 pm:   

Hi Nikos,

You might not be experiencing the same problem that Des described, but I'd still suggest updating.

In particular, I'm suspicious of a change note for 2008.11.27 that mentions a performance enhancement that prevents large numbers of delivery receipts from slowing down the processing of other received messages.

If memory serves me correctly, the first attempt at this particular enhancement slowed down delivery receipt processing too much.

-bn
ashot shahbazian
New member
Username: Animatele

Post Number: 10
Registered: 06-2004
Posted on Saturday, June 27, 2009 - 11:44 am:   

Check if the problem with the lost DLR fixes itself after midnight. If it does, the reason might be hard disk failures (or electric power spikes/failures if disk write caching is enabled.) This can corrupt the database file in NowSMS used for DLR matching. Once the application creates a fresh DB file at midnight, the problem is gone. One indication of that is also a relatively high CPU load, a lot higher on some cores than the others, which is not related to traffic.

The only way to fix the issue was to delete the respective DB file (easy to find as it stops updating the "last changed" time when broken,) remove the uplink and create it again. Or wait till midnight. In either case, the DLR for previously sent but not delivered messages are lost, they accumulate in the user queues.

The best way to prevent it is to use a good UPS.
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 5
Registered: 05-2007
Posted on Friday, July 03, 2009 - 10:34 am:   

We've updated the Gateway about 2weeks ago however we're still experiencing this. We are looking into rebuilding the server just for this since its a serious issue for us and we need to rule out possibilities.

The reports stop being processed for a long time, and then they seem to start being processed again after a random amount of time, and then will stop again.


Ashot thanks for the suggestion but I think this is not related to our problem. The problem does not fix after midnight, so I doubt its a corrupt DLR db.

We're still looking for the cause.
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 6
Registered: 05-2007
Posted on Friday, July 03, 2009 - 10:48 am:   

I forgot to mention that the deliveries stay in the SMSIN folder as xxxxxx.rec files. The gateway shows them in the SMSIN.log file, places them in the SMSIN folder but from there they are not getting processed...

Any ideas?
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 7
Registered: 05-2007
Posted on Friday, July 03, 2009 - 03:33 pm:   

Update:
After research and experimenting with the 2WaySMSThreadCount=## settings proposed here:

http://blog.nowsms.com/2008/10/2-way-sms-command-speed-and-performance.html

and set 2WaySMSThreadCount=100

We removed all pending .rec messages from the SMS-IN folder into a New Folder and started moving the .rec messages 1000 at a time.

The gateway was then processing everything normally at very good speeds.

We will continue monitoring the situation and get back to you with more information.
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 988
Registered: 08-2008
Posted on Friday, July 03, 2009 - 03:38 pm:   

Hi Nikos,

That is very strange.

There are separate program threads that process the *.REC files so that they do not block *.IN files.

I can't see any scenarios that would cause the *.REC processing to block.

The only thing I could think of would be that maybe it would help to allocate more threads to this processing.

Edit SMSGW.INI, and under the [SMSGW] header, add 2WaySMSThreadCount=5

That will allocate more threads to process these messages.

--
Des
NowSMS Support
ashot shahbazian
New member
Username: Animatele

Post Number: 16
Registered: 06-2004
Posted on Sunday, July 05, 2009 - 08:33 pm:   

I think I know what might be causing it:

Check the followng:

- whether the unprocessed files in the sms-in directory contain "ReceiptedMessageID=XXXXXXX" in a separate line.

If they do, check:

- are the XXXXXXX in decimal or hex format. Try to match them to the message id-s returned by the SMSC upon submission of respective outbound messages. It those contained in the files are decimal they'd match the id-s in your smsout log only after you'd converted the numbers from the files to hex. That's one possibility why some are skipped by NowSMS: the hex convertor in it needs improvement.

The second possibility is that the files don't contain the "ReceiptedMessageID" field - which means the SMSC upstream doesn't return them. Check again, whether the message ID contained in the payload of the deliver_sm (after "id:") is hex or decimal. If decimal, that'd be the hardest to fix, but if hex and it's matching those in your log or it's contained in the messageID field I'm sure Bryce or Des will help you fix it if you posted relevant parts of the SMPPDEBUG log (first activate it, then don't forget to deactivate as it's resource-consuming.)

Hope this helps.
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 8
Registered: 05-2007
Posted on Monday, July 06, 2009 - 10:55 am:   

Hello Des,
Before the problem our 2WaySMSThreadCount was =6 so we tested a bit and we found that when it goes to 100 the problem disappears.

obviously we haven't tested all possible values, but so far 100 seems to work fine and doesn't really harm performance at all. Please advise if you think 100 is too a big value and should change.



Ashot, the .rec files in the SMS-IN directory contain "ReceiptedMessageID=XXXXXXX" in a separate line. This is true as of now that the rec files are processed just fine, I haven't been able to check that when we had the problem, however i strongly believe that the format of the .rec files was the same.

Here is an example of our files.

[SMS-IN]
ModemName=CIMD - (SMSC IP)#13:9971
Sender=(msisdn number)
PhoneNumber=(short code number)
Data=id:4A47A77C sub:001 dlvrd:000 submit date:0907061239 done date:0907061239 stat:REJECTD err:003
ReceiptFailed=Yes
ReceiptMessageId=4A47A77C

Please note that NowSMS does not skip processing them, it just stops processing them and starts the process at a random time in the future. This could range from anything like 2hours later or 10hours.

As it is right now, the rec files are processed very fast, but we will keep monitoring the situation and see how it goes.

I'm concerned about the 2WaySMSThreadCount=100 that we have.. Is it perhaps too much and if so, what would the consequences be?
ashot shahbazian
New member
Username: Animatele

Post Number: 17
Registered: 06-2004
Posted on Monday, July 06, 2009 - 03:02 pm:   

Have you checked that the message id in the file, 4A47A77C in this case, is matching that in the line in the SMSOUT log for respective outbound message? If it does, then I'm not sure what's the problem.

Other clues can be found in that if you stop/restart the service often. Sometimes it's causing part of the DLRs not processing. Also if the outbound message was sent long (a few days) before the DLR received. Also, if the SMSC upstream returns 2 DLR, the temporary and the final ones with the same ID-s. Also in previous versions of the application DLR with "failed " status would get stuck more often than those "deliverd", but that could be because they're more likely to be old than the delivered ones.

Your uplink seems to be CIMD. We've not had experience with CIMD, and perhaps that's the root of the problem. It's not a widely used protocol. Or you are experiencing the same with SMPP traffic?
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7829
Registered: 10-2002
Posted on Monday, July 06, 2009 - 04:43 pm:   

We're still at a loss to explain this.

We do have one potential theory that we are looking more closely at ... trying some tests to see if we can find any unusual behaviour. But so far we haven't found any problems.

However, I did want to provide some explanation for a few things above.

2WaySMSThreadCount allocates program threads for the processing of inbound messages. In current versions, there are actually twice the number of threads configured here that are allocated ... one for regular inbound SMS, and one processing receipts. So you've got 100 of each ... all basically waiting for other program threads to place messages in the SMS-IN directory.

100 is a bit high, and it could degrade performance in other areas.

Note that these threads are doing nothing more than scanning the SMS-IN directory and making callbacks for the 2-way commands. Most of their delay in processing occurs waiting for 2-way commands to complete.

I would have thought 6 would be plenty high.

I also want to mention about the "ReceiptMessageID" entry, because there have been some changes in recent versions.

*.REC files are receipts that have been successfully resolved. No further processing is required other than to dispatch these messages to the 2-way commands. They get dispatched to the 2-way commands the same way that regular *.IN messages do.

If you have any *.RCT files, these can be problematic ones. These are ones where we are unable to resolve the receipt message id. These are the ones that are more likely to get stuck.

The *.RCT files are processed completely separate from *.REC and *.IN, so that they don't cause other messages or receipts to be delayed in processing.

Maybe we'll find something in testing our one longshot theory.

Can you try 2WayKeepAlive=No again under [SMSGW] in SMSGW.INI to see if that makes a difference. Note that the service needs to be restarted for the setting to take effect. That was ruled out as a solution earlier in this thread, but I'd like to revisit it.

-bn
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 994
Registered: 08-2008
Posted on Monday, July 06, 2009 - 10:47 pm:   

Nikos,

Are all of your 2-way commands HTTP based? (We've been assuming that they are ... but if some are local executables there may be other issues to explore.)

--
Des
NowSMS Support
Nikos Mavrakis
New member
Username: Nmavra

Post Number: 9
Registered: 05-2007
Posted on Wednesday, July 08, 2009 - 10:14 am:   

Yes Des,
we have 2 2-way commands and they are HTTP based, on the same server!

Bryce I did what you proposed and added 2WayKeepAlive=No.

Note however that since we upped the 2WaySMSThreadCount number to 100 we haven't noticed this issue, however since we don't know what caused it in the first place I'm not 100% sure that 2WaySMSThreadCount=100 is the solution, and since we can't reproduce the error we can only wait and see how it goes. We rarely see .RCT files, perhaps one every 10 days or so.

Ashot thanks for your participation and help so far, CIMD is in fact not very widely used but it also happens for SMPP. The message id in the file matches the one in logfile.

One provider sends ENROUTE and DELIVERED receipts, depending on the case. Our application disregards enroute reports and only processes (updates our database) delivered reports. Finally, the problem happens even for SMS that were sent on the same hour, i.e. we sent an SMS but never got any receipt. This does not happen for a few random SMS, but rather for a whole period of time.