Concat messages problem

Concat messages problem SearchSearch
Author Message
Alex Kaiser
New member
Username: Alex_k

Post Number: 17
Registered: 07-2006
Posted on Thursday, June 18, 2009 - 01:52 pm:   

Hello,

We have the following config:
[SMPP - XXX:XXX]
RoutePrefOnly=Yes
Route1=+???????*
AllowedUserOnly=Yes
AllowedUser1=test_user1

[SMPP - YYY:YYY]
RoutePrefOnly=Yes
Route1=+???????*
AllowedUserOnly=Yes
AllowedUser1=test_user1

If test_user1 sends long sms, from time to time NowSMS processes message parts via different connections, that issue makes big problem for several handsets - messages don't deliver. Is that a problem or i'm wrong

Regards,
Alex K.
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7795
Registered: 10-2002
Posted on Monday, June 22, 2009 - 03:24 pm:   

Hi Alex,

I could see a problem occurring if "ReRouteReceived" was being used to route messages to one of these routes. But that does not seem to be the case here.

Is test_user1 submitting messages via SMPP? I have an idea about how the use of multiple subdirectories could cause a problem if different parts end up in different subdirectories. By default, after each 10000 messages are received, a new directory gets used. If parts of the message end up in different directories, I could see that causing a problem.

We'll have to go back and figure out a solution to this.

Des or I will post a follow-up, but it will probably be about a week or two.

-bn
Alexandre
New member
Username: Alexd

Post Number: 11
Registered: 01-2008
Posted on Monday, June 22, 2009 - 04:01 pm:   

Ok thanks...that was my mistake :-)
Alexandre
New member
Username: Alexd

Post Number: 12
Registered: 01-2008
Posted on Tuesday, June 23, 2009 - 09:08 am:   

Sorry preview my post was not in this topic :-)
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 976
Registered: 08-2008
Posted on Wednesday, July 01, 2009 - 09:09 pm:   

Hi Alex,

We did identify a problem where there was a 1 in 10,000 chance that a concatenated message segment submitted via an SMPP client might get routed to a different SMSC connection.

I can't be certain that this is the problem that you are encountering, but 1 in 10,000 seemed significant enough to us for it to be problematic.

An updated version has been posted to http://www.nowsms.com/download/nowsms2009rc.zip (v2009.06.30), which includes a fix to address this particular issue.

--
Des
NowSMS Support
ashot shahbazian
New member
Username: Animatele

Post Number: 14
Registered: 06-2004
Posted on Friday, July 03, 2009 - 03:04 pm:   

We've noticed a similar problem in 2009 releases with SeparateUserQueues setting on. Different segments of the same message would be found in different sub-folders, and some would stay there indefinitely. This was I think also related to that the correspnding .lck file was in a directory different from that where the .req file was. After disabling the SeparateUserQueues the problem's largely gone.

But even then, as reported by my colleague Alex, some stray segments, along with some unsegmented messages, would keep bouncing in the subfolders named ###0001 ###0002 etc., which causes a significant performance lag. Is there a way to do away with these subfolders? The problem is also that the .req files in them are locked by the application and are hard to move elsewhere without stopping the service.
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7824
Registered: 10-2002
Posted on Friday, July 03, 2009 - 04:58 pm:   

Hi Ashot,

This fix will make sure that the different segments of the same message are created in the same directory.

I'd caution about disabling the separate user queues in that if you end up with too many files in a single directory, Windows can grind to a halt. The whole system can become extremely unresponsive.

It is possible that as new features have been added, we have made the assumption that separate user queues is enabled, because not having it enabled can cause these severe system problems.

I will check to see if this is the case. But I would caution that if separate user queues are not enabled, there will be problems with large message queues where the system grinds to a halt. CPU utilisation will appear very low, but anything that involves disk i/o will be very very slow.

It is likely that these locked .REQ files are related.

As I recall, that was one of the problems that would occur if there were too many files in a single directory level. Even stopping the NowSMS services ... making sure that no NowSMS processes were running, these files were still locked by the operating system itself.

Separate user queues was the solution to this problem. And in particular, the counters that create even more subdirectories (###0001, ###0002, etc.) as activity increases, allow Windows to better cope with all of the activity.

-bn
ashot shahbazian
New member
Username: Animatele

Post Number: 15
Registered: 06-2004
Posted on Friday, July 03, 2009 - 07:53 pm:   

Hi Bryce,

Thanks for a quick reply!

Here is what we observe:

- If separate user queues are enabled, the queue files are spread in sub-directories \username and further down in something like \username\47F21A5\

- the \username\ folders may also contain sub-folders named \###0001 etc.

- if the setting is disabled, the \47F21A5\ sub-folders seem to be not created, but the \###000X\ do, under the \q\. And it's the \###000X\ ones which are causing grief.

We're not handling Bulk SMS traffic and our uplinks are to very fast SS7 gateways, most in the same LAN. Hence, in our environment the traffic itself is not causing queues. Newer versions of NowSMS are capable of sending at up to licensed speed without creating noticeable CPU load on modern servers - regardless of the separate user queue setting, with a very large number of uplinks and complex routing on them. For queues, we use stripe-sets of Intel X-25E server-grade SSD-s on multi-channel SAS controllers, which makes a volume 200-500 times faster in terms of latency and IOPS than any rotating disk. So the disk I/O in our case is not an issue.

It is periodic maintenance procedures, such as archiving, backing up or defragging the volumes, or restarting the service after configuration changes, causing the customers to dump their queues that puts some extra strain on the system. When that happens you spill messages to the ###000X folders - and that is when the server might overload. If the server stopped, files in the ### folders are manually moved back to \q\ and server restarted, the messages are quickly gone and the CPU load returns to normal. With files in the \###000X\ folders it enters a vicious cycle - a 16-core server is sending at 10/sec., new messages are rapidly building up in these folders and it gets even slower.

Can you implement a setting that'd specifically prevent the application from creating these sub-folders?

The three-dimensional directory structure you're using in newer releases should indeed improve performance with the separate queue setting for customers with very large outbound queues because of bulk traffic and slow uplinks. But that does not apply to us, the solution seems to just add extra overhead.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7825
Registered: 10-2002
Posted on Saturday, July 04, 2009 - 03:27 pm:   

Hi Ashot,

That is a valid point. If the queues are not that large, this is unnecessary overhead.

We wanted to freeze version 2009.06.30, however it is a very simple matter to respect the separate user queues flag for these additional traffic based subdirectories.

So we've gone ahead and updated http://www.nowsms.com/download/nowsms2009rc.zip with a 2009.07.04 version that includes this change.

-bn
ashot shahbazian
New member
Username: Animatele

Post Number: 18
Registered: 06-2004
Posted on Tuesday, July 07, 2009 - 03:46 am:   

Hi Bryce,

Patched, works great now. Thanks for doing it so quickly!

There was definitely something wrong with how the queued messages were handled in those ###XXXX folders. Some .req files would get stuck there - perfectly normal ones, in absense of any queues whatsoever - and won't send out for hours no matter what, and couldn't be deleted or moved before you stop the service and move them manually to the \q\ folder.

Why would the files get locked so viciously in the \###XXX folders, but not so in the \q\ one?

Out of curiosity I thought I'd try the separate user queue setting. It was 5am and traffic was not at its peak. This is the new patch, the setting set as "default". Check this out:

application/pdf
queue.pdf (34.3 k)


The CPU load shot up to 80-90% immediately and to 98-99% within less than a minute. The server is a 16-core monster with 128 GB of RAM!

The number of these subdirectories when I stopрed the service was 401, 3/4s were sattermrus000XXXX (although that's in fact the most active account.) The total number of .req files in \q\ and all subfolders was just 61.

Now look at the subfolder numbers. What i did find amusing was that the numbering of the folders seems to have switched by itself from HEX to decimal - see the ###00BF and the next one ###0009?!

I've stopped the service, deleted the 401 directories, unticked the "SeparateUserQueues" restarted - CPU load 1-3%, sending at 30-60 SMS/secs.

Things definitely break down when you turn this on. I'm surprised no one's reported it, but you should rather check what's going on.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7830
Registered: 10-2002
Posted on Tuesday, July 07, 2009 - 10:28 pm:   

Hi Ashot,

The CPU load that you are seeing is troubling.

We have seen some increased overhead from the directories, but not to the extent that you are seeing.

I do have one idea about it, however. Do you have 8.3 filename compatibility disabled. (You probably do, but it's the one factor that I'm wondering about.)

These extra directories were key to resolving some severe problems at installations with large bulk submissions where queues would frequently outpace the outbound message capacity.

But you've given us enough concern that we may need to do a minor rethink.

Assuming the separate user queues are enabled, we're going to use these extra directories only if the size of the outbound message queue exceeds a threshold (by default 10,000). If the queue size is under the threshold, we won't use them as they will only add overhead.

I'm still concerned about the extent of the overhead that you're seeing. The logic that deletes these directories when they are unused may be part of the problem, as it waits longer than necessary, and we're looking at that.

-bn
ashot shahbazian
New member
Username: Animatele

Post Number: 19
Registered: 06-2004
Posted on Wednesday, July 08, 2009 - 12:35 am:   

Hi Bryce,

The 8.3 filename compatibility is disabled.

I think ours is the scenario just opposite that for bulk sending. The uplinks are a lot faster than we can possibly send, but there are hundreds of user and SMSC links, most with complex routing (including loops for reassembling and breaking apart segmented messages, which seems to be heavy on resources.)

I must admit that v.2009, especially that with the most recent patch, is many many times more stable and a lot faster than the v.2007 we've had on this server before. That one would run at 40-60% CPU load and would tend to stall at every large traffic spike.

We did notice that these folders cannot be deleted for a few minutes even if empty.

Another observation: during a large traffic spike, some messages would show in the log as "retry pending, timeout waiting for response from xxx xxx". Despite that if we look in the SMPP trace of the SMSC upstream we can see it returning the submit_sm_resp on time, NowSMS neither acking nor trying to resubmit the message, and that CommandTimout on the NowSMS server is set to 180 seconds (if left at default this happens a lot more often.) What's more interesting is that some of the messages that timed out in this fashion would stay in the \q\ folder as .req files with a line in it indicating just one retry attempt, and won't send out for hours unless the service is stopped, the lines indicating retry errors are deleted, the file saved and service restarted. These files, unlike regular .req files, are also viciously locked and impossible to remove at all without stopping the service. Also, if that stuck message happened to be one part of a segmented one while the others are gone, the corresponding .lck file is stuck and locked as well.

Not sure why is this happening, but hope it gives youe some clues.

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7832
Registered: 10-2002
Posted on Thursday, July 09, 2009 - 10:22 pm:   

Hi Ashot,

It looks like you posted more information about the problem in another thread:

http://support.nowsms.com/discus/messages/1/41522.html

Basically, it looks like the stranded .REQ files would happen if an SMSC connection terminated unexpectedly. This bug was introduced some time between 2008.02 and 2008.06.

What is confusing is why the connection is dropping if you're not seeing any indication of a problem in the protocol trace. That may be a problem for another day ...

-bn