Issues with network disk mounts

Issues with network disk mounts SearchSearch
Author Message
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 107
Registered: 06-2004
Posted on Wednesday, June 29, 2011 - 12:39 am:   

Hi Des, Bryce

We have just discovered something that’s hard to explain. Tried building a high-availability “cluster” with 2 NowSMS servers and encountered several problems:

If SHAREDVOLUME.INI is pointing to a local (D) disk:
[SharedVolume]
SharedVolume=d:\nowsms\
MessageIDPrefix=aa
LogDirectory=d:\Logs\
DebugLogDirectory=d:\debug\

Debug files keep appearing in default (C:\NowSMS) folder, but logs and other files get properly distributed where specified.

It breaks down completely when a network volume is specified for \nowsms\ folder, such as SharedVolume=z:\nowsms, where Z is a Windows SMB or an NFS volume (tried both:)

1. In EventLog, only Web server shows starting; SMPP one does not;
2. The SMPP port cannot be telnetted and no binds can be established to the server; existing binds disconnect;
3. If we try to submit messages over the Web interface, it shows “message submitted” and Continue on the next page, but it shows no message ID-s and Dest numbers and no messages are submitted;
4. In Windows Task Manager, all three NowSMS services are present.

Changing z:\nowsms\ to d:\nowsms\ rectifies it all apart from the debug file location. Changing it back breaks it down again.

Even more interesting is the following:
If we don’t use SHAREDVOLUME.INI but simply try to locate the \q\ folder on a network share in SMSGW.ini – the SME would accept a bind, the interface would show that there’s a connection and the SME would ack a message from an outside ESME, but the message would not pass through the server: stats on the user account and the general counters won’t change and message won’t appear in any of the logs. As if the program turns into an SMPP emulator. Changing back the path to any local disk and restarting the service makes everything work.

Web interface in this case also isn’t working and behaves just as with the SHAREDVOLUME.INI.

Nothing at all appears in debug files, and no EXCEPT.LOG file is generated.

My impression is that we’re missing something, as we couldn’t find any references to such trouble on the forum. Is there perhaps some additional configuration option to make the program work with network volumes?

Note that putting smsuser data files, smsgw.ini, or log file on a network mount appears to be working: user accounts and SMSC binds appear in NowSMS interface and an empty SMSOUT log is created.

We could confirm the issue on two of our servers, both running W2k3 Enterprise x86, OS patched, with NowSMS v.20100524 and v.20101104.

If you’re not aware of the problem, can you please try recreating it?

Also, does NowSMS work with Windows Cluster (CCS?)

Thanks!

Kind regards,
Ashot
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 3313
Registered: 08-2008
Posted on Wednesday, June 29, 2011 - 04:24 am:   

Hi Ashot,

Have you tried using UNC paths? \server\volume\path?

Drive mappings cannot be counted on as services run under a system account. The account name can be set manually under Windows services if necessary, but i'd srill recommend UNC format.

We have not tested under Windows cluster server.

-bn
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7967
Registered: 10-2002
Posted on Wednesday, June 29, 2011 - 04:50 am:   

Test to see if Des' credentials are cleared from this tablet.
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 108
Registered: 06-2004
Posted on Wednesday, June 29, 2011 - 02:19 pm:   

Hi Bryce,

Thanks for the suggestion! We've tried locating the \q\ folder on a UNC path, as follows:

QDir=\xxx.xx.xx.xx\Software\nowsms\q\ (note that it's actually two backslashes - the program is not working and almost freezing if we use a single backslash in front)

and are still having the same problem. Changing the IP address to the Windows Network verbose
host name, as well as setting a verbose host name in HOSTS file made no difference.

The SME on this server appears to be acking the message without assigning it a message ID:

16:07:50:625 (00000204) xxx.xx.xx.57 <-: 17 byte packet
16:07:50:625 (00000204) xxx.xx.xx.57 <-: 00 00 00 11 80 00 00 04 00 00 00 00 00 00 00 02
16:07:50:625 (00000204) xxx.xx.xx.57 <-: 00

and the server ignores this message.

As soon as we change the queue path to local:

QDir=E:\Q

The ESME gets a normal response with a message ID:

16:17:09:234 (000001C0) xxx.xx.xx.57 <-: 25 byte packet
16:17:09:234 (000001C0) xxx.xx.xx.57 <-: 00 00 00 19 80 00 00 04 00 00 00 00 00 00 00 02
16:17:09:234 (000001C0) xxx.xx.xx.57 <-: 34 41 43 34 35 33 33 32 00 4AC45332

And the message is being processed normally.

Any thoughts?

One guess is that some of the Windows standard services required for this to properly function are stopped or disabled on our servers. We typically do it for reasons of security or performance. In this case though there's nothing suspicious in Windows services or applications event logs. Unless you can recreate the problem, would you be able to send a list of started services of a server that's not having this issue?

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7971
Registered: 10-2002
Posted on Wednesday, June 29, 2011 - 02:33 pm:   

Hi Ashot,

This discussion board has a problem posting double backslash, I think I need to use 4 to get 2. The format should be \\server\volume\path\

What you are experiencing is definitely an access problem. The message cannot be written to disk.

It may be necessary to go into Windows services and assign an account to be used by the NowSMS service, depending on security setup.

-bn
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 109
Registered: 06-2004
Posted on Wednesday, June 29, 2011 - 02:40 pm:   

Would you recommend tying it to the SYSTEM account?

The only other one we use is Admin. If we tied the application to it, I'm thinking it'd get disrupted when we log off?

Thanks!

Kind regards,
Ashot
Bryce Norwood - NowSMS Support
Board Administrator
Username: Bryce

Post Number: 7972
Registered: 10-2002
Posted on Wednesday, June 29, 2011 - 02:57 pm:   

It depends what credentials are required on the shared disk side. If you can access the share ok (try DIR \\server\volume\path from a comnand prompt window), then use that same account for the service. It doesn't mattet if there are user session login/logouts, as the service account gets its own session.

-bn
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 110
Registered: 06-2004
Posted on Wednesday, June 29, 2011 - 03:11 pm:   

yes the volume is accessible:

C:\Documents and Settings\Admin>dir \Bulkvm\Software\nowsms\Q
Volume in drive \Bulkvm\Software is SYS
Volume Serial Number is 1822-E9C0

Directory of \Bulkvm\Software\nowsms\Q

28.06.2011 19:41 <DIR> .
28.06.2011 19:41 <DIR> ..
0 File(s) 0 bytes
2 Dir(s) 37 528 023 040 bytes free

C:\Documents and Settings\Admin>

We're testing the hypothesis of some missing service, as we've tried the network volume problem on another server (Windows XP though) and it worked! Now comparing the lists of services, and would try assigning it to an account as per your suggestion then.

Kind regards,
Ashot
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 111
Registered: 06-2004
Posted on Thursday, June 30, 2011 - 12:28 am:   

Hi Bryce,

Here's an update:

We've not yet tried tying NowSMS service to any account, but had success making one of the options work:

Windows SMB to a Linux server emulating the Windows protocol with a btrfs shared volume on fast SSD-s.

It worked when we've started (and then stopped) the disabled services as compared to a WinXP fresh install machine:
- error reporting service
- print spooler
- windows firewall
- web client
and a couple of other similarly unrelated ones . All of them were then disabled and the servers rebooted - but the trouble never appeared again.

Interestingly, we couldn't make the servers work with a volume on a native Windows server. That's strange, since no registry changes or other tweaks were performed with either mount, and both Linux and Windows servers are in the same sub-net and vLAN.

Also, we couldn't hook up the shared volume via NFS, possibly because paths for NFS are specified differently. By your logic, it may work if we tied the Windows NFS client service to an account. We'll try that.

We've successfully hooked up both servers and confirmed test traffic being handled "first-come-first-serve" by two servers. Stopping the service on one of the servers made its messages seamlessly picked up and submitted by the second one.

The preformance however (1gbit Ethernet, same LAN on uncongested segment) seems to be about 1/7th of that at the same server but single and on a local (much slower SAS RAID) disk - which peaks at about 1000/sec with NowSMS ESME short-circuited to its SME. The peaks, short-circuited, with a network disk configuration were 120-130/sec per server with two servers and ~270/sec with one of them stopped.

That was not surprising, as the servers were consuming constant 4.4 mbit/sec of network bandwidth each with service started but no messages received or sent. Looping 200 messages injected in the queue at 260/sec across two servers made the network bandwidth shoot to 20 megabit/sec per server, with the CPU load at 20-30%, as opposed to 2-4% at 1000/sec with a slower local disk.

Is there a configuration option that'd optimise bandwidth usage? The slowdown is probably caused by the network overhead. Live traffic with real routing and DLR would make this otherwise clever setup hardly usable.

We'll also check if the packet size for Windows SMB file transfers can be decreased to reduce the overhead for small files. Also having hopes about NFS, particularly NFS over SDP. The transport is Infiniband, with much lower latency and network overhead than conventional LAN. It'd take some time to set up and configure parameters though.

Kind regards,
Ashot
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 112
Registered: 06-2004
Posted on Monday, July 11, 2011 - 02:11 am:   

We've managed to make a high-performance redundant configuration:

The problem with network disk mounts, both Windows SMB and NFS can be resolved: the tools for that are in Windows SysInternals package.

NFS is practically of no use for the purpose; before version 4, file locks are not passed to NFS client, and there's no stable release of v.4 for Windows. We've tested it with an older version. The result was that the messages would multiply quite rapidly: in short-circuited configuration where each server's ESME is connected to its own SME and its peer server's SME, 100 messages injected in the queue turn into many thousand an hour later. With higher license speed caps, it'd have happened in a few minutes.

No such problems occur if Windows SMB is used.

The performance limitation in conventional 1GbE LAN can be largely avoided if Infiniband is used for IP transport. The peak speed with SMB to a Lunux target over one 40 Gbps IB link was ~450 SMS/sec per server at ~12% CPU load - 4 times the speed at half the load if compared to 1GbE.

If one such cluster with two servers is required the setup does not need a full-fledged IB fabric and is not at all impractical. The target Lunux server needs an HCA with two IB ports ($800,) the boards at NowSMS Windows servers could have just one ($600/ea.) Both ports in the target should be configured in the same subnet. Thus, no IB switch is needed (that's where it gets expensive.) If you need a redundant fileserver, a Lunux cluster, each server with one 2-port HCA and one port each used for intra-cluster communication might be built relatively easily.

Also I'm sure a Windows server may also be used for the shared disks - we've just not tried it. The Linux one with two weak X5130 CPU-s, Debian X64 with btrfs had its CPU running at just 0 to 1% during peaks. If it were Windows it might have been more demanding.

Should anyone need details on this setup please reply.

Kind regards,
Ashot
ashot shahbazian
Frequent Contributor
Username: Animatele

Post Number: 113
Registered: 06-2004
Posted on Wednesday, July 20, 2011 - 06:48 pm:   

Bryce, Des:

We have tried such clustered configuration in live environment, with just two users and simple routing.

Unfortunately, it was quite unstable. Large queues of outbound messages have been accumulating bringing sending to a halt.

About one in a hundred Delivery Receipts have not matched, which made thousands of .sms files with unconverted (decimal) message ID-s accumulating in the SMS-IN folder.

When sending stalled, "Timeout waiting for response" entries recorded in the smsout log for batches of 30-60 messages at a time. The SMSdebug log would have corresponding entries indicating failure to send submit and enquire link commands.

This happens regardless if both servers are active or the second one has the NowSMS services stopped.

With a large outbound queue, the load on the link to the fileserver (40 gbps IB) is reaching about 50 megabit/sec. It nearly brings down the fileserver with the shared volume, because of tens of thousands disk I/O requests per second from NowSMS servers, mostly read, scan and file lock.

When we change file locations to a local disk and cut-paste all files including the queues from the network disk to a local one the queue is quickly gone without any errors, and no DLR get stuck.

As if packets are dropped on one (1Gb Ethernet) network interface because of a high load on the 2nd (40Gb IB) one. Is there a setting for NowSMS to reduce the intensity of file scans? Or perhaps one for using a separate thread for handling files accessed over the network?

Also we have noticed that with a network disk configuration most of the CPU load is at one core, which becomes 100% locked with just a few dozen outbound messages in the queue. With a local disk configuration, the CPU load spreads differently: 4 out of 8 cores are loaded at about 10% each with 2000 messages in the queue and one of the other 4 cores is locked, one at a time and changing from one to the other every minute or so.

The NowSMS version is 2010.11.04. Have any issues related to clustered configurations been addressed in later releases? If not, would you be willing to work on it? We would then get the debugs and diagnostic info off the fileserver.

Kind regards,
Ashot
Des - NowSMS Support
Board Administrator
Username: Desosms

Post Number: 3354
Registered: 08-2008
Posted on Friday, July 22, 2011 - 10:31 pm:   

Hi Ashot,

I'm going to have to defer to Bryce on this one.

Unfortunately, he's left for holiday next week.

I am concerned that there are two issues. One may be the intensity of file scans. (We did experiment with a setting to reduce these, which we will revisit. At the time the effort was abandoned because the change slowed performance on local disks. But it may help with network disks.)

Based upon your description, however, I am more concerned that the DLR tracking is a bigger problem. That is an area that I can see some potential concerns with in extremely high volume situations. And I need to defer to Bryce on this.

--
Des
NowSMS Support