Overcoming a disconnection of a remote connection in the absence of user actions

When working with GUI and terminal applications, it often happens that a user, working in remote access mode (usually via the Internet), having left the computer for about 15 minutes, on his return, detects that the program is frozen. It responds to any action with an error that contains approximately the following phrases: “Communication with the server has been lost”, “[WINSOCK] virtual circuit reset by host”, etc. This is also observed when executing “long-playing” methods (requests to the server), which do not provide for the output of the progress bar or any interactivity.

This problem is typical not only for GUI and terminal solutions based on Caché and Ensemble DBMS from InterSystems, but in general for any client-server interaction via TCP / IP. Usually, it is solved at the application level by periodically exchanging special type of empty messages, intended only to signal that the application is “alive”.

Below is how to solve this problem without programming.

Source of the problem

The source of the problem lies in the nature of the TCP / IP protocol. Typically, the TCP / IP session source and its receiver are on different networks, and several routers are encountered along the path of the session. At least one of them usually performs NAT address translation. Router resources are always limited, so some of them clean up the NAT tables from the "dead" sessions. A session is considered “dead” if no packets have been transmitted over it for some specified time interval (let's call it the cleaning interval). Thus, a “silent” session can be mistaken for a “dead” one and cleared from the NAT table. At the same time, neither the source nor the receiver is notified of this (“not a royal case”), and both remain confident that the session is still “alive” (which is easy to verify with the netstat command by running it on the client or on the server when errors, but before clicking OK). When the user who received the error message clicks on OK, the client learns about the session break; the server process will end when the “dead” session recognizes the OS.
')
It was established experimentally that the cleaning interval for many marshruzers (at least with Linux 2.4 / iptables stitched) is about 10 minutes. We will try to make our TCP session automatically keep itself active, even when no data packets are transmitted.

Proposed Solution

At the OS level, the detection of broken TCP connections is controlled by the following kernel parameters controlling the operation of the tcp_keepalive mechanism [1] :
• tcp_keepalive_time - time interval from the moment of sending the last data packet; after this period, the connection is marked as requiring verification; after starting the test, the parameter is not used;
• tcp_keepalive_intvl - interval between test packets (sending of which begins after tcp_keepalive_time expires);
• tcp_keepalive_probes - the number of unconfirmed test packets; When this counter is exhausted, the connection is considered broken.

It must be said that the tcp_keepalive mechanism has a dual purpose: it can be used both to artificially maintain the activity of a compound, and to identify broken (so-called “half-open”) connections. This article discusses mainly the first application, the second application, perhaps, will be discussed in the next article on this topic.

In order for the tcp_keepalive mechanism to be enabled for TCP connections, two conditions are necessary:
• support at the OS level; fortunately, it is available on both Windows and Linux;
• at one end of the connection, the socket must be opened with the SO_KEEPALIVE parameter. As it turned out, Caché services open sockets with this parameter, and the OpenSSH service is easy to get to do the same.

Of greatest interest to us is the first parameter ( tcp_keepalive_time ), since it depends on it how often it will check inactive (in terms of no traffic) connections. Its default value — on both Windows and Linux — is two hours (7200 s). The typical time of inactivity, after which the break occurs, is about 10 minutes. Therefore, it is proposed to set the parameter value to 5 minutes, which will allow artificially maintaining the activity of TCP sessions without overloading the network with excessive traffic (5 minutes is not 5 seconds).

Setting tcp_keepalive parameters on a Windows server

You must have administrator rights to the server. In the registry section
HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Tcpip \ Parameters
Create a DWORD value with the name KeepAliveTime and a value of 300,000 (decimal). The parameter is set in milliseconds, so the suggested value is 5 minutes. Then stop Caché and restart the server.

As for the two other parameters tcp_keepalive, their defaults in Windows are as follows:

KeepAliveInterval
Key: Tcpip \ Parameters
Value Type: REG_DWORD — time in milliseconds
Valid Range: 0–0xFFFFFFFE
Default: 1000 (1 second)

KeepAliveProbes
Such a parameter (setting the number of unacknowledged test packets) does not exist in the registry. According to [2] , in Windows 2000 / XP / 2003, this value uses the value of the TcpMaxDataRetransmission parameter (the default is 5), and in later versions [3] - a fixed value of 10. Therefore, if only the value of the first parameter is changed (from 7200 by 300), keeping the default for the second, the Windows 2008 server will know about the disconnection of the TCP connection in 1 * 10 + 300 = 310 seconds.

Setting tcp_keepalive parameters on a Linux server

You can change the parameter values “on the fly” without restarting the server. Log in as root and run:

echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time

To make the change lasting with respect to possible server reloads, the easiest way to edit the kernel parameters file is /etc/sysctl.conf, adding a line to it (preferably two):

 # help to prevent disconnects net.ipv4.tcp_keepalive_time = 300

Note that unlike Windows, the parameter value is set in seconds.
As for the other two parameters tcp_keepalive, their default in Linux are:

 net.ipv4.tcp_keepalive_intvl = 75 net.ipv4.tcp_keepalive_probes = 9

If only the value of the first parameter is changed (from 7200 to 300), while maintaining the defaults for the other two, the Linux server will know about the disconnection only after 75 * 9 + 300 = 975 seconds.

Setting the TCPKeepAlive parameter in the Caché DBMS configuration

Starting from version 2008.2, in Caché, for Windows and Linux platforms, it became possible to set tcp_keepalive_time at the socket level, which is convenient because it allows you to avoid changing the settings of the operating system. However, in “pure form” this possibility is mainly of interest only to independent developers of socket servers. Fortunately, it was supplemented with the TCPKeepAlive = n configuration parameter in the [SQL] section, which can be edited from the page of the System Management Portal: Configuration> General SQL Settings . The default is 300 seconds (what the doctor ordered). The parameter applies not only to SQL, but, as you might guess, to any connections to Caché serviced by the% Service_Bindings service. These include, in particular, object access via CacheActiveX.Factory, so if your application can use this protocol as a transport, you should not miss this opportunity.

Setting KeepAlive Options in OpenSSH Server Configuration

If you are using SSH [4] (to work in a command shell mode or as a transport for your GUI application), then ... most likely, the done kernel configuration will be enough, since the OpenSSH service (at least in version 5.x) - default opens a socket with the parameter SO_KEEPALIVE.

Just in case, it is worth checking the configuration file / etc / ssh / sshd_config. Find the line in it

 #TCPKeepAlive yes

If found, then you do not need to do anything, since the default values of the parameters are provided in commented form.

The SSH v.2 protocol has alternative means of controlling session activity, for example, by setting the parameters of the OpenSSH service ClientAliveInterval and ClientAliveCountMax.
When using these parameters, unlike TCPKeepAlive, KeepAlive requests are sent via a secure SSH channel and cannot be changed. We have to admit that alternative means are more secure than the traditional TCPKeepAlive mechanism, for which there is a danger of analyzing KeepAlive packets and organizing DoS attacks [5] .

 ClientAliveInterval 0

Sets the wait time in seconds, after which, if no information is received from the client, sshd sends it a response request via a secure channel. The default is 0, which means that such a request will not be sent to the client.

 ClientAliveCountMax 3

Sets the number of client requests that sshd can send without receiving a response to them. If the limit is reached, sshd disconnects with the client and ends the session. The default value is 3. If you set the value of the ClientAliveInterval parameter to 60, leaving ClientAliveCountMax unchanged, non-responding ssh clients will be disconnected in approximately 180 seconds. At the same time, you should disable the TCP KeepAlive mechanism by setting

 TCPKeepAlive no

Does it always work?

There are categories of network problems in which the described approach may be ineffective.

One of them occurs when, due to the poor quality of network service, communications may physically disappear for short periods of time. If a session is inactive, and the connection is temporarily lost and restored before the client or server tries to send something to each other, none of them "notices", and the TCP session is saved. In the case of periodic checks, TCPKeepAlive increases the likelihood of the server accessing the client during times of temporary loss of communication, which can lead to forced disconnections of a TCP connection. In such a situation, you can try to increase KeepAliveInterval on the server to 60-75 seconds (remembering that in Windows the default is 1 second) with the maximum number of repetitions equal to 10, in the hope that in 10 minutes any temporary network problem will be eliminated. True, if the retransmissions last too long, it turns out that
KeepAliveTime + (KeepAliveInterval * number of repetitions)> 10 minutes
That TCP session, despite all efforts, can be mistaken for a “dead” one and cleaned up from the NAT table.

Another category of problems is related to the insufficient bandwidth of the routers and / or communication channels used, when any packets, including KeepAlive, may be lost during congestion. In the case of routers, such problems are sometimes solved by changing the firmware (for example, it helped me to beat Acorp ADSL XXXX), or, in the worst case, by replacing it with a more productive model. In the case of “too narrow” communication channels, there is nothing left but to expand them.

Conclusion

The proposed approach makes it possible to artificially maintain the activity of TCP / IP sessions, over which no data is currently transmitted, exclusively at the system level, without making any changes to the application code. Today it is tested and successfully used in Caché for UNIX (Red Hat Enterprise Linux 5 for x86-64) 2010.1.4 (Build 803), Caché for Windows (x86-64) 2010.1.4 (Build 803), and also in more later versions.

It should be recognized that it works effectively if the network connection is physically stable, and in addition to breaking off inactive sessions, you do not have other network problems.

When deploying an application in a hostile environment (remote access, distributed systems, etc.), think about the implementation of KeepAlive not at the TCP level, but at the level of a higher-level secure protocol; SSH is a good candidate.

Literature

[1] Fabio Bussato. TCP Keepalive HOWTO
[2] How does it work against Windows Server 2003?
[3] TCP / IP Registry Values for Microsoft Windows Vista and Windows Server 2008
[4] Interactive system view system manuals (man-s)
[5] OpenSSH: installing and configuring sshd

Source: https://habr.com/ru/post/155565/

All Articles