So let's start with the original data.
There is an ISP with 4 FreeBSD + MPD5 (PPPoE) -based NAS servers and the challenge is to normalize the load on the servers.
First, let's take a look at the schedule of server load before normalization:

Immediately visible terrible scatter (note brown and blue graphics).
These servers have the same configuration (ALMOST the same compared to other NAS, see UPD_1), the same software and the same settings (and the same network connection). But judging by the schedule, one of them is still idle more than the other.
In order to restore justice and this event was started.
Normalization technique
Technically, we will use telnet access to mpd5 and set the max-children parameter to restrict access to a specific NAS or in the case of
providing access to a specific server - we will set max-children = 10,000.
We will also read the number of users of a particular NAS for subsequent analysis by the algorithm.
Algorithm
The most controversial section. Here I will give my conclusions that WORK. But nevertheless, I ask you to contribute my own vision of this algorithm. So for starters we have at the entrance a list of NAS in the format:
nases = [ ('VPN2',('192.168.X.Y','5001','username','password',600)), ('VPN3',('192.168.X.Y1','5001','username','password',300)), ('VPN4',('192.168.X.Y2','5001','username','password',600)), ('VPN5',('192.168.X.Y3','5001','username','password',200)) ]
where the parameters are accordingly:
Server name, address, port, administrator name mpd, password, normalization factor (in fact, this is the maximum number of parallel sessions this server supports)
The essence of the algorithm is as follows:- Determine whether all servers are working (if one or more servers have fallen - redistribute their coefficients proportionally among others)
- Calculate the percentage of each server in accordance with claim 1
- Sort the list in descending order of percentage filled
- Disable N most full servers and make the rest available.
')
By points of implementation:1. Determine the number of users:
def getUserCount(anas): (nas,params) = anas (host,port,user,pwd,koeff) = params try: client = telnetlib.Telnet(host,port) client.read_until("Username:") client.write(user+'\n') client.read_until("Password:") client.write(pwd+'\n') client.read_until("[]") client.write("sh mem \n") res = int(client.read_until("BUND").split('\n')[3].split()[1]) dummy = client.read_until("[]") client.write("sh global \n") limit=int(client.read_until("Global options").split("\n")[10].split()[2]) client.close() return (nas,(res,limit),params) except: return (nas,(0,0),params) ress = map(getUserCount,nases)
2. We consider the sum of the coefficients of all servers .:
naskcount = reduce(lambda total,nas: total+nas[1][4],nases,0)
3. We count the sum of all users online from the list obtained in Section 1:
naswcount = reduce(lambda total,nas: total+(((nas[1][0] != 0) and nas[2][4]) or 0),ress,0)
4. We calculate the new coefficient taking into account the coefficients of those servers that could fall (
new_koeff = koeff * (nascount / naswcount) ):
def corKoeff(nas): (nas,(klk,limit),params) = nas (host,port,user,pwd,koeff) = params koeff = koeff * ((klk != 0 ) and (float(naskcount)/float(naswcount)) or 0) params = (host,port,user,pwd,int(round(koeff,0))) return (nas,(klk,limit),params) ress1 = map(corKoeff,ress)
5. Calculate the percentage of server occupancy (
percent = 100% * klk / koeff ) and sort:
def calcMaxUser(nas): (nas,(klk,limit),params) = nas (host,port,user,pwd,koeff) = params return (nas,klk,(koeff !=0 and float(100*klk)/float(koeff)) or 0,params) ress2 = map(calcMaxUser,ress1) ress2.sort(lambda x,y: cmp(x[2],y[2]))
Now we are doing “feint ears” (we replace the coefficients by -1 for all servers except the 2 least loaded ones (I personally don’t really like this approach, but I don’t know how to make it more cultured):
for i in range(-1,-len(ress2)+1,-1): (nas,klk,koeff,params) = ress2[i] koeff = -1 ress2[i] = (nas,klk,koeff,params)
After that, in ress2, we already have everything we need to install max-children in mpd. What we actually do (for servers marked -1, set the maximum number to 10,000, and for all others equal to the current number of users - 10% (just the coefficient can be taken from the head, the main thing is not more than the current number of connected users):
def setMax(nas): (nas,klk,koeff,params) = nas (host,port,user,pwd,kkx) = params if koeff != -1: klk = 10000 else: klk = round(klk * float(100-down_koeff)/float(100)) try: client = telnetlib.Telnet(host,port) client.read_until("Username:") client.write(user+'\n') client.read_until("Password:") client.write(pwd+'\n') client.read_until("[]") client.write("set global max-children %d\n" % klk) client.close() return (nas,"max set to %d" % klk) except: return (nas,"Error setting max-children %d" % klk)
Result
As a result, after installing this script into cron, this is the normalization

Blue-brown is almost the same. What and had to achieve
PS
on working with telnet on python:
herempd5 parameters (in Russian):
hereR.P.S.
When working, do not forget to “read up” the output of telnet until the end of
dummy = client.read_until ("[]") otherwise there will be unforeseen consequences (for example, commands that did not work)
UPD: Changed the name to more (thanks
obramko ). It was: "
round-robin to normalize the load on the MPD5 VPN server "
UPD_1 from the admin: Some clarifications about the "
controversial " NAS:
Brown:
FreeBSD 8.2-RELEASE-p1 #1: Fri May 13 22:55:37 EEST 2011
CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (3363.89-MHz K8-class CPU)
Origin = "GenuineIntel" Id = 0x106e5 Family = 6 Model = 1e Stepping = 5 Features=0xbfebfbff <FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE > Features2=0x98e3fd <SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT >
AMD Features=0x28100800 <SYSCALL,NX,RDTSCP,LM>
AMD Features2=0x1<LAHF>
TSC: P-state invariant
real memory = 2147483648 (2048 MB)
avail memory = 2041532416 (1946 MB)
Blue:
FreeBSD 8.2-RELEASE-p1 #1: Fri May 13 22:55:37 EEST 2011
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (2664.79-MHz K8-class CPU)
Origin = "GenuineIntel" Id = 0x106a5 Family = 6 Model = 1a Stepping = 5 Features=0xbfebfbff < FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE >
Features2=0x98e3bd < SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT >
AMD Features=0x28100800 <SYSCALL,NX,RDTSCP,LM >
AMD Features2=0x17<LAHF>
TSC: P-state invariant
real memory = 4294967296 (4096 MB)
avail memory = 4106272768 (3916 MB)
In both cases, LAGG is used on 2 EMs for communication with clients and igb is a way out into the world.
Difference in CPU and RAM.
I apologize to the people of the country for misleading about the similarity of servers. BUT in principle, they normally keep the same number of users when using the above technique. Therefore, I consider this note useful and I am waiting for your comments.
UPD_2 from admin : The reasons that made us develop this method.
It so happens that the majority of clients connect in the morning, but it starts to swing at 6-7 in the evening.
Without normalization, the “blue” NAS-assumes most of the connections. and normally works with them to the peak. And since after 6-7 hours most of the clients start downloading for 15-20 Mbit each, then it does not stand up and panics about once every 2 weeks. This method of normalization scatters clients according to the coefficients established by us even before the onset of hour X, thereby providing clients with a more reliable connection and support for an extra couple of hours of sleep.