Pocket Groupware Server: Performance Evaluation

The Internet came to Russia in 1995-96. The average computer then was AMD 486DX150 or Intel Pentium100 with RAM 4-8 Mb and HDD 100-400 Mb. Just then, Windows 95 appeared, and it was the new OS that required hardware upgrades to the specified values, since on typical 94go computers, the 486 SX25 or DX66 with 2-4 Mb Win95 barely twisted. Internet servers and routers from providers in those years were exactly the same machine, or even weaker, because Linux was still quite comfortable at 2 MB (without GUI), the sites were mostly static, and there was no mail spam yet. At best, the Internet access was one per thousand people - several hundred people in the average Russian regional center, and they all worked through individual provider email and web servers. That is, one server of the specified, negligible by today's standards configuration served approximately as many people as Internet users in a fairly large enterprise now. And coping ...

Are there computers of comparable power now (if the word power still applies to such hardware), and how are they used? There is. No, these are not routers and certainly not smart phones — those and others are noticeably more productive, even if we consider only home routers and the simplest phones. Routers easily pump 100 Mbps, and phones easily twist the video, and the phones have hundreds of megabytes of memory - there was nothing close in 96th year. We need to look for weaker processors comparable to the Pentium100, that is, around 100 MHz or up to 200 DMIPS ...

')
The closest candidate for comparison is the ARM Cortex-M3 or -M4 microcontroller. The first frequency is slightly less than 100 MHz, the second one is slightly more, the declared performance of 125-210 DMIPS (2.19 CoreMark / MHz - 1.25 DMIPS / MHz - just like the Pentium in 96m). The built-in SRAM controller is only about 100Kb, but you can connect megabytes of an external SDRAM operating on the same frequency as the first Pentiums to the controller. Total we get an approximate equivalent of the motherboard of those years, but the size of the entire board will be less than the Pentium socket, and the power consumption is an order of magnitude smaller than the then (only 10-watt!) Pentiums.

How and what will we compare, and most importantly why. We will measure the performance of the Cortex-M3 on non-typical network tasks. Ethernet controllers in many Cortex-M3 are, moreover, in versions of the TI company, even with a built-in PHY (i.e., it does not require any external chips, it is connected directly to MagJack RJ45), but it is considered that it should be used either for industrial automation - to drive management / monitoring teams over the network, or in everyday life - network meters for utilities consumption, smart home management, etc., i.e. the tasks are important, but completely “from another opera”. But since it is in terms of expected performance it corresponds to servers that in the past “served” entire cities, then you should not be limited to this traditional niche and it makes sense to evaluate the applicability of the controller for server tasks? Not at the scale of the Internet, but for the local network. At a time when spacecraft travel through the universe, and IT departments outsource their internal tasks to external clouds, the task of delivering a real network service to a machine that is hundreds or thousands of times slower than a regular office PC may seem pointless. But on the other hand, if the controller (suddenly) copes with the task, doesn’t the hardware “arms race” make this pointless task meaningless? And isn't that what our engineering work is all about in the end - improving work efficiency, incl. and by reducing the cost of production? To either (1) produce more for the same money, or (2) as much work, but for less money. We will find out the limits of our capabilities by the second option.

Now about the choice of the test problem. There is one IT problem in which the volume of operations is constant in time, because it is not limited by the capabilities of the technology, but by the performance of the slowest participant of technical processes - a person. And to turn off this weak link from the process in this case will not work unfortunately (or fortunately), because this process is the most human - the decision-making process! - his man prefers not to entrust while technology; equipment with such capabilities may decide that a person is not needed ... What is the essence of the process: a person analyzes the current situation (according to the information available to him) and decides what to do next, fixing this decision either in direct physical actions or in changing the internal state (“Now I know what to do” or “I will think about it tomorrow ...”). Apart from small automatic decisions made on autopilot - where to put your foot or from which pocket to get the phone - and count only conscious decisions that require fixing attention, the person’s “clock frequency” is no more than 200-300 decisions per day. If a person has to synchronize all his decisions with other people, i.e. jointly solving common tasks at the same time, individual productivity drops by an order of magnitude - it is hardly possible to hold more than 20 meetings a day, and you still need time to prepare information and to put decisions into practice (personally or by assigning to another, which also requires synchronization) ... Groupware systems (GWS) are designed to increase collaboration efficiency, and they can only do this by turning synchronous events into asynchronous events, eliminating blocking and waiting between people: after solving / performing another task The hunter should immediately receive full information from GWS for making the next decision. GWS at the same time works like a task scheduler in the operating system, and people work like cashiers in supermarkets - each does its job at the limit of its own capabilities (if it is a customer’s turn to it, i.e., “tasks”), working only with its own queue, without being distracted by interaction with colleagues (any such interaction, such as a call to the neighboring cashier “Lena, exchange 5 thousand” immediately stops two queues immediately), colleagues must work asynchronously - to provide stocks of change, cash tapes, goods on the shelves, customers in the hall, etc.

Thus, in GWS, you can estimate the required maximum system performance: 200 times the number of system users. In practice, of course, such parameters will be required only in very well-developed organizational processes. Typical for a small office or enterprise (up to 100 people) will most likely be 200-300 decisions per day at all, especially since not all events are reflected in the GWS. We assume that several thousand per day cover (with a margin) all the current needs of a typical office - for almost any industry and regardless of the development of technology around.

If you are still here, then we begin to translate it into a computer language. What is an event or task in GWS? E-mail message, calendar event, task in tracker, application / account on the site, etc. In most cases, such a file or record in the database has a size of from units to tens of KB. I counted the average size of the task in our calendars (ics-files of the VCALENDAR format with VTODO tasks or VEVENT events inside) for several years - 700 bytes. Message forum - one and a half KB. Internal wiki page - 4 Kb. A letter in technical support - 12 KB (often with attached files, i.e. the average size of the actual text is much smaller). With these sizes, you can estimate the amount of new information for one employee at 1MB per day and not more than 50MB per day for the entire office (given that many elements of GWS are addressed to several recipients). If the volumes are higher, then a significant part of the tasks will never be completed - they simply will not physically have time to process - they will be deleted without reading or hanging forever (or auto-closing / archiving “beyond the prescription”, which is the same). With such amounts of information, a GWS server based on modern PCs can cache tasks for several months in RAM, can “ship” all tasks of one employee per day in less than one second, store all tasks of the entire office for several years in online access, etc. It is clear that, depending on the database formats, the protocols used, the server and client software used, these parameters can vary widely, vary by an order of magnitude, but it is also clear that the power of a conventional PC is sufficient for these tasks with a huge margin. So you can try to perform the same task on a less powerful hardware.

Server performance at such small volumes and using caching can be considered not limited in the case of a PC, but for a number controller such as: Cortex-M3 at 80 MHz copies SDRAM-memory at 16 MB / s (processor means, without using DMA), sends TCP / IP data through a 100 Mbit integrated Ethernet controller with a speed of 4-5 MB / s (also without the use of DMA, i.e. there are optimization reserves). It turns out that, ideally, the controller synchronizes all the calendars in minutes, and for an individual user, this can be done in general in a second. But in practice, everything is usually very far from ideal, depending on the protocols and the scenario of work.

Suppose that users use calendar-clients with WebDAV / CalDAV protocols (Outlook, Evolution, SunBird, Thunderbird + Lightning, etc. on desktop PCs, iCAL on iPhone / iPad, etc.) to work with events / tasks , and the server, respectively, any web server that supports these protocols. But we will not test client software (it is clear that it is sufficient on modern PCs), and not the performance of specific GWS models (it is clear that they slow down over wide limits), but the ratio of controller and PC forces in network work on this type of task. When synchronizing calendars, client software uses several HTTP / WebDAV commands — OPTIONS, PROPFIND, GET, PUT — most of the time spent in the PROPFIND command, which selects from the list of events on the server a subset that meets the specified criteria. The principle is approximately the same as when receiving mail - it turns out "what's new" and a new one is downloaded, plus (unlike mail, but like for example NNTP news) it turns out what is not on the server and is transmitted by the PUT command. For simplicity, at the first stage we will test only GET, by the ApacheBench program. We will launch a set of parallel separate short connections with single GET requests, simulating non-optimal clients receiving one task. Run long connections with multiple GETs in Keep-Alive mode, simulating PROPFIND with multiget inside. And we will receive in each test several hundreds of short files, simulating the synchronization sessions of individual users who receive all the tasks in one or several days.

Iron. We will compare the office PC on an economical AMD E350 processor (power consumption is close to the first Pentiums or modern Atoms, and the performance is two orders of magnitude higher than the first Pentiums and noticeably higher than those of Atoms) with 4GB RAM and a box of 6x6 cm HonixBox with ARM Cortex-M3 (TI LM3S9B95) and 8 MB of RAM inside. They are connected to the same 100 Mbps switch in the same cabinet and are separated by two more switches (gigabit) from the computer in the same office where ab is run, imitating clients.

Soft Windows 8 beta 64bit is running on the PC, as the nginx / 1.0.14 server (current stable build) for Windows. On HonixBox, a self-made web server on a self-made OS compiled by a self-made compiler (no special optimization for this task — the server was written to issue network statistics, the main “profession” of HonixBox; about 500 bytes of code were written in an assembler, all the rest is written on Forte without any optimization, since the original task of HonixBox was extremely simple, which is why it was implemented on the controller).

We are testing. At first I wanted to request "/" on HonixBox (a small script that issues a page with the current HonixBox settings and summary statistics), and on nginx to run PhpInfo (), but PHP under Windows in FastCGI mode does not live up to 500 requests under the control of nginx - flies . I didn’t understand yet (I ran nginx for the first time in my life), I switched to static files, especially as WebDAV calendars are sets of static files.

ab -n 500 -c 100 192.168.0.156/favicon.ico (the only file on HonixBox that is close in size to VTODO; and on nginx we request a shorter index.html) - we simulate the receipt of 500 separate tasks by hundreds of clients at a time.

Nginx result:
Time taken for tests: 1.994990 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 181000 bytes
HTML transferred: 75500 bytes
Requests per second: 250.63 [# / sec] (mean)
Time per request: 398.998 [ms] (mean)
Time per request: 3.990 [ms] (mean, across all concurrent requests)
Transfer rate: 88.22 [Kbytes / sec] received

HonixBox result:
Time taken for tests: 10.466127 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 211500 bytes
HTML transferred: 159000 bytes
Requests per second: 47.77 [# / sec] (mean)
Time per request: 2093.225 [ms] (mean)
Time per request: 20.932 [ms] (mean, across all concurrent requests)
Transfer rate: 19.68 [Kbytes / sec] received

Five times the superiority of the PC over the controller. We pay attention to the low final transfer rate in both cases - only 20-90 KB / s. Most likely, the speed is limited mainly by delays in establishing / breaking multiple TCP connections (the parties expect confirmations from each other). But the number of requests per second is quite consistent with the typical speed of calendar synchronization or receiving email.

Remove -c , add -k (Keep-Alive) - Nginx accelerates to 259 requests per second, HonixBox to 52, i.e. single-threaded version (a separate client receives 500 tasks in 2s with Nginx and in 10s with HonixBox). Add -c 10 (10 clients receive 50 tasks each at the same time) - Nginx accelerates to 424 s / s, and HonixBox stays at 52, losing this time eight times .

The results are worse than I expected - even the suspicions crept in that something was wrong with ApacheBench under Windows - but the final balance of power between the PC and the controller probably does not depend on “ab”. And it shows that the limiting factor is not the speed of the processor, but the network (although using FTP between these two machines, the speed of 11MB / s is 100 times greater than the typical speed in the above tests) and even more work scheme - a lot of small objects received by individual HTTP requests. But the most important thing is the difference in the rate of fire of the PC server and the controller server is only 5-8 times, and not at least 50 times, as one would expect, while the software on the controller still has the plowed optimization space and the PC huge reserve of unused power - Nginx didn’t have to strain in this test, it was mainly waiting for IV. But in the current version both servers give out data just as fast as (or even lower) is usually observed when working, for example, in Lightning with a real WebDAV server. Those. the user is unlikely to notice the difference. The more interesting it will be to compare the PC and the controller in real work in Groupware mode. We will deal with this next time, if anyone besides me is also interested.

Source: https://habr.com/ru/post/140476/

All Articles

Pocket Groupware Server: Performance Evaluation

More articles: