My 5 kopecks about Highload Cup 2017 or the history of the 9th place

There have already been several articles about Higload Cup, so I’m not going to write about it, those who missed can read in “History 13 places at Highload Cup 2017” .

I will also try not to repeat myself and share interesting, from my point of view, decisions. Under the cut:

A little about the data structure
JSON parsing on definees
Unescape uri
UTF decode
HTTP Server
Network tuning

and a lot of code.

Bicycles

I wrote the first version on Go using net / http and encoding / json. And she lay down at 2,000 RPS. After that, net / http was replaced by fasthttp, and encoding / json by easyjson. This configuration allowed me to go to sleep in the first place, but in the morning I already seemed to be in the third. Here there is a dilemma: to optimize the code in Go or to immediately write in C ++ in order to have a more flexible tool closer to the final, when nanoseconds are important.
')
I chose the second option, at the same time I decided to use only system libraries and write my own HTTP server, which does not waste time on unnecessary things in this case and JSON parser / serializer. I also initially wanted to play around with the libnuma and SSE 4.2 commands, but it didn’t come to that, because, looking ahead, the longest operation was write to the socket.

All the code below is not “production ready”, it is written for specific test cases of the competition, there is no overflow protection in it, or rather, there is no protection at all, it is not safe to use it in this form!

A little about the data structure

There are only 3 tables:

There were a little more than 1,000,000 users in the ammunition for the tank, about 800,000 locations and a little more than 10,000,000 visits.

The service should return items from these tables by Id. The first desire was to put them in the maps, but fortunately the Id turned out to be almost without gaps, so you can sallocate continuous arrays and store the elements there.

Also, the service should be able to work with aggregated information, namely

return the list of locations visited by the user in the order sorted by the date of the visit
return average rating for location

To do this effectively, we need indexes.

For each user, I entered the std::set<uint32_t, visitsCmp> , where visitsCmp allows visitsCmp to store the id of visits in the order sorted by the date of the visit. Those. during output, you do not need to copy visits to a separate array and sort them, but you can immediately output them in serialized form to a buffer. It looks like this:

 struct visitsCmp { Visit* visits; bool operator()(const uint32_t &i, const uint32_t &j) const { if (visits[i].VisitedAt == visits[j].VisitedAt) { return visits[i].Id < visits[j].Id; } else { return visits[i].VisitedAt < visits[j].VisitedAt; } }

In the case of an average estimate of a location, the order is not important, so for each location I entered a field of type std::unordered_set<uint32_t> , which contains visits from a specific location.

With any addition / change of the visit, it was necessary not to forget to update the data in the affected indexes. In code, it looks like this:

 bool DB::UpdateVisit(Visit& visit, bool add) { if (add) { memcpy(&visits[visit.Id], &visit, sizeof(Visit)); //     users[visit.User].visits->insert(visit.Id); locations[visit.Location].visits->insert(visit.Id); return true; } //    ,      if (visit.Fields & Visit::FIELD_VISITED_AT) { users[visits[visit.Id].User].visits->erase(visit.Id); visits[visit.Id].VisitedAt = visit.VisitedAt; users[visits[visit.Id].User].visits->insert(visit.Id); } if (visit.Fields & Visit::FIELD_MARK) { visits[visit.Id].Mark = visit.Mark; } //               if (visit.Fields & Visit::FIELD_USER) { users[visits[visit.Id].User].visits->erase(visit.Id); users[visit.User].visits->insert(visit.Id); visits[visit.Id].User = visit.User; } // ,   location if (visit.Fields & Visit::FIELD_LOCATION) { locations[visits[visit.Id].Location].visits->erase(visit.Id); locations[visit.Location].visits->insert(visit.Id); visits[visit.Id].Location = visit.Location; } return true; }

In general, the average number of elements in the index is 10, the maximum is 150. So one could get by with just an array, which would increase the data locality and not greatly slow down the modification.

JSON parsing on definees

Those JSON parsers that I found for C / C ++ build a tree when parsing, and these are extra allocations, which is unacceptable in highload. There are also those that add data directly to variables, but in this case it is impossible to find out which fields were in the JSON object, and this is important, because when a JSON object changes, it does not come with a full set of fields, but only with those that need to change.

JSON, which must parse the service is very simple, is a single-level object that contains only known fields, there are no quotes inside the strings. JSON for the user looks like this:

 { "id": 1, "email": "robosen@icloud.com", "first_name": "", "last_name": "", "gender": "m", "birth_date": 345081600 }

Those. it's pretty easy to write a parser for it in meta language

 bool User::UmnarshalJSON(const char* data, int len) { JSON_SKIP_SPACES() JSON_START_OBJECT() while (true) { JSON_SKIP_SPACES() //   if (data[0] == '}') { return true; //   } else if (data[0] == ',') { data++; continue; //  "id" } else if (strncmp(data, "\"id\"", 4) == 0) { data += 4; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR() JSON_SKIP_SPACES() //       Id JSON_LONG(Id) //  ,   Id   JSON Fields |= FIELD_ID; //  "lastname" } else if (strncmp(data, "\"last_name\"", 11) == 0) { data += 11; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR(); JSON_SKIP_SPACES() //       Id JSON_STRING(LastName) //  ,   LastName   JSON Fields |= FIELD_LAST_NAME; } else if (strncmp(data, "\"first_name\"", 12) == 0) { data += 12; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR() JSON_SKIP_SPACES() JSON_STRING(FirstName) Fields |= FIELD_FIRST_NAME; } else if (strncmp(data, "\"email\"", 7) == 0) { data += 7; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR() JSON_SKIP_SPACES() JSON_STRING(EMail) Fields |= FIELD_EMAIL; } else if (strncmp(data, "\"birth_date\"", 12) == 0) { data += 12; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR() JSON_SKIP_SPACES() JSON_LONG(BirthDate) Fields |= FIELD_BIRTH_DATE; } else if (strncmp(data, "\"gender\"", 8) == 0) { data += 8; JSON_SKIP_SPACES() JSON_FIELDS_SEPARATOR() JSON_SKIP_SPACES() JSON_CHAR(Gender) Fields |= FIELD_GENDER; } else { JSON_ERROR(Unknow field) } } return true; }

It remains only to determine what to replace the meta language commands and the parser is ready:

 #define JSON_ERROR(t) fprintf(stderr, "%s (%s:%d)\n", #t, __FILE__, __LINE__); return false; #define JSON_SKIP_SPACES() data += strspn(data, " \t\r\n") #define JSON_START_OBJECT() if (data[0] != '{') { \ JSON_ERROR(Need {}) \ } \ data++; #define JSON_FIELDS_SEPARATOR() if (data[0] != ':') { \ JSON_ERROR(Need :) \ } \ data++; #define JSON_LONG(field) char *endptr; \ field = strtol(data, &endptr, 10); \ if (data == endptr) { \ JSON_ERROR(Invalid ## field ## value); \ } \ data = endptr; #define JSON_STRING(field) if (data[0] != '"') {\ JSON_ERROR(Need dquote); \ } \ auto strend = strchr(data+1, '"'); \ if (strend == NULL) { \ JSON_ERROR(Need dquote); \ } \ field = strndup(data+1, strend - data - 1); \ data = strend + 1; #define JSON_CHAR(field) if (data[0] != '"') {\ JSON_ERROR(Need dquote); \ } \ if (data[2] != '"') {\ JSON_ERROR(Need dquote); \ } \ field = data[1]; \ data += 3;

Unescape uri

In receiving the list of places visited by the user there is a filter by country, which can be in the form of URI encoded lines: /users/1/visits?country=%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F . For decoding on StackOverflow, a wonderful solution was found, in which I added support for replacing + with a space:

 int percent_decode(char* out, char* in) { static const char tbl[256] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1, -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }; char c, v1, v2; if (in != NULL) { while ((c = *in++) != '\0') { switch (c) { case '%': if (!(v1 = *in++) || (v1 = tbl[(unsigned char) v1]) < 0 || !(v2 = *in++) || (v2 = tbl[(unsigned char) v2]) < 0) { return -1; } c = (v1 << 4) | v2; break; case '+': c = ' '; break; } *out++ = c; } } *out = '\0'; return 0; }

UTF decode

Strings in JSON objects can be of the form "\u0420\u043E\u0441\u0441\u0438\u044F" . In general, this is not terrible, but we have a comparison with the country, so one field must be able to decode. I took percent_decode as a percent_decode , only in the case of Unicode it is not enough to turn \u0420 into 2 bytes 0x0420, this number must be matched with a UTF symbol. Fortunately, we only have Cyrillic characters and spaces, so if you look at the table , you can see that there is only one gap between the letters "n" and "p", so you can use an offset for the conversion:

 void utf_decode(char* in) { static const char tbl[256] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1, -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }; char *out = in; while (in[0] != 0) { if (in[0] == '\\' && in[1] == 'u') { uint16_t u = tbl[in[2]] << 12 | tbl[in[3]] << 8 | tbl[in[4]] << 4 | tbl[in[5]]; //  ASCII     if (u < 255) { out[0] = u; out++; } else { uint16_t w; // < '' if (u >= 0x0410 && u <= 0x043f) { w = u - 0x0410 + 0xd090; // >= '' } else { w = u - 0x0440 + 0xd180; } out[0] = w >> 8; out[1] = w; out += 2; } in += 6; } else { out[0] = in[0]; in++; out++; } } out[0] = 0; }

HTTP Server

Parser

From the HTTP request, you need to get a method (GET / POST), query (path + parameters) and in the case of a POST request, the body. Parsing and storing headers does not make sense, except for the Content-Lentgth header for POST requests, but as it turned out later, this is not necessary, since all requests fit into one read. The result is a parser like this:

 ... auto body = inBuf.Data; const char *cendptr; char *endptr; while (true) { switch (state) { case METHOD: body += strspn(body, " \r\n"); if (strncmp(body, "GET ", 4) == 0) { method = GET; body += 4; } else if (strncmp(body, "POST ", 5) == 0) { body += 5; method = POST; } else { state = DONE; WriteBadRequest(); return; } body += strspn(body, " "); cendptr = strchr(body, ' '); if (cendptr == NULL) { WriteBadRequest(); return; } strncpy(path.End, body, cendptr - body); path.AddLen(cendptr - body); cendptr = strchr(cendptr + 1, '\n'); if (cendptr == NULL) { WriteBadRequest(); return; } state = HEADER; body = (char*) cendptr + 1; break; case HEADER: cendptr = strchr(body, '\n'); if (cendptr == NULL) { WriteBadRequest(); return; } if (cendptr - body < 2) { if (method == GET) { doRequest(); return; } state = BODY; } body = (char*) cendptr + 1; case BODY: requst_body = body; doRequest(); return; } ...

Routing

There are very few handlers, that's why it's just a switch by the method, and inside the prefix search is a simple comparison:

 ... switch (method) { case GET: if (strncmp(path.Data, "/users", 6) == 0) { handlerGetUser(); } else if (strncmp(path.Data, "/locations", 10) == 0) { handlerGetLocation(); } else if (strncmp(path.Data, "/visits", 7) == 0) { handlerGetVisit(); } else { WriteNotFound(); } break; case POST: if (strncmp(path.Data, "/users", 6) == 0) { handlerPostUser(); } else if (strncmp(path.Data, "/locations", 10) == 0) { handlerPostLocation(); } else if (strncmp(path.Data, "/visits", 7) == 0) { handlerPostVisit(); } else { WriteNotFound(); } break; default: WriteBadRequest(); } ...

Keep-alive

Yandex.Tank does not pay attention to the “Connection” header in the cartridges, but looks only at this header in the response from the server. Therefore, you do not need to break the connection, but always need to work in Keep-Alive mode.

Work with network

To implement asynchronous interaction, epoll was naturally chosen. I know 3 popular options for working with epoll in a multithreaded application:

N threads have a common epoll + 1 thread waits for accept in blocking mode and registers client sockets with epoll
N threads have N epolls + 1 thread waits for accept in blocking mode and registers client sockets in epolls, say using RoundRobin.
Each thread has its own epoll, in which a server socket is registered, which is in a non-blocking state and client sockets that this thread captures.

I compared the 2 and 3 options and on the local tests the third option won a bit, it looks like this:

 void Worker::Run() { int efd = epoll_create1(0); if (efd == -1) { FATAL("epoll_create1"); } connPool = new ConnectionsPool(db); //     epoll auto srvConn = new Connection(sfd, defaultDb); struct epoll_event event; event.data.ptr = srvConn; event.events = EPOLLIN; if (epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &event) == -1) { perror("epoll_ctl"); abort(); } struct epoll_event *events; events = (epoll_event*) calloc(MAXEVENTS, sizeof event); while (true) { auto n = epoll_wait()(efd, events, MAXEVENTS, -1); for (auto i = 0; i < n; i++) { auto conn = (Connection*) events[i].data.ptr; if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || (!(events[i].events & EPOLLIN))) { /* An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) */ fprintf(stderr, "epoll error\n"); close(conn->fd); if (conn != srvConn) { connPool->PutConnection(conn); } continue; //      ,    accept } else if (conn == srvConn) { /* We have a notification on the listening socket, which means one or more incoming connections. */ struct sockaddr in_addr; socklen_t in_len; in_len = sizeof in_addr; int infd = accept4(sfd, &in_addr, &in_len, SOCK_NONBLOCK); if (infd == -1) { if ((errno == EAGAIN) || (errno == EWOULDBLOCK)) { continue; } else { perror("accept"); continue;; } } int val = true; if (setsockopt(infd, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) == -1) { perror("TCP_NODELAY"); } event.data.ptr = connPool->GetConnection(infd); event.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd, EPOLL_CTL_ADD, infd, &event) == -1) { perror("epoll_ctl"); abort(); } continue; //    ,      } else { bool done = false; bool closeFd = false; while (true) { ssize_t count; count = read(conn->fd, conn->inBuf.Data, conn->inBuf.Capacity); conn->inBuf.AddLen(count); if (count == -1) { /* If errno == EAGAIN, that means we have read all data. So go back to the main loop. */ if (errno != EAGAIN) { perror("read"); done = true; } else { continue; } break; } else if (count == 0) { /* End of file. The remote has closed the connection. */ done = true; closeFd = true; break; } if (!done) { done = conn->ProcessEvent(); break; } } if (done) { if (closeFd) { close(conn->fd); connPool->PutConnection(conn); } else { conn->Reset(conn->fd); } } } } } }

Already after closing the decision making process, I decided to abandon the epoll and make the classic prefork model, with only 1,500 streams (Ya. Tank opened 1000+ connections). By default, each stream reserves 8MB per stack, which gives 1,500 * 8MB = 11.7GB. And under the terms of the competition, 4GB RAM is allocated to the application. But fortunately, the stack size can be reduced using the pthread_attr_setstacksize function. The minimum stack size is 16KB. Since I don't have anything big stacked inside the threads. I chose a stack size of 32KB:

  pthread_attr_t attr; pthread_attr_init(&attr); if (pthread_attr_setstacksize(&attr, 32 * 1024) != 0) { perror("pthread_attr_setstacksize"); } pthread_create(&thr, &attr, &runInThread, (void*) this);

Now the threads occupy 1 500 * 32KB = 47MB.
On local tests, this solution showed slightly worse results than epoll.

Network tuning

For profiling, I used gperftools , which showed that the longest operation was std::to_string . This was corrected fairly quickly, but now most of the time was in epoll_wait , write and writev . At first I didn’t pay attention to what might have been worth getting into the prize winners, but I began to study what to do with write , finding accelerations along the way to accept

TCP_NODELAY

By default, the kernel for network optimization sticks together small pieces of data into one packet ( Nagle’s algorithm ), which in this case only hinders us, so the service always sends small packets and sends them as quickly as possible. So turn it off:

  int val = 1; if (setsockopt(sfd, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) == -1) { perror("TCP_NODELAY"); }

TCP_DEFER_ACCEPT

This option allows you to send a response without waiting for the ACK from the client with TCP handshake:

  int val = 1; if (setsockopt(sfd, IPPROTO_TCP, TCP_DEFER_ACCEPT, &val, sizeof(val)) == -1) { perror("TCP_DEFER_ACCEPT"); }

TCP_QUICKACK

Just in case I set this option, although I don’t understand the principle of its operation:

  int val = 1; if (setsockopt(sfd, IPPROTO_TCP, TCP_QUICKACK, &val, sizeof(val)) == -1) { perror("TCP_QUICKACK"); }

SO_SNDBUF and SO_RCVBUF

Buffer sizes also affect network transfer rates. The default is about 16KB. Without changing the kernel settings, you can increase them to about 400KB, although you can ask for any size:

  int sndsize = 2 * 1024 * 1024; if (setsockopt(sfd, SOL_SOCKET, SO_SNDBUF, &sndsize, (int) sizeof(sndsize)) == -1) { perror("SO_SNDBUF"); } if (setsockopt(sfd, SOL_SOCKET, SO_RCVBUF, &sndsize, (int) sizeof(sndsize)) == -1) { perror("SO_RCVBUF"); }

At this size, broken packets and timeouts appeared.

accept4

The accept function is usually used to get a client socket and 2 fcntl calls to set the fcntl flag. Instead of 3 system calls, you need to use accept4 , which allows you to do the same by passing the last argument to the SOCK_NONBLOCK flag for 1 system call:

  int infd = accept4(sfd, &in_addr, &in_len, SOCK_NONBLOCK);

aio

1 more way to work with IO asynchronously. In lio_listio , there is a lio_listio function that allows you to combine several write/read into one system call, which should reduce switching delays into kernel space.

The idea was simple, since several events come to epoll at the same time, then you can prepare answers for all clients and send at the same time. Unfortunately, the improvements did not bring, had to be abandoned.

epoll_wait (...., -1) -> epoll_wait (...., 0)

As it turned out, this was a key feature that allowed reducing the number of fines by about 30 seconds, but, unfortunately, I found out about this too late.

Postscriptum

Thanks to the organizers for the competition, although everything did not go very smoothly. It was very exciting and informative.

Source: https://habr.com/ru/post/337854/

All Articles