Amazon SQS Testing

The network already has several performance reviews of this solution from Amazon, in this article I did not pursue the goal of verifying the results already obtained, I was interested in some features not covered in other sources, namely:

the documentation says that Amazon is trying to preserve the order of messages, how well is it stored?
How fast does a message get when using Long Polling?
How much does batch processing speed up?

Formulation of the problem

The most supported library for AWS on erlang is erlcloud [1], to initialize the library, just call the start and configure methods, as indicated on github. My messages will contain a set of random characters generated by the following function:
')

random_string(0) -> []; random_string(Length) -> [random_char() | random_string(Length-1)]. random_char() -> random:uniform(95) + 31 .

To measure the speed, we use a known function that uses timer: tc, but with some changes:

 test_avg(M, F, A, R, N) when N > 0 -> {Ret, L} = test_loop(M, F, A, R, N, []), Length = length(L), Min = lists:min(L), Max = lists:max(L), Med = lists:nth(round((Length / 2)), lists:sort(L)), Avg = round(lists:foldl(fun(X, Sum) -> X + Sum end, 0, L) / Length), io:format("Range: ~b - ~b mics~n" "Median: ~b mics~n" "Average: ~b mics~n", [Min, Max, Med, Avg]), Ret. test_loop(_M, _F, _A, R, 0, List) -> {R, List}; test_loop(M, F, A, R, N, List) -> {T, Result} = timer:tc(M, F, [R|A]), test_loop(M, F, A, Result, N - 1, [T|List]).

The changes relate to the call of the function being tested. In this variant, I added the R argument, which allows using the value returned on the previous run, this is necessary in order to generate message numbers and collect additional information regarding mixing when receiving the message. Thus, the function of sending a message with a number will look like this:

 send_random(N, Queue) -> erlcloud_sqs:send_message(Queue, [N + 1 | random_string(6000 + random:uniform(6000))]), N + 1 .

And her call with the collection of statistics:

 test_avg(?MODULE, send_random, [QueueName], 31, 20)

here 31 is the number of the first message, the number is not randomly chosen, the fact is that erlang does not distinguish too well the sequence of numbers and strings and in the message it will be symbol number 31, smaller numbers can be sent to SQS, but continuous ranges are obtained in this case small (# x9 | #xA | #xD | [# x20 to # xD7FF] | [# xE000 to #xFFFD] | [# x10000 to # x10FFFF], in more detail [2]) and when leaving the allowable range you will get an exception. Thus, the send_random function generates and sends a message to a queue with the name Queue, at the beginning of which there is a number defining its number, the function returns the number of the next number, which is used further by the next generation function. The test_avg function accepts QueueName, which becomes the second argument of the send_random function, the first argument is the number and the number of repetitions.

The function that will receive messages and check their order will look like this:

 checkorder(N, []) -> N; checkorder(N, [H | T]) -> [{body, [M | _]}|_] = H, K = if M > N -> M; true -> io:format("Wrong ~b less than ~b~n", [M, N]), N end, checkorder(K, T). receive_checkorder(LastN, Queue) -> [{messages, List} | _] = erlcloud_sqs:receive_message(Queue), remove_list(Queue, List), checkorder(LastN, List).

Deleting messages:

 remove_msg(_, []) -> wrong; remove_msg(Q, [{receipt_handle, Handle} | _]) -> erlcloud_sqs:delete_message(Q, Handle); remove_msg(Q, [_ | T]) -> remove_msg(Q, T). remove_list(_, []) -> ok; remove_list(Q, [H | T]) -> remove_msg(Q, H), remove_list(Q, T).

The list sent to delete contains a lot of unnecessary information (message body, etc.), the delete function finds the receipt_handle that is required to form a request or returns the wrong if the receipt_handle is not found

Shuffle messages

Looking ahead, I can say that even on a small number of messages, mixing turned out to be quite significant and an additional task arose: you need to evaluate the degree of mixing. Unfortunately, no good criteria could be found and it was decided to display the maximum and average discrepancy with the correct position. Knowing the size of such a window, you can restore the order of messages upon receipt, while, of course, processing speed worsens.

To calculate this difference, it is enough to change only the function of checking the order of messages:

 checkorder(N, []) -> N; checkorder({N, Cnt, Sum, Max}, [H | T]) -> [{body, [M | _]}|_] = H, {N1, Cnt1, Sum1, Max1} = if M < N -> {N, Cnt + 1, Sum + N - M, if Max < N - M -> N - M; true -> Max end }; true -> {M, Cnt, Sum, Max} end, checkorder({N1, Cnt1, Sum1, Max1}, T).

The call to the function of executing the series will look as follows

 {_, Cnt, Sum, Max} = test_avg(?MODULE, receive_checkorder, [QueueName], {0, 0, 0, 0}, Size)

I get the number of elements that came later than needed, the sum of their distances from the largest of the received elements and the maximum offset. The most interesting thing for me here is the maximum offset, the remaining characteristics can be called controversial and they may not be calculated very well (for example, if one element is read earlier, then all elements that must go to it will be considered permuted in this case). To the results:

Size (pcs)	20	50	100	150	200	250	300	400	500	600	700	800	900	1000
Maximum offset (pcs)	eleven	32	66	93	65	139	184	155	251	241	218	249	359	227
Average displacement (pcs)	5.3	10.5	23.9	43	25.6	45.9	48.4	65.6	74.2	74.2	78.3	72.3	110.8	82.8

The first line is the number of messages in the queue, the second is the maximum offset, the third is the average offset.

The results surprised me, the messages are not just mixed up, there are simply no boundaries, that is, with an increase in the number of messages, you need to increase the size of the window being viewed. The same in the form of a graph:

Long polling

As I already wrote, Amazon SQS does not support subscriptions, you can use Amazon SNS for this, but if fast queues with multiple handlers are required, this does not work, in order not to pull the message receiving method Amazon implemented Long Polling, which allows you to hang while waiting for messages up to twenty seconds, and since SQS is charged by the number of methods called, this should significantly reduce the cost of queues, but what is the problem: for a small number of messages (according to official documentation), the queue may not return nothing. This behavior is critical for queues in which you need to quickly respond to an event and generally speaking, if this happens often then Long Polling does not make much sense, since it becomes equivalent to periodic polls with an SQS reaction time.

For verification, we will create two processes, one of which will send messages at random times, and the second one will reside in Long Polling, while the moments of sending and receiving messages will be saved for later comparison. In order to enable this mode, set Receive Message Wait Time = 20 seconds in the queue parameters.

 send_sleep(L, Queue) -> timer:sleep(random:uniform(10000)), Call = erlang:now(), erlcloud_sqs:send_message(Queue, random_string(6000 + random:uniform(6000))), [Call | L].

this function falls asleep for a random number of milliseconds, after which it remembers the moment and sends a message

 remember_moment(L, []) -> L; remember_moment(L, [_ | _]) -> [erlang:now() | L]. receive_polling(L, Queue) -> [{messages, List} | _] = erlcloud_sqs:receive_message(Queue), remove_list(Queue, List), remember_moment(L, List).

These two functions allow you to receive messages and memorize the moments in which this happened. After the simultaneous execution of these functions with the help of spawn, I get two lists, the difference between which shows the reaction time to the message. It does not take into account the fact that messages can be mixed, in general, it will simply increase the additional reaction time.

Let's see what happened:

Sleep interval	10,000	7500	5000	2500
Minimum time (sec)	0.27	0.28	0.27	0.66
Maximum time (sec)	10.25	7.8	5.36	5.53
Average time (s)	1.87	1.87	1.84	1.88

The first line is the value set as the maximum delay for the sending process. That is: 10 seconds, 7.5 seconds ... The remaining lines - the minimum, maximum and average waiting time for receiving a message.

The same in the form of a graph:

The average time turned out to be the same in all cases; it can be said that, on average, it takes two seconds between sending such single messages before receiving them. Long enough. In this test, the sample was rather small, 20 messages, so the minimum-maximum values are more a matter of luck rather than some kind of dependency.

Batch shipping

To begin with, let's check how important the effect of “warming up” the queue when sending messages:

Number of records	20	50	100	150	200	250	300	400	500	600	700	800	900	1000
Minimum time (sec)	0.1	0.1	0.1	0.09	0.09	0.09	0.09	0.1	0.09	0.1	0.1	0.09	0.09	0.09
Maximum time (sec)	0.19	0.37	0.41	0.41	0.37	0.38	0.37	0.43	0.39	0.66	0.74	0.48	0.53	0.77
Average time (s)	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12	0.12

The same in the form of a graph:

we can say that no warm-up is observed, that is, the queue behaves approximately equally in these data volumes, only the maximum for some reason rises, but the average and minimum remain in their places.
Same for read-delete

Number of records	20	50	100	150	200	250	300	400	500	600	700	800	900	1000
Minimum time (sec)	0.001	0.14	0	0.135	0	0.135	0	0	0	0	0	0	0	0
Maximum time (sec)	0.72	0.47	0.65	0.65	0.69	0.51	0.75	0.75	0.76	0.73	0.82	0.79	0.74	0.91
Average time (s)	0.23	0.21	0.21	0.21	0.21	0.21	0.21	0.21	0.21	0.2	0.2	0.2	0.2	0.21

There is also no saturation, an average of around 200ms. Sometimes reading happened instantly (faster than 1 ms), but this means that the message was not received, according to the documentation, the SQS server can do this, you just need to re-request the message.

Let's go directly to the block and multi-threaded testing.

Unfortunately, the erlcloud library does not contain functions for batch sending messages, but such functions are not difficult to implement on the basis of the existing ones, in the function of sending messages you need to change the request to the following:

 Doc = sqs_xml_request(Config, QueueName, "SendMessageBatch", encode_message_list(Messages, 1)),

and add the query generation function:

 encode_message_list([], _) -> []; encode_message_list([H | T], N) -> MesssageId = string:concat("SendMessageBatchRequestEntry.", integer_to_list(N)), [{string:concat(MesssageId, ".Id"), integer_to_list(N)}, {string:concat(MesssageId, ".MessageBody"), H} | encode_message_list(T, N + 1)].

In the library, you should also fix the API version, for example, on 2011-10-01, otherwise Amazon will return Bad request in response to your requests.

testing functions are similar to those used in other tests:

 gen_messages(0) -> []; gen_messages(N) -> [random_string(5000 + random:uniform(1000)) | gen_messages(N - 1)]. send_batch(N, Queue) -> erlang:display(erlcloud_sqs:send_message_batch(Queue, gen_messages(10))), N + 1 .

Here, I just had to change the length of the messages so that the whole package would fit in 64kb, otherwise an exception is generated.

The following recording data was obtained:

Number of threads	0	one	2	four	five	ten	20	50	100
Maximum delay (sec)	0.452	0.761	0.858	1.464	1.698	3.14	5.272	11.793	20.215
Average delay (sec)	0.118	0.48	0.436	0.652	0.784	1.524	3.178	9.1	19.889
Time per message (s)	0.118	0.048	0.022	0.017	0.016	0.016	0.017	0.019	0.02

here 0 means reading one in 1 stream, then reading 1 in 10 in 1 stream, in 10 in 2 streams, in 10 in 4 streams, and so on

For reading:

Number of threads	0	one	2	four	five	ten	20	50	100
Maximum delay (sec)	0.762	2.998	2.511	2.4	2.606	2.751	4.944	11.653	18.517
Average delay (sec)	0.205	1.256	1.528	1.566	1.532	1.87	3.377	7.823	17.786
Time per message (s)	0.205	0.126	0.077	0.04	0.031	0.02	0.019	0.017	0.019

graph showing bandwidth for reading and writing (messages per second):

Blue - write, red - read.

From this data, we can conclude that the maximum throughput is achieved for recording in the region of 10 streams, and for reading - about 50, with a further increase in the number of streams, the number of messages sent per unit of time does not increase.

findings

It turns out that Amazon SQS significantly changes the order of messages, it has not very good reaction time and throughput; only reliability and a small (in the case of a small number of messages) charge can counter this. That is, if your speed is not critical, it doesn’t matter that the messages are mixed up and you don’t want to administer or hire a queue server administrator - this is your choice.