Multi-threaded file upload in S3

This article illustrates the actual application and performance gains in the example of uploading files to Amazon S3 storage using multi-threading in Ruby using gem aws-sdk .

Start simple

Implementing a file upload using the official (i.e. Amazon) is gem aws-sdk quite simple. If you omit the preparatory part of the formation of the authorization parameters for Amazon Web Services (AWS), then the code takes three lines:

def upload_to_s3(config, src, dst) #     S3,      config s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) #  ,    'dst'  S3 (   ) s3_file = s3.buckets[config['bucket_name']].objects[dst] #    src    (   ) s3_file.write(:file => src, :acl => :public_read) end

The method works and shows the average download speed up to 5 Mb / c on files larger than 5 Mb (on smaller files the average speed drops).
However, from experiments it is easy to see that the total speed of simultaneous downloading of several files is higher than the speed of downloading one file at a time. It turns out that there is a kind of bandwidth limit per stream? Let's try to parallelize the download using multi-threading and the multipart_upload method, and see what happens.

Download file parts

To upload a file in aws-sdk, there is a multipart_upload method. Let's look at its typical use:

 def multipart_upload_to_s3(src, dst, config) s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) s3_file = s3.buckets[config['bucket_name']].objects[dst] #   src    src_io = File.open(src, 'rb') #    read_size = 0 uploaded_size = 0 parts = 0 #    src_size = File.size(src) #    s3_file = s3_file.multipart_upload({:acl => :public_read}) do |upload| while read_size < src_size #          buff = src_io.readpartial(config['part_size']) #   read_size += buff.size part_number = parts += 1 # ,     S3 upload.add_part :data => buff, :part_number => part_number end end #    src_io.close s3_file end

It looks more complicated than a simple download. However, the new method has significant advantages:

parts can be pumped in random order
there is no need to wait for the completion of downloading one part in order to start downloading another

These two properties allow us to use multithreading when performing downloads.

Multiple file download in parts

There are several approaches to solve this problem:

fixed number of streams: the file is divided into N parts (by the number of streams N), each part is downloaded in parallel
fixed part size: the file is divided into parts of a size not exceeding the value of S, each part is uploaded in parallel;
mixed: the file is divided into parts of a size not exceeding the value of S, the obtained parts are downloaded in parallel, but with no more than N streams at the same time.

From a practical point of view, the mixed approach is the most convenient, because:

API S3 has a minimum size limit (5Mb);
using a large number of threads reduces efficiency.

I will give the code of multi-threaded file upload

 def threaded_upload_to_s3(src, dst, config) s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) s3_file = s3.buckets[config['bucket_name']].objects[dst] src_io = File.open(src, 'rb') read_size = 0 uploaded_size = 0 parts = 0 src_size = File.size(src) s3_file = s3_file.multipart_upload({:acl => :public_read}) do |upload| #        upload_threads = [] #   ( ),   “” mutex = Mutex.new max_threads = [config['threads_count'], (src_size.to_f / config['part_size']).ceil].min #       max_threads.times do |thread_number| upload_threads << (Thread.new do #     while true #    ,         #   mutex.lock #  ,        break unless read_size < src_size #          buff = src_io.readpartial(config['part_size']) #   read_size += buff.size part_number = parts += 1 #   mutex.unlock # ,     S3 upload.add_part :data => buff, :part_number => part_number end end) end #      upload_threads.each{|thread| thread.join} end src_io.close s3_file end

To create threads, we use the standard class Thread (the basics of working with multithreading in Ruby are described in the article habrahabr.ru/post/94574 ). To implement mutual exclusions, we use the simplest binary semaphores (mutexes) implemented by the standard Mutex class.
')
What are semaphores required for? With their help, we mark a section of code (called critical) that can only execute one thread at a time. The remaining threads will have to wait until the thread that has turned on the semaphore leaves the critical section. Usually semaphores are used to ensure proper access to shared resources. In this case, the common resources are: the input object src_io and the variables read_size and parts. The variables buff and part_number are declared as local within the stream (ie, the Thread.new do ... end block), and therefore are not shared.

For more information about semaphores and multithreading in Ruby, see the article www.tutorialspoint.com/ruby/ruby_multithreading.htm (eng.)

Comparison results

We measure the download speed using different methods on several test files (1MB, 10MB, 50MB, 150MB in size) and tabulate:

	File	Parts	Streams	upload, Mb / s	multipart_upload, Mb / s	threaded_upload, Mb / s
one	64k	one	one	0.78	0.29 (37%)	0.29 (37%)
2	512k	one	one	2.88	1.84 (64%)	1.65 (57%)
3	1Mb	one	one	3.39	2.38 (70%)	2.58 (76%)
four	10Mb	2	2	5.06	4.50 (89%)	7.69 (152%)
five	50Mb	ten	five	4.48	4.41 (98%)	9.02 (201%)
6	50Mb	ten	ten	4.33	4.44 (103%)	8.49 (196%)
7	150Mb	thirty	five	4.34	4.43 (102%)	9.22 (212%)
eight	150Mb	thirty	ten	4.48	4.52 (101%)	8.90 (199%)

Testing was performed by uploading files from a machine in the EU West (Ireland) region to the S3 storage in the same region. A series of 10 consecutive tests for each file was conducted.
If we estimate the injection rate by a simple method for tests 4–8, then the measurement error is about 8%, which is quite acceptable.
Download in parts (multipart_upload) on small files showed a worse result compared to simple and the same - on large files.
Multi-threaded download (threaded_upload) showed the same efficiency on files from one part as the download in parts (which is quite obvious). But on large files, we have a significant advantage - up to two times (compared to the usual download).
The task of finding out the optimal part size and number of streams was not set, but an increase in streams from 5 to 10 on large files did not have a significant effect.

Conclusion

Multi-threaded file downloads proved to be more effective than normal ones on files that consist of more than one part, the speed increase is up to two times.
By the way, it will be convenient to create a method that itself selects the most appropriate download method depending on the file size.
The source code provided in the examples is available on Github: github.com/whisk/s3up

Source: https://habr.com/ru/post/140940/

All Articles