📜 ⬆️ ⬇️

Multi-threaded file upload in S3

This article illustrates the actual application and performance gains in the example of uploading files to Amazon S3 storage using multi-threading in Ruby using gem aws-sdk .

Start simple


Implementing a file upload using the official (i.e. Amazon) is gem aws-sdk quite simple. If you omit the preparatory part of the formation of the authorization parameters for Amazon Web Services (AWS), then the code takes three lines:
def upload_to_s3(config, src, dst) #     S3,      config s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) #  ,    'dst'  S3 (   ) s3_file = s3.buckets[config['bucket_name']].objects[dst] #    src    (   ) s3_file.write(:file => src, :acl => :public_read) end 

The method works and shows the average download speed up to 5 Mb / c on files larger than 5 Mb (on smaller files the average speed drops).
However, from experiments it is easy to see that the total speed of simultaneous downloading of several files is higher than the speed of downloading one file at a time. It turns out that there is a kind of bandwidth limit per stream? Let's try to parallelize the download using multi-threading and the multipart_upload method, and see what happens.

Download file parts


To upload a file in aws-sdk, there is a multipart_upload method. Let's look at its typical use:
 def multipart_upload_to_s3(src, dst, config) s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) s3_file = s3.buckets[config['bucket_name']].objects[dst] #   src    src_io = File.open(src, 'rb') #    read_size = 0 uploaded_size = 0 parts = 0 #    src_size = File.size(src) #    s3_file = s3_file.multipart_upload({:acl => :public_read}) do |upload| while read_size < src_size #          buff = src_io.readpartial(config['part_size']) #   read_size += buff.size part_number = parts += 1 # ,     S3 upload.add_part :data => buff, :part_number => part_number end end #    src_io.close s3_file end 

It looks more complicated than a simple download. However, the new method has significant advantages:

These two properties allow us to use multithreading when performing downloads.

Multiple file download in parts


There are several approaches to solve this problem:

From a practical point of view, the mixed approach is the most convenient, because:

I will give the code of multi-threaded file upload
 def threaded_upload_to_s3(src, dst, config) s3 = AWS::S3.new( :access_key_id => config['access_key_id'], :secret_access_key => onfig['secret_access_key']) s3_file = s3.buckets[config['bucket_name']].objects[dst] src_io = File.open(src, 'rb') read_size = 0 uploaded_size = 0 parts = 0 src_size = File.size(src) s3_file = s3_file.multipart_upload({:acl => :public_read}) do |upload| #        upload_threads = [] #   ( ),   “” mutex = Mutex.new max_threads = [config['threads_count'], (src_size.to_f / config['part_size']).ceil].min #       max_threads.times do |thread_number| upload_threads << (Thread.new do #     while true #    ,         #   mutex.lock #  ,        break unless read_size < src_size #          buff = src_io.readpartial(config['part_size']) #   read_size += buff.size part_number = parts += 1 #   mutex.unlock # ,     S3 upload.add_part :data => buff, :part_number => part_number end end) end #      upload_threads.each{|thread| thread.join} end src_io.close s3_file end 

To create threads, we use the standard class Thread (the basics of working with multithreading in Ruby are described in the article habrahabr.ru/post/94574 ). To implement mutual exclusions, we use the simplest binary semaphores (mutexes) implemented by the standard Mutex class.
')
What are semaphores required for? With their help, we mark a section of code (called critical) that can only execute one thread at a time. The remaining threads will have to wait until the thread that has turned on the semaphore leaves the critical section. Usually semaphores are used to ensure proper access to shared resources. In this case, the common resources are: the input object src_io and the variables read_size and parts. The variables buff and part_number are declared as local within the stream (ie, the Thread.new do ... end block), and therefore are not shared.

For more information about semaphores and multithreading in Ruby, see the article www.tutorialspoint.com/ruby/ruby_multithreading.htm (eng.)

Comparison results


We measure the download speed using different methods on several test files (1MB, 10MB, 50MB, 150MB in size) and tabulate:
File

Parts

Streams

upload, Mb / s

multipart_upload, Mb / s

threaded_upload, Mb / s

one

64k

one

one

0.78

0.29 (37%)

0.29 (37%)

2

512k

one

one

2.88

1.84 (64%)

1.65 (57%)

3

1Mb

one

one

3.39

2.38 (70%)

2.58 (76%)

four

10Mb

2

2

5.06

4.50 (89%)

7.69 (152%)

five

50Mb

ten

five

4.48

4.41 (98%)

9.02 (201%)

6

50Mb

ten

ten

4.33

4.44 (103%)

8.49 (196%)

7

150Mb

thirty

five

4.34

4.43 (102%)

9.22 (212%)

eight

150Mb

thirty

ten

4.48

4.52 (101%)

8.90 (199%)


Testing was performed by uploading files from a machine in the EU West (Ireland) region to the S3 storage in the same region. A series of 10 consecutive tests for each file was conducted.
If we estimate the injection rate by a simple method for tests 4–8, then the measurement error is about 8%, which is quite acceptable.
Download in parts (multipart_upload) on small files showed a worse result compared to simple and the same - on large files.
Multi-threaded download (threaded_upload) showed the same efficiency on files from one part as the download in parts (which is quite obvious). But on large files, we have a significant advantage - up to two times (compared to the usual download).
The task of finding out the optimal part size and number of streams was not set, but an increase in streams from 5 to 10 on large files did not have a significant effect.

Conclusion


Multi-threaded file downloads proved to be more effective than normal ones on files that consist of more than one part, the speed increase is up to two times.
By the way, it will be convenient to create a method that itself selects the most appropriate download method depending on the file size.
The source code provided in the examples is available on Github: github.com/whisk/s3up

Source: https://habr.com/ru/post/140940/


All Articles