📜 ⬆️ ⬇️

The best way to upload files to Ruby is with Shrine. Part 1

This is the first part of a series of posts about Shrine . The purpose of this series of articles is to show the advantages of Shrine over existing file loaders.


More than a year has passed since I started developing Shrine. During this time, Shrine received a lot of interesting functionality, the ecosystem has grown significantly and enough developers started using Shrine in production.

Before delving into the explanation of the benefits, you need to take a step back and consider in detail what motivated the development of Shrine in the first place.

In particular, I want to talk about the limitations of existing boot loaders. I think it’s important to be aware of these limitations so that you can make the choice that best fits the requirements .

Requirements




The requirements were as follows:
')
  1. Files on Amazon S3 must be uploaded directly.
  2. File processing and deletion must be performed in the background.
  3. Processing can be performed during the boot process.
  4. Sequel Integration
  5. Ability to use with frameworks other than Rails

In my opinion, the first two points are very important because they allow you to achieve optimal user experience when working with forms. But the last two points should also not be ignored:

1. Using Amazon S3 or analogues, allows you to optimize the process of downloading files.
This definitely has a number of advantages: reduced resource consumption, scaling with storage encapsulation, working with Heroku-type cloud solutions that do not provide the ability to write to the disk and have a limitation on the query execution time .

2. Processing and deleting files in background tasks allows you to work with files asynchronously , regardless of whether you store files on the local file system or on external storage, such as Amazon S3, this will greatly improve the user interface. Using background tasks is also necessary to maintain the high bandwidth of your application, because workers will not be bound to slow queries.

3. Processing on during the download process works fine with small files, especially when several versions of files are created, for example, different sizes for images. on the other hand, processing at boot is necessary for large files, such as video. Therefore, we need a library that can work with any type of file.

4. Using with ORM besides ActiveRecord is also very important. Since more functional and productive ORMs for Ruby have already appeared.

5. Finally, decent alternatives to Rails have appeared in the Ruby community. Need easy integration with any web framework.

Now we will go through the existing libraries and consider their main drawbacks with regard to the requirements.

Paperclip




Simple management of file attachments for ActiveRecord

We can say stasis - goodbye Paperclip, because there is a strong dependence on ActiveRecord. Since this is a very common library that is used with ActiveRecord, let's still go over the other requirements.

Direct download


Paperclip does not have direct download capability. It is possible to use aws-sdk to generate links and parameters for direct upload to S3 and then edit the attributes of the model in the same way as when downloading a file through Paperclip.

However, Paperclip works with only one repository. For work, it is necessary that all downloads take place directly to the main S3 repository. This leads to a security problem, since an attacker can download files without attaching, and as a result, many orphan files can be created. It would be a lot easier if S3 did it for you .

Background tasks


For background tasks, use delayed_papeclip . However, delayed_paperclip runs tasks only after the file has been completely downloaded. This means that if you do not want or cannot do direct downloads to S3, your users will have to download the file twice (first to the application, then to the repository) before any background processing takes place. And it is very slow.

In addition, delayed_paperclip does not support deleting files in the background. This is a big minus because you have to perform an HTTP request for each version of the file (if you have several versions of files stored on S3). Do not expect to add this functionality, as Paperclip also checks for the existence of each version before uninstalling . Of course, you can disable file deletion, but then you have a problem with orphan files.

Finally, delayed_paperclip is now tied to ActiveJob , which means that now it’s not possible to directly use it with libraries for background tasks.

False triggering of mime-type spoofing attack


Paperclip has the functionality of detecting whether someone is trying to replace the MIME type of the file. However, this functionality often works falsely, this leads to the fact that it is likely to cause a validation error, even if the file extension matches the file contents. This is quite a decisive factor, because in this case, a false positive can be very annoying to users.

Of course, you can disable this functionality, but this will make the application vulnerable to attacks when downloading files .

CarrierWave




Great solution for uploading files for Rails, Sinatra and other web frameworks

CarrierWave is the answer to Paperclip who stored the configuration directly in the model, encapsulating in classes.

CarrierWave is an integration with Sequel .

Unfortunately, for the carrierwave_backgrounder and carrierwave_direct extensions, the ORM CarrierWave integration is not enough. It takes a lot of additional ActiveRecord-specific code to make it all work.

Direct download


As mentioned earlier, the CarrierWave ecosystem has solutions for direct download to S3 - carrierwave_direct . This works in a way that allows you to create a direct upload form on S3, and then assign the S3 key to the uploaded file to your loader.

<!-- Form submits to "https://my-bucket.s3-eu-west-1.amazonaws.com" --> <%= direct_upload_form_for @photo.image do |f| %> <%= f.file_field :image %> <%= f.submit %> <% end %> 

However, what if you need multiple downloads directly to S3? The README notes that carrierwave_direct is intended for single downloads only. What about the JSON API? This is the usual form, all it does is generate URLs and parameters for uploading to S3. So why doesn't carrierwave_direct get this information in JSON format?

But what if, instead of re-implementing all the logic for generating a request for S3 using fog-aws, I simply relied on aws-sdk ?

 # aws-sdk bucket = s3.bucket("my-bucket") object = bucket.object(SecureRandom.hex) presign = object.presigned_post 

 <!-- HTML version --> <form action="<%= presign.url %>" method="post" enctype="multipart/form-data"> <input type="file" name="file"> <% presign.fields.each do |name, value| %> <input type="hidden" name="<%= name %>" value="<%= value %>"> <% end %> <input type="submit" value="Upload"> </form> 

 # JSON version { "url": presign.url, "fields": presign.fields } 

This method has the following advantages: It is not tied to Rails, it works with the JSON API, it supports multiple file downloads (the client can simply make a request with this data for each file), and it is more reliable (since now the parameters are generated by the officially supported gem ).

Background tasks


First, it is worth noting that carrierwave_direct provides instructions for setting up background processing. However, setting up background tasks correctly is quite challenging , so it makes sense to rely on a library that does this for you.

Which brings us to carrierwave_backgrounder . This library supports the processing of background tasks, but in my experience it was unstable ( 1 , and 2 ). In addition, it does not support deleting files in the background, which is a decisive factor when deleting multiple files.

Even if we overcome all of this, it’s impossible to integrate carrierwave_backgrounder with carrierwave_direct. As I mentioned, I want to upload files directly to S3 and process and delete them in background tasks. But it seems that these two libraries are incompatible with each other, which means that I cannot achieve the desired performance with CarrierWave for my cases.

Closing not allowed issue on github


I understand that sometimes people are ungrateful to the maintainers of popular open-source libraries and it is worth being softer and respectful to each other. However , I cannot understand why the CarrierWave developers are closing unresolved tasks .

One such closed task is the unnecessary execution of CarrierWave processing before validation. This is a serious security hole, since an attacker can transfer any file to an image processor, since validation of file sizes / MIME measurements will be performed only after processing. This makes your application vulnerable to attacks like ImageTragick , image bombs or simply downloading large images.

Refile




Uploading Files to Ruby, Attempt # 3

Refile was created by Jonas Niklas, the author of CarrierWave, as the third attempt to improve file loading in Ruby . Like Dragonfly, Refile was designed to work on the fly. Having suffered from the complexity of CarrierWave, I found that the simple and modern design of Refile is really promising, so I began to contribute to it, and as a result I was invited to the test.

 Refile.attachment_url(@photo, :image, :fit, 400, 500) # resize to 400x500 #=> "/attachments/15058dc712/store/fit/400/500/ed3153b9cb" 

Some of the new Refile ideas include temporary and permanent storage as first-order repositories, clean abstractions for repositories, IO Abstraction, a clean internal design (no GOD objects), and direct loading out of the box. Thanks to the clean Refile design, creating a Sequel integration was pretty simple.

Direct download


Refile is the first file upload library that comes with native support for direct downloads, allowing you to upload an attached file asynchronously while the user selects it. You can upload a file via Rack or directly to S3 using Refile to generate S3 request parameters. there is even a javascript library that does everything for you.

 <%= form.attachment_field :image, presigned: true %> 

There is also a great performance boost. When you upload a file directly to S3, you upload it to the bucket directory, which is marked “temporary”. Then, when validation is completed and the record is saved, the downloaded file is moved to the permanent storage. If the temporary and permanent storage is on S3, then instead of reloading Refile, it will simply issue an S3 COPY request.

No words, my requirements for direct downloads were met.

Background tasks


One of the limitations of Refile is the lack of support for background jobs. You might think that since Refile does the processing during the boot process and has S3 COPY optimization, background tasks are not needed here.

However, the S3 COPY request is still an HTTP request and affects the duration of the form submission. In addition, the S3 COPY request rate depends on the file size, so the larger the file, the slower the S3 COPY request will be.

In addition, Amazon S3 is just one of many cloud storages, you can use another service that suits your needs better, but which doesn’t have this optimization or even supports direct download.

Processing during the boot process


I think the processing during the upload process is great for images that are stored locally and quickly processed. However, if you store originals on S3, then Refile will serve the initial request for the version much slower, since it must first load the original from S3. In this case, you need to think about adding background tasks that pre-process all versions.

If you upload larger files, such as videos, it is usually best to process them after downloading, rather than during the download process. But Refile currently does not support this.

Dragonfly




Ruby gem for processing during the loading process - suitable for loading images into Rails, Sinatra

Dragonfly is another processing solution during the download process that was on the stage for much longer than Refile, and in my opinion, has much more advanced and flexible processing options during the download process.

Dragonfly does not work with Sequel, as one would expect, I would even be ready to write an adapter, but the overall behavior associated with the model seems to be mixed with the behavior specific to ActiveRecord models , so it is not clear how to do this.

Also there is no support for background tasks or direct downloads. You can do the latter manually, but this will have the same drawbacks as Paperclip.

There is one more important note. Receiving files through an image server (Dragonfly application for processing during the download process) is a completely separate responsibility. I mean that you can use another file upload library that comes with everything (direct downloads, background tasks, various ORMs, etc.) to upload files to the repository and still use Dragonfly to serve these files .

 map "/attachments" do run Dragonfly.app # doesn't care how the files were uploaded end 

Attache




Another approach to uploading files

Attache is a relatively new library that supports processing during loading. The difference between Dragonfly and Refile is that Attache was designed to run as a separate service, so files are uploaded and distributed through the Attache server.

Attache has ActiveRecord integration for linking uploaded files to database records and has direct download support. But still there is not enough opportunity to create backup copies and delete files in background tasks. In addition, Attache is not flexible enough.

Please note, like Dragonfly, Attache does not need to be integrated with the model - you can use Shrine for this. This year, I visited RedDotRubyConf in Singapore, where I met the author Attache, and after a very interesting discussion about the problems with file uploads, we came to the decision that it would be useful to use Shrine for file attachment logic, and simply connect Attache to as backend.

Thus, Attache can still do what it does best - distribute files, but delegate work with attachments to Shrine.

Finally


Support for direct downloads, managing files in the background, processing at boot, and the ability to use with other ORMs is what I really expect from the library. However, none of the existing libraries supported all of these requirements.

Therefore, I decided to create a new Shrine library based on knowledge from existing libraries.

The purpose of Shrine is not to be clumsy, to provide functionality and flexibility that will optimize various tasks when working with files.

This is an ambitious goal, but after a year of active development and research, I feel that I have achieved this. At the very least, there are more features than any other Ruby library. In the rest of this article series, I will introduce you to all the cool features you can use with Shrine, so stay tuned!


Original: Better File Uploads with Shrine: Motivation
The rest of the articles from the series in the author's blog:

Source: https://habr.com/ru/post/328558/


All Articles