We use flows in Ruby

Many Ruby developers ignore threads , although this is a very useful tool. In this article, we will look at creating IO threads in Ruby and show how Ruby copes with threads in which many computational operations take place. Let's try to use alternative implementations of Ruby, as well as find out what results can be achieved using the DRb module. At the end of the article, let's see how these principles are used in various servers for applications on Ruby on Rails .

IO streams in Ruby

Consider a small example:

def call_remote(host) sleep 3 #      end

If we need to access two servers, for example, to clear the cache, we will call this function twice in succession:
')

 call_remote 'host1/clear_caches' call_remote 'host2/clear_caches'

then our program will work 6 seconds.

We can speed up the execution of the program if we use threads, for example, like this:

 threads = [] ['host1', 'host2'].each do |host| threads << Thread.new do call_remote "#{host}/clear_caches" end end threads.each(&:join)

We created two threads, in each thread we turned to our server and the #join commands said that the main program (main thread) should wait for them to complete. Now our program is successfully executed twice as fast, in 3 seconds.

More streams of good and different

Consider a more complex example in which we will try to get all the closed bugs and problems with GitHub about the Jekyll project through the provided API .

Since we don’t want to do a DoS attack on GitHub , we need to limit the number of simultaneous streams, do their scheduling, launch and collect results as they become available.

The standard Ruby library does not provide ready-made tools for solving such problems, so I implemented my own FutureProof library for creating groups of threads in Ruby , about which I want to tell you more.

Its principle is simple - you need to create a new group, specifying the maximum number of simultaneous streams:

 thread_pool = FutureProof::ThreadPool.new(5)

add tasks to it:

 thread_pool.submit 2, 5 do |a, b| a + b end

and ask for their meanings:

 thread_pool.values

So, to get the information we need about the Jekyll project, the following code will suffice:

 require 'future_proof' require 'net/http' thread_pool = FutureProof::ThreadPool.new(5) 10.times do |i| thread_pool.submit i do |page| uri = URI.parse( "https://api.github.com/repos/mojombo/jekyll/issues?state=close&page=#{page + 1}&per_page=100.json" ) http = Net::HTTP.new(uri.host, uri.port) http.use_ssl = true http.request(Net::HTTP::Get.new(uri.request_uri)).body end end thread_pool.perform puts thread_pool.values[3] # [{"url":"https://api.github.com/repo ...

The implementation of the FutureProof library is based on the Queue class, which allows you to create queues that are safe for working with multiple threads, ensuring that several threads do not write to the queue at the same time on top of each other, and do not consider the same value at the same time.

The library also designed exception handling - if it happened during the execution of the thread, then thread_pool will still be able to return an array of the received values and will cause an exception only if the programmer tries to directly access a specific element of the array.

The implementation of thread groups is an attempt to bring Rubi’s capabilities in working with threads to the capabilities of Java and java.util.concurrent , from which partly the inspiration came.

Using the FutureProof library, you can perform tasks that involve working with IO threads, much more convenient and efficient. The library supports Ruby versions 1.9.3, 2.0, as well as Rubinius .

Flows and computational operations

Given the successful experience of improving the performance of the program with the help of threads, we will carry out two tests, in one of which we calculate factorial 1000 two times in succession and parallel in the other.

 require 'benchmark' factorial = Proc.new { |n| 1.upto(n).inject(1) { |i, n| i * n } } Benchmark.bm do |x| x.report('sequential') do 10_000.times do 2.times do factorial.call 1000 end end end x.report('thready') do 10_000.times do threads = [] 2.times do threads << Thread.new do factorial.call 1000 end end threads.each &:join end end end

As a result, we got a rather unexpected result (using Ruby 2.0) - parallel execution was performed for a second longer:

  user system total real sequential 24.130000 1.510000 25.640000 (25.696196) thready 24.600000 2.420000 27.020000 (26.877708)

One of the reasons is that we have complicated the code with scheduling threads, and the second, at one point in time, Ruby used only one kernel to execute this program. Unfortunately, the possibility of forcing Ruby to use several cores for one ruby process is currently not available.

Let me show you the results of the execution of the same script on jRuby 1.7.4:

  user system total real sequential 33.180000 0.690000 33.870000 (33.090000) thready 37.820000 3.830000 41.650000 (24.333000)

As you can see, the result is better. Since Measurement took place on a computer with two cores, and one of the cores was used only by 75%, the improvement was not 200%. But, therefore, on a computer with a large number of cores, we could do even more parallel threads and further improve our result.

jRuby is an alternative implementation of Ruby on the JVM , introducing very great features into the language itself.

Choosing the number of simultaneous threads, you need to remember - we can, without losing performance, have many threads involved in IO operations on a single core. But we will lose a little in performance if the number of threads exceeds the number of cores in the case of computational operations.

In the case of the original implementation of Ruby ( MRI ), it is recommended to use only one stream for computational operations. We can achieve true parallelism using threads only using jRuby and Rubinius .

Process level concurrency

As we now know, Ruby MRI for one ruby process (on Unix systems) can use only one core resources at a time. One of the ways we can get around this drawback is to use forks of processes, like this:

 read, write = IO.pipe result = 5 pid = fork do result = result + 5 Marshal.dump(result, write) exit 0 end write.close result = read.read Process.wait(pid) puts Marshal.load(result)

The fork of the process, at the time of creation, copies the value of the result variable equal to 5, but the main process will not see further change of the variable inside the fork, so we needed to adjust the message between the fork and the main process using IO.pipe .

This method is effective, but rather cumbersome and inconvenient. With the help of the DRb module for distributive programming, more interesting results can be achieved.

We use the DRb module to synchronize processes

The DRb module is part of the standard Ruby library and is responsible for distribution programming features. At the core of his idea is the ability to give access to one Ruby object to any computer on the network. The results of all the manipulations with this object, its internal state, are visible to all connected computers, and are constantly synchronized. In general, the capabilities of the module are very broad, and are worthy of a separate article.

I had the idea to use Rinda :: TupleSpace tuples along with this DRb feature to create a Pthread module that is responsible for executing code in separate processes on both the main program computer and other connected machines. Rinda :: TupleSpace offers access to tuples by name and, like objects of the Queue class, allow you to write and read tuples to only one thread or process at a time.

Thus, a solution appeared that allows Ruby MRI to execute code on several cores:

 Pthread::Pthread.new queue: 'fact', code: %{ 1.upto(n).inject(1) { |i, n| i * n } }, context: { n: 1000 }

As you can see, the code that needs to be executed is served as a string, because in the case of DRb procedures, it transfers to another process only a link to it, and for its execution it uses the resources of the process that created this procedure. To get rid of the context of the main process, I submit the code to other processes as a string, and the values of the variables of the string in the additional dictionary. An example of how to connect additional machines to the code execution can be found on the project’s home page .

The Pthread library supports MRI versions 1.9.3 and 2.0.

Parallelism in Ruby on Rails

Servers for Ruby on Rails and libraries for running background tasks can be divided into two groups. The first one uses forks to process user requests or perform background tasks — additional processes. Thus, using MRI and these servers and libraries, we can process several requests in parallel, and perform several tasks at the same time.

However, this method has a disadvantage. Forks of processes copy the memory of the process that created them, and thus the Unicorn server with three “employees” can take 1GB of memory, barely starting the work. The same applies to libraries for performing background tasks, such as Resque .

The creators of the Puma server for Ruby on Rails took into account the features of jRuby and Rubinius , and released a server focused primarily on these two implementations. Unlike Unicorn , Puma uses threads that require much less memory to simultaneously process requests. Thus, Puma will be an excellent alternative when used in conjunction with jRuby or Rubinius . Therefore, the principle is tripled library Sidekiq .

Conclusion

Threads are a very powerful tool that allows you to do several things at the same time, especially when it comes to long IO operations or calculations. In this article, we looked at some of the features and limitations of Ruby and its various implementations, and also used two third-party libraries to simplify work with streams.

Thus, the author of the article recommends playing with Ruby , streams, and when starting future Rails projects, look towards alternative implementations — jRuby and Rubinius .

Source: https://habr.com/ru/post/180741/

All Articles