Elixir in bioinformatics

In this article I will talk about my attempt to use the GenStage library, and in particular the Flow module, to implement one of the bioinformatics algorithms. Over the past two years, I have been developing a comprehensive storage system and searching for the results of metagenomic analysis ( metagenomics ) of hydrocarbons. Probably for many this is a Chinese literacy. In fact, such an analysis means the identification of all types of microorganisms found, for example, in oil deposits. Some of these microorganisms, predominantly bacteria, are capable of corroding steel pipes and create many other adverse effects.

The first stage of the metagenomic analysis is the genome sequencing of several thousand individuals found in hydrocarbon materials. Since, as a result of sequencing, not the whole genome is “read”, but only its individual sections, it is a computationally difficult task to determine which individual this or that area belongs to.

As for the computer implementation, the analysis consists in transferring the initial data, the so-called nitrogenous bases (A, T, U, C, and G), into a chain of processes. One of the well-known programs for solving this problem is Qiime (read “tea”). It consists of many related applications written in Python. At first, I developed my streaming framework for Elixir, but as soon as GenStage appeared, it was interesting for me to evaluate its capabilities in conducting research of this kind.
')

Switch to GenStage

Instead of trying to rewrite Qiime on Elixir (not the simplest task), I decided to start small, namely find out how the simplest bioinformatics algorithms can be implemented on Elixir and how to take advantage of the parallel execution of GenStage (it might work faster). To this end, I completed a wonderful online course on the Coursera website called “Finding messages hidden in DNA (Bioinformatics I)” , which describes bioinformatics problems and their solutions. To implement them, you can choose any programming language.

One of these tasks is the search for repetitive regions of the genome. We define the condition of the problem. A k-mer is a sequence of k nucleotides, for example, CTGGAT is a 6-mer. Given the sequence, which may consist of millions of nucleotides. It is required to find k-measures appearing in it repeatedly. Instead of combing the entire sequence, we pay attention to the k-measures contained in its individual parts. In particular, in the sequence of 1500 nucleotides, we will look for k-measures only in the area of 500 nucleotides.

Example from online course

Given: CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
Task: To find a 5-mer, which is present at least 4 times at a site of 50 nucleotides.
Solution: CGACA GAAGA

The algorithm iterates over the sequence, extracts a certain section from it, and then iterates through this section, extracting k-measures and counting among them the number of unique ones. In both cases, the search takes place in each nitrogen base separately. As a result, we obtain k-measures that occur a specified number of times.

Why is this algorithm used? The fact is that certain nucleotide sequences in the genome of a living organism are of particular importance. For example, in bacteria, such sequences indicate the beginning and end points of replication of the DNA molecule.

Implementation

And here is my implementation of the above algorithm using the Flow module from GenStage:

defmodule Bio do alias Experimental.Flow def clump(seq, k_mer_len, subseq_len, times) do |> seq |> String.to_charlist |> Stream.chunk(subseq_len, 1) |> Flow.from_enumerable |> Flow.partition |> Flow.map(fn e -> Stream.chunk(e, k_mer_len, 1) end) |> Flow.map( fn e -> Enum.reduce(e, %{}, fn w, acc -> Map.update(acc, w, 1, & &1 + 1) end) end) |> Flow.flat_map( fn e -> Enum.reject(e, fn({_, n}) -> n < times end) end) |> Flow.map(fn({seq, _}) -> seq end) |> Enum.uniq end end

Maybe not perfect, but the code works. Having considered it in more detail, the following actions can be distinguished

Nucleotide sequences are recorded in the Flow data stream.
Further, k-measures are placed there, the number of their repetitions is calculated, and this value is recorded in the Map field.
Those elements (k-measures) that appear less than the required number of times are deleted.
And, finally, the function Enum.uniq is Enum.uniq , which eliminates repeating elements (it is not important how many times a sequence has appeared, but that it has been encountered a certain number of times).

Notice that I use the Stream.chunk/4 function. This feature is implemented in the Enum and Stream modules, but not in Flow. Being confused about whether a separate implementation of the chunk function is needed for the Flow module, I sent this question to the Elixir mailing list. The creator of the language, Jose Valim, kindly answered it, and moreover, provided an improved implementation of the clump function (see below)!

It is important to note that, according to him, it is necessary to use Flow with caution, especially if there is a need to preserve the original sequence of data, because it is not known when exactly one of the parallel processes will end and return the result. As it turned out, the above search algorithm does not require preserving the original sequence of data. Jose also noted that there is no need to call Flow.partition , as in this algorithm, data partitioning does not occur.

José function implementation:

 def clump(seq, k_mer_len, subseq_len, times) do seq |> String.to_charlist |> Stream.chunk(subseq_len, 1) |> Flow.from_enumerable |> Flow.flat_map(&find_sequences(&1, k_mer_len, times)) |> Enum.uniq end def find_sequences(subseq, k_mer_len, times) do subseq |> Stream.chunk(k_mer_len, 1) |> Enum.reduce(%{}, fn w, acc -> Map.update(acc, w, 1, & &1 + 1) end) |> Enum.reject(fn({_, n}) -> n < times end) |> Enum.map(fn({seq, _}) -> seq end) end

Conclusion from the translator

You can find the original article at the link: GenStage and Bioinformatics
Unfortunately, the network still has too little information about Elixir in Russian, so if you liked the article, support it with pluses and reposts, and if you would like to periodically read something new about Elixir, then join Wunsh , the Russian-speaking Elixir community - and subscribe to the newsletter to receive translations of the most interesting articles!

Source: https://habr.com/ru/post/318104/

All Articles

Elixir in bioinformatics

Switch to GenStage

Example from online course

Implementation

Conclusion from the translator

More articles: