📜 ⬆️ ⬇️

MapReduce for beginners on Erlang'e

I continue my immersion in Erlang. There is already a cunning plan to rewrite one of our services for monitoring on the Erlang. We are developing Windows Azure and Amazon EC2 clouds as a platform for some products and internal QA tasks, so the ability to use many cores and machines without rewriting the code looks promising.

So, for a start, a simple but real example is a project of ~ 2000 files. It is necessary to make a list of used environment variables. That is, find the occurrences of the strings “getenv (...)” and “GetVariable (...)” (this is our wrapper) and rip out the parameter from them.

The task is straightforward and has been solved by a C ++ program for a long time, which does not even crawl directories, but simply calls a Unix “find” that generates a list of files by mask, and then scans files by list. On 2000 files, it works a couple of seconds in one stream.
')
Now Erlang. Here you want to stir up something more curly than the sequential crawling of files. MapReduce is just a topic - you can make a list of files, then analyze each file in parallel (Map), accumulating the found variable names, and at the end process all received inputs (Reduce), in our case, just count the number of occurrences of each variable.

In fact, my code repeats the example from " Programming Erlang " and uses the phofs module (parallel higher-order functions) from the same book.

-module(find_variables). -export([main/0, find_variables_in_file/2, process_found_variables/3]). -define(PATH, "/Projects/interesting_project"). -define(MASK, "\\..*(cpp|c)"). main() -> io:format("Creating list of files...~n", []), %     .   - % ,    . Files = filelib:fold_files(?PATH, ?MASK, true, fun(N, A) -> [N | A] end, []), io:format("Found ~b file(s)~n", [length(Files)]), F1 = fun find_variables_in_file/2, % Map F2 = fun process_found_variables/3, % Reduce %  MapReduce   benchmark,   % . benchmark(fun() -> L = phofs:mapreduce(F1, F2, [], Files), io:format("Found ~b variable(s)~n", [length(L)]) end, "MapReduce"). benchmark(Worker, Title) -> {T, _} = timer:tc(fun() -> Worker() end), io:format("~s: ~f sec(s)~n", [Title, T/1000000]). -define(REGEXP, "(getenv|GetVariable)\s*\\(\s*\"([^\"]+)\"\s*\\)"). % Map.   . find_variables_in_file(Pid, FileName) -> case file:open(FileName, [read]) of {ok, File} -> %    . {_, RE} = re:compile(?REGEXP), %       %      . CallBack = fun(Var) -> Pid ! {Var, 1} end, find_variable_in_file(File, RE, CallBack), file:close(File); {error, Reason} -> io:format("Unable to process '~s', ~p~n", [FileName, Reason]), exit(1) end. % Reduce.  .     %  MapReduce        % ,   .       % {VarName, 1}.      VarName  %  ,       . %      . process_found_variables(Key, Vals, A) -> [{Key, length(Vals)} | A]. %   . find_variable_in_file(File, RE, CallBack) -> case io:get_line(File, "") of eof -> void; Line -> scan_line_in_file(Line, RE, CallBack), find_variable_in_file(File, RE, CallBack) end. %        ( ), %      CallBack      % . scan_line_in_file(Line, RE, CallBack) -> case re:run(Line, RE) of {match, Captured} -> [_, _, {NameP, NameL}] = Captured, Name = string:substr(Line, NameP + 1, NameL), CallBack(Name); nomatch -> void end. 

To build the program you need the phofs module . It is universal, independent of the specific Map and Reduce functions.

And a Makefile just in case:

  target = find_variables all: erlc $(target).erl erlc phofs.erl erl -noshell -s $(target) main -s init stop clean: -rm *.beam *.dump 


Puzomerka. As I have already said, the C ++ program together with the time for the “find” call on my machine takes 1-2 seconds. The version on Erlang'e works ~ 20 seconds. Poorly? Watching how to look. If the analysis of each file is longer (that is, the program will spend most of the time analyzing the file, rather than traversing directories), then it is not entirely clear which solution will be more practical if the number of files and the complexity of the analysis increase.

I am new to Erlang, so I would be grateful for the criticism of the code.

Posts on the topic:

Source: https://habr.com/ru/post/133750/


All Articles