Functional shell programming using the xargs example

Abstract : a story about how quickly and beautifully to do list processing in a shell, a little manual on xargs and a lot of water about the philosophy of either programming or administration.

Some SEO optimization: currying, lambda-function, composition of functions, map, list filtering, work with sets in the shell.

Example

System administrators are often in a situation where you need to take the output of one program, and apply a different program to each element of the output. Or not even one. As an amusing (and useless) example, we take the following: it is required to calculate the total size of all executable files currently running on the system along with all the dynamic libraries they use.
')
This is not a real “task”, this is an educational example, solving which (the solution will be a one-liner) I will tell you about a very unusual and powerful system administration tool - linear functional programming. It is linear, because the use of the pipe "|" This is linear programming, and using xargs allows you to turn a complex program with nested loops into a one-line functional type. The purpose of the article will not show “how to find the size of libraries” and not retell the arguments of xargs, but explain the spirit of the decision, explain the philosophy behind it.

Lyrics

There are several programming styles. One of them looks like this: for each element of the list, make a cycle in which for each element of the list, if it is not an empty string, take the file name, and if the file size is not zero, then add to the counter. Oh yes, first you need to make the counter zero.

Another looks like this:
Apply a function to the list, which is applied to each element of the list, if this element is a non-empty string and the file size is not zero with this name, add to the sum.

Even the words show that the second option is shorter.

Taking into account the expressive means of programming languages, constructions are capacious and devoid of the problems of ordinary cycles - loops, incorrect exit from a cycle, etc.

In fact, in the context of the problem under discussion, we are talking about specifying a function to be applied to each element of the list. The function itself, in turn, can also be a list handler, with its processing function.

Data to solve

To solve the problem, we need to get a list of running processes. This is simpler than it looks - all processes are in / proc / [0-9] +. Next, we need the paths to the binary. And this is also simple: / proc / PID / exe for all processes except nuclear, points to the path to the process. Next task: we need a list of libraries for the file. This is done by the ldd command, which expects a file path and displays (in a tricky format) a list of libraries. Next issue of technology and walking on symlinks - we need to go through the library symlinks to the very stop, and then calculate the size of each of the files.

Thus, the high-level description of the task looks like this: take lists of executable files and libraries, and for each of them find out the size.

Note that in the process the list of executable files will be used twice - once “by itself”, the second time - to get the list of libraries of each file.

Imperative decision

(I exaggerate and omit the details)

get_exe_list(){ for a in `ls /proc/*; do readlink -f $a/exe; done } get_lib(){ for a in `cat `; do ldd $a done |awk '{print $3}' } calc(){ sum=0 sum=$(( $sum + `for a in $(cat); do du -b $a|awk '{print $1}'; done` )) } exe_list=`get_exe_list|sort -u` lib_list=`for a in $exe_list; do get_lib $a;done|sort -u` size=$(( calc_size $exe_list + calc_size $lib_list)) echo $size

Disgusting, right? This, we note, without regexp on the proper filtering of pids (we do not need to try to read a non-existing / proc / mdstat / exe) and without handling numerous cases of errors.

Lists

We are aware of the task. Since our input data is homogeneous files, we can simply present them as a list, and process them the same way. Caps I will write "not written" sections of code.

Part One: Double List Processing

We’re cheating a bit, and we’ll use stderr to duplicate the list.

 (EXE_LIST |tee /dev/stderr|LIB_LIST) 2>&1 | CALC

What does this code do? Unwritten until part of EXE_LIST generates an exe list in the system. tee takes this list, writes to stderr (stream number 2) and writes to stdout (stream number 1). Stream # 1 is passed to LIB_LIST. Then we combine the output of all three commands (brackets) with stderr and push into stdout with a single list and transfer to CALC.

Now we need to do EXE_LIST

Part Two: Filtering Lists

(Note. To see _all_ processes in the system, you need to be root).

We will go in a slightly unusual way, and instead of ls in a loop, use find. In principle, it is possible to use ls, but there will be more problems with the processing of symlinks.

So: find /proc/ -maxdepth 2 -name "exe" -ls
we need maxdepth to ignore threads, we get a conclusion similar to the output of ls. We need to filter it.

So, we improve EXE_LIST:

 find /proc/ -name "exe" -ls 2>/dev/null|awk '{print $13}'

Observation: we will use stderr for data transfer, so we don’t need to find a flood about problems with any kind of nuclear threads.

Part Three: LIB_LIST

It's simple: for each transferred file we need to use ldd, then brush the output.

 xargs -n 32 -P 4 ldd 2>/dev/null|grep "/"|awk '{print $3}'

What's so interesting? We limit ldd with a maximum of 32 files at a time, and run 4 queues of ldd in parallel (yes, this is how we get homebrew badup). The -P option allows you to parallelize the execution of ldd, which will give us some speed increase on multi-core machines and good disks (fucking it, in this case, but if we do something more slower than ldd, then parallelism can be a salvation ... ).

Part Four: CALC

At the input files, at the output it is necessary to give the figure the total size of all files. The size will be determined ... However, stop. Who said that symlinks point to files? Of course, symlinks point to symlinks that point to symlinks or files. And those symlinks ... Well, they did it.

Add readlink. And he, the infection, wants one parameter at a time. But there is an option -f, which will save us a lot of effort - it will show the name of the file, regardless of whether it is a symlink or just a file.

| xargs -n 1 -P 16 readlink -f | sort -u

... so, the size will be determined using du. Note, we could simply use the -C option here, which will add up the numbers and give the answer, but in the training course we are not looking for simple ways. So without muhlezh.

| xargs -n 32 -P 4 du -b | awk '{sum + = $ 1} END {printf "% i \ n", sum}'

Why do we need sort -u? The fact is that we will have a lot of repetitions. We need to select unique values from the list, that is, turn the list into a set (set). This is done by a naive method: we sort the list and say throw out duplicate lines when sorting.

One-liner, terrifying

We write everything together:

(find / proc / -name exe -ls 2> / dev / null | awk '{print $ 13}' | tee / dev / stderr | xargs -n 32 -P 4 ldd 2> / dev / null | grep / | awk '{print $ 3}') 2> & 1 | sort -u | xargs -n 1 -P 16 readlink -f | xargs -n 32 -P 4 du -b | awk '{sum + = $ 1} END {print sum}'

(I didn’t put a raw tag in this horror, so that the line would be automatically transferred, so as not to tear your rss-reader tapes, love me.)

Of course, THIS is no better than what was stated at the beginning. Fear and horror, in one word. Although, if you are a guru 98 leveled up and shake the 99th, then such one-line players may be the usual style of work ...

However, back to the assault of the 10th left.

Decent look

We want to cut it. On readable and debugged pieces of code. Sane. Commented.

So back to the initial entry form: (EXE_LIST | tee / dev / stderr | LIB_LIST) 2> & 1 | Calc

Readable? Maybe yes.

It remains to figure out how to properly write EXE_LIST

Option one: using functions:

 EXE_LIST (){ find /proc/ -name "exe" -ls 2>/dev/null|awk '{print $13}' } LIB_LIST (){ xargs -n 32 -P 4 ldd 2>/dev/null|grep /|awk '{print $3}' } CALC (){ sort -u|xargs -n 1 -P 16 readlink -f|xargs -n 32 -P 4 du -b|awk '{sum+=$1}END{print sum}' } (EXE_LIST |tee /dev/stderr|LIB_LIST) 2>&1 | CALC

And a bit of functional nobility:

 EXE_LIST () ( find /proc/ -name "exe" -ls 2>/dev/null|awk '{print $13}' ) LIB_LIST () ( xargs -n 32 -P 4 ldd 2>/dev/null|grep / |awk '{print $3}' ) CALC() ( sort -u|xargs -n 1 -P 16 readlink -f|xargs -n 32 -P 4 du -b|awk '{sum+=$1}END{print sum}' ) (EXE_LIST |tee /dev/stderr|LIB_LIST) 2>&1 | CALC

Of course, the FNP purists will say that there is no type inference, no control, no lazy computation, and in general it’s insane to treat this list processing - insanity and pornography.

However, this is code, it works, it is easy to write. It should not be a real product environment, but it can easily be a tool that, of three hours of monotonous work, will leave 5 minutes of interesting programming. This is the specifics of the work of the system administrator.

The main thing I wanted to show: a functional approach to the processing of sequences in a shell gives a more readable and less cumbersome code than direct iteration. And it works faster, by the way (due to xargs parallelism).

scalabilty

A further development of the “hadup on the shell” is the gnu parallels utility, which allows you to run code on several servers in parallel.

Source: https://habr.com/ru/post/153785/

All Articles