📜 ⬆️ ⬇️

Linux pipes tips & tricks

Pipe - what is it?


Pipe is a unidirectional interprocess communication channel. The term was coined by Douglas McIlroy for the Unix command shell and is named after the pipeline. Conveyors are most often used in shell scripts to connect several commands by redirecting the output of one command (stdout) to the input (stdin) of a subsequent command, using the conveyor symbol '|':
cmd1 | cmd2 | .... | cmdN 

For example:
 $ grep -i “error” ./log | wc -l 43 

grep performs a case-insensitive search for the string “error” in the log file, but the search result is not displayed on the screen, but redirected to the input (stdin) of the wc command, which in turn calculates the number of lines.

Logics


Conveyor provides asynchronous command execution using I / O buffering. Thus, all the commands in the pipeline are working in parallel, each in its own process.

The buffer size starting with kernel 2.6.11 is 65536 bytes (64Kb) and is equal to the memory page in older kernels. When trying to read from an empty buffer, the reading process is blocked until data is displayed. Similarly, if you attempt to write to the filled buffer, the recording process will be blocked until the required space is freed.
It is important that despite the fact that the pipeline operates with file descriptors for I / O streams, all operations are performed in memory, without any load on the disk.
All the information below relates to the bash-4.2 shell and the kernel 3.10.10.
')

Simple debag


The strace utility allows you to track system calls during program execution:
 $ strace -f bash -c '/bin/echo foo | grep bar' .... getpid() = 13726 <– PID   ... pipe([3, 4]) <–      .... clone(....) = 13727 <–      (echo) ... [pid 13727] execve("/bin/echo", ["/bin/echo", "foo"], [/* 61 vars */] ..... [pid 13726] clone(....) = 13728 <–     (grep)      ... [pid 13728] stat("/home/aikikode/bin/grep", ... 
It can be seen that the pipe () system call is used to create the pipeline, and also that both processes are executed in parallel in different threads.

Lots of bash source code and kernels

Source code, level 1, shell


Since the best documentation is source code, let's turn to it. Bash uses Yacc to parse the input commands and returns 'command_connect ()' when it encounters the '|' character.
parse.y :
 1242 pipeline: pipeline '|' newline_list pipeline 1243 { $$ = command_connect ($1, $4, '|'); } 1244 | pipeline BAR_AND newline_list pipeline 1245 { 1246 /* Make cmd1 |& cmd2 equivalent to cmd1 2>&1 | cmd2 */ 1247 COMMAND *tc; 1248 REDIRECTEE rd, sd; 1249 REDIRECT *r; 1250 1251 tc = $1->type == cm_simple ? (COMMAND *)$1->value.Simple : $1; 1252 sd.dest = 2; 1253 rd.dest = 1; 1254 r = make_redirection (sd, r_duplicating_output, rd, 0); 1255 if (tc->redirects) 1256 { 1257 register REDIRECT *t; 1258 for (t = tc->redirects; t->next; t = t->next) 1259 ; 1260 t->next = r; 1261 } 1262 else 1263 tc->redirects = r; 1264 1265 $$ = command_connect ($1, $4, '|'); 1266 } 1267 | command 1268 { $$ = $1; } 1269 ; 
Also here we see the processing of the pair of characters '| &', which is equivalent to redirection of both stdout and stderr to the pipeline. Next we turn to command_connect (): make_cmd.c :
 194 COMMAND * 195 command_connect (com1, com2, connector) 196 COMMAND *com1, *com2; 197 int connector; 198 { 199 CONNECTION *temp; 200 201 temp = (CONNECTION *)xmalloc (sizeof (CONNECTION)); 202 temp->connector = connector; 203 temp->first = com1; 204 temp->second = com2; 205 return (make_command (cm_connection, (SIMPLE_COM *)temp)); 206 } 
where connector is the character '|' as int. When executing a sequence of commands (linked by '&', '|', ';', etc.), execute_connection () is called: execute_cmd.c :
 2325 case '|': ... 2331 exec_result = execute_pipeline (command, asynchronous, pipe_in, pipe_out, fds_to_close); 

PIPE_IN and PIPE_OUT are file descriptors containing information on the input and output streams. They can take the value NO_PIPE, which means that the I / O is stdin / stdout.
execute_pipeline () is a rather voluminous function whose implementation is contained in execute_cmd.c . We will consider the most interesting parts for us.
execute_cmd.c :
 2112 prev = pipe_in; 2113 cmd = command; 2114 2115 while (cmd && cmd->type == cm_connection && 2116 cmd->value.Connection && cmd->value.Connection->connector == '|') 2117 { 2118 /*      */ 2119 if (pipe (fildes) < 0) 2120 { /*   */ } ....... /*     ,      prev —   ,     fildes[1] —   ,     pipe() */ 2178 execute_command_internal (cmd->value.Connection->first, asynchronous, 2179 prev, fildes[1], fd_bitmap); 2180 2181 if (prev >= 0) 2182 close (prev); 2183 2184 prev = fildes[0]; /*        */ 2185 close (fildes[1]); ....... 2190 cmd = cmd->value.Connection->second; /* “”      */ 2191 } 
Thus, bash processes the pipeline symbol by system pipe () call for each encountered '|' and executes each command in a separate process using the appropriate file descriptors as input and output streams.

Source Code Level 2 Core


Referring to the kernel code and look at the implementation of the function pipe (). The article discusses the kernel version 3.10.10 stable.
fs / pipe.c (missing code sections for this article):
 /*       .       /proc/sys/fs/pipe-max-size */ 35 unsigned int pipe_max_size = 1048576; /*    ,   POSIX     , .. 4 */ 40 unsigned int pipe_min_size = PAGE_SIZE; 869 int create_pipe_files(struct file **res, int flags) 870 { 871 int err; 872 struct inode *inode = get_pipe_inode(); 873 struct file *f; 874 struct path path; 875 static struct qstr name = {. name = “” }; /*  dentry  dcache */ 881 path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, &name); /*     file.    FMODE_WRITE,     O_WRONLY, ..             .   O_NONBLOCK   . */ 889 f = alloc_file(&path, FMODE_WRITE, &pipefifo_fops); 893 f->f_flags = O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)); /*      file   (. FMODE_READ   O_RDONLY) */ 896 res[0] = alloc_file(&path, FMODE_READ, &pipefifo_fops); 902 res[0]->f_flags = O_RDONLY | (flags & O_NONBLOCK); 903 res[1] = f; 904 return 0; 917 } 918 919 static int __do_pipe_flags(int *fd, struct file **files, int flags) 920 { 921 int error; 922 int fdw, fdr; /*   file     (.  ) */ 927 error = create_pipe_files(files, flags); /*     */ 931 fdr = get_unused_fd_flags(flags); 936 fdw = get_unused_fd_flags(flags); 941 audit_fd_pair(fdr, fdw); 942 fd[0] = fdr; 943 fd[1] = fdw; 944 return 0; 952 } /*    int pipe2(int pipefd[2], int flags)... */ 969 SYSCALL_DEFINE2(pipe2, int __user *, fildes, int, flags) 970 { 971 struct file *files[2]; 972 int fd[2]; /*    /     */ 975 __do_pipe_flags(fd, files, flags); /*     kernel space  user space */ 977 copy_to_user(fildes, fd, sizeof(fd)); /*       */ 984 fd_install(fd[0], files[0]); 985 fd_install(fd[1], files[1]); 989 } /* ... int pipe(int pipefd[2]),        pipe2   ; */ 991 SYSCALL_DEFINE1(pipe, int __user *, fildes) 992 { 993 return sys_pipe2(fildes, 0); 994 } 
If you noticed, the code checks for the O_NONBLOCK flag. It can be set using the F_SETFL operation in fcntl. He is responsible for the transition to the mode without blocking I / O flows in the pipeline. In this mode, instead of blocking, the read / write process to the stream will end with errno code EAGAIN.

The maximum size of the data block that will be written into the pipeline is equal to one memory page (4Kb) for the arm architecture:
arch / arm / include / asm / limits.h :
  8 #define PIPE_BUF PAGE_SIZE 
For kernels> = 2.6.35, you can change the size of the pipeline buffer:
 fcntl(fd, F_SETPIPE_SZ, <size>) 
The maximum allowed buffer size, as we saw above, is specified in the / proc / sys / fs / pipe-max-size file.

Tips & trics


In the examples below, we will execute ls on the existing Documents directory and two non-existing files: ./non-existent_file and. / other_non-existent_file.

  1. Redirecting both stdout and stderr to pipe

     ls -d ./Documents ./non-existent_file ./other_non-existent_file 2>&1 | egrep “Doc|other” ls: cannot access ./other_non-existent_file: No such file or directory ./Documents 
    or you can use the combination of characters '| &' (you can learn about it from the shell documentation (man bash) or from the sources above, where we parsed the Yashc bash parser):
     ls -d ./Documents ./non-existent_file ./other_non-existent_file |& egrep “Doc|other” ls: cannot access ./other_non-existent_file: No such file or directory ./Documents 

  2. Redirect _only_ stderr to pipe

     $ ls -d ./Documents ./non-existent_file ./other_non-existent_file 2>&1 >/dev/null | egrep “Doc|other” ls: cannot access ./other_non-existent_file: No such file or directory 
    Shoot yourself in the foot
    It is important to follow the order of redirection stdout and stderr. For example, the combination '> / dev / null 2> & 1 ′ will redirect both stdout and stderr to / dev / null.

  3. Getting the correct pipeline completion code

    By default, the pipeline completion code is the code for the completion of the last command in the pipeline. For example, take the original command, which ends with a non-zero code:
     $ ls -d ./non-existent_file 2>/dev/null; echo $? 2 
    And put it in the pipe:
     $ ls -d ./non-existent_file 2>/dev/null | wc; echo $? 0 0 0 0 
    Now the pipeline completion code is the wc command completion code, i.e. 0

    Usually we need to know if an error occurred during the execution of the pipeline. To do this, set the pipefail option, which indicates to the shell that the pipeline completion code will match the first nonzero code for the completion of one of the pipeline commands, or zero if all the commands completed correctly:
     $ set -o pipefail $ ls -d ./non-existent_file 2>/dev/null | wc; echo $? 0 0 0 2 
    Shoot yourself in the foot
    Keep in mind the “harmless” commands that can return non-zero. This applies not only to work with conveyors. For example, consider an example with grep:
     $ egrep “^foo=[0-9]+” ./config | awk '{print “new_”$0;}' 
    Here we print all the found lines, assigning 'new_' at the beginning of each line, or we do not print anything if there is not a single line of the required format. The problem is that grep completes with code 1, if no matches were found, so if the pipefail option is set in our script, this example will end with code 1:
     $ set -o pipefail $ egrep “^foo=[0-9]+” ./config | awk '{print “new_”$0;}' >/dev/null; echo $? 1 
    In large scripts with complex constructions and long conveyors, this moment can be overlooked, which can lead to incorrect results.

  4. Assigning values ​​to variables in the pipeline

    To begin with, remember that all the commands in the pipeline are executed in separate processes received by the clone () call. As a rule, this does not create problems, except in cases of changing the values ​​of variables.
    Consider the following example:
     $ a=aaa $ b=bbb $ echo “one two” | read ab 
    We now expect that the values ​​of the variables a and b will be “one” and “two”, respectively. In fact, they will remain “aaa” and “bbb”. In general, any change in the values ​​of variables in the pipeline outside of it will leave the variables unchanged:
     $ filefound=0 $ find . -type f -size +100k | while true do read f echo “$f is over 100KB” filefound=1 break #      done $ echo $filefound; 
    Even if find finds a file larger than 100Kb, the filefound flag will still have the value 0.
    There are several solutions to this problem:
    • use
       set -- $var 

      This construct will set the positional variables according to the contents of the var variable. For example, as in the first example above:
       $ var=”one two” $ set -- $var $ a=$1 # “one” $ b=$2 # “two” 
      It should be borne in mind that the script will lose the original positional parameters with which it was called.
    • transfer all the logic of processing the value of a variable to the same subprocess in the pipeline:
       $ echo “one” | (read a; echo $a;) one 
    • change the logic to avoid assigning variables inside the pipeline.
      For example, change our example with find:
       $ filefound=0 $ for f in $(find . -type f -size +100k) #   ,     do read f echo “$f is over 100KB” filefound=1 break done $ echo $filefound; 
    • (only for bash-4.2 and newer) use the lastpipe option
      The lastpipe option instructs the shell to execute the last pipeline command in the main process.
       $ (shopt -s lastpipe; a=”aaa”; echo “one” | read a; echo $a) one 
      It is important that in the command line it is necessary to set the lastpipe option in the same process where the corresponding pipeline will be called, therefore the brackets in the example above are required. Brackets are optional in scripts.

Additional Information


Source: https://habr.com/ru/post/195152/


All Articles