⬆️ ⬇️

Parse the GNU Coreutils source code: yes utility

(The article is available for offline reading: Markdown | PDF | PDF (print) | HTML )



What for?



Everyone around is constantly saying: “Do you want to learn how to write professional programs? See how others do it! ” So I decided to follow this advice, especially since my studies at the university are just coming to an end. It is especially interesting to compare how they were taught to do and how they are done in the real world. The GNU Coreutils package was chosen as an example to follow. It has everything:

  1. Strict requirements for portability.
  2. Big life cycle.
  3. Huge development team.
  4. Code of varying complexity: from the trivial echo to the super-sophisticated sed, from the purely applied wc to closer to the OS mkdir.


GNU Coreutils



GNU Core Utilites is a set of utilities for performing basic user operations: creating a directory, displaying a file on the screen, and so on. According to the developers, these utilities should be available in any operating system, which we are seeing at the present time: Cygwin is for Windows, but there is nothing to say about * nix. Maintain the uniformity of work in different systems helps the POSIX standard, which in Coreutils try to comply . Coreutils contains such commonly used utilities as cat, tail, echo, wc, and many others.



To begin, choose the most trivial program called yes. Its simplicity allows you to deal with the tools and libraries used in Coreutils.

')

Utility yes



As stated in the mana , all that the yes utility is able to do is infinitely output "yn" to stdout. If we pass yes some arguments, then instead of "y" yes will display the arguments separated by spaces. Surely a similar program was written by anyone who began to study C. So, many people have the opportunity to compare their approach with how the harsh, bearded guys from GNU do. About practical application yes is written a little in Wikipedia .



Source



Go to the source code. You can get it either with apt-get source and get the version that is used on your system by default, or pull the latest version out of the repositories. We will choose the second option: it is more convenient and familiar.

  1. Coreutils: git clone git://git.sv.gnu.org/coreutils
  2. Gnulib (look there a couple of times): git clone git://git.savannah.gnu.org/gnulib.git


The source code yes fits in a single coreutils/src/yes.c , and open it.



Coding style



The first thing you notice is the unusual formatting of the code. You can read about it in the relevant chapter of the GNU Coding Standards. For example, when defining a function, the type of the return value should be placed on a separate line, like the opening bracket:



 int main (int argc, char **argv) { foo(); ... } 


Only spaces are used for indentation and alignment. Between different levels of nesting, the difference in indent is 2 spaces. Bracelets with operators have a particularly perverted form:



 if (x < foo (y, z)) haha = bar[4] + 5; else { while (z) { haha += foo (z, z); z--; } return ++x + bar (); } 


12 lines



yes.c begins with a comment required for all GPL programs. He had already managed to kill my eyes in other programs and the need for its presence was a mystery to me. It turns out that the text of this comment is fixed in the instructions for using the GPL. It is written in it that everyone who wants to release their software under the GPL must add these 12 lines of copyright statement to the beginning of each source code file.



initialize_main



The first thing the program does is call initialize_main . This function is intended for the program to perform its specific actions on the arguments. In practice, in Coreutils, there is not a single utility that would use this function for something useful. Everywhere the stub is used, represented in the coreutils/src/system.h :



 #ifndef initialize_main # define initialize_main(ac, av) #endif 


The name of the program



Coreutils utilities distinguish two program names:

  1. The official name that the user can not change.
  2. The real name of the executable file.


The official name is used when displaying application version information:



 user@laptop:~$ yes --version yes (GNU coreutils) 8.5 Usage: yes [STRING]... or: yes OPTION 


Moreover, this name does not depend on the name of the executable file:



 user@laptop:~$ /usr/bin/yes --version yes (GNU coreutils) 8.5 user@laptop:~$ cp /usr/bin/yes ./foo user@laptop:~$ ./foo --version yes (GNU coreutils) 8.5 


This behavior is provided by the macro PROGRAM_NAME specifically defined at the beginning of the file:



 /* The official name of this program (eg, no `g' prefix). */ #define PROGRAM_NAME "yes" 


The real name without any tricks is taken from argv[0] and is used when displaying errors and prompts:



 user@laptop:~$ yes --help Usage: yes [STRING]... or: yes OPTION user@laptop:~$ /usr/bin/yes --help Usage: /usr/bin/yes [STRING]... or: /usr/bin/yes OPTION 


The value argv[0] is placed in the global variable program_name by calling the set_program_name function in the second line of main :



 set_program_name (argv[0]); 


The set_program_name function set_program_name provided by the Gnulib library. The corresponding code is located in the gnulib/lib/ directory, in the progname.h and progname.c . It is interesting to note that set_program_name not only saves the values argv[0] into the global variable program_name declared in progname.h , but also performs additional conversions related to the subtleties of using GNU Libtool , a tool for developing dynamic libraries.



Internationalization



Coreutils are used throughout the world, so all utilities provide for localization. Moreover, this feature is provided with minimal effort due to the use of the GNU gettext package. Few will be surprised by the use of gettext, because this package has spread far beyond the GNU project. For example, internationalization in my favorite Django web framework is built on gettext . About using gettext with various languages ​​and frameworks have already been written on Habré .



A great feature of gettext is that it is used in approximately the same way in all languages, and C is no exception. Here is the standard magic function _ , the use of which can be found in the usage function:



 void usage (int status) { if (status != EXIT_SUCCESS) fprintf (stderr, _("Try `%s --help' for more information.\n"), program_name); ... } 


The function definition _ is in the system.h file already familiar to us:



 #define _(msgid) gettext (msgid) 


Initialization of the internationalization mechanism in Coreutils is performed by calling three functions in main :



 setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE); 




Error processing



Moving further along the main code, we meet the following line:



 atexit (close_stdout); 


Intuitively, you might think that the standard output stream is closed in the close_stdout function, which eliminates data loss if we replace stdout with some file descriptor and use buffered output. But I did not succeed in finding the source code for this function and understanding what is actually happening there, whether any additional actions for cleaning up resources are being performed.



Command line arguments



This is the last question that does not concern the work of the program itself. Here, as in the case of internationalization, the time-tested and crawled into many projects (for example, in Python ) solution is used - the getopt module. This module is very simple: in fact, the developer is required to call one of the functions getopt or getopt_long in a loop. More information about getopt can be read on the Internet, and on Habré, they also wrote about it.



Gnulib has a special function parse_long_options for handling the --version and --help arguments, which any GNU application must support. It is located in the gnulib/lib/long-options.c file and uses getopt_long in its work.



The source code yes is a great example of working with getopt. There is at the same time no need for learning the complexity of the analysis of dozens of arguments, and there is the use of all getopt tools. First, of course, call parse_long_options . Then it is checked that no more options-keys are passed and the remaining arguments, if any, are just arbitrary strings:



 parse_long_options (argc, argv, PROGRAM_NAME, PACKAGE_NAME, Version, usage, AUTHORS, (char const *) NULL); if (getopt_long (argc, argv, "+", NULL, NULL) != -1) usage (EXIT_FAILURE); 


The following code can be translated into Russian as follows: “If there was nothing in the argument list of the command line except the --version and --help keys, then we will output“ y ”to stdout”:



 if (argc <= optind) { optind = argc; argv[argc++] = bad_cast ("y"); } 


Writing to argv[argc] not an error: the ANSI C standard requires that the argv[argc] element be a null pointer.



Main loop



Well, we got to the functionality of the program. Here it is, as it is:



 while (true) { int i; for (i = optind; i < argc; i++) if (fputs (argv[i], stdout) == EOF || putchar (i == argc - 1 ? '\n' : ' ') == EOF) error (EXIT_FAILURE, errno, _("standard output")); } 


It can be noted here that all actions are performed inside the if condition, and not in its body. So, Kernigan and Ritchie did not lie when they wrote that an experienced C-programmer implements the copying of lines like this:



 while (*dst++ = *src++) ; 

Source: https://habr.com/ru/post/133408/



All Articles