Bypassing the errors of utilities from the GNU Core Utilities package

The coreutils package is preinstalled on many Linux distributions. It contains standard and familiar utilities such as cat , chmod , date , echo , ls, and many others. But even in such a canonical package there are errors that can interfere with the work of the user. I encountered one of them on my own experience and I want to tell you how I could get around it.

The task was the following - to convert a text file with long lines so that no line was longer than 80 characters. Long lines should be broken into several lines of 80 or less characters. The file is encoded in UTF-8. By a little googling you can find out that in Unix-like OS this utility is handled by the fold utility. Great, then we will use it. To begin, let us execute a couple of test commands in the terminal in order to learn how to handle it. I will give the output below of commands executed on the Debian 7.5 system with the coreutils 8.13 package. The same output will be on the Arch Linux system with coreutils 8.22.

When executing all the test commands, the locale settings are as follows:
')

$ locale LANG=ru_RU.UTF-8 LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL=ru_RU.UTF-8

If you are wrong, then run:

 $ export LC_ALL="ru_RU.UTF-8"

Let the test team break the string “abcdefghij” into lines of 4 characters each:

 $ echo "abcdefghij" | fold -w 4 abcd efgh ij

Great! Now the line "abvgdeezzi":

 $ echo "" | fold -w 4

And here we are in for a surprise. We see that the string "abvgdeozhzi" broke into lines of two characters. The point here is that the Cyrillic character encoded in UTF-8 occupies two bytes, and the Latin character is one. The fold utility, counting all the characters as single-byte, simply split the given string (array of bytes) into pieces of 4 bytes each. As you can see, this partitioning algorithm is correct in UTF-8 encoding for Latin characters only. At the same time, the wc utility correctly calculates the number of characters in the string “abc”:

 $ echo -n "" | wc -m 10

This suggests that unicode support in the coreutils package is partially implemented, and the result of working with unicode various utilities can be unpredictable.

In fact, this error was known several years ago. It is described here and here , and even the answer from developers is given, but, unfortunately, it is still in a condition "this is not a bug, this is a feature."

Described above does not apply to BSD systems, they have their own implementation of standard utilities. The test in the FreeBSD 10 system showed that everything is in order with Unicode.

Now let's talk about how to get around this error. I know of two coreutils replacements: BusyBox and Heirloom . The first option seemed to me more relevant and simple, so I'll show you how to build a crutch with it, which will allow you to use the fold utility in your system normally. Similarly, you can build a crutch for any other standard utility.

First, install the busybox package. On the Debian system, the command is:

 # apt-get install busybox

In the Arch Linux system, respectively, the following command:

 # pacman -S busybox

According to the documentation , you can use BusyBox like this:

 $ busybox ls -l $ busybox ps $ busybox seq 1 5

Those. just pass the name of the utility as a parameter to the busybox executable file. You can also rename the executable file to one of the commands it supports, and it will automatically act as if it were this command. We will not rename it, but here we will create a symbolic link named fold to it:

 # cd $(dirname $(which fold)) # mv fold fold.orig # ln -s $(which busybox) fold

After that, fold can be used in the usual way: call from a terminal or a script. Such a patch in the system is acceptable to me. I would be glad if she could help someone too. In the meantime, it remains to hope that someday coreutils will fully support Unicode.

Source: https://habr.com/ru/post/221945/

All Articles

Bypassing the errors of utilities from the GNU Core Utilities package

More articles: