With the advent of 2019, it is good to remember the past and think about the future. Look back 30 years ago and reflect on the first scientific articles on fuzzing:
“An Empirical Study of the Reliability of UNIX Utilities” and the subsequent work of 1995
“Revising the Fuzzing” by the same author
Burton Miller .
In this article we will try to find bugs in modern versions of Ubuntu Linux, using
the same tools as in the original works on fuzzing. You must read the original documents not only for the context, but also for understanding. They turned out to be very prophetic in regard to vulnerabilities and exploits for decades to come. Attentive readers may notice the date of publication of the original article: 1990. Even more attentive notice copyright in the comments source: 1989.
Short review
For those who have not read the documents (although this should really be done), this section contains a brief summary and some selected quotes.
A fuzzing program generates random streams of characters, with the ability to generate only printed or non-printing characters. It uses a certain initial value (seed), providing reproducible results, which modern fuzzers often lack. A set of scripts runs on the tested programs and checks for the presence of the main dumps. Hangs are detected manually. Adapters provide random input for interactive programs (1990 paper), network services (1995), and graphical X applications (1995).
')
The 1990 article tests four processor architectures (i386, CVAX, Sparc, 68020) and five operating systems (4.3 BSD, SunOS, AIX, Xenix, Dynix). The 1995 article has a similar choice of platforms. The first article succeeds in achieving a failure of 25-33% of utilities, depending on the platform. In the following article, these figures range from 9% to 33%, with GNU (on SunOS) and Linux having the smallest failure rate.
The 1990 article concludes that 1) programmers do not check the array boundaries or error codes, 2) macros make it difficult to read and debug the code, and 3) the C language is very insecure. A special mention was given to the extremely insecure function
gets
and the type system C. In the course of testing, the authors found Format String vulnerabilities years before their mass exploitation. The article ends with a user survey on how often they fix bugs or report bugs. It turned out that it was difficult to report bugs and there was no particular interest in correcting them.
The 1995 article mentions open source software and discusses why it has fewer errors. Quote:
When we investigated the causes of failures, an alarming phenomenon emerged: many of the bugs (about 40%), which were reported in 1990, are still present in their exact form in 1995. ...
The methods used here are simple and mostly automated. It’s hard to understand why developers don’t use this easy and free source of reliability enhancement.
Only in 15-20 years the fuzzing technique will become standard practice for large vendors.
I also think that this statement in 1990 foresees future events:
Often the brevity of the C programming style is brought to the extreme, the form prevails over the correct function. The possibility of input buffer overflow is a potential security hole, as a recent Internet worm has shown.
Testing Methodology
Fortunately, after 30 years, Dr. Barton still provides the
complete source code, scripts, and data for reproducing his results : a good example to be followed by other researchers. The scripts work without problems, and the fuzzing tool required only minor changes to be compiled and run.
For these tests, we used
scripts and input data from the fuzz-1995-basic repository , because there is the latest list of
tested applications . According to the
README , here are the same random inputs as in the original study. The results below for modern Linux are obtained in
exactly the same fuzzing code and input data as in the original articles. Only the list of utilities for testing has changed.
Changes in utilities for 30 years
Obviously, over the past 30 years, there have been some changes in the Linux software packages, although quite a few proven utilities have continued their pedigree for decades. Where possible, we took the modern versions of the same programs from the 1995 article. Some programs are no longer available, we have replaced them. Justification of all replacements:
cfe
⇨ cc1
: Equivalent to preprocessor C from the 1995 article.dbx
⇨ gdb
: Equivalent to the 1995 debugger.ditroff
⇨ groff
: ditroff
no longer available.dtbl
⇨ gtbl
: GNU Troff equivalent of the old dtbl
utility.lisp
⇨ clisp
: Standard implementation of lisp.more
⇨ less
: Less is more!prolog
⇨ swipl
: There are two prolog options: SWI Prolog and GNU Prolog. SWI Prolog is preferable because it is an older and more complete implementation.awk
⇨ gawk
: GNU version of awk
.cc
⇨ gcc
: Standard C Compilercompress
⇨ gzip
: GZip is the ideological heir to the old Unix utilities compress
.lint
⇨ splint
: Rewritten lint
under the GPL license./bin/mail
⇨ /usr/bin/mail
: Equivalent utility in a different way.f77
⇨ fort77
: There are two versions of the Fortan77 compiler: GNU Fortran and Fort77. The first is recommended for Fortran 90, and the second is for Fortran77 support. The f2c
program f2c
actively supported, its list of changes has been maintained since 1989.
results
The fuzzing technique of 1989 still finds errors in 2018. But there is some progress.
To measure progress, we need some kind of base. Fortunately, there is such a base for Linux utilities. Although in the time of the original article of 1990, Linux OS did not exist yet, but the repeated test of 1995 launched the same fuzzing code on utilities from the Slackware 2.1.0 distribution of 1995. The corresponding results are shown in
table 3 of the 1995 article (pp. 7-9) . Compared to GNU / Linux commercial rivals, it looks very good:
The percentage of utility crashes in the free Linux version of UNIX was the second highest: 9%.
So, let's compare Linux utilities of 1995 and 2018 with fuzzing tools for 1989:
| Ubuntu 10/18 (2018) | Ubuntu 04/18/018 | Ubuntu 16.04 (2016) | Ubuntu 04/14 (2014) | Slackware 2.1.0 (1995) |
---|
Failures | 1 (f77) | 1 (f77) | 2 (f77, ul) | 2 (swipl, f77) | 4 (ul, flex, indent, gdb) |
Hanging | 1 (spell) | 1 (spell) | 1 (spell) | 2 (spell, units) | 1 (ctags) |
Total tested | 81 | 81 | 81 | 81 | 55 |
Failures / freezes,% | 2% | 2% | four% | five% | 9% |
Surprisingly, the number of crashes and hangs in Linux is still more than zero, even on the latest version of Ubuntu. So,
f77
calls the
f2c
program with a segmentation error, and the
spell
program hangs on two test input options.
What are the bugs?
I was able to manually figure out the root cause of some bugs. Some results, such as a bug in glibc, were unexpected, while others, such as a sprintf with a fixed-size buffer, were predictable.
Ul failure
The error in
ul is actually a bug in glibc. In particular, it was reported
here and
here (another person found it in
ul
) in 2016. According to the bug tracker, the error is still not fixed. Since the bug cannot be reproduced on Ubuntu 18.04 and newer, it is fixed at the distribution level. Judging by the comments to the bug tracker, the main problem can be very serious.
Failure f77
The
f77
program comes in the fort77 package, which is itself a wrapper script around
f2c
, translating the source code from Fortran77 to C. Debugging
f2c
shows that a crash occurs when the
errstr
function prints too long an error message. From the
f2c source, you can see that it uses the sprintf function to write a variable-length string to a fixed-size buffer:
errstr(const char *s, const char *t) #endif { char buff[100]; sprintf(buff, s, t); err(buff); }
It seems that this code has been preserved since the creation of
f2c
. The program has been keeping
track of changes since at least 1989. In 1995, when re-fuzzing, the Fortran77 compiler was not tested, otherwise the problem would have been found earlier.
Hang spell
A great example of a classic deadlock. The
spell
program delegates
spell
checking to the
ispell
program via a channel.
spell
reads text line by line and produces a blocking record of the size of the line in
ispell
. However,
ispell
reads a maximum of
BUFSIZ/2
bytes at a time (4096 bytes on my system) and issues a blocking entry to ensure that the client has received verification data that has been processed so far. Two different test inputs caused
spell
to write a line of more than 4096 characters for
ispell
, which led to a deadlock:
spell
waits for
ispell
read the entire line, while
ispell
waits for
spell
confirmation to read the original spelling.
Freezing units
At first glance, it seems that there is an infinite loop condition. The hang seems to be in
libreadline
, not in
units
, although newer versions of
units
do not suffer from this error. The change log indicates that input filtering was added, which could accidentally fix this problem. However, a thorough investigation of the causes is beyond the scope of this blog. Perhaps a way to hang the
libreadline
still remains.
Crash swipl
For the sake of completeness, I want to mention the failure of the
swipl
, although I have not carefully studied it, since the bug has been fixed a long time ago and seems to be quite high quality. Failure is actually an assertion (that is, something that should never happen), which is called when converting characters:
[Thread 1] pl-fli.c:2495: codeToAtom: Assertion failed: chrcode >= 0
C-stack trace labeled "crash":
[0] __assert_fail+0x41
[1] PL_put_term+0x18e
[2] PL_unify_text+0x1c4
…
Emergency termination is always bad, but at least here the program can report an error, collapsing early and loudly.
Conclusion
In the past 30 years, fuzzing has remained a simple and reliable way to find bugs. Although
active research is underway in
this area , even a 30-year-old fuzzer successfully finds errors in modern Linux utilities.
The author of the original articles predicted the security problems that C would cause in the coming decades. He argues convincingly that it is too easy to write unsafe code in C and should be avoided if possible. In particular, the articles demonstrate: bugs appear even with the simplest phasing, and such testing should be included in the standard practice of software development. Unfortunately, this advice has not been followed for decades.
Hope you enjoyed this 30 year old retrospective. Look for the next article, Fuzzing 2000, where we examine how robust Windows 10 applications are compared
to their Windows NT / 2000 equivalents when tested with a fuzzer . I think the answer is predictable.