I bring to your attention a translation of a recent post in the electronic journal Queue, by Poul-Henning Kamp.Did Ken, Dennis, and Bryan make a mistake when choosing to use NUL-terminated text strings?
IT stimulates and implements the modern Western economy. Accordingly, we often see headlines about overwhelmingly large sums of money associated with errors in IT. What kind of solution related to IT or CN [computer science] is the most expensive?
Recently, a sufficient number of experts shrugged at the mention of the financial implications for Sony of the problems associated with their PlayStation Network, but this event is insignificant in this context. When I was in school, I had the opportunity to talk with the inspector from the Guinness Book of Records, who explained that something could not just happen to be a real record, there should be a causal relationship that begins with human intent (for example, we drove 26 high school students into Volkswagen Beetle our music teacher and closed the door).
')
Sony (probably) did not want to find out what kind of confusion will happen if you pay little attention to safety, so this and other examples of false savings do not pass the selection. Another choice could be IBM’s choice of Bill Gates, rather than Gary Kildall, as the supplier of the operating system for their personal computer. The damage from this solution is still accumulating at breakneck speed: Stuxnet [approx. per. network worm] and the distortion of the ISO standardization process OOXML vividly demonstrate how far and wide the damage can spread. But this was not a decision related to IT or CN. As is currently known, this was a business decision based on Kildal’s refusal to accept IBM's non-disclosure requirements.
A more suitable example would be a solution to invent your own directory / file name separator for MS-DOS: a backslash (\) was chosen, rather than a forward slash (/) that was used in Unix or a dot that used DEC in its operating systems . Although the actual damage is relatively moderate, this solution is also not suitable as a good example, since it was not a real solution, it was a true preference. IBM chose to use slashes for command flags, ignoring Unix, and the dot was used as a separator for the file name and extension, which made it impossible to follow the DEC example.
The history of space exploration offers a whole set of untwisted and expensive mistakes, but, interestingly, I could not find any suitable candidate. Fortran syntax errors and sync errors of the onboard computer of the shuttle are not suitable due to lack of intent. Maintaining part of the project in the imperial units of measurement, and the other part in metric is a “random act of control” and does not relate to CN or IT in any way.
The best example I could find is the use of NUL-terminated text lines in C / Unix / Posix. It was a very simple choice: should C represent strings as a tuple address + length or just as an address and some magic character (NUL) marking the end of the string? It was this decision that Dynamic Trio Ken Thompson, Dennis Ritchie, and Brian Kernighan took in the early 1970s, and they were free to choose any method. I could not find a single record of the decision, which I recognize as a weak candidate: I have no evidence that this decision was deliberate.
However, as far as I can determine from my research, the address + length format was preferred for most programming languages ​​of that time, while the address + magic marker was used mainly in programs written in assembler. Since language C grew out of assembler, like a more portable and high-level language, it's hard for me to believe that Ken, Dennis and Brian didn't even think about it.
Using the address + length format would cost a single byte over the address + magic marker, and their PDP computer would have limited basic memory. In other words, this example could be an ideally typical and rational decision related to IT or CN, like so many similar decisions we make every day. But it is precisely this decision that has atypical economic consequences.
Hardware development costs
Initially, Unix had little influence on the development of hardware and a set of commands [approx. per. processor]. Processors that provide string manipulation instructions, such as the Z-80 and DEC VAX, implemented this using the much more common address + offset model. As soon as Unix and C received support, NUL-terminated strings became the target for optimizations. CPU developers have begun to add instructions for working with them. An example is the Logical String Assist instructions added by IBM in 1992 to processors based on ES / 9000 520.
Adding instructions to the CPU is not cheap and occurs only when there are tangible and quantifiable monetary reasons.
Performance costs
IBM added instructions for working with NUL-terminated strings because its customers spent expensive CPU cycles on processing such strings. However, this does not tell us that it would take less CPU cycles if the address + length format were used.
Our question is omitted if we think about virtual memory systems. Optimizing the movement of a string of bytes of known length favorably uses the entire width of the memory bus and cache lines, without affecting memory areas that are not part of the source or destination string.
One example is the libc library in FreeBSD, where the bcopy (3) / memcpy (3) implementation moves as much information as possible into fragments over an “unsigned integer”: usually 32 or 64 bits, and then “clears any trailing bytes” using byte-by-copy, like described in the comments.
When using a NUL-terminated string, attempting to work with parts of it that are larger than one byte may result in characters being accessed by the NUL character. If the NUL character is the last byte of the virtual memory page and the next page is not defined, it may crash the process with an error “page not found” [“page not present”].
Of course, you can write code that identifies possible pitfalls before executing the optimized code, but this adds a relatively high fixed cost to all operations of moving lines - an unfavorable exchange in any case.
If we know the length of the string in advance, everything is different.
Compiler development costs
Often the compiler knows one thing about strings - their length, especially when strings are constant. This allows the compiler to make a call to a faster memcpy (3) even if the programmer used strcpy (3) in the source code.
A deeper code inspection allows the compiler to perform more advanced optimizations, some of which are very skillful, but only if someone has implemented this in the compiler. Developing optimizing compilers has never been easy or cheap, but it’s obvious that Apple hopes this will change using LLVM (Low-level Virtual Machine), where optimizers are in bulk.
The down side of deep compiler optimizations are optimizations that provide a holistic review of the source code and reorder it in large volumes. They require the programmer to carefully ensure that the source code accurately describes what is needed. The programmer who worked with the compiler for supercomputers of the Convex C3800 series, designated it as “it is necessary to program as if the compiler is your ex-wife”.
Security costs
Even if your compiler has no hostile intent, the source code must be written in such a way that it can withstand attacks and the NUL-terminated string has a dismal file in this regard. A complete disaster, from the point of view of security, is gets (3), which “believes that the buffer will be large enough” and this problem is “relatively under our control”.
Taking control, however, means that the compiler will additionally complain if gets (3) is called somewhere. Despite 15 years of attention, overflow and underload [under-running] of string buffers are the preferred attack vector for intruders and too often it pays off.
These risks have been reduced at all levels. Long-missing bits that prohibit execution have been added to CPU memory management hardware; operating systems and compilers added address space randomization, often at the expense of performance; Static and dynamic analysis of programs has absorbed countless hours in an attempt to find out whether insidious diagnostics are an error or clever programming.
Still, no one would be surprised if it turned out that Sony’s problems began with a buffer overflow or misinterpretation of the end of the line.
Preventing Slashdot Sensation
We learn from our mistakes and do not come up with “yellow” headlines for this article. Ken, Dennis and Bryan could not have foreseen all the consequences of their choice, made 30 years ago, and then they did not guarantee anything. As far as I know, it took 15 years to realize that this subtle solution is a bad idea. Probably none of my decisions lasted as long.
In other words, Ken, Dennis and Brian did everything right.
But this does not solve the problem.
For many people, C is dead, and $ {lang} is the language of the future, for the ever-changing short term value of $ {lang}. In reality, all other languages ​​directly or indirectly sit on top of the Posix API and NUL-terminated strings of C.
When your Java, Python, Ruby, or Haskell program opens a file, the runtime passes the file name as a NUL-terminated string to open (3); or when it converts queue.acm.org to an IP address, it transmits the host name as a NUL-terminated string to getaddrinfo (3). As long as you continue to do so, you will retain all the benefits of running your program on the PDP / 11, and all the disadvantages when running on something else.
You can write here your API proposal against an imaginary adversary, suggest a way to define functions, operations and error handling strategies, but I’m sure it will be a complete waste of a wonderful evening. Experience shows that such sentences go nowhere, since backward compatibility with the PDP / 11 and the final number of programs written are much more important than the potential to write an infinite number of programs in the future in a more efficient and safe way.
Thus, the costs of Ken, Dennis, and Bryan’s decisions will continue to pile up like the dust that almost buried Ancient Rome for centuries.
Links
1. Computer Business Review. 1992. Partitioning and Escon enhancements for ES / 9000s;
http://www.cbronline.com/news/ibm_announcements_71 .
2. ViewVC. 2007. Contents of /head/lib/libc/string/bcopy.c;
http://svnweb.freebsd.org/base/head/lib/libc/string/bcopy.c?view=markup .
3. Wikipedia. 2011. Lifeboat sketch;
http://en.wikipedia.org/wiki/Lifeboat_sketch .