Fighting a thirty-year bug

The first edition said about a twenty-year bug. In fact, he is 30 years old. Thanks Sidnekin .

Today, reading some data, my program processed 36'916 possible dates. Two of these 36'916 failed the test. I did not attach any importance to this, because these dates were from the data provided by the client, and such data is often surprising. However, looking at the source data, it turned out that the test did not pass on January 1, 2011 and January 1, 2007. There was a bug in the program I wrote a month ago. But it turned out that this bug is 30 years old.

Anyone who doesn’t really understand the software ecosystem will find it odd written below, but it makes sense. Because of the decision taken long ago to bring money to one company, my client spent money to pay me so that I could fix a bug introduced by one company by chance, and another one on purpose. To explain this, I will have to tell you about a third company that added a feature that eventually became a bug, and a few more facts that affected an incomprehensible bug that I fixed today.

In the good old days, Apple computers sometimes dropped the date to January 1, 1904. The reason is simple: at that time , Apple computers used battery-powered system time to keep track of the date and time. What happened when the battery sat down? Apple computers counted their dates as the number of seconds since the beginning of the era . The epoch in this case is just a reference date. And for Apple computers, this date was January 1, 1904. When the battery sat down this number became a new date. But why did this really happen?
')
In those days, Apple used 32 bits to store the number of seconds from the starting date. One bit can contain two values: 0 or 1. Two bits - four values: 00, 01, 10, 11. Three bits - eight values: 000, 001, 010, 011, 100, 101, 110, 111. And so on . How many values are contained in 32 bits? 32 bits contain 2 ³² (or 4'294'967'296) values. For Apple dates, this was about 136 years, so old Apple computers cannot work with dates after 2040, and if the battery in the system clock was down, the date was again 0 seconds after the era and you had to manually set the current number every time you turn on the computer (before buying a new battery).

However, Apple’s decision to store dates as the number of seconds after an epoch meant that it was impossible to store dates before an epoch. As we shall see, this had far-reaching consequences. It was a feature, not a bug, added by Apple. Among other things, this meant that the Macintosh operating system was immune to the Y2K problem (although many programs on the Mac did not have, since they had their own dates to circumvent the limitations of the Macs).

Moving further, we encounter Lotus 1-2-3 , a killer IBM application that helped launch the PC revolution, although in fact, VisiCalc at Apple gave rise to personal computers. It can be said that, if it were not for 1-2-3, then the PCs most likely did not leave their niche and computer technologies developed quite differently. However, Lotus 1-2-3 incorrectly considered the 1900th leap year. When Microsoft released Multiplan, its first spreadsheet program, it failed to conquer the market . So when developing Excel, it was decided not only to copy the column naming rules from Lotus 1-2-3, but also to make the products fully, up to a bug, compatible, including intentional handling from 1900 as a leap year, the problem is still relevant since then So for 1-2-3 it was a bug, but for Excel it was a feature that guaranteed to all 1-2-3 users the ability to import spreadsheets into Excel without differences in dates, even if they were wrong.

Over time, Microsoft decided to release an Excel version for the Apple Macintosh, but there was a problem. As already mentioned, Macintosh did not understand the dates until January 1, 1904, and for Excel, the era was January 1, 1900. So Excel corrected to recognize the epoch and store the date relative to the corresponding epoch. The Microsoft Support article describes this problem quite clearly . And this leads to my bug.

My current client receives spreadsheets from many of his clients. These tables could be made on Windows, or they could be made on Mac. As a result, the epoch in these tables may be January 1, 1900, or January 1, 1904. How to find out which one? The file format in Excel stores such information , but the parser that I use does not provide it and believes that you yourself know what epoch you are dealing with in this file. I probably should have spent a lot of time trying to figure out how to read the Excel binary format and send a patch to the parser developer, but I had other things to do with my $ client and I sketched the heuristics to determine what era this file belongs to. She was simple.

Excel can store, for example, July 5, 1998, but this number can be formatted as “07-05-98” (useless American format), “Jul 5, 98”, “July 5, 1998”, “5-Jul. 98 ”and many other useless options (ironically, the only format that my version of Excel does not offer is ISO 8601). Inside, the unformatted value is either 35981 for the era of 1900, or 34519 for the era of 1904 (these numbers correspond to the number of days that have passed since the era). I used a robust parser to extract the year from the formatted date. and then the Excel parser to extract the year from the unformatted value. If they differed by four, then the dates in the file were counted from 1904.

Why not use formatted dates right away? Because July 5, 1998 can be formatted as "July 1998", losing the day. We get spreadsheets from so many companies and they create them in so many different ways that they expect us (from me in this case) to understand. Excel understands what's what, then I have to!

It was here that 39082 gave me a push. Remember how Lotus 1-2-3 considered the 1900th leap year, and how it was honestly copied to Excel? Since it adds one day to 1900, many date calculating functions can be wrong for one day. This means that 39082 may be January 1, 2011 (on a Mac), or maybe December 31, 2006 (on Windows). This is great, of course, that my parser extracts 2011 from a formatted value. But since the Excel parser does not know from which epoch the dates in this file are calculated, it assumes by default that it is from 1900, returns the year 2006, my program sees that there is a difference of five years, considers that it is a mistake, writes it to the log and returns unformatted value.

To get around this, I came up with the following (pseudocode):

difference = formatted_year - parsed_year if ( 0 == difference ) assume 1900 date system if ( 4 == difference ) assume 1904 date system if ( 5 == difference and parsed month is December and parsed day is 31 ) assume 1900 date system

Now all 36'916 dates are parked correctly.

Note : for fun, if you have a Mac with Excel, you can try entering the date before 1904 and format it in a different format. You can enter it, but you cannot format it, because Excel will consider it to be plain text. At the same time, for Microsoft Excel, all days of the week until March 1, 1900 are incorrect because of a bug in a program released in January 1983.

Update : I was told that Spreadseet :: ParseExcel understands the 1904 flag . Unfortunately, I use Spreadsheet :: ParseExcel :: Stream, which does not understand. Even on huge machines we do not have enough memory when using the standard parser, so we use streaming. My attempts to circumvent this restriction ran into another bug .

Update 2 : It turns out Microsoft first released Excel for Mac .

Update 3 : According to Joel Spolsky , a bug in Lotus 1-2-3 could be a deliberate attempt to simplify the program. I had read hints earlier that Lotus did this intentionally, but since I’m not sure about 100, I didn’t write about it.

Source: https://habr.com/ru/post/198174/

All Articles

Fighting a thirty-year bug

More articles: