
Jeff Landgraf, Ph.D.
Not since the idea of thinking machines first surfaced have computers captured the attention of the mass psyche as now.
If a modern Tarot deck were designed, iconifying the hopes and fears of our society, surely one of the Major Arcana would be the World Wide Web and another would be the Y2K problem. The former represents the awesome potential of our new technologies to perform our work, inform us, teach us, and even to free our minds from the physical constraints of our bodies. The latter represents the other side, our dependence on people and technology we do not understand and do not control and on a vast ever-changing sea of information we can access, but never master. Technology is now our frontier. Those brave enough to venture into its territories reap great rewards (witness the booming IT salaries and soaring internet stock). At the same time, unknown risks lurk like monsters in the shadows and they are all the more frightening because they are so difficult to describe. The Millennium Bug, or Y2K problem, is one of these monsters. It is a widespread problem and unlike other computer bugs it seems easy to understand. It is also timely: we are warned that come January 1, 2000 the monster will descend from the mountains and we had better be prepared. It embodies all of our misgivings about the technology. It provides a focus for all of our fears.
Even without computers, the Millennium is a historical landmark, impossible to ignore. For some of us, it is a formal opportunity to reflect on the passage of time and the events that filled it. For others the event looms, as other round-numbered anniversaries have in the past, as our next appointment with the Apocalypse. Conditioned though our society is to carry on its daily business against a background noise of prophecies of doom, two facts lend the year 2000 alarm bells a unique legitimacy. The first fact is the unprecedented boom in invention that began in the nineteenth century and has been increasing ever since. The last century has seen the development of an immense number of new devices, so useful that they are no longer curiosities but necessities: electric lights, automobiles, telephones, refrigerators, elevators, subway trains, airplanes, televisions, fax machines and computers. Not even the most remote rural homesteader is independent from the technology underlying the web of goods and services. This technology is now supported extensively by computers. The second reason the year 2000 is unique is that there is a bona fide technical problem associated with this date. Some of these devices will indeed fail, leading to inconvenience and expense. The existence of this real technical problem lends credibility to our apprehensions of apocalypse and fuels the now familiar tabloid hysteria.
This second observation, however, obscures part of the point of the first - we are at risk from all computer bugs. We must realize and accept the fact that it is possible for computer errors to have serious, even fatal, consequences. Everyone who has used a computer is familiar with glitches, and yet we still insist that computers don't make mistakes. This misconception revolves around the true fact that functioning computer hardware performs its instructions exactly. However, the complexity of today's computer systems means that programmers often write instructions that, despite performing correctly in the overwhelming majority of circumstances, are faulty. The unbelievable success of computer processing gives us expectations that we do not have of other technology. We go to great lengths to avoid aeronautical failures, but we do not have difficulty accepting the reality that engineering failures can have fatal consequences. We do not view computer technology in the same way. The contradiction between our idea that computers should be perfect and the reality of bugs drives the irrational unease we feel when faced with the possibility of computer malfunction.
The obstacles to removing all bugs from a computer system are formidable. Assuming that we have the source code in hand, along with the development tools required to build all of the programs, the first step is to compile an inventory of all the code modules. These are straightforward to list, but the complex relationships between different code modules are critical. A small change in one module can potentially affect the operations of hundreds of others. Next the code must be analyzed in detail. Even a small system such as a PC can easily contain upwards of 2 million lines of code, so this step is an immense task. When the analysis has concluded, the bugs must be fixed, and because each change can potentially cause additional bugs, the analysis must be repeated.
Clearly it is impossible to follow such a program through, even in ideal circumstances, much less for the real-life situation in which you lack the source code and the development tools of the original developers. Even if you could perform this exercise, it is of dubious utility. The procedure is analogous to trying to prevent a disaster in a skyscraper by going over every inch of the building with a magnifying glass looking for cracks. Even after the painstaking effort, the lack of cracks gives no information about whether the structure is plumb, square, and structurally sound. You can not predict future failures from the present lack of cracks.
Does this mean that computers are fundamentally, dangerously unmanageable? Not if we understand the true goal of intelligent, directed testing. Rather than to eliminate bugs from each line of code, the task of the "debugger" lies in putting the system through its paces in the combinations of circumstances that it will actually face. Although a thorough understanding of the system is necessary in order to generate a set of tests that will cover all such plausible situations, the detailed, line by line analysis which would be necessary to ensure a 100% clean bill of health is unnecessary. Every test confirms the system's correct response with a directness that can not be matched by inspection of the code. By building tests in such a way that every critical function of the system is covered, software engineers can reduce failure rates to levels as low, or lower than those of mechanical systems.
Mention of the Y2K bug is often accompanied by scorn: those programmers were stupid not to use four digit dates! Although better planning would have prevented nearly all Y2K bugs, the reality is that the same programming tricks that are now causing such distress are closely related to the clever programming tricks that allow us to store data on the computer at all. A date, like all other computer data, is stored as a sequence of binary numbers in computer memory. The computer does not "know" that the sequence represents a date. Instead, programmers invent conventions their programs use to interpret the numbers. For instance, a year might be stored as a single byte (a number between 0 and 255). To determine the real year from the value in computer memory the program might add 1900. In this case, the year storage convention would hold any year between 1900 and 2155. One of the most prevalent date conventions is to store dates as a series of characters interpreted as digits. The first two digits might represent the month, followed by two digits for the day, followed by four digits for the year. Such a convention is generally called a data type. There are usually entire libraries of routines to manipulate various data types. For a "date" data type there might be validation routines, which make sure that the month is between 1 and 12, that the day is appropriate for the month and so on. There might be a comparison routine, which determines whether one date is later than another. There might be a "next day" routine that takes a date and returns the next day. There might be addition and subtraction routines giving the number of days between two dates, or giving the date after a certain number of days, months, or years. If the time is embedded in the date data type there may even be conversion routines for converting the date/time to a different time zone.
The Millennium Bug arises either when the convention used to represent the date fails to uniquely identify the date, or when the utility functions that perform operations on the date fail to work correctly for all the dates. Programmers often work under the assumption that the lifetime of their programs will be short. This is seldom actually true. Old software commonly serves as the starting point for new software. Y2K is not a single bug but consists of all of the various mistakes that arise from the assumption that systems would be replaced before the turn of the century. Many believe that the Y2K problem is that the year is stored as two digits so that the year 2000 can not be distinguished from the year 1900. Though this fact is the root cause of many Y2K problems, it is not the only problem. Systems that use four digits for the year can exhibit Y2K problems, and systems that use only two digits for the year can be Y2K compliant. Y2K compliance depends on how the dates are interpreted by the programs. The following are some of the typical practices that now, in the year 1999, seem so questionable:
Using a value such as "9/9/99", or "0/0/00", or "1/1/00" to indicate "NULL" or "No Date". Such a code may be intended to signify "no expiration date" for licensed software, but at subsequent real dates will be interpreted as a valid expiration, blocking access to the program.
Not calculating leap year correctly. Leap years are years divisible by four but not ending in '00' except for years divisible by '400'. The exception to the exception is often ignored. In this case, 2000 is not a leap year, but in fact it is.
Not accounting for the century in the storage of the date resulting in incorrect comparisons of dates before and after 2000.
Basing date utility functions on only the last two digits of the year. Even when year is stored with four digits, the programs that manipulate dates sometimes only examines the last two digits. For this reason, even systems using four-digit year storage can fail.
How can we assess the danger that these problems pose? We might start with an analysis of the danger on our own PC's. Here, the same hardware and software you use is duplicated on thousands of desktops throughout the world. Even if the formal testing that hardware and software companies pursue for their own products were to fail, the massive consumer base that these platforms enjoy ensures that these systems are extensively tested. While this does not ensure 100% Y2K compliance, it does ensure that the rare instances of bugs that disrupt critical functions are well known, and have long since been corrected. In fact, the flurry of activity by individuals all over the world, testing the Y2K ramifications of every possible bizarre combination of capabilities, strives mightily towards the impossible task of removing all Y2K related bugs from these systems. The result is the proliferation of patches for nameless Y2K bugs that can be found on the web site of any major software manufacture.
If you don't trust these arguments, stop reading right now. Backup your system (You do keep regular backups don't you?). Then, go to the control panel and change your system date to December 31, 1999 at 11:55pm. Bring up the clock application and watch the new millennium roll over. Bring up your word processor. Print a file. Run your spreadsheet. Log on to the Web. Send yourself an email. Reboot your machine. For your computer, the year 2000 has come.
It is likely that you did not have any problems performing the above tests. Good, you are Y2K compliant. If you did have problems, you now have advance warning. You can now attend to the specific problems you encountered, rather than worry about nebulous monsters lurking in the circuits of your machine.
Enough about PC's. The real horrors of the Y2K problem involve the possibility of jet liners steered into houses by faulty air traffic control systems, world wide electrical outages, and the total destruction of our computer-dependent markets. There are dozens of scenarios detailing how the Y2K problem in one system will set off chain reactions resulting in horrible catastrophes. All the banks mistakenly transfer all their money into cyberspace. The Fed tries to cover all the debts, but fails. The stock market scares. Stocks crash and the world is thrown into a depression. Alternatively, computers controlling electrical distribution misinterpret the date, send all the current to a single station, trip all the breakers through a domino effect and create a tremendous blackout. Most alarmist pundits miss the possibility that, in this case, a power failure that knocks out all the banks' computers could save the economy. Even so, it might be wise to seriously analyze the possibility of such scenarios.
The computers controlling such important functions are often very large systems. Unfortunately, their enormous variability makes it very difficult to analyze the extent to which failures will affect our lives. The real trouble answering this question is often obscured further by ill-founded attempts to define the effort required to fix systems by extrapolation of the total number of lines of code to man-hours without regard for the relative importance of various system functions. Despite the complications involved in determining the precise effects the Y2K problem will have on our lives, there are good reasons for us to doubt the predictions of the more radical chain reaction theories. Science fiction fantasies aside, computers do not react to unforeseen circumstances by turning insane, and dumping the contents of their disk drives into cyberspace. Exception handling routines rollback transactions, write error logs, shut interfaces down, and close programs. Even when error conditions are not checked, they generally lead to secondary errors that are subsequently trapped and logged.
The vast increase in complexity as systems scale up is often cited as evidence that Y2K is impossible to control. In fact, increased size exposes systems to risk from all classes of bugs. Bugs related to issues such as concurrency or process and memory management arise from errors in the logical flow of the program, and as such, are more sensitive to increased complexity than the Y2K problems, whose scope can usually be isolated to specific, date-related operations. Furthermore, the unique, large-scale systems lack the advantage of de-facto testing that commercial software programs, by virtue of their massive consumer bases, enjoy. Bugs are not an exception, but rather a fact of life for the staff of large computer systems. For this reason there are specific measures taken to combat each of the difficulties of scale. Increased system size is offset by the efforts of more system dedicated, trained staff. The disadvantage of large systems' tendency to depend on multiple third party products is offset by the direct access to the developers of the third party software garnered through formal support contracts. The increased costs are offset by larger availability of resources. The increased time it takes to modify large systems is offset by the formal management and planning that is dedicated to them.
Of course, large systems face some difficulties that can not be directly balanced by the simple application of additional resources. Legacies of old, poorly documented code written by long departed staff are the norm, so training often consists of dropping the source code into the laps of new arrivals and allowing them to sink or swim. These conditions foster an internal unpredictability in large-scale computer systems that are addressed by continual monitoring of the system's performance and integrity. Many large systems have formal procedures for testing, system monitoring, backups and recoveries, independent safeguards, and contingency plans detailing the steps necessary to perform critical tasks in the face of total system failure.
Probably the best indication of the preparedness of large systems is the success of those systems that are already handling the Y2K problem correctly. For several years, credit card companies have been successfully processing cards with expiration dates ranging into the new millennium. Banks have long been successfully processing the paperwork for IRA's, Bonds, and Mortgages that expire well into the millennium. Recently, airline reservation systems have started booking flights into the New Year. Such systems have are already successfully handling the transition to year 2000. They give us good reason to believe the transition can be made smoothly.
So far, we have concentrated on what problems the Y2K problem is likely to cause, let's turn our attentions to the kinds of failures that are not likely to occur. The specter of system failure is raised so often that one imagines that our computers will turn into marvelous giant pumpkins the second the ball comes down in Times Square. This outcome, total chip failure, has been estimated at rates as high as 5% of all computer chips. This number is an enormous over-estimate, based on rough calculations for the number of lines of COBOL containing dates in typical business applications. For a chip to fail based on the date of the real time clock a number of exceedingly poor and unlikely design decisions would have had to have been made. First the clock itself would have to encounter an exception when the year was incremented past its largest value. This is not the default operation for most registers, which "wrap" under addition overflow - it would require extra work for the designer of the chip to construct a date-increment failure, and this designer would also have had to choose the maximum date. Second, the clock would have to invasively notify the CPU that the error had occurred. This is a reversal of roles: the CPU usually polls the clock for information, not the other way around. Third, the CPU would have to react to the clock's exception by failing itself, rather than simply continuing to do its work. In other words, total chip failure is theoretically possible, but exceedingly unlikely.
The most valuable information to computer users is stored in files on their hard disk. These customized files can represent years of hard work. For most systems, there is very little danger of losing data from the hard disk due to a Y2K failure. Even if the CPU running the computer were to be entirely destroyed, the hard disk could be removed and read by another computer. The main potential source of danger for data storage lies in automatic programs that delete old versions of files. Here a faulty date comparison could result in significant loss, so such drive-updating routines should be carefully tested.
Many might be surprised to know how many computers don't depend on dates at all. These computers include the majority of the embedded systems, which control electronic devices of all kinds. These computers will not fail because they don't access or store dates, or because they don't depend on the information they receive from dates they access for any critical tasks. A good test as to whether there is any significant potential for Y2K failure is whether it is possible to set a date for the device. If not, the only way a Y2K bug could surface is for an unused real time clock to be embedded in the circuit. If you removed the power, the clock would be reset to its original value, so even if the circuit contained a debilitating Y2K error, it would not surface until the device had been powered for the full cycle of its clock. In many applications, like automobiles, this time period will never be exceeded because of the need to replace the battery every 5 years or so.
I've tried to show how unlikely it would be for Y2K to lead to the sorts of apocalyptic problems some are predicting. I do not mean to say that the problem is unimportant. In fact, even if the technical problems did not exist, the huge amounts of money already spent trying to fix it would argue that the problem was important. The real problem lies with our fears. These are fed by the contradiction that humans hold about the computer. We believe that computers are perfect and that they don't make mistakes. We want to believe that their operations can not be interrupted, and when we find out about a bug we want it eradicated. It is not enough for most of us to make sure that the machine works for what it is supposed to do. We want the machine certified clean. This is what will never happen. Not for Y2K and not for any other bug.
Will anyone be killed as the result of the Y2K bug? It is entirely possible, as it would only take a single unlucky failure. I would stress, however, that if such a tragedy does occur the blame will not lie with our preparations for Y2K, or with any fundamental danger computers hold, but rather with specific negligence in addressing the general problems of computer malfunction. Generally systems that are important enough to be very dangerous are very well tested and very well thought out, and so they are much less likely to contain bugs of any sort than typical software. Still, writing software is an engineering field, just as bridge building or building automobiles. People have died when bridges collapsed. People have died when automobiles failed. The real question to ask here is whether it is more likely for Y2K to have fatal consequences than other bugs. Will more people be killed by computer malfunctions in January 2000, than in January 2010? My answer to this question is probably not.
Jeff Landgraf has a B.S. degree in Physics from the University of Minnesota and a Ph.D. in Physics from the University of Michigan. He is currently working at Brookhaven National Laboratory on Long Island, New York as a computer analyst.
|