Failures
Ron Graham
with Kirk Gordon
An engineered system fails when it stops working. This probably means it broke, or broke down, or shut down. A "failure" should not be mistaken for a "malfunction," in which case the system may work properly next time you turn it on - unless the malfunction leads to a loss of opportunity.

A contributing factor in most spectacular failures is the lack of checks that might have detected problems before the failure occurred, if not before the system was put in service. This shows us that the causes of failures won't be found unless they're looked for. This is done relatively seldom in advance.

Important lessons learned include

  • failures propagate along the path of least resistance - once the failure begins, its growth accelerates;
  • what initiates the failure may not be the same thing as what propagates it;
  • scaling systems is a highly nonlinear process and cannot be treated on the basis of geometry alone;
  • failures tend to concentrate in areas of stress concentration, or in areas of close proximity of subsystems that are sensitive to one another;
  • failure prevention is a function of design, analysis, materials, production, operation, maintenance and management;
  • it's possible for engineered system to remain safe even in the presence of a failure - but you have to design for that in most cases;
  • "previous success is not a reliable indicator of future performance."
Some of these principles are exhibited in this cross-section of a failed rubber coupling. A small crack formed on its outer surface, accelerating into a tear all the way through to its inner surface as its motion became more pronounced and its stresses more concentrated in the neighborhood of the tear.

failed coupling

Once a failure begins to occur, it's easier to propagate than it was to start. This is analogous to most moving systems - it's easier to *keep them moving* than it is to *get them going*. While systems are in service, they are aging - material properties change, loads shift and cycle, fasteners drop off - and we may not be able to observe these subtle changes.

Failures follow the principles of short circuits -- the path of least resistance. A cracked support offers less resistance than a healthy one, so it bears the critical load. When a container begins to crack, the chance of it releasing its load depends *much* more on crack propagation than on strain.

Likewise, when a vibrating or rotating system begins to fail, not only is the critical load felt most near the crack, but the continued motion accelerates the problem and may make the resulting failure at best loud, at worst explosive!

Stored energy situations are among the sneakiest kinds of safety problems; but those can still be made fail-safe in many cases. For example, if you're worried about hydraulic fluid pressure, then you'd need a "dump valve" that would ...NOT... open when a control told it to. It would only close, to maintain system pressure, when all appropriate sensor inputs and logical operations were satisfied. In other words, the only way to accumulate pressure is when everything's going right.

Kinematic devices (e.g. spindles, flywheels, etc.) tend to keep on spinning (or falling, or sliding) for a long time if you let them coast to a stop -- and coasting to a stop is the only passive way to let them dump energy. A mechanical brake (normally engaged by a spring, and only released to allow motion when all safety indicators are satisfied) is the usual solution, but it isn't fail-safe. Though brake technology is well-developed and well-understood, and is often the only realistic solution, using a brake to capture and dissipate a lot of energy in a hurry can be dangerous in itself.

Knowing this, we recognize that although a system may offer no signs of wearing out, once the signs become evident failure is *very* near. This is why "if it ain't broke don't fix it" is such a dangerous philosophy. From the moment a system is put into service, it begins to wear out. If we wait until it's visibly wearing out, failure may be too near to prevent. A better philosophy would be "an ounce of prevention is better than a pound of cure." Example:

Xerox copiers collect performance data using "remote interactive communication" (RIC). After gathering sensor data, the RIC system predicts breakdowns and sends the predictions and supporting data to a branch office -- so a technician is sent out before the failure occurs!

Tracy Kidder, in The Soul of a New Machine, gives four types of mistake typically seen in hardware development -- and which may affect many engineers. He also gives reasonable steps for preventing them.

  1. The Big Mistake -- the flaw that's discovered late in the development process, and requires a redesign to correct.
    • testing to success/testing to failure
    • soliciting shareholder views on functionality
    • documentation
  2. Flakiness -- caused by a misstep in a procedure that wasn't documented and can't be reproduced.
    • documentation of steps
    • test one thing at a time
  3. The Boogeyman -- the dark, nameless fear that when all else is completed, the darn thing will still not work at last.
    • redundant project oversight through all phases
    • involving focus groups in testing
    • gaining experience (the best way long-term)
  4. The Kludge -- the temporary correction that isn't recorded and thus isn't repaired properly later.
    • documentation of kludge and its purpose
    • team design of kludge (you don't want a single point of contact here)
    • notification of management when kludge is installed
    • finite lifetime built into kludge

Another Example: Intel Pentium Problem

During the production of some 5 million pentium chips, a table look-up affecting floating point calculations was left with missing entries, leading to incorrect results for just a few unique sets of numbers. These errors were found in the 12th through 16th bits of the number, or as high as the fourth significant digit to the right of the decimal.

  • 5505001/ 294911 = 18.66665197 (Pentium gave 18.66600093)
  • 4195835/3145727 = 1.33382045 (Pentium gave 1.33373907)

Intel missed these errors despite what looks like a rigorous testing procedure:

  • Intel tested (a.b/c.d)/(e.f/g.h) for {a.b, c.d, e.f, g.h} = {0.1 through 9.9} in Ada and C++; using Sun SparcStation with SunOS and Linux, DOS, Windows NT, HP/UX, and OS/2 platforms; they found less than one error in 100000 tries and deemed this of no consequence to business users.

The fallacies in the procedure and assumptions here are

  • Even these simple decimals aren't representable by finite binary numbers, so when they're used in more complex calculations (e.g. search and optimization, monotonic functions), discontinuities would show up, the same way each time, and screw up the processes.
  • Even if there was no consequence to business users, there was one HECK of a consequence to scientific and engineering users.

Though the error was discovered 06.1994, no action was taken by Intel by 11.1994, and chips with the error were still distributed. Intel eventually offered replacement and/or rebate, but not before pissing off the marketplace.

Yet Another Example: Rockets

Homer Hickam, in Rocket Boys, reminds us that when something fails, it's the best source of answers.

We couldn't lose [a rocket]. Like every rocket we launched, it held answers we had to know. [...] With rockets, anytime you changed one thing, a lot of other things changed, too, and it was hard to predict what all they might be.

References

Hickam, H. Rocket Boys. NYC: Doubleday, 1998. ISBN 0-385-33320-X
Intel Inside?
Markoff, J. "Circuit Flaw Causes Pentium Chip to Miscalculate, Intel Admits." New York Times, 11.24.1994.
Coffee, P. "What Does the Pentium Do, and When Does It Do It?" New York Times, 11.16.1994.
Kidder, T. The Soul of a New Machine. Back Bay Books, 2000. ISBN 0-316-49197-7



What You Can Do

  1. Be skeptical of the philosophy "if it ain't broke, don't fix it." History tells us instead that if it ain't broke, it's breaking.
  2. Remember that once something starts to break it'll finish breaking in a hurry.
  3. Remember that most systems cost less to maintain than to replace. You may find that recalls are less costly than product liability litigation as well.
  4. Try to anticipate failure mechanisms before they occur. There is some magic involved here, but you have historical lessons to help you.

[Table of Contents] [Previous] [Next]