|
Failures Ron Graham with Kirk Gordon |
|
|
An engineered system fails when it stops working. This
probably means it broke, or broke down, or shut down.
A "failure" should not be mistaken for a "malfunction,"
in which case the system may work properly next time you
turn it on - unless the malfunction leads to a loss of
opportunity.
A contributing factor in most spectacular failures is the lack of checks that might have detected problems before the failure occurred, if not before the system was put in service. This shows us that the causes of failures won't be found unless they're looked for. This is done relatively seldom in advance. Important lessons learned include
Once a failure begins to occur, it's easier to propagate than it was to start. This is analogous to most moving systems - it's easier to *keep them moving* than it is to *get them going*. While systems are in service, they are aging - material properties change, loads shift and cycle, fasteners drop off - and we may not be able to observe these subtle changes. Failures follow the principles of short circuits -- the path of least resistance. A cracked support offers less resistance than a healthy one, so it bears the critical load. When a container begins to crack, the chance of it releasing its load depends *much* more on crack propagation than on strain. Likewise, when a vibrating or rotating system begins to fail, not only is the critical load felt most near the crack, but the continued motion accelerates the problem and may make the resulting failure at best loud, at worst explosive!
Stored energy situations are among the sneakiest kinds of safety problems; but those can still be made fail-safe in many cases. For example, if you're worried about hydraulic fluid pressure, then you'd need a "dump valve" that would ...NOT... open when a control told it to. It would only close, to maintain system pressure, when all appropriate sensor inputs and logical operations were satisfied. In other words, the only way to accumulate pressure is when everything's going right. Knowing this, we recognize that although a system may offer no signs of wearing out, once the signs become evident failure is *very* near. This is why "if it ain't broke don't fix it" is such a dangerous philosophy. From the moment a system is put into service, it begins to wear out. If we wait until it's visibly wearing out, failure may be too near to prevent. A better philosophy would be "an ounce of prevention is better than a pound of cure." Example:
Xerox copiers collect performance data using "remote interactive communication" (RIC). After gathering sensor data, the RIC system predicts breakdowns and sends the predictions and supporting data to a branch office -- so a technician is sent out before the failure occurs! Tracy Kidder, in The Soul of a New Machine, gives four types of mistake typically seen in hardware development -- and which may affect many engineers. He also gives reasonable steps for preventing them.
Another Example: Intel Pentium Problem During the production of some 5 million pentium chips, a table look-up affecting floating point calculations was left with missing entries, leading to incorrect results for just a few unique sets of numbers. These errors were found in the 12th through 16th bits of the number, or as high as the fourth significant digit to the right of the decimal.
Intel missed these errors despite what looks like a rigorous testing procedure:
The fallacies in the procedure and assumptions here are
Though the error was discovered 06.1994, no action was taken by Intel by 11.1994, and chips with the error were still distributed. Intel eventually offered replacement and/or rebate, but not before pissing off the marketplace. Yet Another Example: Rockets Homer Hickam, in Rocket Boys, reminds us that when something fails, it's the best source of answers.
We couldn't lose [a rocket]. Like every rocket we launched, it held answers we had to know. [...] With rockets, anytime you changed one thing, a lot of other things changed, too, and it was hard to predict what all they might be. References
Hickam, H.
Rocket
Boys. NYC: Doubleday, 1998.
ISBN 0-385-33320-X What You Can Do
|
|