Human Error
Ron Graham
Engineers are often treated to the following explanation of a malfunction or failure: "human error." When I hear this explanation, I immediately want to hear more. Which human? What error? When did it occur? And why? How is that error to be prevented in the future? I WANT THE ANSWERS, AND I WANT 'EM NOW!!!

Customers may want to target blame for failures, as would the general public; the responsible parties, on the other hand, want to diffuse the blame. Neither is satisfied by vague responses. It LOOKS to the observer like a shrug and an "oh well, I’m only human." When the public feels this way about a response, it wants a scapegoat; this is why NASA takes a great deal of heat for celebrated failures like that of the Mars Climate Orbiter (MCO).

Mars Climate Orbiter

Because the MCO failure was attributed to a unit-conversion error, some engineers like to think "I’d never let one like that get by." The more complex the system lost, however, the less likely one person’s actions could have saved it. At the same time, the more complex the system, the less likely one person’s action (or inaction) could CAUSE a failure. In a system that depends on hundreds of thousands of design parameters, where would you even LOOK a priori for the error that brings the system down? (Other engineers might think, "I’m sure glad it wasn’t me." We all know that the next time it could be.)

Engineers focus (hopefully) on identifying such errors and ensuring that they can’t be repeated. We can make plenty of brand new errors without falling back into the same old ones; and sometimes we have to get rid of the old ones just to find out what the new ones ARE.

90% of traffic accidents are based on "human error." (Does this mean that of the drivers?) This assumption leads to the prospect of growth in the telematics industry (automatic information delivery from remote sources).

But we don't yet know the point at which information delivery becomes a distraction in itself. We have cell phones, CD players and GPS; we see most states having laws against "distracted driving." But can we readily recognize who's distracted, when, and by what?

Audio-visual map prompts have been shown in some studies to be safer than drivers looking at paper maps. (Is this supposed to be a surprise?) Does this mean that holographic GPS-driven maps are next, followed by adaptive cruise control? If we had the "Smart Driver," would we sit in the driver's seat and read the paper? Or take a nap?

Astronauts don't trust information delivery well enough to have a totally-automated space rendezvous. Maybe they think what they're doing is more dangerous than driving a car; maybe they're just more sober and serious than the average driver. But would WE trust a "Smart Driver?"

  Errors of commission Errors of omission
On-line
  • [verb]-ing the wrong [noun]
  • Incorrectly assuming the next step
  • Incorrectly assuming completion
  • Forgetting [all or part of] procedure
  • Missing events
  • Ignoring instructions
Off-line
  • Typing the wrong key
  • Factor-of-two errors
  • Sign inversion errors
  • Incorrect choice of mode
  • Missing conversion factors
  • Incorrect coordinate systems
  • Insufficient documentation
  • Overlooking details

The Roots of Error

There are at least two reasons that we put up with the hassle of personal-use high-tech devices that we don’t understand:

  1. I think the problem is trivial because it doesn’t affect anyone else.
  2. I think the problem may be ME and not the technology. Maybe I’m just not smart enough.

But certain types of basic errors of operation are exactly the same whether you’re operating a hand-held appliance or a 777. Stanton refers to Norman’s "mode error" here: a device has multiple modes of operation, and a given action has different consequences in different modes. Mode error is an example of design for the convenience of the designer, mass-produced for all. Even the most noble of engineers can’t be expected on their own to design products that work as easily for everybody else as for themselves.

The types of errors we make in doing simple tasks for ourselves, especially in using devices, are the same types of errors we make in complex system design and operation.

And, these errors increase in likelihood when the work environment allows for it. Human error results from our own omission or commission, as seen above, but may also result from systemic problems. Eliminating error requires recognizing both individual and systemic causes.

How individuals tend to err

  • Inability to follow chains of events to a conclusion
  • Inability to anticipate correctly, or assuming events will occur the same way each time
  • Assuming expertise or previous success precludes future error
  • Inability to maintain persistent focus
  • Desire to avoid personal responsibility
  • Desire to avoid critical scrutiny
  • Tendency to do "most fun" tasks first, sometimes ahead of both urgent and important tasks
  • Tendency to cut corners "when possible"

How systems tend to allow individuals to err

  • Requiring that urgent tasks be performed before important tasks
  • Making critical tasks monotonous (or vice-versa)
  • Ignoring "non-critical path" tasks (e.g. safety, environmental, QA)
  • Having excessive dependence on individual expertise/computer simulation results
  • Performing insufficient simulation
  • Tendency not to simplify processes
  • Cutting strategy tasks (e.g. planning, checking) first when the money runs low

Dodging Responsibility

We will try to avoid taking responsibility for errors because of Karchmer’s Law of Performance Evaluation: ten "attaboys" equals one "aw, shit." An engineer who hasn’t accumulated a substantial number of "attaboys" by the time the error is committed may be out of a job. Again, the threat of losing a job or being thrown to the crowd when we make a mistake may not be the answer -- it may make us miserable without making us any better. Plus, if an engineer’s error causes a problem that propagates up the management chain, it’s very likely the engineer won’t be sacrificed before the managers!

How different the engineering world is from the business world! When a business fails, the owners will often be able to find investors in their next venture easily -- investors think the owners won’t make the same mistakes the next time. And they’re probably right. Engineers won’t make the same mistakes the next time either, if the mistakes are large enough -- but will they land on their feet?

We learn from watching NASA and its contractors absorb media criticism for failures that

  • The probability that an error will be repeated is inversely proportional to the magnitude of the error’s consequences.
  • The probability that responsibility for the error will be avoided or diffused is directly proportional to the magnitude of the error’s consequences.

This means that even as a member of the (outraged) public it does me no good to call for a scapegoat. I may not get one, and the product is probably going to improve without one.

Anticipating Error

Given what we know about ourselves; about the types of errors we commit; and about the environments in which errors occur -- is it possible to predict errors? We can’t know what’s inside others’ minds, but we can observe others’ behavior and the systems they work under. If you observe any of these warning signs, a critical error may be around the corner:

Individuals

  • Lack of sleep or other loss of focus (e.g. ADHD, etc.)
  • Personal crises (e.g. injury, illness, separation, death, financial loss)
  • Dissatisfaction (e.g. with job itself or work surroundings)
  • Lack of confidence (e.g. in training or procedure)
  • Miscellaneous other distractions (see the Canonical List)

Systems

  • Experience drain (e.g. from illness, retirement, or "use-or-lose")
  • Difficulty in locating proper tools
  • Loss of expected budget
  • Redesigns, especially late in the project
  • Complex processes (i.e. impossible for one person to memorize)

We tend to blame workers for having human characteristics, instead of finding ways to work in spite of them. It is true that workers should be able to focus, especially on critical tasks. But the warning signs above tell us it may be impossible to expect that even of the best workers. Systems must therefore be designed to compensate for as much of the above as possible. In this regard we can learn a lot from the stage. Consider:

Redundancy. Almost all plays have an understudy for each key role. Does a critical worker require an understudy? If so, then train one. If there’s no money in the budget to do this, then borrow another worker from a non-critical task with its own budget. Is this unethical? Maybe -- but is it more ethical to keep all the staff budgets in their individual pots and risk a failure that haunts all your customers? If there are no other available workers to act as an understudy, draft a manager. There must be one in the organization that hasn’t forgotten the technical side altogether.

Simplicity. Plays require the actors to memorize their lines, and enable the memorization by cues. Can this not be done for critical procedures as well? Not if the procedure manual is longer than the screenplay for a major motion picture. Make each critical procedure memorizable, within a single page if possible and written in plain language, and consider ways for the worker to receive a cue for the most critical steps.

Honesty. Actors know their play could be a bomb. Workers must know what they face, including all possible consequences of their work. If they’re to have all the necessary confidence in their actions, they must know that others are confident as well. If they’re to be well-rested, they must have the chance to rest. If they’re to be focused, they must have the chance to collect themselves. If they’re not to panic, they must know that what they really need will be there when they need it.

It seems clear that the stages in which errors are either most likely to occur or can do the most damage are the stages in which resources should be focused. This means that part of the budget -- whatever part is necessary -- must be targeted to ensure that the stage is clean and paid for, the props tagged and securely stored, the understudies recruited and trained, the scripts edited and revised, the costumes comfortable, and the champagne on ice. Even then there’s no guarantee the play will come off without a hitch, but we can say we did our best.

Why Human Error Can't be Avoided

There's a certain amount of human error that engineered systems often are intended to tolerate. One classic example involves proximity operations of the Space Shuttle, which are under astronaut control despite their complexity and danger. Astronauts are more comfortable with having their fate in their own hands than with a computer, no matter how powerful the computer.

Shuttle
  Approach to ISS
This image (though flawed in both scale and relative orientation) shows the concept of the approach cone in Shuttle proximity operations.

Two rules of thumb, the "0.1% Rule" (in which the Shuttle's velocity is kept close to 0.1% of its distance from its rendezvous target, given like units) and the "approach cone" (in which the docking points on both Shuttle and target are kept within a few degrees of a straight line from one another) were developed to compensate for the astronauts' perceived need for human control. These rules help the pilot keep a feeling of a steady approach, despite the complexity of relative motion in space -- whenever the Shuttle fires jets to control closure rate, its altitude changes slightly, so more jets must be fired. This is something like the way we crawl toward the edge of a cliff and peek over it -- S-L-O-W-L-Y -- instead of walking right up to it.

If we look at any process as involving only controlled inputs plus effort leading to outputs, we miss the contributions of participants, physical constraints, and environment to the process.

real processes

Human error can't be avoided because even if our ability to learn from our mistakes improves nearly to perfection, everything will change around us. The old axiom I call the "Gannon Rule" applies: "whenever you think you know all the answers, they change the questions."

  • Participants -- changes in personnel, management, schedules, and personal influences on individuals previously involved in the process
  • Constraints -- changes in requirements, regulations, and customer habits
  • Environment -- changes in machinery characteristics (due to wear, etc.), computer hardware and software, sensors and actuators, and ambient conditions
  • Inputs -- changes in suppliers or raw materials/components that go into the supply

...each of these forces us to re-learn how to become error-free.

Tracy Kidder's classic, The Soul of a New Machine, reveals four types of human error that affect the development of a new computer, and how the engineers set out to overcome these error types:

  1. The Big Mistake -- the flaw that's discovered late in the development process, and requires a redesign to correct.

    • testing to success/testing to failure
    • soliciting shareholder views on functionality
    • documentation

  2. Flakiness -- caused by a misstep in a procedure that wasn't documented and can't be reproduced.

    • documenting individual steps (in order)
    • testing one thing at a time

  3. The Boogeyman -- the dark, nameless fear that when all else is completed, the darn thing will still not work at last.

    • redundant project oversight through all phases
    • involving focus groups in testing
    • gaining experience (the best way long-term)

  4. The Kludge -- the temporary correction that isn't recorded and thus isn't repaired properly later.

    • documentation of kludge and its purpose
    • team design of kludge (you don't want a single point of contact here)
    • notification of management when kludge is installed
    • finite lifetime built into kludge

References

Neville Stanton’s web site has some interesting notes on Engineering Psychology. Start there. (www.soton.ac.uk/~psyweb/sig/engpsy.html)
Then get the December 1999 issue of IEEE Spectrum and review James Oberg’s article on the MCO Failure.
Norman, D., The Design of Everyday Things. Doubleday, 1990.
Kidder, T. The Soul of a New Machine. Back Bay Books, 2000. ISBN 0-31649-197-7
Graham, R. "Positioning of Space Station Solar Arrays with a Neural Network." Doctoral dissertation, Cleveland State University, 02.1994.


[Table of Contents] [Previous] [Next]