Statistics
Ron Graham
with Lisa Henn
histogram

This histogram is based on simulation data.

Statistics are used to answer one of two questions regarding data:

  1. Are two data points, or data sets, "different?"
  2. Are two variables "related" (i.e. is Y is a function of X)?

You've performed an experiment and gathered data. The answers to these questions depend on the variation in the data - the scatter of the data about some nice, smooth, well-behaved, and understandable continuous function. The function represents what you figure the data should do; the data actually does something that's somewhat different. You perform statistical analysis on this data both because it's doing something different than what you predict, and also because you can't take an infinite number of data points. It's too expensive.

Variation ALWAYS EXISTS in collected data. It is due either to COMMON CAUSES, found within a single population, or data set; or to SPECIAL CAUSES, which discriminate between multiple populations of data; or both. In general, common causes are not (and should not be) correctable - you'd be fighting a losing battle to try. Special causes, on the other hand, are where systems and processes suffer losses of efficiency, or worse. They should be corrected if it's possible. If they are responsible for losses, correcting them is the way to get the big win.

In general, data variation is random to some extent, having the appearance of a normal distribution, or bell curve. Such a distribution has a mean (m, or average), and a standard deviation (s, or the average of the difference between any individual data point and the mean), and these numbers provide both specific meaning and tools for answering the two questions above.

To understand your data's variation:

  1. Prepare a histogram. Each data point is grouped into a "cell," a small portion of your data's range. You plot the number of points in each cell against the range of that cell.
  2. Plot the data as a function of time (assuming the data was collected at different times or over some duration). Examine that plot to ascertain, and correct for, any time-dependencies. Such dependencies might include "drifting," which might indicate system wear or temperature elevation; or "cycling," which might indicate an unwelcome alternating dynamic such as unbalanced rotating machinery. In such cases, both the data and the way it's distributed are moving around in time. This means that the process from which you gather the data is statistically "unstable." It has both common cause and special cause variation. The special cause(s) must be rooted out and addressed.

    Recognize, by the way, that data doesn't always fall in a normal distribution. Sometimes it looks somewhat like a normal distribution, and if it looks like one it is -- possibly needing a transformation. If it doesn't look like one, you have to treat the data another way.

    1. Once a distribution is established, examine it for skewedness, peakedness, and outliers. If the curve is lopsided, flattened or excessively peaked, or bimodal (two humps), there are transformations that can be performed on the data to place it in a symmetric distribution. Outliners (usually just a point or two, well off the mean) can change the way the data is interpreted. Those points should be addressed separately.

    When you have a normal distribution for a data set with at least 50 points or so, you have established for yourself confidence intervals for that data. "I am (some percentage) confident that the next data point of this kind that I acquire will fall within (some number of s) of m." The confidence interval is the way you represent the fact that you can't account for everything. If you had an infinite number of data points, you would know all there is to know. But you don't have the points, so you only know what the points you do have tell you. Standard confidence intervals are

    • 68% for m plus or minus one s
    • 95% for m plus or minus two s
    • more than 99% for m plus or minus three s

    A normal curve will have 99% of its area (or, 99% of the points in a large data set) between m - 3 s and m + 3 s. That's why the term "three-sigma" is important in industry - it represents 99% of the cases you're likely to encounter.

    Now, if the data set you're measuring has a very large number of points, even the remaining 1% may seem significant. Consider the example of misplaced mail for the post office. One percent of a million pieces of mail is ten thousand. One might easily guess that there would be enough important mail among those ten thousand pieces to generate headaches for the Postal Service. For this reason, some industries are adopting as a goal "six sigma quality." That's a better than 99.99% confidence interval. Fewer than ten pieces of misplaced mail out of a million. That may generate no more than a single complaint.

    If your data set has a small number of points (less than 50 in most cases), then the confidence interval has a different mathematical meaning, and is necessarily more restricted. There are tables (referred to as "chi-square") for making confidence interval estimates on small data sets.

    And, there are formulas, tests, and tables for using your distributions to answer your questions above. :-)

    References

    Box, Hunter, and Hunter, Statistics for Experimenters
    Deming, W. E., Out of the Crisis

    normal distribution
    This is a normal distribution.


[Table of Contents] [Previous] [Next]