|
Statistics are used to answer one of two questions
regarding data:
- Are two data points, or data sets, "different?"
- Are two variables "related" (i.e. is Y is a function of X)?
You've performed an experiment and gathered data. The
answers to these questions depend on the variation in
the data - the scatter of the data about some nice,
smooth, well-behaved, and understandable continuous
function. The function represents what you figure the
data should do; the data actually does something that's
somewhat different. You perform statistical analysis
on this data both because it's doing something different
than what you predict, and also because you can't take
an infinite number of data points. It's too expensive.
Variation ALWAYS EXISTS in collected data. It is due
either to COMMON CAUSES, found within a single population,
or data set; or to SPECIAL CAUSES, which discriminate
between multiple populations of data; or both. In
general, common causes are not (and should not be)
correctable - you'd be fighting a losing battle to
try. Special causes, on the other hand, are where
systems and processes suffer losses of efficiency,
or worse. They should be corrected if it's possible.
If they are responsible for losses, correcting them is
the way to get the big win.
In general, data variation is random to some extent,
having the appearance of a normal distribution, or
bell curve. Such a distribution has a mean
(m, or average),
and a standard deviation
(s, or the
average of the difference between any individual
data point and the mean), and these numbers provide
both specific meaning and tools for answering the two
questions above.
To understand your data's variation:
- Prepare a histogram. Each data
point is grouped into a "cell," a small portion
of your data's range. You plot the number of
points in each cell against the range of that cell.
- Plot the data as a function of time
(assuming the data was collected at different times or
over some duration). Examine that plot to ascertain, and
correct for, any time-dependencies. Such dependencies
might include "drifting," which might indicate system
wear or temperature elevation; or "cycling," which
might indicate an unwelcome alternating dynamic such
as unbalanced rotating machinery. In such cases, both
the data and the way it's distributed are moving around
in time. This means that the process from which you
gather the data is statistically "unstable." It has
both common cause and special cause variation. The
special cause(s) must be rooted out and addressed.
Recognize, by the way, that data doesn't always fall
in a normal distribution. Sometimes it looks somewhat
like a normal distribution, and if it looks like one
it is -- possibly needing a transformation. If it
doesn't look like one, you have to treat the data
another way.
- Once a distribution is established,
examine it for skewedness, peakedness, and
outliers. If the curve is lopsided,
flattened or excessively peaked, or bimodal (two
humps), there are transformations that can be
performed on the data to place it in a symmetric
distribution. Outliners (usually just a point or two,
well off the mean) can change the way the data is
interpreted. Those points should be addressed
separately.
When you have a normal distribution for a data set with
at least 50 points or so, you have established for
yourself confidence intervals for that data. "I am
(some percentage) confident that the next data point
of this kind that I acquire will fall within (some
number of s) of
m." The
confidence interval is the way you represent the fact
that you can't account for everything. If you had an
infinite number of data points, you would know all
there is to know. But you don't have the points, so
you only know what the points you do have tell you.
Standard confidence intervals are
- 68% for m plus or minus one
s
- 95% for m plus or minus two
s
- more than 99% for m plus or minus three
s
A normal curve will have 99% of its area (or, 99% of
the points in a large data set) between
m - 3 s
and
m + 3 s.
That's why the term "three-sigma" is
important in industry - it represents 99% of the cases
you're likely to encounter.
Now, if the data set you're measuring has a very large
number of points, even the remaining 1% may seem
significant. Consider the example of misplaced mail
for the post office. One percent of a million pieces
of mail is ten thousand. One might easily guess that
there would be enough important mail among those ten
thousand pieces to generate headaches for the Postal
Service. For this reason, some industries are adopting
as a goal "six sigma quality." That's a better than 99.99%
confidence interval. Fewer than ten pieces of misplaced mail
out of a million. That may generate no more than a
single complaint.
If your data set has a small number of points (less
than 50 in most cases), then the confidence interval
has a different mathematical meaning, and is necessarily
more restricted. There are tables (referred to as
"chi-square") for making confidence interval estimates
on small data sets.
And, there are formulas, tests, and tables for using
your distributions to answer your questions above. :-)
References
Box, Hunter, and Hunter,
Statistics
for Experimenters
Deming, W. E.,
Out
of the Crisis
This is a normal distribution.
|