|
Regression Plots Ron Graham with John Grosh, Wolfgang Hees, Lisa Henn, and Mark Rogers |
|
|
All graphical representation of data, especially on
overhead transparencies, is only an approximation
of the "real data." When you look at the screen from
20 M away, it's only a *very rough* approximation.
The rhetorical concern in displaying regression plots is that readers can tell what the data means. There are obviously numerous design considerations:
In the Ramage and Bean book is an example of an x-y plot showing dollars of sales for some make-believe company as a function of time. There are about ten data points on the graph, and the trend is upward, monotonically increasing and parabolic. The data is referenced to zero dollars and it's a linear scale. Ramage and Bean say this is a bad plot, that it forces an incorrect interpretation of data. They show another one of the same data, in which the vertical scale goes to about four times the maximum data value, making that parabola look more like a straight line. They say that's a good plot. I absolutely disagree. I have always felt that axes should serve as boundaries for the data that actually exists. On the other hand, it's possible that historical information suggests that the "straight line" aspect of the data's character is more representative of what's going on than is the "parabolic" aspect. But Ramage and Bean used very few points and no history in their case, so if that's true it's simply a bad example. They could use a best-fit line instead of an expanded scale. Scale Abuse Tufte gives an example of another x-y plot with some dependent variable that shows an almost sinusoidal behavior as a function of time, with pronounced peaks and valleys. He says that plot is bad, then re-plots the data using a y-axis of about one-sixth the length of the original. He also performs some sort of smoothing on the data -- I can't even understand his explanation of how he does it -- and the result is a very narrow graph with little bumps instead of large peaks and valleys. And he says it's better. I'm sorry, but I think it's worse. A LOT worse. And dangerously worse. You don't just present data visually; you present the message the data sends. When you shrink the data's visual span, you obscure the message. On the other hand, if the objective is to observe some trend that's obscured by (for instance) an overlying sinusoid, and you want to get it out of the way, the sinusoid can be addressed by a "notch filter" (or, band-limited filter, though that's not what Tufte actually did). If you want to see some trend that's obscured by a routine, expected parabola, you can always plot the data on a scale other than linear. That's what statisticians do. But to report the message contained by the data, I still think axes have to provide a visible boundary to the data that's actually there. There are various forms of scale abuse:
The guiding principle in selecting scales should be making the main point of the presentation. If you have to change scales, smooth curves etc. you might deviate attention from the "true" message of the data. But what is the truth, in YOUR presentation of what YOU want to say to an audience that can't verify the data? The trick is, to make your data fit your argument without lying. It comes back to the usual dilemma of how much of the truth you are going to tell your audience. Too much information, and they either can't follow or concentrate on what's unimportant; too little, and they could accuse you of lying by omitting parts of the truth. Zero Reference The idea of requiring plots (and we're obviously talking x-y/regression/scatter plots here) to have a zero origin or zero center is kind of confusing to me. I mean, what if the data never goes near zero? Isn't that a kind of scale abuse? Even if you have scatter about a zero mean, if the scatter isn't the major feature of the data, to concentrate on it can be a kind of scale abuse. In statistical process control (for instance), you might look at a plot centered at zero to identify "special causes." (Though you might not find them if the data's mean and scatter are both some distance away from zero.) But with obvious special causes identified, then you might want to look at the data with the large trends removed (all sorts of statistical tools for that) to see if there are really only "common causes" left. I once used in class the example of a dynamic step response -- something that comes up in (for instance) control systems, vibrations, and structural dynamics all the time. A non-engineer was able to easily determine from an unlabeled sample plot what were the most critical characteristics in the data. She didn't know the words, but she picked out max overshoot, rise time, and settling time as though she'd been taking a controls class. There's a case where you could summarize a whole plot in three numbers, especially if your readers have seen those types of plots before. And it's also a case where a zero origin doesn't buy you anything -- unless the response is slow, as in this case:
Case Study
One of my auxiliary jobs at one time before the Excel Era was to do ad hoc (that makes two of us here) statistical programming for engineering and manufacturing and I had done some regression work that plotted up real pretty. A buddy brought me some business history for one of our burgeoning product lines and asked me to "do the same thing with this stuff." He said, "And plot me a trend line out to three years." References
Tufte, E.
Visual
Explanations. Graphics Press, 1997. What You Can Do
|
|