Regression Plots
Ron Graham
with John Grosh, Wolfgang Hees, Lisa Henn, and Mark Rogers
All graphical representation of data, especially on overhead transparencies, is only an approximation of the "real data." When you look at the screen from 20 M away, it's only a *very rough* approximation.

The rhetorical concern in displaying regression plots is that readers can tell what the data means. There are obviously numerous design considerations:

  • trends v. scatter
  • curves that best fit the data
  • scales that bound data v. scales that bound references
  • grids and ticks that locate particular points v. grids and ticks that obscure the data
  • confidence limits and sources of uncertainty that bound the data

In the Ramage and Bean book is an example of an x-y plot showing dollars of sales for some make-believe company as a function of time. There are about ten data points on the graph, and the trend is upward, monotonically increasing and parabolic. The data is referenced to zero dollars and it's a linear scale.

Ramage and Bean say this is a bad plot, that it forces an incorrect interpretation of data. They show another one of the same data, in which the vertical scale goes to about four times the maximum data value, making that parabola look more like a straight line. They say that's a good plot. I absolutely disagree. I have always felt that axes should serve as boundaries for the data that actually exists.

On the other hand, it's possible that historical information suggests that the "straight line" aspect of the data's character is more representative of what's going on than is the "parabolic" aspect. But Ramage and Bean used very few points and no history in their case, so if that's true it's simply a bad example. They could use a best-fit line instead of an expanded scale.

Scale Abuse

Tufte gives an example of another x-y plot with some dependent variable that shows an almost sinusoidal behavior as a function of time, with pronounced peaks and valleys. He says that plot is bad, then re-plots the data using a y-axis of about one-sixth the length of the original. He also performs some sort of smoothing on the data -- I can't even understand his explanation of how he does it -- and the result is a very narrow graph with little bumps instead of large peaks and valleys. And he says it's better. I'm sorry, but I think it's worse. A LOT worse. And dangerously worse.

You don't just present data visually; you present the message the data sends. When you shrink the data's visual span, you obscure the message.

On the other hand, if the objective is to observe some trend that's obscured by (for instance) an overlying sinusoid, and you want to get it out of the way, the sinusoid can be addressed by a "notch filter" (or, band-limited filter, though that's not what Tufte actually did). If you want to see some trend that's obscured by a routine, expected parabola, you can always plot the data on a scale other than linear. That's what statisticians do. But to report the message contained by the data, I still think axes have to provide a visible boundary to the data that's actually there.

There are various forms of scale abuse:

  • Scales that mute trends or exaggerate scatter (you might consider a note describing such adjustments as filters, rolling averages, or seasonal fluctuations -- or any smoothing used, because some types of smoothing destroy the original signal).

    Here is a case where you want to overlook such harmonic fluctuations: spacecraft solar disturbances have a known component of variation that acts sinusoidally (one cycle per orbit). Sometimes you want to know what's superimposed on top of that. You can find out with

    • a notch filter
    • a highpass filter
    • a least-squares fit that includes the sinusoid
    You may find such an analytical approach to be best taken before you get to the point of presentation. Data reduction is, however, critical to the engineer's argument.
  • Scales enlarged to include outliers -- but fog the reader's view of the majority of the data (you might consider removing outliers, perhaps noting that you have done so).

The guiding principle in selecting scales should be making the main point of the presentation. If you have to change scales, smooth curves etc. you might deviate attention from the "true" message of the data. But what is the truth, in YOUR presentation of what YOU want to say to an audience that can't verify the data?

The trick is, to make your data fit your argument without lying. It comes back to the usual dilemma of how much of the truth you are going to tell your audience. Too much information, and they either can't follow or concentrate on what's unimportant; too little, and they could accuse you of lying by omitting parts of the truth.

Zero Reference

The idea of requiring plots (and we're obviously talking x-y/regression/scatter plots here) to have a zero origin or zero center is kind of confusing to me. I mean, what if the data never goes near zero? Isn't that a kind of scale abuse? Even if you have scatter about a zero mean, if the scatter isn't the major feature of the data, to concentrate on it can be a kind of scale abuse.

In statistical process control (for instance), you might look at a plot centered at zero to identify "special causes." (Though you might not find them if the data's mean and scatter are both some distance away from zero.) But with obvious special causes identified, then you might want to look at the data with the large trends removed (all sorts of statistical tools for that) to see if there are really only "common causes" left.

I once used in class the example of a dynamic step response -- something that comes up in (for instance) control systems, vibrations, and structural dynamics all the time. A non-engineer was able to easily determine from an unlabeled sample plot what were the most critical characteristics in the data. She didn't know the words, but she picked out max overshoot, rise time, and settling time as though she'd been taking a controls class. There's a case where you could summarize a whole plot in three numbers, especially if your readers have seen those types of plots before. And it's also a case where a zero origin doesn't buy you anything -- unless the response is slow, as in this case:

step response

Case Study

One of my auxiliary jobs at one time before the Excel Era was to do ad hoc (that makes two of us here) statistical programming for engineering and manufacturing and I had done some regression work that plotted up real pretty. A buddy brought me some business history for one of our burgeoning product lines and asked me to "do the same thing with this stuff." He said, "And plot me a trend line out to three years."

So I did, and I found that the best data fit was a second-order polynomial with a trend curve that indicated a leveling off of business. And then I put on the confidence bands which showed that extrapolations of more than nine months were even more useless than usual and handed the guy the beautiful graph along with a paragraph of explanation.

In short order, he came back in and told me the VP (now he tells me!) who needed this graph was making a presentation to the HOME OFFICE and couldn't "carry bullshit" like this in front of THEM. The VP had drawn a nice straight line with a big red marking pen through the data points showing doubling of sales every year. "Fix it!", my buddy told me, " and take off those extra curves."

At that point I told my buddy the story about the accountant that got the job. My buddy didn't think it was funny. After that I tried to avoid VP's and do analysis for engineering experiments where the message was to be found in the data and not where the data was to be adjusted to suit the message.

I guess the moral of this tale is, "It's not enough to know your audience, you have to know where your audience is going to repeat your story."

References

Tufte, E. Visual Explanations. Graphics Press, 1997.
Ramage, Bean, and Johnson. Writing Arguments. Viacom, 1998.


What You Can Do

  1. Integrate the visual presentation of data into an overall communications strategy. If the text suggests that the data fits a pattern, and the graphic matches the pattern only with an adjusted scale, then the graphic is lying - but so is the text.
  2. Your audience should not have to re-do your analysis in order to understand your point.
  3. Your audience may need to come to its own conclusions about what data means. Use smoothing and scale adjustments with caution.

[Table of Contents] [Previous] [Next]