With the summer gone, and the winter semester in full swing my mind is once again filled with teaching statistics. As I pointed out in a previous post on this blog I am hardly a full blown statistician, but as economic theory goes, one best specializes in producing what one’s comparative advantage is, and for me that is definitely on the quantitative side.

A side effect of teaching (and learning?) statistics is that your mind is slightly more geared towards picking up news items where statistics play a role, and boy-oh-boy have I been spoiled. Especially given the fact that new tax plans are announced in both the US and the Netherlands, and there is a bit of a debate about their effects.

That brings me to the title of this post which is inspired by this recent post on the environmental economics blog.

In statistics we tend to use the average (or the mean) a LOT. There are several reasons for that among others that i) it is easily understood and communicated, ii) it has a few nice theoretical properties such as that the mean of a random sample is a good estimate of the mean of a larger group and iii) it is an important parameter in “ye good old'” normal distribution, iv) there are a few simple tests that allow us to compare means between subgroups to investigate if they are different from one another, which in turn sometimes allows us to draw conclusions on causality.

However, just reporting the mean can be misleading, as the above post pointed out, especially when the distribution matters. The tweet by Franklin Leonard asks, in a response to a question why people do not support a tax plan that raises average income: if I divide 10 apples by giving one person 10 apples and the other 9 none, on average everybody gets 1 apple. Then why are 9 people mad at me?

This is where the standard deviation comes in. Skipping the technical details of how to calculate it, it basically tells us how much the individual data points vary around the mean. In a relatively large group you can expect 95% of the individual data points to lie within the interval ranging from two times the standard deviation below the mean, to two times the standard deviation above it. Using the example above let me illustrate the effect:

Person 1 | 10 | 8 | 3 | 2 | 1 |

Person 2 | 0 | 1 | 3 | 2 | 1 |

Person 3 | 0 | 1 | 2 | 2 | 1 |

Person 4 | 0 | 0 | 1 | 2 | 1 |

Person 5 | 0 | 0 | 1 | 1 | 1 |

Person 6 | 0 | 0 | 0 | 1 | 1 |

Person 7 | 0 | 0 | 0 | 0 | 1 |

Person 8 | 0 | 0 | 0 | 0 | 1 |

Person 9 | 0 | 0 | 0 | 0 | 1 |

Person 10 | 0 | 0 | 0 | 0 | 1 |

Mean | 1 | 1 | 1 | 1 | 1 |

Standard deviation | 3 | 2,366432 | 1,183216 | 0,894427 | 0 |

In all cases the average person gets 1 apple, but these distributions are far from equal. Clearly as the distribution becomes more equitable, the standard deviation decreases. That is hardly surprising given the fact that as things become more equitable there is less variation around the mean, and that is the exact thing a standard deviation is supposed to measure. (Another way to solve the problem here would be to report either the median amount or the modal amount of apples).

Another but related case where just reporting the mean can be misleading is if we want to use a sample to draw conclusions on a larger group, where the larger group itself is very variable. The fact that there is a lot of variation in the group makes making confident statements about the group as a whole based on a sample alone harder.

I realize that it too much to ask that everyone reports a standard deviation when they report a mean. Even in scientific articles this is not always done (although it should). But at least let us be aware that just relying on the average for your conclusions is mean.

Photo credit, featured image:

“I’m 3 points Standard Deviation off the mean” by Dean Myers. https://www.flickr.com/photos/deanmeyers/4438377176/in/photostream/