How to Deal with Statistics

Grabiner, Judith V.

The authority of numbers is often used to mislead. But this isn’t an argument for rejecting all statistics. To tell what’s reliable and what’s not, ask the right questions.

What kind of average?

There are three measures for which the term “average” is used.

  • The mean, or arithmetic average: add up all the values and divide by how many there are. Useful if what you
    need to know is the total of something per individual.
  • The median: half above, half below. Gives a sense of the middle of whatever we’re looking at.
  • The mode: the individual value that occurs for the largest number of individuals. Useful if you need exactly that.

When is the mean a good measure? If it really matters what would happen if you took the whole thing and divided it equally between all the people.

When is it not so good? If there are just a few extreme values, the mean is changed a lot. If Bill Gates joins a group of people, their mean income is increased tremendously. That doesn’t mean they’re all well off.

When is the median a good measure? If you are limited to just one measure, medians give a sense of how something is distributed within a population. That’s why economists use it for incomes.

When is the median not so good? When you care about all the values, not just the center. “Median survival time after diagnosis is 8 months” says that half the patients live longer than 8 months; it doesn’t say whether that “longer” is typically 9 months or 50 years.

When is the mode a good measure? If you want the largest possible subgroup with something in common. For instance, if we know which brand is preferred by the largest group of buyers, we stock that one. When is the mode not a good measure? When “most popular” doesn’t mean “more than half” but simply “a very small fraction though a tiny bit more frequent than any other.” Or when you care more about how the values are distributed than what there’s most of.

Always ask whether the type of average chosen is the most illuminating, or instead is the one which best serves someone’s interest.

What’s a normal distribution? A bell curve?

If you take a large number of measurements of the same thing, the measurements group around the true value in a bell-shaped curve. Many human traits are distributed like this as well: height, for instance. 19th-century statisticians called the person in the middle the “average man” or the “norm,” so the curve is called the “normal curve.” The value in the middle is the median, and for normal distributions, whose graphs are symmetric with the highest point in the middle, the median is the same as the mean and the mode.

Nonetheless, this “average person“ does not exist. It’s a statistical abstraction. The “average” is often useful, but it’s the individuals who are real. And the more varied they are, the more important this point is.

How widely the various values are spread around the middle can be as important as the middle itself. A wide bell curve says that many values are far from the mean; a skinny one means that most values are very close to the mean. The usual measure of how the values are spread is called the standard deviation. In a population of men with a median height of 5’9” where almost everybody is very close to 5’9”, the average height describes the population well, but if many men have heights less than 5 feet or more than 6 feet 3 inches, using just the median glosses over lots of variation; you need the spread, whatever you choose to call it.

What if we have two distinct distributions for two different groups?

Imagine two bell curves representing, say, the physical strength of women and of men on some set of standard tests.

If they don’t overlap much, the difference between the averages of the two groups matters a lot. But often the variation within one group is far larger than the difference in the medians between the groups. For physical strength, the average man may well be stronger than the average woman. Nevertheless, lots of women are stronger than a good number of men. As Ruth Hubbard has said, “There are enough qualified men and women for any job in our society, except for sperm donor and surrogate mother.”

Given a number, ask, “Compared to what?” What is the background rate?

There’s a Dilbert cartoon where the boss complains that 40% of the employee sick days are Monday or Friday, so clearly the employees are faking sick days to extend their weekends. Do we think he’s right? Not after we look at the fraction of weekdays that are Monday or Friday.

Drivers have more accidents close to home. Why? Are people more careless in familiar surroundings? More likely it’s because no matter where you go, you start and end at home, so that’s where you’re driving most often.

People over 65 are 12% of all drivers, but have only 7% of the accidents. Are they safer drivers than people under 65? To find out, you need, not their share in the population, but the number of miles driven by each age group.

Is something a cause, or just a correlation?

Am I getting older because my hair is getting gray? Since taller people on average have higher IQ’s, can you increase your IQ by stretching yourself? Or is height increased by good nutrition, which, like IQ, is promoted by affluence?

In such cases, two things occur together, but this doesn’t mean that one causes the other. Correlation does not necessarily mean cause.

Given that smokers have substantially higher rates of lung cancer, does smoking cause lung cancer, or is it just a correlation? To show that a correlation does involve causation, you need a plausible a mechanism (many chemicals in tobacco smoke are carcinogens). Also, see what happens when you vary or remove the supposed cause. Dyeing my hair doesn’t make me younger, but stopping smoking does cut the risk of lung cancer, and the longer since you’ve stopped, the more you decrease your risk.

Polls and Surveys: Why use samples? How choose samples? What is “sampling error”?

To find out about an entire population, it’s cheaper and more practical to sample than to look at every individual. Sampling and surveys permeate American public life. Not just to predict elections, but to find out how many people are employed, have various diseases, think various things. The census tries to count every single American. But for most things, you can’t examine every individual. After all, if you’re cooking for a crowd, you don’t eat the whole pot of soup to see how it tastes; you sample it.

The sample isn’t the population, but we need it to be representative of the population. We also need it to be big enough to give useful results for the whole population.

How do you choose a sample that is representative of the population?

The standard way is to use a random sample. This means that every individual in the population has an equal likelihood of being chosen. When you sample the soup, you don’t skim it off the top, or pick it up in a strainer and get only solids. You mix it up first!

Another way is to use a stratified sample. To do this, you build a little model of the population, with the same percentage of men and women, various ethnic and income groups, and so on. For soup, this is like starting with the same proportions as your recipe for 100 people, but using only 1/100 of each ingredient. Lifesavers are stratified. Every roll of fruit Lifesavers looks the same; it’s a little model of the whole Lifesaver population.

Most media public-opinion polls start with a random sample of telephone numbers. This leads them to a sample of people. Then they adjust this sample to make it more representative in terms of gender, age, income, race, etc. For instance, if women are 51% of the population and the interviewees turn out to be 40% women, they would weight the women’s responses to get them up to 51%. This is makes the sample more representative by “stratifying” it.

Telephone polling leaves out people without land-line phones. The percentage of households without telephones, according to the 2000 census, ranges from 0.9% in Massachusetts to 5.7% in New Mexico and 6.5% in Mississippi. About 6% of the population has only a cell phone. Pollsters try to correct for this by stratifying according to demographic group. Does this work? You can compare the poll with the results of other polls. And with pre-election polls, there’s a test: how closely did the poll predict the outcome? They did pretty well in November 2006.

What is “sampling error” or “margin of error”?

Is the sample, even if chosen in an unbiased way, big enough to give representative results? That is, what’s the “margin of error” or “sampling error”? This is really not an error; it’s the uncertainty caused by the fact that you are using a sample instead of the whole population. The New York Times explains “margin of error” in all its polls. For instance, if they interview 2500 adults and report a 2% margin of error, they say, “In theory, in 19 cases out of 20, the results based on such samples will differ by no more than 2 percentage points in either direction from what would have been obtained by interviewing all adult Americans.”

(Probability theory tells us that the results for 19 samples out of every 20 will differ from the results for the real population plus or minus a fraction, which is 1 divided by the square root of the sample size. If the sample size is 2500, its square root is 50. Then 1/50 gives 2%.)

Suppose 44% of the people in a sample of 2500 people say, “I prefer X.” Instead of saying “44% of Americans prefer X,” we should say “the probability is 19/20, or 95%, that 44% plus or minus 2%, that is, between 42% and 46%, support X.”

If 44% say they prefer X and 47% say they prefer Y, with the rest undecided, what’s your headline? “Americans prefer Y”? Or, “”X and Y in a statistical dead heat”? (The latter.)

There’s no guarantee that the true value must lie within the margin of error. Even with perfect random sampling, in 1/20 of the cases the sample results should be farther from the true value than the margin of error. And if your sample is biased, calculating the sampling error is nonsense, no matter how large your sample is. The “margin of error” calculation assumes that the sample is random. Otherwise, literally, all bets are off.

Assuming the sample is random, and big enough, what else should we know about the survey?

Exactly when the survey was taken, especially on questions involving fast-moving events. How the questions are worded and in what order they’re asked. The survey-maker’s agenda. The actual demographics of the sample, and how they compare with the demographics of the population being sampled – to see if the sample really does represent the population.

Summing up the key questions:

  • Is the “average” a mean, median, or mode?
  • If you’re given an average, are most values very close to the average, or all over the map?
  • If you have data from two populations: what’s the difference in the averages, but also, what is the amount of overlap?
  • If they tell you an absolute number: compared to what? What percentage of the whole is it?
  • If two things are found together: how do you know which is the cause, which is the effect, or if both are caused by some other thing?
  • For a survey: What kind of sample? How big a sample, and what’s the margin of error? Is that error small enough so the results mean something? Does the sample really mirror the population? What questions were asked and in what order? When, by whom, and for what purpose?
  • And three general questions:
    What’s the best estimate, and what degree of confidence can we place in the result?
    Who says so?
    How do they know?

Judith V. Grabiner is Flora Sanborn Pitzer Professor of Mathematics at Pitzer College in Claremont, California. Author of articles and books on the history of mathematics, she won the Mathematical Association of America’s national teaching award, the Haimo prize, in 2003.