Monkey butts, menstrual cycles, sex, and the color pink. The statistical crisis in science


Ed Hagen


June 27, 2015

A couple of years ago, when I was starting a new project that used cross-national data, I picked up a popular statistics textbook by Andrew Gelman and Jennifer Hill because Gelman is known for his analysis of voting patterns across US states, which is conceptually similar to analysis of data across countries. I decided to google Gelman, a prominent statistics professor at Columbia, to see if he had papers on his website that would provide detailed, published examples of these kinds of analyses (’bout ready to head back to Facebook?).

It turned out Gelman is a prolific blogger. At the time, in Slate and on his blog, Gelman was accusing authors of four papers of shoddy stats. One paper was the infamous paper on ESP by Daryl Bem that had already been debunked. But the three other papers were on evolutionary psychology, and two authors of one paper were my former advisors John Tooby and Leda Cosmides.

That got my attention.

My first impression was that Gelman had found a new way to attack ev psych: cherry pick a few EP papers with questionable stats, and use those to tar the field.

Most of the discussion on Gelman’s blog involved the paper by Alec Beall and Jessica Tracy, Women more likely to wear red or pink at peak fertility, published in Psychological Science, one of psychology’s top journals. The inspiration for the study was the fact that “females in many closely related species signal their fertile window in an observable manner, often involving red or pink coloration.”

Victorian cartoon

Monkey butts, menstrual cycles, sex, and the color pink. Nothing to mock there. The study, however, is more important than it sounds. If Beall and Tracy were right, human female ovulation is not concealed after all.

In Slate, Gelman accused Beall and Tracy of conducting a fishing expedition — repeatedly making comparisons until they found one that was “statistically significant” and then reporting it as if they had predicted it in advance — a big, though often inadvertent, no-no that he claimed was widespread in science:

There’s a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.

Every scientist knows that fishing is wrong, so Beall and Tracy were understandably upset that Gelman accused them of unethical conduct without talking to them first.

I hadn’t met either author (so far as I know), so I googled them. According to Beall’s cv, he was a grad student and this was his first, first-authored paper. What a great introduction to academia! Get published in Psych Science on your first effort, only to be slammed by a prominent statistician, repeatedly. I became a regular reader of Gelman’s blog, and a week rarely goes by that he doesn’t work in some dig at the Beall and Tracy paper. In fact, I was inspired to write this post after a recent dustup over Beall and Tracy’s refusal to share their data with Gelman since Gelman wouldn’t tell them where he was publishing his critique (they have shared their data with others).

Gelman, who presented himself as a disinterested critic who only wanted to improve science, had several criticisms of the paper, but I am going to focus on one: Gelman guessed that Beall and Tracy had originally predicted that, like female chimpanzees who signal estrus with a prominent red swelling, women would tend to choose red shirts around the time of ovulation. Gelman surmised that after collecting their data Beall and Tracy noticed that there was no statistically significant tendency to wear red shirts during the fertile period (red only: p >.05), but if women wearing red shirts and women wearing pink shirts were combined into a single, new category (reddish shirts), there was a statistically significant effect (red + pink: p<.05).

I was a little confused. Researchers often combine conditions and no one blinks an eye (and Beall and Tracy deny they did this). Further, there was no evidence that Beall and Tracy had repeatedly tested various combinations of shirt colors to find one that was “significantly” associated with the fertile period of women’s menstrual cycles, and then, after the fact, had come up with some theory to explain it, which would be a classic fishing expedition.

Gelman began to walk back his accusation that Beall and Tracy went fishing. Instead, in a paper with Eric Loken in American Scientist, and on his blog, Gelman claimed that multiple comparisons can be a problem, even when the research hypothesis was posited ahead of time and researchers only conduct one statistical test.


Gelman’s own stats textbook didn’t mention anything about this. In fact, in the textbook and in a publication, Gelman et al. claimed “we (usually) don’t have to worry about multiple comparisons,” an irony he himself noted. Further, why pick on a grad student? Or evolutionary psychology? The problem, if there was one, would be pervasive throughout all the sciences.

My first impression was right. A reader of Gelman’s wondered why he spent so much time criticizing a study on pink shirts instead of the statistically flawed medical research that actually harms people, and Gelman admitted he had an agenda: surprise, surprise, he doesn’t like evolutionary psychology:

But I do think these social psychology studies make a difference too, in that they feed the idea that people are shallow and capricious, that we make all sorts of decisions based on our animal instincts etc. Sure, some of that is true but not to the extent claimed by those clueless researchers.

To erroneously connect fat arms [the paper co-authored by Tooby and Cosmides] or monthly cycles to political attitudes is to trivialize political attitudes, and I think that’s a mistake, whatever your politics.

Well, Gelman also makes mistakes: singling out a few articles on one side of a debate that supposedly have statistical flaws, but not looking at articles on the other side of the debate, obviously says nothing about who’s right (even assuming such a debate exists; Gelman knows very little about evolutionary psychology).

Still, if Gelman’s general points were correct, I realized I could easily be making the same mistakes in my own research, and so too would a lot of other people. With some chagrin, I had to admit that although I was aware of the many debates surrounding Frequentist vs. Bayesian approaches to statistical inference, I had never thought critically about the foundation of them both: probability.

Following pointers on Gelman’s blog, I started reading. Late to the party as always, I discovered that many statisticians, such as John Ioannidis, and other quantitative types, such as Uri Simonsohn, Joseph Simmons, and Leif Nelson, believed there was a statistical crisis in science, and that it was possible that most research findings were false. Yikes! After pouring through many articles and blogs, I concluded that although the tools statisticians have given us are powerful, they are very brittle and easy to break because probability is easy to get wrong.

The world is noisy and we humans have a propensity to see patterns in this noise where none exist. To distinguish signal from noise, most scientists therefore rely on the following expression:


which is the probability that your data (D) would turn out the way they did given some null hypothesis (H0), such as no difference in shirt color during peak fertility. If p is small, you’re looking at a signal and not noise. Yea! This is called Null Hypothesis Significance Testing (NHST).

There are many critics of NHST. But the problems that Gelman and others were highlighting actually did not involve NHST per se. Instead, they were making a “garbage in, garbage out” argument. The probabilities spit out by standard software packages — p-values — were, in too many cases, misleadingly small because scientists had inadvertently fed the software the answers they wanted to see.

Probability is a subtle concept (and I’m probably going to screw it up right now!). To illustrate the problem Gelman and others have found, I will use a casino example. Imagine that you have a die you suspect is loaded. You roll it 2 times, and it comes up 3 each time. Can you reject the null that the die is fair? You might think that because the probability of rolling one three with a fair die is 1/6, and you’ve rolled two threes, the probability under the null is \((1/6)^2\), which is about 0.028, so you can reject the null hypothesis that the die is fair.

If that’s what you thought, you would be wrong, because it’s a trick question. If you had called two threes before rolling the die, that is, you suspected it was loaded to come up threes, and then it came up all threes, you could conclude that the die is (probably) loaded. But I asked you to test the null by computing the probability of two threes based on already having rolled two threes.

Hey, you might respond, it’s odd that the die came up three both times, right? Yes it is. It would also be odd if it came up two ones, or two twos, or two fours, etc., and I gave you no reason to suspect one of these possibilities over the others. The probability that it came up two ones OR two twos OR two threes OR two fours OR two fives OR two sixes is about 0.17, i.e., not odd at all. You can’t test the null by computing a probability as if you hadn’t seen the faces of the die if your choice of probability test is based on having seen those faces. This is the core problem.

In science, collecting data is the analog of rolling the dice. That’s what P(D|H0) means. To use this probability to distinguish signal from noise, we scientists must therefore make a very precise prediction before collecting or looking at the data because the p-values our stat programs compute are only accurate if our “call” was not influenced by the data.

But we all have looked at the data. Almost every statistics textbook, including Gelman’s, recommends that we pour over our data, checking distributional assumptions and so forth, before running any test. And many common choices during data analysis, such as controlling for potential confounds, looking at interactions, and combining conditions, can dramatically increase the chance of a false positive.

Current best practice in the sciences is like first rolling the dice, then meticulously examining the numbers that land face up, but promising to not let anything we learn about those numbers influence any aspect of our call.

Think any casino would play by those rules?

Gelman and others are not criticizing NHST per se (or rather, that’s a separate argument). They are showing how easy it is to inadvertently break NHST.

In another irony, Gelman grounds this crisis in human nature: just like the gambler who has the strong monetary incentive to beat the house, scientists can usually only publish (and thus get credit for) “statistically significant” results, and therefore have an incentive to find some justification for altering their predictions after looking at the data (e.g., red+pink, not just red). That is, many scientists are pursuing their self-interest and taking a benefit they don’t deserve, and Gelman’s cheater detection mechanisms are on full alert. Just the phenomenon that put evolutionary psychology on the map!

Although deliberate or inadvertent cheating is certainly part of the story, I want to offer a different framing: scientists currently face an untenable tradeoff between learning about the world, and confirming what they’ve learned. If Beall and Tracy had discovered, rather than predicted, that women tend to wear pink shirts and red shirts during the fertile phase of their cycle, that would be important. Discovering that you need to control for, e.g., age, would be important. After all, aren’t scientists supposed to discover things? Combining the two colors, and studying “reddish” clothing, would be excellent science. So would controlling for previously unsuspected confounds. Unfortunately, these would break the computation of statistical significance that tells us this is a real effect and not just a fluke. Good science can be bad statistics, and good statistics can be bad science.

Science is screwed.

There is a way out of this mess, as statistician John Tukey realized decades ago. I’ll discuss that in my next post. For now, I will just note that there is a long history of ridiculing approaches to human sexuality that take account of our primate heritage. As targets of such ridicule, Beall and Tracy are in pretty good company:

IN the discussion on Sexual Selection in my “Descent of Man,” no case interested and perplexed me so much as the brightly-coloured hinder ends and adjoining parts of certain monkeys. As these parts are more brightly coloured in one sex than the other, and as they become more brilliant during the season of love, I concluded that the colours had been gained as a sexual attraction. I was well aware that I thus laid myself open to ridicule; though in fact it is not more surprising that a monkey should display his bright-red hinder end than that a peacock should display his magnificent tail.

Charles Darwin (1876) Sexual Selection in Relation to Monkeys. Nature, 15, 18-19.