Anatomy of a Statistical Meltdown
“Plastics Chemical Tied to Aggression in Young Girls,” said the headline on ABC News. “The research showed that hyperactive, anxious, aggressive and depressed behavior was more common in 3-year-old girls who were exposed in the womb to bisphenol-A than in boys of the same age”
“In a new study of Cincinnati-area kids, girls exposed to higher levels of bisphenol A before birth had more behavioral problems and were more anxious and over-active than those only exposed to small amounts of the chemical,” said Reuters Health.
“The results held strong even after the researchers adjusted for other possible contributors to the children’s behavior, such as breast-feeding, mother’s education, depression and marital status, and possible study confounders such as differences in the strength of the urine samples,” said Time’s Healthland.
“Linda Birnbaum, director of the National Institute of Environmental Health Sciences, said the sample size is reasonable and its results support studies that show similar effects in animals,” reported the Washington Post.
The only reported critics of the study, Impact of Early-Life Bisphenol A Exposure on Behavior and Executive Function in Children, which was published in Pediatrics in 2011, were from the chemical industry, who, given their economic interest in the safety of BPA, have little credibility when set against government-funded scientific research defended by the director of a government agency.
And yet… this particular study did nothing to alter the Food and Drug Administration’s position on the safety of the ubiquitous chemical. Why? The answer neither politics nor industry influence; it is design.
STATS asked Patrick McKnight, a statistician in the Department of Psychology at George Mason University who specializes in the statistics of measurement to look at the study’s design and data. Here are his observations:
1. They began with a sample of 468 but analyzed only 237 or 239—it is not clear how many were actually analyzed from their data section in the results. Regardless, the sample is roughly 50 percent of their original sample. We never know (from the authors) how much the missing data contributed to the results. Only through multiple imputation can we know that; and yet, they chose to analyze data on if it were complete (i.e., listwise deletion). The practice of listwise deletion is so common it might not be worth mentioning. The practice of comparing missing to non-missing on socio-demographic variables is widespread in medicine and psychology but that comparison provides little useful information on the problems presented by the missing data. There are other more effective methods but these comparisons are common today in most scientific fields so it might be a little harsh to criticize these authors for reporting these results.
2. I believe the real problem with this study is that the design cannot yield the proper data to assess their primary aims. The authors describe their study sample as a prospective cohort. Unfortunately, Figure 2 represents a less rigorous design that forms the basis of their conclusions—that girls differ from boys with respect to their maturation after exposure to BPA. The authors even state in the discussion that their findings were not based upon statistical tests (i.e., a moderated effect of sex by BPA exposure) but rather a qualitative comparison between the two figures. In fact, the results could have been interpreted in a slightly different way in that BPA helped boys in their maturation. The authors rightfully acknowledged low statistical power and their use of a more qualitative assessment but they still do not gain the advantage of a prospective cohort design in their study. If they had limited their discussion to only dose response then the design would remain intact and their findings defensible.
3. Another problem in the analysis may be evident in the figures. These figures clearly show cases that exert a huge influence (i.e., leverage) on the parameters—both for girls and boys. The authors ought to have calculated Cook’s distance and conducted a few regression diagnostics to clearly see that only a few observations drove their primary results. Splines—as they implemented them in their analysis— tend to be overly sensitive to extreme values and especially to extreme values found on the ends of the x-axis. I see “no effect” for all the figures presented.
4. BPA had a harmful effect for girls but the boys actually benefited from their exposure. Could you imagine the outcry if the media grabbed this study and interpreted the effects as “BPA good for boys but bad for girls,” and failed to mention that these effects were trivial and potentially zero?
5. Many people (statisticians, researchers, and certainly the public) fail to consider the standard error of measurement when they are interpreting these effects. The mean differences observed in these measures are likely smaller than the standard error of measurement. In other words, the effects are likely within the band of error typically observed with these measures.
The most troubling observation is not so much that journalists failed to “see” these problems, or even that the head of the National Institute for Environmental Health Sciences defended the robustness of the study when interviewed by the media (though that is very troubling indeed—and why would journalists, working on tight deadlines, think to question the veracity of her claims?); it is that the study’s authors didn’t, in all likelihood, think they were doing their study badly. “Most of what these folks did is, and continues to be, standard practice in both medicine and psychology,” said McKnight. That speaks to a much wider need for statisticians and scientists to collaborate—and for journalists and statisticians to collaborate to create a new level of insight and accountability.
Editor’s note Nov 24: Thanks to an astute comment by one reader, point 2 has been rewritten.
Please note that this is a forum for statisticians and mathematicians to critically evaluate the design and statistical methods used in studies. The subjects (products, procedures, treatments, etc.) of the studies being evaluated are neither endorsed nor rejected by Sense About Science USA. We encourage readers to use these articles as a starting point to discuss better study design and statistical analysis. While we strive for factual accuracy in these posts, they should not be considered journalistic works, but rather pieces of academic writing.