How Statistics Can Solve the BPA Controversy
Editor’s note: The controversy over whether bisphenol A, a component of plastics and can linings, is dangerous to humans is now in its 17th year. Hundreds of millions of dollars have been spent on research by the US and other governments around the world—with intense disagreement over the conclusions. Is the risk real and quantifiable—or do the claims about harm fail the test of replication? In the US, the Food and Drug Administration has concluded that there is no risk at current exposure levels, drawing on research conducted in conjunction with the National Toxicology Program and on studies funded and conducted by the Environmental Protection Agency. And yet, a constant stream of stories in the news media, focusing largely on work funded by the National Institute of Environmental Health Sciences, have claimed a link between BPA and a wide range of serious health problems. The dispute is sometimes framed as a clash between competing scientific disciplines and worldviews—endocrinology and toxicology—over the concept of endocrine disruption: pick your discipline, pick your research, pick your conclusion. But it also takes place against a much broader concern about a reproducibility crisis in science—one identified by the National Institutes of Health, and which speaks to a widespread failure to design studies so that they do, in fact, answer the questions they ostensibly seek to answer. This is where statistics and statisticians come in.
STATS asked Patrick McKnight, a statistician in George Mason University’s Department of Psychology, to look at one of the fundamental mechanistic issues in the BPA controversy, the idea that the chemical can have adverse impacts at very low doses but not at higher doses, which if true would invalidate the widespread scientific view that the risk of harm from chemicals is generally monotone—the greater the dose, the greater the toxicity. McKnight runs the Measurement Research Methodology Evaluation Statistics (MRES) Lab at Mason, and while his research focuses on many health-related issues, he is not involved in research on BPA. His brief was to look at this issue from a statistical perspective—which is to say, were the studies designed in a way that allows us to conclude that their findings are reliable? His conclusions raise important questions about the way government funded research is conducted, the centrality of design and statistics to study funding, and the need for journalists to go behind scientific claims and ask informed questions about design, data, and interpretation.
Update, Friday 20: The Inspector General of the Health and Human Services Department is conducting an investigation into the way the National Institute of Environmental and Health Sciences has funded BPA research (p56).
The scientific evidence suggesting that bisphenol A (BPA) affects human growth and development remains both controversial and unclear. Evidence dating as far back as the 1950s indicates that environmentally available toxins may cause the increased incidence rates in many developmental disorders—particularly those that involve the endocrine system. What remains unclear is whether BPA is a primary causal agent and whether the small doses found in typical environmental conditions are sufficient to cause these observed effects.
A methodological critique of the evidence found that previous published reports suffered both methodological and statistical problems that may interfere with causal inference. Moreover, a recently published (2014) review by the US National Research Council aimed to resolve the controversies but merely pointed out that the evidence does not exist to fully and clearly address these known concerns.
The current review provides a concise enumeration of the most pressing points in this debate. Our aim was to make clear the claims on both sides so readers without specific content knowledge may understand what is known, what is inferred, and what remains unknown. The core issue surrounding the BPA effects focuses on the low-dose effects—often referred to as non-monotonic dose-response or NMDR. Low-dose BPA effects may be difficult to detect but serve as the primary source for those who argue that BPA is the causal agent in these observed disorders. Many national organizations— such as the US Food and Drug Administration—now consider BPA a safe compound at present human exposure levels and argue that the lack of evidence for low-dose effects may not be due to NMDRs.
Instead of resolving this debate, we provide readers with the tools to evaluate the evidence by explaining why and how these effects may be detected—both methodologically and statistically. Resolving this problem, however, requires far more research that specifically addresses these problems. We begin with the most pressing issue and then detail additional methodologically and statistical issues that arise when trying to test for NMDR.
Low-dose effects and the NMDR
As stated in the summary, the core issue to the debate stems from our collective uncertainty about the dose and response to BPA. A typical dose-response curve is a non-constant but increasing curve where higher doses usually lead to greater responses. Figure 1 provides a rough sketch of the effects of alcohol on human behavior. As a person increases their consumption of alcohol, they tend to respond in ways that can be fairly consistent. Higher doses lead to more dramatic effects. The figure shows a monotonic dose-response relationship.
In contrast to the typical dose-response curve, toxicologists suggest that a more complex relationship exists between the dose and the response. Consider the following figure, which illustrates the differences among different types of dose-response curves. The alcohol dose-response curve in Figure 1 follows roughly the same pattern as the first curve in Figure 2.
Both the first and second curves in Figure 2 are what are referred to as monotonic relationships; they either increase or decrease but not both. The two curves to the right in Figure 2 are what remain quite controversial and are what are typically referred to as non-monotonic curves that both increase and decrease depending upon the dose. Evident from these two right-hand curves is the potential effects of low doses and their dramatically different effects that contrast greatly from the first curve to the left. Researchers who argue that BPA produces large effects hold that the small environmental doses observed in foods, water, and packaging are sufficient to produce those effects. How? They claim that the relationship between the dose and the response does not follow a typical curve as depicted in both Figure 1 and the two left curves in Figure 2. Paramount to solving the NMDR problem is assessing the low-dose effects (i.e., where the curve is higher than would be expected for a constantly increasing dose response curve, cf Figure 1).
Detecting low-dose effects
The expected NMDR curve would not present much problem for scientists if the critical low-dose effects were easily detectable; however, many factors make detection difficult. We review the primary problems—many enumerated by Haseman et al (2001) —with detecting these effects and then return to how these problems continue to plague the BPA toxicology literature. The first point below is probably the most critical indictment to the NMDR research to date.
Low doses appear too varied and many often do not qualify as “low”. Recent evidence from a comprehensive review of the literature showed that the term “low” dose not represent a true low dose for humans. Studies from 2002 to 2012 reported doses qualified as “low” but represented doses as high as eight to 12 times the dose that humans get exposed to in typical environments. These purported low doses distort the dose response curves and inflate the effects of what the authors represent as “low” but, in fact, are actually high doses. Thus, the efforts to detect the low dose effects may be undermined by poor dose measurement and control. Only after researchers agree upon standard definitions can there be much progress with respect to understanding NMDR.
Power to detect low dose effects. Should the dose actually be a low one, there are other problems—particularly the power to detect small effects. Existing research on low doses fail to detect the effects because of either small samples or inappropriate statistics. Small samples produce unreliable results that may be less sensitive to the differences between monotonic and nonmonotonic dose response curves. A general rule of thumb for researchers is that regression models often used for dose-response curves require no fewer than 50 observations (i.e., mice or humans) for each dose. When the effects for the dose become less clear or more difficult to discern from no effect, those requirements increase dramatically. Current toxicology research often provides estimates for the dose-response curves on well under 50 observations. An adequate sample size may be as high as 200 for these small effects and few, if any, studies to date provide that large of a sample. Thus, the sample size limits how much a researcher may estimate any changes in an otherwise expected monotonic function. Compounding the sample size problem is the use of inappropriate, or rather, less sensitive statistical procedures that mask the effects. Most researchers use a statistical tool (ANCOVA) that requires far greater sample sizes than would a more suitable statistical tool (Multiple Regression). Combined, the small sample sizes with the less sensitive statistical tool lead to less sensitivity to the nonmonotonic effects. The investigations into NMDR curves requires far more resources than typically directed at the problem and our collective understanding of these curves will remain low until more sensitive studies provide estimates for these curves.
Sample composition. Sampling problems do not end with merely the size but extend to the composition of the sample as well. Researchers often use littermates for mice and rats to increase efficiency. Those littermates introduce familial or genetic confounds and rarely get taken into consideration when testing these small effects. Confounded effects make the nonmonotonic effects at low doses more difficult to discern from potential group or familial effects.
Scientific findings require replication—both within and between laboratories. The current toxicological work with BPA tends to originate within one lab and the replications tend not to extend beyond partial effects. Thus, the failure to find reliable results may indicate either no NMDR effects or effects that are small and difficult to determine from the aforementioned limitations.
Investigator bias. Science requires human reasoning to generate the questions, apply the methods, obtain the results, and interpret the results given a theoretical framework. That framework, however, often dictates the approach and may lead investigators to unwittingly find results they believed existed. No scientific area exists today immune to these biases. The extent that investigator bias affects the testing of NMDR remains unclear but most investigators tend to find the effects they believe to be true. Replication between laboratories—especially labs with competing views—offers the best treatment for investigator bias. Unfortunately, very little work on BPA provides this type of peer oversight and peer collaboration.
Unclear rates of NMDR effects. The distinction between monotonic and nonmonotonic dose-response curves requires some expectation of the rates for a given problem. BPA, as an example, may be a toxin where low doses have disproportionately high effects compared to other toxins. Unfortunately, we do not know for sure if that expectation is reasonable. Some researchers (e.g., Vandenberg, 2014) estimate the NMDR effects to be in at least 20 percent of all experiments. That estimate, however, means that 80 percent do not produce this nonmonotonic effect; moreover, the estimated or expected rate of nonmonotonic dose-response curves requires a mechanistic explanation and most of these explanatory mechanisms are likely to be post-hoc. Thus, if researchers knew the rates of NMDR across many different areas of toxicology and the mechanisms could be specified a priori then the effects may be more believable if they were observed at the expected rate. Currently, there are no clear expectations and mechanisms remain largely post-hoc.
Methodological and statistical problems limit our understanding of BPA effects. These limitations can be remedied by a concerted effort to estimate and replicate the NMDR effects identified by prior research. These small effects require careful attention to subtle design issues that currently plague the extant literature.
 Haseman, J. K., Bailer, A. J., Kodell, R. L., Morris, R., & Portier, K. (2001). Statistical issues in the analysis of low-dose endocrine disruptor data. Toxicological Sciences, 61(2), 201-210.
 Teeguarden, Justin G, and Sesha Hanson-Drury. “A systematic review of bisphenol A “low dose” studies in the context of human exposure: A case for establishing standards for reporting “low-dose” effects of chemicals.” Food and Chemical Toxicology 62 (2013): 935-948.
 VanVoorhis, C. R. W., & Morgan, B. L. (2007). Understanding power and rules of thumb for determining sample sizes. Tutorials in Quantitative Methods for Psychology, 3(2), 43-50.
Carlsen, E., Giwercman, A., Keiding, N., & Skakkebæk, N. E. (1992). Evidence for decreasing quality of semen during past 50 years. Bmj, 305(6854), 609-613.
Colborn, T., vom Saal, F. S., & Soto, A. M. (1993). Developmental effects of endocrine-disrupting chemicals in wildlife and humans. Environmental health perspectives, 101(5), 378: Documented the “deleterious effects of endocrine-disrupting chemicals in the environment on the reproductive success of wildlife populations.”
Howdeshell, K. L., Hotchkiss, A. K., Thayer, K. A., Vandenbergh, J. G., & Vom Saal, F. S. (1999). Environmental toxins: exposure to bisphenol A advances puberty. Nature, 401(6755), 763-764: Found “prenatal exposure to a dose of bisphenol A comparable to levels found in the environment …altered postnatal growth rate and reproductive function in female mice, although individual differences in endogenous oestradiol resulting from natural variation influenced the responsiveness of the females to bisphenol A.”
Paulozzi, L. J., Erickson, J. D., & Jackson, R. J. (1997). Hypospadias trends in two US surveillance systems. Pediatrics, 100(5), 831-834.: Epidemiological evidence suggests a higher rate of hypospadias (male congenital condition related to potential increased environmental oestrogens). This paper cited two other links between the epidemiological findings and potential mechanisms:
- Giwercman A, Carlsen E, Keiding N, Skakkebaek NE. Evidence for increasing incidence of abnormalities of the human testis: a review. Environ Health Perspect Suppl, 1993;101(suppl 2):65–71 18.
- Sharpe RM, Skakkebaek NE. Are oestrogens involved in falling sperm counts and disorders of the male reproductive tract? Lancet. 1993;341: 1392–1395
vom Saal, F. S., & Sheehan, D. M. (1998). Challenging risk assessment. Forum for applied research and public policy. University of Tennessee, Energy, Environment and Resources Center.: Challenged the traditional dose-response curves and suggested that low dose effects may be greater than expected.
Toppari, J., Larsen, J. C., Christiansen, P., Giwercman, A., Grandjean, P., Guillette Jr, L. J., et al. (1996). Male reproductive health and environmental xenoestrogens. Environmental health perspectives, 104(Suppl 4), 741.
Haseman, J. K., Bailer, A. J., Kodell, R. L., Morris, R., & Portier, K. (2001). Statistical issues in the analysis of low-dose endocrine disruptor data. Toxicological Sciences, 61(2), 201-210.
REVIEW OF THE ENVIRONMENTAL PROTECTION AGENCY’S STATE-OF-THE-SCIENCE EVALUATION OF NONMONOTONIC DOSE-RESPONSE RELATIONSHIPS AS THEY APPLY TO ENDOCRINE DISRUPTORS (2014)
Teeguarden, J. G., & Hanson-Drury, S. (2013). A systematic review of bisphenol A “low dose” studies in the context of human exposure: A case for establishing standards for reporting “low-dose” effects of chemicals. Food and Chemical Toxicology, 62, 935-948.
Please note that this is a forum for statisticians and mathematicians to critically evaluate the design and statistical methods used in studies. The subjects (products, procedures, treatments, etc.) of the studies being evaluated are neither endorsed nor rejected by Sense About Science USA. We encourage readers to use these articles as a starting point to discuss better study design and statistical analysis. While we strive for factual accuracy in these posts, they should not be considered journalistic works, but rather pieces of academic writing.