What do we mean by “reproducibility”?
The call for improving research reliability and transparency is admirable, but the discussion is often confusing. “Reproducibility,” “replication,” and related terms are used in many different ways across, and even within, disciplines. The economist Michael Clemens identified more than 70 different terms used in the social science literature to describe “replication” and related concepts– while often researchers may have similar concepts in mind, many introduce new terms like “broad” versus “narrow” replication, “pure” versus “statistical” replication, until we are awash in terminology.
In a recent article, the founders of an institute at Stanford (METRICS) that focuses on “meta-research” (research on research) argued that this morass of terminology has “led to confusion…about what kind of confirmation is needed to trust a given scientific result.” They attempt to clarify the range of terms by sorting them into three categories, each reflecting an underlying concept that we care about.
First, we care about transparency in research methods. Have researchers clearly shared the steps they used to conduct their experiment? If they analyzed the data from the experiment, do we know what steps they took to do so? To borrow Ivan Oransky and Adam Marcus’ metaphor: Do we know the recipe that we’d need to follow in order to bake the same cake? Goodman and his coauthors label this kind of transparency as methods reproducibility.
But things aren’t quite so simple. To continue the metaphor: Imagine a chef who is writing a cookbook. Obviously, he should include a list of ingredients for the cake; but should he also include the temperature of the ingredients when they are added—should the eggs and butter be at room temperature? What about the exact speed of mixing? One might think: just include what’s relevant to how well the cake turns out. But which details those include, and whether to assume that the readers of a cookbook will correctly fill in the missing details (how warm should the “room” be in “room temperature?”), is challenging. As Goodman et al. pointed out, even where researchers might agree that details should be shared, they might differ in how much detail they think is necessary to share.
Beyond just having the recipe, we care about the finished product: Has someone actually used the recipe to bake a cake like the original? If so, then we have achieved what Goodman et al. call results reproducibility. This is different from methods reproducibility in that we don’t just have the recipe in our hand. We’ve also successfully used it to bake a cake that’s the same as the original.
But science is often more complicated than cake recipes. For example, say the instructions of an experiment call for a particular kind of cell; there can be plenty of genetic variation, even in the same batch of cells grown in the same lab, which could influence how the experiment turns out. If we don’t arrive at the same results when following the original instructions, it could be hard to know what to infer from that—perhaps there were hard-to-pin-down factors that varied from the original to the repeat study. Another difficulty is that we might be dealing with studies, such as those in psychology, where findings are reported in terms of “statistical significance.” (A much misunderstood concept—the probability that we’d see a result of this effect size or greater, were the null hypothesis true). Whether or not a second study “succeeds” or “fails” to replicate the original’s results, given its reported effect size and statistical significance, isn’t black or white, but could be spelled out in different ways.
While these two reproducibility concepts are important, Goodman and coauthors invite us to widen our gaze to include a third concept. Beyond whether we can reproduce methods or results of a particular study, we might raise a bigger question. Let’s say two researchers survey their area of research specialty, and they’re aware of exactly the same studies. Will they draw the same conclusions from the evidence that they see, and come to have the same beliefs? Goodman et al. call this inferential reproducibility and say that it might be the most important of the three concepts. After all, what we’re trying to achieve with all of this hustle and bustle is not just endless discussion of a single study, or even of a single study and various attempts to repeat it. We want to form a view about a particular scientific question overall.
In fact, there’s often quite a bit of disagreement among experts in the same field. That’s true even when we can reasonably assume they’re aware of the same studies. One possible explanation for this that Goodman and coauthors point out is that different experts might enter the discussion with different prior beliefs. According to a Bayesian perspective, researchers could arrive at different views if they came in with different prior beliefs. For example, there are plenty of differing beliefs among researchers on how confident to be in the results of studies of varying methodologies (for example, randomized controlled or quasi-experimental trials versus studies using other methods). Those differing prior beliefs could easily influence conclusions that they draw when they survey the studies in their field.
The authors lament that reproducibility is often mistaken for an end in itself, rather than as an “imperfect surrogate for scientific truth.” That sounds right: we’re trying to use evidence to get at how things are, and energy poured into building bodies of evidence is aimed at getting us closer to that end. But who is forgetting that, and in what ways? The authors don’t spell it out fully. Instead they end with a call to greater reflectiveness: “We need to move toward a better understanding of the relationship between reproducibility, cumulative evidence, and the truth of scientific claims.” Hard to argue with that, and also sounds like a call for philosophers—many of whom spend decades on these kinds of concepts-related discussions—to get involved.
Useful readings on reproducibility:
Baker, M., 1,500 scientists lift the lid on reproducibility, Nature 538 (2016). Results of a survey of researcher attitudes on reproducibility conducted by Nature. Note that Nature also has a great compilation of its articles on reproducibility-related topics.
Begley, C. G. and L. M. Ellis, Drug development: Raise standards for preclinical cancer research, Nature 483, 10.1038/483531a (2012). Article on biomedical reproducibility, discusses the authors’ findings of reproducing 6/53 preclinical studies.
Chang, Andrew C., and Phillip Li. Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not,” Finance and Economics Discussion Series 2015-083. Washington: Board of Governors of the Federal Reserve System, http://dx.doi.org/10.17016/FEDS.2015.083 (2015). Paper on replication in economics – see also another paper on a similar question.
Clemens, M.A., The meaning of failed replications: A review and proposal. J. Econ. Surv.10.1111/joes.12139 (2015). Discussion of the meanings of terms such as reproducibility and replication.
Ioannidis J.P.A., Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124 (2005). Highly cited article on problems in research, over a decade old but still worth reading. Also see the sequel, How to Make More Research True.
Open Science Collaboration, Estimating the reproducibility of psychological science. Science 349, aac4716 (2015). Results of the Reproducibility Project in Psychology. See also a critical comment and reply to the comment from the original researchers.
Nosek, B., et al., Promoting an open research culture. Science 348, 10.1126/science.aab2374. (2015). Transparency and Openness guidelines developed by a large number of researchers and others at a meeting hosted by the Center for Open Science, and endorsed by 500+ journals and organizations.
Schwalbe, M., Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop, National Academies Press, 10.17226/21915 (2016). Useful write-up on discussion of reproducibility, includes contributions from key discussants across several disciplines.
Simonsohn, U., Small Telescopes: Detectability and the Evaluation of Replication Results, Psychological Science, 10.1177/0956797614567341 (2015). Proposal for how to judge whether or not a study has been successfully reproduced.
Stephanie Wykstra is a research consultant for AllTrials USA and a freelance writer. She has held positions with Innovations for Poverty Action and GiveWell, and previously worked in academia as an Assistant Professor of Philosophy. Stephanie holds a B.A. from Yale University and earned a Ph.D. in philosophy from Rutgers University, where she specialized in epistemology.