Why We Need a Statistical Revolution
My father told me the most important thing about solving a problem is to formulate it accurately, and one would think that, as statisticians, most of us would agree with that advice. Suppose we were to build a spaceship that can fly to Mars and return safely to Earth. It would be folly indeed to make simplifying assumptions in its construction that science tells us are false, such as assuming that during take off the temperature of materials would not exceed a certain critical level, or that friction is constant in the atmosphere. Such assumptions could spell death for the astronauts and failure for their mission. And yet, that is what many statisticians often do, sometimes referring to the great 20th century English statistician, George E.P. Box’s belief that “Essentially, all models are wrong, but some are useful” (Empirical Model-Building and Response Surfaces, 1987).
To understand why this insight into statistics is outdated is to understand how we are building the foundations for a revolution in method, one that uses machine learning in ways that could scarcely have been imagined by Box writing three decades ago, let alone the great progenitors of computer algorithms, such as Alan Turing. It is a revolution that has the power to revitalize the connection between scientists and statisticians, and one that will be as central to making sense of Big Data as Big Data is central to the future of statistics and science. But in order to arrive at what I have called “targeted learning,” we need to start with a basic problem in statistical modeling.
A statistical model is a set of assumptions about the way the data generated by an experiment will be probabilistically distributed. When Box talked of statistical models, he was referring to parametric models, where the probability distribution of the data is determined by the choice of the parameters, which means it can be adapted by changing one or more of these. For example, one might model the chance of developing lung cancer as a function of how many cigarettes are smoked. A linear model of this would assume that the more cigarettes someone smokes the higher the probability of developing lung cancer, and that this relationship is linear (meaning it can be graphed on a line). The slope of the line is an example of a parameter. Obviously, a parametric model seems to give us an insight of the relationship between increased smoking and cancer, which is useful. But parametric models are too restrictive to capture the true probability distribution in most cases.
The reason is this: In a parametric model, we assume that the data are derived from a probability distribution that is fully described by a finite number of unknown numbers. One way to visualize this would be to imagine flipping all the loose change in your pocket or pocket book. The probability distribution covers the likelihood of each coin turning up heads (or tails—or perhaps the coin landing sideways). Either way, there are a finite number of outcomes: heads, tails, sideways, and an unknown number of times that each of these outcomes will occur. If you make no assumptions at all, then the model allows for all possible probability distributions; but if you start making restrictive assumptions, there’s the risk that the true probability distribution is no longer captured by the model. For example, one might assume that the coin is fair and always lands on one side or the other. The probability of heads and the probability of tails are then assumed to each be 0.5; but if that assumption is inappropriate for any reason (the coin is slightly bent, perhaps), then the model will not describe what will really happen. We refer to that type of model as a misspecified statistical model.
Such models still dominate statistical analysis, even though, you may be surprised to hear, everybody agrees they are known to be false. The world is full of bent coins, each bent in slightly different ways. For example, we often assume that data are well modeled by a normal curve, commonly called a bell-shaped curve; and yet, in many real-life situations, this may be a poor model, or it may be only useful for broad, sweeping generalizations. Test scores, for example, do not fall exactly along a normal curve, and the model is a poor predictor of the outcome for scores that are far from average; but we continue to talk about test scores as if the normal curve applies at all times. We may talk about the racial or gender make up of students whose scores are “more than three standard deviations from the mean” and make conclusions from normal curves. These consclusions may, in fact, be incorrect when it comes to predicting scores on standardized tests.
By using these kinds of models, we assume that our data is generated from a specific type of probability distribution; but if the model is incorrect, it may lead us to incorrect conclusions. To illustrate with another example, imagine that you collected a sample of data consisting of the income of 100 people in a particular city in the United States. You found that the average in your sample was $1,000 less than the average income across the U.S. Would you say that residents of your entire city earn on average, less than the U.S. population? Or might it just be a fluke of probability? To assess this, you would have to employ a statistical test to compare your average to that of the United States.
If you assume that the U.S. data is normally distributed, you may well find that the income in your town is quite different than expected, and that people living there have a statistically significant difference in salary compared to the general US income. The statistical test in that case is assuming that the experiment is a sample of normally distributed data; however, if you knew that U.S. income is actually right skewed (it has a long tail to the right, with many people earning comparatively low salaries, and few people earning very high salaries) then the average income you measured may not be so unusual at all.
The problem is that the world is not pre-packaged for this kind of statistical convenience—and the assumptions driving these models are too restrictive to capture the true distribution of data. Despite this, almost all the statistical software tools available to scientists encourage parametric modeling, and thus designing experiments based on assumptions about the distribution of data that are wrong. The resulting epidemic of false positives—findings that aren’t true—has been recognized by many, not least John Ioannidis, whose 2005 paper—“Why most published research findings are false’’—in PLOS Medicine made a compelling case for reform, and drew the attention of many people beyond the practice of science and statistics to a signal problem in the production of knowledge.
The “Art” of Statistics
In order to see why parametric modeling is actually working against scientific discovery, it’s worth constructing a typical exercise of data analysis behind a typical study claiming a new finding.
Suppose you are a data analyst and you want to establish the causal effect of drug A versus drug B on the occurrence of heart disease within one year after starting the drug. You would be provided with various demographic details and medical data about the individuals, their treatment plan, and whether or not each person had a heart attack. The individual outcome in this study can be coded as a binary outcome, that is, either 0 or 1. For each person in the study, the value 0 would stand for no heart attack in that year, whereas the 1 would stand for the occurrence of a heart attack.
Given the limited set of available tools, you will immediately implement a logistic regression model. This model, like all models, makes some assumptions. In this case, it assumes that the probability of heart attack can be described by a special kind of function, called a logistic function. This function depends on the treatment and characteristics of the patient that are related to heart attack and which also might be predictive of the treatment decision. These characteristics are called confounders because they influence the probability of heart attack, independent of whether the patient takes drug A or drug B, and they will often also influence the probability of taking drug A versus drug B. The choice of confounders will influence the resulting logistic function, and hence the probabilities derived from the model.
In designing this experiment, you will come up with a reasonable set of such confounders. The question then is, how to put them into you linear logistic regression model: will you enter age, or the square of age, or some other function of age? How should you account for interactions between age and drug, baseline-health and drug? Should we include all two-way interactions, all three-way interactions, and so on?
The number of possible ways to add confounders to these models can quickly become overwhelming, and you lack the knowledge to choose among these choices. So given this impossible situation you decide to just keep it simple, probably avoiding adding in interactions with treatment since that would make the output much more complicated, and run your first logistic regression model focusing on the dependence of the probability on heart-attack on whether the patient was taking drug A or drug B. This dependence is now measured by a coefficient (or number) in front of the treatment indicator. The analyst hopes that its p-value will be smaller than 0.05, because if so, the finding will be statistically significant, a generally accepted but problematic measure of an effect unlikely to occur if the experiment were random.
Now suppose your p-value doesn’t show “statistical significance.” You report back to the rest of your research team. Nobody is happy. “Okay,” you say, “let’s not panic—let’s try a few different logistic regression models.” Eventually, you produce a logistic regression model from among a wealth of choices that provide results that make sense to the collaborators. Everyone is happy, and the researchers can now publish the paper, although without mentioning that the final logistic regression model was, in fact, selected in order to get these “nice’’ results. The authors make sure to clarify that the chosen method adjusted for the listed confounders, even though it did not. As the p-value for the treatment coefficient shows a significant result, there’s something to tell the media, and—bingo—the finding becomes “news.” Fortunes, academic reputations, possibly even lives will be affected by the result.
What you have just witnessed is statistics as an art rather than a science: the art of finding an association rather than a science that accurately models the probability distribution of the data. Instead of talking to the scientists who generated the data so that we can understand the underlying data-generating distribution as much as possible in order to determine the appropriate model of statistical analysis, we typically ask a few questions, such as: Is the outcome a survival time? Is it a binary outcome? And then we quickly move on, picking a model based almost entirely on the format of the data we collected rather than the experiment that was conducted to create the data.
We have a hammer in our hand and we have decided that that hammer will have to do the job. As a consequence, the scientists who generated the data with their experiment will have their doubts about the statistical methods used, perhaps only accepting the answers we generate for them if these answers “make sense.” We simply try out as many models as needed until we get an answer that achieves consensus. Our collaborators will view us as technicians they can steer to get their scientific results published, and statistics as a tool of confirmation rather than a scientific exploration.
But if we have turned statistics into an art, the same conclusion applies to the scientific experiments themselves: Their results are unreliable. “Confidence” intervals end up being based on flawed assumptions and may miss any characterization of the valid answer to the scientific question, even when the sample size is very large. Indeed, as the sample sizes increase, the width of the confidence interval becomes small, suggesting high levels of confidence that, in fact, do not exist.
An unfortunate irony emerges from this kind of error: if the model is wrong and the sample size is large so that the reported confidence interval is narrow, the true value will surely lie outside the narrow confidence interval. Yet the data will be taken to mean that we have a great deal of certainty that the valid answer is in the narrow confidence interval. In other words, the larger the sample size in a misspecified model, the more confident one can be that the confidence interval excludes the truth! Because of the same bias in the estimated coefficient, “p-values” will always be “significant” as long as the sample size is large enough.
One common way of finessing this problem of model misspecification is to do a sensitivity analysis, meaning the statistician will investigate how much the results change if one adds extra variables into the model. That is, one investigates how much the results change by using slightly less unrealistic parametric models than the parametric model used in the original analysis. This is like building a spaceship that can only do the job under unrealistic assumptions, and then determining if and how it would blow up under slightly less unrealistic assumptions. In the real world, it will still blow up. So how useful is that?
Is this mess we have created really necessary? No! As a start, we need to take the field of statistics—the science of learning from data—seriously.
Statistics as Science
Lets go back to our hypothetical drug trial. We might know the observed data are the result of a specified number of independent identical experiments; we may know that the medical doctors made the treatment decisions based only on a small subset of the variables measured on the patient; we may be assured that the conditional probability of heart disease is always smaller than 0.04, independent of the characteristics of the patient in our patient population. We may also have some sense of the design of the experiment such as it involving a well-understood two-stage sampling carried out as follows:
- One first samples patients from a target population
- Blood-samples are taken from each patient
- Patients are randomly assigned a treatment
- Medical doctors followed the patient up till the outcome of interest was collected
- At the end of this first stage of the study, one chooses a random subsample of all the patients sampled in the first phase
- One then collects extra data on the subsample and carries out the careful and expensive analysis of their previously collected blood-samples.
But to obtain this knowledge and to define an accurate statistical model, we statisticians need to learn as much as possible about how the data were generated. Better yet, we are involved in defining the data generating experiment in such a way that the scientific question of interest can be best answered. Many times statisticians appear at the table only after the data has been gathered, and we are supposed to answer the question of interest one way or another, even though the experiment simply does not generate the information needed.
Statisticians can also help find what we call an estimand (the quantity or parameter that we’re trying to figure out based on the data) that best answers the scientific question being researched, Being present at the design stage would allow the statistician to minimize the possible discrepancy between the estimand and the answer to the scientific question. This is a lot of work. It is difficult. It requires a reasonable understanding of statistical theory. But it is a worthy academic enterprise! We will open up a new world to our collaborators by being able to generate questions—such as “What is the optimal personalized treatment rule?’’—that our collaborators had no idea they were even allowed to pose.
Targeted Learning and Big Data
At the same time, we have reached a moment in history where technology can help us to transcend the limitations of the parametric model and tackle all these hard estimation problems. Starting in 2006, we developed a general statistical learning approach—targeted maximum likelihood learning (or, more generally, targeted minimum loss-based learning)—that integrates the state of the art in machine learning and data-adaptive estimation with all the incredible advances in causal inference, censored data, efficiency and empirical process theory. This integration is done through what we called “super learning.”
The initial stage of super-learning creates a library of parametric model-based estimators and data adaptive estimators. The latter are automated machine learning algorithms that use the data to decide on the confounders that should enter the model, thereby replacing the art (and arbitrariness) of making these decisions with objective criteria. If the statistical model contains knowledge, such as that the probability of heart attack is bounded by some small number, then each of these candidate algorithms will respect that knowledge. There are a lot of these algorithms, and the body of machine learning algorithms grows every year—in this case, all following different approaches in approximating the true probability distribution of a heart attack as a function of these confounders. The algorithms go through an iterative updating process that aims to balance bias (due to not including enough variables) against variance (by including too many variables).
So why would one of these machine learning algorithms be better than another one? First, you let the super-learning algorithm use the data to decide between all these algorithms. The data set is split into many different “training samples” and “validation samples” and the algorithms compete on the training samples, while their performance is evaluated on the validation samples. The one that performs the best, on average, in predicting heart attack based on patient characteristics is the winner.
Our research showed that for large samples, this super-learner process performs as well as the best-weighted combination of all these algorithms. The lesson is that one should not bet on one algorithm alone, but that one should use them all to build a diverse, powerful library of candidate algorithms—and then to deploy them all competitively on the data. In this manner, targeted learning integrates all advances in learning.
This field of targeted learning is open for anyone to contribute to, and the truth is that anybody who honestly formulates the estimation problem and cares about learning the answer to the scientific question of interest will end up having to learn about these approaches and can make important contributions to our field.
Solving the estimation problem is also of critical relevance to the field of big data, where it is far from clear that data scientists are doing a better job than statisticians in producing valid knowledge. How serious are data analysts about tackling the statistical estimation problem? How in thrall is the field to parametric modeling?
There is also a very serious concern that the leaders of the big data revolution do not realize that algorithms in data science need to be grounded within a formal statistical foundation so that they actually answer the questions they want to answer with a specified level of uncertainty. Despite some prejudices to the contrary, big data does not obviate the need for statistical theory. Data itself is useless and can only be interpreted in the context of the actual data-generating experiment, and a biased sample is a biased sample even when the sample size is incredible large. The largest sample size in the world is not going to change the need for targeted learning and formal statistical inference.
Big data also presents new challenges for statistical learning, such as designing large-scale data collection studies at the national and international level, and setting up dynamic self-learning systems—such as a precision health care system in which patient care improves in response to what has been learned from the past. Instead of giving up on statistical theory in this era of big data, we need vigorous statistical theory to tackle these new challenges.
In sum, statistics needs to make itself relevant to data science in order to survive as a credible science, and data science needs statistics in order to do credible science. All scientists need both statistics and big data to advance knowledge into new areas of problem solving, and good science requires these disciplines to work as a team.
We, as statisticians, need to stop working on toy problems, stop talking down important theory, stop being attached to outdated statistical methods, and stop worrying about the politics of our journals and our field. We, as statisticians, data scientists, and scientists, need to define the outstanding problems in research—and solve them as a team by taking into account all the scientific advances in the different disciplines. Science needs big data and statistical targeted learning—but statisticians will have to rise to the challenge if science as a whole is to thrive.
A shorter version of this essay appeared in AmStats, the magazine of the American Statistical Association.