Online Dating and the Statistical Dark Arts
One of the darkest statistical arts lies in choosing the model to use when analyzing your experimental data. A statistical model both represents your understanding of the experiment and enables you to test the strength of evidence supporting your conclusions. You can obtain very different results by choosing different models, and the existence of this choice can lead both scientists and statisticians into temptation: do we choose a model to find the best conclusions to our scientific investigation or are do we engage in sleight of hand—choosing a model to produce the most dramatic results but perhaps leaving out some critical element? Searching through many models to find “significant” results has gained a lot of press recently, under the label of “p-hacking” (see pieces in Nature News or Freakonomics) and this is a serious and wide-spread problem in statistics. This piece is not about that, however. It’s more about the decisions that have to be made about analyzing data, even when the experimenter is trying to do it well, the consequences that these have for scientific conclusions, and how to deal with them as a reporter.
In textbook descriptions of experiments, the experimental plan is entirely laid out before anything starts: how the experiment will be set up, what data will be collected, and the statistical analysis that will be used to analyze the results. Well-designed experiments will be set up to isolate the particular effect you want to study, making it relatively easy to pinpoint the consequences of drug treatments or the amount of sunlight a plant receives.
Unfortunately, the realities of scientific practice are rarely so simple: You often have to rely on surveys or other observational data—resulting in a model that includes factors that could explain your data, but which are highly correlated among themselves. For example, smoking and reduced exercise are correlated with colorectal cancer, but people who smoke are also less likely to exercise, making it unclear how much of the lung cancer to attribute to each aggravating factor. Plus, you often cannot measure effects that might be important, like why people might not participate in a poll. Here I will discuss two examples of missing measurements, model choices that impact the scientific interpretation of the data, and the need to make reasonable judgements; both come from papers on which I was asked to comment and give some thoughts on how to deal with this as a science journalist.
First I want to give a neat example of nonresponse bias in surveys. My excellent colleague Regina Nuzzo (also a fellow STATS advisory board member) sometimes writes for Nature News. Regina is a statistical expert in her own right, but isn’t allowed to quote herself as expert opinion. So in 2013 she asked me to provide some statistical commentary. The paper she was writing about examined the success of relationships that began in online dating sites (I think my last name may have motivated her to talk to me on this particular topic). In particular, the authors had undertaken a study of the success and happiness of marriages that started online and offline. The study had been funded by eHarmony, but it was undertaken in a very transparent manner and I don’t think anyone would seriously question its integrity.
The over-all results stated that while the very best thing you could do was to marry your high-school sweetheart (assuming you had one), but the next best option was online (statistically better than meeting someone in a bar, for example) and this really was the headline. From a statistical point of view, the most obvious critique of the study was that the effect sizes were tiny—average marital satisfaction of 5.6 (on a scale from 1 to 7) as opposed to 5.5—and these were only significant because the authors had surveyed 19,000 couples. Here, I’m inclined to think that eHarmony was simply pleased that online dating came out as not being worse than other ways of meeting a spouse and statistical significance was simply icing on the cake.
But when I looked at the study’s methods, the survey methodology was more interesting. The authors had commissioned an online survey company to contact a pool of users whom they paid to participate. An initial 190,000 users responded of which about 60,000 were screened into the survey (they had to have been married at least five years, for example). Where things get more complex is that of these only 19,000 actually completed the survey—a 2/3rds drop-out rate. This brings up the question of nonresponse bias: Could whatever was associated with these users dropping out also affect their marital success?
I came up with a hypothetical that people who were inclined to persist at online surveys might also be more inclined to persist in online dating than your ordinary love-lorn single. So the survey pool might be enriched with people who were “good” at online dating and therefore had more success at it. The impact of the nonresponse rate is hidden from our measurements, as if covered by an invisibility cloak.
Does this invalidate the study conclusions? That depends on how large an effect you think that nonresponse bias might be (remember we have no way to measure this). Given that the differences were so small, I’m not sure that “online dating is better” is necessarily valid, but I doubt that a different survey methodology would tell you that it is a lot worse (and it might also confirm the study findings). Even with this, the study does have some justifiable conclusions: if you happen to be the sort of person who fills out internet surveys all the way to the end, and you’re looking for a partner, going online will give you an ever-so-small advantage of making a good match (conditional on making a match at all).
Nonresponse bias is a hypothetical issue—we have no actual measurements that would back this up—and it relies on a plausible explanation of why it might make an impact. The issue here is that you could tell any story you like: You could also make up a story about why this bias dramatically suppresses the success of online dating. One of the reasons it was so hard to tie smoking to lung cancer, at least in legal fights, was that without being able to conduct randomized trials you could always argue that there may be an unmeasured factor that is a common cause for both smoking and lung cancer, but stopping smoking won’t stop you getting cancer. The key is plausibility; eventually, the weight of evidence for smoking causing cancer became so large that explanations of this kind were no longer plausible. I haven’t been able to come up with a reasonable story to suggest why online dating is more beneficial than this study suggested.
Notes for journalists!
The first lesson here is about what is often described as “observational” data. Conclusions from survey data always run the risk that the relationships they find (e.g. “Where you meet your spouse is associated with the happiness of your marriage…”) might be changed by some unmeasured variable (“… but only if you obsess about completing internet questionnaires”). This doesn’t make them not useful, but a journalist should ask:
- Did too many people drop out of the survey? Large drop-out rates can imply a poorly designed survey (possibly it took too much time to complete), suggesting that only particular types of people complete the survey.
- What might affect whether someone (or some animal, for that matter) is included in the survey? Could these factors also be associated with the response? If being in the survey at all makes a difference to the outcome you’re measuring (in this case, marital bliss), the scientific conclusions really only apply to those who were surveyed. This applies both to why someone dropped out of a survey as well as who even started the survey in the first place.
- Are the sizes of the effect small relative to the over-all variability? We might believe that a large effect is fairly robust to changes in survey design, but an effect as small as we had here could well vanish with only subtle modifications. Just because a result is “statistically significant” doesn’t mean it’s a large or meaningful difference.
- Does the way the questions are asked on the survey impact the way the results might arise? Similarly, do earlier questions on a survey lead respondents to give particular answers?
However, it is important to remember that there are infinitely many ways in which you could attribute survey results to unrecorded factors. It’s important when casting doubt on such results to ask the corresponding questions:
- Is the explanation for why a result is misleading plausible? Does it have a basis is known scientific findings?
- Are the reported effects smaller than the plausible effects of non-response bias?
Bear in mind that we are now speculating about data that weren’t recorded, but which might change our mind if they had been. Humans are very good at making up stories about why something might be so and the skeptics of a study deserve even more scrutiny because they don’t come with data.
It’s always good practice to note the drawbacks of survey research; it’s also good policy to ask an expert, “Is there anything they should have measured, or that you would have done differently, and could that have made a difference?” But remember that there are infinite possible variations on any experiment and results should be questioned only when there are good reasons based on scientific understanding.
Even without unmeasured factors in the model, the choice of model can have an important impact on scientific conclusions. The second example is from a story about a study on the causes of very different gender representation in different scientific disciplines. The over-all patterns of gender representation aren’t simply STEM (Science, Technology, Engineering and Mathematics) versus humanities: many women go into biology but there is limited representation in physics, many women go into history but very few are in philosophy or music composition.
This particular study proposed that subjects were dominated by men where “innate ability” was believed to be necessary to success—you just have to be naturally good at it. To assess this, they collected statistics on female representation on university faculty by field and they sent out a survey to ask students about attitudes in their field, including about the importance of being naturally gifted, as opposed to working hard. When they conducted an analysis of this data, they found that average ratings of the importance of innate ability dominated the explanation of female representation across fields.
The particular writer who asked me for comment seemed to be looking for some form of controversy, possibly the non-response bias discussed above. In fact, from reading the piece I felt that the authors had been fairly careful with their statistics; they had posed several possible alternative explanations, asked questions in the survey to assess these, and conducted reasonable analyzes to reject them. In so doing, they showed that their result was robust to many different possible models. There were not a lot of holes to pick. Nonresponse bias could skew these results, but unlike the story above, it’s harder to make a plausible case, especially since the effects were reasonably large; so far so good.
I later downloaded the authors’ data so that I could present their analyses in a statistics class. The authors’ analyses followed as they had claimed (this is never assured); however, there was one column that wasn’t in their headline model, from a survey question titled “How welcoming is your field to women?” If you include this as a covariate (a variable that could explain the resulting data) to explain female representation, you couldn’t distinguish the relationship between gender representation and any of the other effects, including innate ability, from simply occurring by chance.
In other words, by leaving out this covariate, they ended up with a model that led to more interesting conclusions. Had the authors incorporated “How welcoming is your field to women?” as a variable in their model, it would have been able to explain so many of the choices that women eventually make; the variable related to opinions on how much innate talent is required had no relationship with the remaining variation in those choices. It would have been just as valid to conclude that women’s professions were predicted by how welcoming the field is to women, rather than by perceptions of the need for innate talent.
Did I just uncover scientific fraud? I’m not sure. It may not be appropriate to use “Is your field welcoming to women” as an explanatory variable, because the perception of being welcoming may just be reflecting female representation. That women have lower representation in fields viewed as less welcoming to them (or possibly the cause is the other way around) is a pretty trite statement.
In fairness, the authors did partially acknowledge this problem with their interpretation, noting that, in a simple model for how women choose their fields, innate ability corresponds with “welcoming to women,” and 70 percent of the effect seen from perceptions of innate ability can be mediated through the variable “Welcoming to Women.” Nonetheless, I do wish they had acknowledged that including the “welcoming” variable rendered anything else insignificant.
Notes for journalists!
This is a different example of confounding: the decision to include or exclude an explanatory variable in your model can change your conclusions. As is the case in this study, it isn’t immediately obvious that this decision has been made, and as a journalist there are some key questions to ask:
- Are there variables in the data that haven’t been used? If so, is there an explanation as to why they have been left out?
- Statements like “Welcoming to Women mediates the effect of Innate Ability” imply a model that includes “Welcoming to Women,” yet this variable may not have been reported (and its inclusion in the model might reduce the impact of the author’s results).
- Are there factors that haven’t been examined and should be? This study did look at expectations of hours worked on campus as another possible factor. It didn’t appear important, but if it hadn’t been included that should have been queried. We might also have asked about how collegial the field is—do projects tend to be done alone, or in large groups?
As with my list above, we need to also ask whether there are good reasons to exclude some factors. Perhaps adding them would make the model a lot less interesting and they ought to be treated an alternative response. In this case, that fields perceived as being unwelcoming to women have poor gender representation isn’t a particularly interesting statement; however, it might be interesting to look at whether fields thought to require innate ability tend to be unwelcoming.
Useful questions to ask experts are not just “What might you have done differently?” but also “What would you do now?” as a means of eliciting whether there are deeper questions to ask, or yet more possible explanations to rule out.
What both of these examples illustrate is the use of judgment in choosing statistical models. There are many statistical traps that produce erroneous conclusions, but there are also many reasonable judgments that rest on our understanding of the science or situation and our interpretation of what the data means.
For effects that we cannot measure, such as nonresponse bias, is there a plausible mechanism that might affect results? When some explanatory variables are so highly correlated with the response that they exclude all others, are they really telling us something new?
Many statistical arguments are also arguments about subject matter and context. Moreover, deciding on what to control for remains a dark art. We know that testing too many variables for significance, without correcting for multiple comparisons, yields unreliable results. But the effect of selecting confounders is both a matter of how your model behaves, statistically, and a decision about what your model means.
Giles Hooker is Associate Professor of Statistics at Cornell University
Please note that this is a forum for statisticians and mathematicians to critically evaluate the design and statistical methods used in studies. The subjects (products, procedures, treatments, etc.) of the studies being evaluated are neither endorsed nor rejected by Sense About Science USA. We encourage readers to use these articles as a starting point to discuss better study design and statistical analysis. While we strive for factual accuracy in these posts, they should not be considered journalistic works, but rather pieces of academic writing.