Interpreting the data

Data collection, study design, social factors, and the risks of over-interpretation

Rebecca Goldin Ph.D.

May 21, 2019

Photo credit:

Be wary of what data say and ask which social features may impact that data

You may think, for example, that a good measure of top schools with high levels of educational attainment is that they have more students taking AP courses, and then write a piece using AP exam scores as a metric of value for “better” schools.
But broader participation in standardized tests generally results in lower test scores. As a school encourages students who don’t typically take AP courses and exams to take these exams, these students will, on average, perform worse than the kind of students who historically have taken these exams—and that will bring the average scores down. Even if two schools are otherwise exactly the same—except for one school encouraging a broader range of kids to try the AP exam—the school with increased participation will have lower AP scores.
This speaks to why it’s important to interpret any statistic—such as the “average AP Calculus AB exam score”—from multiple angles and by looking at the wider context. How many students take AP courses?  Has the school with higher rates encouraged weaker students to sit the exams?  You need to be  specific about the particular educational factor you want to write about—students taking AP courses and/or students passing AP exams—and investigate whether the data are comparable from one school to another.


Don’t draw conclusions from a study that aren’t supported by the study and its design

A recent tweet by the Economist went: “It is better to have big classes taught by excellent teachers than smaller groups taught by mediocre ones.” Yet the study written about by the Economist did not distinguish “excellent” teachers from “mediocre” ones—and it didn’t directly compare big classes to small ones. The article discussed features of the learning environment in Singapore and in the U.S., and then attempted to reason why Singapore scored so much better on the PISA (Programme for International Student Assessment) exam.


Make clear distinctions between correlations found in the literature and causal relationships

The most common misperception in educational research is that an observed relationship is causal. These relationships are often complex, with challenges and differences that come from multiple sources.
Imagine, for example, an observational study that finds students in schools where there is a high percentage of student poverty do worse on exams than students in wealthier schools, even when compared to students with the same family incomes.
The conclusion may not be that “poverty causes poor performance on standardized tests,” as the problem could relate not to individual poverty but to poor communities. Perhaps the problem derives more from parental education or family stressors, which are related to poverty but not identical to it.
It’s important to recognize that there are often competing measurements that can tell conflicting stories.
Suppose a school’s AP scores decline substantially over a period of two years. Does this mean that the school has gotten worse? Superficially, one could point to poorer teaching, or the changing demographics of the students attending the school, or increased violence—or possibly even a new principal who didn’t value the AP courses.
But, as we discussed above, the same data may be observed if the school decided to increase access to AP courses occur if the school decided to increase access to AP courses and/or pay for students in AP courses to take the AP exam. Increasing access to AP exams will generally result in decreased exam scores, as students who are less academically prepared (on average) than their peers take the exams; the analogous students’ exam scores weren’t included in the data in prior years.


Examine data collection methods

Data is never collected in a social vacuum. Data are influenced by how questions are asked, when and how the data are collected, and who participates. A good journalist views data as a source—and sources always have a bias. Data needs to be backed up by reporting.

Some questions to ask about your data

  • Are there aspects of the data collection that would leave some groups out of the mix? For example, a study measuring progress of students over a three-year period in mathematics will not measure the progress of students who have changed schools. A study conducted in English will not include people who don’t speak English. Is there a reason that the “results” could be impacted by such restrictions?
  • How broadly do the data collected reflect truths about the population about which you want to make an inference? For example, a study on effective techniques for engaging families in promoting school engagement in a low-income community with many generations of parents on public assistance may not have the same impact on a community with a high percentage of immigrants and English language learners—or one which has high levels of lead in the water (and presumably more students with special education needs). Similarly, a study showing that students are more likely to pursue advanced courses in a school offering the International Baccalaureate program may be misguided about the impact of the IB program in a highly educated and wealthy county.
  • How is “error” handled by the researchers? This certainly includes human error, but it also includes other broad categories of error. The mechanism of measurement itself may be flawed: for example, a test may have a misworded question or the machine that reads a Scantron may make a mistake. There could be an error due to an environmental circumstance, like a fire drill that happens in the middle of a test, or a bad cold going around the school when the measurement is made. There could be systematic bias, such as a question using a word whose cognate in Spanish has a different meaning than in English, leading to a student misunderstanding a question. There could be human bias, such as an individual psychologist who makes a decision about which test to complete, and makes a decision in part based on behavioral cues that are irrelevant to the tests.
  • How are participants selected? If people choose to be part of a study, it is worth asking whether they might have different characteristics to people not in the study that relate to the question being studied. For example, a study examining the relationship between parental math anxiety and a child’s math anxiety may end up recruiting more parents who stay home with their children than parents who work outside the home. The stay-at-home parents may, in turn, have (on average) fewer job skills and/or more anxiety about math. And they may transmit their anxiety to their children more readily precisely because they are home with them for longer (including homework hours). Without adjusting for these different characteristics, we might end up concluding that parents, generally, have more anxiety about math than they actually do, based on inadvertently oversampling parents with higher anxiety levels..
  • What did the data actually measure? Be sure to look into details. For example, a multiple-choice test cannot distinguish among students who have no clue how to solve a question and students who have a good idea how to solve it but make a lot of careless errors.

Study design plays an important role in evaluating study conclusions


Observational studies are designed to observe what happens, without taking into account changes introduced for the purpose of the study. A comparison of graduation rates in two different schools, for example, says little about what causes these differences to occur, even if there may be reasons conjectured by the study authors.
Experimental studies are designed to answer a research question and compare two groups put into different (experimental) conditions. For example, two teachers sign up their (generally similar) classrooms to take part in a study. One teacher is asked to show inspirational videos before math class and the other teacher is asked to show neutral videos before math classes. The final exam scores are then measured and compared. In this circumstance, barring other factors coming into play, such as differences among the students, teacher quality etc., a difference in test scores can be attributed to the different “treatment group.” If the scores in the inspirational-video group went up more than those in the neutral-video group, we can infer that watching inspirational videos may have caused the improvement in performance on the final exam.
Share This