About

“Missingness” is a joint project between the Annals of Internal Medicine and Sense About Science USA. Our goal is to help clinical researchers better understand the importance of statistical concepts to experimental success, and to encourage physicians to collaborate with statisticians on study design and analysis.

But we also had another goal which frames this project: to investigate how these critical statistical concepts might be better explained through graphic design. This is where Accurat came in. We don’t just want “Missingness” to help researchers deal with missing data; we want to trigger an evolution in how statistical concepts are communicated to researchers.

With that in mind, the ascent to understanding requires many basecamps: We hope you will contribute ideas and insights as to how we can proceed to the next level.

Credits

For Sense About Science USA: Trevor Butterworth, Rebecca Goldin PhD

For Annals: Catharine Stack PhD

With thanks to: Michael Lavine PhD, Patrick McKnight PhD, Neda Afsarmanesh, Annals of Internal Medicine senior clinical and statistical editors

Sources

1 First Case Study: G. Bronfort, M.A. Hondras, C.A. Schulz, R.L. Evans, C.R. Long, and R. Grimm. Spinal Manipulation and Home Exercise With Advice for Subacute and Chronic Back-Related Leg Pain. A Trial With Adaptive Allocation. Ann Intern Med. 2014;161:381-391.

2 Second Case Study: G. Tomasson, C. Peloquin, A. Mohammad, T.J. Love, Y. Zhang, H.K. Choi, and P.A. Merkel. Risk for Cardiovascular Disease Early and Late After a Diagnosis of Giant-Cell Arteritis. A Cohort Study. Ann Intern Med. 2014;160:73-80.


National Research Council. (2010). The Prevention and Treatment of Missing Data in Clinical Trials. Panel on Handling Missing Data in Clinical Trials. Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.)

Little, RJ, D’Agostino, R, Cohen, ML, et al. The Prevention and Treatment of Missing Data in Clinical Trials. NEJM. 2012; 367:1355-60.

Li T, Hutfless S, Scharfstein DO, et al. Standards should be applied in the prevention and handling of missing data for patient-centered outcomes research: a systematic review and expert consensus. J Clin Epi. 2014; 67:15-32.

Annals of Internal Medicine Information for Authors – General Statistical Guidance, http://annals.org/aim/pages/ AuthorInformationStatisticsOnly

Bell ML and Fairclough DL. Practical and statistical issues in missing data for longitudinal patient-reported outcomes. Stat Methods Med Res. 2014; 23:440-59.

Bell ML, Fiero M, Horton NJ, et al. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodology. 2014; 14:118.

Carpenter JR, Kenward MG. Missing data in randomized controlled trials – a practical guide. Publication RM03/JH17/MK. Birmingham, United Kingdom: National Institute for Health Research; 2008.

Dziura JD, Post LA, Zhao Q, et al. Strategies for Dealing with Missing Data in Clinical Trials: From Design to Analysis. Yale Journal of Biology and Medicine. 2013; 86:343-358.

About
Sources
Scroll Down
  1. Intro
  2. Recognizing missing data in your study
  3. Identifying the statistical mechanisms behind missing data
  4. Analysis Methods and Sensitivity Analysis
  5. First case study
  6. Second case study
  7. Before/during data collection
Missing data is a problem that you’re going to have to address in almost all aspects of research. Daryl Thornton MD, MPH, physician researcher at Case Western Reserve University and co-director of the Case Western Reserve University Center for Reducing Health Disparities
1.1Intro

In clinical research, missing data are often inevitable. If you don’t want missing data to undermine your findings, learning how to handle missing data is critical. Here’s the rub: at the root of many biomedical studies is a need for a valid estimate of the relationship between an exposure and outcome. Your goal is to get an estimate that is unbiased (close to the ‘truth’), and one that has the correct variance (meaning confidence bounds that appropriately reflect the extent of uncertainty, and p-values that are valid).

When there are missing data, the choices you make about the analysis can have a big impact on whether or not the final estimate is biased and whether or not its variance is estimated correctly. In cases of dropout, the only approach is to perform a sensitivity analysis to bound the degree of potential bias.

You’re likely to have dropouts — and because of dropouts you’ll probably have missing data. Daryl Thornton MD, MPH, physician researcher at Case Western Reserve University and co-director of the Case Western Reserve University Center for Reducing Health Disparities
1.2Intro

While the best approach is to design your study to minimize the risk of missing data, our goal here is to show you how to recognize missingness when it occurs, what to do about it — and when to talk to a statistician.

This last point is critical: some of the commonly used techniques to deal with missing data do more harm than good and should be avoided.

Age
Gender
Body Weight
Total Cholesterol
Smoking
Diabetes
Statin Use
Andre Mcguire
Rochelle Webb
Roman Carson
Mike Douglas
Mindy Bell
Hector Laughlin
Fred Hoffman
Ella Martinez
Mike Douglas
Bernard Logan
Cathy Barker
Ebony Castro
Sonya Franklin
Francis Adams
Terri Smith
Lena Lindsey
Samuel Norman
Audrey Roberson
2.1 Recognizing missing
data in your study

Let’s start by visualizing a study population with baseline data. Each row is a participant and each column is a baseline covariate (age, sex, history of disease, etc).

Each vertical bar represents observed data items and diagonal red bars represent missing items. In an ideal world, your study would look like this — no missing baseline data. That is, during the baseline assessment you were able to gather data from every participant on every covariate.

2.2 Recognizing missing
data in your study

In reality, this will probably be closer to what you see.

Baseline covariate data may be missing due to human or machine error (someone forgot to measure something, or couldn’t measure something), or because patients refused or forgot to answer a questionnaire item, or because the patient missed an appointment.

In studies using more than one data source, the reasons for missing may vary. Similarly, for multi-center studies, some sites or physicians may not collect all the necessary variables, or they may be missing from some patients’ charts.

Visits 1-11
Complete
2.3 Recognizing missing
data in your study

Now let’s look at what happens over the time course of the study. The additional columns represent the data collected at visits over time.

First, the ideal situation — you have no missing data — every participant provides complete data for all measurements at every follow-up visit.

Intermittent missing
2.4 Recognizing missing
data in your study

Unfortunately, this is unlikely to happen. Some participants will miss visits all together or some data elements will fail to be collected during some visits. When this happens, your data might look like this...

Missing due to drop-out
2.5 Recognizing missing
data in your study

What also happens is that some patients drop out completely — perhaps due to the patient being lost to follow up, having an adverse event, the condition worsens or improves to the extent that the patient loses interest in the study, or the patient dies — and when this happens data are missing from that point through the end of the study. Now we see the impact of drop out.

1 –Missing
Completely
at Random
2 –Missing
at Random
3 –Missing
Not at Random
3.1Identifying the
statistical mechanisms
behind missing data

Now that we've become familiar with some of the reasons why data become missing, let’s talk about the patterns of missing data:

There are three kinds of mechanisms behind missing data:
missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

Age
Gender
Total Cholesterol
Final Outcome
Andre Mcguire
Rochelle Webb
Roman Carson
Mike Douglas
Mindy Bell
Hector Laughlin
Fred Hoffman
Ella Martinez
Mike Douglas
Bernard Logan
Cathy Barker
Ebony Castro
Sonya Franklin
Francis Adams
Terri Smith
Lena Lindsey
Samuel Norman
Audrey Roberson
3.2Identifying the
statistical mechanisms
behind missing data

Missing Completely at Random

This refers to data missing by chance for reasons that are unrelated to any observed or unobserved variables. Examples include data lost from transient equipment failure, a dropped lab sample, or a patient who moved out of state.

The important thing to remember is that this kind of missing data forms a random subset of all the data.

Age
Gender
Total Cholesterol
Final Outcome
Martin Lewis
Harold Davis
Ronald Allen
Judith Wright
Diana Edwards
Stephen Moore
Wanda Anderson
Billy Gonzalez
Keith Flores
Terry Foster
Jean Patterson
Susan Brooks
Christina Ross
Alan Torres
Ella Martinez
Kathy Long
3.3Identifying the
statistical mechanisms
behind missing data

Missing at Random

This refers to missing data that are conditional on (i.e., related to) some observed variable in your study — such as the age or sex of the patient. Perhaps young patients are more likely than older patients to miss visits or drop out, or men may be less likely than women to answer all questions in a questionnaire.

The important thing to remember is that a missing response is not missing due to the answer itself. In other words, in the case of the questionnaire, the answer to the question is not missing for a specific reason related to the response.

Age
Gender
Total Cholesterol
Final Outcome
Males
Martin Lewis
Terry Foster
Harold Davis
Keith Flores
Alan Torres
Ronald Allen
Stephen Moore
Billy Gonzalez
Females
Christina Ross
Jean Patterson
Diana Edwards
Susan Brooks
Ella Martinez
Judith Wright
Wanda Anderson
Kathy Long
3.4Identifying the
statistical mechanisms
behind missing data

Missing at Random

Visualizing MAR and conditioning or stratification can be illustrated by breaking our data grid into smaller grids based upon age and sex (if data are MAR based upon these factors).

In this example, we see that missingness is related to sex. Men are more likely to have missing data than women. But, once you condition on sex (in other words look at a grid stratified by sex), the probability of missing data is the same for each final outcome. Data that are MAR might also be related to responses in previous time periods where responses are observed.

Age
Gender
Total Cholesterol
Final Outcome
Andre Mcguire
Rochelle Webb
Roman Carson
Mike Douglas
Mindy Bell
Hector Laughlin
Fred Hoffman
Ella Martinez
Mike Douglas
Bernard Logan
Cathy Barker
Ebony Castro
Sonya Franklin
Francis Adams
Terri Smith
Lena Lindsey
Samuel Norman
Audrey Roberson
3.5Identifying the
statistical mechanisms
behind missing data

Missing Not at Random

This refers to missing data where the value of the unobserved or missing data depends on the probability that the data item is missing. Examples of MNAR are patients dropping out of a study because they are too ill to continue, or patients dropping out of diet studies because they are not losing much weight.

In studies where data are collected by questionnaire, an example of MNAR is if more affluent participants were less likely to answer questions about income than less affluent participants.

Managing missing data can be a very complex statistical problem. It’s not a simple matter of hitting a button in your favorite stats program to fix it. Daryl Thornton MD, MPH, physician researcher at Case Western Reserve University and co-director of the Case Western Reserve University Center for Reducing Health Disparities
4.1Analysis Methods
and Sensitivity Analysis

In practice, you will not know for certain whether the pattern of missing data in your observed data follows MCAR, MAR, or MNAR. Different analysis models make different assumptions about the pattern of missing data. If you use an analysis approach that assumes missing data follow MCAR or MNAR and your assumption is wrong, results may be biased (estimates too high or too low) or you may make a conclusion with too much certainty. This is why it is crucial that you work with a statistician to do the following:

  • Assess the possible mechanisms for the missing data
  • Determine the assumptions about the missing data patterns under which a primary analysis should be conducted
  • Determine alternative assumptions under which sensitivity analyses should be conducted

Age
Gender
Body Weight
Total Cholesterol
Smoking
Diabetes
Statin Use
Andre Mcguire
Rochelle Webb
Roman Carson
Mike Douglas
Mindy Bell
Hector Laughlin
Fred Hoffman
Ella Martinez
Mike Douglas
Bernard Logan
Cathy Barker
Ebony Castro
Sonya Franklin
Francis Adams
Terri Smith
Lena Lindsey
Samuel Norman
Audrey Roberson
4.2Analysis Methods
and Sensitivity Analysis

One of the simplest methods for handling missing data is to carry out a complete case analysis, where you simply exclude individuals with missing data. Such an analysis assumes the missing data are MCAR. If that is incorrect, results may be biased.

Suppose you are analyzing data from a trial with attrition and you know the sicker patients (greater disease severity at the start of the trial) are more likely to drop out. In such a setting, it would be incorrect to use a complete case analysis because data are not MCAR.

Sensitivity analyses should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting. The Prevention and Treatment of Missing Data in Clinical Trials. Courtesy of the National Research Council
4.3Analysis Methods
and Sensitivity Analysis

It is also important to consider the extent of missing data. For example, if there are missing outcomes in a small number (<5%) of participants, then the choice of analysis method may not impact results. It's difficult to make a general statement about how much missing data is too much and when a specific data set cannot support definitive conclusions.

For example, having modest amounts of missing data under MNAR can make results more uncertain than having larger amounts of missing data under MCAR. Sensitivity analyses can help you assess how robust conclusions are for a given data set with missing outcomes.

Age
Gender
Body Weight
Total Cholesterol
Smoking
Diabetes
Statin Use
Andre Mcguire
Rochelle Webb
Roman Carson
Mike Douglas
Mindy Bell
Hector Laughlin
Fred Hoffman
Ella Martinez
Mike Douglas
Bernard Logan
Cathy Barker
Ebony Castro
Sonya Franklin
Francis Adams
Terri Smith
Lena Lindsey
Samuel Norman
Audrey Roberson
4.4Analysis Methods
and Sensitivity Analysis

In general, the MCAR assumption rarely holds for missing data from clinical studies and especially not when data are missing due to dropout. More commonly, missing patterns are either MAR or MNAR and require analysis methods such as maximum likelihood, multiple imputation, Bayesian methods, and methods based upon generalized estimating equations.

Again, you should work with a statistician to carry out the primary analysis based upon a primary set of assumptions about the missing data and with additional sensitivity analyses under alternate assumptions.

12 w
52 w
12 w
52 w
001
051
002
052
003
053
004
054
005
055
006
056
007
057
008
058
009
059
010
060
011
061
012
062
013
063
014
064
015
065
016
066
017
067
018
068
019
069
020
070
021
071
022
072
023
073
024
074
025
075
026
076
027
077
028
078
029
079
030
080
031
081
032
082
033
083
034
084
035
085
036
086
037
087
038
088
039
089
040
090
041
091
042
092
043
093
044
094
045
095
046
096
047
048
049
050
Spinal Manipulation
with Home Exercise
12 w
52 w
12 w
52 w
097
147
098
148
099
149
100
150
101
151
102
152
103
153
104
154
105
155
106
156
107
157
108
158
109
159
110
160
111
161
112
162
113
163
114
164
115
165
116
166
117
167
118
168
119
169
120
170
121
171
122
172
123
173
124
174
125
175
126
176
127
177
128
178
129
179
130
180
131
181
132
182
133
183
134
184
135
185
136
186
137
187
138
188
139
189
140
190
141
191
142
192
143
144
145
146
Home Exercise Alone
5.1First Case Study  1

What question did the researchers want to answer and how did they design the study?

Researchers wanted to determine whether adding spinal manipulation to home exercise and advice would reduce short- and long-term back-related leg pain. They designed a randomized controlled trial and randomly assigned 192 patients to either spinal manipulation with home exercises or home exercise alone.

The interventions lasted 12 months, and the researchers planned to measure pain, disability, medication use, and other outcomes at both 12 months (short-term) and 52 months (long-term).

5.2First Case Study  1

What happened that left them with missing data?

One participant in the home exercise group was lost to follow-up and did not provide outcome data at 12 weeks. Thirteen participants (8 in the home exercise group and 5 in the combined therapy group) were lost to follow-up and did not provide outcome data at 52 weeks.

Most of these individuals dropped out between their week 12 and 26 visits. Reasons for dropout were unclear.

5.3First Case Study  1

What did they do to address the problem?

The researchers analyzed the data using a mixed-effects regression model, a method that utilizes all observed data and does not exclude patients with incomplete follow-up. The model provides valid results as long as missing data follow a MAR framework.

They also conducted a number of sensitivity analyses where they assumed patients with less leg pain were more likely to drop out than those with more pain (i.e., they assumed the missing data were MNAR). In these sensitivity analyses, they used multiple imputation to generate an imputed pain score (primary outcome) and then subtracted a range of small but plausible amounts (0.1, 0.2, 0.3, and 0.4 point) from that score (11-point pain score). They found that, across the spectrum of these assumptions, results were similar to those from the primary analysis, suggesting that the results were robust to a set of MNAR assumptions.

5.4First Case Study  1

How could they have gone wrong?

A much weaker approach to handling these missing data would have been to exclude the participants with 12-week or 52-week outcomes missing from the analysis. This kind of complete-case analysis assumes that data are MCAR. In this situation, the MCAR assumption is unlikely, because the reasons participants drop out are typically related to their disease characteristics and outcome measurements. Therefore, results from a complete-case analysis are likely to be biased.

12 w
52 w
12 w
52 w
001
051
002
052
003
053
004
054
005
055
006
056
007
057
008
058
009
059
010
060
011
061
012
062
013
063
014
064
015
065
016
066
017
067
018
068
019
069
020
070
021
071
022
072
023
073
024
074
025
075
026
076
027
077
028
078
029
079
030
080
031
081
032
082
033
083
034
084
035
085
036
086
037
087
038
088
039
089
040
090
041
091
042
092
043
093
044
094
045
095
046
096
047
048
049
050
Spinal Manipulation
with Home Exercise
12 w
52 w
12 w
52 w
097
147
098
148
099
149
100
150
101
151
102
152
103
153
104
154
105
155
106
156
107
157
108
158
109
159
110
160
111
161
112
162
113
163
114
164
115
165
116
166
117
167
118
168
119
169
120
170
121
171
122
172
123
173
124
174
125
175
126
176
127
177
128
178
129
179
130
180
131
181
132
182
133
183
134
184
135
185
136
186
137
187
138
188
139
189
140
190
141
191
142
192
143
144
145
146
Home Exercise Alone
5.4First Case Study  1

How else could they have gone wrong?

Another tempting but flawed approach would have been to “fill in” missing 52-week pain scores with a single value (such as a patient’s pain score at baseline or at their last visit before dropping out, or the mean pain scores of patients who did not drop out). These are examples of “single-imputation” approaches. They make the strong assumption that the 52-week pain scores are known with certainty in participants who dropped out. These methods can lead to bias in either direction and typically understate variance, leading to p-values that are too small and confidence intervals that are too narrow.

Giant-cell Arteritis
No Giant-cell Arteritis
BMI
Smoking Status
Total Cholesterol
Complete risk data
6.1Second Case Study  2

What question did the researchers want to answer and how did they design the study?

The researchers wanted to determine whether, compared to patients without the disease, patients with giant-cell arteritis (GCA) have an increased risk for heart attacks, stokes, and peripheral artery disease (blockages of arteries providing blood to other parts of the body, such as the legs). GCA is a disease involving inflammation of blood vessels that can lead to blindness if it is not treated. Some studies have suggested that patients with GCA may be at increased risk for other blood vessel diseases (such as heart attacks, strokes, peripheral artery disease).

Information on cardiovascular risk factors needed for the analysis, was obtained from The Health Improvement Network (THIN), an electronic database derived from general practices in the United Kingdom. Data on diagnoses, prescription medications, height, weight, smoking status, vaccinations, and other variables are entered into the THIN database by primary care physicians during clinical visits. For this study, investigators needed data on baseline BMI, smoking status, and total cholesterol level to be able to adjust for these important confounders.

6.2Second Case Study  2

What happened that left them with missing data?

Unfortunately, about 50% of the study patients in the THIN database had missing data on cholesterol level. Data on smoking were also missing for 16% of the study group. More data were missing for the group of participants who did not have GCA, providing evidence that a MCAR assumption is not plausible. In the end, just 43% of GCA patients and 37% of participants without GCA had complete cardiovascular risk factor data.

6.3Second Case Study  2

What did they do to address the problem?

Researchers first determined that assuming the missing data were MCAR was not reasonable since rates of missing data differed between those with and without GCA. They then addressed this missing data by using multiple imputation under a MAR assumption and included a number of patient characteristics (age, sex, GCA status, smoking status, hypertension, diabetes, BMI) and outcomes in their imputation model. In other words, they imputed (or estimated) a value for every missing item using all of the available observed data, and they repeated this imputation 50 times, resulting in 50 unique imputed data sets. Next, they performed their analysis separately with each imputed data set and combined the 50 results to make variance estimates that appropriately accounted for the uncertainty of the missing data. (Note: Carrying this process out only once would have been equivalent to using a single imputation approach, which would have under estimated the variance.) Given the extent of missing data, these researchers were also cautious in their interpretation of the results and mention the missing data within the abstract. They could have strengthened their report by including results from additional sensitivity analyses that made alternative (MNAR) assumptions about the missing data.

6.4Second Case Study  2

How could they have gone wrong?

If the researchers had instead used a common, yet flawed, missing-indicator method, results would be subject to bias. Under the missing-indicator method, a dummy (1/0) or indicator variable is included in the analyses to denote individuals with missing values. When this ad hoc approach is used, individuals with missing values can be included within the analysis, but results can be biased, especially in non randomized studies such as this one.

Giant-Cell Arteritis
No Giant-Cell Arteritis
BMI
Smoking Status
Total Cholesterol
6.5Second Case Study  2

How else could they have gone wrong?

Researchers could have also mistakenly used another single-imputation approach, mean imputation. Under mean imputation, missing items would be assumed to be equal to the mean among those participants with observed data. For example, all GCA patients with missing total cholesterol measures would be assumed to have their total cholesterol equal to the mean total cholesterol in GCA patients. Participants without GCA with missing total cholesterol values would be assumed to have a value equal to that of participants without GCA. Although this approach allows inclusion of all patients within the analysis, it is not optimal because it makes a strong assumption that all missing values equal a single value, which can result in biased estimates and incorrect estimates of variance.

Guidance
summary
points
7.1Before/during data
collection

  • Aim to design and conduct studies to minimize missing data; keep track of the reasons for missing data while conducting the study.
  • Consult a statistician.
  • Learn to recognize missing data and, if possible, hypothesize and explore missing data mechanisms consistent with the reasons the data might be missing.
  • Assess the amount of missing data, perhaps adding new people to a study or allotting additional resources to follow up, if the amount of missing data could compromise the replicability of the results.

Guidance
summary
points

After data collection

  • Summarize missing data; consider the timing of and reasons for missing data; think about plausible mechanisms for the missing data.
  • Consult a statistician.
  • Use an analysis method that makes reasonable/plausible assumptions about the missing data. Generally, this will mean conducting the main analysis under a MAR assumption (e.g., multiple imputation or likelihood-approaches).
  • Provide results from additional sensitivity analyses to explore the impact of deviations from the MAR assumption (i.e., results under MNAR or informative missing assumptions).
  • Avoid using ad hoc analysis methods to handle the missing data, such as single imputation methods or those that “fill in” values (e.g., last observation carried forward, missing indicator), or methods that exclude data from the analysis.

Back to top