Best practice to prevent missing data

Einar Martin Aandahl, MD, PhD,

CEO Ledidi

Missing data may undermine the ability to draw valid conclusions from any clinical trial, patient registry or research project. There are numerous causes that data may be lost, unavailable or not observed. If substantial, missing values can reduce the statistical power and introduce biases in a dataset regardless of study design.

Although it may be impossible to completely avoid missing data, it is critical to take steps in the design process to ensure that the risk for missing data is minimized. Also, a pre-specified plan should be established for how to handle missing data throughout the operational phase and in the analyses.

How to reduce the risk for missing data in study design

It is important to recognize that most registries and clinical studies have some missing data. The critical part is to minimize the amount and the impact on data quality, and to have sound strategies to handle it both practically and statistically. There are some key steps to keep in mind during the design phase that help to reduce the number of missing values, or, that may help to alleviate the impact on the quality and integrity of the dataset once missing data occur.

Easy accessible data. In clinical registries and research projects (both experimental and observational), the parameters should be easily accessible and make sense for the clinicians, collaborators and study participants. Data that are a part of the routine diagnostic or therapeutic process is much more likely to be entered into a clinical research form (CRF) than parameters that are unfamiliar to the study collaborators or that are hard to find, e.g. not present in electronic health records or health registries.
Only collect the information that is absolutely essential for your study. It is easy to be expansive in the design phase and include too many parameters. A critical eye on the amount of demographic data and number of diagnostic parameters, or therapeutic and outcome variables, is important. The focus should be on what is needed to maximize the value of the analyses and not to collect as much data as possible without a clear purpose. Too many variables represent a big risk for missing data.
Mandatory fields. Use of mandatory fields is an efficient tool to avoid missing values. However, mandatory fields should only be used for key variables and data that are accessible for all data entries as they will block the user from proceeding the data registration if the data is not available. Mandatory fields will force the respondents to answer, and the result will be an apparently complete dataset, but with poor data quality and an end result that might not be informative.
Data validation. Validation during registration helps to avoid misspelling and out of range values due to typing errors. This will reduce the risk for corrupting the analyses and need for tedious data cleaning.
Include an option for “not applicable”. Including an option for “not applicable” helps to distinguish between truly missing data and data that has not been entered because it may not be relevant. One step further is to also include an option to specify the reason for why the parameter is not applicable, which may guide the data analyses later on.
Easy to read and understand the CRF. Keep all text in the CRF short, simple and easy to understand without any ambiguity.
Make a pilot. It is good advice to pilot test the CRF with your collaborators or other parties that are going to participate. This may uncover issues regarding the number of variables, structure of the forms and data access. Eliminate vague or ambivalent language. Clarity is crucial.
Choose external data sources carefully. Historic data, patient registries and even electronic health records can have incomplete data sets, inconsistent use of nomenclature, non-standardized follow-up schedules and inconsistent use of scoring systems. It is important that the key variables fundamental for primary objectives of the study do not depend on data sources with questionable data quality hampered by missing values or invalid data.

How reduce the risk for missing data during the operational phase

A data management plan should include both manual reviews and automated mechanisms.

Ongoing or periodic manual reviews may identify issues in the study design that may be alleviated at an early stage, prompt improvement of the study guidelines or help sites overcome challenges in data collection and use of the CRFs.
Automated mechanisms may include improving the workflow with staging of data entries by assigning statuses, defining roles and responsibilities with user privileges and setting up alerts triggered by deviations in the workflow, missing data or other irregularities.
Regular meetings in the study group to keep everybody well informed and motivated is also a key success factor, and if the protocol allows it, include presentation of interim results.