3 of the Most Common and Offensive Healthcare Data Quality Issues

As data scientists and researchers, we rely heavily on accurate and reliable data to derive meaningful insights and make informed decisions. However, actually achieving clean healthcare data is challenging due the variety of sources (EMR, registries, sensors, etc.), and the chain of custody as data moves from its origin to the ultimate analyst. As a result, we see many varieties of data quality problems. Sometimes, these can be hard to spot in the context of a complex dataset with many tables, fields, and patients. In other cases, the problems are so glaring that it feels offensive that they have not been remedied.

In this series, we will describe three commonly encountered offensive data quality issues and discuss their implications. We use the word offensive in jest, but when you consider the stakes of healthcare data used for conducting research that inform on patient outcomes, it can be genuinely concerning.

Issue 1: Obvious Data Errors That Stand Out

Our first offensive data quality issue is when you can spot impossible values in a table just looking at the first few records. These errors signify a lack of data validation and suggest that no one has ever bothered to check the data for accuracy. Such errors may include:

  1. Absurd Values: Encountering entries like a heart rate of 0 or a hemoglobin of -9 immediately raises red flags. These types of errors indicate data entry mistakes or the presence of invalid or irrelevant data.

  2. Swapped Entries: An example is lab results showing up in the unit column and vice-versa, making both columns unsuitable for analysis until the values are swapped back.

  3. Impossible Dates: Diagnosis dates in 1899 may sound silly but come up very frequently.  Analyzing patient follow-up will yield skewed results unless they are corrected.

Example:

Imagine analyzing a dataset of patient medical records and coming across numerous blood pressure readings of zero. This value is clearly erroneous and indicates data entry mistakes or equipment malfunction. Without proper remediation, using such inaccurate data can lead to incorrect diagnoses or treatment decisions.

Issue 2: Misstructured Tables

Misstructured tables are the second offensive data quality issue.  These significantly inflate the time it takes to perform even basic analysis.  This issue manifests in several ways:

  1. Duplicate Records: Large numbers of duplicate records within a dataset can skew analysis results and misrepresent the true distribution of data. Identifying and removing these duplicates is essential for accurate analysis.

  2. Mixed Information: In some cases, a table column may contain multiple disparate pieces of information, leading to challenges and ambiguity during analysis. For example, a single lab data column may contain test results, units, and normal ranges, making it difficult to perform any analysis without proper data transformation.

  3. Missing Entity Relationships: Tables that lack proper links or keys, making combining tables difficult or impossible. This issue often arises when different sources of data are integrated without establishing the necessary connections between them.

Example:

We’ve seen scenarios where a table has twenty diagnosis fields labeled diagnosis_1 to diagnosis_20, where nearly all the entries are missing.  Reformatting this table into something suitable for analysis is conceptually straightforward, but every time custom code is needed to transform data it introduces the possibility of new errors and extends the time before analysis can start. 

Issue 3: Customized Indicators of Missing Data

Sometimes, organizations may use customized indicators for representing missing data, which can introduce confusion and jeopardize the quality of analysis. Examples of such indicators include strings of repeated characters like “9999” or "XXXXXXXXXXX". This practice adds an extra layer of complexity and can lead to erroneous or misleading results.

Example:

Suppose a dataset contains non standard indicators of missing data. Models built on this data may interpret this as real data leading to inaccurate results and conclusions, and models that do not work when applied to other data sources that don’t use these missing data indicators.  

Conclusion

“Offensive” data quality issues can significantly impact the reliability and integrity of healthcare data. It is crucial for data scientists and researchers to be aware of these issues and take necessary steps to ensure data accuracy, completeness, and consistency. By addressing these problems, we can enhance the effectiveness of healthcare data analysis and make more informed decisions for better patient outcomes.

Data management practices, such as thorough data validation, establishing proper data relationships, and standardizing missing data indicators, are essential for mitigating offensive data quality issues. By implementing robust data quality assurance processes, organizations can unlock the full potential of their data and drive meaningful insights and improvements in patient care.

Cornerstone AI is an AI-assistant purpose-built to handle these (and many other) types of data quality problems in Real World Data (RWD). 

Previous
Previous

A Q&A with Clara Oromendia, Co-founder & VP Data Science & Product Strategy

Next
Next

A Q&A with Eugene Wang, Head of Product Engineering