Detecting contaminated birthdates using generalized additive models

File Size136.9 KiB
DateJuly 16, 2015
AuthorWei Luo, Marcus Gallagher, Bill Loveday, Susan Ballantyne, Jason P Connor, and Janet Wiles
Background: Erroneous patient birthdates are common in health databases. Detection of these errors usually
involves manual verification, which can be resource intensive and impractical. By identifying a frequent
manifestation of birthdate errors, this paper presents a principled and statistically driven procedure to identify
erroneous patient birthdates.
Results: Generalized additive models (GAM) enabled explicit incorporation of known demographic trends and birth
patterns. With false positive rates controlled, the method identified birthdate contamination with high accuracy.
In the health data set used, of the 58 actual incorrect birthdates manually identified by the domain expert, the
GAM-based method identified 51, with 8 false positives (resulting in a positive predictive value of 86.0% (51/59) and
a false negative rate of 12.0% (7/58)). These results outperformed linear time-series models.
Conclusions: The GAM-based method is an effective approach to identify systemic birthdate errors, a common
data quality issue in both clinical and administrative databases, with high accuracy.

BMC Bioinformatics 2014, 15:185
Published: 12 June 2014