A comparative study of imputation techniques formissing values in healthcare diagnostic datasets

doi:10.1007/s41060-025-00825-9

Back

A comparative study of imputation techniques formissing values in healthcare diagnostic datasets

Journal article

Open access

A comparative study of imputation techniques formissing values in healthcare diagnostic datasets

2025

DOI: https://doi.org/10.1007/s41060-025-00825-9

Handle:

https://hdl.handle.net/10210/514884

Abstract

Missing data imputation

Healthcare datasets

Machine Learning

Missing values are a common feature of real-world datasets, particularly in healthcare data. This can be challenging when applying machine learning algorithms, as most models perform poorly in the presence of incomplete data. The goal of this study is to evaluate the performance of seven imputation techniques:Mean Imputation, Median Imputation, Last Observation Carried Forward (LOCF), K-Nearest Neighbor (KNN) Imputation, Interpolation, MissForest, and Multiple Imputation by Chained Equations (MICE) on three healthcare datasets.Various levels of missing datawere introduced—10%, 15%, 20%, and 25%—and the imputation techniques were used to fill in the gaps. The methods were compared using root mean squared error (RMSE) and mean absolute error (MAE). The results indicate that MissForest imputation performed best, followed byMICE. Additionally, we examined whether feature selection should be performed before or after imputation, using recall, precision, F1-score, and accuracy as evaluation metrics. The result suggests that performing imputation before feature selection is better. Since there is limited research on the order of imputation and feature selection, and ongoing debate among researchers, we hope the findings of this study will encourage data scientists and researchers to prioritize imputation before feature selection when working with datasets containing missing values.

Files and links (1)

pdf

GetDocument (73)4.24 MBDownload View

Open Access

Metrics

1 Record Views

Details

Title: A comparative study of imputation techniques formissing values in healthcare diagnostic datasets
Contributors - without role: Luke Oluwaseye Joel
Wesley Doorsamy
Babu Sena Paul
Identifiers: 9954303707691
Academic Unit: University of Johannesburg; Faculty of Science
Language: English
Resource Type: Journal article