Abstract
Missing values are a common feature of real-world datasets, particularly in healthcare data. This can be challenging when
applying machine learning algorithms, as most models perform poorly in the presence of incomplete data. The goal of this
study is to evaluate the performance of seven imputation techniques:Mean Imputation, Median Imputation, Last Observation
Carried Forward (LOCF), K-Nearest Neighbor (KNN) Imputation, Interpolation, MissForest, and Multiple Imputation by
Chained Equations (MICE) on three healthcare datasets.Various levels of missing datawere introduced—10%, 15%, 20%, and
25%—and the imputation techniques were used to fill in the gaps. The methods were compared using root mean squared error
(RMSE) and mean absolute error (MAE). The results indicate that MissForest imputation performed best, followed byMICE.
Additionally, we examined whether feature selection should be performed before or after imputation, using recall, precision,
F1-score, and accuracy as evaluation metrics. The result suggests that performing imputation before feature selection is better.
Since there is limited research on the order of imputation and feature selection, and ongoing debate among researchers, we
hope the findings of this study will encourage data scientists and researchers to prioritize imputation before feature selection
when working with datasets containing missing values.