Improving classification performance in missing insurance data

Mlungisi Sizwe Duma

D.Phil. (Electrical and Electronic Engineering) The ubiquitous missing data and its pervasiveness in large scale datasets (such as insurance datasets) have inspired research conducted on this thesis to focus on techniques that sustain high accuracies and robustness. It is a consensus in research and in practice that missing data reduces the quality of data and negatively affects the accuracy in classification. The increase in pervasiveness of missing data affects the accuracy and robustness (or resilience) of classifiers. This effectively impacts decision making and calculation of premiums. The goal of the thesis is to present methods that will improve the accuracy and/or robustness of classifiers in the presence of missing data in insurance datasets. The first contribution in this thesis is a comprehensive comparative study of machine learning techniques (classifiers) in the presence of increasing missing data. The study explores and scrutinises their performance and robustness. The classifiers are the repeated incremental pruning to produce error reduction (RIPPER), naïve Bayes (NB), k-nearest neighbour (k-NN), logistic discriminant analysis (LgDA) and support vector machines (SVM). The study reveals that the sensitivity of the classifiers decreases with increasing missing data rate. The RIPPER shows better performance overall, whilst the NB shows better robustness as the quality of the data deteriorates. A second contribution presented in this thesis is a novel relevance determination (ARD) ensemble for effective attribute selection in insurance datasets with large number of attributes and contains missing data. ARD ensemble applies the Bayesian neural networks and evidence framework to find and order attributes based on their relevance to the target outcome. The data is partitioned into numerical and nominal subsets. Each ARD in the ensemble is then constructed using each of the subsets. The combined outcome of each ARD is scrutinised using a confidence factor and the most relevant attributes are selected. Missing data imputation is performed using the mean-mode imputation. The performance of the ARD ensemble is compared to that of the principal component analysis (PCA). The results show that classifiers that use the ARD ensemble achieve high accuracies and sustain robustness than when applied using the PCA.

Improving classification performance in missing insurance data

Abstract

Files and links (1)

Metrics

Details