Abstract
D.Phil. (Electrical and Electronic Engineering)
The ubiquitous missing data and its pervasiveness in large scale datasets (such as insurance
datasets) have inspired research conducted on this thesis to focus on techniques that sustain high
accuracies and robustness. It is a consensus in research and in practice that missing data reduces
the quality of data and negatively affects the accuracy in classification. The increase in
pervasiveness of missing data affects the accuracy and robustness (or resilience) of classifiers.
This effectively impacts decision making and calculation of premiums. The goal of the thesis is
to present methods that will improve the accuracy and/or robustness of classifiers in the presence
of missing data in insurance datasets.
The first contribution in this thesis is a comprehensive comparative study of machine learning
techniques (classifiers) in the presence of increasing missing data. The study explores and
scrutinises their performance and robustness. The classifiers are the repeated incremental pruning
to produce error reduction (RIPPER), naïve Bayes (NB), k-nearest neighbour (k-NN), logistic
discriminant analysis (LgDA) and support vector machines (SVM). The study reveals that the
sensitivity of the classifiers decreases with increasing missing data rate. The RIPPER shows
better performance overall, whilst the NB shows better robustness as the quality of the data
deteriorates.
A second contribution presented in this thesis is a novel relevance determination (ARD)
ensemble for effective attribute selection in insurance datasets with large number of attributes
and contains missing data. ARD ensemble applies the Bayesian neural networks and evidence
framework to find and order attributes based on their relevance to the target outcome. The data is
partitioned into numerical and nominal subsets. Each ARD in the ensemble is then constructed
using each of the subsets. The combined outcome of each ARD is scrutinised using a confidence
factor and the most relevant attributes are selected. Missing data imputation is performed using
the mean-mode imputation. The performance of the ARD ensemble is compared to that of the
principal component analysis (PCA). The results show that classifiers that use the ARD
ensemble achieve high accuracies and sustain robustness than when applied using the PCA.