Abstract
Redundancy, correlation, feature irrelevance, and missing samples are just a few problems that make it difficult to
analyze software defect data. Additionally, it might be challenging to maintain an even distribution of data relating
tobothdefective andnon-defective software.The latter software class’sdata are predominatelypresent in thedataset
in the majority of experimental situations. The objective of this review study is to demonstrate the effectiveness of
combining ensemble learning and feature selection in improving the performance of defect classification. Besides
the successful feature selection approach, a novel variant of the ensemble learning technique is analyzed to address
the challenges of feature redundancy and data imbalance, providing robustness in the classification process. To
overcome these problems and lessen their impact on the fault classification performance, authors carefully integrate
effective feature selection with ensemble learning models. Forward selection demonstrates that a significant area
under the receiver operating curve (ROC) can be attributed to only a small subset of features. The Greedy forward
selection (GFS) technique outperformed Pearson’s correlationmethodwhen evaluating feature selection techniques
on the datasets. Ensemble learners, such as random forests (RF) and the proposed average probability ensemble
(APE), demonstrate greater resistance to the impact of weak features when compared to weighted support vector
machines (W-SVMs) and extreme learning machines (ELM). Furthermore, in the case of the NASA and Java
datasets, the enhanced average probability ensemble model, which incorporates the Greedy forward selection
technique with the average probability ensemble model, achieved remarkably high accuracy for the area under
the ROC. It approached a value of 1.0, indicating exceptional performance. This review emphasizes the importance
of meticulously selecting attributes in a software dataset to accurately classify damaged components. In addition,
the suggested ensemble learning model successfully addressed the aforementioned problems with software data
and produced outstanding classification performance.