Abstract
Class-imbalanced datasets are a common occurrence in real-world applications. The imbalance between minority and majority classes exists due to the over-representation of one class compared to another in a dataset. The class imbalance might reflect a system's behaviour over time. However, the class imbalance causes sub-optimal performance for machine learning models that predict the system's future behaviour. Various techniques are used to reduce the negative impact of class-imbalanced datasets on machine learning models. Data resampling techniques are one of the main techniques, and the subdivisions of data re-sampling techniques include oversampling and undersampling. Oversam-pling techniques have outperformed undersampling techniques in most studies, and most data resampling techniques are derived from oversam-pling. However, some oversampling techniques are ineffective when used on minority-class datasets that lack within-class variation and have a high-class imbalance. In this study, an analysis was performed to understand the changes in within-class variation before and after over-sampling for nine datasets. Additionally, classification performance was measured for standard and hybrid oversampled datasets. A novel hybrid oversampling technique that uses k-Means and ADASYN was implemented. Hybrid oversampling techniques generated synthetic examples that marginally changed the within-class variation and had the highest F1 score compared to standard oversampling techniques across nine datasets.