Abstract
Mental disorders represent a complex and multifaceted challenge within healthcare and well-being. These conditions, encompassing a wide spectrum of psychological and emotional disturbances, have extensive implications for individuals, families, and society. Understanding the various dimensions of mental disorders is essential to developing effective strategies for prevention, diagnosis, and treatment and creating supportive environments that foster mental well-being. This study explored mental disorders, offering valuable insights and implications for addressing mental health concerns. This study aimed to rigorously compare various machine learning (ML) techniques to assess their effectiveness in predicting mental disorders. By systematically analysing the performance of these techniques, valuable insights can be gained regarding their predictive capabilities, inherent strengths, and limitations. The ultimate objective was to comprehensively understand how different ML techniques perform in the specific context of mental disorder prediction.
This study used a six-year dataset (2016-2021) from Open Sourcing Mental Illness in comparison to other studies that used a single-year dataset. To ensure robust analyses, the dataset went through data preprocessing, including data exploration. This revealed the need for enhanced awareness of mental healthcare options and uncovered gender disparities in mental health discussions, with females more active in these discussions. Additionally, data preprocessing showed the significance of family history in mental health diagnoses, emphasising familial factors. Data preprocessing also included label encoding, data cleaning, and handling missing values through four distinct imputation techniques: Hot deck, K-Nearest Neighbor (K-NN), Multiple Imputation by Chained Equations (MICE), and MODE. MICE and K-NN emerged as superior choices among these methods due to their accuracy and relationship preservation capabilities. Class imbalance was addressed using four techniques, namely, the Synthetic Minority Over-sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Tomek links, and Near Miss, in the context of predicting mental disorders. These techniques impact various metrics, with SMOTE and ADASYN demonstrating a promising enhancement in precision, recall, and balanced accuracy. eXtreme Gradient Boosting (XGBoost) and Adaptive Boosting (AdaBoost) were identified as potent classifiers, offering a trade-off between balance and performance.
Feature selection methods were employed to optimise model performance, including Recursive Feature Elimination with Cross-Validation (RFECV), Ensemble Feature Selection Algorithms (EFSA), and Random Subspace Method (RSM). RFECV was the most effective, reducing features to just three and significantly enhancing accuracy. EFSA, selecting 11 features, balances accuracy and inclusivity and proved pivotal in improving predictive accuracy while comprehensively representing features.
Predictive models for mental disorders incorporating class imbalance techniques, feature selection, and hyperparameter tuning were then constructed. XGBoost consistently excelled in accuracy, precision,
iv
F1 score, and other metrics. Cross-validation technique was employed to validate model performance, with XGBoost consistently achieving high accuracy scores in the 92% to 93% range, indicating stability across different dataset subsets. Ensemble Learning methods, including bagging, boosting, and stacking, further enhanced predictive capabilities, with the proposed Tomek Link Boosting Ensemble (TlBE) emerging as the most effective choice, achieving robust and precise mental disorder prediction models. TlBE showed a notable high recall score, prioritising the identification of relevant minority class instances. This characteristic is crucial in healthcare and safety-critical domains. The study employed a confusion matrix to assess model performance, where minimising false negatives was particularly important to avoid overlooking individuals who require assistance, underlining the model's effectiveness in mental health disorder prediction.
In a comparative performance evaluation, the proposed TlBE model emphasised its reliability and effectiveness in predicting mental disorders. Statistical tests confirmed the significant performance differences among models from other studies, solidifying TlBE as a robust foundation for accurate mental disorder prediction and a valuable contribution to the field.
Keywords: Class Imbalance, Ensemble Learning, Feature Selection, Mental Health Disorders, Machine Learning, Predictive Models, Tomek Link Boosting Ensemble