Abstract
This study investigated the impact of alkaline pretreatment on the biomethane yield of Xyris capensis
experimentally and computationally using machine-learning (ML)-based techniques. Despite extensive
studies on the anaerobic digestion of lignocellulosic biomass, the integration of a robust nexus of
advanced data analytics, including explainable AI (XAI) based on SHapley Additive exPlanations
(SHAP) and ML techniques, with experimental investigations has not been explored. The biomass was
subjected to varying NaOH concentrations and exposure times, then digested anaerobically for 35
days. A comprehensive data-driven insight was gained through correlation-mapping, SHAP-based XAI
for feature-ranking, cluster analysis for bio-digestion operational dataset using k-means integrated
with Principal Component Analysis (PCA). Optimal hyperparameter settings in four different ML
models, namely Artificial Neural Network (ANN), Random Forest (RF), Support Vector Machine (SVM),
and Decision Tree (DT), were conducted for predicting the biomethane yield. NaOH pretreatment
improved biomethane yield by 91–143%, with optimal yield at higher NaOH concentration and short
exposure time. SHAP analysis revealed exposure time as the most influential feature with a strong
negative impact on biomethane yield, retention time and NaOH concentration were identified as key
positive contributors, while PCA captured 86% of the total data variance in the principal components
(PCs) 1–3. K-means cluster analysis revealed 3 distinct groups, with cluster-0 exhibiting optimal NaOH
pretreatment conditions connected to the highest biomethane yield. The RF model gave the best
prediction with RMSE, MAE, MAD, MAPE, and VAF values of 3.1480, 2.0737, 1.7569, 5.7488, and
99.07, respectively, at the training phase. This research demonstrates the potential of data-driven
approaches as powerful standalone tools and vital complements to experimental investigations of
biomethane yield from lignocellulose biomass.