Prediction of hydrogen and methane yields from gasification of leather waste using machine learning and explainable AI: An original dataset


Cihan P., Alfarra F., Kurtulus Ozcan H. K., CİNER M. N., ÖNGEN A.

JOURNAL OF ENVIRONMENTAL MANAGEMENT, cilt.391, 2025 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 391
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1016/j.jenvman.2025.126521
  • Dergi Adı: JOURNAL OF ENVIRONMENTAL MANAGEMENT
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, International Bibliography of Social Sciences, PASCAL, Aerospace Database, Agricultural & Environmental Science Database, Aqualine, Aquatic Science & Fisheries Abstracts (ASFA), BIOSIS, CAB Abstracts, Communication Abstracts, Environment Index, Geobase, Greenfile, Index Islamicus, Metadex, Pollution Abstracts, Public Affairs Index, Veterinary Science Database, Civil Engineering Abstracts
  • Anahtar Kelimeler: Data generation, Explainable AI, Machine learning, SHAP, Syngas prediction
  • İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Accurately predicting syngas composition is essential for optimizing energy production and ensuring environmental sustainability. Despite the growing use of machine learning techniques in this field, publicly available datasets remain limited, and existing datasets contain relatively few samples. To bridge this gap, we generated a comprehensive dataset of 3748 samples under controlled laboratory conditions and publicly shared it on Kaggle (https://www.kaggle.com/datasets/miracnurciner/gasification-dataset). This study aims to identify the most successful machine learning model for predicting H-2 and CH4 gas concentrations by evaluating nine models: Random Forest (RF), Linear Regression (LR), Decision Tree (DT), Support Vector Regression (Linear and RBF), K-Nearest Neighbors (KNN), Gradient Boosting Regressor (GBR), XGBoost, CatBoost, and LightGBM. Model performance was assessed using multiple metrics, including the coefficient of determination (R-2), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and explained variance score (EVS). The Friedman test was applied to evaluate the statistical significance of performance differences among the models. The results show that the KNN model achieved the highest predictive performance for both H-2 (R-2 = 0.987, RMSE = 1.253) and CH4 (R-2 = 0.979, RMSE = 0.920). Friedman test shows that the performance differences between the models are statistically significant (p < 0.001). By integrating Shapley Additive Explanations (SHAP) into the model, the contribution of each feature to the prediction results is clarified. SHAP analysis highlights that temperature and time are the main features affecting H-2 and CH4 gas. This study highlights the potential of machine learning techniques for biomass gas prediction and advocates for integrating Explainable AI (XAI) methods, establishing a robust foundation for future research. Furthermore, by providing a large, publicly available dataset, this research significantly advances studies in syngas composition prediction.