Quantitative Benchmarking of Machine Learning Models for Risk Prediction: A Comparative Study Using AUC/F1 Metrics and Robustness Testing

Md. Arifur Rahman; B. M. Taslimul Haque

doi:10.63125/9hd4e011

Authors

Md. Arifur Rahman Bachelor of Science (B.Sc.) in Computer Science & Engineering, Bangladesh University of Business & Technology, Bangladesh Author
B. M. Taslimul Haque Master Bachelor of Science in Computer Science & Engineering, American International University Bangladesh, Dhaka, Bangladesh Author

DOI:

https://doi.org/10.63125/9hd4e011

Keywords:

Machine Learning Benchmarking, Enterprise Risk Prediction, Robustness Testing, AUC And F1, Cloud Analytics Pipelines

Abstract

This study addresses a problem in enterprise risk decision systems: model selection is often justified by baseline accuracy, although cloud and enterprise pipelines face measurement noise, missing values, and distribution shift that can weaken predictions. The purpose was to quantitatively benchmark machine learning classifiers for binary risk prediction and determine which approach delivers the best combination of effectiveness and stability. Using a quantitative, cross-sectional, case-based design, the study analyzed N = 4,200 enterprise case records with adverse-event prevalence of 8.6% (n = 361) and benchmarked Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting (XGB), and a Multilayer Perceptron. Key variables were model family and stress condition (baseline versus perturbation), with outcome variables AUC, F1-score, precision, recall, and robustness degradation (ΔAUC, ΔF1). The analysis plan applied standardized preprocessing and stratified validation, then computed baseline AUC and F1, followed by robustness testing under noise injection, 15% added missingness, and segment-shift distribution perturbation; descriptive statistics summarized performance, while correlation and regression tested relationships between robustness and performance, and a 5-point Likert survey (n = 120) captured adoption perceptions. Baseline results showed XGB as the top performer (AUC = 0.872, F1 = 0.532), followed by RF (0.846, 0.498), MLP (0.831, 0.474), SVM (0.804, 0.431), and LR (0.781, 0.402). Under robustness testing, XGB retained the smallest degradation (ΔAUC = -0.027; ΔF1 = -0.032) while MLP was most unstable (ΔAUC = -0.071; ΔF1 = -0.079), reshaping the composite ranking to XGB, RF, LR, SVM, then MLP. Robustness correlated with baseline performance (AUC: r = 0.62, p = 0.004; F1: r = 0.58, p = 0.008) and significantly predicted benchmark score (robustness index β = 0.44, p < 0.001), with noise, missingness, and distribution perturbation showing negative effects (β = -0.21, -0.18, -0.27; all p ≤ 0.009). Survey reliability was strong (Cronbach’s alpha = 0.86) and respondents rated perceived reliability (M = 4.18, SD = 0.54), supporting deployment implications: enterprise risk teams should adopt multi-metric, stress-tested benchmarking before production rollouts in practice.