Machine Learning for Early Non-invasive Diabetes Detection Using Electronic Health Records

Suresh Kumar Arumugam(1*), Jason Patterson(2), Panagiotis Petridis(3), Sara Masoud(4)


(1) Department of Computer Science and Engineering, Graphic Era (Deemed to be University), Dehradun, Uttarakhand 248002, India
(2) Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA
(3) Department of Electrical and Computer Engineering, School of Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
(4) Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48202, USA
(*) Corresponding Author

Abstract


Early detection of Type 2 Diabetes Mellitus (T2DM) is critical to preventing long-term complications such as cardiovascular disease, nephropathy, and retinopathy. However, conventional diagnostic approaches are often invasive, costly, and unsuitable for population-scale screening. This study proposes a non-invasive, machine learning-based framework for early T2DM detection using electronic health records (EHRs) from a publicly available Kaggle dataset. Key non-invasive features including demographics, vital signs, medication history, and temporal health trends were extracted and used to train six classifiers: random forest (RF), support vector machine (SVM), naïve bayes (NB), alternating decision tree (ADT), random tree (RT), and k-nearest neighbors (KNN). Class imbalance was addressed using the synthetic minority over-sampling technique (SMOTE) at 0%, 150%, and 300% levels. Experimental results show that RF achieved the highest AUC (88.45%) at 150% SMOTE, while SVM demonstrated the best sensitivity gains when temporal features and feature selection were applied. The proposed framework demonstrates the potential of interpretable, EHR-based ML models for scalable, cost-effective diabetes screening and offers a reproducible benchmark for future applications in real-world clinical data.

Keywords


Type 2 Diabetes Mellitus; Machine Learning; Electronic Health Records; Temporal Features; Non-invasive Detection

Full Text:

PDF

References


Allgaier, J., & Pryss, R. (2024). Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Machine Learning and Knowledge Extraction, 6(2), 1378–1388. https://doi.org/10.3390/make6020065

Appasani, D., Bokkisam, C. S., & Surendran, S. (2024). An Incremental Naive Bayes Learner for Real-time Health Prediction. Procedia Computer Science, 235, 2942–2954. https://doi.org/10.1016/j.procs.2024.04.278

Bayramli, I., Castro, V., Barak-Corren, Y., Madsen, E. M., Nock, M. K., Smoller, J. W., & Reis, B. Y. (2022). Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction. Npj Digital Medicine, 5(1), 15. https://doi.org/10.1038/s41746-022-00558-0

Bernardini, M., Romeo, L., Misericordia, P., & Frontoni, E. (2020). Discovering the Type 2 Diabetes in Electronic Health Records Using the Sparse Balanced Support Vector Machine. IEEE Journal of Biomedical and Health Informatics, 24(1), 235–246. https://doi.org/10.1109/JBHI.2019.2899218

Chen, Z., Tang, J., & Song, D. (2024). Modeling landslide susceptibility using alternating decision tree and support vector. Terrestrial, Atmospheric and Oceanic Sciences, 35(1), 12. https://doi.org/10.1007/s44195-024-00074-6

Diallo, R., Edalo, C., & Awe, O. O. (2025). Machine Learning Evaluation of Imbalanced Health Data: A Comparative Analysis of Balanced Accuracy, MCC, and F1 Score (pp. 283–312). https://doi.org/10.1007/978-3-031-72215-8_12

Fawagreh, K., & Gaber, M. M. (2020). Resource-efficient fast prediction in healthcare data analytics: A pruned Random Forest regression approach. Computing, 102(5), 1187–1198. https://doi.org/10.1007/s00607-019-00785-6

G, K., K P, I., Hasin A, J., M, L. F. J., Siluvai, S., & G, K. (2025). Support Vector Machines: A Literature Review on Their Application in Analyzing Mass Data for Public Health. Cureus. https://doi.org/10.7759/cureus.77169

Global Burden of Disease Collaborative Network. (2024, April 3). Global Burden of Disease Study 2021: Results. Institute for Health Metrics and Evaluation.

Gurcan, F., & Soylu, A. (2024). Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis. Cancers, 16(19), 3417. https://doi.org/10.3390/cancers16193417

Hairani, H., Widiyaningtyas, T., & Dwi Prasetya, D. (2024). Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies. JOIV : International Journal on Informatics Visualization, 8(3), 1310. https://doi.org/10.62527/joiv.8.3.2283

Halder, R. K., Uddin, M. N., Uddin, Md. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of

modifications. Journal of Big Data, 11(1), 113. https://doi.org/10.1186/s40537-024-00973-y

Hennebelle, A., Dieng, Q., Ismail, L., & Buyya, R. (2024). SmartEdge: Smart Healthcare End-to-End Integrated Edge and Cloud Computing System for Diabetes Prediction Enabled by Ensemble Machine Learning. 2024 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 127–134. https://doi.org/10.1109/CloudCom62794.2024.00031

Ilham, A., Kindarto, A., Fathurohman, A., Khikmah, L., Dias Ramadhani, R., Abdunnasir Jawad, S., April Liana, D., Amylia. AR, A., Kareem Oleiwi, A., & Mutiar, A. (2024). CFCM-SMOTE: A Robust Fetal Health Classification to Improve Precision Modeling in Multiclass Scenarios. International Journal of Computing and Digital Systems, 15(1), 471–486. https://doi.org/10.12785/ijcds/160137

Kiran, M., Xie, Y., Anjum, N., Ball, G., Pierscionek, B., & Russell, D. (2025). Machine learning and artificial intelligence in type 2 diabetes prediction: a comprehensive 33-year bibliometric and literature analysis. Frontiers in Digital Health, 7. https://doi.org/10.3389/fdgth.2025.1557467

Lee, H., Hwang, S. H., Park, S., Choi, Y., Lee, S., Park, J., Son, Y., Kim, H. J., Kim, S., Oh, J., Smith, L., Pizzol, D., Rhee, S. Y., Sang, H., Lee, J., & Yon, D. K. (2025). Prediction model for type 2 diabetes mellitus and its association with mortality using machine learning in three independent cohorts from South Korea, Japan, and the UK: a model development and validation study. EClinicalMedicine, 80, 103069. https://doi.org/10.1016/j.eclinm.2025.103069

Lin, H.-C., Kuo, Y.-C., & Liu, M.-Y. (2020). A health informatics transformation model based on intelligent cloud computing – exemplified by type 2 diabetes mellitus with related cardiovascular diseases. Computer Methods and Programs in Biomedicine, 191(2), 105409. https://doi.org/10.1016/j.cmpb.2020.105409

Moglia, V., Johnson, O., Cook, G., de Kamps, M., & Smith, L. (2025). Artificial intelligence methods applied to longitudinal data from electronic health records for prediction of cancer: a scoping review. BMC Medical Research Methodology, 25(1), 24. https://doi.org/10.1186/s12874-025-02473-w

Nawaz, A., Khan, S. S., & Ahmad, A. (2024). Ensemble of Autoencoders for Anomaly Detection in Biomedical Data: A Narrative Review. IEEE Access, 12, 17273–17289. https://doi.org/10.1109/ACCESS.2024.3360691

Noroozi, Z., Orooji, A., & Erfannia, L. (2023). Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction. Scientific Reports, 13(1), 22588. https://doi.org/10.1038/s41598-023-49962-w

Singh, N., & Singh, P. (2021). Exploring the effect of normalization on medical data classification. 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV), 1–5. https://doi.org/10.1109/AIMV53313.2021.9670938

Tabassum, S., Abedin, N., Maruf, R. I., Taufiq Ahmed, M., & Ahmed, A. (2022). Improving Health Status Prediction by Applying Appropriate Missing Value Imputation Technique. 2022 IEEE 4th Global Conference on Life Sciences and Technologies (LifeTech), 345–348. https://doi.org/10.1109/LifeTech53646.2022.9754794

Zhu, M., Xia, J., Jin, X., Yan, M., Cai, G., Yan, J., & Ning, G. (2018). Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data. IEEE Access, 6, 4641–4652. https://doi.org/10.1109/ACCESS.2018.2789428


Article Metrics

Abstract view : 37 times
PDF - 5 times

DOI: https://doi.org/10.26714/jichi.v6i1.17299

Refbacks

  • There are currently no refbacks.


____________________________________________________________________________
Journal of Intelligent Computing and Health Informatics (JICHI)
ISSN 2715-6923 (print) | 2721-9186 (online)
Organized by
Department of Informatics
Faculty of Engineering
Universitas Muhammadiyah Semarang

W : https://jurnal.unimus.ac.id/index.php/ICHI
E : jichi.informatika@unimus.ac.id, ahmadilham@unimus.ac.id

View My Stats

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.