Machine Learning–Based Hyperglycemia Prediction: Enhancing Risk Assessment in a Cohort of Undiagnosed Individuals

Abstract Background Noncommunicable diseases continue to pose a substantial health challenge globally, with hyperglycemia serving as a prominent indicator of diabetes. Objective This study employed machine learning algorithms to predict hyperglycemia in a cohort of individuals who were asymptomatic and unraveled crucial predictors contributing to early risk identification. Methods This dataset included an extensive array of clinical and demographic data obtained from 195 adults who were asymptomatic and residing in a suburban community in Nigeria. The study conducted a thorough comparison of multiple machine learning algorithms to ascertain the most effective model for predicting hyperglycemia. Moreover, we explored feature importance to pinpoint correlates of high blood glucose levels within the cohort. Results Elevated blood pressure and prehypertension were recorded in 8 (4.1%) and 18 (9.2%) of the 195 participants, respectively. A total of 41 (21%) participants presented with hypertension, of which 34 (83%) were female. However, sex adjustment showed that 34 of 118 (28.8%) female participants and 7 of 77 (9%) male participants had hypertension. Age-based analysis revealed an inverse relationship between normotension and age (r=−0.88; P=.02). Conversely, hypertension increased with age (r=0.53; P=.27), peaking between 50‐59 years. Of the 195 participants, isolated systolic hypertension and isolated diastolic hypertension were recorded in 16 (8.2%) and 15 (7.7%) participants, respectively, with female participants recording a higher prevalence of isolated systolic hypertension (11/16, 69%) and male participants reporting a higher prevalence of isolated diastolic hypertension (11/15, 73%). Following class rebalancing, the random forest classifier gave the best performance (accuracy score 0.89; receiver operating characteristic–area under the curve score 0.89; F1-score 0.89) of the 26 model classifiers. The feature selection model identified uric acid and age as important variables associated with hyperglycemia. Conclusions The random forest classifier identified significant clinical correlates associated with hyperglycemia, offering valuable insights for the early detection of diabetes and informing the design and deployment of therapeutic interventions. However, to achieve a more comprehensive understanding of each feature’s contribution to blood glucose levels, modeling additional relevant clinical features in larger datasets could be beneficial.


Table of Contents
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
Non-communicable diseases (NCDs) have become a significant public health concern in Africa [1].Conditions like coronary artery disease, stroke, hypertension, and diabetes, which were once primarily associated with developed nations or affluence, have now become pervasive health challenges in developing countries and across diverse socio-economic strata [1].The complex nature of NCDs underscores the need for a comprehensive approach to risk assessment, intervention and prevention.
Suburban communities serve as a distinctive microcosm within an evolving landscape of diseases [2,3].These communities, characterized by the coexistence of traditional and modern lifestyles, grapple with risk factors that necessitate thorough examination [4].The epidemiological shift from communicable to non-communicable diseases, coupled with limited healthcare resources especially in suburban parts of developing countries [5,6], stresses the importance of this research.In addition, recent advancements in genetic research have elucidated the underlying mechanisms of various complex NCDs.The identification of individuals at an elevated genetic risk for NCDs has the potential to revolutionize the approach of healthcare stakeholders to disease management.However, the effective implementation of genetic screening for NCD risk analysis relies on a robust understanding of the baseline contributors prevalent in the target population [7,8].This study provided a comprehensive description of the prevalence and intricate interplay of risk factors associated with NCDs, highlighting hypertension, obesity and diabetes.The specific focus was on undiagnosed asymptomatic individuals to elucidate the complex relationships of these health indicators within this population.
Machine learning encompasses a diverse set of algorithms designed to extract patterns from data and establish associations between these patterns and discrete sample classes within the data.Machine learning proves to be a valuable tool for identifying potential disease risk factors, elucidating etiology and interpreting complex pathological processes in the context of NCDs [9][10][11][12][13][14][15][16].In this study, multiple machine learning algorithms were developed to predict elevated blood glucose levels in a cohort of undiagnosed asymptomatic individuals.The primary objective was to systematically compare the accuracies of supervised machine learning classifiers to identify the most effective model for predicting hyperglycemia.
Leveraging the predictors in the dataset, we meticulously constructed and evaluated these models for the identification of significant features associated with potential diabetes in the population.

Participant recruitment and screening
This study was carried out as part of a parallel community-based genetic screening of apparently healthy 90 mmHg [18].Isolated systolic hypertension (ISH) was described as SBP above 140 mmHg with diastolic blood pressure (DBP) of less than 90 mmHg [19].Isolated diastolic hypertension (IDH) is an important subtype of hypertension defined as a systolic blood pressure (SBP) of <130 mm Hg and a diastolic blood pressure (DBP) of at least 80 mm Hg [20].Prediabetes was defined as random blood glucose (RBG) concentration of 140-199 mg/dl or fasting blood glucose of 100-125 mg/dl.Diabetes mellitus was defined as random blood glucose level of ≥200 mg/dl or fasting blood glucose of ≥126 mg/dl [21].However, as all the participants reported they were not fasting, random blood glucose values were documented.

Correlation analysis
Data cleaning, exploratory analysis and feature engineering were performed in Google Colab (with Python 3.10).The target variable was specified as "blood glucose," where 1 indicated a RBG concentration ≥140mg/dl and 0 indicated RBG concentration <140mg/dl.Independent variables included age (integer), sex (integer), BMI (float), smoking status (integer), ECG (float), hemoglobin (float), cholesterol (float), uric acid (float), systolic blood pressure (integer), diastolic blood pressure (integer), normal BP (integer), elevated BP (integer), preHTN (integer), HTN (integer), isolated systolic hypertension (integer), isolated diastolic hypertension (integer), prediabetes (integer), diabetes (integer), normal glucose (integer), abnormal ECG values (integer) and normal ECG values (integer).The dataset was checked and visualized for missingness using seaborn heatmap (Additional File 1: Fig. S1).Missing values were replaced with column mean (for continuous variables) or mode (for categorical variables).Duplicate rows and outliers were dropped before encoding categorical variables and creating dummy variables.Subsequently, we created a heatmap of correlation of independent variables with target column in descending order.The cleaned dataset was then scaled for subsequent training of machine learning models.P-value ≤ .05 was considered statistically significant.

Machine learning algorithms and evaluation
The study adopted 26 supervised classification algorithms and compared their accuracies to identify the best performing model for predicting high blood glucose which was defined in this study as random blood glucose (RBG) concentration ≥140mg/dl (Fig. 1).Specifically, after installation and importation of Sci-Kit Learn libraries [22], we carried out data cleaning, exploration and scaling to improve the efficiency of our model (Supplementary Methods).Imbalances in the distribution of hyperglycemia cases and non-cases within the dataset might affect the model's performance.
Addressing this imbalance and validating the model on balanced datasets could enhance its robustness.To address class imbalance in the outcome variable (blood glucose level), we adopted synthetic minority over-sampling technique (SMOTE).SMOTE tackled the underrepresentation of the minority class and rebalanced the class distribution for equitability [23].After resampling, we split the data into training and test sets at ratio 80:20 respectively, using the train_test_split function in Sci-Kit Learn.We went further to select and rank the performances of the machine learning algorithms using LazyPredict to obtain the weighted average of the F1 and accuracy scores as well as the receiver operating characteristic-area under the curve (ROC-AUC) score.For hyperparameter optimization, we adopted GridSearchCV (https://github.com/oyebolakolapo/Machine-Learning-Prediction-of-Elevated-Blood-Glucose-in-a-Cohort-of-Apparently-Healthy-Adults).The grid search technique constructs many versions of the model with all possible combinations of hyperparameters to return the best one [24].Subsequently, we determined feature importance to provide insight into which features are most associated with elevated blood glucose level using the best performing model.To operationalize the best performing model generated at scale, the training file was stored as a serialized pickle file.Subsequently, we used Fast application programming interface (Fast API) in Google Colab [25], to make an inference call from the model using the predict() function and generated our API.Pyngrok was used to open secure tunnels from public uniform resource locator (URL) to local host.

Cohort description
Two hundred participants aged 18-83 years were enrolled into the cohort.However, after hemoglobin electrophoresis screening, five individuals were found to possess the HbSS/HbSC genotypes and were excluded from further analysis.Enlisted individuals consisted of 118 females and 77 males (Fig. 2; Additional File 1: Fig. S2).

Machine learning algorithms and evaluation
Following data cleaning, transformation (Additional File 1:  1).
To determine the importance of each variable (feature) to the outcome (blood glucose level), we carried out random forest feature analysis.The importance of a feature is calculated based on how much the tree nodes that use that feature reduce impurity across all trees in the forest.The key findings showed that uric acid and age were the most important features associated with elevated blood glucose (Fig. 11), followed by systolic blood pressure and body mass index (BMI).

Discussion
Noncommunicable diseases, such as cancer, cardiovascular diseases, and diabetes, are progressively becoming the primary causes of mortality in sub-Saharan Africa [26].This epidemiological shift is primarily attributed to limitations in implementing crucial control measures, such as prevention and early detection [1].This research focused on exploring key clinical indices of NCDs in asymptomatic individuals.The application of machine learning in disease prediction is now well-established for its immense potential in analyzing complex datasets and uncovering patterns that may elude human detection [27][28][29][30].The investigation employed various machine learning algorithms to predict hyperglycemia to enable early identification of individuals at a particular risk of developing diabetes.The study identified suspected hypertension in 21% of study participants, underscoring the urgency of addressing hypertension as a major health challenge in the country.
Furthermore, a notable increase in the prevalence of hypertension with advancing age was observed.However, the investigation into hypertension subtypes revealed a dual phenomenon: a pronounced increase in systolic hypertension with age and a concomitant reduction in diastolic hypertension.
Several factors may contribute to the observed age-related increase in systolic hypertension.
Physiological changes, alterations in vascular reactivity, and lifestyle factors could play decisive roles in driving the upward trajectory of systolic blood pressure with advancing age [31,32].In contrast, the age-related reduction in diastolic hypertension may be associated with changes in arterial compliance, heart rate dynamics, or other physiological adaptations over the aging process [33].Recognizing these dual dynamics holds significant clinical implications, necessitating tailored screening protocols and interventions to address the unique challenges posed by hypertension in different age groups.
Moreover, a gender disparity was observed, with systolic hypertension being more prevalent in females while diastolic hypertension was more common in males.This gender difference may be linked to heart rate variability or hormonal influences, particularly fluctuations in estrogen levels in females.However, understanding how blood vessels respond to changes in pressure and the potential impact on systolic blood pressure would be crucial in deciphering these gender disparities [34][35][36].Therefore, tailoring screening protocols and interventions to address the unique challenges posed by hypertension in different age groups and genders is essential to mitigate the overall burden of this condition.
Electrocardiography is a pivotal tool for assessing cardiac health, and its interpretation can provide valuable insights into cardiovascular conditions.Our investigation revealed a remarkable agedependent pattern in abnormal ECG values, reaching a peak at 70 years.Advancing age often coincides with a myriad of physiological changes, including alterations in cardiac structure and function [37][38][39].A comprehensive exploration of these factors is essential for delineating the intricate relationship between aging and abnormal ECG findings.
The global burden of diabetes is well-documented [40][41][42], but our investigation into supposedly healthy individuals has unearthed a concerning revelation.Despite outward appearances of health, there existed a relatively high prevalence of suspected prediabetes and diabetes in the cohort.This underscores the importance of probing beyond outward health markers to understand latent metabolic landscape [43][44][45][46].This prompts a reevaluation of health screening protocols to incorporate metabolic parameters in apparently healthy populations.Early detection and intervention strategies should be tailored to encompass metabolic assessments, providing an opportunity for targeted preventive measures and lifestyle modifications.
In the realm of predictive modeling, selecting the most effective machine learning algorithm is paramount.Our study, aimed at evaluating various algorithms, revealed insightful findings regarding their predictive performances.Upon meticulous evaluation, Random Forest emerged as the topperforming algorithm, consistently delivering the highest accuracy among the tested models.The success of the Random Forest algorithm can be attributed to its ensemble learning nature [47,48], which harnesses the collective power of multiple decision trees.This enables robustness against overfitting, enhanced generalization, and effective handling of complex datasets with diverse features.The observed superiority of Random Forest in our study has profound implications for future applications, suggesting its applicability across diverse datasets and underscoring its potential as a reliable choice for achieving high predictive accuracy.
To investigate the intricate determinants of hyperglycemia, our study employed a robust feature importance analysis, with compelling results showcasing uric acid and age as the most influential predictors.Uric acid's prominence as a predictor of hyperglycemia adds a unique dimension to our understanding of metabolic health.While traditionally associated with conditions like gout, our findings suggest a potential link between hyperuricemia and hyperglycemia, urging further exploration into the underlying physiological mechanisms.The identification of age as a key predictor aligns with existing knowledge regarding the age-associated risk of hyperglycemia [48][49][50].
Our findings reinforce the significance of age as a robust indicator, reflecting the cumulative impact of aging processes on metabolic health and glucose regulation.The recognition of uric acid and age

Limitations and Future Direction
While our study provides valuable insights into predicting hyperglycemia using machine learning in undiagnosed individuals, it is essential to acknowledge certain limitations that may impact interpretation.First, the size of our cohort may limit the generalizability of the results.A larger and more diverse sample could enhance external validity of the predictive model.Furthermore, the study did not account for potential variations in clinical practice, including differences in diagnostic criteria.
For instance, the study did not take into consideration orthostatic hypotension, a fall in SBP of at least 20 mm Hg or a DBP fall of at least 10 mm Hg within three minutes of standing, especially in older individuals [19].Although seats were provided to participants, we could not accurately document how long participants had been standing before attending the screening.Besides, phenomena such as postprandial hypotension (a reduction in BP after meals, a common cause of syncope and falls in healthy and hypertensive elderly individuals), circadian BP variability, and white-coat (non-sustained) hypertension, especially in the elderly were not factored into the analyses [51][52][53].As such, incorporating standardized criteria across diverse healthcare settings could enhance our model's clinical applicability.
Moreover, the study did not dissect the influence of ethnicity and genetics on hyperglycemia [54,55].Future research could explore these aspects to provide a more comprehensive understanding of predictive factors.Since the dataset primarily comprises information from a specific geographic location or demographic group, extrapolating the findings to other populations requires caution as regional variations in lifestyle, genetics, and healthcare practices may influence the performance of the predictive model.In addition, the cross-sectional nature of our study limits our ability to establish causation or assess changes over time.Therefore, longitudinal studies would be beneficial to

Conclusions
This study has made a substantial contribution to the expanding domain of predictive modeling and offers promising implications for enhancing early detection and personalized risk assessment, particularly in the context of hyperglycemia and its potential association with diabetes.The research has not only brought to light the prevalence of undiagnosed hypertension, isolated systolic and diastolic hypertension but has also highlighted factors associated with elevated blood glucose within the population.The findings of this study emphasize the significance of regular screening, effective intervention strategies and targeted therapeutic designs.Collectively, the results contribute to the overarching effort to enhance healthcare outcomes through proactive and tailored approaches.The values which were negative and were predicted negative.Here, 28 cases were detected.In all, the weighted average of accuracy score = 0.89 and F1 Score = 0.89.Precision is a metric that quantifies the accuracy of a classifier by determining the number of correctly identified members of a class divided by all instances where the model predicted that specific class.In the context of hyperglycemia prediction, precision would be the count of accurate predictions of hyperglycemia divided by the total instances where the classifier predicted "hyperglycemia," regardless of correctness.Recall, on the other hand, measures the effectiveness of a classifier in correctly identifying members of a class by dividing the number of correctly identified instances by the total number of actual members in that class.In the hyperglycemia scenario, recall would represent the number of actual hyperglycemic individuals correctly identified by the classifier.The F1 score is a composite metric that combines both precision and recall into a single value.It provides a concise evaluation of a classifier's performance.A high F1 score indicates that both precision and recall are high, while a low F1 score suggests that one or both metrics are low.This metric is particularly useful for quickly assessing whether a classifier effectively identifies members of a class or if it resorts to shortcuts, such as indiscriminately classifying everything as a member of a larger class.
adults living in Ijede Community, Lagos, Nigeria.Ethical approval was obtained from the Institutional Review Board of the Nigerian Institute of Medical Research (IRB/21/074).Following informed consent, participants were recruited and 10ml of venous blood samples were collected per individual.Demographic information, body mass index (BMI), knowledge, attitude and practices were obtained from the participants.The study clinician further clerked participants for personal and family medical history as well as their smoking status.Exclusion criteria included pregnancy at the time of recruitment, placement on antihypertensive or antidiabetic chemotherapy, radiotherapy, current or previous hematologic or tumoral diseases and known chronic diseases.Participants underwent electrocardiogram (ECG) screening (SonoHealth, USA) to provide clues on heart defects or other heart-related problems.Hemoglobin electrophoresis was conducted to detect possible hemoglobinopathy in the participants [17].In addition, random blood glucose concentrations (Guilin Royalze, China) and blood pressure (BP) values (Iston Mediq, USA) were determined to evaluate the presence or absence of prediabetes, diabetes, prehypertension (preHTN) or hypertension (HTN) onset in the participants.Individuals with screening tests outside normal ranges were advised to visit their healthcare specialists for further checks.Normal BP was described as systolic blood pressure (SBP) <120mmHg and diastolic blood pressure (DBP) <80 mmHg.Elevated BP was defined as SBP of 120-129 mmHg and DBP <80 mmHg, stage 1 hypertension (preHTN) as SBP ≥ 130-139 mmHg and DBP 80 -89 mmHg and stage 2 HTN as SBP ≥140 and DBP ≥ Fig. S8) and observation of a class imbalance in the target variable (Additional File 1: Fig. S9), whereby the raw dataset demonstrated that 163/195 (83.6%) of the participants had normal blood glucose {0} while 32/195 (16%) had high blood glucose level {1}, rebalancing was established with SMOTE to yield an even representation of both categories of blood glucose level (Counter ({0: 163, 1: 163}).When the performance of each classifier was tested, the reports showed Random Forest Classifier (Figs. 9 and 10) gave the best accuracy (Accuracy Score = 0.89; ROC-AUC score = 0.89; F1 Score = 0.89) followed by Extra Trees (Accuracy Score = 0.88; ROC-AUC score = 0.88; F1 Score = 0.88) and XGB classifiers (Accuracy Score = 0.86; ROC-AUC score = 0.86; F1 Score = 0.86), respectively (Fig. 9B; Table as pivotal predictors holds significant clinical implications.Healthcare practitioners can leverage these findings to enhance risk assessment strategies for hyperglycemia.Incorporating uric acid measurements and age considerations into routine screenings may facilitate early identification of individuals at heightened risk, enabling proactive interventions.While our study sheds light on the importance of uric acid and age, further research is warranted to unravel the intricate relationships and mechanisms underlying these associations.Longitudinal studies exploring the dynamic interplay between uric acid, age and hyperglycemia can deepen our understanding and inform targeted interventions.
understand the dynamic nature of hyperglycemia predictors.The model's performance was evaluated on the same dataset used for training, raising the potential for overfitting.External validation on an independent dataset would be crucial to assess its generalizability and reliability in real-world scenarios.Lastly, the importance of a feature in a Random Forest model does not necessarily mean a causal relationship and other models might find different results if additional features are introduced.Future approaches are expected to accommodate more features and larger datasets.This will account for the deployment of built and containerized models as publicly accessible web apps.Nevertheless, this present study has expounded the potential of machine learning for early disease detection, risk assessment strategies, proactive interventions and targeted therapeutic design.

Figure 7 :
Figure 7: Blood glucose levels in the cohort

Figure 8 :
Figure 8: Correlation matrix of independent variables with the outcome variable

Figure 9 :Figure 10 :
Figure 9: Accuracy scores of machine learning classifiers (A) before class rebalancing with SMOTE (B) after class rebalancing with SMOTE