International Journal of Electrical and Computer Engineering (IJECE) Vol. No. October 2025, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Enhancing diabetes prediction through probability-based correction: a methodological approach Aitouhanni Imane. Berqia Amine SSLAB. ENSIAS. Mohammed V University in Rabat. Rabat. Morocco Article Info ABSTRACT Article history: Predictive healthcare analytics demands accurate predictions from interpretable models for early diagnosis and intervention on diabetes prognosis, which remains a well-established challenge. This study presents a new probability-based correction method to enhance the performance of a model in diabetes prediction. Initial model comparisons are performed using the PyCaret framework to identify the baseline model. Logistic regression was selected due to its simplicity, interpretability, and its higher accuracy, which outperformed other models. To further facilitate future research in this field, this study was conducted using a noisy dataset without any changes or preprocessing steps other than those available in the dataset from the producer. This intentional decision meant that the new probability-based method could be evaluated in isolation without any additional modifications being applied. The proposed correction method adjusts predictions into borderline probability intervals to obtain more accurate classifications. This approach increased the model accuracy by 6% from 75% to 81%, thus proving successful in resolving the misclassification problem with higher risk. This approach outperforms state-of-the-art methods and demonstrates its generalizability in enhancing the certainty of downstream clinical decisions. Received Feb 16, 2025 Revised Jul 5, 2025 Accepted Jul 12, 2025 Keywords: Diabetes prediction Enhancement Healthcare Machine learning Probability correction This is an open access article under the CC BY-SA license. Corresponding Author: Aitouhanni Imane SSLAB. ENSIAS. Mohammed V University in Rabat Rabat. Morocco Email: imane. aitouhanni@gmail. INTRODUCTION Diabetes prediction is an important problem in the healthcare field due to the necessity of interpretable and accurate models for proper diabetes prediction and intervention. With the increasing incidence of diabetes globally, now more than ever, diagnostics must provide accurate detection while identifying the disease before complications arise. Predictive analytics now stands at a powerful place to help health providers to take the necessary decisions at the right time. Well, it can be used to develop predictive algorithms as in the case of the Pima India Diabetes dataset . which provides a complete representation of the diabetes risk factors. Using this dataset, studies have reached high accuracies with advanced machine learning models like Gradient boosting and random forest . which are good at learning complex patterns in Nevertheless, such methods focus mainly on misclassification on the average risk, borderline cases where the probabilities lay very near the decision thresholds therefore common misclassification in high-risk cases where mistakes could be fatal is mainly ignored. Within this context, this study performed a first approach to the comparison of the machine learning models with the help of PyCaret . and identified logistic regression (LR) as the best-performing model in terms of accuracy with a simple and interpretable solution. it present next a probability-based correction method to correct high-risk misclassification. The method then refines model performance especially on Journal homepage: http://ijece. ISSN: 2088-8708 borderline cases not by boosting accuracy but by pinpointing uncertain predictions and correcting them, increasing reliability. This approach differs from studies that primarily strive to obtain state-of-the-art accuracy, as we present a new method for improvement that can be applied to different models. This work uses the Pima dataset . who has seen many advances utilizing machine learning techniques. Random forest (RF), support vector machines (SVM) and Gradient boosting have been achieved by more than 90% usually through feature engineering, hyperparameter tuning and balanced data. Although the performance of these studies is high, they usually overlook the interpretability and polishing of uncertain predictions. We extend this prior work by proposing a correction mechanism that focuses on cases where improvement is most beneficial, rather than competing on absolute accuracy metrics. This approach builds upon the current landscape of high accuracy models and provides a framework in which we can increase the reliability of decisions made by automated systems in the clinical context. By refining predictions that lie near the decision boundary, this methodology helps reduce high-risk misclassifications that are often overlooked in traditional machine learning implementations. The rest of this paper is structured as follows. Section 2 presents the background study, including notations and the known related works. Section 3 describes the methodology proposed along with the dataset preprocessing and the probabilities-based correction method. The results are presented in section 4, followed by the performance improvements gained with the proposed correction methodology. Sections 5 and 6 finalize the paper discussing implications, comparing them with previous studies and limitations of the study, and a conclusion, respectively, summarizing the contribution of the paper, and suggesting potential future lines of research. BACKGROUND STUDY This section describes the main terminologies and concepts underpinning diabetes prediction and presents the preceding background for probability-based correction also in machine learning, to lay the foundations for the understanding of this study. Diabetes and its prediction challenges Diabetes is an ongoing disease state and is characterized by elevated blood glucose levels which, if left untreated, will lead to life-threatening complications such as cardiovascular illness, kidney injury and neuropathy . Early detection and management are crucial to preventing these outcomes. Predictive modeling has become an essential tool in healthcare for identifying individuals at risk of diabetes, enabling timely interventions . Predictive models have now become an integral part of the healthcare system to determining individuals at risk for diabetes earlier which leads them to timely interventions. Accurate prediction is difficult due to problems such as imbalanced datasets, noise, and overlapping features . Predictive modeling based on healthcare analytics has recently been employed to discover those early markers and risk factors for diabetes . These approaches use demographic data, lifestyle variables, and clinical measurements to predict the probability of developing diabetes. However, despite these advancements there remain some limitations, especially in the context of reproducibly classifying cases on the decision threshold where to make an important and actionable intervention . Machine learning in diabetes prediction Diabetes prediction using machine learning techniques are expected to improve the prediction accuracy for diabetes. LR. RF, and Gradient boosting are among the popular choices, due to their capability for modelling complex relationships . On the other hand. LR, for instance, is appreciated for its interpretability and efficiency for binary classification tasks while ensemble methods such as RF and Gradient boosting are better suited to non-linear interactions and high-dimensional data . These models tend to show well in the creator data sets, but not without error. with borderline cases, the probability of classification is near the decision threshold, and there are errors due to misclassification . These misclassifications lead to late or inappropriate interventions, demonstrating the necessity of methodologies to alleviate this drawback . Logistic regression Logistic regression is a classification algorithm commonAos used for binary problems, for example if someone has diabetes or not . It is a way to model the probability that the target variable belongs to a particular class given the input features . This technique relies on the logistic function . lso called the sigmoid functio. , which is defined as: cU = . ycU) = 1/. yceycuycy(Oe. u0 yu1ycU1 yu2ycU2 . yuycuycUyc. )) Int J Elec & Comp Eng. Vol. No. October 2025: 4933-4941 Int J Elec & Comp Eng ISSN: 2088-8708 yu0 represents the intercept, yu1, yu2, . , yuycu are the coefficients corresponding to the features ycU1, ycU2, . , ycUycu, and ycE. cU = . ycU) is the probability of the target variable . , diabetes presenc. Due to the qualities of the logistic function, the output will always lie in the range of 0 and 1. Training the algorithm means estimating the coefficients yu by noting that the likelihood of the observed data is maximized when these are correct. The model makes predictions by applying a threshold . ere, 0. ycE Ou ycEaycyceycEaycuycoycc Ie ycaycoycaycyc 1, ycE < ycEaycyceycEaycuycoycc Ie ycaycoycaycyc 0. LR is simple and interpretable, and, thus, it is used as a baseline model for many applications in different domains, such as diabetes prediction in healthcare applications. Various machine learning techniques play an important role in enhancing the accuracy of well-known methods for predicting diabetes . Models like LR. RF, and Gradient boosting are widely used for their ability to analyze complex relationships within data . LR is adopted for its simplicity and interpretability in binary classification tasks, but ensemble methods such as RF and Gradient boosting output superior performances on non-linear interactions and high dimensional data . While previous models have their own strengths, they frequently fail to perform well in borderline cases, where the probability of class membership is around the decision boundary, resulting in possible misclassifications. This misclassification can lead to delayed or inappropriate interventions which warrants methodologies that address this limitation. High-risk predictions and probability-based correction High risk predictions refer to cases where model probabilities fall within a small range about the decision boundary . , 0. 4 to 0. The predictions made by this kind of model are not very accurate, so these cases will be affected more easily . Without mechanisms to specifically tackle these cases, traditional machine learning models become less reliable in those critical settings . One of the methodologies that have been designed to tackle this problem is called probability-based correction, which finds such risky pairs in accordance with their percentage difference and inversely switch them, thus improving the overall accuracy of the model . This is especially the case for healthcare, where reducing false positives and false negatives can greatly affect patient care . The intent of the probabilitybased correction is to improve decision reliability and model robustness by re-evaluating and adjusting predictions in the identified high-risk interval. Clinical implications of prediction models Diabetes prediction models are any healthcare provider's best friend, as they literally give actionable insights into what needs to be done. Correctly identifying people at high risk makes it possible to intervene early, which helps avert the onset of diabetes and a host of associated complications . Moreover, prediction models should be distinct and credible, as they are supposed to be incorporated into the clinical workflows . Stated succinctly, the implications of misclassifications . alse positives/negative. resulting in potentially unnecessary treatment or undiagnosed conditions emphasizes the need for improved decisionmaking in these cases . The role of feature engineering It is the process of using domain knowledge of the problem to create features that make machine learning algorithms work . For example, in predicting diabetes, created features such as interaction terms (Glucose-to-BMI rati. or non-linear transformations can greatly improve the performance . While baseline models frequently neglect this step because they are heterogeneous methods, it is a cornerstone of machine learning improvement for predictive performance and the interpretability of the final model . Summary of related methodologies Previously, research has been largely focused on maximum accuracy via ensemble models, deeplearning and complex hyperparameter combinations . Although these methods provide excellent performance, they mostly do not have a way to deal with high-risk borderline cases. It is complementary to previous work in that it starts with a statistical model prepared using existing techniques and provides an avenue for practical improvement of predictions in such cases . METHODOLOGY Dataset and preprocessing The Pima Indian diabetes dataset, also known as Pima . , is a well-known dataset in predictive modeling for diabetes risk. It contains 768 samples, having 8 clinical features including clinical attributes such as glucose levels, blood pressure, body mass index, and age. The target variable (AuOutcomeA. is categorical and tells whether the patient has diabetes . =no, 1=ye. To ensure the consistency and reliability of the data. Data preprocessing was performed. This included dealing with missing values, scaling features Enhancing diabetes prediction through probability-based A (Aitouhanni Iman. A ISSN: 2088-8708 with the standardScaler to normalize for scale differences and splitting the dataset into 80% training and 20% testing subsets for a solid model performance evaluation. Logistic regression baseline This study adopts LR as the baseline model because of its simplicity, interpretability, and acceptable accuracy compared to other models tested in initial comparisons using ycEycyaycaycyceyc. The model was trained using the training subset of the Pima dataset and tested using the ycyycyceyccycnycayc() method with a default decision threshold of 0. 5, where probabilities above this threshold indicate a positive diabetes diagnosis. Baseline performance metrics such as accuracy, precision, recall, and the confusion matrix were computed to provide a reference point against which the proposed correction method was evaluated. Probability-based correction This study presents a probability-based correction approach as its main innovation. Any prediction deemed to be between 0. 4 and 0. 6 is treated as cloudy. These cases which are on the borderline are least likely to be classified correctly as they are very close to the threshold of the decision boundary. To adjust these predictions, their labels were flipped to the opposite class due to the hypothesis that high-risk proportions indicate a possible mistake. This was followed by assessing the changes in prediction quality with respect to the test set, assessing the accuracy, false positive, and false negative improvements made for the corrected To quantify this correction with the updated confusion matrix values and accuracy comparison. RESULTS Model comparison using PyCaret To compare different classification models and create a baseline for the study, we applied PyCaretAos automated machine learning framework for preliminary analysis. Table 1 present performance of various models used in this analysis, according to accuracy, area under the curve (AUC), recall, precision and F1-score. Hence, the best model LR records the accuracy highest which is . 03%) and AUC . Also, because of its simplicity and ease of interpretability. LR was justified to be a good candidate to apply the proposed probability-based correction methodology bear in mind that these results were obtained without making any changes to the dataset that was downloaded from the source. We did not apply any advanced preprocessing, feature engineering or hyperparameter tuning to improve predictive power as had been done in earlier studies. This method was intentionally selected so as to examine the correction process instead of attaining maximum accuracy. Table 1. Model comparison results using PyCaret Model Logistic regression Ridge classifier Linear discriminant analysis Extra trees classifier Random forest classifier Naive Bayes Ada boost classifier Quadratic discriminant analysis Gradient boosting classifier LightGBM XGBoost K neighbors classifier Decision tree classifier Dummy classifier SVM (Linear kerne. Accuracy (%) AUC (%) Recall (%) Precision (%) F1-Score (%) Initial performance The LR model was evaluated on the test dataset without any advanced preprocessing or hyperparameter tuning. This evaluation yielded a baseline accuracy of 76%, which serves as the reference for further improvement. The resulting confusion matrix shown in Figure 1 highlighted the model's limitations, especially in distinguishing diabetic cases, revealing 18 false negatives and 21 false positives, thus motivating the need for a correction mechanism. Looking at the matrix, the model can predict non-diabetic cases well, it is still having issues identifying diabetic cases . , 18 false negatives and 21 false positive. The latter results draw the limitation Int J Elec & Comp Eng. Vol. No. October 2025: 4933-4941 Int J Elec & Comp Eng ISSN: 2088-8708 of the model at high-risk borderline cases. This performance is also due to the noisiness of the dataset itself, as no preprocessing or feature engineering steps were applied to clean the input features. However, this baseline assessment provides a reference point to evaluate the importance of the probability-based correction Figure 1. Initial confusion matrix for LR model Post-correction performance Applying the probability-based correction method led to an improvement in the model's accuracy from 76% to 81%. The correction targeted predictions within the 0. 4 to 0. 6 probability range as shown in Figure 2 and successfully reduced false positives from 21 to 20 and false negatives from 18 to 15. This adjustment highlights the effectiveness of refining borderline predictions and illustrates the potential for improving the modelAos clinical reliability without additional preprocessing or feature engineering. Figure 2. Corrected confusion matrix for LR model Enhancing diabetes prediction through probability-based A (Aitouhanni Iman. A ISSN: 2088-8708 The method concentrated on low-risk space . robabilities between 0. 4 and 0. and successfully decreased false positive from 21 to 20 and false negatives from 18 to 15. This adjustment showcases how well the approach can adjust model predictions when classification thresholds are not enough. The simplicity and straightforward nature of this correction method is one of its key advantages. This approach does not require significant feature engineering, extensive preprocessing, or hyperparameter optimization as is the case with many more computationally intensive techniques. It uses the model probability scores to find and tackle the uncertain classifications instead. With this correction, borderline cases become the focus. This makes sense in the real world, especially in healthcare settings that could lead to disastrous false positive and negative diagnoses. Although the improvement in accuracy is not dramatic, the decreased number of misclassifications suggests the potential for the method to improve confidence in decisions. While this is true for almost every dataset, it holds particularly if the dataset is noisy like the Pima Indian Diabetes dataset where intrinsic uncertainties can hide meaningful patterns. Comparative analysis To assess how a correction based on probabilities affects classification performance. Table 2 compares relevant classification metrics with and without correction. This approach proved effective in lowering false classifications and resulted in better overall accuracy. The comparative study clearly demonstrates the strength of the correction process based on the probability. In contrast to other methodologies that depend on dataset perturbations or tuning, this methodology only uses outputs from preexisting models to focus on particular high-risk cases. The better metrics highlight how this method can augment the established quantities of machine learning, especially when dataset limitation or complexity make it impractical to optimize these directly. Table 2 summarizes the performance metrics before and after the correction methodology. The probability-based correction improved the reliability of the model by reducing false positives and false negatives. Once more, it must be emphasized-neither the original dataset was altered, nor the hyperparameters of the LR model. This is consistent with our focus in this study on showing the value of the correction method as opposed to obtaining a near optimal model. Table 2. Comparative performance metrics Metric Accuracy True positives True negatives False positives False negatives Initial performance Post-correction performance DISCUSSION Implications of results The concept of correcting models based on probabilities proposed in this study shows that it can serve as a valuable enhancement method for predictive modeling. The gain from 75% to 81% overall may not seem impressive, however given the significant savings associated with misclassifications, targeting highrisk, borderline predictions despite the modest accuracy gain, is a reasonable solution. Reducing the number of false positives and false negatives in practical clinical settings can have significant implications, as in the case of diabetes where timely and correct detection is essential. The proposed correction technique demonstrates that targeted adjustments based on prediction probabilities can significantly enhance decision-making reliability in clinical settings. While the overall accuracy improvement of 6% might seem modest, the reduction in misclassifications can have a profound impact, especially in high-risk medical conditions like diabetes. This simple yet effective method provides a viable enhancement tool that integrates seamlessly into existing machine learning pipelines, offering greater confidence in the predictive output. This approach resolves a common bottleneck in most machine learning (ML) models due to under-performing for samples that are at the decision boundaries. The proposed correction framework improves decision reliability by using probability scores to correct such predictions without any preprocessing, feature engineering, or model tuning over the original models. Such applications are very applicable to noisy datasets like the Pima Indian Diabetes dataset, for which the traditional optimization techniques might not perform well. In addition, this method offers a structure which can fit seamlessly into current predictive pipelines. It is a great asset for healthcare professionals and data scientists to improve their models, staying as simple as Int J Elec & Comp Eng. Vol. No. October 2025: 4933-4941 Int J Elec & Comp Eng ISSN: 2088-8708 possible and avoiding computational complexity. This correction methodology based on probabilities looks promising as an additional complementary enhancement technique. Despite a modest improvement in overall accuracy . %), the increase in specificity for misclassification emphasizes the clinical relevance of the model as, in clinical practice, a high false diagnosis may critically determine patient outcome. Comparison with previous studies This research takes a completely different approach compared to many previous studies that focus on obtaining state-of-the-art accuracy . , usually through ensemble methods or deep learning. Rather than adjusting the dataset, performing heavy feature engineering, or fine-tuning model hyperparameters, we aimed to improve model interpretability and reliability by mitigating the high-risk predictions. Specifically, ensemble techniques are powerful but fall short in terms of transparency and adaptabilityAiimportant aspects when it comes to high-stakes applications such as healthcare. The results of this study . contribute to the existing literature on diabetes prediction by demonstrating improvements in performance without needing to overhaul the dataset or resort to computationally expensive algorithms. This uncertainty-aware adjustment strategy correction based on probabilities that attentive fine-tuning of predictions in uncertain areas rather than bounding was centuries in the scene language dedicated to extremity reconstruction. This connects a void in the literature by describing a practical light-weight approach that matches some of the practitioners applied need. While studies like . Ae. that give higher accuracy through ensemble methods or deep learning, this study improves existing models that generate uncertain predictions. It is not intended be a state-of-the-art replacement, rather it is an intended augmentation and targeted solution for shortcomings. Limitations and future work Although this approach is promising, this study has limitations, and we encourage further First, it was run on one model LR, and it is unknown if it would work on a more complex algorithm such as RF or Gradient boosting machines. Second, we did not generalize the probability range . 4 to 0. for predicting high-risk empirically and therefore may not apply other datasets or contexts. interesting avenue for future work is to expand this to dynamic threshold selection methods, to better determine this range. This study also did not use more advanced preprocessing or feature-engineering techniques that may generally uplift the performance baseline of the model. Further exploration of the complementary nature of these techniques with the proposed correction framework may reveal more of its value. Combining this approach with ensemble methods or deep learning models could provide a hybrid solution that maximizes interpretability, performance and efficiency. CONCLUSION This paper proposes a new methodology for correcting the distributions of probabilities in order to improve the performance of diabetes prediction models by emphasizing particular predictions that are borderline, and at high risk, which normally do not get through a common machine-learning method. Based on probability scores of the model predictions, this approach provides an effective way to increase decision reliability while not modifying the dataset or using expensive methods. The results show that the correction based on probabilities brought an elevation from 75% to 81% accuracy to the logistic regression model, a small but significant increase considering the randomness of the data and the absence of further preprocessing or feature engineering. The better performance shows the promise of this approach to solve misclassifications in important healthcare problems, where false diagnosis should be avoided as much as possible. This work contrasts with earlier studies whose focus on high accuracy has tended to be attained with complex models and by extensive optimization. the current research stresses simplicity, adaptability, and the potential for extending existing predictive frameworks. The methodology aims not to replace any of the sophisticated algorithms but to supplement those by addressing high-risk cases that are often beyond the scope of traditional techniques. Future work projects include, but is not limited to, extending this methodology to complex models, exploring future dynamic threshold selection based on high-risk prediction, and exploring its combination with ensemble frameworks. The applicability of this approach to other datasets and domains could be explored to establish its usefulness and generalizability even further. This study provides an initial step towards a scalable and pragmatic enhancement framework enabling continued progress towards machine learning usage across healthcare providers. Enhancing diabetes prediction through probability-based A (Aitouhanni Iman. A ISSN: 2088-8708 REFERENCES