Journal of Soft Computing Exploration Homepage: htps://shmpublisher. com/index. php/joscex p-ISSN: 2746-7686 e-ISSN: 2746-0991 Comparison of gridsearchcv and bayesian hyperparameter optimization in random forest algorithm for diabetes prediction Rini Muzayanah1*. Dwika Ananda Agustina Pertiwi2. Muazam Ali3. Much Aziz Muslim4 1, 4Department of Computer Science. Universitas Negeri Semarang. Indonesia 2, 4Faculty of Technology Management. Universiti Tun Hussein Onn Malaysia. Johor 86400. Malaysia 3Faculty of Management Science. HITEC University Taxila. Pakistan Article Info ABSTRACT Article history: Diabetes Mellitus (DM) is a chronic disease whose complications have a significant impact on patients and the wider community. In its early stages, diabetes mellitus usually does not cause significant symptoms, but if it is detected too late and not handled properly, it can cause serious health To overcome these problems, diabetes detection is one of the solutions used. In this research, diabetes detection was carried out using Random Forest with gridsearchcv and bayesian hyperparameter optimization. The research was carried out through the stages of study literature, model development using Kaggle Notebook, model testing, and results analysis. This study aims to compare GridSearchCV and Bayesian hyperparameter optimizations, then analyze the advantages and disadvantages of each optimization when applied to diabetes prediction using the Random Forest From the research conducted, it was found that GridSearchCV and Bayesian hyperparameter optimization have their own advantages and The GridSearchCV hyperparameter excels in terms of accuracy of 0. 74, although it takes longer for 338,416 seconds. On the other hand. Bayesian hyperparameter optimization has a lower accuracy rate than GridSearchCV optimization with a difference of 0. 01, which is 0. 73 and takes less time than GridSearchCV for 177,085 seconds. Received November 25, 2023 Revised December 19, 2023 Accepted January 24, 2024 Keywords: Function point analysis Use case diagrams Software effort estimation Adjusted function points Estimation accuracy This is an open access article under the CC BY-SA license. Corresponding Author: Rini Muzayanah. Department of Computer Science. Universitas Negeri Semarang. Sekaran. Kota Semarang. Jawa Tengah. Indonesia Email: rinimuzayanah0415@students. https://doi. org/10. 52465/joscex. INTRODUCTION Diabetes Mellitus (DM) is a chronic disease characterized by hyperglycemia due to impaired insulin secretion, impaired insulin action or both . In Indonesia, 133 million people are also reported to be living with diabetes mellitus, and 87. 5% of them suffer from uncontrolled glycemic . Cumulative evidence suggests that long-term glycemic control is a major risk factor for the development of micro- and macrovascular complications in diabetic patients. Diabetes and its complications have a significant impact on patients and society at large. Diabetes has an impact on increasing health care costs in the health care system and reducing life expectancy and quality of life of the population. In its early stages, diabetes mellitus usually does not cause A SHM Publisher muzayanah et al. / J. Soft Comput. Explor. Vol. No. Maret 2024: 81-86 significant symptoms, but if it is detected too late and is not handled properly, it can cause serious health problems, such as: heart attack, blindness, kidney failure, limb amputation and even death. Early diagnosis of this disease can significantly improve the patient's quality of life . Research on the prediction of diabetes has been carried out by . with the Random Forest method and optimization of hyperparameter tuning. The research produced accuracy, f1score, precision, recall and specificity of 88. 61%, 75. 68%, 100%, 60. 87% and 100% respectively. In this experimental analysis, an accuracy rate of 88. 61% was achieved when the 'n_estimator' value was 5 while the parameter range was tested between 1 and 50. The same experiment was carried out with the min_sample_leaf valueAy. One better approach to increase the outcome of any classifier is to tune the hyperparameters of that classifier . The parameters that are set by the data analysts before the training process is called hyperparameters and it is independent of the training process . Hyperparameter optimization is an optimization that aims to select the best hyperparameters from a particular model that will produce the best performance from the model being built . The hyperparameter optimization algorithm optimizes discrete, ordinal, and continuous variables, but must simultaneously choose which variables to optimize . There are various approaches available for implementing hyperparameters, for example GridSearchCV and Bayesian. This study was made to compare the accuracy of diabetes prediction using the random forest algorithm with GridSearchCV and Bayesian Hyperparameter optimization. The purpose of this research is to find out the strengths and weaknesses of each optimization when it is applied to predict diabetes using the Random Forest algorithm where a target value of 1 is used to predict a high probability of developing diabetes while a target of 0 is used to predict a low probability of developing diabetes. METHOD The research was carried out through the stages of study literature, model development using Kaggle Notebook, model testing, and results analysis. The research stage can be seen in Figure 1. Figure 1. Research stages Dataset This study used data from the National Institute of Diabetes and Digestive and Kidney Diseases accessed using the Kaggle database platform . The data is data from a 21-year-old woman who has a total of 8 variables, namely Pregnancies. Glucose. BloodPressure. SkinThickness. Insulin. BMI. Diabetes Pedigree Function. Age, and Outcome. muzayanah et. al/ J. Soft Comput. Explor. Vol. No. Maret 2024: 81-86 Pre-processing Stage Before the data is used, the data needs to go through the pre-processing stage to check the data rows that have a zero value, then the normalization process is carried out. The purpose of the data normalization process is to group data according to different units so that it becomes well structured without data repetition. GridSeachCV Implementation This stage is carried out with the ridSearchCV hyperparameter tuning to find parameters that can produce the most optimal performance in the model to be developed. After hyperparameter tuning, the parameters identified as the most optimal parameters are stored for later use in the model development process. The next stage is to build a prediction model with the Random Forest algorithm using the most optimal parameters resulting from the hyperparameter tuning process. Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid . Bayesian Implementation This stage is carried out using Bayesian hyperparameter tuning to find parameters that can produce the most optimal performance in the model to be developed. After hyperparameter tuning, the parameters identified as the most optimal parameters are stored for later use in the model development process. The next stage is to build a prediction model with the Random Forest algorithm using the most optimal parameters resulting from the hyperparameter tuning process. Bayesian optimisation can be costly especially when the model is learnt over a large volume of data . Modeling with Random Forest Random forest is an ensemble learning algorithm in machine learning proposed by Breiman . and it is a widely used machine learning method with high prediction accuracy. Random forest is a combination of tree predictors such that each tree depends on random vector values that are sampled independently and with the same distribution for all trees in the forest . The random forest is a classifier consisting of a set of structured tree classifiers where each tree issues a sound unit for the most popular class in the x input . The random forest has been considered one of the most successful ensemble algorithms in machine learning, which builds a large number of random trees one by one and then makes predictions based on an average of the resulting predictions. How the random forest algorithm works can be seen in Figure 2. Figure 2. Flowchart random forest algorithm . muzayanah et al. / J. Soft Comput. Explor. Vol. No. Maret 2024: 81-86 Hyperparameter Tuning GridSearchCV GridSearchCV is one part of the scikit-learn module which validates more than one model and provides each hyperparameter automatically and systematically. GridSearch is used to find parameters that can produce the most optimal performance in the model to be developed. GS theoretically finds the optimal combination of parameters by the exhaustive method, while the amount of computation required for GS increases exponentially as the parameter dimension increases . Bayesian Bayesian optimization is a very effective algorithm . The Bayesian optimization is an approach to globally optimizing unknown functions . Bayesian optimization creates a probabilistic model of the objective function and uses it to select hyperparameters to estimate the true objective function. Results Analysis The results of diabetes prediction using Random Forest with GridSearchCV hyperparameter optimization are compared with Bayesian hyperparameter optimization. In this study, comparisons were made with benchmarks of time and accuracy of the predictions that had been made. Mode performance analysis will be carried out using several evaluation matrices including accuracy, precision, recall and f1-score. Evaluation is carried out to analyze how well the model can perform classification so that later it can be used to help humans predict whether diabetes is detected or not. RESULTS AND DISCUSSIONS Before the research was carried out, feature analysis was carried out to find out whether there were problems with column collinearity. Analysis was carried out using HeatMap which can be seen in Figure 2. Figure 2. Colleration matrix From the figure above, it is known that the pregnancies and age. SkinTickness and Insulin, and SkinTickness and BMI columns have a significant level of dependency. The diabetes data above will be divided into two, namely training and test data with a test size of 0. A parameter search algorithm from a random forest classifier was run to find the optimal set of hyperparameters in terms of classification accuracy. The param_grid dictionary defines the range of hyperparameter values for the Random Forest classifier being searched for. Grid_SearchCV is a traditional brute-force method that searches the hyperparameter space for the best-tuned model. A GridSearchCV object is sent when a Random Forest Classifier object is sent. The next stage is a pipeline that uses the StandardScaler transformer to standardize the training data before training a model from the training data and standardize the test data before making predictions. Finally, a fitting function muzayanah et. al/ J. Soft Comput. Explor. Vol. No. Maret 2024: 81-86 is sent out of the pipeline to train each model in the hyperparameter grid and find the best fit. Random Forest classification report with GridSearchCV optimization can be seen in Table 1. Table 1. Random forest classification report with gridsearchcv optimization Accuracy Macro Average Weighted Average Precision Recall F1-Score Support From Table 1, the highest F1 score, precision, recall was 0. 82, 0. 75, 0. 92 respectively. From the classification report, the resulting accuracy level is 0. The time needed to detect diabetes using GridSearchCV optimization is 338,416 seconds. Bayesian optimization creates a probabilistic model of the objective function and uses it to select hyperparameters to estimate the true objective function. Random Forest classification report with GridSearchCV optimization can be seen in Table 2. Table 2. Random forest classification report with bayesian optimization Accuracy Macro Average Weighted Average Precision Recall F1-Score Support From Table 2, the highest F1 score, precision, recall was 0. 82, 0. 74, 0. 90 respectively. From the classification report, the resulting accuracy level is 0. The time needed to detect diabetes using Bayesian optimization is 338,416 seconds. The results comparison of diabetes detection using the Random Forest algorithm with GridSearchCV and Bayesian hyperparameter optimization can be seen in Table 3. Table 3. Comparison results of random forest diabetes detection with gridsearchcv and bayesian hyperparameter optimization GridSearchCV Bayesian Fit Time Accuracy From research that has been done to detect diabetes. GridSearchCV optimization requires a longer time than Bayesian optimization, namely GridSearchCV optimization for 338,416 seconds and Bayesian optimization for 177,085 seconds. However, with this longer time. GridSearchCV optimization produces a higher level of accuracy than Bayesian optimization, namely GridSearchCV optimization of 0. 74 and Bayesian optimization of 0. CONCLUSION From the research conducted, it was found that GridSearchCV and Bayesian hyperparameter optimization has its own advantages and disadvantages. The GridSearchCV hyperparameter excels in terms of accuracy, although it takes longer. On the other hand. Bayesian hyperparameter optimization has a lower accuracy rate than GridSearchCV optimization with a difference of 0. 01 and takes less time than GridSearchCV. The GridSearchCV hyperparameter excels in terms of accuracy of 0. 74, although it takes longer for 338,416 seconds. On the other hand. Bayesian hyperparameter optimization has a lower accuracy rate than GridSearchCV optimization with a difference of 0. 01, which is 0. 73 and takes less time than GridSearchCV for 177,085 seconds. REFERENCES