Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Diabetes Classification using Gain Ratio Feature Selection in Support Vector Machine Method Nabila Al Rasyid. Iis Afrianty*. Elvia Budianita. Siska Kurnia Gusti Faculty of Science and Technology. Informatics Engineering. Universitas Islam Negerti Sultan Syarif Kasim Riau. Pekanbaru. Indonesia Email: 112150120329@students. uin-suska. id, 2,*iis. afrianty@uin-suska. id, 3elvia. budianita@uin-suska. 4siskakurniagusti@uin-suska. Correspondence Author Email: iis. afrianty@uin-suska. Abstract Diabetes is a major cause of many chronic diseases such as visual impairment, stroke and kidney failure. Early detection especially in groups that have a high risk of developing diabetes needs to be done to prevent problems that have a wide impact. Indonesia is ranked seventh in the world with a prevalence of 10. 7% of the total number of people with diabetes. This research aims to determine the attributes in the diabetes dataset that most affect the classification and apply the Support Vector Machine method for diabetes For the determination process. Gain Ratio feature selection technique is applied. The dataset used consists of 768 data with 8 attributes. In this classification process, 3 SVM kernels (Linear. Polynomial, and RBF) are used with three possible data divisions using the ratio . :30. 80:20. Before applying feature selection, there were 8 attributes used and achieved the highest accuracy 81% at a ratio of 80:20 using the RBF kernel with a combination of two parameters namely C = 100. Gamma = 3 and C = 100. Gamma = Scale. Feature selection parameters in the form of thresholds used include 0. After applying feature selection, the attribute that produces the highest accuracy uses 6 attributes. The highest accuracy after applying feature selection reached 45% at a threshold of 0. 02 with a ratio of 80:20 using the RBF kernel with parameters C = 100 and Gamma = Scale. The results showed that there was an increase in accuracy after applying feature selection. Keywords: Data Mining. Diabetes. Feature Selection. Gain Ratio. Support Vector Machine INTRODUCTION Diabetes Mellitus is a chronic disease caused by elevated glucose levels in the blood due to the failure of the pancreas to produce adequate amounts of the hormone insulin . , . , . Diabetes is the main cause of various chronic diseases such as visual impairment, heart disease, stroke, and kidney failure . , . According to . the United States with a percentage of 31%. India with a percentage of 77%, and China with a percentage of 116. 4% are the three countries with the highest prevalence of diabetes in the world. Indonesia is ranked seventh with a prevalence of 10. 7% in the number of people with diabetes . , . The high prevalence of diabetes in Indonesia, which is a developing country with a large population, makes it difficult for certain groups of people to consult with medical personnel for examination . Early detection, especially in groups that have a high risk of developing diabetes, needs to be done to prevent problems that have a wide impact . Data mining is a method of analyzing patterns and characteristics in large datasets to gather unexpected knowledge or information that is not yet owned . , . The results of data mining can be applied in the future to improve the quality of decision making . In data mining there are various main functions, such as estimation, prediction, clustering, association and classification . , . Classification is a data analysis method to determine the class or category of data samples and find relationships or patterns between attributes contained in the data . According to . the classification process has two steps including learning and classification. Learning . raining phas. is the first stage, where the training data is analyzed by the classification algorithm that has been made until it can be applied to the form of classification rules. Next is the classification phase, where test data is used as an estimate of the accuracy of the classification rules. Applying classification to diseases based on medical history and symptoms can help speed up diagnosis to plan effective treatment . Support Vector Machine (SVM) is one of the algorithms from machine learning techniques with a high level of accuracy in predicting the potential classification of data . In research . with the Pima Indians Diabetes Dataset using the SVM method on the RBF kernel with a data ratio of 90: 10 resulted in an accuracy of 87%. Furthermore, research conducted by . on the Pima Indians Diabetes Dataset resulted in the highest accuracy in the benchmark model using a polynomial kernel with C = 100 and degree = 3 with 87% accuracy, while the highest accuracy in the scratch model using a polynomial kernel with C = 1, gamma = scale, and degree = 3 resulted in 78% accuracy. Another study conducted by . on skull bone data using SVM resulted in 91. 3% at a ratio of 90:10 using the RBF kernel with C = 2, gamma = 'auto'. Another research on Pima Indians Diabetes Dataset with SVM method applying one of the feature selection techniques, namely forward selection by . resulted in a high increase in accuracy to reach 91% accuracy when using 20% test data and 80% training data. Feature selection is a form of attribute reduction to improve data quality and enhance the performance of classification algorithms . Feature selection can help algorithms process data faster because it helps select the most relevant attributes, so that irrelevant attributes will be reduced . , . According to . , there is a need for a feature selection approach to select important features that are useful for the learning modeling process to improve its accuracy. One feature selection that has been proven to improve classification algorithms is the gain ratio . , . Gain Ratio is Copyright A 2025 Authors. Page 10 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index one of the methods for selecting features that serves to determine the level of influence of an attribute on the target variable to be predicted . , . According to . the selected feature is determined by a value limit called threshold, which can be determined freely. According to . , just like information gain, the gain ratio also requires determining the minimum limit to determine the features used by repeatedly testing the minimum limit. In research . used a threshold value of 01 to 0. Research conducted by . using feature selection gain ratio in the Naive Bayes method for heart disease resulted in an accuracy of 91. 2% higher than the performance of Naive Bayes without feature selection which only resulted in 90. From other research conducted by . on hypertension complications using feature selection gain ratio in the Naive Bayes method obtained an increase in accuracy of 20% which initially had an accuracy of 75% to 95%. addition, research conducted by . on credit approval datasets obtained higher accuracy after applying feature selection gain ratio which initially only used C45 resulting in 94. 12% to 95. Research on SVM methods with feature selection gain ratio has been done before by . for skull bone classification with a threshold of 0. 01 resulting in an accuracy of 92. 01%, while without feature selection only produces In research . for sentiment analysis using feature selection gain ratio in SVM method can increase accuracy compared to before using feature selection. The results showed the use of 1732 attributes with a threshold weight of less 0001 increased the accuracy of 61. 63% to 71. While the use of 518 attributes with a threshold weight of less 002 increases the accuracy of 61. 63% to 62. Based on previous research, the gain ratio feature selection technique and SVM method have proven effective in various studies. Therefore, this study applies a combination of gain ratio to Support Vector Machine in diabetes The purpose of this research is to improve the performance of the prediction model to better predict the risk of diabetes. RESEARCH METHODOLOGY The research method consists of several processes including problem identification, literature study, diabetes data collection, data preprocessing, data transformation using min max normalization, feature selection using gain ratio, classification using SVM, evaluation using confusion matrix, and conclusion. The research flow is shown in Figure 1. Figure 1. Research method 1 Data Collection The data in this study are secondary data in the form of datasets taken from the kaggle platform. The license listed is CC0: Public Domain and can be accessed via https://w. com/datasets/jamaltariqcheema/pima-indians-diabetesdataset/data. Pima Indians Diabetes Dataset totals 768 with 8 attributes. The class of data consists of diabetes totaling 268 data and non-diabetes totaling 500 data. The diabetes dataset attributes can be seen in Table 1. Table 1. Dataset Attributes Atribut Pregnancies Glucose Blood Pressure Skin Thickness Insulin BMI/Body Mass Index Diabetes Pedigree Function Age Outcome The diabetes dataset can be seen in Table 2. Copyright A 2025 Authors. Page 11 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Table 2. Diabetes dataset Patient Pregnancies Glucose Blood Presure Skin Thickness Insulin BMI Diabetes Pedigree Function Age Outcome 2 Data Prepocessing This stage performs data cleaning from duplication, missing values, and not filled with inappropriate data, thus rearranging the data to fit the modeling to be done . Cleaning data removes errors in the data, such as handling missing values and removing duplicate data. Aims to avoid bias that can be caused by missing diabetes data used, so as to improve the performance of the prediction model. Data balancing was not applied because the data difference between the two classes was only 232 data. 3 Data Transformation This stage converts the data type to be in accordance with the provisions . Data is changed to be simpler without changing the basic content . This stage performs cleaning, changing and rearranging diabetes data to suit the modeling to be carried out. Data normalization measures the feature value of the dataset within the specified value range. This research uses the Min Max Normalization method. The Min Max Normalization method is a normalization method that changes the range of data values to be in the range of 0 to 1 . ycUA = ycUycn Oe ycoycnycu. Oe ycoycnycu. Description: = normalized value ycUA = the specific value to be normalized ycUycn ycoycnycu. = minimum value of an attribute ycoycaycu. = maximum value of an attribute 4 Gain Ratio Feature Selection This stage performs feature selection using the gain ratio. Feature selection is an important process that aims to identify and select the most influential set of attributes . Gain ratio is the development of information gain and is the best feature selection model and is widely used by researchers . Feature selection can help find the ranking results of each attribute in diabetes data, so that it can help the learning modeling process and improve its accuracy. The threshold gain ratio used in this study includes 0. 02, 0. 03, and 0. According to . the gain ratio stage includes: Calculating the entropy value of each attribute ycu cI) = Ocycn=1 Oeycyycn O ycoycuyci2ycyycn Calculating the information gain value of each attribute ycu yaycuyceycuycycoycaycycnycuycu yaycaycnycu . cI, y. = yaycuycycycuycyyc. cI) Oe Ocycn=1 . Calculating the split information value yc ycIycyycoycnyc yaycuyceycu . = Oe Ocyc=1 . cIyc. ycu yaycuycycycuycyyc. cIyc. cI| yayc yayc ycu ycoycuyci2 ya ya . Calculating the gain ratio value yaycaycnycu ycIycaycycnycu . = yaycaycnycu. ycIycyycoycnyc yaycuyceycu. Copyright A 2025 Authors. Page 12 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Description: |S. |S| Gain (A) Split Info (A) = Sample = Number of values in the classification class = Number of samples in class i = Attributes = Number of samples of value i = Total data samples = Total number of samples in the dataset or data subset being processed = Number of samples in the Jth category or subset after separation based on a particular attribute = Number of categories or subsets resulting from attribute separation = Information gain value of attribute A = Split info value of attribute A 5 SVM Classification Method This stage conducts training and testing to create SVM modeling. Previously, diabetes data was divided into training data and testing data using the ratio of training data and testing data 90:10. 80:20. and 70:30. Next, classify using three kernels, namely linear. RBF, and polynomial. The value of parameter C to be used for all kernels is 1, 10, and 100. For the polynomial kernel, it includes the degree value. The degree values to be used are 1, 2, and 3. For the RBF kernel, the values used are gamma and scale. The gamma values to be used are 1, 2, and 3. According to . the kernel equation includes the equation: Linear ya. cUycn, ycUy. = ycUycn ycN . ycUyc Polynomial ya . cUycn, ycUy. = . cUycn ycN . ycUy. ycc RBF (Radial Basis Functio. cUycn, ycUy. = yce Oe. || ycUycnOeycUyc || ) . Description: ycc = degree of polynomial = constant = kernel parameters 6 Evaluation of Test Results This stage is the final stage in the classification algorithm calculation process . This stage checks the suitability of patterns or information with previously existing facts or hypotheses . This stage conducts accuracy testing by comparing the performance results of using various SVM algorithm kernels in diabetes classification. Previously, a single data split was performed using a ratio followed by the process of generating accuracy, recall, and precision measurements using a confusion matrix. According to . the representation of the results of the classification process on the confusion matrix has four terms including TP (True Positiv. TN (True Negativ. FP (False Positiv. , and FN (False Negativ. Accuracy, indicating the level of accuracy of the model applied in classification ycaycaycaycycycaycayc = ( ycNycE ) y 100% . cNycE yaycE) . Recall, indicating the success of the model in retrieving information ycyceycaycaycoyco = ( . Precision, indicates the accuracy between the requested data and the prediction results provided by the model. ycyycyceycaycnycycnycuycu = ( . cNycE ycNycA) ) y 100% . cNycE yaycE ycNycA yaycA ) ycNycE ) y 100% . cNycE yaycA) . F-1 score, shows the weighted average comparison of precision and recall ya Oe 1 ycycaycuycyce = ( . y precision y recal. ) y 100% . cyycyceycaycnycycnycuycu ycyceycaycaycoyco ) . Copyright A 2025 Authors. Page 13 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index RESULT AND DISCUSSION The results and discussion include a discussion of the results of the research that has been done in measuring the effectiveness of applying feature selection gain ratio to the SVM method to improve the accuracy of diabetes 1 Data Prepocessing In this step, the data is checked first before the data is cleaned. After checking, the data has no missing values as shown in Figure 2. Since there are no missing values, the process of deleting or changing missing values is not necessary so we can proceed to the data transformation process. Figure 2. Check for Missing Values 2 Data Transformation In this step using Min Max Normalization. Data that has been normalized can be seen in Table 3. Table 3. Data Normalization Pregnancies Glucose Blood Pressure Skin Thickness Insulin BMI 0,353 0,059 0,471 0,059 0,059 0,671 0,265 0,897 0,290 0,316 0,490 0,429 0,408 0,429 0,469 0,304 0,239 0,272 0,174 0,261 0,187 0,106 0,187 0,096 0,106 0,315 0,172 0,104 0,202 0,249 Diabetes Pedigree Function 0,234 0,117 0,254 0,038 0,101 Age 0,483 0,167 0,183 0,000 0,033 3 Feature Selection Gain Ratio The output generated by the selection of gain ratio features is the ranking order of all attributes from highest to lowest, as well as the selection results of ranking all attributes according to the threshold used. Figure 3 shows the results of the gain ratio calculation. Figure 3. Gain Ratio Calculation Results The representation of the Gain Ratio calculation is shown in Figure 4. Copyright A 2025 Authors. Page 14 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Figure 4. Gain Ratio Calculation Table 4 shows the attributes used at various thresholds. Threshold 0. 02 takes attributes that are above the 0. threshold, as well as the thresholds 0. 03 and 0. Threshold 0. 02 uses more attributes than other thresholds. The smaller the threshold, the more attributes are used and the larger the threshold, the fewer attributes are used. Table 4. Attributes of Various Thresholds Threshold 0,02 Total Attribute 6 Attribute Insulin 0,03 5 Attribute Insulin 0,05 3 Attribute Insulin Diabetes Pedigree Function Diabetes Pedigree Function Diabetes Pedigree Function Attributes Skin Thickness Skin Thickness Skin Thickness BMI Glucose BMI Glucose Age 4 SVM Classification Method This stage conducts training and testing to create SVM modeling. Previously, diabetes data was divided into training data and test data using a ratio of training data and test data of 90:10. 80:20. and 70:30. Next, classify using three kernels, namely linear. RBF, and polynomial. The values of parameter C to be used for all kernels are 1, 10, and 100. For the polynomial kernel, it includes the degree value. The degree values to be used are 1, 2, and 3. For the RBF kernel, the values used are gamma and scale. The gamma values to be used are 1, 2, and 3. The parameters of each kernel can be seen in Table 5. Table 5. Kernel Parameters Linear RBF Polynomial Gamma Scale Scale Scale Degree Copyright A 2025 Authors. Page 15 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index 5 Evaluation of Test Results This study shows that the application of feature selection gain ratio in SVM can increase the accuracy of diabetes classification by 0. 64% from 94. 81% to 95. 45% after applying a threshold of 0. 02 at a ratio of 80:20 with RBF kernel at parameters C = 100. Gamma = Scale. There was an increase in accuracy from research . which only produced the highest accuracy of 87% using the SVM method on the same dataset. Table 6 shows the results of SVM testing on all kernels before applying the selection feature getting the highest accuracy of 94. 81% at a ratio of 80:20 with the RBF kernel at 2 parameter combinations, namely at parameter C = 100. Gamma = 3 and at parameter C = 100. Gamma = Scale. Table 6. Test Results Without Selection Features Ratio 70:30 80:20 90:10 Kernel Linear RBF Polynomial Linear RBF Parameter C=1 C = 100. Gamma = Scale C = 100. Degree = 3 C = 100 C = 100. Gamma = 3 C = 100. Gamma = Scale C = 100. Degree = 3 C = 100 C = 100. Gamma = 3 C = 100. Gamma = Scale C = 100. Degree = 3 Polynomial Linear RBF Polynomial Accuracy 76,62% 93,51% 90,48% 79,22% 94,81% 91,56% 77,92% 92,21% 89,61% The highest accuracy produced is 94. 81% at a ratio of 80:20 with the RBF kernel with a combination of 2 parameters including C = 100. Gamma = 3 and C = 100. Gamma = Scale. All kernels show that the 80:20 ratio produces optimal performance. The highest accuracy results before applying feature selection to each ratio are represented in Figure 5. Figure 5. Testing Results Before Applying Feature Selection Table 7 shows the results of SVM testing without gain ratio and with gain ratio at each threshold and each ratio represented by the highest accuracy on the linear kernel. Table 7. Test Results After Applying Feature Selection on Linear Kernel Threshold No Gain Ratio Total Attribute 8 Attribute Threshold 0,02 6 Attribute Threshold 0,03 5 Attribute Ratio 70:30 80:20 90:10 70:30 80:20 90:10 70:30 80:20 90:10 Threshold 0,05 3 Attribute 70:30 80:20 90:10 Parameter C=1 C = 100 C = 100 C=1 C=1 C = 10 C = 10 C=1 C = 10 C=1 C = 10 C = 100 C = 100 C = 100 C = 10 Accuracy 76,62% 79,22% 77,92% 76,19% 77,92% 76,62% 77,06% 81,17% 85,71% 82,25% 84,42% 85,71% Table 8 shows the results of SVM testing without gain ratio and with gain ratio at each threshold and each ratio represented by the highest accuracy on the RBF kernel. Copyright A 2025 Authors. Page 16 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Table 8. Test Results After Applying Feature Selection to RBF Kernel Threshold No Gain Ratio Total Attribute 8 Attribute Ratio 70:30 80:20 90:10 Threshold 0,02 6 Attribute Threshold 0,03 5 Attribute Threshold 0,05 3 Attribute 70:30 80:20 90:10 70:30 80:20 90:10 70:30 80:20 90:10 Parameter C = 100. Gamma = Scale C = 100. Gamma = 3 C = 100. Gamma = Scale C = 100. Gamma = 3 C = 100. Gamma = Scale C = 100. Gamma = Scale C = 100. Gamma = Scale C = 100. Gamma = Scale C = 100. Gamma = Scale C = 10. Gamma = Scale C = 100. Gamma = 2 C = 100. Gamma = Scale C = 100. Gamma = 2 C = 100. Gamma = Scale C = 100. Gamma = Scale C = 100. Gamma = Scale Accuracy 93,51% 94,81% 92,21% 94,37% 95,45% 94,81% 90,48% 92,21% 93,51% 88,31% 88,96% 89,61% Table 9 shows the results of SVM testing without gain ratio and with gain ratio at each threshold and each ratio represented by the highest accuracy on the polynomial kernel. Table 9. Test Results After Applying Feature Selection to the Polynomial Kernel Threshold No Gain Ratio Total Attribute 8 Attribute Ratio 70:30 80:20 90:10 70:30 80:20 90:10 Threshold 0,02 6 Attribute Threshold 0,03 5 Attribute 70:30 80:20 90:10 Threshold 0,05 3 Attribute 70:30 80:20 90:10 Parameter C = 100. Degree = 3 C = 100. Degree = 3 C = 100. Degree = 3 C = 100. Degree = 3 C = 100. Degree = 3 C = 10. Degree = 3 C = 100. Degree = 3 C = 100. Degree = 3 C = 100. Degree = 3 C = 1. Degree = 2 C = 1. Degree = 3 C = 100. Degree = 3 C = 10. Degree = 1 C = 100. Degree = 1 C = 1. Degree = 1 C = 10. Degree = 1 C = 100. Degree = 1 C = 1. Degree = 1 C = 1. Degree = 2 C = 10. Degree = 1 C = 10. Degree = 2 C = 100. Degree = 1 C = 100. Degree = 2 Accuracy 90,48% 91,56% 89,61% 87,01% 88,96% 88,31% 87,88% 90,91% 88,31% 82,25% 85,71% 87,01% The results of the highest accuracy of the linear kernel before applying feature selection and after applying feature selection at each threshold and each ratio are represented in Figure 6. Figure 6. Testing Results After Applying Feature Selection on Linear Kernel Copyright A 2025 Authors. Page 17 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index The highest accuracy results of the RBF kernel before applying feature selection and after applying feature selection at each threshold and each ratio are represented in Figure 7. Figure 7. Test Results After Applying Feature Selection to the RBF Kernel The highest accuracy results of the Polynomial kernel before applying feature selection and after applying feature selection at each threshold and each ratio are represented in Figure 8. Figure 8. Test Results After Applying Feature Selection to the Polynomial Kernel The highest accuracy test with SVM without feature selection resulted in the highest accuracy of 94. 81% using a ratio of 80:20 on the RBF kernel with a combination of parameters Cost = 100. Gamma = 3 and Cost = 100. Gamma = Scale resulting in a confusion matrix shown in Figure 9. Using 154 test data, data classification successfully predicted data that was actually diabetic correctly as diabetes (True Positiv. as much as 50 data, but there were 5 diabetic data that were incorrectly predicted as not diabetic (False Negativ. Data classification also successfully predicts data that is actually not diabetic correctly as not diabetic (True Negativ. as much as 96 data, but there are 3 non-diabetic data that are wrongly predicted as diabetes (False Positiv. False Positive is negative data detected as positive data, while False Negative is positive data detected as negative data. The results show that the model is quite good at identifying cases of no diabetes and diabetes, but there are still errors in detecting diabetes as much as 5 data and no diabetes as much as 3 Figure 9. Confusion Matrix of Best Test Results Without Feature Selection Copyright A 2025 Authors. Page 18 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Testing the highest accuracy with SVM after applying feature selection resulted in the highest accuracy of 95,45% using a ratio of 80:20 on the RBF kernel with parameters Cost = 100 and Gamma = Scale produces a confusion matrix shown in Figure 10. Using 154 test data, data classification successfully predicts data that is actually diabetic correctly as diabetes (True Positiv. as much as 52 data, but there are 3 diabetic data that are incorrectly predicted as not diabetic (False Negativ. Data classification also successfully predicts data that is actually not diabetic correctly as not diabetic (True Negativ. as much as 95 data, but there are 4 non-diabetic data that are wrongly predicted as diabetes (False Positiv. False Positive is negative data detected as positive data, while False Negative is positive data detected as negative data. The results show that the model is quite good at identifying cases of no diabetes and diabetes, but the error in detection is reduced to 7 data including 3 diabetes data and 4 no diabetes data. Figure 10. Confusion Matrix of Best Test Results with Feature Selection This research shows that the application of feature selection gain ratio in SVM is able to increase the accuracy of diabetes disease classification by 0. 64% from 94. 81% using a ratio of 80:20 on the RBF kernel with parameters C = 100. Gamma = 3 and C = 100. Gamma = Scale to 95. 45% at a threshold of 0. 02 using a ratio of 80:20 on the kernel with parameters C = 100. Gamma = Scale. A ratio of 80:20 can produce optimal performance as in research . RBF kernel often leads to high accuracy compared to other kernels as in research . , both before applying feature selection and after applying feature selection. The polynomial kernel is proven to produce higher accuracy than the linear kernel. However, the accuracy of the polynomial kernel decreases after applying feature selection. The scale parameter in the RBF kernel also constantly produces the highest accuracy at each threshold variation and data ratio. CONCLUSION Tests by applying the Gain Ratio selection feature at various thresholds show that a threshold of 0. 02 produces the highest Tests on various ratios and kernels with various parameters show that the RBF kernel still provides optimal The application of feature selection, data ratio, and kernel parameters also affect the performance of the model. At a threshold of 0. 02, the 80:20 data ratio produces higher accuracy compared to other data ratios on the RBF kernel. Threshold 0. 02 produces the highest accuracy at a ratio of 80:20 across all kernels and produces the lowest accuracy at a ratio of 70:30. At a threshold of 0. 03 there is a constant increase in accuracy from a ratio of 70:30 to 90:10 in the linear and RBF kernels, while the polynomial kernel produces the highest accuracy at a ratio of 80:20 and produces the lowest accuracy at a ratio of 70:30. At a threshold of 0. 05 there is a constant increase in accuracy from a ratio of 70:30 to 90:10 across all kernels. Overall, the test results show that the application of selection features using gain ratio can improve model performance on all three kernels. After applying the selection feature, the highest accuracy increase was 95. at a threshold of 0. 02 in a ratio of 80:20 using the RBF kernel with parameters Cost = 100 and Gamma = Scale. This research shows that the application of feature selection gain ratio in SVM is able to increase the accuracy of diabetes disease classification by 0. 64% from 94. A ratio of 80:20 can produce optimal performance on RBF and polynomial RBF kernel often leads to high accuracy compared to other kernels, both before applying feature selection and after applying feature selection. The polynomial kernel is proven to produce higher accuracy than the linear kernel. However, the accuracy of the polynomial kernel decreases after applying feature selection. The scale parameter in the RBF kernel also constantly produces the highest accuracy at each threshold variation and data ratio. The right combination of threshold, data ratio, and parameters of each kernel can produce a more reliable model in predicting the risk of diabetes. Future research is suggested to develop other combinations such as using data division other than ratios, such as cross Due to the difference in the amount of data between classes 0 and 1, the use of data balancing techniques is In addition, it can also apply the gain ratio selection feature to other algorithms to improve accuracy or apply other selection features to the SVM method. Copyright A 2025 Authors. Page 19 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 10Oe21 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index REFERENCES