Journal of Soft Computing Exploration Homepage: htps://shmpublisher. com/index. php/joscex p-ISSN: 2746-7686 e-ISSN: 2746-0991 Using Genetic Algorithm Feature Selection to Optimize XGBoost Performance in Australian Credit Dwika Ananda Agustina Pertiwi1*. Kamilah Ahmad2. Shahrul Nizam Salahudin3. Ahmed Mohamed Annegrat4. Much Aziz Muslim5 1,2,3,5Faculty of Technology Management and Business. Universiti Tun Hussein Onn Malaysia. Johor. Malaysia 4Faculty of Economics. University of Bani Waleed. Libya 5Department of Computer Science. Universitas Negeri Semarang. Indonesia Article Info ABSTRACT Article history: To reduce credit risk in credit institutions, credit risk management practices need to be implemented so that lending institutions can survive in the long Data mining is one of the techniques used for credit risk management. Where data mining can find information patterns from big data using classification techniques with the resulting level of accuracy. This research aims to increase the accuracy of classification algorithms in predicting credit risk by applying genetic algorithms as the best feature selection method. Thus, the most important feature will be used to search for credit risk information. This research applies a classification method using the XGBoost classifier on the Australian credit dataset, then carries out an evaluation by measuring the level of accuracy and AUC. The results show an increase in accuracy of 2. with an accuracy value of 89. 93% after optimization using a genetic algorithm. So, through research on genetic algorithm feature selection, we can improve the accuracy performance of the XGBoost algorithm on the Australian credit Received March 30, 2024 Revised April 3, 2023 Accepted April 3, 2024 Keywords: Credit risk Australian credit Genetic algorithm XGBoost This is an open access article under the CC BY-SA license. Corresponding Author: Dwika Ananda Agustina Pertiwi. Faculty Technology Management and Business. Universiti Tun Hussein Onn Malaysia. Persiaran Tun Dr. Ismail, 86400 Parit Raja. Johor. Malaysia Email: dwikapertiwi13@gmail. https://doi. org/10. 52465/joscex. INTRODUCTION Credit problems in banks in the last few decades have required a deepening of existing risk management . Therefore, to help strengthen the reliability of credit institutions and improve significantly. So, in this case credit management becomes an important issue . Being one of the competitive financial institutions for profit. Bank provides a variety of financial services such as credit to individuals and businesses as well as managing various types of risk. From this fact, that taking a risk is synonymous with being profitable, banks derive a significant portion of their profits from their lending activities . As a result, they are keenly interested in creating credit risk assessment models that are constantly more accurate in order to optimize the performance of loans that they have granted. The risk of credit has been predicted using a variety of ways. Probability of default shows the possibility of default at a certain time, and is the main parameter in the credit risk evaluation system Since A SHM Publisher Pertiwi et. al/ J. Soft Comput. Explor. Vol. No. March 2024: 92-98 credit scores determine the probability of default, credit risk evaluation has emerged as an aid credit risk management tool by identifying AugoodAy or AubadAy applicants. By adopting a data mining strategy to examine applicant data, the degree of credit risk may be decreased . , . Many studies of risk management for credit risk evaluation have been carried out, this important issue is interesting to study by utilizing a classification algorithm with an optimal level of accuracy . Supervised machine learning models have been widely applied in credit risk assessment, in particular they are used in credit scoring models to find default probabilities and then predict default classification usually in binary format. Various types of classification algorithms have been applied to credit risk prediction studies such as LightGBM . Logistic Regression (LR) . Gradient Boosting . XGBoost . Neural Network (NN) . This paper applied eXtreeme Gradient Boosting Tree (XGBoos. proposed recently by Chen and Guestrin . Due to its speed and accuracy. XGBoost has attracted interest in some significant global big-data contests including Kaggle and DataCastle. We believe a robust yet effective solution is a potential advance in this area instead of going for a complicated way of creating a model for financial institutions to use in practice. This research applies the selection of the best features to improve classification accuracy by removing redundant and unnecessary features from the dataset . , . The most popular feature selection technique is Genetic Algorithm (GA), and has been proven effective in several scopes of computer science . , including data mining . , and industrial applications . Genetic algorithms have been used to obtain optimum values and demonstrate their superiority in increasing the accuracy of classification models. In this study proposes a Genetic Algorithm (GA) to improve XGBoost classifier accuracy of Australian credit dataset. Then, to handle imbalanced class data using a synthetic minority oversampling technique (SMOTE) model . , . At the evaluation stage we look at the value of accuracy, and AUC. METHOD In this study, a method was developed consisting of data collection, preprocessing, data split, classification method, and evaluation using a confusion matrix. The benchmarks and suggested models are validated using a real-world credit dataset called Australian credit. The method design in this study is shown in Figure 1. Figure 1. Stage of the Research Pertiwi et al. / J. Soft Comput. Explor. Vol. No. March 2024: 92-98 The Australian credit dataset is a collection of real-world credit data used in this study to test the effectiveness of the classification model. This dataset was obtained from the UCI Machine Learning Repository, which consists of 690 instances and 14 features, with 8 numeric features and 6 feature categories. Then the data preprocessing stage is an important stage in modeling which can make the data ready for the classification process in data mining . This research performs normalization of data, and balancing of data classes using the SMOTE oversampling. Then, the features of categorical type are normalized using the dummy variable method, which transfers n attributes from the original features to n-1 features with only two attributes "0" and "1" . , . The normalization equation is as the following Equation . x Oe min. x A = max. Oemin. Where x stands for the original feature value, x A represents the feature value after normalization, max. and min. stand for the maximum and minimum values of the original feature. After preprocessing, the dataset is divided into a training set and a testing set with a total proportion of 70:30, widely used in many studies . , . , . After going through the data normalization process, feature selection is carried out by applying a genetic algorithm. This method represents the process that propels biological evolution by using limited and unconstrained optimization problems to solve natural selection-based optimization issues . can be implemented in selecting the best feature from a dataset, where there are five phases in the genetic algorithm, which is initial populations, fitness function, selection, crossover, and mutation . From selecting features using a genetic algorithm to produce a subset of data that contains the best features to proceed to the classification modeling process, which in this study applies the XGBoost algorithm. Xgboost is one classifier that can be enabled to predict decision tree based . , . Algorithm it is possible to optimization 10 times faster compared to other GBM . The last stage is the evaluation of the model to find out the accuracy value of XGBoost after being optimized using GA, the research applies a confusion matrix which can be shown in Table 1. Table 1. Confusion matrix Predicted Real Positive Negative Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN) In accordance with TABLE 1. True Positive (TP) indicates that the prediction is correct if the actual values A False Negative (FN), however, denotes the possibility that the forecast value of 1 may come from the real value of 0. Additionally, the True Negative (TN) exhibits the same outcomes as the False Positive (FP), which predicts a value of 0 even when the actual value is 1. Then, the accuracy formula, can be seen in Equation . Accuracy = TP TN TP FN FP TN RESULTS AND DISCUSSIONS This study uses Australian datasets to test the XGBoost classifier model on credit risk prediction The Australian credit dataset is unbalanced data with 307 good credit data and 383 default data. Thus, the SMOTE technique to deal with this data imbalance problem has been applied and produce a balanced proportion, where the amount can be seen in the graph provided in Figure 2. Pertiwi et. al/ J. Soft Comput. Explor. Vol. No. March 2024: 92-98 Figure 2. Results of SMOTE-oversampling from Australian credit dataset Figure 2 shows that the number of classes in the dataset becomes balanced after SMOTEoversampling with the number of classes 1 and 0 each totaling 383 data. Then, the data is ready to be processed in the feature selection process using the GA algorithm, in the feature selection process the result is getting the 8 best features in the Australian credit dataset from the previous number of 14 features. Next, after producing a data subset with the 8 best features and a balanced data class, the data sample is entered into the stage distribution of training and testing data which is then processed at the modeling stage using the XGBoost classifier. In this modeling, the performance of the classification model evaluated using a confusion matrix that obtains a value, which can be seen in Table 2. Real Table 2. Confusion matrix of XGBoost in Australian credit dataset Predicted Positive Negative Positive Negative So, the accuracy calculation is as follows. Accuracy = TP TN TP FN FP TN Accuracy = = 87. 225 37 29 225 The results of the performance accuracy of the XGBoost classifier for predicting credit risk in the Australian dataset as shown in Table 2 obtained a result of 87. Then, we will see the performance of the XGBoost classifier which is optimized using the Genetic Algorithm as a feature selection method, where the results of the confusion matrix are presented in Table 3. Real Table 3. Confusion matrix of XGBoost GAFS in Australian credit dataset Predicted Positive Negative Positive Negative So, the accuracy calculation is as follows. Accuracy = TP TN TP FN FP TN Accuracy = = 89. 228 34 20 254 In Table 4, it shows the increased accuracy of the XGBoost classifier after being optimized using the Genetic Algorithm. The accuracy result is 89. 93%, an increase from the accuracy value before optimization, which was The resulting increase was 2. 24% from the XGBoost and Genetic Algorithm feature selection Pertiwi et al. / J. Soft Comput. Explor. Vol. No. March 2024: 92-98 This shows that the selection of the best features can affect the accuracy of credit risk predictions based on the Australian credit dataset, and the diagram of comparison model performance, presented in Figure 3. TABLE 4. Comparison of model performance in Australian credit dataset Model Accuracy XGBoost XGboost GAFS Figure 3. Results of comparison model in Australian credit dataset CONCLUSION Based on this research which aims to provide recommendations for this credit risk prediction model, where the model is classified as one of the data mining techniques for credit risk management, the probability of default has been carried out. The application of the genetic algorithm as a feature selection method was successful increase the accuracy value of the XGBoost algorithm as a classification model with an increase of 24% in predictions credit risk based on Australian credit dataset. As a future study, research on credit risk prediction can be continued by optimizing the Genetic Algorithm as a tuning hyperparameter, and testing other credit datasets. REFERENCES