JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 COMPARISON OF CLASSIFICATION ALGORITHM IN CLASSIFYING AIRLINE PASSENGER SATISFACTION Jacky Suwanto, 2Daniel Ryan Hamonangan Sitompul, 3Stiven Hamonangan Sinurat. Andreas Situmorang, 5Ruben, 6Dennis Jusuf Ziegel, 7*Evta Indra Prodi Sistem Informasi. Fakultas Teknologi dan Ilmu Komputer. Universitas Prima Indonesia Jl. Sampul No. Sei Putih Barat. Medan Petisah E-mail : *evtaindra@unprimdn. ABSTRACT- In order to revive the airline industry, which is being hit by the current recession, it is essential to restore passenger confidence in airlines by improving the services provided by airlines. With the influence of technology in all industrial fields, airlines can now use Machine Learning to find the essential points that can make passengers feel satisfied with airline services and classify passenger satisfaction. This study presents the making of Machine Learning models starting from Data Acquisition. Data Cleaning. Exploratory Data Analysis. Preprocessing, and Model Building. It is concluded that Random Forest is the best algorithm used in this case study, with an F1 accuracy score of 89. ROC-AUC score of 0. 90, and a shorter modeling period than other algorithms used in this study. Kata kunci : Machine Learning. Random Forest. AdaBoost. XGBoost. Classification. INTRODUCTION The airline industry experienced a setback during the pandemic. Based on data from the International Civil Aviation Organization (ICAO), in 2020, the aviation industry suffered a loss of $372 billion in 2020 with a decrease in the number of passengers by -60% . In order to revive the airline industry, which is being hit by a recession, it is essential to restore passenger confidence by improving the services provided by airlines. With the influence of technology in all industrial fields, airlines can now use Machine Learning to find the essential points that can make passengers feel satisfied with airline The airline can also classify the rating given by the passenger to find out whether the passenger is satisfied or not with the service that has been provided. Many previous studies have carried out the classification of airline passenger comfort, for example, research entitled "Comparison of Feature Selection Optimization in Nave Bayes for Airline Passenger Satisfaction Classification" . and "Predicting Airline Passenger Satisfaction With Classification Algorithms" . Research . performs classification using the K-Nearest Neighbors (KNN) algorithm. Logistic Regression. Gaussian Nayve Bayes. Decision Tree, and Random Forest. The results of each algorithm will be compared based on the highest accuracy value. Research . carried out the classification using the Nayve Bayes algorithm, which is configured by default, with Particle Swarm Optimization (PSO) and with Genetic Algorithm (GA). Based on the problems mentioned above, to help airlines know the essential points to provide passengers with the best service and also be able to classify passenger satisfaction, the authors suggest making an analysis using the Exploratory Data Analysis method and making Machine Learning models that can carry out the classification process. The parameters used in making the model include the services provided by the airline to passengers. this study, several classification algorithms will be presented, such as Random Forest (RF). Adaptive Boosting (AdaBoos. , and Extreme Gradient Boosting (XGBoos. The highest accuracy value and the shortest creation time will be concluded to be the best algorithm that can be used in this case METHODOLOGY This research was conducted at the Data Analyst Laboratory of Prima Indonesia University. The notebook used in making the model uses the help of Google Collaboratory. The workflow of this research is presented in Figure 1. The flow of the research was carried out by retrieving data from the web page of the dataset provider, namely Kaggle (TJ Klein, 2. after the data is obtained, data cleaning will be carried out, which includes checking whether there is data that is not balanced (Imbalanc. and checking whether there is empty data (Null Valu. then the Exploratory Data Analysis process is carried out where the search for essential points that can be visualized from the dataset will be carried after the understanding of the data is complete, the Pre-processing stage will be carried out where there is a Label Encoding and Outlier Removal Then the last one will be making a model with a predetermined configuration. Figure 1. Research Methodology JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 Figure 2. Dataset Detail 1 Data Acquisition The data acquisition stage aims to download data from trusted sources before the data is stored, processed, pre-processed, and used for other This process begins with retrieving relevant information, changing the data as needed, and calling the dataset into the notebook . , . this study, the dataset used came from Kaggle, namely AuAirline Passenger SatisfactionAy (TJ Klein. This dataset contains passenger survey data from an airline. The survey covers passenger numbers to passenger satisfaction. This dataset contains 25 columns and 103904 rows of data. Details of the dataset can be seen in Figure 2. 2 Data Cleaning The data cleaning stage prepares data for analysis by removing irrelevant or inappropriate The data in question has a negative impact on the model or algorithm to be made. Data cleaning is not only to dispose of data but can also be interpreted as a step to improve data . , . In this study, data cleaning is done by checking the null Details of this stage can be seen in Table 1. Figure 3. Exploratory Data Analysis 4 Preprocessing The pre-processing stage is transforming or encoding data so that the data can be parsed easily by machine learning. The main task of this stage is to create an accurate and predictable model . , . In this study, pre-processing was carried out to encode the data and remove outliers contained in the Details of the data encoding process and the outlier removal process can be seen in Figure 4. Table 1. Data Cleaning Process After Data Cleaning Arrival_Delay_in_Minutes Filled with mean Column Gender Customer_Type Type_of_Travel Class Filled categorical data . 3 Exploratory Data Analysis (EDA) The EDA stage is an essential process for conducting an initial investigation of the data used to find patterns and anomalies and test hypotheses with the help of statistical visualization . , . In this study. EDA was conducted to visualize data such as the number of satisfied or dissatisfied passengers based on the distance traveled and the type of trip. An example of the EDA stages can be seen in Figure Figure 4. Preprocessing JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 5 Model Building Random Forest (RF) The random forest algorithm is an ensemble learning method used to carry out the process of classification, regression, and other things that are made using many decision trees when doing model For the classification process, the result of a random forest is the class chosen by most trees. In this study, the random forest algorithm was max_depth=16,min_samples_leaf=1,min_samples_s plit=2, n_estimators=100 and random_state=12345. The process of making the random forest algorithm can be seen in Figure 5. Figure 5. Random Forest Detail Adaptive Boosting (AdaBoos. The adaptive boosting algorithm is an ensemble learning method that creates a model by assigning a balanced weight value to each data point, then giving more weight to the incorrectly classified points . , . In this study, the AdaBoost algorithm was created with the configuration of n_estimators 500 and random_state 12345. The process of making the random forest algorithm can be seen in Figure 6. Figure 7. XGBoost Detail RESULT AND DISCUSSION 1 Exploratory Data Analysis This study's results of the Exploratory Data Analysis stage are divided into three points. The first point is the comparison of the number of satisfied or dissatisfied passengers by gender. the second point is the comparison of the number of satisfied or dissatisfied passengers based on baggage handling services and terminal location . ate locatio. , and the third point is the comparison of the number of satisfied or dissatisfied passengers based on aircraft (Inflight Entertainmen. and in-flight wi-fi services (Inflight wi-f. In the first point, it can be seen in Figure 8 that the comparison of the number of satisfied or dissatisfied passengers is evenly distributed by gender, and it can be concluded from this comparison that there are more dissatisfied passengers than satisfied passengers. Figure 8. Comparison of Gender Figure 6. AdaBoost Detail Extreme Gradient Boosting (XGBoos. The xgboost algorithm is a scalable and easily distributed Gradient-Boosted Decision Tree (GDBR) based machine learning library. This algorithm presents a parallel tree and is a machine learning library that is most often used for the process of regression and classification . , . In this study, the XGBoost algorithm was created with the configuration of n_estimators 500 and max_depth The process of making the random forest algorithm can be seen in Figure 7. On the second point, it can be seen in Figure 9 that for business class, more dissatisfied passengers are carried out if Baggage Handling is carried out imperfectly . ating <=. For eco plus and eco classes, when the location of the terminal is not good/far (<=. , the passenger is not satisfied, even if Baggage Handling is usually carried out . ange 2-. Figure 9. Comparison Based on Baggage Handling On the third point, it can be seen in Figure 10 that Eco plus passengers are more satisfied with flights without wi-fi service . and the usual media entertainment . ating 2-. For business class JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 passengers, only the best entertainment media . can satisfy them. For Eco passengers, good media entertainment . ating 3-. and the best wi-fi service . can make them satisfied. precision in this case. Details of the results can be seen in Figure 12. Figure 10. Comparison Based on Inflight Media and Wi-Fi 2 Important Features In this study, the essential features of the dataset were obtained using Permutation Importance. From the method used, it is concluded that there are five most important features: the type of travel (Type of Trave. , aircraft wi-fi service (Inflight WiFi Servic. , online boarding service, and seat comfort. Details of the essential features can be seen in Figure Figure 12. Resultt of Random Forest The second point is the AdaBoost algorithm, with an accuracy value of 0. 5%) and ROCAUC value of 0. 9%). The precision of a true negative is 12620 data, a false positive is 1953 data, a false negative is 761 data, and a true positive is 10642 data. AdaBoost also scores 90% precision in this case. Details of the results can be seen in Figure Figure 11. Important Features 3 Model Result and Comparison Model Result This study will use three algorithm models to carry out the classification process: Random Forest. AdaBoost, and XGBoost. The results of model fitting will be divided into three points. The first point is the Random Forest algorithm, with an accuracy value of 0. 4%) and the ROC-AUC value of 0. %). The precision of a true negative is 12375 data, a false positive is 2197 data, a false negative is 553 data, and a true positive is 10850 data. The random forest can also get 90% JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 Figure 14. Result of XGBoost Figure 13. Result of AdaBoost Model Comparison After the modeling is complete and the model fitting results have been obtained, the model will be compared based on the ROC-AUC value and the duration of manufacture. In this study, it can be concluded that the Random Forest and AdaBoost algorithms have the same ROC-AUC value of 90%, while the XGBoost algorithm has a value of 89. The algorithm with the fastest creation time is Random Forest, and the longest is XGBoost. The comparison visualization of the model can be seen in Figure 15. The bar chart represents the manufacturing time, and the line represents the ROC-AUC value. The third point is the XGBoost algorithm, with an accuracy value of 0. 1%) and ROC-AUC value of 0. 7%). The precision of the actual negative is 12264 data, the false positive is 2309 data, the false negative is 521 data, and the true positive is 10882 data. XGBoost also scores 89% precision in this case. Details of the results can be seen in Figure 14. Figure 15. Algorithm Comparison CONCLUSION With the influence of technology in all industrial fields, airlines can now use Machine Learning to find the essential points that can make passengers feel satisfied with airline services. The airline can also classify the rating given by the passenger to find out whether the passenger is satisfied or not with the service that has been provided. It can be concluded that the best model in this case study is Random Forest with a ROC-AUC value of 90% and the model generation time is the fastest compared to other algorithms that have been made in this study. BIBLIOGRAPHY