JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 Comparison of Classification Algorithm in Predicting Stroke Disease Fenna Kemala Hutabarat, 2Daniel Ryan Hamonangan Sitompul, 3Stiven Hamonangan Sinurat. Andreas Situmorang, 5Ruben, 6Dennis Jusuf Ziegel, 7Evta Indra* 1,2,3,4,5,6,7 Sistem Informasi. Fakultas Teknologi dan Ilmu Komputer. Universitas Prima Indonesia Jl. Sampul No. Sei Putih Bar. Medan Petisah. Kota Medan E-mail : *evtaindra@unprimdn. ABSTRAK- To prevent stroke, we need a way to predict whether someone has had a stroke through medical With the influence of technology in the medical world, stroke can be predicted using the Data Science method, which starts with Data Acquisition. Data Cleaning. Exploratory Data Analysis. Preprocessing, and the last stage is Model Building. Based on the model that has been made, it is concluded that the algorithm with the best performance, in this case, is XGBoost with a precision value of 0. 9, a recall value of 0. 95, an f1 value of 0. and a ROC-AUC value of 0. 978 after receiving five folds of cross-validation. With these results, the model created can be used to make predictions in real-time. Kata kunci : Machine Learning. Logistic Regression. Random Forest. XGBoost. Stroke INTRODUCTION A stroke is a condition where blood flow to the brain is blocked which can cause cell death . According to data from the World Health Organization (WHO), stroke is the second leading cause of death worldwide, accounting for 11% of total A stroke is very deadly if it cannot be treated quickly and appropriately . To prevent stroke, we need a way to predict whether someone has had a stroke through medical parameters. With the influence of technology in the medical world, stroke can be predicted using Machine Learning methods based on classification algorithms, so that with predetermined medical parameters, the Machine Learning model that has been created can predict whether a person has a stroke. Many previous studies that have carried out disease prediction have been carried out, for example, research . , namely "Prediction of Heart Disease Using Machine Learning," and research . , namely "Long Short-Term Memory Recurrent Neural Network for Stroke Prediction. " Research . presents a Neural Network model with Multi-Level Perceptron (MLP) that can predict heart disease. The system's output in research . will provide predictive results in the form of "YES/NO. " Research . presents a Recurrent Neural Network model with Long Short-Term Memory (LSTM) that can predict The output of the research . is a detailed result of the model that has been made, such as the value of Precision. Recall. F1, and Accuracy. Based on the description of the problem described in the previous paragraph, to be able to perform early detection of stroke, the authors suggest making an analysis of medical parameters that can cause stroke and making Machine Learning models that can predict stroke. Parameters used in predicting stroke include age to smoking status. In this research, several classification algorithms will be made, such as Logistic Regression. Random Forest Classifier, and Extreme Boosting Classifier. The algorithm with the best precision, recall. F1, and ROC-AUC values will be the best algorithm in this case. METHODOLOGY This research was conducted at the Data Analyst Laboratory of Prima Indonesia University. The notebook used in this research is Google Collaboratory. The workflow of this research can be seen in Figure 1. This research began with data collection from the website of the dataset provider, namely Kaggle. after the data is obtained, the Data Cleaning process is carried out to normalize the data. then the Exploratory Data Analysis (EDA) stage is carried out, where dataset visualization will be carried after the EDA is carried out, the preprocessing stage will be carried out to prepare the training model. the last stage is Model Building, where the classification algorithm models such as Logistic Regression. Random Forest Classifier, and Extreme Boosting Classifier. Figure 1. Research Methodology JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 Figure 2. Dataset Detail Data Acquisition The Data Acquisition stage takes data from good sources before the data is processed and used to create a Machine Learning model. This process ends by calling the Data Frame on the Notebook . , . this study, the dataset used comes from Kaggle, from Fedesoriano's "Stroke Prediction Dataset" repository . This dataset contains 5110 rows of data and 12 columns which are parameters/features/factors that cause a stroke. Details of the dataset can be seen in Figure 2. Data Cleaning The Data Cleaning stage is the process of detecting, repairing, or deleting datasets for analysis preparation (EDA) . , . The data in question is invalid data that can harm modeling and analysis. this study, the data cleaning process was carried out to fill in empty data (NULL VALUE) contained in the BMI column in the dataset with an average value (Mea. Details of this stage can be seen in Figure 3. Figure 3. Data Cleaning Process Exploratory Data Analysis (EDA) The Exploratory Data Analysis stage is an approach to analyzing the data and summarizing its characteristics of the data. Graphics and visualizations are commonly used in EDA. The primary purpose of EDA is to test the content of nonmodeling data to explain hypothesis testing . , . In this study, the EDA stages include the distribution of stroke by sex to stroke by age and heart disease. Some examples of these stages can be seen in Figure format for the model. Preprocessing also helps improve data quality . , . In this study, the Preprocessing stage includes Label Encoding. Details of this stage can be seen in Figure 5. Figure 5. Preprocessing Model Building This study will use three classification algorithms to predict stroke from medical parameters. The three algorithms used are Logistic Regression. Random Forest Classifier (RFC), and Extreme Gradient Boosting Classifier (XGBoos. Figure 4. Exploratory Data Analysis Process Preprocessing The preprocessing stage prepares relevant data to build and train Machine Learning models. This step also means converting the data into a more readable Logistic Regression Logistic Regression is one of the Machine Learning algorithms used to carry out the classification process. This algorithm calculates the probability of a dataset consisting of dependent and independent variables. Since the results used as outputs are opportunities, the dependent variable is limited to values 0 and 1 . , . In this study, the Logistic Regression algorithm was created using the random_state 22 configurations. the rest is the default JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 a stroke. Of the 41% of male data, 95% did not have a stroke, and 5% had a stroke. Figure 6. Logistic Regression Model Making Random Forest (RF) Random Forest is a decision tree-based Machine Learning algorithm used to perform regression and classification processes. Random Forest consists of many decision trees that work in groups. Each decision tree in this algorithm produces class predictions, and the class with the most choices becomes the prediction result of the model . , . This study created Random Forest with random_state 22 and max_depth five configurations. Figure 7. Random Forest Model Making Extreme Gradient Boosting (XGBoos. Figure 9. Stroke Based on Gender In the second part, a visualization of the age commonly affected by stroke can be seen in Figure From the diagram, it can be concluded that ages 0 to 20 years are generally not prone to stroke. the age most susceptible to stroke is 40 to 80. Ages 75-80 80 years is the age most prone to stroke based on the data in the dataset. XGBoost is an implementation of the Gradient Boosted Decision Tree. In this algorithm, the decision tree is made in sequential form. Weights have an essential role in XGBoost. Weights are loaded on all independent variables used to make predictions . , . In this study, the XGBoost algorithm was created using the configuration random_state 22, max_depth 5, objective Aubinary:logistic,Ay eval_metric Aulogloss. Ay Figure 10. Stroke Based on Age Figure 8. XGBoost Model Making In the third part, a visualization of the BMI level that generally suffers a stroke can be seen in Figure From the diagram, it can be concluded that someone with a BMI of 30 has a higher chance of stroke than other BMI levels. RESULT AND DISCUSSION 1 Exploratory Data Analysis (EDA) Exploratory Data Analysis in this study will be divided into six parts, namely visualization of the distribution of strokes based on gender, visualization of age who are commonly affected by stroke, visualization of BMI levels that are commonly affected by stroke, visualization of glucose levels that are commonly affected by stroke, visualization of stroke comparisons based on hypertension and visualization comparison of stroke by heart disease. The first part of the distribution of stroke by gender can be seen in Figure 9. From the diagram, it can be concluded that the dataset used in this study has 59% female data and 41% male data. Of the 59% of female data, 95% did not have a stroke, and 5% had Figure 11. Stroke Based on BMI In the fourth part, the visualization of glucose levels that are commonly affected by stroke can be seen in Figure 11. From the visualization it can be concluded that glucose levels in the range of 50-100 JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 are more in people who do not have strokes and glucose levels in the range of 200-250 are the most common glucose levels. susceptible to stroke. Figure 14. Stroke Berdasarkan Penyakit Jantung Figure 12. Stroke Based on Glucose Level In the fifth section, a visualization of the comparison of strokes based on hypertension can be seen in Figure 13. From this visualization, it can be concluded that someone with hypertension problems is more prone to stroke than people who do not have hypertension problems. There are 13. 3% of people who have strokes from people with hypertension, and there are 4% of people who have strokes from people who do not have hypertension Figure 13. Stroke Based on Hypertension In the sixth section, a comparison visualization of stroke based on heart disease can be seen in Figure From the visualization, it can be concluded that people with heart disease are more prone to stroke than people without heart disease. There are 17% of people who have had a stroke and also have heart disease, and there are 4. 2% of people who have had a stroke but do not have heart disease. 2 Feature Importance Features Importances in this study are determined based on the algorithm used. From the Logistic Regression algorithm, the three most important features are age, gender, and type of residence. the Random Forest algorithm, the three most important features are age, age group, and glucose and for the XGBoost algorithm, the three most important features are age, glucose level group and age group. Details of the results can be seen in Table Table 1. Feature Importances Detail Feature Algorithm Value Importances Logistic Regression residence_type Random age_group Forest avg_glucose_level XGBoost glucose_group age_group 3 Model Building Result 1 Logistic Regression The results of making the model/training model from the Logistic Regression algorithm have a precision value of 0. 79, a recall value of 0. 83, an f1 value of 0. 83, and a ROC-AUC value of 0. 81 after getting five folds of cross-validation. JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 Figure 15. Logistic Regression Result Detail 2 Random Forest Classifier The results of making the model/training model from the Random Forest algorithm have a precision value of 0. 78, a recall value of 0. 9, an f1 value of 0. and a ROC-AUC value of 0. 921 after getting five folds of cross-validation. Figure 17. XGBoost Result Detail CONCLUSSION With the influence of technology in the medical world, stroke can now be predicted using Machine Learning methods based on classification algorithms, so that with predetermined medical parameters, the Machine Learning model that has been created can predict whether a person has a stroke. After making three algorithms that are generally used to carry out the classification process, the results are as follows. Logistic Regression with a precision value of 0. 79, a recall value of 0. 83, an f1 value of 0. 83, and a ROCAUC value of 0. 81 after receiving five folds of crossvalidation. Random Forest with a precision value of 78, a recall value of 0. 9, an f1 value of 0. 84 and a ROC-AUC value of 0. 921 after getting five folds of cross-validation. and XGBoost with a precision value 79, a recall value of 0. 83, an f1 value of 0. 83 and a ROC-AUC value of 0. 81 after getting five folds of cross-validation. It can be concluded that the best performing algorithm in this case study is XGBoost. This algorithm is suitable for use with pipelines to make real-time predictions. BIBLIOGRAPHY