JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 THYROID DISEASE CLASSIFICATION ANALYSIS USING XGBOOST MULTICLASS Haris Samuel Pranada Panjaitan, 2Agustinus Gulo, 3Ahmad Haikal Alfi, 4Okta Jaya Harmaja, 5Evta Indra* Program Studi Sistem Informasi. Fakultas Teknologi dan Ilmu Komputer. Universitas Prima Indonesia Jalan Sampul E-mail : * oktajaya. h@unprimdn. ABSTRAK- Sickness is an unusual condition of the body or mind that causes discomfort, malfunction, or suffering to the sick person. One disorder that occurs due to a lack of health concerns is thyroid disease. The thyroid is a butterfly-shaped endocrine gland near the neck's bottom. The diagnosis of thyroid disease is complicated because the symptoms of thyroid disease can fluctuate based on the rise and fall of thyroid hormones, which increase the utilization of oxygen by the body's cells. In this case, a thyroid examination by a doctor and proper interpretation of clinical data is required to identify thyroid disease. However, the limitations of a doctor due to age and time constraints lead to a lack of interpretation of patient clinical data. Therefore, a study was conducted on the analysis of thyroid disease classification to simplify and speed up the process of diagnosing thyroid disease using the Xgboost Multiclass method, which is expected to get an accuracy value above 90%. Keywords: Classification. Thyroid. Xgboost Multiclass. Machine Learning INTODUCTION In 2015 in the United States, there were 62,450 cases of thyroid cancer, with 3 out of 4 cases occurring in women . Based on data from the Cancer Registration Agency of the Indonesian Cancer Foundation in 2005, thyroid cancer ranks 9th out of 10 malignant tumors and is the most common type of endocrine gland malignancy in Indonesia . Sickness is an unusual condition of the body or mind that causes discomfort, malfunction, or suffering to the sick person . A disorder that occurs due to a lack of health concerns is thyroid disease . The thyroid is a butterfly-shaped endocrine gland near the neck's bottom . The thyroid gland's job is to make thyroid hormones that help regulate the body's metabolism . The diagnosis of thyroid disease is complicated because the symptoms of thyroid disease can fluctuate based on the fluctuation of thyroid hormone, which increases the utilization of oxygen by the body's cells . In this case, a thyroid examination by a doctor and proper interpretation of clinical data is required to identify thyroid disease. However, the limitations of a doctor due to age and time constraints lead to a lack of interpretation of patient clinical data. Therefore, an analysis of thyroid disease classification is needed to simplify and speed up the process of diagnosing thyroid In this research, the analysis of thyroid disease has been carried out previously by . with the title "Comparison of the Naive Bayes Data Mining Algorithm and Bayes Network To Identify Thyroid Disease" in identifying thyroid disease is to assess the performance of the two algorithms, apply the Cross-Validation testing technique and Split Percentage, and with Confusion matrix to measure the value of both methods accuracy. The Bayes Network method has greater accuracy with a value 49%, than the Naive Bayes method with a value of 91. While in the second study studied by . with the title "Implementation of the J48 Algorithm for Thyroid Disease Detection" conducted a study using the J48 algorithm in predicting thyroid disease with 7,200 data records yielding an accuracy value of 87 It is necessary to develop a method for classifying thyroid disease. Therefore researchers are interested in conducting a study entitled "Analysis of Thyroid Disease Classification Using Xgboost Multiclass. METHOD 1 Method This study classifies thyroid disease using the Extreme Gradient Boosting Multiclass method . XGBoost is an enhanced technique based on gradient enhancement decision trees that can create enhanced trees quickly and work in parallel . Using the previous "weaker" classifier as a foundation, the ensemble technique aims to produce a robust classifier. The following predictors correct previous model errors as the models are added on top of each other iteratively, and this process continues until the model predicts or replicates the training data correctly. In short, the researcher used gradient descent to update the model . The XGBoost implementation provides several advanced features for model customization. It is capable of performing three main types of gradient enhancement (Gradient Enhancement (GB). GB Stochastic, and GB Regula. and is robust enough to allow fine-tuning and addition of regularization parameters . In the regression tree, the inner node indicates the value for the attribute test, while the leaf node with the score reflects the rating . The prediction JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 result is the number of scores anticipated by the K tree, as illustrated in the equation: Where is the differentiated loss function to measure whether the model is suitable for the training data set and is the item that determines the complexity of the model. As the complexity of the model increases, the corresponding score decreases 2 Research Flow In conducting research on the classification of thyroid disease, the researcher made a research flow chart that aims to make the research run well, while the research flow chart can be seen in Figure 1: Start E-ISSN : 2580-2879 RESULTS AND DISCUSSION 1 Problem Analysis The method analysis in this study used the XGBoost multigrade approach to identify patients with different thyroid-related disorders based on their age, sex, and medical information Ae including findings of thyroid hormone levels in the blood. 2 Data Analysis Data were obtained from this UCI machine learning repository . The repository includes various text files with different subsets of data. One of them has information for 9000 different individuals along with medical diagnoses from 20 potential classifications. Classification makes 7 different types of diagnosis such as negative diagnosis, a hyperthyroid condition, hypothyroid condition, protein binding, non-thyroid, undergoing replacement therapy, and discordant results: Data Acquisition : Source EKG Kaggle Normalisasi Data Dari Machine Learning UCI Figure 2. Dataset Penentuan Atribut Train And Test Hasil Akurasi Metode Xgboost Multiclass Selesai Figure 1. Flowchart Research Data Acquisition Data Acquisition is retrieving data from the UCL machine learning repository source and entering the data into Google Colab for processing thyroid disease classification data. UCI Machine Learning Data Normalization In this process, normalization is performed on the dataset taken from machine learning UCI with a total of 7200 instances. Attribute Determination At this stage, the attributes that will be used in the study of thyroid disease classification are Train And Test At this stage, the processed data is tested to obtain accurate results in classifying thyroid disease. Accuracy Results After the classification process is complete, the last step is to get an accuracy value which will see how accurate the Xgboost Multiclass method is in 3 Pengolahan Data In data processing, the first step is to import libraries and datasets into Google Colab, the second stage is to clean up data, such as replacing null values with median values, the third stage is data visualization, and the last stage is the distribution of testing data and test data to obtain data. Value accuracy classification of thyroid disease. The flow of data processing can be seen in Figure 3 below: Mulai Import Library. Import Dataset Normalisasi Data Visualisasi Data Data Corelation Investigation Missing Value Split Data Test Dan Data Train Xgboost Multiclass Hasil Selesai Figure 3. Data Processing Flowchart JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 1 Import Library All Google Colab libraries are called at the import library stage, which will be used to process thyroid disease classification data. The library to be imported can be seen in Figure 4 below: E-ISSN : 2580-2879 Patient Age Observation The first step is to observe the patient's age, and after the observation, it is continued by changing the data of patients with ages above 100 to be blank . because they have a negative diagnostic value. The patient's age to be processed is less than 60 years old. Figure 4. Import Library 2 Import Dataset At this stage, the process of importing the dataset that has been downloaded from the UCL machine learning repository into the Google Collaboratory is carried out. The dataset import process can be seen in Figure 5 below: Figure 7. Age Observation And Change Age Figure 5. Import Dataset The dataset contains several text files with different data types. One contains information for 9000 medical patient diagnoses consisting of 20 possible classes. These classes have 7 different types of diagnosis, such as Negative Diagnosis. Hyperthyroid Conditions. Binding Protein. NonThyroidal. Undergoing Replacement Therapy, and Discordant Result. 3 Data Normalization The data normalization process or data cleaning is carried out at this stage. The data cleaning stages were removing 4 diagnostic targets, deleting redundant columns or repeatedly, and observing 100-year-old patients. The stages of normalization will be described in the following explanation: Drop Column Redundancy In the dataset, there is a lot of redundant data, or it repeatedly appears, which makes the data terrible, so data redundancy is removed in several columns, such as the TSH_Measured. T3_Measured. TT4_Measured. T4U_Measured. FTI_Measured. TBG_Measured. Patient_id columns. Figure 8. Drop Column Redundancy 4 Data Visualization Researchers made Exploratory Data Analysis see the distribution of hormone levels in the blood for each target class of patients. This process is carried out to see how well the predictors of each of these attributes are. Drop Diagnosis Several inconclusive diagnoses were dropped because they accounted for less than 3% of the total data set. So the investigators have decided to maintain the observation for patients with a negative diagnosis, hyperthyroidism, or hypothyroidism. Figure 9. Plots Numerical Atrributes Vs Target Figure 6. Drop Diagnosis Figure 9 shows that FTI. T3, and TT4 will be excellent addition to the research model. TSH is also good but needs to handle outliers for the 'target' hypo and further analyze attribute distributions JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 before generating results. This is all in line with the knowledge found about Hormone level tests. The next step is to make pair plots of numeric variables and see if they can find clusters that form between variables. PairPlots can be seen in Figure 10 below. Figure 10 Pair Plots Numerical Vs Target On the diagonal of the pair plot, we can see the distribution of each numeric variable concerning each other. It is apparent how unbalanced the dataset is, with so many hostile 'targets' compared to hypothyroid or hyperthyroid. To handle the imbalance of the target class, it is necessary to resample the data protocol using another model by applying the Xgboost Multiclass method. 5 Data Correlation In looking at the correlation data, the researcher uses the Dython Library as a tool for exploring multivariate correlations. This is a great way to explore the relationships between variables of different types in a data set. The correlation data will help researchers get an idea of the correlation of variables with Exploratory Data Analysis (EDA). E-ISSN : 2580-2879 From the EDA data correlation in Figure 11, it is concluded that the Hormone test is the most helpful in predicting the target diagnosis. 6 Investigation Missing Value First, the researcher did some calculations to determine the severity of the missing value data The Source code function in Figure 12 below takes a data frame as input and stores a calculation of the missing values per column, then calculates the percentage of missing values in that column and summarizes the information in an easyto-view output data frame. Figure 12. Investigation Missing Value 7 Split Data Train and Test At this stage, the distribution of the dataset of 80:20 is carried out where 80% for test data and 20 for training data with a random state standard google colab of 42. For the data sharing process, see Figure 13 below: Figure 13 Spilting Data Figure 11 Data Correlation 8 Xgboost Multiclass Xgboost Class 1 For Xgboost Class 1, using default hyperparameters settings such as num_class = 3, missing= 1, erly_stopping_rounds = 10, eval_metric, seed = 42, we get 99% accuracy and 94% balanced JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim. Vol. 6 No. Agustus 2022 E-ISSN : 2580-2879 CONCLUSION Figure 14 Xgboost Class 1 Xgboost Class 2 For Xgboost Class 2 use optimized hyperparameters settings like num_class = 3, missing= 1, gamma = 0, learning_rate = 0. max_depth = 3, reg_lambda = 1, subsample = 1, colsample_bytree = 1, erly_stopping_rounds = 10, eval_metric, seed = 42 get the same accuracy as multiclass 1 of 99% and balanced accuracy of 94%. In analyzing the classification of thyroid disease predictions using the multiclass algorithm, it was found that the xgboost multiclass version 3 algorithm was the best in classifying with an accuracy score of 99% and balanced accuracy of 97%, followed by the xgboost method versions 1 and 2 with 99% accuracy and balanced accuracy of Researchers hope that in the future, the development of methods to produce more accurate results of thyroid disease calcification will be carried Moreover, it is also hoped to add thyroid disease predictive analysis using more other attributes. BIBLIOGRAPHY