Institut Riset dan Publikasi Indonesia (IRPI) MALCOM: Indonesian Journal of Machine Learning and Computer Science Journal Homepage: https://journal. id/index. php/malcom Vol. 3 Iss. 2 October 2023, pp: 188-198 ISSN(P): 2797-2313 | ISSN(E): 2775-8575 Determining the Final Project Topic Based on the Courses Taken by Using Machine Learning Techniques Vicky Salsadilla1*. Inggih Permana2. Muhammad Jazman3 . Afdal4 1,2,3,4 Programa Studi Sistem Informasi. Fakultas Sains dan Teknologi. Universitas Islam Negeri Sultan Syarif Kasim Riau. Indonesia E-Mail: 111950321586@students. uin-suska. id, inggihpermana@uin-suska. muhammadjazman@uin-suska. id, m. afdal@uin-suska. Received Aug 04th 2023. Revised Sept 25th 2023. Accepted Oct 10th 2023 Corresponding Author: Vicky Salsadilla Abstract A thesis (TA) is a scientific paper based on a problem. TA must be completed by students who wish to complete their studies. During this time, students often experience difficulties in determining the TA topic they want to research. To fix it, this research tries to determine TA topics using Machine Learning (ML) techniques based on the elective courses that students have taken. Elective courses are one form of academic data that can be used to consider TA topics. The ML algorithms used are KNN. NBC. ANN. SVM. C4. Random Forest, and Logistic Regression. The dataset used in this research is imbalanced data. This research balances the data using the Random Oversampling method and the Random Undersampling method. The results of experiments show that datasets balanced using ROS produce much higher ML performance, but tend to over-fit due to data duplication in the dataset. If the dataset is not balanced at all then the ML performance will be very low. Therefore, for unbalanced data, it is recommended to use the RUS method as data balance. The highest accuracy results for algorithms balanced using ROS are ANN=69. RF=66. SVM=57. LR=57. NBC=42. C4. 5=42. 4%, and KNN=33. Keywords: Machine Learning. Random Oversampling. Random Undersampling. Thesis INTRODUCTION As a student, completing a Thesis or Final Assignment (TA) is a crucial step towards finishing studies . It's a form of scientific writing that requires me to thoroughly investigate an existing problem or phenomenon and test its validity using data that has been collected and processed. The aim is to produce reference material that can used in the future . TA also includes research results in the field or based on literature studies . By conducting research, it's hoped that students will be able to solve the problems by scientific, and can develop their insights . Before preparing a TA, of course, students can pass the process of determining the topic or what they want to research . The large amount of discussion and material that has been studied during lectures makes it difficult for students to determine how research topic they should take to make research into their thesis . The Topic is an idea that underlies a TA. The topic is usually a benchmark for the discussion written by a writer . Due to this phenomenon, some students felt they made the wrong choice of research topic when it went and ended up changing the research TA topics . Apart from the lecture material that has been studied, they are usually also chosen according to their abilities . , such as through analysis of academic data in the form of grades from study results during the lecture process fromthe beginning to the end semester . As expected to help students determine appropriate TA topics. Along with that, students usually also choose TA topics through specialization in elective courses as a form of support in determining what they want to research . By preffering the right topic, students can maximize the TA process and then complete the study on time . Based on the previous explanation, this research uses machine learning (ML) to classify TA topics based on the elective courses that have been taken. It's hoped this classification can help students determine TA There are 7 machine learning algorithms used, named . K-Nearest Neighbor (KNN). Naive Bayes Classifier (NBC). Artificial Neural Network (ANN). Support Vector Machines (SVM). C4. Random Forest (RF). Logistic Regression (LR). In this study, the KNN. NBC. SVM, and C4. 5 algorithms were used, because these algorithms are included in the most frequently used algorithms . Meanwhile, the LR algorithm is used because this algorithm can calculate data probabilities. able to update linear models DOI: https://doi. org/10. 57152/malcom. ISSN(P): 2797-2313 | ISSN(E): 2775-8575 with new data. can learn the data analysis process that can be applied in carrying out target classification. Apart from that, the LR results are not affected by small noise in the data . Then. ANN is used because this algorithm can predict with very high accuracy . Apart from that, other algorithms such as KNN. NBC. SVM, and C4. 5 are also used because they have high accuracy which is following this research . Last but not least. RF is used because it is a combination of several decision trees, each of which is created by a random subset and each node is selected from that random subset of features . The aim of using machine learning in this research is so that future students will be able to learn from the data themselves. A lot of research has been done on how to understand machine learning without being explicitly programmed . Whore in the dataset used, data imbalance or data imbalance occurs, which is one of the problems that can occur in ML . This causes the resulting model to have poor performance . The imbalance of ML towards majority class instances can be overcome by balancing using data-level techniques. This data-level technique aims to modify the dataset directly before ML reaches the measurement stage . Because of this can balance the unequal class distribution. This process is divided into two categories, namely Random Oversampling and Random Undersampling which are applied in this research. MATERIALS AND METHOD In general, this research is divided into 4 phase, namely: . the data collection phase. the data preprocessing phase. the data balancing phase. the phase of implementing machine learning. These phase are shown in Figure 1. Start Data Collection Phase Data Preprocessing Phase Data Balancing Phase Phase of Implementing Machine Learning Finish Figure 1 . Research Methodology The Data Collection Stage First of all, questionnaires are distributed via Google Forms. The respondents in this research were students of the Information Systems Study Major Class of 2019. The questions asked were: . the elective courses the students had taken. TA topics taken. whether students feel they have taken the correct TA topic or not. For more details, see Table 1. The selected TA topics will then be used as classes in the dataset. Table 1. A list of Question Question What are the elective courses you have What is the topic of your chosen thesis? Is your current thesis topic the right one? Information Answers are in the form of multiple choice: Data Mining (DM), kode: A1. Sistem Informasi Intelijen (SII), kode: A2. Customer Relation Management (CRM), kode: A3. Business Inteligence (BI), kode: A4. Knowledge Management (KM), kode: A5. E-business (E-bi. , kode: A6. IT Audit, kode: A7. ERP M1, kode: A8. ERP M2, kode: A9. Geographic Information System (GIS), kode: A10. Answers are in the form of multiple choice: Analisa Proses Bisnis (APB). Evaluasi SI (ESI). Data Mining (DM). Customer Relation Management (CRM). Rekayasa Perangkat Lunak (RPL). Knowledge Management (KM). Manajemen Risiko (MR). Answers are in the form of multiple choice: Yes Data Pre-processing Phase In data pre-processing, data selection and data transformation are carried out as follows: MALCOM - Vol. 3 Iss. 2 October 2023, pp: 188-198 MALCOM-03. : 188-198 Selection Data Based on the results of data collection, a dataset was obtained consisting of 70 rows of data. The dataset was selected by selecting the rows of data where the answer to question number 3 . ee Table . was Yes so the remaining 64 rows of data. Transformation Data At the data transformation stage, the shape of the dataset changed so that it looks like in Table 2. In this table, for columns A1 to A2, if the value is 1. 0 then the student is taking the elective course that corresponds to the name of that column, otherwise, if it is 0. 0 then the student does not take the course that corresponds to the column name. Table 2. Data Transformation A D64 Topik TA APB APB APB APB APB A10 A11 Data Balancing Phase Data balancing is carried out which functions to balance the amount of data in each class. The balancing technique used is the Random Oversampling (ROS) technique and the Random Undersampling (RUS) The ROS technique will create synthetic data from the minority class by randomizing the existing Meanwhile, the RUS technique will reduce the majority class by selecting random existing data . The data balancing process is carried out using Orange Data Mining software. The results of data balancing can be seen in Table 3. Table 3. The amount of data TA Topics APB ESI CRM RPL Without Balancing Amount of Data ROS RUS Application of machine learning This research uses 7 ML algorithms, namely: KNN. NBC. SVM. ANN. C4. RF, and LR. The parameters used for each algorithm can be seen in Table 4. Each combination of experimental parameters was carried out on 3 types of datasets, namely datasets that were balanced using ROS, datasets that were balanced using RUS, and datasets that did not use data balancing. Meanwhile, for performance measurement metrics, this research uses accuracy, precision, and recall. At this stage of implementing ML. Orange Data Mining Software is used. Table 4. Algorithm Parameters Information Algorithm KNN 3, 5, 7, 9, 11 NBC SVM Kernels Linear Kernels Gamma C = Cost . D = Degree Polynomial Auto [C=1,00 D=1,. [C=1,00 D=2,. [C=1,00 D=3,. [C=2,00 D=1,. [C=2,00 D=2,. [C=2,00 D=3,. [C=3,00 D=1,. [C=3,00 D=2,. [C=3,00 D=3,. Parameters Mark / Number / Symbol Determining the Final Project Topic Based. (Salsadilla et al, 2. ISSN(P): 2797-2313 | ISSN(E): 2775-8575 Algorithm Information Parameters Kernels Gamma Kernels C = Cost Iterasi ANN Mark / Number / Symbol Radial Basis Function (RBF) Auto Sigmoid [. = 1. = 2. = . Hidden Layer Activation Solver Learning Rate Maximal Number of Iteration C4. Min Leaves 2, 3, 5, 7 Min Trees 3, 5, 7, 9, 11 Regularization type Lasso (L. Ridge (L. ,100,. ,100,. ,100,. ,200,. ,200,. ,200,. ,300,. ,300,. ,300,. ,100,. ,100,. ,100,. ,200,. ,200,. ,200,. ,300,. ,300,. ,300,. ,100,. ,100,. ,100,. ,200,. ,200,. ,200,. ,300,. ,300,. ,300,. ReLu Adam RESULTS AND DISCUSSION The overall results of the experiments carried out in this research can be seen in Table 5. Table 5. Experiment Result Algorithm Parameters KNN NBC SVM ROS RUS Without Balancing Acc Prec Recall Acc Prec Recall Acc Prec Recall K=3 K=5 K=7 K=9 K=11 Kernels : Linear Kernels : Polynomial C=1,00 D=1,0 C=1,00 D=2,0 C=1,00 D=3,0 C=2,00 D=1,0 C=2,00 D=2,0 C=2,00 D=3,0 C=3,00 D=1,0 C=3,00 D=2,0 C=3,00 D=3,0 Kernels : RBF Kernels : Sigmoid . = 1 . = 2 MALCOM - Vol. 3 Iss. 2 October 2023, pp: 188-198 MALCOM-03. : 188-198 Algorithm Parameters ANN C4. ROS RUS Without Balancing . = 3 Acc Prec Recall Acc Prec Recall Acc Prec Recall Hidden Layer ,100,. ,100,. ,100,. ,200,. ,200,. ,200,. ,300,. ,300,. ,300,. ,100,. ,100,. ,100,. ,200,. ,200,. ,300,. ,300,. ,300,. ,300,. ,100,. ,100,. ,100,. ,200,. ,200,. ,200,. ,300,. ,300,. ,300,. Min Leaves Min Trees Lasso (C. Ridge (C. None In Table 5, it can be seen that the best performance for KNN ROS is when the K value = 3, with an accuracy value = 76. 4%, a precision value = 82. 6%, and a recall value = 73. Meanwhile, the best performance for KNN RUS is when K = 3, with an accuracy value = 33. 3%, a precision value = 32. 9%, and a recall value = 33. Then the best performance for KNN without balancing is when K = 5, namely accuracy value = 42. 2%, precision value = 29. 1%, and recall value = 42. A comparison of the performance of KNN ROS. KNN RUS, and KNN without balancing can be seen in Figure 2. In this figure, it can be seen that the performance of KNN ROS is better than KNN RUS and KNN without balancing, both in terms of accuracy, precision, and recall. Determining the Final Project Topic Based. (Salsadilla et al, 2. ISSN(P): 2797-2313 | ISSN(E): 2775-8575 ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 2. KNN Performance comparison In the NBC algorithm, as seen in Table 5. NBC ROS gets an accuracy value = 53. 6%, a precision value = 50. 0%, and a recall value = 53. Meanwhile. NBC RUS obtained an accuracy value = 42. 4%, a precision value = 56. 2%, and a recall value = 42. Then for NBC without balancing, the accuracy value = 39. precision value = 42. 7%, and recall value = 39. A comparison of the performance of NBC ROS. NBC RUS, and NBC without balancing can be seen in Figure 3. In this figure, it can be seen that NBC ROS is better than NBC RUS and NBC without balancing, both in terms of accuracy, precision, and recall. ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 3. NBC Performance comparison In the SVM algorithm, as seen in Table 5, the best performance for SVM ROS is when using the RBF kernel, with an accuracy value = 79. 3%, a precision value = 80. 8%, and a recall value = 79. Meanwhile, the best performance for SVM RUS is when using the Linear kernel, with an accuracy value = 57. 6%, a precision value = 60. 5%, and a recall value = 57. Then the best performance for SVM without balancing is when using the Polynomial kernel . = auto, c = 3. 00 and d = 2. with an accuracy value = 40. 6%, a precision value = 31. 3%, and a recall value = 40. A comparison of the performance of SVM ROS. SVM RUS, and SVM without balancing can be seen in Figure 4. In this figure, it can be seen that the performance of SVM ROS is better than SVM RUS and SVM without balancing, both in terms of accuracy, precision, and In Table 5, it can be seen that the best ANN ROS performance is when using a hidden layer structure = . , 300, . , with an accuracy value = 79. 3%, a precision value = 80. 4%, and a recall value = 79. Meanwhile, the best performance for ANN RUS is when using a hidden layer structure = . , 100, . , with an accuracy value = 69. 7%, a precision value = 73. 2%, and a recall value = 69. Then the best performance for ANN without balancing is when using a hidden layer structure = . ,100,. with an accuracy value = 40. 6%, a precision value = 34. 0%, and a recall value = 40. A comparison of the performance of ANN ROS. ANN RUS, and ANN without balancing can be seen in Figure 5. In this figure, it can be seen that the performance of ANN ROS is better than ANN RUS and ANN without balancing, both in terms of accuracy, precision, and recall. MALCOM - Vol. 3 Iss. 2 October 2023, pp: 188-198 MALCOM-03. : 188-198 ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 4. SVM Performance comparison ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 5. ANN Performance comparison In the C4. 5 Algorithm, as seen in Table 5, the best performance of C4. 5 ROS was obtained when using a minimum number of leaves = 2, with accuracy = 77. 1%, precision = 77. 3%, and recall = 77. Meanwhile, the best performance of C. 45 RUS is also using a minimum number of leaves = 2, namely accuracy = 42. precision = 45. 9%, and recall = 42. Then the best performance of C4. 5 without balancing is when using a minimum number of leaves = 7, with accuracy = 42. 2%, precision = 31. 5%, and recall = 42. A comparison of the performance of C4. 5 ROS. C4. 5 RUS, and C. 45 without balancing can be seen in Figure 6. In this figure, it can be seen that the performance of C4. 5 ROS is better than C4. 5 RUS and C4. 5 without balancing, both in terms of accuracy, precision, and recall. ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 6. C4. 5 Performance comparison Then in the RF Algorithm, as seen in Table 5, the best RF ROS performance is when using the number of attributes considered in each separation value = 11, with an accuracy value = 80. 7%, a precision value = Determining the Final Project Topic Based. (Salsadilla et al, 2. ISSN(P): 2797-2313 | ISSN(E): 2775-8575 0%, and a recall value = 80. Meanwhile, the best RF RUS performance is when using the number of attributes considered for each separation with a value of = 5, with an accuracy value = 66. 7%, a precision value = 69. 5%, and a recall value = 66. Then the best RF performance without balancing is when using the number of attributes considered in each separation value = 11, with an accuracy value = 40. 6%, a precision value = 31. 9%, and a recall value = 40. A comparison of RF ROS. RF RUS, and RF without balancing can be seen in Figure 7. In this figure, it can be seen that the performance of RF ROS is better than RF RUS and RF without balancing, both in terms of accuracy, precision, and recall. ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 7. RF Performance comparison In the LR algorithm, as seen in Table 5, the best performance for LR ROS is when using the Ridge kernel [C = . , with an accuracy value = 75. 7%, a precision value = 76. 5%, and a recall value = 75. Meanwhile, the best performance for LR RUS is also when using the Ridge kernel [C=. , with an accuracy value = 57. 6%, a precision value = 57. 5%, and a recall value = 57. Then the best performance for LR without balancing is also when using the Ridge kernel [C=. , with an accuracy value = 40. 6%, a precision value = 33. 6%, and a recall value = 40. A comparison of LR ROS. LR RUS, and LR without balancing can be seen in Figure 8. In this figure, it can be seen that the performance of LR ROS is better than LR RUS and LR without balancing, both in terms of accuracy, precision, and recall. ROS Accuracy (%) Precision (%) Recall (%) RUS Without Balancing ROS RUS Without Balancing Figure 8. LR Performance comparison Figure 9 is a comparison of the performance of ML algorithms on a dataset where ROS is applied as a balancing technique. It can be seen that the highest accuracy was obtained by the RF Algorithm, namely 80. Meanwhile, the algorithm with the highest precision is the KNN algorithm, namely 82. Even so, the resulting precision is not that far from RF which has a precision of 81. The difference in KNN precision is 6% higher when compared to RF. Meanwhile, the highest recall was obtained by the RF algorithm. With the existing results of accuracy, precision, and recall, it can be concluded that in datasets balanced using ROS, the ML algorithm that produces the best performance is the RF algorithm. MALCOM - Vol. 3 Iss. 2 October 2023, pp: 188-198 MALCOM-03. : 188-198 Accuracy (%) Precision (%) Recall (%) KNN NBC SVM ANN C4. Figure 9. ML ROS Performance comparison Accuracy (%) Precision (%) Recall (%) KNN NBC SVM ANN C4. Figure 10. ML RUS Performance comparison Figure 10 is a comparison of the performance of ML algorithms on the dataset where RUS is applied as a data balancing technique. It can be seen that the highest accuracy was obtained by the ANN algorithm, namely The algorithm with the highest precision is also the ANN algorithm, namely 73. Meanwhile, the highest recall was obtained by the LR algorithm, namely 75. So it can be concluded that the best performance for accuracy and precision on datasets balanced using RUS is the ANN algorithm, while the best performance for recall on datasets balanced using RUS is the LR algorithm. Accuracy (%) Precision (%) Recall (%) KNN NBC SVM ANN C4. Figure 11. ML Without Balancing Performance comparison Determining the Final Project Topic Based. (Salsadilla et al, 2. ISSN(P): 2797-2313 | ISSN(E): 2775-8575 Figure 11 is a comparison of the performance of ML algorithms whose datasets are not balanced. The highest accuracy value obtained was only 42. 4%, namely the KNN and C4. 5 algorithms. The highest precision value produced was only 42. 7%, namely the NBC algorithm. The highest recall value produced was only 4%, namely the KNN and C4. 5 algorithms. It can be concluded that for unbalanced datasets the performance produced by ML algorithms is very low. Based on the experiments that have been carried out as in Table 5, it can be seen that the dataset balanced using ROS produces much better ML performance in terms of accuracy, precision, and recall. However, what is important to note is that a dataset that is balanced using the ROS method will produce ML training results that tend to be overfitting. On the other hand, if the dataset is not balanced at all then the ML performance will be very low. Therefore, based on the experiments that have been carried out, for imbalanced data it is recommended to balance it using the RUS method. CONCLUSION Based on the results of the experiments carried out. ML can be used to create a model for determining TA topics if the dataset used is balanced. This is proven by experiments carried out on datasets without balancing and datasets that are balanced. A dataset without balancing produces very low performance, whereas when the dataset is balanced the performance is much better. The data balancing method that gets the highest performance is ROS, but this method has a large risk of overfitting. Therefore, this research suggests using RUS as a data balancing method, even though the resulting performance is not as high as ROS. REFERENCES