Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 Sentiment Analysis on the Impact of MBKM on Student Organizations Using Supervised Learning with Smote to Handle Data Imbalance Lailatul Cahyaningrum1. Ardytha Luthfiarta2*. Mufida Rahayu3 1,2,3 Informatics Engineering Department. Universitas Dian Nuswantoro. Semarang. Indonesia 1lailacahya22@gmail. luthfiarta@dsn. id(*) 3mufidarahayu2002@gmail. Received: 2023-12-07. Accepted: 2024-01-05. Published: 2024-01-15 AbstractAi Recently, there has been a decline in student interest in joining organizations. One of the causes is the MBKM program "Merdeka Belajar Kampus Merdeka". With this program from the government, more and more students are interested in entering because it is considered more profitable. Responses regarding this were conveyed by students through questionnaires. Twitter crawling, and YouTube comments. The data obtained was 1,770 . egative, positive, and neutral labelin. using Sastrawi. Nazief & Adriani, and Arifin Setiono stemming. There is an imbalance of data in labeling, so it is necessary to do SMOTE to balance the data. The algorithms used in the research focus on modeling the Nayve Bayes Classifier. Support Vector Machine, and Decision Tree with the split random method, with the best results using Support Vector Machine. Of the three algorithms, the highest results were obtained from the results of Arifin Setiono's data setmming, using a Support Vector Machine with 91% accuracy, obtained from 90% training data and 10% testing. KeywordsAi MBKM. Sentiment Analysis. SMOTE. Stemming. Classification. SVM. INTRODUCTION Students become agents of change in the world in the future, especially during the golden age in 2045, when Indonesia reaches a century and experiences a demographic bonus . addition to being agents of change, students are also iron stock, the next generation of the nation who will be able to replace government leaders for the better . On the way to becoming agents of change and iron stock, students can gather in student organizations in addition to lectures because organizations are also a means of learning to develop their intellectual, social, and religious abilities . Organizations are also useful for increasing student awareness and involvement in society, increasing professionalism, instilling creativity, and increasing critical thinking, which will be useful for post-campus life . However, today the presence of Organisasi Kemahaisswaan . and Unit Kegiatan Mahasiswa (UKM) on Indonesian campuses has decreased in existence due to the Merdeka Belajar Kampus Merdeka (MBKM) program, one of which is a program of the Ministry of Education. Culture. Research and Technology of the Republic of Indonesia (Kemendikbud Riste. , the programs in MBKM are now increasingly numerous branches . In September 2022, a survey at Jenderal Soedirman University revealed that 20. 5% of active organization enthusiasts participated. Additionally, the percentage of those enthusiasts who preferred the MBKM program was four times higher than the overall organizational participation rate . The survey results from several campuses in Semarang and Yogyakarta realized that with the program, the interest in Ormawa and UKM decreased, and the decline was an obstacle to the management's regeneration and performance. In the Tourism Education Student Association organization for the 2022/2023 period, out of 20 respondents, 64. 7% were not interested in joining the organization's management . They prefer to join the MBKM program because the benefits offered are considered more beneficial for them in the academic realm and future career preparation. Students who participate in the organization convey opinions about the influence of MBKM on the organization, some of which they convey through Twitter. YouTube, and distributing The opinion data they convey is processed into sentiment analysis by understanding, managing data, and extracting computations using text mining techniques to produce positive, neutral, or negative categories . Various algorithms are applicable for handling sentiment analysis, encompassing the Nayve Bayes Classifier (NBC). K-Means clustering. Decision Tree, and Support Vector Machine (SVM). The Decision Tree algorithm is easy to implement using a recursive algorithm. However, the Decision Tree also has the disadvantage of being applied to a large amount of data because it has the potential to experience overfitting. Overfiting will make the algorithm's performance decrease. In several studies related to sentiment analysis on the implementation of the MBKM, as much as 475 data were obtained: 99. 22% for Nayve Bayes, 96. 90% for K-Nearest Neighbors, and 37. 21% for Decision Tree. A total of 849 data regarding sentiment analysis related to the Merdeka Belajar Kampus Merdeka (MBKM) Program produced a Support Vector Machine of 84. 76% higher than the Decision Tree, which only produced an accuracy of 72. 86% . The comments in the telegram group of the supervisors obtained 591 comments, which showed an accuracy of 99. 30% for Naive Bayes and 97. 20% for K-Nearest Neighbors . From November 20, 2021 to December 19, 2021, sentiment analysis was conducted by crawling data on Twitter. A total of 5980 data DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 points were collected, giving the SVM algorithm a success rate 12% and the Nayve Bayes classifier method a success rate 92% . Twitter sentiment analysis of the MBKM program with 1,212 data resulted in an accuracy value of 74. using the Nayve Bayes Classifier Algorithm . It is necessary to conduct a sentiment analysis of the impact of MBKM on the sustainability of organizational life for Several studies related to sentiment analysis on the MBKM program have been carried out. However, those that discuss the impact on student organizations are still very minimal or can be said to have never been done by previous For this reason, researchers plan to take opinion data from questionnaires, crawling Twitter and YouTube, and then model it with positive, negative, and neutral classes using a lexicon with Sastrawi. Nazief & Adriani, and Arifin Setiono Due to the potential data imbalance, this research will apply the oversampling method with the SMOTE and private universities, which in their questionnaire answers are more detailed in describing how the impact is felt. Figure 2. Workflow of Collecting Data Figure 2 shows that this research uses data from crawling Twitter to get 45 data, questionnaires as much as 215 data, and YouTube comments as much as 1,510, so a total of 1,770 data is obtained. Crawling Twitter with keywords "MBKM vs. Ormawa" and "MBKM atau Organisasi" and crawling YouTube with API using Python. II. RESEARCH METHODOLOGY This research uses a comparison of the Nayve Bayes Classifier. Support Vector Machine, and Decision Tree algorithms to see the best accuracy of reviews about the influence of MBKM on the organization. Implementing the comparison model requires several stages to get the best accuracy results. The following in Figure 1 describes the research stages. Text Preprocessing Text preprocessing is used to obtain more quality information, which is usually done at an early stage. The description is in Figure 3 and is done at the beginning to analyze sentiment with data from Twitter. YouTube, and questionnaires that can affect classification performance . Figure 1. Workflow of Research Stages Collecting Data Three different sources were utilized to collect the data: crawling Twitter and YouTube, as well as questionnaires. Twitter was selected because the most recent information is always readily available on Twitter, particularly considering that approximately 69% of journalists engage with Twitter . If we choose videos that match the problem, comments from YouTube will get precise results, and YouTube users have increased quite dramatically, around 73% . , after this The questionnaires were distributed to several public Figure 3. Workflow of Text Preprocessing DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 . Cleaning: Cleaning is the process of removing usernames (@), hashtags #, numbers. URL . ttp://), delimiters such as commas (,), periods (. ), and also other punctuation marks filtered from useless data in the data or commonly known as stopwords . Balancing Data Data imbalance is still challenging because sometimes classes with minority numbers are more important . This condition can later affect the prediction accuracy, decreasing with less data because it will tend towards one class and disadvantage the other . Data balancing using the Synthetic Minority Oversampling Technique (SMOTE) operator can handle the problem of unbalanced data where the negative, positive, and neutral sentiment labels in the data set are not in proportional numbers . Using the SMOTE operator, it was found that SMOTE can perform better than ADASYN . Case Folding: Data processing functions to convert all characters, such as capital letters, into lowercase letters . Tokenizing: Processing sentences in the data into several words separated into important words only . Filtering: Filtering is the process after tokenization to retrieve words that do not represent the content of a text document . Model Analysis The analysis of the algorithm model used compares Nayve Bayes. SVM, and Decision Tree according to Figure 4, and then the three models are sought which produce the best accuracy . Stemming: Removal of affixes in the form of prefixes, suffixes, and confixes in each word so that it becomes a base word to homogenize the word form . This research uses Sastrawi. Nazief, and Adriani, and Arifin Setiono. As a stemmer package. Sastrawi has not been designed to normalize for example, the word "yg" will not be known to mean "yang" by Sastrawi . Bobby Nazief and Mirna Adriani developed the Nazief and Adriani algorithm in the stemming process. Additional rules, such as reduplication, prefixes, and suffixes, increase each word's accuracy . The stemming of Arifin Setiono has a similar process to the stemming of Nazief & Adriani . Labelling Data Preprocessing makes the data only contain opinions that have been cleaned, and then positive, negative, and neutral labeling is done by determining the score of each sentence first . Figure 4. Workflow of Model Analysis . Nayve Bayes Classifier: Naive Bayes has a strong assumption of independence of each condition or event with a classification method based on Bayes' theorem . TF-IDF weighting The TF-IDF weighting method is common for generating vector sentences based on word vectors . TF-IDF is a group of words or phrases to be numerically reduced by identifying the most important words in a document. Once converted into a numerical value, the numerical value is seen as the frequency of occurrence of the word. The result of the TF value does not provide important information in the converted word. This is because sometimes less useful conjugated or common words are counted to obtain the highest TF result of these words. Therefore, the Inferse Document Frequency (IDF) technique is required after using TF . The Equation . is for the weighting of the TF-IDF. ycNyce. yayaya = ycNyaycnyc ycu yayayaycnyc = ycNyaycnyc ycu log In Equation . about TF IDF. N is the number of documents in the collection. TF term frequency, and IDF inverse document frequency. Figure 5. Flowchart of Nayve Bayes Classifier DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 yco Figure 5 is the flowchart of the Nayve Bayes Classifier and below is Bayes' theorem presented in Equation . c, d are events. = probability of c given d is true. probability of d if c is true, and p. , p. independent probabilities of A and B respectively. ycc ) = ycE . ca )ycE. yca ) ycE . cc ) = ycO0 Oc ycOycn ycUycn = ycO0 ycO ycN ycU = yca ycOycN ycU Decision Tree: The decision Tree classification algorithm uses a top-down decision tree structure to determine the research data class . When attempting to measure the diversity or presence of a data set, it is necessary to have entropy and gain values. The decision tree obtains the entropy and gains values using herniations and equations. Support Vector Machine: SVM includes data classification using a hyperplane emphasizing risk minimization, which is the function estimation by minimizing the generalization error boundary . SVM can overcome overfitting and produce a good classification model even though it is trained with relatively little data . Figure 7. Flowchart of Decision Tree Figure 6. Flowchart of Support Vector Machine Figure 7 is the Decision Tree flowchart. below is the Decision Tree Theorem presented in Equations . Where the S variable is the Set of all possible outcomes, the i variable is individual outcomes, the n variable is probability of occurrence, the pi variable is probability, the A variable is the Figure 6 is a flow chart of the Support Vector Machine, and below is the SVM Theorem presented in Equations . with the description Wi is vectors (W0. W1. W2, . , b biased term (W. , and X = variables. yc = ycO0 ycO1 ycO1 ycO2 ycO2 A. ycn=1 . DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 specific attribute, the |S. variable is the size of subset Si after splitting, and Si Entropy of each subset Si after splitting. crawling from YouTube comments using the programming language Python through Google Collaboratory gets 1,510 data, so the data obtained is 1,770. ycu yaycuycycycuycyyc . cI) = Oc Oe ycyycn. ycoycuyci2 ycyycn ycn=1 ycu yaycaycnycu. cI, y. = yaycuycycycuycyyc . cI) Oe Oc ycn=1 . cIyc. O yaycuycycycuycyyc . cIyc. cI| . Evaluation The evaluation process (Machine Learnin. is carried out to test the classification results by measuring the truth value of the After obtaining the Nayve Bayes. SVM, and Decision Tree classification models, an evaluation process needs to be carried out before continuing to apply the model to the testing data by knowing the level of accuracy, precision, and recall . ycyycyceycaycnycycnycuycu = ycyceycaycaycoyco = ycaycaycaycycycaycayc = ycNycE ycNycE yaycE ycNycE ycNycE yaycA ycNycE ycNycA ycNycE ycNycA yaycE yaycA Equations . , . , and . explain that accuracy is the ratio of correct predictions . ositive, neutral, and negativ. to overall data, precision is the ratio of TP predictions to overall positive prediction results, and recall is the ratio of TP predictions to overall positive data, all of which can be obtained from the confusion matrix shown in the following formula . TN is the opposite of TP, which means that both the system and the expert give a negative result. FP means that the system gives a negative result, but the expert gives a positive result. occurs when the system gives positive and negative results. RESULT AND DISCUSSION In this study, the data on the influence of MBKM on organizations was processed using three types of stemming: Sastrawi. Nazief & Adriani, and Arifin Setiono. The algorithms that will be used are the Nayve Bayes Classifier, the Support Vector Machine, and the Decision Tree, which will be searched for the highest accuracy. Text Preprocessing In this study, the data used from questionnaires. Twitter, and YouTube relates to the influence of MBKM on organizations, where this data has a variety of writing styles, so the text data obtained is unstructured data that is quite difficult to process. Before classification or tagging, the data needs to be transformed into more structured data, including negative, positive, or neutral. The stages of transforming unstructured data into structured information are called text preprocessing stages, and the steps are listed in Table I. TABLE I TEXT PREPROCESSING Raw Data @UGM_FESS Lagian mbkm sebagian dikasih uang saku, beda sama beberapa organisasi yang setiap kali mereka bikin event lu sebagai panitia harus ngeluarin duit lumayan atau disuruh jualan sama upload story pp. Secara kebermanfaatan mbkm kadang lebih menarik dibanding organisasi Text Preprocessing Cleaning UGMFESS Lagian mbkm sebagian dikasih uang saku, beda sama beberapa organisasi yang setiap kali mereka bikin event lu sebagai panitia harus ngeluarin duit lumayan atau disuruh jualan sama upload story pp Secara kebermanfaatan mbkm kadang lebih menarik dibanding Case Folding ugmfess lagian mbkm sebagian dikasih uang saku beda sama beberapa organisasi yang setiap kali mereka bikin event lu sebagai panitia harus ngeluarin duit lumayan atau disuruh jualan sama upload story pp secara kebermanfaatan mbkm kadang lebih menarik dibanding Tokenizing ugmfess, lagian, mbkm. Sebagian, dikasih, uang, saku, beda, sama, beberapa, organisasi, yang, setiap, kali, mereka, bikin, event, lu, sebagai, panitia, harus, ngeluarin, duit, lumayan, atau, disuruh, jualan, sama, upload, story, pp, secara, kebermanfaatan, mbkm, kadang, lebih, menarik, dibanding, organisasi Filtering ugmfess lagian mbkm sebagian dikasih uang saku beda sama beberapa organisasi setiap kali mereka bikin event sebagai panitia ngeluarin duit lumayan disuruh jualan upload story pp secara kebermanfaatan mbkm kadang lebih menarik dibanding organisasi Stemming ugmfess lagi mbkm sebagian kasih uang saku beda sama berapa organisasi setiap kali mereka bikin event sebagai panitia keluar duit lumayan suruh jualan upload story pp secara manfaat mbkm kadang lebih tarik banding Table I shows the results of text preprocessing. The text preprocessing step uses cleaning, case folding, tokenizing, filtering, and stemming to get maximum text results before text Collecting Data The process of data collection is in 3 ways. The first was with a questionnaire through Google Forms, a total of 9 questions, but of the 9 questions are again grouped for those who follow MBKM and Ormawa, only MBKM, only Ormawa, and not both obtained 215 respondents' students across Indonesia. Twitter crawls with keywords "MBKM vs. Ormawa" and "MBKM atau Organisasi" yield 45 data. Then. Labelling Text Data that has gone through the text preprocessing stage is then labeled using a lexicon word dictionary. Labeling uses a lexicon with values less than 0 negative, more than 0 positive, and 0 neutral, as in Table II. DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 A negative word cloud in Figure 9 using Arifin Setiono stemming yielded frequently occurring words: 'MBKM', 'ada', 'karena', 'dan', and 'yang'. TABLE II LABELLING TEXT Clean_Text Polarity _score Polarity Neutral Negative karena lebih efisien dan fleksibel iya keuntungan nya dapat banyak teman relasi ilmu dan pengetahuan baru kerugian nya lebih cape saja karena menghabiskan banyak waktu pingin ikut mbkm tapi masih susah dapat informasi setiap ukm pasti memiliki keuntungan dan kerugiannya masing yang saya rasakan ikut ukm itu banyak memiliki keuntungan yaitu bisa menambah relasi menambah wawasan belajar tanggung jawab mbkm vs ormawa halo kema unpad saat ini masih banyak perdebatan mengenai relevansi peran organisasi kemahasiswaan dalam pengembangan diri anggotanya belum lagi dengan adanya program merdeka belajar kampus merdeka yang dinilai lebih baik dalam mengembangkan soft skill Figure 10. Word Cloud Neutral A neutral word cloud in Figure 10 using the stemming Nazief and Andriani obtained words that often appear, namely "karena", "skill", "dan", "ormawa", and "ukm". The results of Table i are the visualization results of word clouds from texts on negative, positive, and neutral sentiments, as well as on different stem results, namely the results of Sastrawi. Nazief & Adriani, and Arifin Setiono. TF-IDF Weighting TF-IDF is used to extract features using the Python Count Vectorizer library, namely the conversion of text features into a vector representation and the grinding of words using TF-IDF, after which the data can be tested by splitting it into test data and training data. Data Balancing Data balancing, using the Synthetic Minority Oversampling Technique (SMOTE) operator, addresses data imbalance issues where negative, positive, and neutral sentiment labels in datasets are not proportionate. In this research, the labeling process uses three types of stemming algorithms: Sastrawi. Nazief & Adriani, and Arifin Setiono. that the results of the negative, neutral, and positive classifications of the three stemming algorithms are different. Here are some data results using the three stemming algorithms and before or after SMOTE. Figure 8. Word Cloud Positive A positive word cloud in Figure 8 using the stemming Sastrawi yielded words that frequently occur: 'Saya', 'yang', 'karena', 'MBKM', and 'pengalaman'. Figure 9. Word Cloud Negative Figure 11. Before SMOTE Sastrawi DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 Figure 12. After SMOTE Sastrawi Figure 15. Before SMOTE. Arifin Setiono By labeling the data with the steaming method. Sastrawi obtained 843 negative, 797 positive, and 130 neutral data in Figure 11. The SMOTE method was balanced to 843 positive, 843 positive, and 843 neutral data in Figure 12. Figure 16. After SMOTE. Arifin Setiono Labeling the data with Arifin Setiono obtained 987 negative, 569 positive, and 214 neutral in Figure 15, after SMOTE balancing to 987 positive, 987 positive, and 987 neutral in Figure 16. Figure 13. Before SMOTE Nazief & Adriani Model Analysis After the data are processed, labeled, and TF-IDF performed, the next step is to calculate the accuracy of the data using three algorithms, namely Nayve Bayes Classifier. Support Vector Machine, and Decision Tree, using data from the three types of stemming and random splitting. Random Split Training Figure 14. After SMOTE Nazief & Adriani Labeling the data with the stemming Nazief and Andriani obtained 1,029 negative, 523 positive, and 218 neutral in Figure 13, after balancing with SMOTE to 1,029 positive, 1,029 positive, and 1,029 neutral in Figure 14. TABLE i ACCURACY SASTRAWI Accuracy Testing NBC SVM DOI : https://doi. org/10. 25139/inform. Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol. 9 No. 1 January 2024. P-ISSN : 2502-3470. E-ISSN : 2581-0367 In Table i, the stemming result data using Sastrawi amount of 2,529 after the execution of SMOTE obtained the highest result on random split 90:10 or 90% data training and 10% data testing using support vector machine algorithm of 83%, until using random split 60:40 the maximum accuracy result is still using the support vector machine. Negative Neutral Positive Accuracy Macro Avg Weighted Avg TABLE IV ACCURACY NAZIEF & ANDRIANI Random Split Accuracy Training Testing NBC SVM TABLE V ACCURACY ARIFIN SETIONO Accuracy Training Training NBC SVM F1_Score Support IV. CONCLUSION In this study, the data shows many negative word classifications, and the word that appears most often is MBKM because, with this MBKM, students are reluctant to join Then the results of the comparison of the Nayve Bayes Classifier. Support Vector Machine, and Decision Tree algorithms with data that has been done by SOMTE using split random obtained the results of their respective accuracies for Nayve Bayes Classifier, namely 75% - 82% then decision tree with an accuracy of 61% - 73% and Support Vector Machine with an accuracy of 80% - 91% so that it can be concluded that it is true according to several researchers who have explained in the introduction, namely Support Vector Machine is the best algorithm between the two algorithms. The best Stemming used is Arifin Setiono, who managed to get 91% accuracy on SVM. However, it would be nice if the data is balanced. If it is not balanced, the accuracy of the results can go down. Future research can also use stratified sampling, k-fold, or hyperparameter tunning to get results with even better accuracy. On stemming data using Nazief and Andriani in Table IV, a total of 3,087 after SMOTE, the highest results were obtained on random split 90:10 or 90% data training and 10% data testing using Support Vector Machine algorithm of 0. 88% until using random split 60:40 the maximum accuracy result was still using Support Vector Machine algorithm. Random Split TABLE V MODEL EVALUATION Precision Recall REFERENCES