Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367
Comparison of the Effect of Word Normalization on Nayve Bayes Classifier and K-Nearest Neighbor Methods for Sentiment Analysis Novrido Charibaldi1.
Atania Harfiani2.
Oliver Samuel Simanjuntak3
1,2,3
Informatics Department.
Universitas Pembangunan Nasional Veteran Yogyakarta.
Indonesia 1novrido@upnyk.
id, 2ataniaharfiani2@gmail.
com(*), 3oliver.
samuel@upnyk.
Received: 2023-10-12.
Accepted: 2023-11-19.
Published: 2023-12-03 AbstractAi In the pre-processing stage of sentiment analysis, there are several essential steps, one of which is word normalization, which is converting non-standard words into standard words.
However, some research on sentiment analysis generally does not go through the word normalization stage, which can affect accuracy.
This study aims to compare the effect of word normalization on the Naive Bayes Classifier and K-Nearest Neighbor methods for sentiment analysis of public opinion on the Agency Social Security Administrator for Health (BPJS Kesehata.
Gathering the data, labeling it, pre-processing it with two different scenarios, word weighting it with TF-IDF, classifying it using Naive Bayes Classifier and K-Nearest Neighbor, and lastly computing the accuracy of the Confusion Matrix are the steps that are involved.
a result of these discovered fact, the most superior accuracy results are obtained by the Naive Bayes Classifier method 1st scenario, namely by using word normalization at the pre-processing stage and getting an accuracy of 87.
This research shows that the Naive Bayes Classifier method with word normalization produces better accuracy, precision, recall, and F1-score.
KeywordsAi Sentiment Analysis.
Word Normalization.
Nayve Bayes Classifier.
K-Nearest Neighbor.
BPJS Kesehatan.
INTRODUCTION
The pre-processing stage is an essential step in sentiment analysis, the translation of raw, unstructured data into a format that is easier to work with and for processing is the responsibility of this component.
Earlier than the processing, there are several essential steps.
One is word normalization, which is converting non-standard words into standard words according to the Kamus Besar Bahasa Indonesia .
However, some research on sentiment analysis generally does not go through the word normalization stage, so ambiguity occurs, and the system cannot classify the class correctly, affecting accuracy .
Data past the pre-processing stage will then be classified using machine learning methods.
Sentiment analysis can utilize a range of machine-learning techniques, including research-based methodologies .
, which use the Affective model method and a lexicon dictionary base, namely the Russell Circumplex Model, conducted two kinds of experiments, namely the manual process resulting in an accuracy of 81.
The Affective Models method has succeeded in achieving 83.
4% accuracy.
However, this research system still needs to add a word normalization stage so that there are no ambiguities in the sentence that result in misclassification.
Furthermore, .
researched sentiment analysis by comparing Nayve Bayes and Nayve Bayes SMOTE Adaboost methods on Twitter social media.
They resulted in Nayve Bayes method accuracy of 71.
68% and Nayve Bayes Adaboost SMOTE method accuracy of 69.
Nevertheless, this research has to be enhanced with other techniques as the categorization outcomes remain subpar.
Other pre-processing needs to be added, such as word normalization.
The analysis results match the class, and accuracy improves.
Research on sentiment analysis with other machine learning methods conducted by .
using Nayve Bayes Classifier with TF-IDF weighting resulted in an accuracy rate of 62%.
However, this research needs to be optimized by comparing other classification methods because the accuracy is still low, as well as adding other pre-processing such as word normalization to improve the accuracy results.
Another study by .
also compared three methods, namely.
K-Nearest Neighbor.
Nayve Bayes.
C 4.
5, and combined the three methods with the Ensemble Vote algorithm method.
The Nayve Bayes method has an accuracy result of 81.
KNearest Neighbor of 71.
83%, and C 4.
5 of 65.
However, this research needs to add other pre-processing, such as word normalization, to correctly classify tweets according to their Because of this, the focus of the research will be on contrasting the K-Nearest Neighbor algorithm with the Naive Bayes Classifier, according to .
, they are simple and easy to Another reason is that the Naive Bayes Classifier method, according to research .
, can produce relatively high accuracy, and the K-Nearest Neighbor method, according to study .
, also has relatively high accuracy.
The influence of word normalization is needed because, in research .
, it is known that adding the word normalization stage to preprocessing can affect accuracy, so accuracy has increased 67% to 91.
This research aims to compare the effect of word normalization on the Naive Bayes Classifier and K-Nearest Neighbor methods.
II.
RESEARCH METHODOLOGY
The purpose of this study was to investigate the effect that word normalization has on the accuracy of the Naive Bayes Classifier and the K-Nearest Neighbor algorithms.
The study consisted of four different test scenarios.
For the purpose of DOI : https://doi.
org/10.
25139/inform.
Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367 conducting the sentiment analysis experiments, a total of ten hundred fifty pieces of data were extracted from comments made on Instagram and Twitter.
Following that, the data was divided into two parts: eighty percent for training and twenty percent for testing.
There were 840 training data points and 210 test data points that were produced by this technique.
The first seven stages of the research method consist of the following: data collection, labeling, pre-processing scenarios.
TF-IDF weighting.
Nayve Bayes Classifier sentiment classification.
K-Nearest Neighbor sentiment classification, and testing using confusion matrix to produce accuracy, precision, recall, and F-1 score information.
During the preprocessing step, there are two distinct outcomes that might occur, namely 1st scenario pre-processing includes case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, word normalization, stemming, stopwords removal, 2nd scenario pre-processing includes case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, stemming, stopwords removal and 2nd scenario pre-processing includes case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, stemming, stopwords removal.
The research methodology flows in Figure 1.
This approach is comprised of seven steps, which are as follows: web scraping, labeling, pre-processing.
TF-IDF.
Nayve Bayes Classifier.
K-Nearest Neighbor, and Confusion Matrix.
total data from Twitter and Instagram is 1050 data.
The data consists of positive, negative, and neutral comments, as seen in Table I.
Two distinct sets of data have been created: 840 for training purposes and 210 for testing purposes.
TABLE I
DATA COLLECTION
Category Quantity Positive Neutral Negative Labeling The labeling was done manually by humans and checked for validity by an expert, an Indonesian Language Teacher at SMK Negeri 1 Jember.
Achmad Zaenul Ulum.
SPd.
whether the data belongs to the positive, negative, or neutral The purpose of labeling by Indonesian language teachers is so that the results of sentiment class classification are more valid or correct in classifying the class because experts do it.
In Table II, samples of manual data labeling are presented for examination.
In the process of data labeling, there were three occurrences that were either good, neutral, or negative that were noted.
TABLE II
LABELING
Comment Alhamdulillah bpjs sgt mmbantu.
Lahiran berkali pke sc.
Anak dirwt pun cover bpjs.
Label Positive Untuk saat ini yg utama, pelayanan kepada pengguna bpjs yg ke rumah sakit.
Neutral Ribet, kenapa sudah dapat resep haruske loket BPJS, nya ga berfungsi percuma Negative Pre-processing Pre-processing is the initial process of text mining, which converts unstructured data into structured data following the required format, namely by exploring, processing, organizing information, and analyzing data .
, which aims for uniformity and ease of reading .
This study's preprocessing stage consists of case folding .
, cleaning data, word normalization, tokenizing, stopword removal, and stemming .
Figure 1.
Research Methodology .
Case Folding: Case folding aims to convert all letters in a text document into lowercase letters .
to homogenize the characters in the data.
Information that contains characters other than letters and numbers will be removed from the database .
Data Collection Data collection in this study comes from Twitter and Instagram using the automatic scrapping method using the snscrape library.
The snscrape library was available in Python.
The data taken are tweets that use the keyword "BPJS Kesehatan" taken from January 2021.
At this time, many people were pro and contra to new policies or information about BPJS Kesehatan conveyed by the government, so many people expressed their complaints about the performance of BPJS Kesehatan from Twitter with the amount of data taken as much as 616 data and Instagram as much as 434 data.
The .
Deleting Emojis: Deleting emojis removes or deletes emoticon symbols in text documents .
When writing tweets or comments, people are sometimes inappropriate in using emoticons, for example, when commenting on funny things but using emoticons:" .
, so emoticons will DOI : https://doi.
org/10.
25139/inform.
Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367 This study will compare two pre-processing scenarios, 1st scenario with word normalization and 2nd scenario without word normalization.
As shown in Figure 2, it includes case folding, emoji removal, cleaning data, deleting repetitive words, tokenization, word normalization, stemming, and stopwords removal.
After the data is pre-processed in the first scenario, the results are in Table i.
interfere in the sentiment analysis process .
So that various forms of emoticons are deleted or ignored.
Cleaning Data: Cleaning data is a process to remove punctuation .
eriod, comma, exclamation, question, fence, ), duplicate data, and special symbols such as username, re-tweet symbol (RT), and URL .
Cleaning data aims to make the data processed cleaner and does not contain much noise that can affect classification results.
Deleting Repetitive Words: Deleting repetitive words is a process to correct words with repetitive characters caused by errors in writing that are often found in comments or tweets.
The words written are not standardized, so the purpose of deleting repetitive words is to remove or delete characters in a word that are repeated so that it is easier to find the base word .
Tokenization: Tokenization separates a string of characters or spaces into a list of words .
Tokenization aims to collect the number of words that make up the sentences in the Only one is used if two or more words are the same .
Figure 2.
Preprocessing 1st Scenario TABLE i PREPROCESSING 1ST SCENARIO Before After @theansm @hynxf Datang sj ke optik datang optik dekat lebih mudah bantu baik lbh mdh dan dibantu dgn .
Word Normalization: Word normalization converts nonstandard words into standard words or forms according to the Kamus Besar Bahasa Indonesia (KBBI) .
Word normalization in this study converts slang or slang words, word abbreviations, and word spelling errors into standard ordinary words .
Word normalization is an important stage, considering that Indonesian Instagram tweets and comments have a lot of non-standard words, so the system cannot detect these words for the classification process .
Examples of non-standard word repair or word normalization in tweets and comments that are captured by non-standard words, such as slang or slang writing .
x: gue = say.
, writing with abbreviations .
x: bgs = bagu.
, and writing with spelling errors .
x: jelsk = jele.
Figure 3 includes case folding, removal of emoji, cleaning data, deleting repetitive words, tokenization, stemming, and removal of stopwords.
After the data is pre-processed in the first scenario, the results are in Table IV.
Stemming: Stemming is the process of removing affixes and getting the root word from the original word in a sentence .
Stemming is used to improve the quality of information from data, for example, to get the relationship between oneword variant and another.
Stemming also affects reducing an index file size .
In this research, the stemming used by the Indonesian-language Sastrawi library.
For example, there are variants of the words "membuatkan", "dibuatkan", "membuat" and "dibuat" which only have the root word .
, namely "buat".
Figure 3.
Preprocessing 2nd Scenario
TABLE IV
PREPROCESSING 2ND SCENARIO
Before After @theansm @hynxf Datang sj ke optik datang optik dekat lbh mdh bantu baik lbh mdh dan dibantu dgn .
Stopword Removal: Stopwords are high-frequency words in a document, both in terms of time and space complexity, with very low informative value .
According to .
, examples of stopword lists in Indonesian such as "yang", "di", "untuk", "dan", "ke", "dari", and others.
Stopwords removal is a stage to take important words from the tokenizing results by discarding less important words .
and keeping important words .
In this research, the stopwords list is from the Indonesian-language Sastrawi library.
TF-IDF (Term Frequency Ae Inverse Document Frequenc.
As mentioned in .
, the weight of each word in a text is derived by combining the Term Frequency (TF) and the DOI : https://doi.
org/10.
25139/inform.
Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367 Inverse text Frequency (IDF) of the document.
Term Frequency-Inverse Document Frequency (TF-IDF) is the name given to this method of approaching the problem.
The word "term frequency" (TF) refers to the number of times a certain phrase occurs in a particular piece of writing.
One may say that the worth of a term is exactly proportional to the number of times it is used .
For the purpose of computing the value of the TF-IDF, the equation that is employed is as follows .
EN E
TF( t ) Oe IDF( t ) = tf t , d C logEE EE E dft E d ( x, y ) = .
| n ) = P.
| .
C P.
i Oe1 Confusion Matrix The last stage tests the classification results using the confusion matrix, which is one of the methods used to measure the performance of a classification method .
measuring the performance of this research using confusion matrix calculation for multi-class and obtained calculations to calculate accuracy, precision, and recall, the confusion matrix table is used in Table V.
Naive Bayes Classifier The Naive Bayes Classifier algorithm is an algorithm that uses probabilities in making decisions .
Nayve Bayes Classifier is a method with simple probability, a derivative of Bayes' theorem.
Bayes' theorem is a theory of calculating conditional probability .
, namely the calculation of the probability of an event m if it is known that there is an event n or P.
According to .
, the classification problem with the Naive Bayes Classifier model can generally be Equation .
E P.
| m ) C P.
) E
| n ) = EE
) E
Eu .
Oe y ) Negative Actual
Class
Neutral
Positive TABLE V
CONFUSION MATRIX
Predicted Class Negative Neutral Positive FL2 (False (True Negativ.
(False Positiv.
Neutral.
FN2
FP2
TL (True (False (False Neutra.
Negative.
Positive.
(False (False (True Positiv.
Negativ.
Neutra.
RESULT AND DISCUSSION
In this part, we evaluate the results using various classifiers by employing the Nayve Bayes Classifier and the K-Nearest Neighbor algorithms.
The pre-processing scenario is comprised of two different scenarios, namely 1st and 2nd, which are described in this section.
The 1st scenario with preprocessing stages consists of case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, word normalization, stemming, and removing stopwords.
The 2nd scenario pre-processing stage consists of case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, stemming, and stopwords removal.
The data used in this study were taken from Twitter and Instagram and included topics related to BPJS Kesehatan.
After the data is collected, it will be labeled into positive, neutral, and negative classes using manual labeling.
There are 1050 data used in this study with a training data division of 80%, namely 840 data, and test data of 20%, namely 210 data.
The word weighting process will be done with TF-IDF after the data-sharing stage.
The classification process will be carried out in both the first and second occurrences by using the Nayve Bayes Classifier and the K-Nearest Neighbor A sentiment analysis model will be constructed using two different approaches and two different scenarios, with the k value of the K-Nearest Neighbor method being set Equation .
or posterior is the probability of class .
when word .
Proseterior is the product of the likelihood and prior, divided by the evidence.
Likelihood is the probability of word .
from class .
Prior is the probability of class .
before processing the data.
Evidence is the probability of the occurrence of the word .
Calculating evidence or the chance of word occurrence can be omitted because each feature in the dataset is independent or not bound .
using Equation .
K-Nearest Neighbor "nearest neighbor" is one of the various interpretations of the term "k-nearest neighbor," where "k" refers to the total number of neighbors that are closest to you .
Due to the fact that it is a supervised learning approach.
K-Nearest Neighbor sorts the results of a new query instance in accordance with the majority of the K-Nearest Neighbor In accordance with .
, the outcome of the classification will display the greatest number of classes.
When it comes to classifying new objects, the K-Nearest Neighbor algorithm is responsible for locating the training examples that are geographically nearest to the query instance and then using those samples.
It is possible to use the Euclidean Distance approach to determine the value of k in accordance with Equation .
in order to determine whether the distance is close or distant.
In the first place, assessing the Anomaly Detection Protocol Pre-processing is the first scenario.
The Nayve Bayes Classifier method is used specifically for the purpose of preprocessing in the first instance.
In the first scenario, the preprocessing stage also includes the application of word As part of the pre-processing, the following DOI : https://doi.
org/10.
25139/inform.
Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367
TABLE IX
CONFUSION MATRIX MODEL K-NEAREST NEIGHBOR 2ND
SCENARIO
Predicted Class
Negative Neutral
Positive Total
Negative TN=48 FL2=7 FP=7 Actual
Neutral
FN2=10
TL=61
FP2=1
Positive FN=11 FL=12 TP=53 Total operations are performed: case folding, emoji removal, data cleaning, removal of repeated words, tokenization, word normalization, stemming, and removal of stopwords.
The results of the implementation of the Nayve Bayes Classifier 1st scenario confusion matrix model are shown in Table VI.
TABLE VI
CONFUSION MATRIX MODEL NAyaVE BAYES 1ST SCENARIO
Predicted Class
Negative Neutral
Positive Total
Negative TN=50 FL2=2 FP=10 Actual
Neutral
FN2=2
TL=69
FP2=1
Positive FN=4 FL=8 TP=64 Total Based on the values in Tables VI.
VII.
Vi, and Table IX, calculations are made to determine accuracy, precision, recall, and f1-score.
Table X presents a comparison of the test results obtained from the model by employing the Nayve Bayes Classifier technique in the first scenario, the Nayve Bayes Classifier method in the second scenario, the K-Nearest Neighbor method in the first scenario, and the K-Nearest Neighbor method in the second scenario methods.
Testing The Nayve Bayes Classifier Method Preprocessing 2nd scenario: The second test uses the Nayve Bayes Classifier method pre-processing 2nd scenario.
The preprocessing stage of 2nd scenario is without using word Pre-processing consists of case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, stemming, and stopwords removal.
Table VII shows the result of the confusion matrix model Nayve Bayes Classifier 2nd scenario.
TABEL X
MODEL TESTING RESULT
Model
Accuracy Precision 87,14% 87,18% NBC 1st Scenario
86,67%
86,64%
NBC 2 Scenario
80,48%
80,93%
KNN 1 Scenario
77,14%
78,12%
KNN 2 Scenario
TABEL VII
CONFUSION MATRIX MODEL NAyaVE BAYES 2ND SCENARIO
Predicted Class
Negative Neutral
Positive Total
Negative TN=51 FL2=3 FP=8 Actual
Neutral
FN2=2
TL=68
FP2=2
Positive FN=6 FL=7 TP=63 Total Recall 87,14% 86,67% 80,47% 77,14% Based on Table X, the model with the Nayve Bayes Classifier method with 1st scenario is superior to the other three models with 87.
14% accuracy, 87.
18% precision, and 14% recall.
In comparison, the Nayve Bayes Classifier with 2nd scenario gets 86.
67% accuracy, 86.
64% precision, and 67% recall.
When applied to the first scenario, the KNearest Neighbor method achieves an accuracy of 80.
48%, a precision of 80.
93%, and a recall of 80.
47%, respectively.
using the second scenario as an example, we are able to observe that the K-Nearest Neighbor method reaches a level of accuracy, precision, and recall that is 77.
14 percent.
The created system presents the results of the model tests that are shown in Table X in a graphical format.
The graph that compares the test results of the two models is shown in Figure .
Testing The K-Nearest Neighbor Method Pre-processing 1st Scenario: The third test uses the K-Nearest Neighbor method pre-processing 1st scenario.
The pre-processing stage of 1st scenario is by using word normalization.
Pre-processing consists of case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, word normalization, stemming, and stopwords.
Table Vi shows the result of the confusion matrix model K-Nearest Neighbor 1st scenario.
TABLE Vi
CONFUSION MATRIX MODEL K-NEAREST NEIGHBOR 1ST
SCENARIO
Predicted Class
Negative Neutral
Positive Total
Negative TN=49 FL2=7 FP=6 Actual
Neutral
FN2=6
TL=63
FP2=3
Positive FN=7 FL=12 TP=57 Total .
Testing The K-Nearest Neighbor Method Pre-processing 2nd Scenario: The fourth test uses the K-Nearest Neighbor method pre-processing 2nd scenario.
The pre-processing stage of 2nd scenario is without using word normalization.
Preprocessing consists of case folding, deleting emojis, cleaning data, deleting repetitive words, tokenization, stemming, and The result of the confusion matrix model KNearest Neighbor 2nd scenario is in Table IX.
Figure 4: Graph of Model Testing Results DOI : https://doi.
org/10.
25139/inform.
Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi Vol.
9 No.
1 January 2024.
P-ISSN : 2502-3470.
E-ISSN : 2581-0367 In comparison to the model that employs the Nayve Bayes Classifier technique with the second scenario, which attains an accuracy of 86.
67%, precision of 86.
64%, and recall of 67%, the sentiment analysis model that employs the first scenario achieves an accuracy of 87.
14%, precision of 87.
and recall of 87.
The K-Nearest Neighbor method with 1st scenario obtained an accuracy of 80.
48%, precision of 93%, and recall of 80.
It is also superior to the model using the K-Nearest Neighbor method with 2nd scenario, which obtained an accuracy of 77.
14%, precision of 78.
and recall of 77.
Based on the research that has been done, the model with the Nayve Bayes Classifier method with 1st scenario is a model that has the most superior accuracy, precision, and recall compared to the other three models.
The 1st scenario model using the word normalization stage, namely changing non-standard words to standard words in pre-processing, it is possible to increase recall, accuracy, and precision by refining the comment classifications that are used by the sentiment analysis system.
Within the framework of the Naive Bayes Classifier method, every single word is considered to be an independent feature.
Word normalization can affect how the Naive Bayes Classifier interprets the relationship between words, affecting the probability calculations used by this algorithm.
If normalization changes the meaning or frequency of words, it can also affect the probability calculation and classification process of words in the Naive Bayes Classifier model.
Meanwhile, the K-Nearest Neighbor method uses the distance or similarity between data to classify new data.
Word normalization affects the calculation of distance or similarity between words in the feature space.
Suppose normalization changes the representation of words that were originally different to be more uniform.
In that case, it can affect the distance measurement between data and how K-Nearest Neighbor classifies new data.
The Nayve Bayes Classifier technique is superior than the K-nearest neighbor method in terms of accuracy, precision, and recall values.
This is because the Nayve Bayes Classifier approach is more successful on relatively small datasets.
The K-nearest neighbor algorithm, on the other hand, is effective when used to large datasets.
IV.
CONCLUSION
The research that performed the word normalization stage in the Nayve Bayes Classifier and K-Nearest Neighbor methods produced higher accuracy than the word normalization stage.
The Nayve Bayes Classifier method, both with 1st scenario and 2nd scenario, has superior accuracy compared to the K-Nearest Neighbor methods in both the 1st and 2nd scenarios.
In model testing with 1050 data, the accuracy obtained by the Nayve Bayes Classifier method with 1st scenario is 87.
The accuracy obtained by the Nayve Bayes Classifier method with 2nd scenario is 86.
Meanwhile, the accuracy obtained from the K-Nearest Neighbor method in 1st scenario was 80.
48%, and 2nd scenario resulted in an accuracy of 77.
ACKNOWLEDGMENT
The number of datasets should be included, according to the study's findings, which may be used as input or suggestions, among other things.
In addition, it is recommended to compare the results with those obtained from other machine learning or deep learning approaches and incorporate additional pre-processing stages to achieve more accurate results.
REFERENCES