Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Vol.
No.
July 2023, pp.
495O504
ISSN: 2476-9843, accredited by Kemenristekdikti.
Decree No: 200/M/KPT/2020 DOI: 10.
30812/matrik.
Hate Speech Detection for Banjarese Languages on Instagram Using Machine Learning Methods Muhammad Alkaff.
Muhammad Afrizal Miqdad.
Muhammad Fachrurrazi.
Muhammad Nur Abdi.
Ahmad Zainul Abidin.
Raisa Amalia Universitas Lambung Mangkurat.
Banjarmasin.
Indonesia
Article Info
ABSTRACT
Article history:
Hate speech refers to verbal expression or communication that aims to provoke or discriminate against The Ministry of Communication and Information of Indonesia has encountered and dealt with 3,640 cases of hate speech transmitted through digital channels between 2018 and 2021.
Particularly in South Kalimantan, hate speech in the local language.
Banjarese has become increasingly prevalent in recent years.
Surprisingly, there is a lack of research on using machine learning to detect hate speech in the Banjarese language, specifically on Instagram.
Therefore, this study aimed to address this gap by constructing a dataset of Banjarese language hate speech and comparing various feature extraction and machine learning models to detect Banjarese language hate speech effectively.
This research used several feature extraction techniques and machine learning methods to detect Banjarese language hate speech.
The feature extraction methods used were Word N-Gram.
Term FrequencyInverse Document Frequency (TF-IDF), a combination of Word N-Gram and TF-IDF.
Word2Vec, and Glove, while the machine learning methods used were Support Vector Machine (SVM).
NaOve Bayes, and Decision Tree.
The results of this study revealed that the combination of TF-IDF for feature extraction and SVM as the model achieves exceptional performance.
The average Recall.
Precision.
Accuracy, and F1-Score score exceeded 90%, demonstrating the modelAos ability to identify Banjarese hate speech accurately.
Received May 02, 2023 Revised May 30, 2023 Accepted July 01, 2023 Keywords:
Banjarese Language Dataset Hate Speech Detection Instagram Machine Learning Copyright c 2022 The Authors.
This is an open access article under the CC BY-SA license.
Corresponding Author:
Muhammad Alkaff, 6281953632809 Faculty of Engineering and Department of Information Technology.
Universitas Lambung Mangkurat.
Banjarmasin.
Indonesia.
Email: m.
alkaff@ulm.
How to Cite:
Alkaff.
Miqdad.
Fachrurrazi.
Abdi.
Abidin, and R.
Amalia.
AyHate Speech Detection for Banjarese Languages on Instagram Using Machine Learning MethodsAy.
MATRIK : Jurnal Manajemen.
Teknik Informatika dan Rekayasa Komputer, vol.
3, pp.
Jul.
This is an open access article under the CC BY-SA license .
ttps://creativecommons.
org/licenses/by-sa/4.
Journal homepage: https://journal.
id/index.
php/matrik ISSN: 2476-9843
INTRODUCTION
Hate speech is an expression, writing, action, or performance intended to provoke violence or discrimination against someone based on the characteristics of their society.
represent, such as race, ethnicity, gender, sexual orientation, religion, and other characteristics .
Hate speech is one of the important topics of discussion related to social media analysis.
It is mainly associated with the freedom of users to share content and opinions on existing social media platforms .
Freedom of opinion in social media has also led to increased hate speech through social media.
Hate speech containing harsh words or phrases accelerates social conflict because harsh words/phrases trigger emotions .
This problem affects the dynamics and interactions of online social communities.
In Indonesia, the Ministry of Communication and Information Technology of the Republic of Indonesia (KOMINFO) handled 3,640 SARA-based Hate Speeches in the Digital Space from 2018 to April 26, 2021.
In South Kalimantan, hate speech cases have been rampant in recent years.
Quoted from several news pages in 2018, a social media account uploaded content that allegedly contained elements of hate speech that were considered insulting to a cleric from Banjar.
South Kalimantan.
In 2020, a State Civil Apparatus (ASN) was arrested for spreading hoax news and hate speech against the Indonesian National Police (POLRI) institution.
In January 2021, when a major flood hit South Kalimantan.
H Sahbirin Noor became the target of hate speech from South Kalimantan residents in his actions to deal with floods.
In South Kalimantan, most of the hate speech uttered by residents of South Kalimantan uses the Banjarese language.
From several social media, the most common hate speech found in it is Instagram.
Hate speech detection has become crucial in social media platforms, including Instagram.
The Banjarese language is one of the languages spoken in Indonesia, and detecting hate speech in this language on Instagram is a relatively new area of research.
This review aims to provide an overview of previous studies that can support and strengthen noveltyAos contribution to detecting the hate speech of Banjarese Language on Instagram.
Previous research has extensively explored the accuracy of machine learning methods in detecting hate speech on social media.
The effectiveness of these methods depends on the language and dataset used .
For instance, a study focused on the English language employed a dataset comprising 14,509 tweets from Twitter.
The study applied the SVM Linear algorithm to classify hate speech, achieving an accuracy rate of 78%.
Furthermore, a research endeavor on the Indonesian language involved a dataset of 13,169 tweets from Twitter.
The study used RFDT (Random Forest Decision Tre.
and LP (Linear Programmin.
transformation methods.
Without identifying targets, categories, and levels, the classification process achieved an accuracy rate of 77.
In contrast, the classification with the identification of targets, categories, and levels yielded an accuracy rate of 66.
12% .
Salim and Suhartono .
conducted a systematic literature review of different machine-learning methods for hate speech detection.
The study can be used to make an experimental approach to detecting hate speech and abusive language.
Zhang et .
observed that extremist violence tends to increase online hate speech, particularly on messages directly advocating violence .
Sinyangwe established that in the fore model, to detect hate speech and offensive language on online social media platforms, the data set must be categorized and presented in statistical form after running the model.
Ghosal and Jain .
identified the need for artificial intelligence (AI) in hate speech research.
Awal .
explored fine-tuning language models (LM.
to perform hate speech detection, and these solutions have yielded significant performance.
Li and Ning .
researched anti-Asian hate speech detection via data-augmented semantic relation inference.
Boishakhi et al.
Used a combined approach to detect hate speech from contents using video, audio, and speech by extracting feature images and feature values from audio and text.
They used Machine learning.
Deep learning, and Natural language processing to detect hate speech.
In .
, the researchers used Long Short-Term Memory for hate speech and abusive language detection on Indonesian Youtube comment sections.
Deshpande et al .
They have conducted experiments for a binary hate speech classification task in Multilingual-Train Monolingual-Test.
Monolingual-Train Monolingual-Test, and Language-Family-Train Monolingual Test scenarios.
Mozafari et al.
investigated the feasibility of applying a meta-learning approach in cross-lingual few-shot hate speech detection by leveraging two meta-learning models based on optimization-based and metric-based (MAML and Proto-MAML) methods.
These findings demonstrate the varying performance of different machine learning approaches in hate speech detection, depending on the language and dataset under consideration.
Therefore, the novelty of this research lies in investigating hate speech detection using machine learning techniques, specifically in the context of the Banjarese language on social media platforms.
In order to address this gap in the literature, this study aims to explore existing methods and identify the most accurate approach for detecting hate speech in the Banjarese language.
The data utilized in this study comprises comments extracted from local Instagram accounts known for frequently containing hate speech.
Three commonly employed models were chosen for text classification purposes: Support Vector Machine (SVM).
NaOve Bayes, and Decision Tree.
SVM is commonly employed as a binary classifier in natural language processing (NLP) tasks .
constructs margins between classes to maximize the distance between the margins and the classes, thereby minimizing classification errors .
NaOve Bayes, widely recognized for its effective assumptions and ease of implementation, is extensively used for text classification .
Decision trees have been extensively employed in various machine learning tasks, as they possess a lucid structure that offers insights into the training data and facilitates straightforward implementation .
This study aims to determine the most Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
July 2023: 495 Ae 504 Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer accurate method for detecting hate speech on social media, particularly Instagram.
Consequently, the findings of this research can serve as a valuable reference when selecting an appropriate machine-learning method to assess the accuracy of hate speech detection in the Banjarese language.
The researchers aspire that this study will benefit other scholars, particularly those in the low-resource local language like Banjarese.
RESEARCH METHOD
This research aims to create a Banjarese language hate speech dataset and try several combinations of feature extraction and machine learning models to determine which combination has the best accuracy in classifying hate speech.
The method used in this study can be seen in Figure 1.
Figure 1.
Research Methods Data Collection Because this study focuses on detecting hate speech in the Banjarese language, where previously there was no dataset, the researchers created a dataset for this study by collecting comments on local Instagram accounts where many comments were found in Banjarese.
Comments are mainly collected from posts that discuss disasters, politics, or other topics that trigger hate speech.
Data Filtering and Annotation At the data filtering stage, the researcher removed the redundant data and changed the comments in languages other than Banjarese into Banjarese in the dataset.
The process of language change refers to the Banjarese language dictionary and is validated by linguists.
Dataset labeling will be done manually by the researchers themselves.
Labeling is done by marking each data as Ayhate speechAy with the number 1 or Aynot hate speechAy with the number 0.
Before annotating the data, the researcher prepared guidelines as the rules of hate speech used in this study.
Preprocessing Before classifying the data, it is necessary to carry out several preprocessing procedures.
Case folding involves changing words in a text into uniform lowercase letters to facilitate further processing .
, .
Stop Word Removal, stop word is a common word that often appears in a sentence but has no meaning .
Removing stop words can increase the signal-to-noise ratio in unstructured text and thus increase the statistical significance of terms that may be important for a specific task .
Punctuation Removal, this flag - used to divide the text into sentences, paragraphs, and phrases - affects the result of any text processing approach, especially what depends on the frequency of occurrence of words and phrases because punctuation marks are often used in the text .
Most text and document data sets contain many unnecessary characters, such as punctuation and special characters .
Critical punctuation and special characters are essential for the human understanding of documents, but they can harm classification algorithms .
URLs Removal.
URLs do not correlate with the meaning of a comment, which can reduce classification performance, and are also not used in the following process .
, .
Feature Extraction Machine learning algorithms cannot understand classification rules on unprocessed text.
Machine learning algorithms need numeric features to understand classification.
Therefore, feature extraction is one of the main steps in text classification.
This step extracts the main features from the raw text and represents the features extracted in numerical form .
In this research, the feature extraction used by the researcher is Word N-gram.
TF-IDF, a combination of Word N-gram and TF-IDF.
Word2Vec, and Glove, shown in Table 1.
Hate Speech Detection .
(Muhammad Alkaf.
ISSN: 2476-9843 Table 1.
Feature Extraction (Key Concep.
Concept Word N-gram Term Frequency Inverse Document Frequency Word2vec Glove Definition is a technique of collecting sequential word lists with sizes 1, 2, 3.
to list all expressions of size N and calculate their frequency.
It is a feature representation technique representing Ayword importanceAy to a document in the document set.
It works in a combination of the frequency of word appearance in a document with no.
of documents containing that word.
It is a technique to learn vector representation of words, which can further be used to train machine learning models.
Global log bilinear regression model that combines the advantages of the two main model families in literature: global matrix factorization and local context window method References .
Classification The researcher classified the data by dividing the data into several classes, with class divisions, namely: true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP).
Several machine learning algorithms are applied in this research:
SVM.
NaOve Bayes, and Decision Tree, which detect hate speech in the Banjarese language.
This algorithm is implemented using the scikit learn library .
Evaluation For evaluation, the researcher applies the F1-measure and Accuracy as performance evaluation metrics in this study .
Accuracy is the ratio of correct predictions to the total number of samples, while F1-measure is the harmonic mean of Precision and Recall.
Classifier Performance is measured by calculating true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP), which will form a confusion matrix.
The confusion matrix table is shown in Table 2.
Table 2.
Confusion Matrix Predict Positive Negative Actual Positive Negative True Positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) True Positive (TP) is the proportion of positive instances classified correctly .
False Positive (FP) refers to the number of incorrectly classified hate speeches .
False Negative (FP) is the number of incorrect dictions that an instance is negative .
True Negative (TN) represents the number of negative examples if the classification result is correct .
Different performance metrics are used to assess the performance of the classifier that has been made.
Models built in this experiment were evaluated by calculating their F1-score .
, .
Some performance details metrics are discussed briefly below .
The accuracy rate is the total number of correctly classified over the total number of samples .
rue positives and true negative.
, .
The formula for the accuracy rate is shown in .
The recall is the proportion of actual positives which are predicted positive .
The formula for the recall rate is shown in .
Precision is also a positive predictive value indicating the algorithmAos accuracy for each model that detects hate speech .
The formula for the precision rate is shown in .
F1-measure evaluates the harmonic value between recall and precision .
The formula for the F1-measure rate is shown in .
Accuracy = TP TN
TP TN FP FN
TP FN
TP FP
Recall = P recision = Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
July 2023: 495 Ae 504 Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer F 1 Oe measure = 2 A P recision A Recall P recision Recall .
RESULT AND ANALYSIS Banjarnese Hate Speech Dataset The Banjarese language hates speech dataset created comes from comments on local South Kalimantan Instagram accounts that speak Banjarese.
The process of making this dataset goes through several stages: data collecting, data filtering and annotation, preprocessing, feature extraction, classification, and evaluation.
The CSV-formatted dataset consists of 15,481 data instances, 2,039 classified as hate speech, and 13,442 as not being hate speech (See Table .
The sample dataset and labels used in this study are shown in Table 4.
Due to the data imbalance, the F1-measure metric will be used to measure accuracy.
F1-measure is a composite metric considering precision and recall.
Precision measures correctly predicted hate speech instances out of all predicted hate speech instances, while recall measures correctly predicted hate speech instances out of all actual hate speech instances.
F1-measure provides a balanced evaluation metric, particularly for imbalanced datasets.
The F1 measure enhances model performance when data imbalance is addressed appropriately .
Table 3.
Dataset Distribution Number of Sentences 15,481 Hate
Speech
2,039
Normal
Speech
13,442
Table 4.
Banjarese Language Hate Speech Dataset Text Cabut ja Indonesia ni gatuk pang Liwar tahi Handak tetawa tapi ini indonesia Bebanyak begal ni Mehadangi habar berita nang hanyar admin Mirisnya hukum negara kaini Negara Indonesia lucu tapi bungul Polisi nang bepandir bungul Translation Label Just let go stupid This is Indonesia, letAos touch Dull So shit I want to laugh but this is Indonesia More and more Waiting for the latest news admin ItAos sad that state law like this Indonesia is funny but stupid Stupid talking The Combination of Feature Extraction and Model for Detecting Banjarese Hate Speech After the dataset is collected, the next step is to perform feature and model extraction and then compare the combination of feature and model extraction with the Recall.
Precision.
Accuracy, and F1-Measure metrics to find the most accurate combination of feature extraction and model in detecting hate speech in Banjarese language.
The dataset created was divided into 8:2 compositions for training and testing compositions.
The results of combining feature extraction and models using the dataset created can be seen in Table 5.
In the accuracy metric, the combination of feature extraction and model with the highest score after being applied to the Banjarese language hate speech dataset is TF-IDF and SVM, with a score of 91%.
In the recall metric, there are two feature extraction combinations, and the model with the highest score with the same number.
TF-IDF and NaOve Bayes, as well as TF-IDF and SVM, are the combination of feature extraction and model that has the highest score after being applied to the Banjarese language hate speech dataset with the same score of 91%.
In the Precision metric, there are differences between the two previous metrics.
The combination Hate Speech Detection .
(Muhammad Alkaf.
ISSN: 2476-9843 of feature extraction and model with the highest score is TF-IDF and NaOve Bayes with a score of 91%.
In the F1-Measure metric.
SVM and TF-IDF are the combinations of feature extraction and model with the highest score after being applied to the Banjarese language hate speech dataset with a score of 91%.
Table 5.
Performance of Algorithms Models SVM
NaOve Bayes Decision Tree
Feature Extraction
N-Gram
TF-IDF
N-Gram & TF-IDF
Word2Vec
Glove
N-Gram
TF-IDF
N-Gram & TF-IDF
Word2Vec
Glove
N-Gram
TF-IDF
N-Gram & TF-IDF Word2Vec Glove Accuracy (%) Recall (%) Precision (%) F1-measure (%) It can be seen from Table 5 that NaOve Bayes and SVM models with N-Gram and TF-IDF feature extraction dominate the highest values for F1-measure.
Accuracy.
Precision, and Recall metrics.
However, due to unbalanced data, the accuracy metric used is F1-measure, so SVM and TF-IDF are the best model and combinations of feature extraction from this research to detect hate speech in the Banjarese language.
Table 6 shows the comparison of this research with previous research.
The research [? ] conducts a comparative analysis of studies focusing on different languages, including Javanese.
Sundanese.
Madurese.
Minangkabau, and Musi.
In contrast, research [? ] specifically compares previous research on Sundanese and Javanese languages.
The novelty aspect of each study is emphasized in the corresponding column, and the outcomes of prior investigations are contrasted with the present studyAos findings.
The results presented in reference .
demonstrate a positive correlation between dataset size and performance The current study employs a Banjarese language dataset comprising 15,481 instances, achieving an F1-measure of These results indicate superior performance compared to previous studies conducted on other regional languages.
On the other hand, reference [? ] focuses on comparing different algorithms and feature extraction techniques.
The earlier research achieved F1-measures ranging from 80% to 82% using N-Gram feature extraction in combination with algorithms such as SVM.
RFDT, and NaOve Bayes for Sundanese and Javanese languages.
However, the present study surpasses these previous findings by employing TF-IDF feature extraction.
By utilizing this approach in conjunction with SVM, the F1-measure for detecting Banjarese hate speech reaches 91%.
The effectiveness of the TF-IDF feature extraction method stems from its ability to assign higher weights to words that offer greater information content within a specific document while considering their rarity across the entire This weighting scheme proves instrumental in capturing the discriminative power of words specific to hate speech in the Banjarese language.
Furthermore.
TF-IDF effectively mitigates the influence of common words that frequently appear in both hate speech and non-hate speech documents.
By downplaying the significance of these common words, the feature extraction method can focus more on identifying distinctive words and phrases that serve as indicators of hate speech in the Banjarese language.
Thus, the TF-IDF feature extraction method takes into account the distribution of words across the entire dataset to enhance hate speech detection capabilities.
Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
July 2023: 495 Ae 504 Matrik: Jurnal Managemen,Teknik Informatika, dan Rekayasa Komputer Table 6.
Comparison of Research Results References .
Novelty Comparison:
Based on research using other regional languages, such as the Javanese language with a dataset of 3449, the Sundanese language with a dataset of 2207, the Madurese language with a dataset of 2773.
Minangkabau language with a dataset of 3125, and Musi language with a dataset of 2564.
Novelty:
The results show that a larger number of datasets increases the performance results obtained.
Result (Previous Stud.
Java language Dataset 3449 F1-measure 87.
Result (This Stud.
Banjarese language Dataset 15481 F1-measure 91% Sundanese language Dataset 2207 F1-measure 79.
Madurese language Dataset 2773 F1-measure 73.
Minangkabau language Dataset 3125 F1-measure 69% Musi language Dataset 2207 F1-measure 80% [? ] Comparison:
Results from previous research on Sundanese and Javanese using NaOve Bayes.
SVM, and RFDT algorithms and N-Gram feature extraction yielded better Novelty:
The results of this study on Banjarese language using SVM.
NaOve Bayes, and Decision Tree, as well as TF-IDF feature extraction, resulted in a much better F1 measure.
Comparison F1-measure:
SVM N-Gram 82% RFDT N-Gram 82% NaOve Bayes N-Gram 80% Comparison F1-measure:
SVM TF-IDF 91%
DT TF-IDF 89%
NaOve Bayes TF-IDF 89% CONCLUSION This research uses feature extraction and model experiments to investigate hate speech detection in the Banjarese language.
analyzing a dataset of 15,481 instances, including 2,039 hate speech samples and 13,442 non-hate speech samples, the study finds that the combination of TF-IDF feature extraction and the Support Vector Machine (SVM) model achieves an average accuracy score exceeding 90% for each metric.
The research contributes novel insights to the field by addressing the lack of previous studies in hate speech detection for the Banjarese language, and it offers practical implications for future research in refining detection methods and enhancing accuracy.
The findings of this study have significant implications for hate speech detection in the Banjarese language.
The demonstrated effectiveness of the TF-IDF feature extraction method and SVM model underscores their potential as accurate tools for distinguishing Banjarese language hate speech.
The research also provides a valuable dataset for further exploration, enabling researchers to investigate alternative approaches and refine detection methods specific to the Banjarese language.
Overall, this study expands knowledge in hate speech detection and offers valuable insights for future research endeavors in this area.
ACKNOWLEDGEMENTS
The authors want to acknowledge the research funding by the Directorate of Learning and Student Affairs (BELMAWA) under the Student Creativity Program (PKM) grant, number 2489/E2/KM.
01/2022.
DECLARATIONS
AUTHOR CONTIBUTION
All authors contributed equally to the main contributor to this paper.
All authors read and approved the final paper.
FUNDING STATEMENT
The Directorate of Learning and Student Affairs (BELMAWA) funded the research under the Student Creativity Program (PKM) Hate Speech Detection .
(Muhammad Alkaf.
ISSN: 2476-9843 grant, number 2489/E2/KM.
01/2022.
COMPETING INTEREST
The authors declare no conflict of interest.
REFERENCES