JOIV : Int.
Inform.
Visualization, 9.
- March 2025 824-830
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION
journal homepage : w.
org/index.
php/joiv Performance Improvement of Cosine Similarity Algorithm with Bidirectional Encoder Representations from Transformers on Abstract Document Similarity Detection Musthofa Galih Pradana a,*.
Nindy Irzavika a.
Nurhuda Maulana a.
Jesselyn Mu a.
Valtrizt Khalifah Wari a Faculty of Computer Science.
Universitas Pembangunan Nasional Veteran Jakarta.
South Jakarta.
DKI Jakarta.
Indonesia Corresponding author: *musthofagalihpradana@upnvj.
AbstractAiIn thesis courses or final projects, students are required to be able to conduct research by the science they are engaged in, find innovations, solve problems, and foster a culture and critical mindset.
However, the issue that is often encountered is plagiarism.
Plagiarism is taking a work that can be in the form of someone else's opinion and making it seem as if it is your own.
The step in applying technology that can be done is to carry out early detection of the similarity of documents written by students.
In this case, the document that will be detected is an abstract that must be collected by students when submitting a thesis title.
The algorithm used is a cosine similarity algorithm, which is computationally efficient because of its ease of interpretation and compatibility with large-scale data.
This research was carried out using two schematic approaches: bidirectional encoder representations from transformers (BERT) and not bidirectional encoder representations from transformers (BERT).
The corpus data used in this study was 1450 data of student thesis abstract documents, with the test using 10 data to see the performance of the cosine similarity algorithm in detecting the similarity of abstract documents.
The results showed that documents with optimization using the Bidirectional Encoder Representations from Transformers (BERT) approach had better results, with an average performance improvement of 23.
KeywordsAi Abstract.
cosine similarity.
bidirectional encoder representations.
Manuscript received 28 Jun.
revised 17 Sep.
accepted 8 Oct.
Date of publication 31 Mar.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.
0 International License.
engaged in, find innovations, and solve problems.
Currently, the acceleration and advancement of artificial intelligencebased technology make it easier to access data and information while promoting plagiarism .
Plagiarism can be defined as taking a work, including someoneAos opinion, and making it seem theirs.
Plagiarism, if left unchecked, will cause massive corrosion of the noble values of education .
In the context of higher education as a place of production and transfer of knowledge, the government issues regulations on the Prevention and Control of Plagiarism in Higher Education .
However, it does not stop there.
a preventive way is needed for this act of plagiarism, one of which is by utilizing technology .
Technology does have two blades.
long as it is used wisely, it will benefit human work.
Technology can be used to detect the similarity of studentsAo manuscript abstracts early when submitting a thesis title.
The manuscript is compared to another to decide on the admission of studentsAo thesis titles.
Many algorithms can be applied to detect the similarity of text documents .
, one of which is the cosine similarity algorithm.
It is computationally efficient INTRODUCTION The Preamble to the 1945 Constitution and Articles 31 and 32 state that one of the roles of the state government is to educate peopleAos lives .
Education is related to the future of a country and should be prioritized.
Moreover, the progress or retreat of a country is primarily determined by the quality of its human resources.
One of the critical levels for developing human resources is at the university level .
The process of education, research, and service, known as the Tridharma of Higher Education, is an obligation.
Research significantly impacts higher education since it can encourage innovation and knowledge development and solve complex problems in various disciplines.
The research activities in the academic community involve both lecturers, as facilitators in the Tridharma of Higher Education, and their students.
From the studentsAo perspective, completing a thesis or final project can encourage them to participate actively in research In a thesis or final project course, students must be able to conduct research following the science they are test results show that the process of calculating the degree of title similarity can be done quickly .
Similar results were also found to check ransomware detectioon the Confusion Matrix, with 94% accuracy .
Moreover, a previous study showed that students used cosine similarity and the NaziefAdriani algorithm for stemming.
It suggested the choice of words, considered as keywords in the answer key, greatly affected the results of the system assessment and obtained a cosine law of 89.
5% .
The document similarity detection used the Goods and Price Planning Information System (SiPaG.
application for the codification search process.
This study produces cosine similarity and TF-IDF weighting calculations and is expected to be applied to the SiPaGa application for more accurate Cosine similarity and TF-IDF algorithms are expected to improve the accuracy of product codification searches.
Therefore.
OPD can choose the product code as desired .
Moreover, previous research shows that product searches in e-commerce apply TF-IDF and Word2Vec, as well as cosine similarity, to calculate similarities between objects.
From the study above, it can be concluded that Word2Vec takes more time to process data than using TF-IDF, making TF-IDF more efficient in terms of time .
The similarity of news articles among several sites can be measured using the cosine similarity algorithm by translating Hindi news articles into English and then comparing them with English news articles.
Furthermore, cosine similarity.
Jaccard similarity, and Euclidean distance were measured to calculate the news similarity score.
The results can be used for effective and efficient identification .
Meanwhile, another study checking the similarities on news portals used cosine similarity and TF-IDF on the Microsoft News portal and obtained an accuracy of 80.
77% .
Similarity detection has long been used in information search and machine learning domains for multi-purpose text mining, while this study is carried out in combination with clustering This study compares many methods for measuring document similarity, resulting in the conclusion that PDSM.
Cosine, and Jaccard were superior to Euclidean.
Manhattan, and KulbackAeLeibler .
Text Classification has received significant attention recently, including a centroid-based approach and Bayesian nayve multinomial classifier, support vector machine, and neural network.
The classification algorithm involves cosine similarity and weighting of the IDF TF, with the results being much more optimal .
The research from Hanifi states that using Doc2vec and Cosine Similarity algorithms helps the integration process for the initial analysis phase of the inventive design, which takes and collects essential knowledge from scientific data.
Applying these two algorithms can optimize the data collection time in the initial analysis phase of the inventive design process and can significantly improve the accuracy of the information collected .
On the other hand, cosine similarity can be combined with the bidirectional encoder representations from the Transformers (BERT) model of the Transformer architecture, which allows bidirectional processing to better understand the context of words in sentences .
The BERT model in Natural Language Processing can obtain curve, accuracy, sensitivity, and specificity results of 0.
96, 0.
89, 088, and 0.
BERT is also often used to solve the problem of computational complexity and a vast memory because of the ease of interpretation and compatibility with large-scale data .
This is certainly relevant to studentsAo abstracts, which, from time to time, will undoubtedly grow more extensive and need a machine-learning approach .
This study comprehensively examines the application of the cosine similarity algorithm, combined with word weighting using TFIDF, to obtain more optimal and accurate detection results using several test scenario approaches to strengthen the research position.
Using the notion of proximity, the cosine similarity algorithm finds instances of plagiarism or document similarity between thesis and final project papers.
The Bidirectional Encoder Representations from Transformers (BERT) method and the additional scheme of applying cosine similarity will be integrated, and the outcomes will be further examined and compared.
The selection of bidirectional encoder representations from transformers (BERT) and cosine similarity algorithms is predicated on their prowess in determining the degree of document similarity.
II.
MATERIAL AND METHOD
State of the Art This research discusses the detection of the similarity of studentsAo abstracts submitted with the thesis title.
The algorithm used in this detection is cosine similarity and weighting using TF-IDF.
To strengthen the position of the research, the following are some previous researches that apply identical algorithms or similar cases.
Prior research can show approaches for detecting similarity.
Moreover, each approach has its advantages and disadvantages.
For example, the corpus method is more optimal for cross-language, and the semantic approach has good results but needs more resources.
In contrast, the graph structure method needs to rely on learning good graph representation to perform well.
Previous research suggested that the semantic approach produces the most optimal result .
Research on document similarity detection in NLP uses a semantic approach:
understanding and extracting meaning from text using computational techniques.
The results of this study suggest that various methods exist to obtain embeddings from text, which are then used to detect similarities in bag-of-words-based documents such as Word2vec and GloVe.
TF-IDF.
Word Mover spacing, and Smooth Inverse Frequency.
Moreover, the novel model outperformed all other models and, therefore, can be used to capture semantic information from input text effectively .
Research on the semantic approach for short text detection is still rare due to its limited application, and to perform this semantic detection, corpus-based, knowledgebased, and DL-based detections can be conducted .
Measuring word similarity with a semantic approach becomes a solution to finding optimal results.
Previous research succeeded in improving detection results with Discourse Representation Structure, which shows more optimal results .
The results of the experimental evaluation confirmed that the proposed model improved the performance of the textual semantic similarity measure compared to the sentence embedding model, achieving an accuracy of 88.
35% .
The research on checking final project documents using the TF-IDF algorithm could calculate the level of similarity of the final projectAos title submitted by students of the Padang State Polytechnic.
The i.
RESULT AND DISCUSSION
Moreover, the reliability of the BERT model can optimize good prediction results to determine the score of computer security vulnerability level .
This BERT model emphasizes transformer models designed to process data with high capacity and effectiveness .
BERT ability is also good at maximizing data with estimated data on unlabeled datasets .
and can improve performance in word similarity detection .
In addition, the approach with BERT can also improve results and optimize results in word embeddings .
Applying transformer-based architecture and proper sampling techniques significantly improves the performance of BERT .
Experimental results with BERT indicate that proposed model improves the performance of the baselines on 24 NLP tasks .
Deep learning models based on a combination of BERT with Bidirectional Long ShortTerm Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) algorithms have a good result .
Previous research suggested that the most optimal method or approach is to use semantic analysis.
However, it requires resources that tend to be larger than others, and the results are more accurate.
One of the algorithms in the semantic approach is cosine similarity with the TF-IDF weighting This gets optimal results, so it will be suitable for detecting the similarity of studentsAo abstracts.
Another reason for computing is that the algorithm is efficient because of the ease of interpretation and has a match for large-scale data.
this study, two test scenarios are carried out in the corpus data by comparing the performance of the results obtained using the cosine similarity algorithm in determining the level of similarity of abstract documents with the scenarios using Bidirectional Encoder Representations from Transformers (BERT) and without using BERT.
This demonstrates the benefits of Bidirectional Encoder Representations from Transformers (BERT), which can raise the accuracy of document similarity detection.
Data from student abstract documents at a university within the Faculty of Computer Science is utilized.
Abstract data from the last two years will be used for testing and training.
This data was taken from 4 departments in the faculty of computer science.
The flow of the methodology is explained in detail as follows:
Data Preparation The description of the corpus data used is shown in Table I and Table II.
TABLE I
RAW DATA
Corpus Document Perkembangan ilmu pengetahuan dan teknologi di indonesia mengalami kemajuan yang sangat pesat kemajuan yang paling dirasakan kehidupan masyarakat saat ini salah satunya adalah kemajuan di bidang teknologi informasi dan komunikasi ilmu pengetahuan dan teknologi secara umum memiliki keterkaitan yang erat khususnya dalam bidang pendidikan islam salah satunya pesantren pesantren merupakan lembaga pendidikan berbasis agama islam yang dikembangkan secara pribumi oleh masyarakat website pondok pesantren asshiddiqiyah 2 tangerang mempunyai banyak kekurangan kekurangan tersebut menjadi kendala website website masih bersifat semi online perlu adanya pembaharuan fitur seperti pendaftaran peserta didik baru secara full online tujuan dari penelitian ini adalah untuk mengetahui pengukuran tingkat kapabilitas website pondok pesantren asshiddiqiyah 2 tangerang dengan framework cobit khususnya pada bidang pesantren dengan aspek domain dss dan mea metode penelitian ini adalah metode kuantitatif dengan menyebarkan kuisioner kepada kurang lebih 94 responden pengguna serta nilai capability level dan gap analysis dari framework cobit hasil penelitian ini menunjukkan bahwa uji validitas dengan tingkat signifikansi 0202 dan uji reliabilitas dengan nilai cronbachs alpha sebesar 0905 masing-masing dinyatakan valid dan reliabel kemudian rata-rata hasil capability level pada subdomain dss01 dss03 dss05 mea01 dan mea02 adalah sebesar 14 level ini berada pada level 1 performed process dari level target yang ingin dicapai pada sisi it website pondok pesantren asshiddiqiyah 2 tangerang yaitu level 4 predictable Methodology This study adopts and modifies the Cross-Industry Standard Process for Data Mining (Crisp-DM) method.
This methodology is used as a Non-Proprietary Standard Methodology for data mining .
The methodology flow is shown in Figure 1.
TABLE II
RAW DATA 2
Corpus Document saat ini daerah dapat mengatur sendiri rumah tangganya, oleh karena itu daerah diberikan kewenangan untuk menggali potensi sumber penerimaan yang ada dimana salah satunya berasal dari sektor pajak daerah, salah satu yang ingin dioptimalkan adalah pendapatan asli daerah .
dari jenis pajak reklame.
saat ini di kota depok masih banyak reklame yang tidak memiliki izin maupun tidak diperpanjang izinnya tentu hal ini dapat mengurangi pad kota depok.
oleh karena itu, saat ini diperlukan sistem yang dapat memonitoring reklame di kota depok.
dalam sistem yang dirancang ini penulis melakukan pembahasan masalah dengan menggunakan metode pieces dan pengembangan sistem menggunakan air terjun yang diharapkan berbasis client server dengan arsitektur 3-tier.
harapan penulis proses monitoring reklame dengan menggunakan web yang menerapkan jaringan vps dapat mempermudah para pemohon maupun staf dalam menyelesaikan pekerjaannya.
The data obtained is shown in Table I following stages were carried out:
Fig.
1 Methodology The detailed methodology flow is discussed in the following section.
and Table II .
Tokenization: This process describes the description initially as a sentence into a word.
This method works well for breaking words up into tokens, which facilitates word identification, as shown in Table i and Table IV.
Filtering: This stage filters irrelevant words or stop This will affect the overall results of the analysis because its function is to minimize the use of words that have less impact and will affect the overall results of the analysis, as in Table V and Table VI.
TABLE i TOKENIZATION
TABLE V
FILTERING
Corpus Document
"perkembangan" "ilmu" "pengetahuan" "dan" "teknologi" "di" "Indonesia" "mengalami" "kemajuan" "yang" "sangat" "pesat" "kemajuan" "yang" "paling" "dirasakan" "kehidupan" "masyarakat" "saat" "ini" "salah" "satunya" "adalah" "kemajuan" "di" "bidang" "teknologi" "informasi" "dan" "komunikasi" "ilmu" "pengetahuan" "dan" "teknologi" "secara" "umum" "memiliki" "keterkaitan" "yang" "erat" "khususnya" "dalam" "bidang" "pendidikan" "islam" "salah" "satunya" "pesantren" "pesantren" "merupakan" "Lembaga" "pendidikan" "berbasis" "agama" "islam" "yang" "dikembangkan" "secara" "pribumi" "oleh" "masyarakat" "website" "pondok" "pesantren" "asshiddiqiyah" "2" "tangerang" "mempunyai" "banyak" "kekurangan" "kekurangan" "tersebut" "menjadi" "kendala" "website" "website" "masih" "bersifat" "semi" "online" "perlu" "adanya" "pembaharuan" "fitur" "seperti" "pendaftaran" "peserta" "didik" "baru" "secara" "full" "online" "tujuan" "dari" "penelitian" "ini" "adalah" "untuk" "mengetahui" "pengukuran" "tingkat" "kapabilitas" "website" "pondok" "pesantren" "asshiddiqiyah" "2" "Tangerang" "dengan" "framework" "cobit" "khususnya" "pada" "bidang" "pesantren" "dengan" "aspek" "domain" "dss" "dan" "mea" "metode" "penelitian" "ini" "adalah" "metode" "kuantitatif" "dengan" "menyebarkan" "kuisioner" "kepada" "kurang" "lebih" "94" "responden" "pengguna" "serta" "nilai" "capability" "level" "dan" "gap" "analysis" "dari" "framework" "cobit" "hasil" "penelitian" "ini" "menunjukkan" "bahwa" "uji" "validitas" "dengan" "tingkat" "signifikansi" "0202" "dan" "uji" "reliabilitas" "dengan" "nilai" "cronbachs" "alpha" "sebesar" "0905" "masing" "masing" "dinyatakan" "valid" "dan" "reliabel" "kemudian" "rata" "rata" "hasil" "capability" "level" "pada" "subdomain" "dss01" "dss03" "dss05" "mea01" "dan" "mea02" "adalah" "sebesar" "14" "level" "ini" "berada" "pada" "level" "1" "performed" "process" "dari" "level" "target" "yang" "ingin" "dicapai" "pada" "sisi" "it" "website" "pondok" "pesantren" "asshiddiqiyah" "2" "Tangerang" "yaitu" "level" "4" "predictable" "process" Corpus Document
ini, rata, bagi, di, dan, adanya, masih, untuk, dari, kemudian, masing, satunya, paling, dalam, jadi, yaitu, oleh, punya TABLE VI
FILTERING 2
Corpus Document ini, oleh, itu, yang, pad, maupun, dan, para, maupun, di, dalam, saat, adalah, dalam .
Stemming: The stemming stage transforms words into their basic form.
The main goal is to reduce the variation in a word's representation.
The stem results are shown in Table VII and Table Vi.
TABLE VII
STEMMING
Corpus Document
kembang ilmu tahu teknologi indonesia alami maju pesat maju
rasa hidup masyarakat saat satu maju bidang teknologi informasi
komunikasi ilmu tahu teknologi cara umum milik kait erat khusus bidang didik islam satu pesantren rupa lembaga didik basis agama
islam kembang cara pribumi masyarakat website pondok
pesantren asshiddiqiyah 2 tangerang banyak kurang sebut kendala website website masih sifat semi online perlu baharu fitur seperti daftar serta didik baru cara full online tuju teliti adalah untuk tahu ukur tingkat kapabilitas website pondok pesantren asshiddiqiyah 2
tangerang framework cobit khusus bidang pesantren aspek domain
dss mea metode teliti metode kuantitatif sebar kuisioner kurang lebih 94 responden guna serta nilai capability level gap analysis framework cobit hasil teliti tunjuk uji validitas tingkat signifikansi
0202 uji reliabilitas nilai cronbachs alpha besar 0905 masing nyata valid reliabel rata hasil capability level subdomain dss01 dss03
dss05 mea01 mea02 besar 14 level level 1 performed process level target capai sisi it website pondok pesantren asshiddiqiyah 2
tangerang yaitu level 4 predictable process
TABLE IV
TOKENIZATION 2
TABLE Vi
STEMMING 2
Corpus Document "saat" "ini" "daerah" "dapat" "mengatur" "sendiri" "rumah" "tangganya" "oleh" "karena" "itu" "daerah" "diberikan", "kewenangan" "untuk" "menggali" "potensi" "sumber" "penerimaan" "yang" "ada" "dimana" "salah" "satunya" "berasal" "dari" "sektor" "pajak" "daerah" "salah" "satu" "yang" "ingin" "dioptimalkan" "adalah" "pendapatan" "asli" "daerah" ".
" "dari" "jenis" "pajak" "reklame" "saat" "ini" "di" "kota" "depok" "masih" "banya" "reklame" "yang" "tidak" "memiliki" "izin" "maupun" "tidak" "diperpanjang" "izinnya" "tentu" "hal" "ini" "dapat" "mengurangi" "pad" "kota" "depok" "oleh" "karena" "itu" "saat" "ini" "diperlukan" "sistem" "yang" "dapat" "memonitoring" "reklame" "di" "kota" "depok" "dalam" "sistem" "yang" "dirancang" "ini" "penulis" "melakukan" "pembahasan" "masalah" "dengan" "menggunakan" "metode" "pieces" "dan" "pengembangan" "sistem" "menggunakan" "air" "terjun" "yang" "diharapkan" "berbasis" "client" "server" "dengan" "arsitektur" "3tier" "harapan" "penulis" "proses" "monitoring" "reklame" "dengan" "menggunakan" "web" "yang" "menerapkan" "jaringan "vps" "dapat" "mempermudah" "para" "pemohon" "maupun" "staf" "dalam" "menyelesaikan" "pekerjaannya" Corpus Document saat ini daerah dapat atur sendiri rumah tangga oleh karena itu daerah beri wenang untuk gali potensi sumber terima yang ada mana salah satu asal dari sektor pajak daerah salah satu yang ingin optimal adalah dapat asli daerah pad dari jenis pajak reklame saat ini di kota depok masih banyak reklame yang tidak milik izin maupun tidak panjang izin tentu hal ini dapat kurang pad kota depok oleh karena itu saat ini perlu sistem yang dapat memonitoring reklame di kota depok dalam sistem yang rancang ini tulis laku bahas masalah dengan guna metode pieces dan kembang sistem guna air terjun yang harap bas client server dengan arsitektur 3-tier harap tulis proses monitoring reklame dengan guna web yang terap jaring vps dapat mudah para mohon maupun staf dalam selesai kerja Indexing and Weighting TF-IDF technique is used in text mining and natural language processing to evaluate how important a word is in a document relative to a set of documents .
Some of the results of the TF-IDF calculations are shown in Table IX.
TABLE IX
TF-IDF
Corpus .
profound, and the possibilities and approaches to word meaning can be captured more broadly.
After these 2 steps, what is done is to apply fine-tuning to the BERT model with the following steps:
Inverse Document Frequency Step 1: Get the dataset.
The dataset used in this study is corpus data from abstracts obtained in the repository.
This data will be reference data combined with reliable problem-solving or translation skills from Bidirectional Encoder Representations from Transformers (BERT).
Step 2: Start exploring.
This stage begins with data labeling, which is carried out by identifying and learning cosine similarity and bidirectional encoder representations from transformers (BERT).
Model The model used compares cosine similarity and optimization process using Bidirectional Encoder Representations from Transformers (BERT).
Step 3: Data monitoring.
This stage of the Bidirectional Encoder Representations from Transformers (BERT) will request processing from the Central Processing Unit (CPU) so that it can execute the assigned command.
Evaluation The model comparison results from the two scenarios will be sought for optimal results and used as an alternative to the best model obtained.
The corpus data will be created and tested with a two-schematic approach, namely detection using the cosine similarity algorithm and schema using cosine similarity along with Bidirectional Encoder Representations from Transformers (BERT) using data testing.
These results will be compared with the similar scores between documents with two different scenarios tested.
Step 4: Processing.
This process repeats the pre-processing stage.
The Remodel performs processes such as tokenizing to ensure that the return results are more optimal.
Step 5: Design the final input pipeline.
The training and testing data are fed into the Bidirectional Encoder Representations from Transformers (BERT) architecture pipeline to be processed to identify word/sentence similarity.
Scenario 1: Cosine Similarity: The results of data testing are shown in Table X.
TABLE X
RESULT IN COSINE SIMILARITY
Corpus Corpus 1 Corpus 2 Corpus 3 Corpus 4 Corpus 5 Corpus 6 Corpus 7 Corpus 8 Corpus 9 Corpus 10 Step 6: BERT classification model Bidirectional Encoder Representations from Transformers (BERT) will identify words and sentences to be represented in the form of numbers similar to the corpus data you already Result Step 7: Updating and saving.
At this stage, the Bidirectional Encoder Representations from Transformers (BERT) will store the model's results for identifying and detecting word similarities.
An overview of the trial scenario using the additional Bidirectional Encoder Representations from Transformers (BERT) model is shown in Table XI.
Scenario 2: Cosine Similarity with Bidirectional Encoder Representations from Transformers (BERT):
Combining Cosine Similarity with Bidirectional Encoder Representations from Transformers (BERT) will be optimized with the rich vocabulary embedded into the Bidirectional Encoder Representations from Transformers (BERT) model.
The working steps in Bidirectional Encoder Representations from Transformers (BERT) are as follows:
Step 1: Large amounts of training data BERT is specially designed to work on larger word counts.
So, with a more diverse vocabulary than the built-in BERT model combined with the data corpus in the study, the model will have more vocabulary and variety.
Step 2: Masked Language Model Masked Language Model (MLM) enables bidirectional learning from text.
Because word analysis is carried out in two
directions, the word identification process becomes more TABLE XI
RESULT IN COSINE SIMILARITY AND BERT
Corpus Corpus 1 Corpus 2 Corpus 3 Corpus 4 Corpus 5 Corpus 6 Corpus 7 Corpus 8 Corpus 9 Corpus 10 Result Comparison Based on the results of the data testing carried out with 2 schemes, namely cosine similarity and cosine similarity with Bidirectional Encoder Representations from Transformers (BERT), the comparison can be visualized in Figure 2.
improving detection results and emphasizing sentence context for a broader range of complex words and sentences.
IV.
CONCLUSION
The detailed results of the tests carried out are shown in Table XII.
The results showed that combining the Bidirectional Encoder Representations from Transformers (BERT) method in Cosine Similarity could increase the similarity detection value of studentsAo thesis abstract by 23.
48% compared to the cosine similarity algorithm.
This can be one of the optimization techniques for improving the performance of the cosine similarity algorithm in detecting document similarity.
The Bidirectional Encoder Representations from Transformers (BERT) method has a more complex range of understanding words or sentences.
For future research, it is suggested that vocabulary and sentences be enriched to provide an overview of BERTAos ability to detect similarities.
The fact that students must first submit the title of their final
project and abstract for the research to compare it to prior
documents that have been saved in the repository offers promise for additional innovation in the prevention of TABLE XII
DETAILED COMPARISON RESULT
ACKNOWLEDGMENT
Fig.
2 Comparison Result
Corpus
Corpus 1
Corpus 2
Corpus 3
Corpus 4
Corpus 5
Corpus 6
Corpus 7
Corpus 8
Corpus 9
Corpus 10
Cosine We thank the Institute for Research and Community Service (LPPM) at Universitas Pembangunan Nasional AuVeteranAy Jakarta for funding assistance for this research on the schema RISCOP and to every colleague who helped in this Cosine Bert
REFERENCES