Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Comparison of Bisecting K-Means and K-Medoids Algorithms in Clustering Junior High School Students Based on the Results of the 2023 Bebras Challenge Elvina Lorenza Phang1.
Paulina H.
Prima Rosa2 Information Technology Sanata Dharma University Yogyakarta E-mail: lorenzaelvina9@gmail.
Abstract.
Improving the quality of education, especially Computational Thinking (CT) skills, is an important priority in the digital era.
Bebras Challenge, an international competition to measure Computational Thinking.
The results show the diverse abilities of Junior High School students, so that appropriate mentoring methods are needed.
This study compares the Bisecting K-Means and K-Medoids algorithms to group students based on the results of the 2023 Bebras Challenge, along with their Indonesian.
Mathematics.
Science scores, and the duration of competition The study was conducted through data preprocessing, application of clustering algorithms, and model evaluation using Silhouette Score at the number of clusters k = 2 to k = 10, and running time.
The results show two main groups, namely the first group of students with high understanding, fast processing time, and effective preparation.
While the second group of students with low understanding due to less than optimal preparation.
Bisecting K-Means showed the best performance with a Silhouette Score of 0.
623 and an execution time of 0.
seconds at k = 2.
This study provides insights for educators and policy makers to design more effective data-based learning strategies.
Keywords: Computational Thinking .
Bisecting K-Means .
K-Medoids .
Silhouette Score.
Introduction Improvement quality education is priority important in prepare generation young face challenges of the digital era.
Assessment results system education by the Program for International Student Assessment (PISA) in 2022, showed decline score almost all countries.
Indonesia, the score literacy decrease as many as 12 points , while score literacy mathematics and scores respective scientific literacy decreases as many as 13 points .
In 2025.
PISA plans add Computational Thinking (CT) to in the list of skills evaluated .
Computational Thinking (CT) is ability analytical which includes breakdown problem , design system , and retrieval decision based on principle computer .
Indonesia has promote CT through Bebras Challenge, competition international assessor ability student in think computational .
However , the results competition This show diversity ability students , especially at the secondary school level First .
unior high schoo.
, which requires method appropriate assistance.
Research that uses outcome data Bebras Challenge For know level Elementary school students' CT understanding was carried out by Herlina et al.
use K-Means Clustering .
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Various study compare a number of algorithm clustering like K-Means .
Bisecting K-Means .
Fuzzy C-Means .
K-Medoids , and Hierarchical .
Researches the show that Bisecting K-Means has performance more fast .
and K-Medoids are more robust against outliers .
However , the comparison second algorithm in context results Bebras Challenge is still on seldom found.
Study This aiming For know results CT data clustering of participants Bebras Challenge 2023 junior high school level uses algorithm Bisecting K-Means and K-Medoids .
In addition , also comparing algorithm Bisecting K-Means and K-Medoids in grouping junior high school students based on results Bebras Challenge 2023.
Data includes results Bebras Challenge and questionnaires distributed to participants Bebras Challenge 2023.
Evaluation comparison done use Silhouette Score For evaluate cluster quality and time execution .
Research results expected can give outlook for teachers and services education in designing learning strategies use increase quality student .
Research methods The data used in this study are the results of the 2023 Bebras Challenge obtained from the Bebras Indonesia competition and questionnaires distributed by the Bebras Bureau of Sanata Dharma University to Bebras Challenge participants .
Both data obtained are in the form of Microsoft Excel documents .
The data from Bebras Indonesia contains the results of the 2023 Bebras Challenge in the scouting category, namely the Junior High School (SMP) student level, totaling 7,793 data.
The data has 23 attributes containing the participant's name, gender, school of origin, bureau code, total score, time used, and score per question category.
While the questionnaire data totals 1,203 data with 13 attributes, containing school level, school name, student name, frequency of participating in the Bebras Challenge , how to prepare, total time spent studying.
Indonesian.
Science, and Mathematics report card scores, impressions, interests, and suggestions.
Both data are preprocessed in Data Mining with several stages, namely data cleaning , data integration , data transformation using Robust Scaler , and data selection based on correlation analysis.
Then the Data Mining process is carried out.
Mining by applying clustering method Bisecting K-Means and K-Medoids .
Then the results are evaluated with Silhouette Score and Running Time .
Figure 1Research Stages 1 Data Mining Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Data Mining is a technique for obtaining information from data sets and is part of Knowledge Discovery in Database (KDD).
KDD includes stages of data analysis to find and identify patterns .
, as shown in Figure 2.
Figure 2.
Knowledge Discovery in Databases (KDD) Data Cleaning Clean up data with remove noise, duplicate data , or fill in missing value in accordance Data Integration Merging data from various source with adapt identity such as name or ID.
Data Selection Selecting relevant data For study with analyze attribute use method like analysis correlation Data Transformation Convert data to the appropriate format For analysis , for example use normalization or scale Data Mining Look for pattern or information use method such as K-Means.
Bisecting K-Means.
KMedoids, and others .
Pattern Evaluation Evaluate patterns found use metric such as Silhouette Score.
SSE, or Davies-Bouldin Index.
Knowledge Presentation Serve results and patterns found in form visualization to users , including clustering results.
2 Robust Scaler Prosiding KONSTELASI Vol.
2 No.
Juni 2025 At the stage KDD transformation , research This use normalization with Robust Scaler for overcome difference range mark between attribute .
This technique designed For reduce the influence of outliers with utilizing the median and interquartile range (IQR) in the calculation .
, .
The Robust Scaler formula is presented in the equation following .
ycu Oe ycAyceyccycnycaycu ycIycaycaycoyceycc_ycU = yaycEycI Information :
ycIycaycaycoyceycc_ycU = new value of Robust Scaler result
= normalized data using Robust Scaler
median = the median value of a data
IQR
= range between quartile 3 (Q.
and quartile 1 (Q.
3 Analysis Correlation Correlation analysis is a method for determining the direction, strength, and significance of relationships between variables in a dataset.
One technique is the Pearson correlation coefficient, which measures the linear relationship between two variables .
, .
, .
The Pearson Correlation formula is presented in the following equation.
ycu O Ocycnyc Oe (Ocyc.
O (Ocy.
yc= Ooycu O Ocycn 2 Oe (Ocyc.
2 O Ooycu O Ocyc 2 Oe (Ocy.
2 Information :
r = Pearson correlation coefficient between two variables i and j n = amount of data Ocycnyc=total number of multiplications between the values in variables i and j Ocycn = sum of values of variable i Ocyc = sum of values of variable j There is guidelines commonly used For assess and understand mark coefficient correlation , which is described in table 1 below This .
Table 1of Correlation Analysis Values Correlation Value Interpretation 80 Ae 1.
Very High Correlation 60 Ae 0.
High Correlation 40 Ae 0.
Moderate Correlation 20 Ae 0.
Correlation Weak 00 Ae 0.
Very Weak Correlation Even Nothing 4 Clustering Clustering is the process of grouping data based on similarity, where data in one cluster is similar but different from other clusters .
, .
The clustering algorithms used in this study are Bisecting K-Means and K-Medoids.
Bisecting K-Means Bisecting K-Means is a development of K-Means that combines the Hierarchical principle.
This algorithm begins by grouping all data in one cluster, then dividing the cluster into two .
using K-Means until the desired number of clusters is reached .
, .
The steps are:
Determine number of clusters .
Using k=2 to divide the clusters .
nitializing 2 cluster centroids firs.
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Calculate the similarity of each object data in the cluster with both centroids using the Euclidean Distance equation , then place the object in the nearest centroid .
The Euclidean Distance equation is as follows.
ycu ya.
, .
= Oo Oc.
cuycoycn Oe xkj ) yco=1 Information :
, .
is the distance from data i to the center of cluster j.
x yi is the value of data i in the kth variable x kj is the central value of j on the kth variable n is the number of observed variables The cluster average using the equation below.
Then recalculate the similarity of each object data in the cluster with the two new centroids using the Euclidean Distance equation .
Steps 2-4 are repeated until the object no longer moves clusters .
= Oc ycuyc ycAycayc ycyun ycayc Information:
: new centroid .
enter poin.
in the next iteration ycAycayc : the amount of data in a cluster Calculating the distance between data and clusters to determine which clusters are still far apart using the Sum of Square Errors .
The SSE formula is as follows.
ycuyc ycIycyco ycuyce ycIycycycaycyce yaycycycuycyc = Oc.
cuyco Oe j ) yco=1 Information:
ycuyc : number of points in cluster j ycuyco : kth data point in cluster j j : centroid of cluster j Select the cluster with the largest amount of data or the largest SSE , then divide it into two clusters using K-Means Clustering .
Repeat steps 2-6, until as many clusters are formed as the value of k set at the beginning.
K-Medoids K-Medoids or Partitioning Around Medoids (PAM) is a clustering technique that divides data into several clusters using medoid as the cluster center, not the mean .
Medoid is an object that represents the cluster center based on the median .
, .
The steps are:
Determine the cluster centers of k .
umber of clusters ) Determine the initial medoid randomly k from the total data.
With the Euclidean Distance equation , the distance of each object and the nearest cluster is calculated.
Then calculate the total of all distances.
Determine new medoids by evaluating candidate medoids that provide the minimum distance to each cluster .
Recalculate the distance of each object in each cluster with the new medoid candidate using the Euclidean Distance equation , then calculate the total of all distances using the new medoid .
ew total distanc.
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Calculate the total new distance Ae total old distance to find the total deviation (S).
If the result S < 0, the new medoid is accepted and updated to produce a new group of k medoids The equation for calculating the total deviation is as follows:
S = ba Information:
a : total shortest distance between the object and the initial medoid b : total closest distance between object to new medoid Repeat steps 4-6 until the medoids do not change and produce clusters with their respective cluster members.
5 Silhouette Score Silhouette Score is method clustering evaluation that measures quality and strength data .
This method calculate the average distance between object in one cluster and distance object to another cluster.
Values approaching -1 indicate grouping bad , value approaching 1 indicates grouping good , value approaching 0 indicates grouping overlap overlap .
The steps for calculating the Silhouette Score are as follows: following :
Finding intra value cluster distance which is the average distance between the i-th data and other data in the same cluster using the equation a.
= Oc ycc.
cn, y.
Oe 1 ycOOycu,ycO1 Information :
intra -mean value cluster distance n is the total number of data in the same cluster d.
is the Euclidean distance between data i and data j Determining the inter value cluster distance which is the average distance of the i-th data with other data in the nearest different cluster .
using the equation b.
= min( Oc ycc.
cn, y.
) ycu.
ycOOycu.
Information:
inter -average value cluster distance n.
is the total number of data in the different clusters closest to data i.
is the Euclidean distance between data i and data j Calculating silhouette index with intra values cluster distance and inter cluster distance is found using the SI.
Oe yca.
ycIya.
= ycoycaycu.
, yca.
) Information:
intra -mean value cluster distance inter -average value cluster distance SI.
is the Silhouette Score value of the ith data Get the Silhouette Score value of all clusters with the S equation.
yco ycI = Oc ycIya.
yco ycn=1 Information:
S is the average Silhouette Score value of all data points.
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 ycIyaycn is the Silhouette Score value of the i-th data k is the total number of points in the dataset The range of cluster strength values based on the Silhouette Score evaluation results can be seen in table 2.
Table 2Silhouette Score Evaluation Value Range Range Interpretation 71 - 1.
The clusters formed produce strong structures.
51 Ae 0.
The clusters formed produce good structures.
26 Ae 0.
The clusters formed produce weak structures.
The clusters formed are unstructured O 0.
Results and Analysis In this section, the research results are explained and a comprehensive discussion is provided.
The research results can be presented in the form of images, graphs, tables and others that make it easier for readers to understand them .
, .
The discussion can be divided into several subchapters.
This study used data from the 2023 Bebras Challenge .
aising categor.
and questionnaire data from the Bebras Bureau of Sanata Dharma University.
1 Preprocessing Stage A Data cleaning .
emoving duplicate data, converting values in the Au Time taken Ay attribute to seconds, changing hyphens Au-Ay to 0 , removing characters, dots, numbers in attributes containing student name.
A Data integration ( combinin.
both datasets are based on attributes containing student names , resulting in 36 attributes from a datase.
A Data transformation ( changing attribute names , initializatio.
student name attribute , conversion a number of attribute categorical become numeric , do normalization Robust Scaler ).
Figure 3Example of Robust Scaler Normalization Results A Data selection .
oing analysis correlation , data selection from 36 attributes into 24 Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Figure 4Correlation Analysis Analysis results correlation show that correlation between attribute dominated with correlation weak until moderate , which means attributes the No own strong relationship .
This is can help in the process of clustering Because more attributes diverse and not very tied One each other.
2 Clustering Testing Election best k amount seen from mark Silhouette Score and running time of each algorithm .
Trial done use number k = 2 to 10.
Testing First use Bisecting K-Means Clustering.
Figure 5 shows mark Silhouette Score ( y- axis ) for every k value .
- axis ).
While Figure 6 displays mark running time ( y- axis ) for every k value .
- axis ).
Figure 5Comparison Results Graph of Silhouette Score Bisecting K-Means Figure 6Comparison Results Graph of Running Time Bisecting K-Means Table 3Results Silhouette Score testing and Bisecting K-Means running time Silhouette Score Running Time Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Silhouette Score Running Time Testing second use K-Medoids Clustering.
Figure 7 shows mark Silhouette Score ( y- axis ) for every k value .
- axis ).
While Figure 8 displays mark running time ( y- axis ) for every k value .
- axis ).
Figure 7Comparison Results Graph of Silhouette Score K-Medoids Figure 8Comparison Results Graph of K-Medoids Running Time Table 4Results Silhouette Score and K-Medoids running time testing Silhouette Score Running Time Based on images and tables the above test , the results testing shows k=2 as results best with a Silhouette Score of 0.
623 and a running time of 0.
006 for Bisecting K-Means, and 0.
for K-Medoids.
Both algorithm show that improvement number of clusters .
decreases quality Silhouette Score evaluation , while running time increases along increase k value .
1 Cluster Results Using Bisecting K-Means Based on the results of the cluster analysis produced by Bisecting K-Means clustering , cluster 0 consists of 70 participants with higher total scores and processing time .
verage 30 minute.
compared to cluster 1 which consists of 538 participants .
verage 33 minute.
Although cluster 1 has a slightly higher average score for some questions, cluster 0 tends to be filled by participants with higher understanding and faster processing time.
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Profiling several categorical attributes shows that the majority of participants in both clusters have only participated in the Bebras Challenge once.
Cluster 0 is dominated by participants who are more experienced and rely more on teacher guidance, while cluster 1 uses more flexible preparation methods, such as self-study.
Participants in cluster 0 also tend to allocate more preparation time and have higher report card scores in the 81-90 and >90 categories, especially in Science and Mathematics subjects.
Cluster 1 is more dominant in the 71-80 and 81-90 scores in Science, while cluster 0 excels in the >90 scores.
Overall, cluster 0 shows participants with better understanding and preparation.
Figure 9Bisecting K-Means Clustering Results 2 Cluster Results Using K-Medoids Based on the results of the cluster analysis produced by K-Medoids clustering .
Cluster 0 consists of 538 participants with lower total scores and longer completion times compared to cluster 1 which contains 70 participants.
Although cluster 1 has a higher average score and faster completion times, cluster 0 includes participants with varying experiences, including those who have participated in the Bebras Challenge more than 3 times.
Cluster 1 is dominated by participants who study more often with teachers, while cluster 0 is more flexible with various preparation methods.
Profiling of several categorical attributes shows that Cluster 0 has more participants who prepare for 1-3 hours, while cluster 1 has more who prepare for more than 5 hours.
Based on report card scores, cluster 1 excels with a higher percentage in the 81-90 and >90 categories, while cluster 0 is more in the 81-90 range.
For Science, cluster 1 excels in the high score category (>.
, while cluster 0 dominates in the 81-90 range.
In Mathematics, cluster 1 participants have more scores >90 and slightly more who have scores below 70.
Overall, cluster 1 shows participants with higher scores and longer preparation.
Prosiding KONSTELASI Vol.
2 No.
Juni 2025 Figure 10K-Medoids Clustering Results Conclusion Based on research that has been done , can concluded that the Computational Thinking Bebras Challenge 2023 data clustering uses Bisecting K-Means and K-Medoids algorithms produce two clusters with difference characteristics understanding participants .
Group First show understanding more tall with time workmanship more fast , value more report cards high , and more preparation effective with the teacher, while group second own better understanding low consequence lack of preparation effective and time learn more little .
Comparison algorithm based on Silhouette Score and Running Time show that Bisecting K-Means algorithm is proven more Good in do clustering because own time more execution fast , namely 0.
006 seconds compared to K-Medoids which require time 0.
041 seconds .
Both algorithm This produce mark the same Silhouette Score evaluation which is 0.
623 at k=2.
As a suggestion for study Next , it is recommended For try other clustering algorithms such as DBSCAN or Agglomerative Clustering for more results optimal.
Use method evaluation addition such as the Davies-Bouldin Index and the Dunn Index can give outlook more about Cluster quality .
Calculation method distance like Manhattan can also under consideration For compared to with Euclidean.
In addition , this dataset Can used For classification use identify factors that influence success participant in Bebras Challenge.
Reference