Bulletin of Information Technology (BIT) Vol 6.
No 4.
Desember 2025.
Hal.
361 - 366
ISSN 2722-0524 .
edia onlin.
DOI 10.
47065/bit.
https://journal.
org/index.
php/BIT Clustering Academic Data of Junior High School Student to Identify Learning Groups Using the DBSCAN Algorithm at SMP Muhammadiyah 5 Samarinda Mini H1*.
Siti Lailiyah2.
Salmon3, 1 Teknik Informatika.
STMIK Widya Cipta Dharma.
Samarinda.
Indonesia 2 Teknik Informatika.
STMIK Widya Cipta Dharma.
Samarinda.
Indonesia 3 Sistem Informasi.
STMIK Widya Cipta Dharma.
Samarinda.
Indonesia Email: 1*2243074@wicida.
id , 2lail.
59a@gmail.
com , 3salmon@wicida.
(* : coressponding author: 2243074@wicida.
Abstract- This study aims to identify learning groups based on student academic data at SMP Muhammadiyah 5 Samarinda.
The data used includes exact and non-exact subject scores, exam results, assignment scores, attendance, and parents' educational backgrounds.
The stages of the research include data collection, data preprocessing through cleaning, feature engineering, and transformation, data processing to determine the optimization values of the DBSCAN parameters, namely eps and minpts, and evaluation of the results using the Silhouette Score.
The optimal parameters obtained were eps = 1.
3 and min_samples = 3, resulting in three main clusters and some noise.
The analysis results showed three main clusters, namely cluster 0 with 89 students .
edium achievemen.
, cluster 1 with 50 students .
igh achievemen.
, and cluster 2 with 5 students .
ow achievemen.
, as well as 14 students identified as noise.
A Silhouette Score value of 0.
217 indicates relatively weak cluster separation quality, but DBSCAN is able to detect noise that may not be detected by other algorithms.
These findings indicate that even though the quality of the clusters is not yet optimal, the algorithm used is still useful for exploring student learning patterns and can serve as the basis for more targeted learning interventions.
Keywords: Clustering.
DBSCAN.
Academic Data.
Study Group.
Silhouette Score
INTRODUCTION
Lower secondary education is a strategic stage in forming the foundation of knowledge and skills, which plays an important role in determining students' readiness to continue to the next level of education.
Students face various academic challenges, which require appropriate learning methods so that their learning potential can develop optimally.
One method commonly applied in schools is the formation of study groups.
This strategy is considered capable of improving students' understanding of the subject matter, motivating them to study harder, and fostering essential social skills in interactions between students.
In addition, group tutoring has also been proven to contribute to the development of good study habits, which are one of the main determinants of student academic success.
The formation of study groups in schools has not been implemented in a structured manner and is not supported by classification procedures based on students' academic abilities or learning styles.
To overcome these obstacles, schools can take advantage of technology and data analysis, which are currently developing rapidly in the world of education.
educational data analysis, the application of machine learning has significant potential[.
One approach that shows great potential is the use of data mining to group students based on their similar characteristics .
Data mining is the process of analyzing large amounts of data to discover hidden patterns and important information that can support systematic and objective decision-making.
In the context of education, data mining enables educators to objectively identify student learning patterns, academic potential, and individual learning needs.
One popular method in data mining is clustering, which aims to group data based on certain similarities between attributes.
This technique is highly relevant in the world of education, as it can help form homogeneous learning groups, identify high-achieving students, and design personalized learning strategies.
The most commonly used clustering algorithm is K-Means, which is known for its performance in segmenting data quickly and efficiently.
The effectiveness of K-Means has been proven in various studies, including grouping high school/MA students' national exam results.
, analyzing student academic data.
, forming study groups based on student performance at the junior high school level .
, and has been applied to religious-based educational institutions such as madrasas to analyze student learning achievements.
Other studies show that this method can help teachers determine learning strategies based on student grade groups .
, and can be used in grouping student academic achievement indices .
Furthermore, at the university level, this algorithm has been implemented to support data-driven curriculum development.
In addition, the K-Means algorithm has limitations, especially in handling data that is not evenly distributed or contains In the field of education, this condition is common in primary education, where student data is heterogeneous and has high variation in characteristics.
To address this issue, the DBSCAN (Density-Based Spatial Clustering of Applications with Nois.
algorithm is used as an alternative that can form clusters based on data density and effectively detect anomalies or outliers.
DBSCAN has the advantage of not requiring a predetermined number of clusters, which is often a weakness of K-Means.
In practice, this algorithm has been used to analyze patterns of student visits to libraries .
Evaluating the quality of clustering results is also an important aspect that cannot be ignored.
Several methods such Copyright A 2025 Author .
Page 361 Jurnal BIT is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Information Technology (BIT) Vol 6.
No 4.
Desember 2025.
Hal.
361 - 366
ISSN 2722-0524 .
edia onlin.
DOI 10.
47065/bit.
https://journal.
org/index.
php/BIT as the Davies Bouldin Index.
Elbow Method, and Silhouette Coefficient are commonly used to assess the effectiveness of clustering models in the context of student academic classification .
In addition, evaluation methods such as Silhouette Score and Davies-Bouldin Index have been proven effective for measuring the quality of clustering results in both K-Means and DBSCAN algorithms.
Based on the literature review, most previous studies still focus on the application of the K-Means algorithm as the main However, student data characteristics at the junior high school level tend to be more complex and heterogeneous.
In addition, non-academic variables such as attendance rates and parental educational backgrounds are rarely considered as components of analysis.
In fact, several studies explain that students' academic success is not only influenced by subject grades, but also by external factors.
This study aims to apply the DBSCAN algorithm in clustering the academic data of students at SMP Muhammadiyah 5 Samarinda.
By utilizing academic data, student attendance, and parental educational background, this study is expected to produce learning groups that can support the development of more effective learning strategies and provide practical contributions to teachers and schools in designing more targeted learning strategies.
RESEARCH METHODOLOGY
1 Research Stages This research was systematically designed to produce academic data groupings that could represent learning patterns.
In the initial stage, data was collected from the academic scores and attendance percentages of students at SMP Muhammadiyah 5 Samarinda, with a total of 158 entries.
The collected data still contained possible duplicates and extreme values, so data cleaning was performed by removing duplicates and detecting outliers based on z-scores.
After cleaning, 157 more representative data entries were obtained.
The research flow is visualized in Figure 1.
Figure 1.
Research Stages Research Stages:
Data Collection The dataset used contains the academic scores of students at Muhammadiyah 5 Samarinda Junior High School .
xact, non-exact, exams, assignments, attendance, parental educatio.
The exact sciences category includes mathematics and natural sciences, while the non-exact science category of Indonesia language.
English language, sosial sciences, cultural arts, physical education, sports and health, and civics.
Thus, this grouping of students provides a more structured picture of studentsAo academic ability trends.
The data consists of 158 data populations.
Data Preprocessing Cleaning: duplicates removed: extreme values .
discarded using the z-score > 3 approach.
Feature Engineering: created new features such as Academic_Score2 .
ombination of exact scores, non-exact scores, exams, assignment.
Consistency_Score .
ifference between exams and assignment.
Attendance_Score .
ormalized to 0Ae.
Parent Score .
arents' educatio.
, and Achievement_Index as the final combined score.
addition, achievement categories (Low.
Medium.
Hig.
are assigned based on quantiles.
Transformation: all features are standardized using StandardScaler to ensure a uniform value range.
Data Processing The eps and min_samples parameters are determined using a grid search approach and k-distance plot to find the optimal values.
In the experiment, the combination of eps=1.
3 and min_samples=3 produced three clusters with minimal noise .
Evaluation of Results Evaluation using the Silhouette score to assess compactness and cluster separation.
The silhouette value obtained was 217, indicating that the clusters were still in the weak-to-moderate category.
Nevertheless, these results are still useful for exploring student learning patterns.
Cluster Results Copyright A 2025 Author .
Page 362 Jurnal BIT is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Information Technology (BIT) Vol 6.
No 4.
Desember 2025.
Hal.
361 - 366
ISSN 2722-0524 .
edia onlin.
DOI 10.
47065/bit.
https://journal.
org/index.
php/BIT Cluster 0 .
: the majority group with moderate achievement indices Cluster 1 .
: higher achievement group.
Cluster 2 .
: a small group with specific tendencies.
Noise .
: students who are not included in the cluster .
2 DBSCAN Algorithm The DBSCAN (Density-Based Spatial Clustering of Applications with Nois.
algorithm is used to cluster data based on point density.
Two important parameters used are epsilon (A) and Minpts.
The value A indicates the maximum distance between points to be considered neighbors, while Minpts indicates the minimum number of points required to form a The distance between two points p and q is calculated using Euclidean Distance using equation .
cy yccycnycyc.
cy, y.
= Oo.
uycn=1 ycn Oe ycycn ) ) .
Where :
cy, y.
: Euclidean distance between point of data ycy and data point yc ycyycn : Value of the i-th feature at the data point ycy ycycn : Value of the i-th feature at the data point yc ycu : Number of feature or data dimensions This equation calculates the degree of closeness between two points based on the difference in values of each feature.
the distance is smaller, the two points will be closer together in the data space.
After the distance is calculated, the EpsilonNeighborhood set of points within radius A of the center point p is defined in equation .
= .
c yun ya | yccycnycyc.
cy, y.
O yuA} .
Where :
: The set points that are neighbors of point ycy ycyunya : Point yc, which is part of the entire dataset ya yccycnycyc.
cy, y.
: The distance between data point ycy dan data point yc yuA : Maximum neighbors radius This set is used to test the density of a point around point ycy.
If the number of neighbors meets the minimum value criterion Minpts, then point ycy will be categorized as a core point in cluster formation.
whereas, the number of neighbors who do not meet the minimum requirements is set as a border point or noise.
3 Application of Methods This research method was applied in the following stages:
Implementation was carried out in the Python Jupyter Notebook environment, utilizing the pandas, scikit-learn, and matplotlib libraries.
Notebook Clustering_DBSCAN_Siswa.
ipynb contains the entire pipeline from preprocessing to evaluation.
The modified dataset .
ata_final_mod_clustered_mid.
is used as input for DBSCAN after undergoing cleaning, feature engineering, and standardization.
The final results are visualized in the form of a 2D PCA scatter plot, which shows the distribution of clusters and student noise.
RESULTS AND DISCUSSION
This section presents the results of applying the DBSCAN algorithm to student academic data, including cluster distribution, grouping quality evaluation, and interpretation of each cluster's characteristics.
1 DBSCAN Clustering Results The DBSCAN algorithm produced three main clusters and a number of noise data points.
The distribution of members in each cluster is shown in Table 1.
Table 1.
Distribution of DBSCAN Cluster Results Cluster Number of Members 89 students General Description Majority, average achievement index, stable attendance Copyright A 2025 Author .
Page 363 Jurnal BIT is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Information Technology (BIT) Vol 6.
No 4.
Desember 2025.
Hal.
361 - 366
ISSN 2722-0524 .
edia onlin.
DOI 10.
47065/bit.
https://journal.
org/index.
php/BIT 50 students 5 students Noise 14 students High achievement, consistent good grades Small group, low/typical A total of 144 students .
1%) were successfully grouped into three clusters, while 14 students .
9%) were categorized as noise.
This shows that DBSCAN is capable of identifying most patterns in the dataset, although there is still some data that does not meet the cluster density criteria.
2 Cluster Quality Evaluation Cluster quality was measured using the Silhouette Score with a value of 0.
This value indicates that the separation between clusters is still relatively weak, because some of the data is located at a similar distance between one cluster and Nevertheless, these values still indicate the existence of group structures, so that the clustering results can still be used as a basis for exploring student learning patterns.
The visualization of the clustering results is shown in Figure 2 using 2D PCA dimension reduction.
This image shows the distribution of three main clusters and the position of data categorized as noise.
Next.
Figure 3 shows the K-distance plot used to determine the optimal eps parameter.
Figure 2.
Visualization of DBSCAN Clustering Results with 2D PCA The figure illustrates the results of data clustering using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm with an epsilon .
value of 1.
3 and a min_samples parameter of 3.
Prior to clustering, the data were reduced in dimensionality using Principal Component Analysis (PCA) to enable two-dimensional visualization.
The results indicate that DBSCAN successfully identified three main clusters with varying density levels, as well as several data points classified as noise or outliers due to insufficient local density.
These findings demonstrate that DBSCAN is effective for density-based clustering without requiring a predefined number of clusters and is well suited for datasets with irregular distributions and the presence of outliers.
Figure 3.
K-distance Plot The figure presents a K-distance plot with k=2, which is commonly used to determine an appropriate epsilon .
value for the DBSCAN algorithm.
The plot shows the distances to the second nearest neighbor for each data point, sorted in ascending order.
A gradual increase in distance is observed for most points, followed by a sharp rise toward the end of the curve, forming an elbow-like shape.
This point of rapid increase indicates a transition from dense regions to sparse Copyright A 2025 Author .
Page 364 Jurnal BIT is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Information Technology (BIT) Vol 6.
No 4.
Desember 2025.
Hal.
361 - 366
ISSN 2722-0524 .
edia onlin.
DOI 10.
47065/bit.
https://journal.
org/index.
php/BIT regions and is typically selected as the optimal eps value.
Therefore, the K-distance plot serves as an effective tool for identifying a suitable neighborhood radius that enables DBSCAN to distinguish clusters from noise accurately.
3 Interpretation of Clusters The interpretation was done by analyzing the average features of each cluster, which included combined academic scores, achievement index, attendance, grade consistency, and parents' educational background.
The interpretation results are shown in Table 2.
Table 2.
Interpretation of Characteristics of Each Cluster Cluster Noise Key Characteristics Average academic performance stable attendance High academic grades.
Nilai rendah/tidak stabil, kehadiran kurang Unique/extreme profile Interpretation Majority group, fairly good performance High-achieving students, potential for acceleration At-risk students, need additional guidance Outlier, requires special Overall, the clustering results using the DBSCAN algorithm show three main groups and a number of students detected as noise.
Although the Silhouette Score is still relatively low, these findings provide an initial overview of the variations in student learning patterns, which can be used as a basis for consideration in developing learning strategies.
CONCLUSION
This study applied the DBSCAN algorithm to cluster the academic data of students at Muhammadiyah 5 Junior High School in Samarinda.
The clustering results show the formation of three main groups with relatively varying numbers of members, namely 89 students in cluster 0, 50 students in cluster 1, and 5 students in cluster 2, as well as 14 students categorized as noise.
Evaluation with Silhouette Score produced a value of 0.
217, indicating that the cluster separation quality was still weak.
This may be due to the homogeneity of student academic data and the limitations of the variables Nevertheless, the results of the study still make an important contribution in the context of exploring student learning The majority cluster describes a group of students with average performance, the second cluster reflects highachieving students who have the potential to be given enrichment programs, while the third cluster indicates students with low achievement risk who need more intensive guidance.
These findings can serve as input for schools to implement more targeted learning interventions.
For further research, it is recommended to add non-academic variables such as learning motivation, family support, and extracurricular activities to improve the quality of clustering and make the interpretation of results more comprehensive.
REFERENCES