METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
EVALUATING THE QUALITY OF K-MEDOIDS CLUSTERING
ON CRIME DATA IN INDONESIA
1Sujacka Retno, 1Rozzi Kesuma Dinata, 2Novia Hasdyna Department of Informatics Engineering.
Universitas Malikussaleh.
Aceh Utara.
Indonesia Department of Informatics.
Universitas Islam Kebangsaan Indonesia.
Bireuen.
Indonesia Email: sujacka@unimal.
DOI: https://doi.
org/10.
46880/jmika.
Vol8No2.
ABSTRACT
This study evaluates the quality of K-Medoids clustering applied to criminal incident data in Indonesia from 2000 The analysis compares the clustering performance on both original and normalized datasets using various evaluation metrics, including the Davies-Bouldin Index (DBI).
Silhouette Score (SS).
Normalized Mutual Information (NMI).
Adjusted Rand Index (ARI), and Calinski-Harabasz Index (CH).
The findings reveal that the original dataset consistently outperforms the normalized dataset across all metrics.
The optimal clustering was achieved in the seventh iteration of the original data, with the lowest DBI .
, the highest SS .
NMI
ARI .
, and CHI .
In contrast, the normalized data exhibited higher DBI values and, in some cases, negative Silhouette Scores, indicating less distinct clusters.
These results suggest that for this dataset.
KMedoids clustering performs more effectively on the original data without normalization, providing more accurate and well-defined clusters of criminal incidents.
This insight is crucial for future research and practical applications in crime data analysis, emphasizing the importance of dataset preprocessing in clustering Keyword: K-Medoids Clustering.
Crime Data Analysis.
Criminal Incidents.
Evaluation Metrics.
Data Normalization.
INTRODUCTION
The analysis of criminal incident data has become increasingly important in understanding crime patterns and devising effective strategies for crime Indonesia, with its diverse and expansive geography, presents unique challenges in monitoring and analyzing criminal activities across its provinces.
Effective data analysis techniques are essential for extracting meaningful insights from this data, allowing for informed decision-making by law enforcement agencies and policymakers (Dinata, et al.
, 2.
Previous studies in Indonesia have highlighted the significance of crime data analysis in shaping public safety policies, especially in clustering data (Fauzi, et , 2.
Clustering techniques, particularly K-Medoids, have gained prominence in crime data analysis due to their ability to partition data into meaningful groups based on similarity.
Unlike other clustering algorithms.
K-Medoids is robust against outliers, making it wellsuited for datasets like criminal incidents, where anomalies are common (Budiaji, et al.
, 2.
identifying central data points .
K-Medoids creates clusters that can be more representative of the underlying patterns in the data, as noted in previous studies (Oktarina, et al.
, 2.
(Nakagawa, et al.
, 2.
In the Indonesian context, clustering methods have been employed to analyze various types of data, including crime, as a means to improve regional Normalization is often used in the data clustering process as a comparison with the original data without normalization.
Normalization can be done in various ways and methods, one of which is with the Standard Scaler (Yanti, et al.
, 2.
The preprocessing of data, especially normalization, often plays a crucial role in determining the quality of clustering results.
Normalization can bring all features to a common scale, potentially improving the performance of clustering algorithms (Samudi, et al.
, 2.
Despite this, the effectiveness of normalization in improving clustering results is not always guaranteed, and its impact may vary depending on the specific characteristics of the dataset.
Local studies (Rifa, et al.
, 2.
have shown that the preprocessing steps, such as normalization, can significantly affect the outcome of clustering analyses in various applications, including public health and crime data (Ghufron, et al.
, 2.
In this study, we evaluate the performance of KMedoids clustering on both original and normalized Halaman 274 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
criminal incident data from Indonesia, spanning from 2000 to 2023.
The study employs several evaluation metrics, including the Davies-Bouldin Index (DBI).
Silhouette Score (SS).
Normalized Mutual Information (NMI).
Adjusted Rand Index (ARI), and CalinskiHarabasz Index (CHI), to assess the quality of the clustering results (Islam, et al.
, 2.
These metrics provide a comprehensive evaluation of the clustering quality, ensuring that the analysis captures the nuances of the data and the effectiveness of the K-Medoids algorithm (Mousavi, et al.
, 2.
ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
RESEARCH METHODS The dataset used in this research contains the number of criminal incidents reported in all provinces of Indonesia from 2000 to 2023.
The data was taken from the records of the Indonesian National Police to ensure it is accurate and complete.
Each entry includes the province, year, and total number of crimes reported.
This large dataset helps in analyzing crime trends in Indonesia.
Table 1 below shows the complete dataset.
Province Table 1.
Complete Dataset.
ACEH
SUMATERA
UTARA
SUMATERA
BARAT
RIAU
JAMBI
In this study, we tested the clustering by using the original dataset and the dataset normalized with StandardScaler to see which one works better for Province ACEH
SUMATER
A UTARA
SUMATER
A BARAT
RIAU
JAMBI
clustering with the k-medoids algorithm.
The dataset normalized with StandardScaler is shown in table 2 Table 2.
Normalized Dataset.
PAPUA
In this study, we use the k-medoids clustering method on both the original crime dataset and the normalized one.
The main idea is to use these datasets A A A A PAPUA for clustering and to check the results with various metrics to ensure accurate analysis (Luchia, et al.
Halaman 275 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
The flowchart of the research process is shown in Figure 1.
Process Input Normalized the Dataset K-Medoids Evaluation Based on Evaluation Metrics Dataset K-Medoids Output Results Figure 1.
Research Scheme The framework of K-Medooids method used for this research is shown in Figure 2 (Herman, et al.
, 2.
Determine the number of K Randomly selects K user models as k medoids, each medoid represents a cluster Calculating the distance between data and medoid into clusters Calculating the medoid iterations Calculating the total deviations (S), if S<0 then swap the object as a medoid Re-calculating k-mediod until if there is no change in medoid limb Terminated Figure 2.
K-Medoids Scheme In Figure 2, the k-medoids scheme used in this study is explained, where we test both the original dataset, and the dataset normalized using Standard Scaler.
RESULTS AND DISCUSSION
The clustering was performed on both the original and normalized datasets using the k-medoids algorithm in Python, and the results were visualized after testing 10 times to analyze the consistency and validity of the clustering process.
In this study, the number of clusters formed was k=3.
The clustering results are compared in Figure 3 below.
Halaman 276 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
Figure 3.
Comparison of K-Medoids Clustering Results The comparison of clustering results before and after data normalization, as depicted in Figures 3, demonstrates the significant impact of normalization on clustering performance.
In the original data clustering, the PCA-transformed data exhibits wide variance across PCA components, leading to clusters with dispersed and uneven distributions.
The clusters, although distinguishable, reflect the influence of unbalanced feature scales, as indicated by the large range of values on both axes.
In contrast, the normalized data clustering shows a more compact distribution of clusters, with data points more evenly spaced across both PCA components.
Normalization reduces the variance between features, leading to clusters that are more homogeneously distributed.
Both figures reveal three distinct clusters, but normalization results in tighter and more cohesive groupings, suggesting improved feature representation.
This highlights the importance of normalization in ensuring that all features contribute equally to clustering The results of the clustering process were then evaluated using various metrics, including DBI.
SS.
NMI.
ARI, and CH.
To evaluate the performance of the clustering algorithms, six metrics were used:
Davies-Bouldin Index (DBI): Measures the average similarity ratio of each cluster with the one most similar to it.
Lower values indicate better clustering.
The formula is:
ycu ycoycaycu yuaycn yuayc yayaAya = Oc ya ycn O yc ycc.
caycn, ycay.
ycn=1 .
where Ei is the average distance of all points in the i-th cluster to the centroid ci, and d.
i,c.
is the distance between centroids ci and cj.
Silhouette Score: Assesses the quality of the clusters by measuring the distance between clusters.
Scores range from -1 to 1, with higher scores indicating better-defined clusters.
The formula is:
= yca.
Oe yca.
, yca.
) .
where a.
is the average distance from the i-th point to the other points in the same cluster, and b.
is the average distance from the i-th point to points in the nearest cluster.
Normalized Mutual Information (NMI): Quantifies the mutual dependence between the clustering results and the ground truth classification.
Scores range from 0 to 1, with higher values indicating greater similarity.
The formula is:
ycAycAya.
cO, ycO) = ya.
ycO) Ooya.
cO)ya.
cO) .
where I(U.
V) is the mutual information between the clustering U and V, and H(U) and H(V) are the entropies of the clusterings.
Adjusted Rand Index (ARI): Measures the similarity between the clustering results and a ground truth classification, adjusted for chance.
Scores range from -1 to 1, with higher values indicating better clustering performance.
The formula is:
Halaman 277 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
yaycIya = ycIya Oe ya.
cIy.
cIy.
Oe ya.
cIy.
where RI is the Rand Index, and E[RI] is its expected value.
Calinski-Harabasz Index: Evaluates the ratio of the sum of between-cluster dispersion to within-cluster Higher values indicate better-defined The formula is:
yaya = ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
ycNyc.
aAyc.
co Oe .
Tr.
cOyc.
cu Oe yc.
where Tr(B.
is the trace of the between-cluster dispersion matrix.
Tr(W.
is the trace of the withincluster dispersion matrix, k is the number of clusters, and n is the number of data points.
The calculations of these evaluation metrics for the k-medoids clustering in this study are shown in Tables 3 and 4 below reveal nuanced differences in clustering quality between the original and normalized datasets and visualized in Figure 4 to Figure 8.
Table 3.
Metrics Value for K-Medoids Clustering on the Original Dataset.
Run
DBI
NMI
ARI
Average 0.
Table 4.
Metrics Value for K-Medoids Clustering on the Normalized Dataset.
Run
DBI
NMI
ARI
Average Firstly, the Davies-Bouldin Index (DBI), which measures cluster separation, shows a slightly lower average value for the normalized dataset .
compared to the original dataset .
A lower DBI generally indicates better clustering, suggesting that the normalized dataset offers marginally better cluster separation.
Figure 4.
Comparison of DBI Values Halaman 278 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
ISSN: 2598-8565 .
edia ceta.
ISSN: 2620-4339 .
edia onlin.
to the original dataset .
This indicates that the normalized dataset may align better with the ground truth in terms of cluster assignments.
Figure 5.
Comparison of SS Values The Silhouette Score (SS), which assesses the cohesion and separation of clusters, is higher for the normalized dataset .
than for the original dataset .
This indicates that the clusters in the normalized dataset are more cohesive and better separated, reflecting improved clustering performance in this regard.
Figure 6.
Comparison of NMI Values However, the Normalized Mutual Information (NMI), which measures the agreement between the clustering result and the true labels, is higher for the original dataset .
than for the normalized dataset .
This suggests that the original dataset may provide clustering results more consistent with the underlying data structure.
Figure 7.
Comparison of ARI Values The Adjusted Rand Index (ARI), another metric for measuring the similarity between the predicted clustering and the true labels, shows a higher average value for the normalized dataset .
compared Figure 8.
Comparison of CH Values Lastly, the Calinski-Harabasz Index (CH), which evaluates the ratio of the sum of between-cluster dispersion to within-cluster dispersion, is higher for the original dataset .
than for the normalized dataset .
A higher CH score typically indicates better-defined clusters, suggesting that the original dataset might offer more well-defined clusters.
In summary, while the normalized dataset demonstrates slightly better performance in terms of cluster cohesion (SS) and alignment with true labels (ARI), the original dataset appears to provide better results in terms of agreement with the underlying data structure (NMI) and cluster definition (CH).
The decision on which dataset to prioritize depends on the specific goals of the clustering analysis and the relative importance of these metrics in the context of the study.
CONCLUSION
This study has effectively demonstrated the application of the k-medoids clustering algorithm to both original and normalized crime datasets, yielding valuable insights into the clustering performance across various metrics.
The comparison of evaluation metrics such as the Davies-Bouldin Index (DBI).
Silhouette Score (SS).
Normalized Mutual Information (NMI).
Adjusted Rand Index (ARI), and Calinski-Harabasz Index (CH) reveals that normalization has a mixed impact on the clustering quality.
Specifically, the normalized dataset showed improvements in cluster cohesion and alignment with the true labels, as evidenced by higher SS and ARI scores.
However, the original dataset outperformed the normalized version in terms of capturing the inherent structure of the data, as indicated by superior NMI and CH values.
These findings highlight the importance of considering multiple evaluation metrics when assessing clustering outcomes, as different metrics may Halaman 279 METHOMIKA: Jurnal Manajemen Informatika & Komputerisasi Akuntansi Vol.
8 No.
2 (Oktober 2.
emphasize different aspects of clustering quality.
The results suggest that while data normalization can enhance certain aspects of clustering performance, it is not universally beneficial and may even detract from the clustering accuracy in some contexts.
Therefore, the choice between using original or normalized data should be guided by the specific objectives of the analysis and the relative importance of each evaluation Future research could explore additional normalization techniques or alternative clustering algorithms to further optimize the clustering process and achieve more robust results.
REFERENCES