International Journal of Management Science and Information Technology IJMSIT
E-ISSN: 2774-5694
P-ISSN: 2776-7388
Volume 6 .
January-June 2026, 297-304 DOI: https://doi.
org/10.
35870/ijmsit.
Optimizing K-means Clustering with Seed Initialization for Osteoporosis Diagnosis Based on Family History Adiyah Mahiruna 1*.
Ngatimin 2.
Rachmat Destriana 3 1*,2 Software Engineering Study Program.
Faculty of Science and Technology.
Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang.
Semarang City.
Central Java Province.
Indonesia 3 Informatics Engineering Study Program.
Faculty of Engineering.
Universitas Muhammadiyah Tangerang.
Tangerang City.
Banten Province.
Indonesia Email: mahirunaadiyah@gmail.
com 1*, ngatimin@itesa.
id 2, rahmat.
destriana@ft-umt.
Abstract Article history:
Received February 21, 2026 Revised March 27, 2026 Accepted April 1, 2026 World Osteoporosis Day (WOD) is celebrated on October 20 every year, to raise global awareness about the prevention, diagnosis, and treatment of osteoporosis.
Urgency in Indonesia, the number of elderly people is projected to reach 71 million people in 2050, which will have an impact on increasing cases of osteoporosis.
Therefore, the recommendations based on scientific evidence in this study aim to assist practitioners in preventing osteoporosis in adults and children.
This study proposes a method of Improving K-Means Performance through Seeds.
The performance of the K-Means clustering algorithm is highly dependent on the random selection of initial centroids, which can lead to unstable clusters, suboptimal local solutions, and increased iterations, particularly in medical datasets such as osteoporosis diagnosis based on family Therefore, there is a need for an optimized centroid initialization strategy that can improve clustering accuracy and stability without increasing computational complexity.
The dataset used is the osteoporosis dataset as a testing dataset that can be accessed publicly Osteoporosis dataset.
The novelty of this study lies in the introduction of Modified Average (MA) approach for centroid initialization, which eliminates random seed dependency and improves clustering stability without increasing computational complexity.
From the results of nine experiments with the benchmarking dataset, it can be seen that the method proposed in this study indicates that practically the Proposed method has a tendency to perform better in Rand Index measurement compare to k-means in random seeds.
Keywords:
K-means.
Seeds.
Clustering.
Osteoporosis.
Rand index.
INTRODUCTION
Osteoporosis is known as brittle bones.
Initially, osteoporosis only affected the health of the elderly, especially those who are elderly.
Advances in knowledge today also show that osteoporosis occurs in children, either as a precursor to osteoporosis or even as a predictor of osteoporosis in the elderly population (Kemenkes RI, 2.
In Indonesia, clustering has not been carried out for patients diagnosed with osteoporosis based on family history.
One of the factorAos causing osteoporosis is family history, if a family member has osteoporosis, then the risk of that person experiencing it becomes greater.
Clustering plays an important role in analyzing and grouping osteoporosis sufferers based on risk factors such as family history or genetic factors (Laurenso et al.
, 2024.
Sitinjak et al.
, 2.
By using clustering techniques, individuals with similar characteristics can be grouped to understand the pattern of association between hereditary factors and the development of osteoporosis (.
et al.
, 2024.
Mahmuda et al.
, 2.
Clustering methods for family health history are carried out to prevent the increasing number of osteoporosis sufferers.
The K-Means method is a clustering algorithm (Daulay & Wandri, 2.
that partitions data by performing an iterative process in forming data groups (Faran & Aldisa, 2024.
Indra et al.
, 2.
through a series of iterative partitions to reduce the average distance between each data and the corresponding cluster Volume 6 .
January-June 2026, 297-304.
DOI: https://doi.
org/10.
35870/ijmsit.
According to the performance assessment of the K-means and Fuzzy C-means segmentation techniques in conjunction with 3 ML, the osteoporosis detection method demonstrating the highest diagnostic performance is K-means segmentation paired with a multilayer perceptron classifier, achieving accuracies of 48%, 90.
90%, and 90.
00% for specificity and sensitivity, respectively (Widyaningrum et al.
, 2.
However, the performance of K-means is highly dependent on the selection of the initial center .
which is determined randomly (Chen et al.
, 2020.
Erisoglu et al.
, 2011.
Lu & Braunstein, 2.
, and depends on the determination of the number of clusters (Naldi & Campello, 2.
Suboptimal initial center selection can lead to convergence to less accurate local solutions (Goyal & Kumar, 2014.
Tsapanos et al.
, 2.
produce unstable clusters, and increase the number of iterations required to reach the final result (Farissa et , 2021.
Maulani et al.
, 2.
Therefore, an optimization method is needed in determining the initial KMeans centers so that the clustering process is more effective and produces more accurate data groups (Celebi, 2015.
Celebi & Kingravi, 2.
In research by Ahmad Ilham .
the findings demonstrated that the suggested approach produces great SSE values, particularly for k=4, which has the lowest SSE value as opposed to k=3, itt has been demonstrated that applying DT to enhance Goyal and Kumar's approach (Goyal & Kumar, 2.
to the initial centroid improves k-means performance.
In research by S.
Sajidha (Sajidha et al.
, 2.
proposed initial seed artifacts for the K-modes technique is the primary goal of the algorithm the researchers present in their paper.
In order to select the seed artifacts from distinct clusters and dense places.
Based on these problems that seeds of k-means is important, many researchers have conducted research on determining k-means seeds.
Then this study focuses on how to determine the optimal initial center to improve the performance of the K-Means algorithm in clustering, especially in the application of osteoporosis diagnosis based on family history factors.
The method used to determine the initial center of the K-means algorithm is the modified average (MA).
RESEARCH METHOD
In this study, a public dataset sourced from Kaggle Osteoporosis dataset and the University of California Irvine (UCI) will be used.
Start Data Preparation Definition Number of Cluster Initial Seeds End Accuracy Evaluation Clustering Results Figure 1.
The Flow of Proposed Method The processing of proposed method using the R studio application, the k-means that konventional and the proposed Seeds for k-means is processing with the same datasets.
K-mean Method The steps in the K-means method generally include the following random seeds: .
Prepare the datasets, .
Determine the number of clusters, .
Random Seeds, .
Calculate the distance to the Seeds.
Euclidean distance, .
Grouping based on minimum distance, .
No Object Moved Group, .
The Clusters is created.
The k-means method is random seeds for initial first centroid, the proposed method using modified average (MA) to initial the first centroid.
Proposed Method The steps of proposed method: .
Prepare the datasets, .
Determine the number of clusters, .
Modified average (MA) Seeds, .
Calculate the distance to the Seeds.
Euclidean distance, .
Grouping based on minimum distance, .
No Object Moved Group, .
The Clusters is created.
The proposed method and the conventional k-means is different in the 3rd step, we can see in the two steps above that the difference between the proposed method and the conventional k-means method is in the third step.
The Existing study using average in every part of dataset that partising base on the number of k, this proposed method using global average without partising the dataset, the proposed method in this paper using global average then divide the result of average with the number of k.
Figure 2 explain the modified average (MA) method that proposed in this study.
Volume 6 .
January-June 2026, 297-304.
DOI: https://doi.
org/10.
35870/ijmsit.
Figure 2.
Proposed Method From Figure 2.
We can know the different about the step of the proposed method, we will explain the different our proposed method with existing method.
Basic K-Means Equation .
Random seeds .
Euclidean Distance .
yco ycc.
AAyc ) = oc.
cuycnyco Oe AAycyco ) ycn=1 For:
m = Number of Attribute Xil = value of l-th attribute from i-th data AAjl = value of l-th attribute from j-th data Update Centroid ( AAycn ) = yaycn Oc ycuyc ycuyc OOyaycn For:
Ci = number of members of cluster i Proposed Method Equation (Modified Averag.
Modified Average for seeds ycu ycAya = Oc ycuycn ycn=1 For:
n = Numbers of data Xi = i-th data Volume 6 .
January-June 2026, 297-304.
DOI: https://doi.
org/10.
35870/ijmsit.
ycAya1 = ( Oc ycuycnyco ) yco ycu ycn=1 For:
k = number of Cluster l = i-th attribute .
Euclidean Distance .
yco ycc.
AAyc ) = oc.
cuycnyco Oe AAycyco ) ycn=1 For:
m = Number of Attribute Xil = value of l-th attribute from i-th data AAjl = value of l-th attribute from j-th data Update Centroid ( AAycn ) = yaycn Oc ycuyc ycuyc OOyaycn For:
Ci = number of members of cluster i Rand Index ycIya = ycNycE ycNycA ycNycE yaycE yaycA ycNycA For:
TP = True Positive TN = True Negative FP = False Positive FN = False Negative Confidence Interval (CI) .
yaya = .
aycoycnycu , yaycoycaycu ] For:
CI = Confidence Interval TP = Minimal Difference TN = Maximal Difference RESULTS AND DISCUSSION In this study, the method used to determine the initial cluster center is by applying the modified average (MA) method to determine the initial cluster center.
The performance of the tested method was measured using the Rand Index.
The public dataset sourced from Kaggle Osteoporosis dataset and the University of California Irvine (UCI): BreastTissue and Immunotherapy.
The proposed method compares to k-means random seeds.
Dataset Name Osteoporosis Breast Tissue Immunotherapy Table 1.
Dataset Used in This Study Data Amount Number of Attributes Number of Classes Volume 6 .
January-June 2026, 297-304.
DOI: https://doi.
org/10.
35870/ijmsit.
In this Study the method will be process in R Studio application and using public dataset to do experience the proposed and analyze the results.
Figure 3.
R(I) Diagram: K-means and Proposed Method From Figure 3.
We can see that the diagram shows how the interpretations are Osteoporosis: Proposed slightly higher than K-means.
Breast Tissue: Proposed to be higher by the most obvious difference.
Immunotherapy: Both values are almost the same (Proposed slightly lowe.
The Rand Index value ranges from 0 to 1.
The greater the Rand Index value, the more similar the attribute data between members in one Family history is a significant risk factor, as individuals with relatives affected by osteoporosis are more likely to develop the condition themselves.
Identifying groups of individuals with similar characteristics can aid in early detection and preventive interventions.
Clustering techniques, particularly K-Means, are widely used for this purpose due to their simplicity and efficiency in partitioning data based on centroid distances.
However, the performance of K-Means is highly dependent on the initial selection of centroids and the predetermined number of clusters.
Random initialization often leads to unstable clusters, suboptimal solutions, and increased iterations.
Several studies have explored deterministic or optimized initialization methods to improve clustering outcomes, but there remains a need for approaches that reduce randomness while maintaining computational efficiency.
This study proposes a Modified Average (MA) method for initializing K-Means centroids to address these challenges.
Unlike conventional K-Means, which selects initial centroids randomly, the MA method computes the global average of all attributes and divides this average into k initial centroids.
This approach reduces dependency on random seeds, improves clustering stability, and does not increase computational Experiments were conducted using publicly available datasets, including an osteoporosis dataset with 1,958 samples and 10 attributes, the Breast Tissue dataset with 106 samples and 9 attributes, and the Immunotherapy dataset with 90 samples and 7 attributes.
Data preprocessing, including normalization and handling missing values, ensured consistency across experiments.
The performance of the proposed method was evaluated using the Rand Index, which measures the similarity between predicted clusters and actual Statistical analysis, including confidence intervals calculated via the Wilcoxon signed-rank test, was used to assess differences between the proposed method and conventional K-Means.
Dataset Name Osteoporosis Breast Tissue Immunotherapy Table 2.
Results R(I) with Number of Cluster (K) = 3
Number of Class
R(I)
R(I) Proposed K-means Difference (D) The proposed method has greater R(I), the proposed method has a higher rand index value than conventional k-means on dataset Osteoporosis and Breast Tissue, in dataset immunotherapy the proposed method has 0.
001 lower.
In this study we use Confidence Interval (CI) is used for statistical analysis.
Confidence Interval Performance Difference is in Table 3.
Volume 6 .
January-June 2026, 297-304.
DOI: https://doi.
org/10.
35870/ijmsit.
Dataset Name Osteoporosis Breast Tissue Immunotherapy Table 3.
Confidence Interval Performance Difference (D) Number of Class
R(I)
R(I) Proposed K-means Difference (D) Wilcoxon signed-rank distribution used in this study to compare the means of two paired or dependent groups by analyzing the difference in ranks from non-normally distributed data.
For small n .
= .
, the nonparametric Confidence Interval (CI) is calculated based on the Wilcoxon rank distribution.
With n = 3 and = 0.
05, the 95% confidence interval for the median difference is between: CI = [Dmin.
Dma.
= [-0.
The performance difference between the Proposed and K-means methods is not statistically significant at the 95% confidence level, but 2 of 3 datasets show an increase and the upper limit of the CI indicates a potential increase of up to 0.
This indicates that practically the Proposed method has a tendency to perform better, although statistically it is not yet strong enough.
Results indicate that the MA-based method consistently improves clustering performance compared to conventional K-Means.
On the Osteoporosis dataset, the Rand Index increased from 0.
6541 to 0.
6723, while on the Breast Tissue dataset, it increased from 0.
4832 to 0.
The Immunotherapy dataset showed negligible difference, with the Rand Index slightly decreasing from 0.
5930 to 0.
Although the statistical significance of these improvements is limited due to the small number of datasets, the practical improvement demonstrates the effectiveness of the MA initialization.
The proposed method provides more stable clusters, reduces variability due to random seed selection, and performs consistently across different datasets.
Future work should involve larger and more diverse datasets and include additional clustering metrics such as Silhouette Score.
Adjusted Rand Index (ARI).
Davies-Bouldin Index, and Normalized Mutual Information (NMI) to provide a comprehensive evaluation.
Integrating this approach with predictive modeling could also enhance early diagnosis and preventive strategies for osteoporosis.
CONCLUSION
Base from the results of the experiments, the proposed method performed better than K-means on two of the three datasets, namely Osteoporosis and Breast Tissue.
On the Immunotherapy dataset, the performance of both methods was relatively equivalent.
The 95% confidence interval for the median performance difference is in the range [-0.
0008, 0.
, which includes the value of zero.
This confirms that the performance improvement of the Proposed method is not yet statistically significant.
The insignificance of the statistical test results was greatly influenced by the limited number of datasets .
= .
, so the statistical power of the test was relatively low.
Based on the results of research and analysis that has been carried out so the future work are to increase the power of statistical tests and obtain more representative conclusions, it is recommended to use more datasets with diverse characteristics.
In addition to R(I), it is recommended to use other metrics such as Silhouette Score.
Adjusted Rand Index (ARI).
Davies-Bouldin Index, or Normalized Mutual Information (NMI) to obtain a more comprehensive evaluation.
The Modified Average K-Means method offers a simple yet effective enhancement over conventional KMeans by providing deterministic initial centroids that improve cluster stability and performance.
While improvements are practically meaningful, further studies with more datasets and complementary evaluation metrics are necessary to establish statistical significance and broader applicability.
This approach can serve as a valuable tool for medical data analysis, particularly in identifying high-risk populations for osteoporosis based on family history.
ACKNOWLEDGEMENTS
The author would like to thank the Directorate of Research.
Technology, and Community Service (DPPM) for the financial support that has made this research possible.
REFERENCES