JURNAL MANAJEMEN PELAYANAN KESEHATAN
VOLUME 29AU
Edisi KhususAU
20 - 25
MODEL PREDIKSI FAKTOR-FAKTOR RISIKO OBESITAS MENGGUNAKAN
MACHINE LEARNING
Predictive Modeling of Obesity Risk Factors Using Machine Learning Husnul Khuluq1,2 .
Lazuardi Fatahillah H2 .
Ayu Nissa Ainni1 .
Tri Cahyani W1 Program Studi Apoteker.
Universitas Muhammadiyah Gombong.
Kebumen.
Indoensia Program Studi Teknik Informatika.
Universitas Muhammadiyah Gombong.
Kebumen.
Indonesia
ABSTRACT
Background: Obesity is a major global health concern and a key risk factor for various non-communicable diseases, including diabetes, hypertension, and cardiovascular disorders.
Despite extensive studies, accurately identifying the key contributing factors remains a challenge.
Objective: This study aims to predict the likelihood of obesity using a machine learning algorithm, based on questionnaire-derived clinical and Several algorithmsAilogistic regression, nayve Bayes, support vector machine (SVM), and random forestAiwill be employed to build predictive models.
Model performance will be evaluated using accuracy, precision, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).
Methods: We used an open-access dataset from Kaggle comprising 2,111 samples with anthropometric, demographic, and lifestyle data.
these, 972 individuals were categorized as obese and 1,139 as non-obese.
The target variable was categorized into binary labels:
"Obesity" and "Non-Obesity.
" Preprocessing included one-hot encoding, label encoding, and train-test splitting.
All four ML models were trained and evaluated using accuracy, area under the curve (AUC), precision, sensitivity, and specificity metrics.
Results: The model achieved an accuracy of 98.
AUC of 99.
sensitivity of 98.
99%, specificity of 98.
21%, and precision of 98.
The most influential predictors were weight, frequent consumption of high-caloric food, family history of being overweight, physical activity frequency, and daily water intake.
Conclusion: The model demonstrated high performance and identified key lifestyle-related features.
These findings support machine learning's potential for obesity screening and public health strategy development.
Keywords: Obesity.
Machine Learning.
Random Forest.
Risk Factors.
Predictive Modeling ABSTRAK Latar belakang: Obesitas merupakan masalah kesehatan global utama dan menjadi faktor risiko kunci bagi berbagai penyakit tidak menular, termasuk diabetes, hipertensi, dan gangguan Meskipun telah banyak dilakukan penelitian, mengidentifikasi faktor penyebab utama obesitas secara akurat masih menjadi tantangan.
Tujuan: Studi ini bertujuan untuk memprediksi kemungkinan obesitas menggunakan algoritma pembelajaran mesin .
achine learnin.
berdasarkan data klinis dan perilaku yang diperoleh dari Beberapa terawasiAilogistic regression, nayve Bayes, support vector machine (SVM), dan random forestAidigunakan untuk membangun model prediktif.
Kinerja model dievaluasi menggunakan metrik akurasi, presisi, sensitivitas, spesifisitas, dan area under the curve (AUC) dari kurva ROC.
Metode: Dataset open-access dari Kaggle digunakan, terdiri dari 111 sampel yang mencakup data antropometri, demografis, dan gaya hidup.
Dari jumlah tersebut, 972 individu dikategorikan sebagai obesitas dan 1.
139 sebagai non-obesitas.
Variabel target diklasifikasikan ulang menjadi dua label: "Obesitas" dan "Non-Obesitas.
" Proses pra-pemrosesan mencakup one-hot encoding, label encoding, dan pembagian data menjadi data latih dan uji.
Keempat model ML dilatih dan dievaluasi menggunakan metrik akurasi.
AUC, presisi, sensitivitas, dan spesifisitas.
Hasil: Model mencapai akurasi sebesar 98,58%.
AUC sebesar 99,96%, sensitivitas sebesar 98,99%, spesifisitas sebesar 98,21%, dan presisi sebesar 98,01%.
Prediktor yang paling berpengaruh adalah berat badan, frekuensi konsumsi makanan tinggi kalori, riwayat keluarga dengan kelebihan berat badan, frekuensi aktivitas fisik, dan asupan air harian.
Kesimpulan: Model menunjukkan performa yang sangat tinggi dan berhasil mengidentifikasi fitur-fitur gaya hidup yang berperan Temuan ini mendukung potensi machine learning dalam skrining obesitas dan pengembangan strategi kesehatan Kata Kunci: Obesitas.
Machine Learning.
Random Forest Faktor Risiko.
Pemodelan Prediktif *Penulis korespondensi.
Email : husnulkhuluq@unimugo.
K Husnul, dkk: Predictive Modeling of Obesity Risk Factors Using Machine Learning INTRODUCTION Obesity is rising globally, affecting both developed and developing countries.
In 2022, approximately 2.
billion adults were overweight, with 890 million classified as obese.
The prevalence of obesity is also increasing among children, particularly in Asia.
If left unaddressed, the global economic burden of obesity could reach US$18 trillion by 20601, 2.
Obesity is influenced by a range of risk factors, including genetic predisposition, unhealthy dietary patterns, low levels of physical activity, and various social and environmental Although these factors have been widely studied, there remains a lack of comprehensive understanding regarding how they interact and contribute to obesity at the individual level3, 4.
The selected machine learning methods were Logistic Regression.
Nayve Bayes.
Support Vector Machine, and Random Forest.
These methods were chosen because each represents a different modeling philosophy and captures different patterns in the data.
Logistic Regression provides an interpretable baseline model and has been widely used in health prediction Nayve Bayes is efficient for high dimensional categorical datasets and performs well with questionnaire derived variables6.
With advancements in technology, machine learning (ML) offers significant opportunities to analyze complex data and develop more accurate predictive models for understanding the factors that contribute to Therefore, this study aims to identify the risk factors associated with obesity by analyzing data using machine learning techniques to produce a more effective predictive model.
Based on this background, the research addresses the following questions: .
What are the significant risk factors contributing to obesity based on questionnaire data collected from individuals? .
How can machine learning models be analyze and predict obesity-related risk factors? .
To what extent can the predictive results be used to support obesity prevention efforts among high-risk populations? Obesity has become a major focus in health research, with numerous studies aiming to identify the risk factors that contribute to this condition.
Previous research has commonly employed conventional statistical approaches, such as linear regression or factor analysis, to determine variables associated with However, these methods have limitations in handling complex datasets involving numerous variables, often resulting in reduced accuracy in With advancements, machine learning (ML) has begun to be used to address these limitations.
ML algorithms such as random forests, logistic regression, nayve Bayes, and support vector machines (SVM) have been applied to analyze more complex medical and social data, as well as to enhance predictive capabilities related to obesity.
Recent studies have demonstrated the potential of machine learning in predicting obesity however, many of these studies are still limited by small sample sizes or rely on specific types of data .
, clinical or genetic data onl.
Some of the latest models have started to incorporate more holistic data, environmental factors, and social behaviors.
Nevertheless, questionnaire-based data on risk factors with machine learning prediction modelsAiparticularly in the Indonesian context and for understanding obesity dynamics in broader populationsAiremains limited.
METHODS
The dataset used was obtained from Kaggle ("Obesity Data Set_raw_and_data_sinthetic.
csv") and included 16 input parameters covering anthropometric, behavioral, and demographic aspects.
These parameters were: Age.
Gender.
Height.
Weight.
CALC .
lcohol consumptio.
FAVC .
requent consumption of high-caloric foo.
FCVC .
requency of vegetable consumptio.
NCP .
umber of main meal.
SCC .
onitoring of calorie intak.
SMOKE.
CH2O .
aily water intak.
, family_history_with_overweight.
FAF .
hysical activity frequenc.
TUE .
ime using technology device.
CAEC .
ood consumption MTRANS .
ode of The target variable, "NObeyesdad", was recorded into a binary classification: "Obesity" and "Non-Obesity".
The dataset was obtained from Kaggle, but the original source cannot be fully verified, as the dataset does not clearly indicate which survey, institution, or time period the data were derived from.
This lack of traceability limits the ability to understand the population context and reduces the generalisability of the findings.
Data Processing The "NObeyesdad" was recorded into a binary class:
"Obesity" vs "Non-Obesity.
" Categorical variables were transformed using one-hot encoding, while numerical values were retained.
The data were split into training and test sets .
/20 spli.
Model Development: logistic regression, nayve Bayes, support vector machine (SVM), and random forest classifier was trained on the processed data.
Model performance was evaluated using confusion matrix metrics: accuracy, sensitivity .
, specificity, precision, and the area under the ROC curve (AUC).
The model evaluation was carried out using Python and relevant libraries such as scikit-learn.
Performance metrics including accuracy, sensitivity .
, specificity, precision, and area under the Jurnal Manajemen Pelayanan Kesehatan.
Vol.
29 Edisi Khusus Februari 2026 l 21 K Husnul, dkk: Predictive Modeling of Obesity Risk Factors Using Machine Learning ROC curves (AUC) were computed to assess the effectiveness of each model.
Additionally, feature importance analysis revealed the top five most influential parameters in predicting obesity.
, implausible height or weigh.
, or if their records were incomplete for the outcome classification Ethical Considerations As this study used open-access data with anonymized synthetic information, no ethical clearance was required.
Expected Outcomes The study aims to deliver a validated binary classification model for obesity risk and identify actionable predictors.
The results are intended to support evidence-based public health strategies and preventative care initiatives.
Inclusion criteria: Participants were included if they had complete questionnaire data covering anthropometric, demographic, and lifestyle variables required for the analysis.
Exclusion criteria:
Participants were excluded if any key variables were missing, inconsistent, or contained unrealistic values RESULT Table 1.
Baseline Characteristics of Participants by Obesity Status
Baseline characteristics Age
Height
Weight
Gender
SMOKE
SCC
FAVC
family_history_with_overweight FCVC
NCP
CH2O
FAF
TUE
CALC
CAEC
MTRANS
All .
= 2.
9Ae26.
6Ae1.
5Ae107.
Male .
6%])
9%])
5%])
4%])
8%])
0Ae3.
7Ae3.
6Ae2.
1Ae1.
0Ae1.
Sometimes .
4%])
Sometimes .
6%])
Public_Transportation .
8%])
Non-Obese .
= 1.
0Ae24.
6Ae1.
0Ae80.
Male .
7%])
1%])
8%])
2%])
9%])
0Ae3.
3Ae3.
5Ae2.
2Ae2.
0Ae1.
Sometimes .
9%])
Sometimes .
2%])
Public_Transportation .
1%])
Obese .
= .
7Ae27.
6Ae1.
9Ae120.
Male .
5%])
7%])
7%])
0%])
2%])
0Ae3.
9Ae3.
6Ae2.
1Ae1.
1Ae0.
Sometimes .
0%])
Sometimes .
1%])
Public_Transportation .
1%])
Source: Secondary Data .
4Ae2.
Table 1 summarizes the baseline characteristics of 2,111 participants, comparing demographic, anthropometric, and behavioral variables between obese .
= .
and non-obese .
= 1,.
Table 2.
Confusion Matrix for Each Model Model Random Forest Logistic Regression Naive Bayes SVM Source: Secondary Data .
4Ae2.
Table 3.
Evaluation Metrics for Each Model Model Random Forest Logistic Regression Naive Bayes SVM Accuracy (%) AUC (%) Sensitivity (%) Precision (%) Specificity (%) Based on data from the Indonesian Family Life Survey.
Wave 5 .
4Ae2.
Table 4.
Top 5 Predictive Features Based on Odds Ratio Feature Coefficient remainder__Weight cat__FAVC_yes cat__family_history_with_overweight_ye remainder__FAF remainder__CH2O Odds Ratio Source: Secondary Data .
4Ae2.
Figure 1.
Curve for ML models 22 l Jurnal Manajemen Pelayanan Kesehatan.
Vol.
29 Edisi Khusus Februari 2026 K Husnul, dkk: Predictive Modeling of Obesity Risk Factors Using Machine Learning Figure 2.
Top 5 Most Influential Features for Obesity Prediction DISCUSSION The findings from this study reveal that the Random Forest (RF) algorithm delivered the best performance among the four evaluated machine learning models.
With an accuracy of 98.
58% and an AUC of 99.
RF significantly outperformed Logistic Regression.
Nayve Bayes, and SVM .
able 3, figure .
This superior performance can be attributed to several inherent advantages of RF: it is an ensemble method that reduces overfitting, handles both numerical and categorical data well, and can manage complex interactions among features without extensive preprocessing8, 9.
One of the key strengths of RF lies in its ability to evaluate feature importance10, 11.
The extremely high accuracy and AUC values reported in the study .
xceeding 98%) raise concerns about potential overfitting, especially given the limited validation procedures presented.
To enhance the robustness and credibility of the findings, it would be important to incorporate more rigorous evaluation k-fold cross-validation, hyperparameter tuning .
, grid search or randomized searc.
, and additional performance metrics including log-loss.
These techniques are widely recommended in recent machine learning research to prevent model overfitting and ensure Without these validation steps, the performance results may be overly optimistic and should be interpreted with caution12 In our study the most influential predictors of obesity identified through logistic regression were weight, frequent consumption of high-caloric food, family history of overweight, physical activity frequency, and daily water intake.
Among these, weight showed an exceptionally high odds ratio (OR = 30,.
, indicating a strong association with obesity.
Individuals who frequently consumed high-calorie food 7 times higher odds of being obese, while those with a family history of overweight had 1.
44 times higher odds.
Physical activity and water intake were also positively associated, though to a lesser extent, with ORs of 1.
12 and 1.
07, respectively .
Weight has a direct correlation with obesity, and numerous studies have established that an increase in body weight is one of the strongest indicators of obesity status 13 The assessment of obesity risk factors in this study is constrained by the exclusive use of odds ratios, which provides only a limited understanding of interpretability techniques such as correlation analysis or model-agnostic feature-importance methods like SHAP would offer a more comprehensive and transparent explanation of how predictors influence recommendations in machine learning interpretability Furthermore, variables such as weight, which are intrinsically tied to the definition of obesity, may introduce circular reasoning and artificially inflate model performance, a concern highlighted in recent discussions on data leakage and deterministic predictors in health-related machine learning models For this reason, the inclusion of such variables should be reconsidered or clearly justified, and the manuscript would benefit from acknowledging their potential impact on validity, interpretability, and Family History (Ye.
Ae A positive family history indicates genetic predisposition and shared lifestyle factors, both of which are well-known contributors to obesity risk16, 17,18 Physical activity frequency, higher total physical activity volume and greater frequency/intensity were significantly associated with reduced obesity risk among Australian women over a 21-year period19 A significant association .
= 0.
between daily physical activity levels and central obesity in adults in Gorontalo.
Indonesia those with lower activity levels had a higher risk of obesity20 Daily water intake.
Adults aged 19Ae39 who met adequate daily water intake levels had significantly lower odds of abdominal obesity compared to those with lower intake.
However, the association was not significant in older age groups21.
Overweight and obese women who drank more water daily had significantly lower BMI.
Specifically, for every 1 mL of additional water consumed.
BMI decreased by 0.
kg/mA22 Frequently consumed high calorie food.
Final-year students with a frequent habit of consuming high-calorie foods had significantly higher odds of being obese.
The study found strong associations between high intake of calories, fats, and carbohydrates from these foods and poor nutritional status23 Adolescents who frequently consumed junk food .
igh-calorie food.
had a 2.
87 times higher risk of obesity compared to those who consumed it rarely.
The study also reported positive correlations between junk food intake.
BMI, and waist circumference24 In addition to the previously noted issues, the manuscript would benefit from a clearer articulation of how the analytical choicesAisuch as the selection of variables, recoding procedures, and model evaluation strategiesAishape the interpretation of the findings.
Providing Jurnal Manajemen Pelayanan Kesehatan.
Vol.
29 Edisi Khusus Februari 2026 l 23 K Husnul, dkk: Predictive Modeling of Obesity Risk Factors Using Machine Learning methodological decisions, including why certain preprocessing steps were chosen and how they may influence the results, would help readers better understand the studyAos internal logic and analytical Moreover, emphasizing the rationale behind the comparison of different machine learning models and discussing the strengths and limitations of each approach within the studyAos context would add depth and enhance the overall clarity of the research Limitation: while the dataset used in this study is sourced from Kaggle and contains synthetic elements, it is important to clarify that not all components of the dataset are fully artificial.
The dataset represents a mixture of real questionnaire-based patterns that have been expanded or partially generated to increase sample size and variability.
Nevertheless, the presence of synthetic data means that certain distributions, correlations, or behavioral patterns may not accurately reflect real-world populations.
As a result, the generalizability of the findings remains limited, and conclusions regarding obesity risk factors should be interpreted with caution.
Future studies should validate these results using fully real, population-based datasets to ensure more robust external validity CONCLUSION This study demonstrated the effectiveness of machine learningAiparticularly the Random Forest algorithmAiin questionnaire-based data.
Among the four models evaluated.
Random Forest achieved the highest performance, with an accuracy of 98.
58% and an AUC The most influential predictors identified were weight, height, age, family history of being overweight, and the number of main meals per day (NCP).
These findings highlight the modelAos potential practical value for supporting early identification of individuals at higher risk and guiding more targeted public health interventions or clinical decision-making, such as prioritizing lifestyle counseling or screening At the same time, the noted limitations point to clear directions for future research, including validating the model in different population groups, integrating additional data sources .
uch as clinical measurements or longitudinal record.
, and evaluating the modelAos real-world impact when implemented in community or clinical settings to determine its effectiveness and feasibility.
ACKNOWLEDGEMENT
This work was supported by financial assistance from the Institute for Research and Community Service (LPPM).
Universitas Muhammadiyah Gombong.
REFERENCE