Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index Hybrid Gradient Boosting and SMOTE-ENN for Toddler Nutritional Status Classification on Imbalanced Data Alfry Aristo Jansen Sinlae1,*.
Moh.
Erkamim2.
Farid Fitriyadi3.
Lilik Suhery4 .
Rachmat Destriana5 1 Computer Science Study Program.
Faculty of Engineering.
Universitas Katolik Widya Mandira.
Kupang.
Indonesia Smart City Information System Study Program.
Universitas Tunas Pembangunan Surakarta.
Surakarta.
Indonesia 3 Informatics Study Program.
Faculty of Science.
Technology, and Health.
Universitas Sahid Surakarta.
Surakarta.
Indonesia 4 Health Informatics Study Program.
Faculty of Health and Science.
Universitas Mercubaktijaya.
Padang.
Indonesia 5 Faculty of Business Management and Information Technology.
Universiti Muhammadiyah Malaysia.
Padang Besar.
Malaysia Email: 1,*alfry.
aj@unwira.
id, 2erkamim@lecture.
id, 3faridfitriyadi@gmail.
com, 4liliksuhery@gmail.
5p52400016@student.
Correspondence Author Email: alfry.
aj@unwira.
Abstract Stunting in toddlers remains a serious global health issue with long-term impacts on children's physical and cognitive development.
One of the main challenges in classifying nutritional status is class imbalance, where the number of normal cases significantly exceeds that of minority classes such as stunted and severely stunted.
This study aims to develop a hybrid approach by integrating the Gradient Boosting algorithm with the SMOTE-ENN (Synthetic Minority Oversampling TechniqueAeEdited Nearest Neighbor.
method to improve classification performance on imbalanced data.
The dataset used was obtained from the Kaggle platform, consisting of 121,000 entries with four nutritional status categories.
Data preprocessing included label encoding, numerical feature standardization, and stratified data splitting with an 80:20 ratio.
The model was evaluated using accuracy, precision, recall.
F1-score, and ROC-AUC The proposed hybrid model successfully increased the recall for the stunted class from 61.
80% to 98.
41%, and the F1-score 93% to 83.
Overall accuracy improved from 92.
39% to 93.
35%, while the ROC-AUC score increased from 99.
08% to These findings demonstrate that integrating Gradient Boosting with SMOTE-ENN is effective in enhancing sensitivity to minority classes and improving overall multi-class classification performance.
Keywords: Stunting Classification.
Imbalanced Data.
Gradient Boosting.
SMOTE-ENN.
Toddler Nutrition INTRODUCTION Stunting, a condition of impaired growth in children caused by chronic malnutrition, remains a significant global health According to the latest report by UNICEF.
WHO, and the World Bank .
, approximately 148 million children under the age of five are affected by stunting worldwide, with the highest prevalence reported in South Asia and Sub-Saharan Africa .
Stunting affects not only physical development but also cognitive abilities, increases susceptibility to illness, and reduces long-term productivity .
Therefore, early detection of nutritional status, particularly in identifying stunting cases, is a crucial step in supporting data-driven health interventions.
Despite its importance, the process of classifying nutritional status presents several challenges.
One of the main issues is the imbalance in class distribution, where the number of children with normal nutritional status significantly exceeds those who are classified as stunted or severely stunted .
In addition, the quality of survey data in many developing countries is often compromised by the presence of outliers, missing values, and inconsistencies in field data collection .
If not properly addressed, these conditions can lead to classification models that are biased toward the majority class and unable to detect important minority cases that require urgent attention.
With the advancement of artificial intelligence and data mining, many studies have explored machine learning approaches to tackle the challenges in stunting classification.
For example, a study by Yunus et al.
utilized the C4.
5 algorithm and achieved an accuracy of 61.
82 percent with an AUC of 0.
584, but did not incorporate boosting or data balancing methods .
Alita et al.
employed a Decision Tree algorithm and achieved 83.
26 percent accuracy, although they did not explicitly address the class imbalance issue .
Another study by Azani et al.
compared the performance of Naive Bayes Classifier (NBC) and Support Vector Machine (SVM), finding that SVM outperformed NBC with 91 percent accuracy compared to 80% .
However, this study only handled binary classification .
tunted and not stunte.
, which does not represent the complexity of multi-class classification problems.
In contrast.
MaAomuriyah et al.
proposed the use of XGBoost combined with imputation and outlier detection techniques and achieved an accuracy of up to 95% .
Nevertheless, their primary focus was on data quality enhancement rather than addressing class From the analysis of previous studies, it is evident that there is a gap in the integration of boosting algorithms with data balancing strategies for multi-class classification of toddler nutritional status.
This research offers a different approach by combining Gradient Boosting with a data balancing method known as SMOTE ENN (Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbor.
to address this challenge.
Gradient Boosting is well regarded for its effectiveness in a wide range of classification tasks due to its capability to build a strong predictive model from a series of weak learners using a sequential learning process .
However, this algorithm often becomes biased toward the majority class when applied to imbalanced datasets because it minimizes overall prediction error without considering the distribution of class labels .
To address this limitation, this study employs SMOTE ENN, a resampling method that applies oversampling to minority classes and removes potentially noisy or ambiguous samples from the majority class using the Edited Nearest Neighbors technique.
Unlike standard SMOTE, which only increases the number Copyright A 2024 Authors.
Page 54 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index of minority class samples.
SMOTE ENN improves data distribution by simultaneously filtering out misleading examples, resulting in a cleaner and more balanced dataset .
This technique is considered appropriate for stunting classification tasks, which naturally involve imbalanced class distributions and a high risk of misclassification.
Based on the identified problems, the objective of this study is to develop a hybrid classification model that integrates Gradient Boosting and SMOTE ENN in order to improve the accuracy and sensitivity of nutritional status classification for toddlers in imbalanced datasets.
This research contributes to the field by introducing a combined approach of boosting and resampling methods, which has rarely been implemented together in the context of multi-class nutritional status classification.
RESEARCH METHODOLOGY
1 Research Stages This research follows a systematic framework designed to guide the researcher through the process of designing, implementing, and evaluating each stage of the study.
A structured methodology is required to provide a clear workflow throughout the research process .
The sequence of activities can be illustrated in Figure 1.
Data Collection Data Preprocessing Application of Data Balancing Method Building a Classification Model Model Evaluation Figure 1.
Research Flow Based on Figure 1, the following provides a detailed explanation of each stage carried out in the study.
Data Collection The data collection phase marks the initial stage of the study, aimed at gathering data relevant to the analytical In this study, the dataset used is sourced from the Kaggle platform and titled AuStunting Toddler DetectionAy .
This dataset adopts a nutritional status classification approach based on z-score methods as recommended by the World Health Organization (WHO).
The dataset primarily focuses on identifying stunting cases among children under five years old.
In total, it contains 121,000 entries comprising three key features: age, gender, and height, along with a target label indicating the childAos nutritional status.
The nutritional status is categorized into four classes:
stunted, severely stunted, normal, and tall.
Data Preprocessing Preprocessing was carried out to prepare the dataset in accordance with machine learning model requirements .
Categorical features such as Gender and Nutritional Status were converted into numerical format using label encoding, assigning values of 0 for female and 1 for male, and four numeric labels for nutritional status categories.
Numerical features such as Age .
n month.
and Height .
n c.
were standardized using the StandardScaler method to ensure a mean of zero and a standard deviation of one, thus preventing large-scale features from dominating the model and expediting the training process.
Subsequently, the data was split into training and testing subsets in an 80:20 ratio using stratified sampling to maintain the proportional distribution of class labels in each subset.
This ratio was chosen as it represents a common practice in machine learning, providing sufficient training data to build a representative model and enough test data for objective performance evaluation .
A bar chart visualization was used to depict the initial class distribution, illustrating the imbalance in sample counts across categories before any balancing method was applied.
Application of Data Balancing Method To address class imbalance, this study applied the SMOTE-ENN (Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbor.
SMOTE generates synthetic samples for minority classes by interpolating between nearest neighbors, while ENN removes samples from the majority class that are considered Copyright A 2024 Authors.
Page 55 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index noise or located near decision boundaries .
SMOTE-ENN is considered more effective than standard SMOTE because it not only increases the number of minority class samples but also reduces potential classification errors caused by ambiguous majority class data .
Applying SMOTE-ENN results in a more balanced class distribution, with increased representation of minority classes and reduced dominance of majority classes .
This enhances the model's ability to learn fairly across all classes.
Building a Classification Model The main classification model employed in this study is the Gradient Boosting Classifier, an ensemble algorithm based on decision trees that incrementally builds a strong model from a set of weak learners by iteratively minimizing the loss function.
Each new model in the Gradient Boosting sequence is trained to correct the prediction errors made by the previous model, thereby progressively increasing accuracy .
This model was chosen due to its ability to capture nonlinear relationships among features, its flexibility with various data types, and its competitive performance in classification tasks .
Furthermore.
Gradient Boosting does not assume a particular data distribution and is robust when dealing with large datasets .
However, because the algorithm optimizes for overall prediction error, it tends to be biased toward the majority class when applied to imbalanced data .
To mitigate this limitation, the model is integrated with SMOTE-ENN as discussed in the previous stage to improve sensitivity to minority classes.
Model Evaluation Evaluation was conducted to compare model performance on both imbalanced and balanced data .
sing SMOTEENN).
The assessment began with the construction of a confusion matrix to display the number of correct and incorrect predictions for each class.
From this matrix, key evaluation metrics were calculated, including accuracy, precision, recall, and F1-score per class.
Precision measures the correctness of predictions for a specific class, recall measures the modelAos ability to identify actual instances of that class, and F1-score balances both, which is especially important in scenarios with class imbalance .
Additionally, the macro-averaged ROC-AUC score was used to evaluate the overall performance of the model in multi-class classification tasks, calculated as the average area under the ROC curve for each binarized class .
2 SMOTE-ENN Oversampling Technique SMOTE-ENN (Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbor.
is a hybrid data balancing method that combines both oversampling and undersampling strategies to address class imbalance in classification tasks .
This technique is designed to not only improve the representation of minority classes but also enhance the quality of data distribution by removing noise from the majority class .
It consists of two components: SMOTE and ENN.
The first component.
SMOTE, generates synthetic instances for the minority class by interpolating between an existing sample and one of its nearest neighbors.
The synthetic sample ycuycuyceyc is calculated as shown in Equation .
ycuycuyceyc = ycuycn yu y .
cuycycn Oe ycuycn ) .
where ycuycn is a randomly selected minority class sample, ycuycycn is one of its k-nearest neighbors, and yu is a random number between 0 and 1.
The second component.
ENN (Edited Nearest Neighbor.
, serves as a cleaning mechanism.
ENN removes any sample whose class label differs from the majority of its k nearest neighbors.
This step aims to eliminate noisy or borderline samples, particularly from the majority class.
By combining these two steps.
SMOTE-ENN results in a cleaner, more balanced, and more stable dataset that enables the classification model to better identify minority classes without being misled by irregularities from the majority class .
3 Gradient Boosting Method Gradient Boosting is an ensemble learning algorithm that combines several weak learners, typically decision trees, in a sequential manner to produce a strong predictive model .
Introduced by Friedman .
, this method builds upon boosting techniques by applying gradient descent to minimize a loss function iteratively .
Each successive model in the sequence is trained to correct the residual errors of the preceding model, thereby improving accuracy at each stage.
In general.
Gradient Boosting works by fitting the model ya.
cu ) to the target yc through the minimization of a loss function ya.
c, ya.
cu )) using gradient descent.
At the ycoycEa iteration, the model is updated using Equation .
= yaycoOe1 .
cu ) yuyco Eayco .
where yaycoOe1 .
cu ) is the model from the previous iteration.
Eayco .
is the new weak learner .
uch as a small decision tre.
trained on the residuals, and yuyco s the learning rate or weight coefficient derived from minimizing the loss function.
The residuals Eayco .
, or gradient directions, are calculated using Equation .
cycn , ya .
cuycn )) ycycnyco = Oe [ yuiya.
cuycn ) ya.
=ya .
ycoOe1.
Copyright A 2024 Authors.
Page 56 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index di mana ycycnyco adalah nilai residu atau arah gradien untuk sampel ke-ycn pada iterasi ke-m, ya.
cycn , ya .
cuycn )) adalah fungsi kerugian .
isalnya log-loss untuk klasifikasi atau MSE untuk regres.
, ycycn adalah nilai target aktual, ya.
cuycn ) adalah prediksi model pada input ycuycn , yaycoOe1.
adalah output dari model sebelum iterasi ke- yco.
Gradient Boosting mampu menangkap hubungan nonlinier yang kompleks dalam data dan beradaptasi dengan berbagai jenis fitur tanpa memerlukan transformasi khusus .
Keunggulannya terletak pada fleksibilitas, kemampuan generalisasi yang tinggi, dan performa yang kompetitif pada berbagai tugas klasifikasi maupun regresi.
RESULT AND DISCUSSION
The development of a nutritional status classification model for toddlers using the hybrid approach of Gradient Boosting and SMOTE-ENN began with the preparation of a relevant dataset.
The dataset used was obtained from the Kaggle platform under the title AuStunting Toddler DetectionAy .
In total, the dataset consists of approximately 121,000 entries containing three primary features, namely age, gender, and height, along with one target label representing the nutritional status of toddlers.
Nutritional status is categorized into four classes: stunted, severely stunted, normal, and tall.
Once the dataset was ready, the next step involved data preprocessing to ensure the dataset met the requirements of machine learning algorithms and to guarantee optimal input quality.
The preprocessing stage started with the application of label encoding to categorical features such as gender and nutritional status, converting them into numerical For instance, gender was encoded as 0 for female and 1 for male, while nutritional status was mapped into four numerical classes according to their respective categories.
Numerical features such as age .
n month.
and height .
n centimeter.
were standardized using the StandardScaler method, which transforms the values to have a mean of zero and a standard deviation of one.
This step is crucial to accelerate the training process and prevent dominance of larger scaled features in model learning.
The distribution of numerical features before and after standardization is shown in Figure 2.
Figure 2.
Distribution of Numerical Features Before and After Standardization: .
Age Before, .
Height Before, .
Age After, and .
Height After Standardization Using StandardScaler The figure illustrates the distribution of age and height before and after standardization.
Prior to standardization, age exhibits a uniform distribution due to monthly data recording, whereas height displays a near-normal distribution.
After standardization, both features share a common scale while retaining their original distribution patterns.
This ensures fair contribution of each feature in the training process.
The next step was to analyze the class distribution of the target variable to determine whether the data was This analysis is essential to understand the proportion of each nutritional status category, as imbalance can influence model bias and overall performance.
It also serves as a foundation for selecting appropriate data balancing techniques such as oversampling or hybrid resampling.
The class distribution for the target variable is shown in Figure 3.
Copyright A 2024 Authors.
Page 57 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index Figure 3.
Distribution of Nutritional Status Classes in the Dataset Figure 3 shows that the "Normal" class dominates the dataset with 56 percent, while the "Severely Stunted," "Tall," and "Stunted" classes are significantly underrepresented.
This imbalance underscores the importance of applying data balancing methods to ensure fair and accurate model training.
To illustrate the manual calculation of SMOTE ENN, a simplified example is used with only two classes: Normal .
and Stunted .
, along with two numerical features: age .
and height .
The initial sample is shown in Table 1.
Table 1.
Initial Data Sample Age .
Height .
Nutritional Status Normal Normal Normal Stunted Stunted Table 1 shows 5 samples with a class distribution of 3 data points labeled as Normal and 2 data points labeled as Stunted.
This indicates that the data is imbalanced.
To address this case study, the first step is to use SMOTE to generate synthetic data for the Stunted class through interpolation between nearest neighbors, where:
ycuycn = .
ycuycycn = .
yu = 0.
andom value between 0 and .
Calculate the synthetic sample using Equation .
, so the calculation is as follows:
ycuycuyceyc = .
6 y (.
Oe .
) = .
6 y (Oe1,.
= .
6,65.
= .
6,65.
After the SMOTE process generates new synthetic samples, for example, .
4, 65.
Stunte.
, the next step is to apply the Edited Nearest Neighbors (ENN) method.
ENN is implemented after oversampling to filter out majority class instances that are considered inconsistent based on their local neighborhood.
Specifically.
ENN removes a sample from the majority class if the majority of its nearest neighbors belong to the minority class.
For example, consider sample ID 2, which belongs to the Normal class and has three nearest neighbors: ID 3 (Norma.
ID 5 (Stunte.
, and ID 6 (Stunted, which is a synthetic sample generated by SMOTE).
Since two out of the three neighbors belong to the Stunted class.
ID 2 is identified as ambiguous and is removed by the ENN procedure.
This process helps to reduce noise around the decision boundary between classes and improves the overall quality of the training dataset.
The final result after applying the SMOTE combined with ENN technique is presented in Table 2.
Table 2.
Data Samples After Oversampling Using SMOTE-ENN Age .
Height .
Nutritional Status Normal Normal Stunted Stunted Stunted .
Copyright A 2024 Authors.
Page 58 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index The table shows that ENN successfully removed ambiguous majority-class instances while reinforcing the representation of the minority class.
This produces a cleaner and more balanced dataset ready for classification model The updated class distribution after applying SMOTE ENN is visualized in Figure 4.
Figure 4.
Distribution of Target Data After Oversampling Using SMOTE-ENN Figure 4 shows that all four nutritional status categories, namely Severely Stunted.
Tall.
Stunted, and Normal, are now nearly equally represented, with each class comprising approximately 25 percent of the dataset.
This demonstrates that SMOTE ENN was effective in balancing the data by adding synthetic samples to the minority classes and eliminating noisy instances from the majority class located near the decision boundaries.
A balanced dataset enables the classification model to learn more equitably without favoring any particular class.
The next stage was to build a classification model using the Gradient Boosting algorithm for both training and The dataset was split into training and testing sets using an 80 to 20 ratio with stratified sampling to preserve class proportions in both subsets.
The Gradient Boosting model was then trained using the training data to identify patterns between features and target classes.
Gradient Boosting is an ensemble learning technique that builds a strong predictive model by combining multiple weak learners, typically shallow decision trees.
It operates in a forward stage-wise manner, where each subsequent tree is constructed to minimize the errors made by the previously built model.
Specifically, the algorithm fits each new tree to the negative gradient of the loss function, which corresponds to the residual errors from the prior iteration.
Through this iterative process, the model progressively improves its predictions by focusing more on the samples that were previously misclassified.
The process continues until either the prediction error reaches a minimal value or the maximum number of iterations is completed.
Model performance was evaluated using several metrics including confusion matrix, classification report, and ROC AUC score to assess its capability in multi-class classification.
The confusion matrices and ROC curves for both scenarios .
ithout oversampling and with SMOTE ENN) are shown in Figure 5.
Copyright A 2024 Authors.
Page 59 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index .
Figure 5.
Confusion Matrix Without Oversampling, .
ROC Curve Without Oversampling, .
Confusion Matrix With SMOTE-ENN, and .
ROC Curve With SMOTE-ENN Figure 5 illustrates that SMOTE ENN improved the modelAos ability to classify minority classes such as Stunted and Severely Stunted.
The confusion matrix in Figure 5.
shows predictions are mostly concentrated in the majority classes (Normal and Tal.
, while in Figure 5.
, the predictions are more evenly distributed.
Additionally, the ROC curve in Figure 5.
shows improvement in AUC scores, particularly for the Stunted class, which increased from 0.
97 to 0.
reflecting improved model sensitivity.
Based on the confusion matrix and ROC results, classification metrics and ROC AUC scores were calculated for both models.
The full comparison is presented in Table 3.
Table 3.
Comparison of Performance Metrics Between Gradient Boosting and the Proposed Hybrid Model (Gradient Boosting SMOTE-ENN) Model Gradient Boosting Gradient Boosting SMOTE-ENN Class Precision Recall F1-Score Normal Severely Stunted Stunted Tall Normal Severely Stunted Stunted Tall Accuracy ROC-AUC Score Table 3 shows that the hybrid model significantly improves performance, especially for the Stunted class, where recall increased from 61.
80 percent to 98.
41 percent and F1-score from 71.
93 percent to 83.
58 percent.
This demonstrates that SMOTE ENN enhances the modelAos sensitivity to underrepresented classes.
The main advantage of the hybrid model is its effectiveness in addressing class imbalance.
SMOTE ENN not only increases minority-class representation but also eliminates noisy samples from the majority class, producing a cleaner and more balanced dataset.
This contributes to improved overall accuracy from 92.
39 percent to 93.
35 percent and increased ROC AUC score from 99.
08 percent to 99.
63 percent, indicating better multi-class classification performance.
Nevertheless, the hybrid model shows a decline in precision for the Stunted class .
04 percent to 72.
This could be due to the generation of overly similar synthetic samples, which may lead to false positives.
improve this, future research can explore advanced post-oversampling filtering techniques or use alternative boosting algorithms such as LightGBM or CatBoost with class weight adjustments.
Intensive hyperparameter tuning may also help improve overall model performance.
CONCLUSION
This study successfully developed a hybrid model by combining Gradient Boosting with the SMOTE-ENN data balancing technique, which proved to be effective in enhancing the classification performance of toddler nutritional status on imbalanced data.
The proposed hybrid model significantly improved the recall for the Stunted class from 61.
80% to 41%, and increased the F1-score from 71.
93% to 83.
58%, indicating greater sensitivity toward minority classes.
Copyright A 2024 Authors.
Page 60 This Journal is licensed under a Creative Commons Attribution 4.
0 International License Bulletin of Informatics and Data Science Vol.
3 No.
November 2024.
Page 54Oe62 ISSN 2580-8389 (Media Onlin.
DOI 10.
61944/bids.
https://ejurnal.
id/index.
php/bids/index Additionally, the overall accuracy of the model increased from 92.
39% to 93.
35%, accompanied by an improvement in ROC-AUC score from 99.
08% to 99.
These findings demonstrate that the integration of SMOTE-ENN not only improves data distribution but also enhances the modelAos ability to classify all categories more fairly.
However, the decrease in precision for the minority class suggests a potential increase in false positives due to the introduction of synthetic data.
Therefore, future research is encouraged to explore alternative method combinations such as class weighting, adaptive undersampling, or the use of other boosting algorithms like LightGBM or CatBoost with more intensive hyperparameter optimization to achieve more stable and optimal results.
REFERENCES