SINERGI Vol.
No.
October 2025: 779-792 http://publikasi.
id/index.
php/sinergi http://doi.
org/10.
22441/sinergi.
Optimizing intrusion detection with data balancing and feature selection techniques Zulhipni Reno Saputra Elsi1*.
Ahmad Affandi Supli2.
Jimmie1.
Muhammad Ghozi Al-Faris1.
David Agustianto Rapel1 Department of Information Technology.
Faculty of Engineering.
Muhammadiyah University of Palembang.
Indonesia Digital Media Technology Department.
Xiamen University Malaysia.
Malaysia Abstract The rapid growth of IoT devices has brought significant security challenges, particularly in detecting various types of attacks within heterogeneous network environments.
This study explores the effectiveness of data balancing techniques, including Random Under Sampling (RUS).
Cost-Sensitive Learning (CSL).
Synthetic Minority Oversampling Technique (SMOTE), and Randomized Combination Sampling (RCS).
Feature selection methods, namely correlation .
and mutual information .
op 15 feature.
, were employed to optimize feature sets.
The Decision Tree (DT) and Linear Discriminant Analysis (LDA) classifiers were used to evaluate the performance of balanced datasets.
The evaluation metrics included accuracy, precision, recall.
F1-score.
G-mean, and ROC curves.
The results revealed that SMOTE and RCS outperformed other balancing methods, with SMOTE achieving the highest accuracy .
7%) and RCS demonstrating robust G-mean values across both feature selection techniques.
DT consistently showed better performance compared to LDA across all metrics, while feature selection significantly improved the classification results, particularly under mutual information criteria.
However, the analysis highlighted limitations of LDA in handling imbalanced datasets and high-dimensional features.
This study concludes that a combination of advanced data balancing and effective feature selection significantly enhances the accuracy of intrusion detection in IoT networks.
Future work will focus on integrating real-time detection systems and exploring hybrid models to further improve the detection of complex attacks in dynamic IoT environments.
Keywords:
Correlation.
Imbalance.
IoT.
Load Balancing.
Mutual information.
RT-IOT22.
Article History:
Received: December 6, 2024 Revised: March 24, 2025 Accepted: April 10, 2025 Published: September 5, 2025 Corresponding Author:
Zulhipni Reno Saputra Elsi Department of Information Technology.
Muhammadiyah University of Palembang.
Indonesia Email:
zulhipni_renosaputra@umpalembang.
This is an open-access article under the CC BY-SA license.
INTRODUCTION
The development of Internet of Things (IoT) technology has brought great benefits to various aspects of life, including in the industrial sector .
, smart homes .
, and transportation .
However, the increasing use of IoT devices also expands the potential for cybersecurity attacks .
Attacks on IoT devices are increasing, so a reliable intrusion detection system (IDS) is needed to protect IoT networks .
To overcome this problem, the Intrusion Detection System (IDS) based on Machine Learning (ML) is increasingly being used in IoT network security .
ML-based IDS requires proper dataset management, especially in dealing with dataset imbalance, which is an unbalanced data distribution between the normal class and the attack class .
All of this often arises due to the dynamic nature of data collection in IoT networks and data distribution in the real world.
Datasets such as IoT-23 .
and IoTID20 .
are examples of unbalanced IoT datasets, where the attack class is less than the normal class.
This alignment can lead to bias in Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A SINERGI Vol.
No.
October 2025: 779-792 ML models, which tend to favor majority class predictions, thus neglecting the detection of less frequent attacks.
In addition, the data generated by IoT devices is dynamic and continuous, which further increases the complexity of anomaly detection .
Several previous studies have identified that imbalanced datasets in IoT networks pose a major challenge to the effectiveness of ML-based IDS.
Approaches such as the Synthetic Minority Oversampling Technique (SMOTE) .
Adaptive Synthetic Sampling (ADASYN) .
Random Under Sampling (RUS) .
Ensemble and ML methods have been proposed to address this issue.
Researchers .
showed that the combination of SMOTE and under sampling techniques successfully improved the accuracy to 81% on the IoT-23 dataset.
Researchers .
reported excellent results using a combination of deep learning and data balancing techniques on IoTID20, with an AUC reaching 99.
However, the implementation of these techniques also has drawbacks, such as the risk of overfitting on synthetic data or removing important features in under sampling .
addition, research .
emphasizes the importance of handling the dynamic nature of IoT data to improve detection accuracy.
Managing IoT data integration requires solutions that are not only able to improve model accuracy but also consider computational efficiency and resilience to real-time data This research aims to address these challenges by exploring various data balancing techniques, such as RUS.
SMOTE, and Cost Sensitive Learning (CSL) .
, and Random Combination Sampling (RCS).
This balancing is expected to reduce bias towards the majority class, improve accuracy on the minority class, and produce a more reliable IDS for IoT This research will also provide an indepth evaluation using metrics such as accuracy, precision, recall.
F1 Score, and G-Men to ensure model performance on the highly imbalanced RTIOT22 dataset .
This research contributes in several significant aspects to improve IoT network security through processing imbalanced datasets:
Data Balancing Strategy Development by techniques such as RUS.
SMOTE.
CSL, and RCS.
Optimization of Machine Learning Model for IoT by using Mutual information-based feature selection (MIFS).
Correlation-based feature selection (CFS) and performing classification with Decision Tree (DT) .
and Linear Discriminant Analysis (LDA) .
Evaluation with Metrics such as accuracy, precision, recall.
F1 Score, and G-Men, this study ensures that the evaluation of model performance is more representative of the needs of attack detection on imbalanced IoT This research proposes an optimized intrusion detection framework for IoT networks by integrating feature selection methods with hybrid sampling techniques and lightweight classifiers, evaluated on protocol-specific datasets to address data imbalance and computational constraints in real-world scenarios.
RELATED WORK
Recent addressed challenges in intrusion detection systems (IDS) for IoT networks and data imbalance in machine learning.
This section reviews key works, focusing on their methods, contributions, and implications for IDS and other ML applications.
Researchers .
proposed an automated myocardial infarction detection system using CNN and a hybrid CNN-LSTM with SMOTETomek Link approach to handle imbalanced Their study showed that data balancing significantly improved the model accuracy up to 89%, which is relevant for clinical applications.
This underscores the importance of data balancing techniques in healthcare and other domains facing class imbalance issues.
Researchers .
proposed an IoT-specific IDS using ensemble methods like RF.
Extreme Gradient Boosting (XGBoos.
, and Light Gradient Boosting Machine (LGBM) on the imbalanced DS2OS dataset.
Their LGB-IDS model achieved 92% accuracy, excelling in speed and threat detection, showing strong potential for real-world IoT IDS applications.
Researchers .
analyzed the impact of class imbalance on the performance of machine learning-based IDS using KNN.
Gradient Boosting, and SVM algorithms on the BoT-IoT By applying SMOTE and random under improvement in the F1 score, highlighting the importance of balancing techniques in improving the reliability of IDS in IoT networks.
Researchers .
addressed the challenge of class imbalance in IDS datasets, which often reduces detection performance for rare attacks.
Karatas used the CSE-CIC-IDS2018 dataset and applied SMOTE with six ML algorithms to improve detection rates.
Researchers .
evaluated ML models with various resampling strategies using F1-score and G-mean, demonstrating that proper integration enhances IDS robustness and accuracy in Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A p-ISSN: 1410-2331 e-ISSN: 2460-1217 identifying minority class intrusions within imbalanced network traffic.
And researchers .
developed a Collaborative Intrusion Detection System (CIDS) using Weighted Ensemble Averaging Deep Neural Network (WEA-DNN).
This system achieves high accuracy and adaptability in detecting coordinated cyberattacks in heterogeneous networks, demonstrating the effectiveness of collaborative approaches in handling complex attack patterns.
Research on handling data imbalance in IDS has been growing rapidly, with various methods proposed to improve the accuracy and reliability of classification models, which have been summarized in Table 1.
Table 1.
Summary of Approaches and Research Results on Imbalanced Data Ref Methodology Resampling techniques: Random Oversampling.
Random Under sampling .
SMOTE, and Adaptive Synthetic Sampling.
Dataset KDD99.
UNSWNB15.
UNSWNB17.
UNSWNB18
Measurement Indicators Macro Precision: 98%.
Macro Recall: 96%.
Macro F1-Score: 97% .
A combination of a Deep Neural Network (DNN) with a Bagging Classifier approach.
Further experiments using CNN and hybrid CNN LSTM.
NSL-KDD,
KDDCUP99,
UNSW-NB15,
Bot-Io Accuracy: 99.
Precision: 99.
Recall: 99.
F1-Score: 99.
Cluster-SMOTE K-Means algorithm for preprocessing and Two-Layer CNN for classification.
UNSW-NB15,
CICIDS2017
Accuracy: 98.
Recall: 98.
Precision: 98.
F1-Score: 98.
AUC: 99.
Hybrid feature selection .
ilter Two-level IDS .
ormal attack, then attack typ.
SMOTE for class imbalance.
algorithms: Decision Tree.
Random Forest.
GNB.
KNN SMOTE.
Gaussian Distribution.
SVM.
RF methods.
BoT-IoT.
TONIoT.
CICDDoS2019 RO.
DT.
RF, and SVM Specific IoT .
SMOTE.
ADASYN and XGBoost.
IoT dataset .
Federated Learning (FL).
SMOTE.
ADASYN, and Generative Adversarial Networks (GAN.
TON_IoT and DS2OS IoT Accuracy: 99.
82-100%
Precision: 98.
Recall: 98.
56-100%
F1-Score: 98.
Detection Time: 0.
Accuracy: 98.
Precision: 96.
Recall: 95.
F1-Score: 96.
Accuracy: 97.
Precision: 94.
Recall: 92.
F1-Score: 93.
TPR: 92.
FPR: 5.
Accuracy: 95.
Precision: 93.
Recall: 92.
F1-Score: 92.
F1 score: up to 0.
Precision: up to 0.
Recall: up to 0.
Accuracy: up to 95% .
Feature engineering with mRMR SMOTE.
CatBoost classifier.
Optuna for hyperparameter Tested on binary and multi-class SMOTE.
ADASYN, and BoostedEnML
NSL-KDD,
UNSW-NB15,
CICIDS-2017
Accuracy: 98.
Precision: 97.
Recall: 97.
F1-Score: 97.
CSE-CICIDS2018 and CIC-IDS2017 Precision: 1.
Recall: 1.
F1 score: 1.
AUC: 1.
MQTT-IOTIDS2020
Key Results Oversampling improves Macro Precision and Macro Recall, especially on minority classes.
Resampling helps detect more minority data but increases training time.
The DNN model with bagging produces high accuracy .
8%), with low False Positive Rate.
The combination of CNN LSTM is more effective on IoT datasets such as Bot-Io.
CSK-CNN provides the highest AUC .
2%) and F1-Score 6%, demonstrating the model's ability to handle imbalanced data with high accuracy and generalization on both datasets.
Decision Tree achieved highest accuracy and lowest detection time, outperforming other algorithms and prior works Significant improvement in model performance when using oversampling techniques.
RO is able to improve model performance with a more balanced data distribution.
Oversampling techniques have been shown to be helpful in increasing the sensitivity of the model to minority attacks.
Data augmentation improves performance by up to 22.
9% in detecting anomalies compared to the baseline without data Optimized CatBoost with mRMR SMOTE consistently outperformed baseline methods across all datasets BoostedEnML with SMOTE/ADASYN achieves 100% accuracy on multiclass classification on IDS dataset with reduced False Positives and False Negatives.
Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A SINERGI Vol.
No.
October 2025: 779-792 Table approaches that have been applied, including machine learning methods, deep learning, and data preprocessing techniques such as oversampling and under sampling.
Each approach is evaluated using various performance metrics on different benchmark datasets, demonstrating its effectiveness in handling imbalanced data.
Although efforts to address data imbalance and optimize model architecture in machine learning-based IDS have been made, there is still a lack of specific approaches for IoT communication protocols such as MQTT.
Most studies still focus on traditional datasets such as NSL-KDD.
BoT-IoT, and CICIDS2017, without considering the specific characteristics of IoT In addition, the balancing methods used are generally limited to oversampling and ensemble, while real-time adaptation and federated learning approaches are still rarely Therefore, more comprehensive research is needed to develop more effective and adaptive IDS for the IoT ecosystem.
METHOD
This section describes the methods used in data processing and the process for generating the IDS model.
Raw Dataset The RT_IOT2022 dataset is obtained from real-time IoT infrastructures from ThingSpeakLED.
Wipro-Bulb, and MQTT-Temp, and then extracted to obtain useful features for attack The data consists of 85 features with 12 classes.
Table 2 presents the attack types included in the RT_IOT2022 dataset along with the number of recorded packets for each type.
The dataset encompasses a variety of attack techniques in IoT environments.
Table 2.
Attack type dataset RT_IOT2022 Attack_type DOS_SYN_Hping Thing_Speak ARP_poisioning MQTT_Publish NMAP_UDP_SCAN NMAP_XMAS_TREE_SCAN NMAP_OS_DETECTION NMAP_TCP_scan DDOS_Slowloris Wipro_bulb Metasploit_Brute_Force_SSH NMAP_FIN_SCAN Packets Proposed Model Machine learning-based Intrusion Detection Systems (IDS) for IoT networks consist of several main stages, namely data ingestion, storage, feature engineering, model training, retraining .
These stages form a continuous learning cycle to improve the accuracy of threat detection in the IoT ecosystem.
The proposed architecture is divided into several parts, processes as shown in Figure 1.
Figure 1 illustrates the process flow in applying machine learning techniques for network intrusion detection, which is divided into several important stages.
Here is an explanation for each Data Preprocessing: This stage consists of two main sub-stages, namely:
Preparation: Includes the process of data cleaning, data labeling, and data normalization to prepare the data before being used in model training.
Balancing: Using various data balancing techniques, such as RUM.
SMOTE.
CSL, and RCS, to address class imbalance issues in the data.
Data Preprocessing Preparation Balancing Data Cleaning Original Random Under Sampling Data Labelling Cost Sensitive Learning Data Normalization Random Combination Sampling SMOTE
Feature Selection Mutual Information Correlation Feature Classification Decision Tree
Linear Discriminant Analysis Evaluation Indicator Accuracy Precision Recall F1 Score
G-Mean
ROC
Figure 1.
Machine learning architecture of our proposed model Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A p-ISSN: 1410-2331 e-ISSN: 2460-1217 Feature Selection: At this stage, relevant features are selected using techniques such as MIFS and CFS to ensure that only the most informative features are used in the Feature Classification: Here, a classification model is applied using algorithms such as DT and LDA to classify data based on the selected features.
Evaluation Indicator: The results of the classification model are evaluated using several performance indicators, including Accuracy.
Precision.
Recall.
F1 Score, and GMean, to assess the effectiveness of intrusion Balancing Balancing in the context of machine learning refers to techniques for dealing with imbalanced datasets, where one class has a much larger number of samples than the other classes .
This imbalance can affect model performance because the algorithm tends to prioritize predictions for the majority class and ignores the minority class, which is often more important to analyze .
The relatedness parameter, especially in the data distribution, has a significant impact on the model performance, especially in classification problems.
If the dataset is highly imbalanced, the model tends to be biased towards the majority class, which leads to misclassification of the minority class and compromises the performance of standard learning algorithms .
In many cases, an imbalanced dataset occurs when one class is much smaller than the other classes.
This imbalance can result in high accuracy, even though the model is not able to detect the minority class well, which may be more important in the context of a particular application .
In addition to data synchronization issues, the performance of machine learning-based IDS in IoT is also influenced by several other factors, such as real-time processors, where IDS must be able to detect threats instantly without high latency, so that Edge Computing and Federated Learning-based approaches can be used to accelerate detection without having to send all data to a central server .
Another factor is scalability, because IoT networks have a very large number of devices, so the IDS model must be able to handle the growth in the number of devices without experiencing a decrease in performance .
In addition, resource limitations on IoT devices, which often have limited computing power and memory, make IDS need to use lightweight models, such as DT or AB based ensemble learning, to increase efficiency .
This study uses data balancing techniques in the following ways: .
RUS, a technique for randomly reducing the number of samples from the majority class so that the number is comparable to the minority class.
SMOTE is a popular over-sampling technique where synthetic samples from the minority class are created based on interpolation between existing samples.
CSL is a technique that does not change the data distribution but adapts the learning algorithm by giving greater weight to prediction errors in the minority class.
RCS is a combination of RUS and SMOTE, this technique balances the dataset by reducing some of the majority class samples while adding synthetic samples to the minority Selection Feature This study uses selection features for MIFS and CFS.
MIFS looks for the best 15 feature values from the MI Score while CFS selects features based on the correlation value of MIFS gets 15 different features for original data.
RUM.
SMOTE.
CSL, and RCS.
While CFS produces a different number of features for each original data.
RUM.
CSL.
SMOTE, and RCS.
Original data produces 53 features.
RUM data produces 58 features.
CSL data produces 64.
SMOTE data produces 61 features, and RCS data produces 51 features.
Classification Classification is an important process in the workflow that aims to build an ML or DL model that is able to predict or classify data based on previously selected features .
In this study, there are two methods used, namely DT and LDA.
DT is one of the most widely used models due to its simplicity and high interpretability .
This method works by building a DT from a dataset, where each node represents a feature, a branch represents a feature value, and a leaf represents a class or final result .
While LDA is a statistical classification method that seeks a linear projection of the data to maximize the separation between classes .
RESULTS AND DISCUSSION
In this study, we implemented several techniques to handle data imbalance and improve the performance of the attack detection We compared the original data and four data balancing techniques, namely RUS.
CSL.
SMOTE, and RCS.
Each technique was followed by two feature selection methods, namely CFS Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A SINERGI Vol.
No.
October 2025: 779-792 with a threshold of 0.
8 and MIFS to select the best 15 features.
After the features were selected, we used two different classification techniques, namely DT and LDA.
Table 3 illustrates the distribution of attack data before and after the balancing technique was performed.
Table 3.
Distribution of RT_IOT2022 Dataset before and after balancing Attack_type DOS_SYN_Hping Thing_Speak ARP_poisioning MQTT_Publish NMAP_UDP_SCAN NMAP_XMAS_TREE_SCAN NMAP_OS_DETECTION NMAP_TCP_scan DDOS_Slowloris Wipro_bulb Metasploit_Brute_Force_SSH NMAP_FIN_SCAN Original .
DT_Ori Training .
DT_Ori Testing .
DT_RUS Traning .
DT_RUS Testing .
DT_CSL Training .
DT_CSL Testing .
DT_Smote Traning .
DT_Smote Testing .
DT_RCS training .
DT_RCS Testing .
LDA_Ori Training .
LDA_Ori Testing .
LDA_RUS Traning .
LDA_RUS Testing .
LDA_CSL Traning .
LDA_CSL Testing .
LDA_Smote Training .
LDA_Smote Testing .
LDA_RCS Training Figure 2.
Confusion Matrix MIFS
LDA_RCS Testing RUS
CSL
SMOTE
RCS
Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A p-ISSN: 1410-2331 e-ISSN: 2460-1217 .
DT_Ori Training .
DT_Ori Testing .
DT_RUS Traning .
DT_RUS Testing .
DT_CSL Training .
DT_CSL Testing .
DT_Smote Traning .
DT_Smote Testing .
DT_RCS training .
DT_RCS Testing .
LDA_Ori Training .
LDA_Ori Testing .
LDA_RUS Traning .
LDA_RUS Testing .
LDA_CSL Traning .
LDA_CSL Testing .
LDA_Smote Training .
LDA_Smote Testing .
LDA_RCS Training Figure 3.
Confusion Matrix CFS .
LDA_RCS Testing Table 3 shows the distribution of the amount of data for each attack type (Attack Typ.
based on the application of various data Original .
ithout balancin.
RUS.
CSL.
SMOTE, and RCS.
SMOTE is best suited to ensure a uniform data distribution, while RCS provides more flexibility in determining the amount of data.
RUS is effective in creating a balanced data distribution, but risks reducing important information.
CSL is a safe choice because it does not modify the original data but only modifies the training approach.
This study produces a Confusion Matrix that can be used to calculate various performance metrics, such as Accuracy.
Precision.
Recall.
F1-Score, and G-Mean to visualize the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).
Figure 2 and Figure 3 illustrate the classification results with the DT and LDA Figure 2 is the Confusion Matrix of MIFS results, while Figure 3 is the Confusion Matrix of CFS results.
This Confusion Matrix illustrates the results of training data and testing data with DT and LDA classifications.
From the confusion matrix, the Precision.
Recall.
F1-Score values are obtained which are displayed in Table 4.
Table 5.
Figure 4 and Figure 5.
Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A SINERGI Vol.
No.
October 2025: 779-792 Table 4 compares the performance of DT and LDA models on data that has been balanced with various methods (Original.
RUS.
CSL.
SMOTE.
RCS) using the evaluation metrics Precision.
Recall, and F1 Score.
The DT model consistently outperforms LDA in all balancing methods and evaluation metrics, both on training and testing data.
The RUS and CSL Balancing methods produce perfect precision and recall on training data for DT, but performance decreases on testing data and SMOTE is the best balancing method on testing data, providing the highest precision, recall, and F1 Score performance for DT.
while RCS also produces good performance, but is still slightly below SMOTE for testing data.
Performance on training data tends to be higher than on testing data.
This is an indication that some methods such as RUS may cause the model to overfit on training data due to overly simple data.
Table 5 compares the performance of DT and LDA models with a correlation selection feature of 0.
Overall, the DT model consistently outperforms LDA in terms of precision, recall, and F1 score, both on training and testing data.
While LDA shows the best performance on the RUS method compared to other methods.
RUS
provides perfect performance for DT on training data, but its generalization to testing data is poor.
And SMOTE is the best method for testing, producing the highest precision, recall, and F1 score for DT, indicating better generalization ability while RCS has almost comparable results with SMOTE, but still slightly lower especially in All methods show a decrease in performance from training to testing data, especially on LDA.
This indicates that LDA is more susceptible to generalization challenges than DT.
Figure 4 compares the accuracy of DT and LDA classification models based on two feature selection methods: MIFS and CFS.
outperforms LDA in all balancing techniques and feature selection approaches, with consistently higher accuracy.
CFS is more effective than MIFS, especially for SMOTE and RCS, producing near-perfect accuracy on testing data.
In the Balancing Technique.
SMOTE and RCS provide the best results in both feature selections, demonstrating their ability to improve the distribution of the minority class without sacrificing model performance, while RUS is less effective especially on testing data, because accuracy decreases drastically for both models, indicating poor generalization and CSL does not provide significant improvement compared to the original data, both in DT and LDA.
Table 4.
Performance Comparison with MIFS Classification with DT_Ori LDA_Ori DT_RUS LDA_RUS DT_CSL LDA_CSL DT_Smote LDA_Smote DT_RCS LDA_RCS Precision Training Recall F1 Score Precision Testing Recall F1 Score Testing Recall F1 Score Table 5.
Performance Comparison with CFS Classification with DT_Ori LDA_Ori DT_RUS LDA_RUS DT_CSL LDA_CSL DT_Smote LDA_Smote DT_RCS LDA_RCS Precision Training Recall F1 Score Precision Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A p-ISSN: 1410-2331 e-ISSN: 2460-1217 Figure 5 presents the evaluation results of the classification model's performance based on the G-Mean, which reflects the balance between recall and specificity.
G-Mean is particularly important for imbalanced datasets, as it provides an overview of the model's ability to handle both majority and minority classes simultaneously.
is superior to LDA due to its higher G-Mean value in all balancing techniques and feature selection CFS is more effective than MIFS in improving G-Mean, especially in DT with SMOTE and RCS.
The SMOTE and RCS balancing techniques provide the best results for DT, with almost perfect G-Mean, while LDA fails to produce adequate G-Mean values, especially with MIFS, although there is a slight increase in CFS.
Balancing with RUS is ineffective, especially in LDA, where G-Mean remains zero in all scenarios.
Table 6 presents a comparative analysis of accuracy and G-Mean across various classifier methods used in intrusion detection.
The comparison includes previously proposed methods and the newly developed models.
In the proposed model, the use of DT and LDA with various balancing techniques showed mixed results.
Several DT variants, such as DT_Ori_CFS.
DT_RUS_MI.
DT_RUS_CFS.
DT_CSL_CFS, and DT_Smote_CFS, achieved 100% accuracy, indicating that the model is very good at recognizing patterns in the data.
However, despite the high accuracy, the G-Mean of some models, such as DT_Ori_MI was only 21%, indicating that the model is less able to handle class precision.
Meanwhile, the LDA method performed much worse, with some LDA_RCS_MI LDA_RCS_CFS having a G-Mean of 0.
meaning the model failed to recognize a single class at all.
Overall, although some models have high accuracy, the low G-Mean indicates that the model is less effective in handling data The best models are those that have a balance between high accuracy and GMean.
DT_RUS_CFS DT_Smote_CFS, which achieve 100% accuracy and G-Mean close to 100%.
This shows that the Decision Tree method with balancing techniques such as SMOTE and CFS is a more reliable choice than other methods, especially for applications in IDS in IoT Smart Home, where precision in detecting attacks from various classes is very important.
Figure 4.
Comparison of Accuracy values Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A SINERGI Vol.
No.
October 2025: 779-792 Figure 5.
Comparison of G-Mean values Table 6.
Comparison of Accuracy and G-Mean Ref.
Proposed Model
Classifier Method
DL ensemble mLSTM
SMOTE
ADASYN
SVM-SMOT
Borderline1SMOTE
Borderline2SMOTE
DT_Ori_MI DT_Ori_CFS DT_RUS_MI DT_RUS_CFS DT_CSL_MI DT_CSL_CFS DT_Smote_MI DT_Smote_CFS DT_RCS_MI DT_RCS_CFS LDA_Ori_MI LDA_Ori_CFS LDA_RUS_MI LDA_RUS_CFS LDA_CSL_MI LDA_CSL_CFS LDA_Smote_MI LDA_Smote_CFS LDA_RCS_MI LDA_RCS_CFS Accuracy G-Mean
CONCLUSION AND FUTURE WORKS
This study evaluates the impact of various data balancing techniques.
RUS.
CSL.
Smote.
RCS, combined with feature selection methods MIFS.
CFS, and classification algorithms DT.
LDA.
This study concludes: .
DT consistently outperformed LDA across all balancing methods, achieving higher accuracy.
G-Mean, and other performance metrics.
CFS proved more effective than Mutual Information, especially when combined with SMOTE and RCS balancing These combinations resulted in nearly perfect G-Mean and accuracy, indicating excellent handling of imbalanced data.
Among balancing techniques.
SMOTE and RCS showed the best performance, particularly for DT, as they effectively addressed class imbalance while maintaining generalization to testing data.
RUS was the least effective balancing method, often leading to poor generalization and significant performance drops, especially with LDA.
LDA demonstrated limitations in handling imbalanced datasets, failing to produce meaningful G-Mean and accuracy, even with advanced balancing techniques.
Elsi et al.
Optimizing intrusion detection with data balancing and feature selection A p-ISSN: 1410-2331 e-ISSN: 2460-1217 Based on the results of this study, there are several future works that can be done to further improve the effectiveness of intrusion detection systems for IoT networks: .
Combining Advanced Balancing Techniques with more sophisticated oversampling and under sampling methods, such as Adaptive Synthetic Sampling (ADASYN) or generative adversarial networks (GAN.
for synthetic data generation.
Performing dimensionality reduction with additional feature selection or extraction methods, such as Principal Component Analysis (PCA) or autoencoders, to improve model performance and reduce computational overhead.
REFERENCES