Science and Technology Indonesia e-ISSN:2580-4391 p-ISSN:2580-4405 Vol.
No.
October 2025 Research Paper Decision Tree Algorithms in Water Quality Classification: A Comparative Study of Random Forest.
XGBoost, and C5.
Dewi Asiah Shofiana1* .
Melan Caniadi1 .
Ridho Sholehurrohman2 .
Aristoteles1 Department of Computer Science.
Faculty of Mathematics and Natural Science.
Universitas Lampung.
Lampung, 35145.
Indonesia Department of Mathematics.
Faculty of Mathematics and Natural Science.
Universitas Lampung.
Lampung, 35145.
Indonesia *Corresponding author: dewi.
asiah@fmipa.
Abstract Safe drinking water is more than a convenience.
public health officials often call it a cornerstone of survival.
United Nations International ChildrenAos Emergency Fund (UNICEF) reported that, shockingly, roughly two billion people still drink water that is neither clean nor tested.
Pathogenic bacteria from human feces and livestock waste taint roughly 70% of available sources, creating a silent Scientists express water quality into five levels: poor, marginal, fair, good, and excellent Ae named as the Water Quality Index (WQI) designed by the Canadian Council of Ministers of the Environment (CCME).
This research measured the performance of three decision-tree classifiers, including Random Forest.
XGBoost, and C5.
0 to predict water quality.
The preprocessing pipeline was thorough, involving label encoding, use of synthetic minor over-sampling technique (SMOTE) for balancing imbalanced classes, and an exploratory phase to examine outliers and irregularities within the dataset.
According to the findings.
Random Forest finished at an impressive test result with 98% of accuracy.
XGBoost and C5.
0 follows close behind at about 96%, but the latter turned out to be the fastest, edging out both XGBoost and Random Forest, making C5.
0 a preferable when a time-sensitive or emergency decision is In short, this research highlights the importance of modern preprocessing tools combined with machine learning algorithms in monitoring water quality.
Keywords Water Quality.
Decision Tree.
Random Forest.
Extreme Gradient Boosting.
C5.
Received: 30 January 2025.
Accepted: 13 June 2025 https://doi.
org/10.
26554/sti.
INTRODUCTION
Access to clean water and reliable sewage disposal ranks among the oldest yet most urgent of human rights, especially in densely populated settings (Filho et al.
, 2.
The sixth Sustainable Development Goal (SDG.
codifies that expectation by requiring that both services be available and managed in an environmentally sustainable manner (Pradhan et al.
, 2.
In practice, however, the availability of such services falls far short.
United Nations ChildrenAos Fund (UNICEF) data lays bare the shortfall:
2 billion people still drink unprocessed water, 4.
billion rely on sanitation facilities that fail basic safety tests, and 3 billion households lack a simple place to wash their hands with soap.
The sheer numbers spell out an urgent maintenance and upgrade task for pipes, latrines, and hygiene stations across continents (UNICEF, 2.
Experts warn that a 70 percent decline in water quality worldwide has been driven chiefly by fecal contamination, with pathogens such as E.
Shigella .
Vibrio cholerae, and Salmonella in the mix.
Most observers agree that rising population density puts the heaviest strain on these fragile systems, with automatic knock-on effects for public health.
Outbreaks and chronic illness are not accidentalthey follow directly from this mismatch between growth and infrastructure (Holcomb and Stewart, 2.
Over the past decade, researchers have increasingly turned to machine-learning to tackle environmental challenges such as water management-issues once thought impossible because of data scarcity and analytical limitations.
A machine-learning system can be understood as a type of computer program that executes tasks autonomously after ingesting and processing large amounts of historical data via statistical models.
Typically, the accuracy of its forecasts improves with continued access to new At the core of the process, data mining extracts relevant information from a multitude of databases, providing the critical insights for the next actions to be undertaken (Shen.
Yuan et al.
, 2.
Machine learning is appearing in environmental management at almost every stage of fieldwork today.
Recent studies show the method being used for climate modelling (Eyring Shofiana et.
et al.
, 2.
, for real-time checks of water quality (Zhu et al.
, even for the early spotting of urban pollution (Xu et al.
By integrating machine learning, organizations can improve their proactive responses to environmental challenges, leading to the development of sustainable solutions and datadriven decision-making practices.
In practice the result usually firms up analytical workflows while steering policy toward more robust and sustainable outcomes (Rolnick et al.
, 2022.
Pansara et al.
, 2.
There has been a number of works in this sphere of research.
In the year 2022.
Nasir et al.
managed to classify the water quality by employing multiple algorithms, where CATBoost performed the best with an accuracy of above 94.
Other studies more recently tried to model water quality parameters prediction through the use of KNN.
Naive Bayes.
CatBoost.
ID3 and Random Forest.
Unfortunately, their accuracies are still around 90% as well (IliN et al.
, 2022.
Yogeshwari et al.
Mutoffar et al.
, 2.
The current investigation analyzes water quality data using three decision tree algorithms.
Random Forest Classifiers.
Extreme Gradient Boosting (XGBoos.
, and C5.
Unlike most previous studies, this research tries to incorporate several preprocessing techniques at the datasetAos initial stage.
Some preprocessing steps include the removal of some non-useful data attributes and creating a new attribute called Quality, which is a subset of the overall water quality index value.
Besides.
EDA is conducted with the help of descriptive statistics and visual tools, enabling the examination of data structure and patterns to measure maximum insights and identify outliers and anomalies.
The study also applies SMOTE because of the class imbalance in the dataset.
The differences in data splitting and classification have been achieved through two techniques.
hold-out method and stratified k-fold cross-validation with the aim of determining the performance of the algorithms under each technique.
This method ensures that sufficient analysis is conducted and enhances the reliability of the classification EXPERIMENTAL SECTION The research proceeds through a sequenced series of stages, each designed to scrutinize how decision-tree algorithms perform under various conditions.
A preliminary literature survey offers a doctrinal grounding and identifies gaps that the present work intends to fill.
Once the theoretical framework is in place, a dataset is gathered from publicly accessible repositories.
Data cleaning and feature engineering then take center stage, as missing values are imputed, outliers trimmed, and categorical variables are encoded in order to produce a tidy and coherent dataset.
After preprocessing, the data are partitioned in two distinct ways: a simple hold-out split for baseline checks and a stratified k-fold arrangement that ensures balanced class representation in every fold.
Building the models follows the with three decision-tree variants including Random Forest.
XGBoost, and C5.
0 are fitted and tuned on the training A 2025 The Authors.
Science and Technology Indonesia, 10 .
999-1011 Model performance is gauged using the classical confusionmatrix and further distilled into singular numbers such as accuracy, precision, and recall.
Side-by-side comparisons reveal where each variant excels or falters.
Figure 1 diagrams the entire pipeline from acquisition through evaluation and visually reinforces the step-wise logic of this study.
Figure 1.
Research Workflow 1 Data Collection This research utilized a dataset acquired from Kaggle .
ttps://w.
com/datasets/hailla/wqi-paramet\ er-scores-1994-2.
provided in CSV format, consisting of 13 attributes and a total of 971 instances.
The data was collected by the Washington State Department of EcologyAos River and Stream Monitoring Program, covering 62 rivers and streams in the United States from 1995 to 2014.
2 Data Preprocessing Before diving into data mining, it is paramount that data undergoes precursory processing, where information is extracted and transformed into a comprehensible format suitable for analysis (Garcia et al.
, 2.
By addressing missing values and discrepancies, this process improves data quality and sets the stage for more precise and thorough data analysis (Garcya et al.
Page 1000 of 1011 Shofiana et.
1 Drop Unnecessary Attributes Successful data mining tasks are fundamentally determined by careful attribute selection.
Removing those that do not contribute improves quality by trimming away noise and clutter.
This helps maximize model performance by reducing noise, prevention of overfitting and increasing accuracy.
In comparison with attributes selection, these actions add structural sinew to each attribute so that the data sanity .
uality, reliability, and consistenc.
is improved.
During this phase, the following 6 attributes were dropped: station, station name, year, address, plus code, and location1.
The remaining features are carried forward to the next phase, as Table 1 shows the information of each attribute.
2 Rename Attributes Certain features have been adjusted to enhance clarity and eliminate any potential ambiguity.
Renaming the columns improve their interpretability allowing them to be used directly in subsequent stages of the analysis without needing further 3 New Columns (Labellin.
In this study, the water quality index is based on the Canadian Council of Ministers of the Environment (CCME) Water Quality Index (WQI).
The calculated WQI values were then classified into categories under the "Quality" attribute.
The range of categories for CCME WQI is shown in Table 2 (Gikas et al.
, 2.
4 Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) provides a systematic approach for revealing hidden structures, correlations, and irregularities within a dataset.
Anomalies, such as stray outliers, frequently emerge during this inspection and serve as early warning signals for possible data-quality troubles.
EDA also probes the interdependence of variables and sketches the overall shape of the dataset long before any formal modelling takes Its toolkit is eclectic, combining summary statistics, histograms, scatter plots, and correlation matrices to visualize how values are distributed and how they vary across different columns of the table.
Intuitive graphics serve a second purpose:
they nudge researchers toward fresh questions that might merit further study (Majumder et al.
, 2022.
Komorowski et al.
, 2.
Sample outputs from this phase are collected in Figures 2 and 3 for closer examination.
Data visualizations presented in Figure 2 permit a preliminary examination of regional water quality and reveal several pronounced fluctuations around the statistical mean.
Dissolved oxygen concentrations cluster around a moderate baseline, implying the habitat remains broadly tenable for resident biota.
pH readings drift toward the alkaline end of the scale, a tendency that hints at prospective chemical inputs capable of neutralizing natural acidity.
Trace-element matrices for nitrogen and phosphorus show predominantly low values, yet isolated spikes suggest episodic contributions from runoff linked to row- A 2025 The Authors.
Science and Technology Indonesia, 10 .
999-1011 crop fertilization or light-industrial discharge.
The temperature distribution shows that many of the samples are warmer, which might reflect seasonal changes or thermal pollution.
The correlation matrix displayed in Figure 3 reveals several robust interrelationships among the water-quality variables.
Most striking is the strong inverse association between the composite measure of water quality and key pollutantsfecal matter, sediment, nitrogen, phosphorus, and turbidity.
As concentrations of these contaminants increase, the overall quality rating declines.
Such declines are probably exacerbated by agricultural runoff and episodic soil erosion that release phosphorus and suspended particles simultaneously.
A second, much weaker relationship appears between temperature and dissolved oxygen, implying that elevated temperatures may impair oxygen availability for aquatic organisms.
Collectively, the data point to the necessity of vigilant monitoring and active management of these parameters to restore and sustain acceptable water quality.
5 Synthetic Minority Over-Sampling Technique (SMOTE) Rather than creating simple copies of existing examples.
SMO TE tackles the issue of class imbalance within datasets by formulating new synthetic examples for the minority class.
This technique SMOTE employs increases the number of occurrences for the minority class, creating better balance among the different classes, leading to improvement of model precision, recall, and F1 measures for the minority class.
Furthermore.
SMOTE alleviates overfitting, improving the generalization of the model by biasing the classifier towards the minority class, resulting in better performance on new data (Pradipta et al.
3 Data Splitting 1 Hold Out Splitting a dataset can be accomplished through the hold-out method, a practice that assigns separate portions for training and testing (Ghazvini et al.
, 2.
For the current investigation, 90% of the total observations .
5 sample.
were reserved for model development, while the remaining 10% .
were kept apart for final testing.
2 Stratified K-Fold Cross Validation (SKCV) SKCV is one of the techniques for model validation that requires partitioning data into training and testing sets.
SKCV
also has folds like k-fold cross-validation, but every single fold is class proportionate and therefore better than k-fold crossvalidation as it is class sensitive (Prusty et al.
, 2.
A ten-block .
= .
design used in the present study, with 90% of samples featured in every training run and 10% set aside for immediate performance testing.
Such an arrangement guarantees that every subset reflects the complete class landscape, thereby fostering evaluations that are more robust and widely applicable.
Page 1001 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Table 1.
Information of Selected Attributes Attribute Name
WQI FC
WQI Oxy WQI pH
WQI TSS
WQI Temp
WQI TPN
WQI TP
WQI Turb Information Fecal Index.
Fecal are organisms that originate from the digestive tract and waste of humans and Oxygen Index.
The amount of oxygen dissolved in water comes from photosynthesis and oxygen diffusion from the air.
pH Index.
pH is the degree of acidity of a solution.
A pH value below 7 indicates the solution is acidic, above 7 indicates the solution is alkaline, where a value of 7 is considered neutral.
Total suspended sediment (TSS) Index.
TSS is the total mass of solid particles floating in the soil, without taking into account dissolved or sinking particles.
TSS includes dust, soil, mud, organic debris and particles suspended in water.
Temperature Index.
It is a measure of the hot or cold intensity of water, affecting the quality and the sustainability of aquatic ecosystems.
Nitrogen Index.
Nitrogen is a compound that comes from agricultural, industrial and domestic waste.
Phosphorus Index.
Phosphorus is a compound that comes from agricultural fertilizer, domestic waste or water flow from the surface.
Turbidity Index.
Turbidity is a measure of the extent to which solid particles are dispersed and dissolved in water.
Figure 2.
Numerical Feature Distribution of the Dataset 4 Classification with Decision Tree 1 Random Forest Random Forest is a classification approach based on an ensemble of several decision trees, each built from a sample of the A 2025 The Authors.
In this method, a random subset of attributes, denoted as F, is chosen to determine how to split each node at a decision tree (Parmar et al.
, 2.
With these features.
Random Page 1002 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Table 2.
CCME WQI Categories Category Range 95 Ae 100 80 Ae 94 65 Ae 79 45 Ae 64 0 Ae 44 Quality Excellent Good Fair Marginal Poor has N samples, we generate a new bootstrap sample Sb of size N by randomly selecting samples from the dataset, with replacement, as formulated in Equation 1, with xi OO Dataset with This means some of the data points might appear multiple times in the same bootstrap sample, while others might not appear at all.
Sb = .
1 , x2 , .
, xN } .
This algorithm also employs the Gini index, defined in Equation 2, which quantifies the impurity of a split with respect to a population composition in a certain branch of the tree.
Because of its dependability and consistency with complex datasets, this approach has become prevalent in areas such as medical diagnosis, financial forecasting, and recommendation system development (Dikananda et al.
, 2.
Pi is a symbol that shows the frequency probability of the i-th class in the dataset, whereas C represents the number of classes.
Gini index measures the impurity of a node, and a lower value indicates a more "pure" node, meaning itAos more likely to contain data points from a single class.
Therefore, choosing the feature with the lowest Gini index for the root node .
nd subsequent node.
helps create a more reliable decision tree (Pavlov, 2.
Gini (S) = 1 Oe OcA i ) 2 Figure 3.
Correlation Matrix Table 3.
Dataset Comparison After SMOTE Class Total Before After Forest has been shown to outperform other learning methods because it has lower error rates and higher classification performance, as well as being able to work with large volume of training data and even incomplete information (Primajaya and Sari, 2.
Nurdin et al.
showed that Random Forest Regression is better than Support Vector Regression in predicting vehicle fuel consumption, while Fitriyana et al.
showed that Support Vector Machine is better at classifying Glycosylation in Lysine Protein Sequences compared to Random Forest.
The algorithm of Random Forest uses a technique called bootstrap sampling, which involves creating multiple datasets by randomly selecting samples from the original dataset with replacement.
Each decision tree in the forest is trained on a different bootstrap sample.
If the original dataset A 2025 The Authors.
In addition to using the Gini index, decision tree construction can also utilize entropy as a measure of attribute impurity.
Entropy quantifies the uncertainty within a dataset and helps in determining how well an attribute can separate the data into distinct classes.
The entropy value can be calculated using Equation 3, with S represents the set of datasets.
Entropy(S) = Oe OcA pi A log2 .
i ) .
Random Forest in classification tasks applies majority voting in predicting the class label.
If each tree in the forest predicts a class, the final prediction is the class that appears the most across all tree, as shown in Equation 4, with T is the number of trees, and yCi is the prediction of tree i.
yC = mode( yC1 , yC2 , .
, yCT ) .
Random Forest has a feature called Out-of-Bag (OOB) error estimation, where for each training sample, the model uses the trees that did not include that sample in their bootstrap sample to test the prediction.
This gives an unbiased estimate of the modelAos accuracy, without needing a separate validation set.
The formula is represented in Equation 5, where NOOB is the number of out-of-bag samples, and I .
i O yCi ) is an indicator Page 1003 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
function that equals 1 if the true value yi is different from the predicted value yi (Pavlov, 2.
OOBerror = OcA
NOOB
i oB I .
i O yCi ) .
2 XGBoost XGBoost utilizes an ensemble approach, improving upon the traditional gradient tree boosting algorithm.
This increases its effectiveness for large-scale machine learning tasks.
XGBoost performs well at complex models because of advanced features built into the system to enhance computing speed and minimize overfitting (Chen and Guestrin, 2.
Like most machine learning systems, the first step for XGBoost is to define an objective function.
Mathematically, for multi-class classification, the objective function is a combination of the loss function and regularization.
The training loss in this case is an error of a prediction that is made and which has to be minimized.
control overfitting, complexity of the model is also controlled by a regularization term.
As highlighted.
XGBoostAos balance of error minimization and complexity control helps it achieve a level of robustness and generalization with its models.
The structure of this function is provided in detail in Equation 6, where L.
i , yCi ) = .
i Oe yCi ) 2 is the loss function for multi-class, ( fk ) is the regularization term that penalizes tree complexity, is T the total number of trees, and N is the number of training samples (He, 2.
O( f ) = OcA L.
i , yCi ) OcA ( fk ) .
The loss function used for multi-class classification is the softmax loss, which generalizes the binary log loss to multiple classes, as formulated in Equation 7, where K is the number of classes, yo.
i = .
is an indicator function that equals 1 if the true class of the sample i is class k, and 0 otherwise, and f .
i ) pCik = ysKe k f j .
i ) is the predicted probability for class k for sample j=1 e i, given the model output fk .
i ).
The regularization term ( fk ) is designed to penalize overly complex trees.
This term is computed for each decision tree in the model as in Equation 8, with yu controls the number of leaves in the tree and yuI is the L2 regularization term, which penalizes large weights .
j ) in the treeAos leaf nodes.
i , yCi ) = Oe OcA i = .
log pCik XGBoost refines predictions iteratively, incorporating new information at each step, denoted as yCi.
(XGBoost Developers, 2.
, as defined in Equation .
The symbol yCi.
represents the prediction at the t-th iteration for the i-th data point, with fk .
i ) denoting the predictor function produced by the k-th model at the k-th iteration, and t as the total number of iterations.
yCi.
= OcA XGBoost also uses gradient boosting to minimize the loss function, and it uses both the first-order gradient .
and second-order Hessian .
to update the model in each iteration.
For multi-class classification, the calculations are extended to handle each class.
The gradient of the softmax loss function with respect to the predicted value fk .
i ) for class k is denoted in Equation 10, and the Hessian .
econd derivativ.
for class k with respect to fk .
i ) is denoted in Equation 11.
i , yCi ) = pCik Oe yo.
i = .
yui fk .
i ) .
yui 2 L.
i , yCi ) = pCik .
Oe pCik ) yui fk .
i ) 2 At each node of the tree.
XGBoost chooses the best feature and split to minimize the objective function.
This is done by calculating the gain for each possible split.
For a given split in the decision tree, the gain is the reduction in the objective function .
he total los.
caused by splitting the data at that node, as calculated in Equation 12, with GL and GR are the gradients for the left and right child nodes and HL and HR are the Hessians for the left and right child nodes (He, 2.
GL2
GR2
1 (GL GR ) 2
Gain = Oe Oe Oe yu .
2 HL HR yuI HL yuI HR yuI After training all the trees, the final predictions for each class are computed using the softmax function.
If the final prediction for each class k at iteration t is represented as fk.
= Tj=1 fk, j .
, with fk, j .
is the score from class k from the j-th tree, the probability for class k is then given in Equation 13, where pCik is the predicted probability for class k for sample i, and the denominator is the sum of exponentials over all classes to normalize the probabilities.
1 OcA 2 ( fk ) = yuT yuI A 2025 The Authors.
i ) = yCi.
Oe.
i ) pCik = ys e fk fk A k A =1 e The final class prediction for a given sample is the class with the highest predicted probability, denoted as yCi = arg maxk pCik .
Page 1004 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
3 C5.
The C5.
0 algorithm is an enhancement to the classification approach of data mining of decision trees containing advanced features like ID3 and C4.
5 by Ross Quinlan (Pandya and Pandya, 2.
C4.
5 and C5.
0 both calculated entropy and information gain for the decision trees, but C5.
0 has further enhancement for selection of attributes by splitting nodes using the gain ratio.
This guarantees that the most informative attribute is at the parent node, thereby increasing the accuracy and efficiency of the decision tree (Myint and Tin, 2.
The entropy calculation is similar as in other decision trees algorithms, such as the entropy in Random Forest given in Equation 3.
Information gain measures how much entropy is reduced by splitting the data based on a particular feature.
The goal is to select the feature that provides the largest reduction in uncertainty .
The information gain for an attribute A is calculated as in Equation 14, with E(D) is the entropy for the entire dataset.
V alues(A) is the set of possible values for the attribute A.
Dv is the subset of D where attribute A has value v, and E(Dv ) is the entropy of the subset Dv .
OcA IG (D.
A) = E(D) Oe vOOV alues (A) |Dv | E(Dv ) |D| .
The bias towards distinct valued attributes is shown by the gain ratio, detailed in Equation 15, which makes C5.
0 a very tough and reliable system to use for various classification tasks.
Equation 15 shows the calculation of gain ratio for a feature A, with IG (D.
A) represents the information gain for attribute A and E(A) represents the split information for attribute A.
Split Information is a measure of how much information is needed to describe the possible splits of an attribute.
It is used to avoid bias toward attributes with many distinct values .
hich might result in overfittin.
The split information for an attribute is the entropy of the split that the attribute creates when it divides the dataset into distinct subsets, as defined in Equation 16 (Myint and Tin, 2.
to remove branches that do not contribute to improving the modelAos performance.
The pruning process is typically postpruning, meaning it occurs after the full tree is constructed.
During pruning.
C5.
0 evaluates the error rate at each node, and if pruning a node result in a lower error rate, the node is 5 Evaluation Metrics This study evaluated the classification model using test data with known actual values through a confusion matrix.
A confusion matrix is a tabular description of a classification process that offers detailed insights thorough comparison of actual classifications and predicted classifications (Joloudari et al.
, 2.
The classical binary class confusion matrix .
s illustrated in Table .
, consists of consists of Predicted and Actual value combinations includes: True Positive (TP).
True Negative (TN).
False Positive (FP).
and False Negative (FN) (Theissler et al.
, 2.
In this context.
TP indicates correct positive prediction.
TN indicates correct negative prediction.
FP indicates incorrect positive prediction, and FN indicates incorrect negative prediction.
Accuracy, precision, recall, and F1 score, which are key performance indicators, can be derived using the confusion matrix main components as shown in Equation 17 to Equation 20.
Such metrics are critical for assessing overall model effectiveness, error types, and steps toward enhanced model accuracy and reliability.
Accur acy = TP TN
TP FP FN TN
Precision = TP FP Recall = TP FN F 1 Score = 2 y GR(D.
A) =
IG (D.
E(A)
E(A) = Oe OcA vOOvalues (A) |Dv | |Dv | |D| |D| .
The decision tree is built recursively by splitting the dataset at each node based on the attribute that provides the highest gain ratio.
The steps are as follows: .
Calculate the entropy of the dataset.
Compute the information gain for each attribute and the gain ratio.
Split the data based on the attribute with the highest gain ratio.
Repeat the process recursively on the resulting subsets until a stopping criterion is met .
, maximum depth or minimum subset After the tree is constructed.
C5.
0 applies a pruning step A 2025 The Authors.
Precision y Recall Precision Recall .
However, for a multiclass problem, the confusion matrix is slightly extended, and metrics calculations are slightly modified to evaluate the performance of a multi-class classification The illustration of a multi-class confusion matrix is presented in Table 5, and the modified formulas are available in Equation 21 to Equation 24, with MC refers to multiclass, and Li indicates that the calculation is for the specific class Li (Markoulidakis et al.
, 2.
T PLi MC_Accur acy = ysC i=1ysC j=1 Li j T PLi T PLi F PLi MC_PrecisionLi = Page 1005 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Table 4.
Evaluation Result of Each Class Using Hold-Out Class Accuracy XGB C5.
Precision XGB C5.
Accuracy Precision Recall F1-Score Runtime 806 ms 998 ms 28 ms Recall XGB C5.
F1-Score
XGB
C5.
Table 5.
Average Results with Hold Out Method Random Forest XGBoost C5.
Table 6.
Classical Binary Confusion Matrix Predicted Positive Predicted Negative Actual Positive Actual Negative Table 7.
Multi-Class Confusion Matrix
Predicted\Actual Class 1
Class 2
a Class C
Class 1
Class 2
a Class C
T P1
F P2,1
a F PC ,1 F P1,2 T P2 a F PC ,2 a a a a F P1,C F P2,C a T PC MC_Recali = T PLi T PLi F NLi MC_F 1ScoreLi = 2 y (MC_PrecisionLi y MC_Recali ) (MC_PrecisionLi MC_Recali ) .
RESULTS AND DISCUSSION Classification tasks are executed using the Python programming language.
The evaluation is carried out in three phases, each based on the data-splitting approach to ensure a comprehensive assessment of the modelAos performance.
However, the parameters for each algorithm remained consistent across all splitting approaches.
The Random Forest algorithm was configured with n_estim ators = 100, meaning it builds 100 decision trees to make preA 2025 The Authors.
Each tree is trained independently on a random sample of the data, and the final decision is made by combining the outputs of all trees through majority voting.
To prevent the trees from becoming overly complex and overfitting the training data, the maximum depth of each tree was limited to This depth restriction helps the model capture meaningful patterns while maintaining its ability to generalize well to unseen data.
For the XGBoost algorithm, the parameter n_estimators = 100 sets the total number of decision trees that the model will build one after another.
Each tree tries to fix the mistakes made by the previous ones, helping the model improve step by step.
The learning_rate of 0.
1 controls how much each new tree affects the overall prediction.
A smaller learning rate means the model learns more slowly but often ends up generalizing better to new data.
To keep the trees simple and avoid overfitting, max_depth is limited to 2, meaning each tree can only grow to two levels deep.
Additionally, the model uses subsample = 0.
8 and colsample_bytree = 0.
8, which means that in each training round, only 80% of the data and 80% of the features are randomly selected.
This randomness helps the model avoid relying too much on any specific data or feature and makes it more robust.
On the other hand, the C5.
0 algorithm works a bit differently.
Setting subset = True allows it to automatically pick the most relevant subsets of features, making the model focus on what really matters.
By turning off winnow .
etting it to Fals.
, the model keeps all features during training instead of trying to filter out less important ones.
The confidence factor (CF = 0.
controls how aggressively the model prunes the decision tree-a lower value means more pruning to keep the tree simpler and reduce the risk of overfitting.
With minCases = 10, the algorithm only splits a node if it has at least ten data points, which also helps keep the model from getting too complex.
The training process is further improved by earlyStopping=True, which stops training early if the modelAos performance stops getting Page 1006 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Table 8.
Evaluation Result of Each Class Using SKCV Class Accuracy XGB C5.
Precision XGB C5.
Like in XGBoost, the maximum depth of trees is set to 2 with maxDepth = 2 to keep the trees shallow.
The boosting process runs for 100 iterations .
rials = .
to gradually build a stronger model.
Other parameters like bands = 0, sample = 0, and fuzzyThreshold = False are left at default, meaning no special sampling or fuzzy logic is applied.
Overall, these settings reflect a careful balance between building a model thatAos accurate but not too complicated.
limiting tree depth, introducing randomness through sampling, and applying pruning and early stopping, the models are designed to generalize well without overfitting.
Choosing these parameters thoughtfully is key to creating effective and interpretable machine learning models.
1 Evaluation Using the Hold-Out Method In the initial experiment, the dataset is split using the hold-out method, with 90% of the data allocated for training and 10% for testing.
Table 6 juxtaposes three classification methodsRandom Forest (RF).
XGBoost (XGB), and C5.
0 - presenting their performance across all classes.
The performance metrics calculated for each class include accuracy, precision, recall, and F1-score.
Table 6 shows that Random Forest (RF) performs well for all of the metrics and classes throughout.
For instance, in Class 1.
Random Forest achieves astounding results, having received an accuracy, precision, recall, and F1 score all equal to 1.
outperforming XGBoost and C5.
0 for this class.
XGBoost and C5.
0 also perform well but show slight variations in their Table 7 summarizes average performance outcomes for three decision-tree variants, each evaluated through the standard hold-out scheme of 90:10.
Random Forest (RF) tops every available metric, yet its mean runtime of 429.
806 milliseconds reveals a significant computational burden.
In contrast, the C5.
0 algorithm, while lagging slightly behind on the same measurements, completes its run in just 62.
28 milliseconds and thus earns the title of fastest algorithm among them.
XGBoost, parked between the extremes of accuracy and speed, surfaces as a reasonable, middle-ground alternative.
2 Evaluation Using the Stratified K-Fold Cross Validation (SKVC) Method In the next experiment, the dataset is split using SKCV with k = 10.
Table 8 displays the performance outcomes for the three classification methods.
Similar to the hold-out results.
RF A 2025 The Authors.
Recall XGB C5.
F1-Score
XGB
C5.
consistently achieves slightly higher metrics results compared to XGBoost and C5.
Interestingly, the class with the highest metrics for all algorithms is the one that was initially a minority class before applying SMOTE .
ee Table .
, whereas Class 3, which had no instances added after SMOTE, scores the lowest.
The recall metric for XGBoost in Class 3 is lower compared to the other algorithms, indicating a higher rate of false negatives.
The lower recall for XGBoost suggests it is less reliable for detecting all positive instances in Class 3, which could be due to the characteristics of the data.
Despite this, both XGBoost and C5.
0 also demonstrate competitive results overall.
The average performance results for three classification methods-Random Forest (RF).
XGBoost (XGB), and C5.
0 are detailed in Table 9.
In line with the hold-out results.
Random Forest demonstrates the highest performance across all However, it has the longest runtime at 8584.
102 milliseconds, attributed to its nature of creating multiple trees.
XGBoost and C5.
0 exhibit very similar performance metrics, with both methods achieving an accuracy of 0.
Precision and recall are nearly identical for both.
Despite this.
XGBoostAos runtime is significantly longer, at 4181.
01 milliseconds, which is more than twice that of C5.
0Aos 1771.
632 milliseconds.
This makes C5.
0 preferable over XGBoost in this particular experiment, especially when computational efficiency is a priority.
3 Comparison of the Methods Figure 4 shows the comparison of training and validation accuracy for Random Forest.
XGBoost, and C5.
0 algorithms using the SKCV method.
The results indicate that the accuracy values for training and validation are very close for all models, which is a good sign.
This means the models have learned well from the training data without overfitting .
oo closely fitting the training dat.
or underfitting .
ot learning enoug.
Training accuracy reflects how well the model performs on the data it was trained on, while validation accuracy indicates how well it generalizes to new, unseen data during training.
When these two metrics are both high and similar, it typically means the model is balanced and likely to perform well on new data.
Conversely, a large gap with higher training accuracy can signal overfitting, and low values for both may indicate underfitting.
Overall, the results in Figure 4 suggest that the models are well-tuned and capable of delivering reliable predictions for this dataset.
Figure 5 summarizes confusion-matrix results for three Page 1007 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Table 9.
Average Results with SKCV Method Random Forest XGBoost C5.
Accuracy Precision Recall F1-Score Runtime 102 ms 01 ms 632 ms Figure 4.
Comparison of Training and Validation Accuracy of the Decision Tree Algorithms popular decision-tree algorithms applied to the test set.
Across both the simple hold-out partition and the more exhaustive stratified k-fold cross-validation, the Random Forest variant regularly records the highest numbers.
Accuracy values for that method hover just under 0.
98 in both settings.
In the same comparisons, precision, recall, and the combined F1 score remain reassuringly strong, signaling the modelAos overall reliability.
C5.
0 and XGBoost trail by only a few decimal points, both landing in the 0.
A curious rebound shows with A 2025 The Authors.
SKCV, however, where C5.
0 sometimes equals or narrowly exceeds XGBoost on the measurements, hinting at the classifierAos sensitivity to how the data is partitioned.
Examination of the runtimes presented in Tables 4 and 6 in Sections 3.
1 and 3.
2 reveals an inherent imbalance when hold-out and SKCV times are weighed against one another.
The former executes a single 90:10 partition, while the k = 10 SKCV scheme repeats that split ten separate times.
Even so, both splitting strategies conclude the same lesson about Random Forest: its ensemble nature drives up wall time, a liability that shows up when response time is crucial.
C5.
0, by contrast, gives nearly identical accuracy numbers yet finishes the work much faster, making it the obvious pick for highthroughput tasks.
XGBoost lands midway between the two, faster than Random Forest but still slower than C5.
0, and while that stride can feel satisfactory it sits slightly behind the latter in sheer speed.
Based on these findings, the best model depends on the application scenario as shown in Table 10.
In the context of predicting water quality.
RFAos accuracy is ideal in some tasks, but it may not be suitable for applications with low latency requirements.
for example, in systems deployed for water quality monitoring in real time.
For scenarios where speed is more desirable such as in real-time classification, or settings with limited resources like IoT-based water monitoring.
C5.
0 works efficiently.
XGBoost, which works better than C5.
0 but is slower, is ideal for medium to large datasets where reasonable accuracy with efficient processing is needed.
If subsequent studies include more features or other parameters of water.
XGBoost might have better scalability than Random Forest.
Also, in water quality monitoring, balanced evaluation is necessary because some levels of contamination are not represented adequately in the unprocessed data.
SKCV makes the estimates more applicable to the real world by minimizing the variance observed in the results.
From the evaluation results, it is evident that some water quality classes are more challenging to classify correctly than Notably.
Class 3 (Fair qualit.
consistently shows lower recall and precision across all models compared to other classes as shown in Table 6 and Table 8.
This difficulty likely arises because the feature distributions for Class 3 significantly overlap with those of Class 2 (Goo.
and Class 4 (Margina.
, making it hard for the models to distinguish between them.
Although SMOTE was applied to address class imbalance, the subtle differences in important parameters such as nitrogen concentration and turbidity still lead to misclassifications among these Page 1008 of 1011 Science and Technology Indonesia, 10 .
999-1011 Shofiana et.
Figure 5.
Comparison of Decision Tree Algorithms (RF.
XGBoost, and C5.
Table 10.
Best Decision Tree based on Application Scenario Scenario Highest Accuracy Fastest Execution Balanced Trade-Off Real-Time Water Quality Monitoring Large-scale Datasets Best Decision Tree Random Forest C5.
XGBoost Reason Best predictor among the three but has high computational cost.
Maintains high accuracy with minimal runtime.
Good accuracy and efficiency but slower than C5.
C5.
0 or XGBoost Fast response time is critical.
XGBoost Scales better than Random Forest in big data.
These misclassifications can have practical implications.
Incorrectly labeling water quality levels could result in inappropriate treatment decisions or insufficient monitoring, potentially impacting environmental management and public health outcomes.
To address this, future research could focus on incorporating additional or more discriminative features that better separate closely related classes.
Applying explainability techniques such as SHapley Additive exPlanations (SHAP) or Local Interpretable Model-agnostic Explanations (LIME) would help uncover which features contribute most to the modelAos predictions and identify sources of confusion.
Furthermore, exploring alternative machine learning models or ensemble approaches that combine strengths of different classifiers could enhance accuracy.
A hybrid data augmentation strategy that blends oversampling .
ike SMOTE) and undersampling methods may better balance the dataset and reduce Lastly, expanding the dataset size and improving data quality, particularly for underrepresented classes, will likely A 2025 The Authors.
improve model robustness and reduce misclassification rates.
CONCLUSIONS
Through systematic analysis, this research assesses the effectiveness of three decision tree-based models: Random Forest.
XGBoost, and C5.
0, in order to forecast water-quality ratings.
The classification step is preceded by a thorough data analysis and preprocessing step to get better insight from the The data was processed using the SMOTE technique to enhance the class imbalance problem.
Although Random Forest achieved the best accuracy of almost 0.
98, the model is unsuitable for scenarios where time is of the essence such as real-time water monitoring system, due to a longer runtime.
C5.
0 is much faster and has a run time of approximately 0.
which makes the model a good-fit for cases where speed is XGBoost performed similarly to C5.
0, but was less efficient in terms of runtime.
All experimentation relies on a medium-sized set of 971 rows with 10 feature columns.
For Page 1009 of 1011 Shofiana et.
bigger and complicated datasets.
XGBoost might outperform, whereas Random Forest will have even slower performance because of its process of multiple-tree generation.
In future studies, the efficiency of Random Forest and XGBoost models can be improved by using the pruning techniques alongside parallel computing or graphic processing unit (GPU) acceleration, which is expected to solve the modelsAo high runtime.
In addition, to improve transparency, predicting SVM.
KNN, or even Neural Networks performance can be benchmarked against other non-tree classifiers alongside SHAP and LIME explainability techniques.
Lastly, considering the issues of overfitting, dataset scalability, and generalizability will fortify the studyAos concern for real world application.
ACKNOWLEDGMENT
The authors sincerely thank the University of Lampung for its financial assistance from the research fund, which has contributed immensely towards completing this work and furthering our research.
REFERENCES