Available online at http://icsejournal.
com/index.
php/JCSE Journal of Computer Science an Engineering (JCSE) e-ISSN 2721-0251 Vol.
No.
August 2025, pp.
HyperbandAcOptimized LightGBM and Ensemble Learning for Web Phishing Detection with SHAPAcBased Interpretability Rizki Wahyudi1* 1Universitas Amikom Purwokerto.
Purwokerto.
Central Java and 53127.
Indonesia key@gmail.
* corresponding author
ABSTRACT
ARTICLE INFO
Article History:
Received July 2, 2025 Revised August 1, 2025 Accepted August 6, 2025 Keywords:
Phishing Detection Machine Learning Hyperparameter Tuning Stacking Ensemble SHAP Interpretability Correspondence:
E-mail: rizki.
key@gmail.
This study evaluates the performance of three tree boosting algorithms.
Random Forest (RF).
XGBoost (XGB), and LightGBM (LGBM), in detecting phishing websites using a phishing dataset based on HTML.
URLs, and network features.
Two hyperparameter optimization strategies were tested: Hyperband search (HalvingRandomSearchCV) and stacking ensemble combining all three models.
The evaluation was conducted based on five main metrics: accuracy, precision, recall.
F1-score, and AUCAcROC.
The results indicate that LightGBM tuned via Hyperband achieved the highest performance .
AUCAcROC 9.
, followed by ensemble tuned .
AUCAcROC 0.
SHAP
analysis was used to interpret the contribution of key features in predicting phishing The AUCAcROC difference of 0.
0034 points from the XGBoost baseline .
confirms the effectiveness of Hyperband tuning and stacking ensembles for phishing detection.
Introduction Phishing websites are a frequent and destructive cybersecurity problem.
Phishing websites are created by cybercriminals who develop fake websites that look like actual businesses in order to steal user passwords, personal information, and financial data.
Due to the fact that the frequency and level of sophistication of these assaults are growing on a global scale, automated detection through machine learning has become an essential tool for phishing prevention.
During the detection phase, it is common practice to extract information from URLs.
HTML structure, and network metadata.
These elements are subsequently utilized in classification models.
Machine learning algorithms for phishing detection have been evaluated in a number of previous studies, such as Support Vector Machines.
Random Forest, and XGBoost .
, .
These algorithms have produced promising results, but they continue to face challenges in terms of performance consistency and hyperparameter optimization.
, .
Numerous investigations have utilized ensemble models like Random Forest.
XGBoost, and LightGBM because of their capacity to manage nonlinear and extensive datasets.
, .
, .
The efficacy of these algorithms significantly relies on the selection and optimization of suitable hyperparameters.
Many prior research continue to employ grid search or random search methodologies, which are laborious and inefficient.
, .
, .
The Hyperband method presents itself as an adaptive and efficient strategy for hyperparameter tuning, utilizing the notion of successive halving.
, .
, which remains relatively underexplored in the field of phishing Model interpretability is a key difficulty, particularly for security applications that require transparency in decision-making.
This is in addition to the performance elements that are a challenge.
Regrettably, the majority of the research that has been done in the past has concentrated on accuracy and other prediction metrics without providing an explanation of how characteristics actually contribute to the classification outcomes .
, .
Interpretation methods such as SHAP (SHapley Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Additive exPlanation.
, which are founded on game theory and are capable of delivering both local and global explanations for model predictions .
, .
, .
, have not yet been widely utilized in the context of tuning and ensemble combinations for the purpose of phishing detection.
This study provides a thorough method that closes this gap by .
evaluating three well-known ensemble models (Random Forest.
XGBoost, and LightGBM) as well as a stacking model at .
optimizing performance on each base model through Hyperband tuning.
Using a meta-learner-based stacking ensemble on the optimized models, stacking allows for combining the strengths of multiple base models by leveraging a higher-level learner, which frequently results in improved generalization and robustness compared to individual models or simple averaging techniques .
interpreting the best model using SHAP techniques to transparently explain feature This method not only increases the accuracy of phishing detection but also improves the interpretability and speed of model training.
Method Presented in this section is a description of the research methodology, which consists of the following steps: the preparation and preprocessing of the dataset, the evaluation of the baseline model, the scaling of the features, the tuning of the hyperparameters with Hyperband, the construction of the stacking ensemble, the comparison of the ROC curve, and the interpretation of the SHAP-based An illustration of the entire research workflow may be found in Figure 2.
1 Dataset and Preprocessing The dataset, sourced from the UCI Machine Learning Repository .
, contains 11.
055 examples of phishing and non-phishing URLs, along with 30 statistical features such as the URL's length, the number of "@" or ".
", the presence of HTTPS, the age of the domain, etc.
The index column and the IP column were removed.
Numeric features are filled with the average value, and categorical features are filled with the mode.
The target is encoded in binary .
=phishing, 0=legitimat.
1 Data Exploration This section provides a summary of descriptive statistics for all numerical parameters, including the mean, standard deviation, minimum, and maximum values, to elucidate the properties of the data.
The class distribution between phishing and genuine categories is obtained using the 5-fold crossvalidation approach.
Table 1 provides an overview of the dataset structure, showcasing five randomly selected rows alongside numerous critical attributes and the goal label.
Additionally.
Table 2 and Figure 1 present a summary of numerical statistics and class distribution, offering a first insight into the data patterns and class proportions within the dataset.
Tabel 1.
Five Examples of Random Rows.
Covering Some Key Features and Target Labels.
Sample having_IP_Address URL_Length SSLfinal_State age_of_domain 2 Summary of Class Statistics and Distribution To analyze the data's characteristics.
Table 2 provides summary statistics, including the mean, standard deviation, minimum, and maximum values for five chosen numerical features.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Table 2.
Summary Statistics of Some Numerical Features Fitur having_IP_Address URL_Length SSLfinal_State age_of_domain Links_pointing_to_page Mean Std Min Max 3 Class Distribution Visualization Table 2 provides a detailed overview of the data characteristics by presenting summary statistics, including the mean, standard deviation, minimum, and maximum values for five selected numerical Figure 1.
Distribution of Target Classes on the Test Dataset 2 Baseline Evaluation The data was split 80/20 .
rain/tes.
Four baseline models were trained on the original features:
A Random Forest .
A XGBoost .
estimator, eval_metric='logloss') A LightGBM .
A Stacking default (RF XGB LGBM metaAclearner LogisticRegressio.
Evaluation metric: Accuracy.
Precision.
Recall.
F1Acscore.
AUCAcROC 3 Hyperband Tuning The initial features undergo scaling using StandardScaler.
Within every model:
A Random Forest: hyperparameter space:
min_samples_split, min_samples_leaf.
n_estimators, max_depth, max_features.
A XGB: n_estimators, max_depth, learning_rate, subsample, colsample_bytree.
A LGBM: n_estimators, max_depth, learning_rate, num_leaves, subsample, colsample_bytree.
A An investigation was conducted utilizing HalvingRandomSearchCV .
actor=3, cv=5, n_iterOO.
The optimal performance of the model was evaluated using the test set.
4 Equations The three optimized models are integrated via StackingClassifier with a Logistic Regression metalearner .
The objective is to harness the synergy of the foundational models.
, .
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 5 Evaluation and Analysis A Confusion Matrix for each variant.
A ROC Curve comparing all models.
A Cross-Validation Variability: 5-fold CV repeated for baseline XGB and tuned LGBM, reporting average and standard deviation of accuracy.
A SHAP Summary Plot for the best LGBM, displaying the top 10 features based on average SHAP values.
Figure 2.
Reesearch Methodology Results and Discussion 1 Baseline Model Evaluation The preliminary phase of this experiment involved assessing the baseline performance of four classification models: Random Forest (RF).
XGBoost (XGB).
LightGBM (LGBM), and the Stacking ensemble model.
All models underwent evaluation utilizing raw features, without any scaling or tuning processes, to deliver an initial overview of the effectiveness of each algorithm on the dataset The assessment was carried out utilizing five primary metrics: accuracy, precision, recall.
F1-score, and AUC-ROC, to guarantee that the model's performance is not only elevated overall but also equitable in identifying both minority and majority classes.
According to the baseline results presented in Table 3, all models demonstrate highly competitive performance, achieving accuracy levels exceeding 96%.
The stacking model exhibits the most superior overall performance, achieving an accuracy of 96.
88%, a recall of 98.
17%, and an F1-score In comparison.
XGBoost follows closely with an accuracy of 96.
83%, a precision of 62%, and an F1-score of 97.
The Random Forest model demonstrated impressive performance, achieving a recall of 98.
However, its AUC-ROC is marginally lower than that of XGBoost and Stacking.
LightGBM recorded the lowest performance among the four models, but the difference is relatively small, with an accuracy of 96.
47% and an AUC-ROC of 96.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Upon examining each metric, the stacking model demonstrates the most favorable equilibrium between sensitivity and precision.
The elevated recall value demonstrates a strong capacity to identify nearly all occurrences of the positive class .
, whereas the consistent precision reflects a low rate of false positive predictions.
The AUC-ROC value for stacking is 0.
9668, matching that of XGBoost, which suggests that both models exhibit equivalent effectiveness in differentiating between phishing and legitimate classes.
In summary, the baseline results demonstrate that while each model exhibits specific advantages .
or instance.
RF excels in recall.
XGB in precision, and Stacking in metric balanc.
, the Stacking model emerges as the most promising option at this preliminary stage prior to any tuning efforts.
This establishes a solid basis for advancing performance enhancement via normalization and hyperparameter optimization.
Table 3.
Baseline Model Evaluation using Raw Features Model
Random Forest (RF) XGBoost (XGB) LightGBM (LGBM) Stacking Accuracy Precision Recall F1-score
AUC-ROC
2 Feature Scaling and Model Tuning The feature scaling process is carried out to ensure that all numerical features are on a uniform scale before being used in the model training process, especially for scale-sensitive algorithms like XGBoost and LightGBM.
Standardization using StandardScaler is a crucial step to prevent features with large values from dominating the model and to accelerate convergence during the training and tuning process.
To improve classification performance, a hyperparameter tuning process was conducted using the HalvingRandomSearchCV approach, which is an efficient version of staged random search that reduces the number of candidates as cross-validation accuracy increases.
This process is carried out independently for each model: Random Forest.
XGBoost, and LightGBM.
The tuning results in models with optimal parameter combinations, which are then used as base learners in the stacking After tuning, the performance of the three base models showed significant improvement.
LightGBM recorded the most significant improvement, with accuracy rising from 96.
47% to 97.
24%, recall increasing from 97.
69% to 98.
65%, and F1-score from 96.
92% to 97.
The XGBoost model also saw an improvement in F1-score from 97.
23% to 96.
84% and AUC-ROC from 0.
9668 to 0.
although it slightly decreased marginally in some metrics.
Meanwhile.
Random Forest showed consistently high recall but a slight decrease in AUC-ROC from 0.
9644 to 0.
9579, likely due to overfitting with certain parameter combinations.
The Tuned Stacking Model, which combines the three optimized models, recorded an accuracy of 96.
97%, a precision of 96.
92%, a recall of 97.
and an F1-score of 97.
34%, generally outperforming all individual models and the baseline.
3 Performance Comparison Across Phases A systematic comparison of model performance before and after tuning was performed to evaluate the impact of the optimization process on classification quality.
The evaluation was conducted using five key metrics: accuracy, precision, recall.
F1-score, and AUC-ROC.
To support this analysis, the results are presented in the form of bar graphs depicting the performance of each model in two phases: baseline and after tuning using HalvingRandomSearchCV.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Accuracy Comparison The accuracy graph in Figure 3.
shows that hyperparameter tuning has a positive impact on the LightGBM and Stacking models.
LightGBM saw an accuracy increase from 96.
47% to 97.
while the stacking model saw an increase from 96.
88% to 96.
The XGBoost and Random Forest models tended to experience slight fluctuations.
This graph shows that the tuning process can help certain models achieve greater stability in classification.
Figure 3.
Accuracy Comparison Across Models F1-score Comparison The F1-score metric in Figure 4.
also showed a considerable improvement after tuning, with LightGBM recording a large rise from 96.
92% to 97.
This indicates that the balance between precision and recall has been improved.
After adjustment, stacking was able to keep up its consistently excellent performance, achieving an F1 score of 97.
By confirming that the tuning process helps maintain optimal performance in categorization scenarios that need both sensitivity and precision simultaneously, this improvement demonstrates that the tuning procedure is effective.
Figure 4.
F1-score Comparison Across Models Precision Comparison The F1-score metric in Figure 5.
also showed a considerable improvement after tuning, with LightGBM recording a large rise from 96.
92% to 97.
This suggests an improvement in the balance between precision and recall.
After adjustment, stacking was able to keep up its consistently excellent performance, achieving an F1 score of 97.
By confirming that the tuning process helps maintain optimal performance in categorization scenarios that need both sensitivity and precision simultaneously, this improvement demonstrates that the tuning procedure is effective.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Figure 5.
Precision Comparison Across Models Recall Comparison The recall values in Figure 6.
show that the stacking and LightGBM models have the highest ability to detect positive classes.
After tuning, the stacking recall was recorded at 97.
77%, while LightGBM increased from 97.
69% to 98.
64%, the highest value among all models.
This indicates that these models are very effective in minimizing type II errors .
alse negative.
, which is crucial in applications such as phishing detection or security systems.
Figure 6.
Recall Comparison Across Models 4 ROC Curve Analysis Figure 7 shows the ROC (Receiver Operating Characteristi.
curve above, providing a visual comparison between all tested classification models, both at the baseline stage and after hyperparameter tuning using the Hyperband approach.
This curve represents the relationship between the True Positive Rate (TPR) and False Positive Rate (FPR), which reflects the extent to which the model can correctly detect positive classes without producing too many misclassifications.
The closer the curve is to the upper left corner of the graph, the better the model's performance in terms of high sensitivity and strong specificity.
The Area Under the ROC Curve (AUC-ROC) values show that the LightGBM model tuned using Hyperband performed best with an AUC of 0.
9702, followed by the Stacking Tuned model with an AUC of 0.
The Stacking Tuned model also demonstrated very balanced classification performance, with an accuracy of 96.
97%, a precision of 96.
92%, a recall of 97.
77%, and an F1score of 97.
The result makes the ensemble model the most optimal candidate because it offers the best trade-off between a high positive detection rate and a low misclassification rate.
Its curve, which is very close to the upper left edge of the ROC graph, indicates that the model is capable of maintaining very high detection performance even at minimal FPR levels.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Meanwhile, the XGBoost model at the baseline stage, despite having a fairly satisfactory AUC value .
, exhibits visually weaker performance on the ROC graph.
Its curve is not as sharp as the other models, especially in the early low FPR segments, indicating that the model struggles to accurately distinguish classes when small error rates are taken into account.
The result indicates that despite high aggregate metrics like AUC, the probability distribution generated by the baseline XGBoost is less focused, causing the ROC curve to deviate from the ideal shape.
Figure 7.
ROC Curve Comparison all Models 5 Model Explainability using SHAP Figure 8 shows the SHAP summary plot visualization showing the contribution of each feature to the prediction output of the best model, namely LightGBM (LGBM), optimized with Hyperband SHAP (Shapley Additive Explanation.
is a game theory-based interpretability method that consistently and locally identifies the individual contribution of each feature to model predictions.
The use of SHAP is important because it provides transparency over black-box models like ensemble methods .
ncluding LGBM) and helps identify the features that most influence model decisions quantitatively and visually.
In the plot, the horizontal axis shows the SHAP value, which indicates the impact of a feature on the classification probability .
ositive or negativ.
, while the color of the dot indicates the original value of the feature .
lue = low, red = hig.
Each dot represents a single observation in the testing dataset.
The features at the top of the graph are those that generally contribute the most to model predictions.
These results show that the URL_of_Anchor.
SSLfinal_State.
Prefix_Suffix, and Links_in_tags features have the most significant contributions to the model's prediction output.
For example, high values for the URL_of_Anchor feature .
arked in re.
tend to push the model's output in a positive direction .
igher ris.
, while low values .
lead to negative predictions .
igher ris.
A similar pattern is also seen for the Abnormal_URL.
Request_URL, and SFH features, which also have a substantial influence on classification.
Thus, this SHAP analysis not only strengthens our understanding of how the model makes decisions but also opens up opportunities for model simplification, feature selection, and increased user confidence in the classification system particularly in the context of phishing detection.
The identified key features can also form the basis for rule-based cyber defense systems or serve as a focus for future data collection.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 Figure 8.
Visualisation SHAP summary plot 6 Confusion Matrix Analysis The results of the analysis on the classification models are shown through confusion matrices for various methods, including Random Forest (RF).
Extreme Gradient Boosting (XGB).
Light Gradient Boosting Machine (LGBM), and the stacking technique.
For the baseline, the RF model showed 907 true negatives and 1230 true positives, while XGB had 913 true negatives and 1228 true positives, and LGBM recorded 907 true negatives and 1226 true positives.
The stacking method showed similar results with 910 true negatives and 1232 true positives.
In the hyperband test, the RF model showed 896 true negatives and 1228 true positives, while XGB got 907 true negatives and 1224 true positives, and LGBM recorded 912 true negatives with 1238 true positives.
The stacking model in the hyperband showed 917 true negatives and 1227 true positives, indicating that, despite variations in performance, all methods provide quite good classification results.
This analysis illustrates the strength of different algorithms in identifying target categories with relatively high effectiveness.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 .
Figure 9.
Confusion Matrix baseline and Hyperband all Model .
Confussion matric RF Baseline, .
Confussion matric XGB Baseline, .
Confussion matric LGBM Baseline, .
Confussion matric Stack Baseline, .
Confussion matric RF Hyperband, .
Confussion matric XGB Hyperband .
Confussion matric LGBM Hyperband .
Confussion matric Stack Hyperband.
7 Summary of Findings This experiment yielded several important findings that offer an in-depth assessment of the performance of various classification models, both at the initial .
stage and after hyperparameter optimization using HalvingRandomSearchCV.
The Stacking Ensemble model consistently performed best, both at the baseline and after tuning.
In the baseline phase, stacking achieved an accuracy of 96.
88%, a recall of 98.
17%, and an F1-score of 97.
28%, outperforming the other individual models.
Performance improved after tuning, reaching 96.
97% accuracy, 97.
recall, 97.
34% F1-score, and 96.
84% AUC-ROC value.
Although the increase is relatively small, it demonstrates the stability and reliability of the stacking model in maintaining classification quality.
The individual models also responded positively to the tuning process.
LightGBM recorded the most significant improvement, from 96.
47% accuracy to 97.
24%, and from 96.
92% to 97.
60% in the F1score, indicating that the model is highly responsive to parameter adjustments.
In contrast.
Random Forest and XGBoost experienced performance fluctuations, with Random Forest showing a slight decrease in accuracy and AUC while maintaining high recall.
This suggests that tuning does not always guarantee improved performance once the model is already near-optimal.
Overall, the tuning process proved to have a positive impact on model performance, particularly for LightGBM and the Stacking ensemble.
The application of an ensembling approach that combines the strengths of each individual model yielded more balanced and stable classification results.
The implications of these results are particularly important in real-world applications such as phishing detection or network security classification systems, where high recall and minimal misclassification are key priorities.
With its ability to capture a significant proportion of incidents without significantly increasing the false positive rate, the Tuned Stacking model is reliable in error-critical classification This research has a number of shortcomings, although it achieved a high level of accuracy and interpretability on benchmark datasets.
In the first place, the model was trained and tested on static datasets (UCI and public repositorie.
, which might not be an accurate representation of the everchanging nature of phishing techniques in real-world circumstances.
Furthermore, the SHAP-based interpretability, although it is insightful, results in additional processing cost during post-analysis.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
August 2025, pp.
e-ISSN 2721-0251 which may not be feasible in real-time detection systems.
Due to the fact that its performance on different phishing datasets or multilingual URLs has not yet been investigated, the generalizability of the proposed model is also something that will require additional examination.
These constraints could be addressed in subsequent work by verifying the approach on streaming data, testing it across various platforms, and including real-time model adaptation strategies.
Conclusion This research successfully evaluated and compared the performance of several classification algorithms, namely Random Forest.
XGBoost.
LightGBM, and the Stacking ensemble model, both under baseline conditions and after hyperparameter tuning using HalvingRandomSearchCV.
The evaluation results showed that the stacking ensemble approach consistently provided the best classification performance, with an accuracy of 96.
97%, a recall of 97.
77%, and an AUC-ROC of 84% after tuning.
This model not only excelled in global accuracy but also demonstrated an optimal balance between precision and sensitivity, making it a strong candidate for application in real-world applications requiring reliable classification, such as phishing detection systems.
The feature scaling and tuning processes proved to have a positive impact, particularly on the LightGBM model, which demonstrated significant improvements in almost all evaluation metrics.
Further analysis using SHAP (SHapley Additive exPlanation.
demonstrated that model interpretability can also be strengthened by identifying the features most influential in predictive decisions.
Thus, the model is not only precise but also understandable and transparently auditable.
Although the results are very promising, there are several possible future development directions.
First, the integration of data balancing methods such as SMOTE or ADASYN can be explored to address imbalanced class distributions, which are common in real-world datasets.
Second, exploring deep learning-based model architectures, such as LSTM or Transformer for sequential data, can be an alternative for cases with temporal patterns.
Third, implementing and testing the model in a production environment or real-time pipeline is necessary to practically verify the system's scalability and efficiency.
Finally, a combination of other explainable AI techniques, such as LIME or counterfactual explanations, can enrich the model's understanding and increase end-user confidence in the developed system.
References