J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
A COMPARATIVE STUDY OF SUPERVISED FEATURE SELECTION METHODS
FOR PREDICTING UANG KULIAH TUNGGAL (UKT) GROUPS
Windy Chikita Cornia Putri1*.
Wiyli Yustanti2, and Ervin Yohannes3
1,2,3
Faculty of Engineering.
Universitas Negeri Surabaya.
Surabaya.
Indonesia Email1*: windychikita@unesa.
Email2*: wiyliyustanti@unesa.
Email3*: ervinyohannes@unesa.
ABSTRAK
Penentuan Uang Kuliah Tunggal (UKT) di perguruan tinggi negeri selama ini masih bergantung pada verifikasi manual dokumen sosio-ekonomi, yang rentan subjektivitas, memakan waktu, dan memicu Penelitian ini mengkaji efektivitas lima teknik seleksi fitur-filter (Chi-Squar.
, embedded (Random Forest Importance.
LASSO), wrapper (Recursive Feature Eliminatio.
, dan reduksi tak berlabel (Exploratory Factor Analysi.
dalam meningkatkan kinerja lima algoritma klasifikasi (Decision Tree.
Random Forest.
SVM-RBF.
K-Nearest Neighbor.
Nayve Baye.
pada dataset UKT UNESA .
369 entri y 53 variabe.
Data dipra-proses dengan imputasi, scaling, encoding, dan SMOTE-NC, kemudian dievaluasi menggunakan Stratified 5-fold CV dan hold-out test .
Hasil menunjukkan bahwa penggunaan seluruh 53 fitur .
memberikan weighted-average akurasi sebesar 0,6244 A 0,0057.
Seleksi fitur menggunakan LASSO-13 dan Chi-Square-13 secara signifikan meningkatkan akurasi rata-rata menjadi 0,7300 dan 0,6775, masing-masing, serta mengurangi waktu pelatihan hingga 40Ae70%.
SVM-RBF dengan LASSO-13 mencapai akurasi tertinggi .
,7.
, diikuti Random Forest-Chi-Square .
,6.
dan Decision Tree-LASSO .
,7.
Uji Friedman terhadap distribusi akurasi model pada enam kondisi mengonfirmasi perbedaan signifikan (NA=15,06.
p=0,.
Temuan ini menegaskan bahwa seleksi fitur khususnya LASSO dan Chi-Square mampu mereduksi kompleksitas data .
ari 53 ke 13 fitu.
tanpa mengorbankan, bahkan meningkatkan performa prediktif model UKT.
Rekomendasi meliputi integrasi metode seleksi terpilih dalam verifikasi UKT otomatis dan publikasi daftar fitur untuk transparansi.
Kebaruan penelitian ini terletak pada perbandingan lima metode seleksi fitur dalam satu pipeline praproses terstandar pada data riil UKT UNESA, menghasilkan subset 13 fitur yang sesuai dengan kebijakan UKT saat ini.
Temuan ini diintegrasikan ke sistem verifikasi UKT otomatis untuk meningkatkan akurasi dan efisiensi keputusan.
Kata Kunci: Uang Kuliah Tunggal.
seleksi fitur.
klasifikasi UKT.
ABSTRACT
The manual classification of Uang Kuliah Tunggal (UKT) groups at Indonesian public universities is laborious, subjective, and error-prone, especially given the explosion of socio-economic data captured via online admission portals.
In this study, we evaluate five feature selection techniques Chi-Square filter.
Random Forest importance.
Recursive Feature Elimination.
LASSO embedded selection, and Exploratory Factor Analysis on a dataset of 9,369 applicants described by 53 socio-economic variables.
Six classifiers (Decision Tree.
Random Forest.
SVM-RBF.
K-Nearest Neighbor, and Nayve Baye.
were tuned via stratified 5-fold cross-validation within an 80:20 train-test split.
Performance was measured by accuracy, macro-F1, and training time, and differences in weighted-average accuracy across feature-selection scenarios were assessed using the Friedman test (NA = 15.
06, p = 0.
Results show that reducing to 13 features via LASSO .
eighted-average accuracy 0.
or Chi-Square .
significantly outperforms both the full feature baseline .
and the EFA baseline .
, while cutting computational costs by over 40%.
We conclude that supervised feature selection particularly LASSO and Chi-Square enables simpler, faster, and more transparent UKT prediction without sacrificing accuracy.
The novelty of this study lies in comparing five feature-selection methods within a standardized preprocessing pipeline on real UKT data from UNESA, resulting in a 13-feature subset aligned with the current UKT policy.
This finding is ready to be integrated into an automated UKT verification system to enhance decision accuracy and Keywords: UKT.
feature selection.
UKT Classification.
*) Corresponding Author Submitted : July 21, 2025 Accepted : August 12, 2025 Published : August 31, 2025 ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
INTRODUCTION
In an effort to ensure equitable access to higher education, the Indonesian Government issued Ministry Regulation No.
22/2015, which mandates a proportional Uang Kuliah Tunggal (UKT) based on a familyAos economic capacity .
This scheme partitions students into eight bands (K1AeK.
so that state subsidies can be distributed fairly.
However, field implementation still relies heavily on manual verification paper-based document checks and in-person interviews that is time-consuming and prone to evaluator bias .
The digital transformation of university admission portals now compels applicants to upload socio-economic evidence, ranging from parental payslips to proof of asset ownership .
The present study analyses 9,369 student records described by 53 variables covering income, utility expenses, dwelling condition, and household characteristics.
The volume and heterogeneity of these data introduce additional challenges for Selecting the most influential variables is critical because handling an excessively large feature set .
igh dimensionalit.
escalates computational complexity and complicates the formulation of UKT policies.
Moreover, high-dimensional data trigger the curse of dimensionality, wherein pairwise distances become increasingly similar and distance-based algorithms such as K-Nearest Neighbour lose discriminative power .
Likewise, probabilistic models such as Nayve Bayes are hindered by strong interfeature correlations, while powerful methods like Support Vector Machines and Random Forests demand substantial training time and computational resources.
To mitigate these issues, feature-selection techniques become indispensable.
Filter approaches, such as the Chi-Square test, can rapidly prune uninformative variables, whereas embedded strategies, such as LASSO and Random-Forest Importance, perform selection concurrently with model training .
Wrapper approaches such as Recursive Feature Elimination (RFE) typically yield higher accuracy at the cost of greater computational expense, while Exploratory Factor Analysis (EFA) supplies an unsupervised dimensionality-reduction baseline.
A wide spectrum of classifiers has already been explored for UKT modelling, including Decision Tree.
Random Forest, radial-basis-function Support Vector Machine.
KNearest Neighbour, and Nayve Bayes .
Each offers distinct advantages: tree models are inherently ensembles are resilient to over-fitting.
margin-based methods excel in high-dimensional and probabilistic models are computationally frugal on large datasets.
At the national level, .
evaluated a combination of correlation-based feature selection and SVM for UKT classification, but the scope was confined to a single academic programme and did not compare alternative selection schemes Beyond supervised approaches, several studies have investigated unsupervised clustering to assess the suitability of UKT band structures.
preliminary evidence indicates that mini-batch K-Means offers the most stable solution when internal and external validity indices are combined .
Although prior work has assessed feature-selection effects in medical and financial data, few studies have focused on UKT band assignment with large, heterogeneous socio-economic variables .
Moreover, no comprehensive investigation has contrasted five feature-selection techniques (NA.
RF-Imp.
RFE.
LASSO.
EFA) within a unified pre-processing pipeline, evaluated across six baseline classifiers using accuracy, macro-F1, and computational cost.
Accordingly, this study aims to .
quantify the impact of the five feature-selection methods on UKT model performance, .
identify a minimal subset (O 13 variable.
that preserves or improves accuracy, and .
recommend the most effective classification algorithm for an automated UKT decision system, thereby enabling decisions that are more objective, rapid, and transparent .
Unlike previous research, this study is the first to conduct a comprehensive, head-to-head comparison of five feature-selection techniquesAiChi-Square.
Random-Forest Importance.
Recursive Feature Elimination (RFE).
LASSO, and Exploratory Factor Analysis (EFA)Aiwithin a unified pre-processing and evaluation pipeline.
All methods are benchmarked across six commonly used classifiers (Decision Tree.
Random Forest.
SVM-RBF.
K-NN.
Nayve Bayes.
Logistic Regressio.
using a large real-world UKT Furthermore, the study not only measures predictive performance .
ccuracy and macro-F.
but also explicitly incorporates computational cost as a decision criterion.
A key practical contribution is the identification of a minimal subset of 13 socio-economic variables, which preserves or even improves classification accuracy compared to the full 53-feature set.
This number is particularly significant because it matches the current number of features used by Universitas Negeri Surabaya (UNESA) in its operational UKT determination process.
By aligning model outputs with existing institutional workflows, the proposed feature set can be readily integrated into the current decision-making system, enabling a more objective, efficient, and scalable UKT assignment for nationwide adoption.
MATERIAL AND METHODS
Research Framework The study follows the stages of the Knowledge Discovery in Databases (KDD) process .
, as illustrated in Figure 1.
The first stage of selection focuses on identifying relevant data sources.
In this work.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
the raw dataset comprises 9,369 student records retrieved from the Admission Integrated System of Universitas Negeri Surabaya, covering applicants admitted through the 2023/2024 national selection schemes: Seleksi Nasional Berbasis Prestasi (SNBP) and Seleksi Nasional Berbasis Tes (SNBT) .
Figure 1.
Knowledge Discovery in Databases (KDD) Framework .
Data Pre-Processing and Transformation The pre-processing stage began with an initial cleansing step in which 112 duplicate records were Missing values affecting approximately 10 15 % of several attributes were imputed using the median for numerical variables and the mode for categorical variables.
After cleaning, the working dataset comprised 53 socio-economic features, grouped as follows:
Income and Financial Burden, such as fatherAos salary, motherAos salary, total instalments, total debt.
Assets and Property, such as land area, building area, government property tax value or Nilai Jual Objek Pajak (NJOP), number of cars/motorcycles, jewellery, deposits.
Utility Bills such as electricity, internet, mobile airtime, water charges.
Housing Conditions, such as roof material, floor material, wall type, and presence of an indoor .
Household Characteristics, such as household size, number of siblings, number of school-aged siblings, and home-ownership status.
The target label is the UKT tier assigned by the universityAos finance office, ranging from Group 1 (K.
to Group 8 (K.
The class distribution is notably imbalanced: K5 and K6 account for 26 % and 24 % of the records, respectively, whereas K1AeK3 each represent only 7 12 %.
At the transformation stage, categorical variables were encoded as follows: ordinal attributes were converted via Ordinal Encoding, whereas nominal attributes were converted via One Hot Encoding .
In addition, three derived ratio features were engineered to capture key socio-economic relationships, namely income expenditure balance, bedroom utilisation, and household electricity usage intensity .
, .
These variables are defined by Equations 1, 2, and 3.
ycycuycycayco ycnycuycycycaycoycoycoyceycuycyc yayceycayc ycycu yaycuycaycuycoyce .
= ycycuycycayco ycnycuycaycuycoyce 1 yaAyceyccycycuycuycoyc yaycaycyycaycaycnycyc .
aAy.
= ycuycycoycayceyc ycuyce ycyyceycuycyycoyce ycayc Eaycuycoyce ycuycycoycayceyc ycuyce ycayceyccycycuycuycoyc 1 yceycoyceycaycycycnycaycnycyc ycaycnycoyco ycnycuycyceycycuyceyc ycaycnycoyco yceycoyceycaycycycnycaycayco ycyycuycyceyc Outliers were addressed by winsorising observations whose modified Z-score exceeded 3.
5, thereby preventing extreme values from dominating the data distribution .
Variables exhibiting pronounced right skew, namely total debt, deposit balance, jewellery value, and mobile credit expenditure, were then subjected to a log transformation.
Finally, min-max normalisation was applied to rescale all continuous attributes to the .
interval, ensuring strictly positive values and comparability across features.
Feature Selection Running feature selection within each cross-validation fold has been shown to prevent data leakage and to yield models with superior generalisability .
, .
Accordingly, all five approaches listed in Table 1.
Chi-Square.
Random Forest Importance.
Recursive Feature Elimination.
LASSO, and Exploratory Factor Analysis, were executed exclusively on the training portion of every fold.
The resulting subset of features was then frozen and applied unchanged to the corresponding validation fold and to the final holdout test set.
yaycuycyc ycycu ycEycuycyceyc .
aycE) = ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Table 1.
Feature-Selection Scenarios for UKT Classification Modelling Method Within-Fold Workflow Strengths Limitation Chi-Square Apply NA test to categorical features.
Extremely fast.
Ignores inter(Filte.
rank by score, and select top 13 model-agnostic.
feature correlation.
The number 13 was chosen well suited to does not handle based on UNESAAos current UKT large data sets.
verification policy .
socio-economic variables without indicator.
and cross-validation tuning showing no significant accuracy gains beyond this point.
RandomTrain a Random Forest model, compute Captures Tends to favour Forest Gini importance for all features, and interaction and high-cardinality Importance retain the top 13.
The choice of 13 non-linearity.
(Embedde.
features follows the same policyAe 4 y slower optimisation rationale as above, stable .
than Chi-Square.
ensuring both operational relevance and computational efficiency.
RFE
Fit an L1-regularised logistic regression Considers joint Highest Recursive model, iteratively remove the least contribution of computational cost Feature important features, and stop when 13 handles .
Ae20 y ChiElimination The stopping point of 13 mixed data Squar.
(Wrappe.
features was predefined to match to the UNESAAos UKT feature policy and regularisation path validated through performance plateau .
analysis in cross-validation.
LASSO
Train L1-penalised logistic regression Simultaneous May arbitrarily (Embedde.
with standardised features, retain the 13 selection drop correlated yet largest non-zero coefficients.
The relevant features.
number 13 was fixed based on mitigates overrequires feature UNESAAos operational policy and CV optimisation indicating an accuracy coefficients are plateau beyond this point.
Exploratory Standardise features via Z-score, extract Compresses Ignores target Factor 13 latent factors, apply Varimax Analysis rotation, and use factor scores as typically yields the (EFA)Ae The number of factors 13 multicollinearity, lowest accuracy.
Varimax was aligned with the policy-based label-free.
feature target to ensure comparability interpretation may across methods.
be ambiguous Classification Algorithms Selecting a diverse set of classifiers from interpretable models (Decision Tre.
and ensemble learners (Random Fores.
to margin-based methods (RBF-kernel SVM), instance-based approaches (K-Nearest Neighbou.
, and lightweight probabilistic models (Gaussian Nayve Baye.
ensures that multiple learning paradigms are examined.
Each estimator is tuned via a stratified five-fold GridSearchCV, using macro-F1 as the optimisation target, a procedure widely regarded as best practice for modern tabular-data benchmarks .
, .
The candidate algorithms and the corresponding hyperparameter grids explored in this study are summarised in Table 2.
Table 2.
Classification Algorithms and Corresponding Hyperparameter Tuning Settings No Algorithms Parameter Range Value 1 Decision Tree (DT) max_depth 10, 20, unlimited min_samples_ 1, 5, 10 2 Random Forest (RF) n_estimators 200, 400 20, 40, max_depth None max_features sqrt, log2 3 Support Vector Machine C gamma 1, 10, 100 0.
001, 0.
01, 0.
(SVM-RBF)
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
K-Nearest Neighbour
(K-NN)
Nayve Bayes (NB) n_neighbors weights metric
var_smoothing 3, 5, 7
uniform, distance, euclidean 1 y 10AA
RESULTS AND DISCUSSION
This section analyses the outcomes of 30 experimental runs, produced by crossing six data-set conditions with five classification algorithms.
Table 3 presents the baseline experiment where no feature selection was applied.
Using the complete set of 53 input features, the RBF-kernel SVM achieved the highest classification accuracy .
but required the longest training time .
4 second.
The Decision Tree yielded a reasonably good accuracy .
with a significantly faster training time .
nder two second.
, while Random Forest imposed higher computational demands without a corresponding gain in accuracy .
K-Nearest Neighbour exhibited degraded performance due to the curse of dimensionality .
, and Gaussian Nayve Bayes failed to generalise effectively .
The average accuracy across all models was approximately 0.
514, highlighting substantial room for improvement through feature selection, which could enhance both predictive accuracy and computational efficiency.
Table 3.
Classification Performance on the Dataset without Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average Subsequently.
Table 4 presents the classification performance after applying EFA-based feature Once the original 53 variables were compressed into 13 latent factors, the predictive performance of all classifiers declined considerably.
Although the RBF-kernel SVM remained the top performer, its accuracy dropped to 0.
Random Forest and Decision Tree followed with comparable scores, ranging 277 and 0.
Similarly, macro-level F1 scores and precision decreased, falling within the 0.
25 range.
This consistent deterioration across models suggests that the latent factors derived from Exploratory Factor Analysis failed to retain the discriminative characteristics required to distinguish among the eight UKT categories.
Therefore.
EFA appears to be an inadequate strategy for supervised feature selection in this context.
Table 4.
Classification Performance on the Dataset Using EFA Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average Table 5 reports the results obtained after retaining the top 13 variables ranked by the Chi-Square test.
Compared to the EFA scenario and, for most classifiers, even the full 53-feature baseline all models display a consistent improvement in predictive performance.
The RBF-kernel SVM again emerged as the bestperforming model, achieving an accuracy of 0.
738 and a macro-F1 score of 0.
690, only marginally below its baseline score, while utilizing a substantially reduced input matrix.
Random Forest also showed marked gains, with an accuracy of 0.
699 and F1 score of 0.
650, while Decision Tree and K-Nearest Neighbour performed reliably within the 0.
620Ae0.
650 range.
The most dramatic improvement was observed for Gaussian Nayve Bayes, whose accuracy increased from 0.
n the full-feature settin.
suggesting that the Chi-Square filter successfully removed interdependent features that previously undermined its assumption of independence.
Overall, the macro-averaged scores accuracy 0.
631 and F1 590 underscore the effectiveness of this simple statistical test in preserving, and often enhancing, model performance while reducing feature dimensionality.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Table 5.
Classification Performance on the Dataset Using Chi-Square Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average Table 6 reports the results obtained after applying LASSO-based (L1-penalise.
embedded feature This approach proved highly effective in preserving the most informative signals while discarding redundant attributes.
The RBF-kernel SVM achieved the highest overall performance, with an accuracy of 7939 and a macro-F1 score of 0.
76, slightly surpassing the full 53-feature baseline despite operating on only 13 selected variables.
Random Forest and Decision Tree models also recorded improved performances, reaching accuracy levels of 0.
7554 and 0.
7111, respectively, suggesting that tree-based learners benefit from the systematic elimination of redundant features.
K-Nearest Neighbour remained stable around 0.
6492, while Gaussian Nayve Bayes showed a considerable increase in accuracy to 0.
an improvement over the baseline, though still the lowest among the classifiers due to its strong independence assumption.
On average, across all classifiers.
LASSO yielded the highest macro metrics .
ccuracy = 0.
F1 = 0.
, confirming its value as a balanced method for optimizing predictive accuracy, computational cost, and model interpretability.
Table 6.
Classification Performance on the Dataset with LASSO Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average Table 7 presents the classification performance after applying feature selection using Random Forest importance on 13 features.
The SVM-RBF model achieved the highest accuracy of 0.
7503, followed by Random Forest at 0.
7042 and Decision Tree at 0.
K-Nearest Neighbour maintained stable performance at 0.
6406, while Gaussian Nayve Bayes lagged significantly at 0.
0363 due to its strong conditional independence assumption, which was less suited to the feature interactions in the dataset.
The overall macro averages .
ccuracy = 0.
F1 = 0.
indicate a moderate improvement over the baseline in certain models, confirming that Random ForestAebased feature selection can benefit complex learners like SVM and Random Forest, although its impact is less pronounced for distance-based and probabilistic Table 7.
Classification Performance on the Dataset with Random Forest Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average Table 8 illustrates the classification results when Recursive Feature Elimination (RFE) was used for feature selection.
In this scenario, overall model performance declined.
Although SVM-RBF maintained its position as the most accurate model .
ccuracy = 0.
, both Random Forest and Decision Tree saw noticeable drops in performance approximately 0.
4 to 0.
6 points lower than in the LASSO and Chi-Square K-NN remained steady at around 0.
6406, while Nayve Bayes experienced a drastic accuracy decline to 0.
0363, likely due to the removal of key probabilistic features.
The average macro-accuracy .
was the second lowest across all selection methods, suggesting that the wrapper-based RFE ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
approach, particularly when using L1-regularised logistic regression as the base estimator, may not be well suited for handling complex multicollinearity in socio-economic UKT classification data.
Table 8.
Classification Performance on the Dataset with RFE Feature Selection F1 Score Precision Recall Model Accuracy .
SVM (RBF) Random Forest Decision Tree K-Nearest Neighbor Nayve Bayes Average After obtaining the classification performance results from all experimental scenarios, a statistical test was conducted to scientifically verify whether there were significant differences among the experimental A non-parametric statistical approach the Friedman test was employed to examine differences in mean performance.
The null hypothesis tested states that there is no significant difference in classification performance .
among the six scenarios, while the alternative hypothesis posits that at least one of the scenarios yields a performance outcome that differs significantly from the others.
The results of the Friedman statistical test yielded NA = 15.
06 with a p-value of 0.
Since the pvalue is less than 0.
05, the null hypothesis (HCA) is rejected.
This indicates that there is a statistically significant difference in model accuracy across the different experimental scenarios.
To determine which scenario differs the most, a weighted-average accuracy calculation can be conducted by combining the performance of each model while accounting for their stability.
In this approach, more consistent models contribute more significantly to the overall score.
Based on the results of this calculation, the comparative mean accuracy across the six scenarios is presented in Figure 2.
Figure 2.
Comparison of Weighted Average Accuracy by Feature Selection Method Based on Figure 2, it can be concluded that UKT classification modelling performs most optimally when feature selection prioritises embedded methods (LASSO) or statistical filters (Chi-Squar.
These methods effectively balance complexity reduction and preservation of relevant signals, resulting in significantly higher weighted-average accuracy compared to using all features or latent extraction techniques (EFA).
CONCLUSION AND RECOMENDATIONS
Feature selection has proven to be crucial in improving both the accuracy and efficiency of the UKT classification models.
The embedded LASSO-13 method achieved the highest weighted-average accuracy .
, followed by Chi-Square-13 .
both outperform the baseline with all 53 features (ALL-53, .
and far surpass the latent factor approach (EFA-13, 0.
The Friedman test confirmed a statistically significant difference between experimental conditions .
= 0.
, reinforcing the notion that appropriate feature selection particularly via LASSO or Chi-Square can reduce complexity .
rom 53 to 13 variable.
without sacrificing, and indeed enhancing, model performance.
In the context of UKT socio-economic data.
LASSO excels because it simultaneously performs variable selection and regularization, effectively discarding redundant or weakly correlated indicators such as overlapping expense variables, while retaining the most discriminative socio-economic attributes.
Conversely.
Chi-Square rapidly ranks categorical variables by their dependency strength with the UKT bands, making it effective for pruning non-informative survey responses.
These mechanisms are well-suited to the high-dimensional and partially redundant nature of UKT applicant datasets.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
From the classification algorithm perspective.
Support Vector Machine (SVM) with an RBF kernel consistently outperformed others, yielding the highest accuracy across all scenarios, including the full 53feature setup .
as well as after LASSO .
and Chi-Square .
In summary, the SVM-RBF LASSO-13 configuration emerged as the overall best-performing model.
However, this study is limited by the fact that the dataset originates solely from Universitas Negeri Surabaya, which may not fully represent socio-economic distributions or application patterns in other Indonesian universities.
Furthermore, the modelAos validity could be affected if national UKT policies or eligibility criteria change in the future, requiring retraining or recalibration.
Future research may explore hybrid combinations .
ilter embedde.
, .
, alternative methods such as ElasticNet or Boruta, and the inclusion of qualitative variables to further enhance the predictive accuracy of the UKT system.
REFERENCES