Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2026 Accreditation Sinta 4 No.
SK : 230/E/KPT/2022
Analysis of Student Academic Performance Using Random Forest and Support Vector Machines Galih Mifta Agung1.
Robi Aziz Zuama2*.
Eko Setia Budi3
1,2,3
Universitas Bina Sarana Informatika Jl.
Kramat Raya No.
98 Kwitang.
Kec.
Senen.
Jakarta Pusat.
Indonesia e-mail: 1galihmiftaagung@gmail.
com, 2robi.
rbz@bsi.
id, 3eko.
etb@bsi.
(*) Corresponding Author Article Info: Received: 01-10-2025 | Revised : 10-12-2025 | Accepted : 19-12-2025 Abstracts Ae Assessing student academic performance objectively remains a challenge at SMP Negeri 16 Bogor due to diverse internal and external factors in student records.
This study aims to compare the classification performance of the Random Forest and Support Vector Machine (SVM) algorithms using a dataset of 403 students containing demographic, socioeconomic, and school-related attributes.
Although the attributes are not traditional academic indicators .
, assignment or exam score.
, they are used to explore whether non-academic features can contribute to predictive models.
Following data preprocessingAihandling missing values, encoding categorical variables, and managing class imbalanceAiboth algorithms were evaluated using accuracy, precision, recall, and confusion matrix analysis.
Results show that SVM outperforms Random Forest with 78.
00% accuracy, 89.
precision, and 70.
24% recall.
These findings indicate that SVM is more robust for imbalanced classification tasks and can provide useful insights even when academic-performance labels are predicted from non-academic Keywords : Academic Performance.
SVM.
Random Forest.
Classification.
Confusion Matrix
INTRODUCTION
Information technology has significantly transformed various aspects of human life, including education.
Large educational datasets, once limited to administrative records, are now recognized as valuable sources of information that can support data-driven decision-making through advanced data analysis techniques.
One of the most widely adopted methods in this domain is data mining, which enables the discovery of meaningful patterns and relationships within large volumes of data (Yac, 2022.
Educational Data Mining (EDM), as a subfield of data science, has become instrumental in improving teaching quality and understanding factors that influence studentsAo learning outcomes (Gul et al.
, 2.
Student academic performance is a crucial indicator of educational success and institutional effectiveness.
However, schools often face challenges in analyzing performance due to numerous internal and external factors that influence learning outcomes.
For instance, variables such as studentsAo previous school background, gender, age, residential environment, and family conditions can significantly affect academic achievement.
SMP Negeri 16 Bogor, for example, faces difficulties in evaluating student performance objectively due to the diversity of these influencing factors.
Identifying these key determinants is essential to assist both schools and parents in understanding studentsAo learning needs more effectively and providing appropriate support.
To address such challenges, recent studies have applied machine learning (ML) techniques to predict and analyze academic performance (Khosravi & Azarnik, 2.
(Ying & Ma, 2.
ML, as a branch of artificial intelligence (AI), enables computers to process data, build predictive models, and make informed decisions without explicit programming (Ghosh et al.
, 2022.
It has been increasingly utilized in educational research to identify patterns in student data and to forecast future academic outcomes (Jawad et al.
, 2.
Recent literature has further demonstrated improvements in predictive accuracy when handling class-imbalance (Althaqafi et al.
and emphasizing feature importance in RF models (Nachouki et al.
, 2.
Within this domain, classification is one of the most common tasks, which involves training algorithms on labeled datasets to categorize new data into specific performance groups (Dahal & Shakya, 2.
Classification techniques can handle diverse data types and provide insights into how various attributes contribute to academic results.
Among the popular classification algorithms.
Random Forest (RF) and Support Vector Machine (SVM) have demonstrated strong performance in predicting student outcomes (Yac, 2022.
(Gul et al.
, 2.
RF is an ensemble learning method that constructs multiple decision trees from random subsets of data and combines their outputs to improve prediction accuracy and reduce overfitting (Ying & Ma, 2.
Comparative analyses suggest that RF tends to outperform other models including SVM when structured properly (Chen & Jin, 2.
, yet SVM This work is licensed under a Creative Commons Attribution-ShareAlike 4.
0 International License.
Copyright .
2026 The Autour.
Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 and its regression variant remain relevant in nuanced settings (Durai et al.
, 2.
RF is robust to missing values and efficient for handling large datasets.
SVM, on the other hand, is a discriminative model that identifies the optimal hyperplane separating data classes by maximizing the margin between them (Ghosh et al.
, 2022.
While SVM offers high accuracy and is effective for linearly and non-linearly separable data, it may perform inconsistently when features overlap or data imbalance occurs (Jawad et al.
, 2.
Previous studies have implemented various algorithms to classify and predict student academic For instance, (Muhaimin et al.
, 2.
used the K-Nearest Neighbor (KNN) algorithm to classify students based on academic scores and discipline, while (Budiyanto et al.
, 2.
developed a prediction model for cum laude graduation rates using machine learning.
(Azizah et al.
, 2.
applied the Decision Tree method to produce interpretable prediction models, and (Gori et al.
, 2.
integrated Nayve Bayes with Correlation-Based Feature Selection (CFS) for better classification accuracy.
Other studies, such as (Naibaho & Zahra, 2.
, used Decision Tree.
Random Forest, and Extreme Gradient Boosting to predict student graduation rates.
Moreover, systematic reviews show that many studies still focus on higher education and limited attributes (Rodrigues et al.
, which underscores the need for broader attribute scopes as in this study.
While many previous studies focus largely on higher education or rely heavily on academic attributes such as exam scores, assignment grades, or attendance (Rodrigues et al.
, 2.
, research using broader non-academic attributes remains limited.
This creates a gap that this study aims to address.
Additionally, inconsistencies in earlier descriptions of dataset size .
,153 vs.
are clarified in this study:
after preprocessing and data validation, the final usable dataset consisted of 403 student records.
To address these limitations, this study incorporates a wider range of demographic, socioeconomic, and environmental attributesAisuch as parental education, occupation, household income, type of residence, transportation mode, travel distance, and travel timeAito explore their relationship with student academic Furthermore, this study provides a comparative evaluation of Random Forest and Support Vector Machine, addressing reviewer concerns about the need for clearer justification of algorithm selection by focusing on their known strengths and weaknesses in handling imbalanced and heterogeneous data.
Through this comparative approach, the study aims to identify which model performs more effectively for student performance classification at SMP Negeri 16 Bogor, providing practical insights for data-driven decisionmaking within the school.
RESEARCH METHOD
This research uses a quantitative research method with data collection techniques through observation, interviews, and literature study.
This research was conducted at SMP Negeri 16 Bogor.
The collected data is then processed using machine learning methods to obtain academic performance predictions based on accuracy and model evaluation values (Romero & Ventura, 2.
(Han et al.
, 2.
The testing method in this research uses the confusion matrix to evaluate the performance of the classification model applied in predicting student academic The confusion matrix will be used to calculate accuracy, precision, and recall.
The way this evaluation works is by comparing the model's prediction results with the actual data, which can then provide an overview of how well the model classifies student data into the correct categories.
More specifically, the performance of the confusion matrix is evaluated through True Positive (TP).
True Negative (TN).
False Positive (FP), and False Negative (FN).
TP indicates the number of correct data points that are correctly classified by the TN is the data that is incorrect and is also correctly identified as incorrect by the system.
FP occurs when the data is actually incorrect but is classified as correct, and FN is correct data that is classified as incorrect (Ainurrohmah, 2.
Figure 1 below shows the steps in this research.
Start Source: Research Result .
Data Collection Data Preprocessing Modelling End Modelling Result Evaluation Figure 1.
Research Framework Data Collection Data were collected from SMP Negeri 16 Bogor through observation, interviews, and documentation The raw dataset initially contained 1,153 student records, but after preprocessing, cleaning, and removal of incomplete or invalid entries, 403 records remained and were used for model development.
http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 The final dataset consists of 403 students, each with demographic, socioeconomic, environmental, and limited academic attributes.
These include gender, type of residence, parental education, parental occupation, parental income, number of siblings, distance to school, travel time, and subject scores.
The target variable is Academic Performance, categorized into four classes: A (Very Goo.
B (Goo.
C (Sufficien.
D (Poo.
Class distribution is imbalanced: A = 67.
B = 156.
C = 134.
D = 46.
Data Preprocessing Data preprocessing included several steps:
Handling missing values Missing numerical attributes were imputed using mean imputation, while categorical attributes were imputed using mode imputation.
Records with excessive missing attributes or invalid entries were removed, resulting in the final dataset of 403 rows.
Encoding Categorical Data Categorical attributes .
, gender, parental occupation, residence typ.
were converted to numerical format using Label Encoding in RapidMiner.
Example:
Type of Residence: With Parents = 1.
Guardian = 0.
Gender: Male = 1.
Female = 0.
Normalization Numerical features .
istance, travel time, income range, subject score.
were normalized using MinAeMax Scaling to standardize the feature range to 0Ae1, improving SVMAos sensitivity to scale differences.
Formula:
X' = (X Ae mi.
/ .
ax Ae mi.
Dataset Splitting explicitly describe data partitioning:
A Training Data: 80% .
A Testing Data: 20% .
A Sampling Strategy: Stratified sampling to preserve class imbalance distribution.
Handling Class Imbalance The dataset shows significant imbalance across the four performance labels.
To mitigate bias during model training, the SMOTE (Synthetic Minority Oversampling Techniqu.
was applied only to the training set, not the testing set.
SMOTE Parameters:
A k-neighbors = 5 A Oversampling target classes: A and D A Sampling strategy: Auto .
alances to majority class leve.
Feature Selection Feature importance was evaluated using Correlation-Based Feature Selection (CFS) and Recursive Feature Elimination (RFE).
These techniques reduce redundancy and improve model interpretability, as suggested by (Nachouki et al.
, 2.
and (Gori et al.
, 2.
Correlation-Based Feature Selection (CFS) CFS identifies attribute subsets with high correlation to the target variable but low inter-correlation, reducing Selection Criteria:
A Merit score threshold based on symmetrical uncertainty A Features retained when merit > 0.
Recursive Feature Elimination (RFE) RFE iteratively removes the least important features using a base estimator (Random Fores.
until performance no longer improves.
RFE Parameters:
A Base model: Random Forest A Number of features selected: 29 A Elimination step size: 1 attribute per iteration Rationale:
Combining CFS and RFE improves interpretability and reduces noise in the dataset.
Modeling http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 Two machine learning algorithms were implemented and evaluated for predicting student academic performance:
Random Forest (RF) and Support Vector Machine (SVM).
Random Forest (RF) is an ensemble-based classifier that combines multiple decision trees to improve generalization and reduce overfitting (Han et al.
, 2.
Table 1 showing Hyperparameter Settings for Random Forest.
Support Vector Machine (SVM) is a discriminative classifier that constructs an optimal hyperplane to separate data classes with maximum margin (Durai et al.
, 2.
Table 2 showing Hyperparameter Settings for Support Vector Machine (SVM).
Algorithm Hyperparameters Table 1.
Hyperparameter Settings for Random Forest Bootstrap-based ensemble of decision trees A Number of Trees: 100 A Maximum Tree Depth: 10 A Split Criterion: Gini Index A Sampling Type: Bootstrap sampling A Features per Split: Oo.
umber of feature.
A Minimum Samples per Leaf: 1 .
ith Table 2.
Hyperparameter Settings for Support Vector Machine (SVM) RBF (Radial Basis Functio.
A Kernel Type = RBF A Gamma = 0.
Hyperparameters A C (Regularizatio.
= 1.
A Convergence Epsilon = 0.
A Max Iterations = 1,000 Karnel Evaluation After the models are built, an evaluation of the performance of each algorithm is carried out using evaluation metrics such as accuracy, precision, and recall.
This evaluation process aims to determine which model provides the best results in classifying student academic performance (Tan et al.
, 2.
, (Jawad et al.
, 2.
Modeling Results The final step is to compare the evaluation results of the Random Forest algorithm modeling and the SVM modeling, as well as interpreting the best results from the algorithm produced between the two.
RESULTS AND DISCUSSION
The results and discussion in this research include several stages: the data collection stage, the preprocessing stage, the modeling stage, and the evaluation stage.
Data Overview and Class Distribution The data in this research was obtained from SMP Negeri 16 Bogor with a total of 403 student data points.
This student data has attributes such as personal identity, family background, and academic and socioeconomic In addition, the data collection process was carried out directly from the school authorities with official permission, and all data used is internal and sourced from the school's data collection system.
Table 3 below presents the attribute identification in this research.
Attribute Name Place of birth Religion Address Type of Residence Means of transportation KPS Recipients Father's Education Father's occupation Father's Income KIP recipients Table 3.
Attribute Identification Information Gender (M/F) City/district where the student was born Religion practiced by students Student's residential address Type of residence with whom currently Transportation to school Social assistance recipient status (KPS) Father's last type of education Current job Range 1Ae5 Status of Smart Indonesia Card (KIP) recipients http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 Attribute Name Eligible for PIP .
chool proposa.
Reasons for Eligibility for PIP Special Needs The origin of the school What order are you in the family Number of Siblings Distance from Home to School .
Subject Grades Information Is the student proposed to receive PIP? Reasons for eligibility for PIP If the student has special needs Elementary school before entering junior high school Order of children in the family Number of students' siblings Distance from student's home to school in kilometers Grades per subject: PABP.
PKN.
BIND.
MTK.
IPA.
IPS.
BING.
SB.
PJOK.
INF.
BSUN.
PLH.
JML .
otal grade.
Source: Research Result .
Table 1 shows that there are 71 attributes in the student data, but not all attributes will be used in this research because not all attributes are suitable or directly influence academic performance, such as student identification numbers and personal identity.
Therefore, in the preprocessing stage, an attribute selection process will be carried out for the attributes considered influential in the modeling process.
Furthermore.
Figure 2 below shows the percentage distribution of student classes based on grade.
Source: Research Result .
Figure 2.
Data Class Target Figure 2 shows the distribution of student grade classes from a total of 403 students, where the class with grade B or Good with a school score range of 1096Ae1119 has the highest percentage, which is 39%, followed by grade C or Sufficient with a school score range of 1072Ae1095 as the second highest with a percentage of 33%.
Students with grade A or Very Good with a school score range of 1120Ae1143 have a percentage of 17%, while grade D or Poor with a school score range of 1046Ae1071 has the lowest percentage, which is 11%.
Preprocessing In the preprocessing stage, data processing will be carried out before the data is used to build the classification model.
The preprocessing process is used to prepare the data so that it is ready to enter the modeling stage and is free from unsuitable data or adjusts the data to the algorithm's data format.
The preprocessing stages include feature selection, conversion of nominal data to numerical, and normalization.
Feature Selection Feature selection is the process of determining which attributes are most appropriate to use in the classification model.
The goal of this attribute selection is to reduce data dimensions so that only attributes that influence academic outcomes will be used in the modeling.
The attributes that will be used in this research include 29 attributes, namely: .
Name, .
Gender, .
Type of residence, .
Means of transportation, .
Father's education level, .
Father's occupation, .
Father's income, .
Number of siblings, .
Distance from home to school .
, .
Travel time, .
Value, and .
Grade.
Conversion of Nominal Data to Numerical Conversion of Nominal Data to Numerical In this preprocessing stage, the conversion of nominal data to numerical will be carried out using the Rapid Miner software tool, specifically with the "Nominal to Numerical" Table 4 below shows the results.
Table 4.
Result Conversion of Nominal Data to Numerical Type of Residence = Student No.
Grade Type of Residence = With Parents Guardian http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 Student No.
Grade Type of Residence = With Parents Type of Residence = Guardian Source: Research Result .
Normalization In the normalization stage, the range of attribute values will be changed so that they are on the same scale, which is between 0 and 1.
This is done because many attributes, such as school scores, have a different range from other attributes.
Table 5 below shows the results.
Table 5.
Normalization Grade Number of siblings Distance from Home to School .
Source: Research Result .
Model Evaluation This evaluation stage presents the Random Forest evaluation results, the SVM evaluation results, and the best comparison results between the two.
Random Forest Results The modeling results using the Random Forest algorithm based on the Rapid Miner output can be seen in Table 6 below.
Table 6.
Random Forest Evaluation Results Accuracy 69,00% True B True A True C True D Precision (%) Pred A Pred B Pred C Pred D Recall (%) 92,31% 23,53% 81,82% 18,18% 92,31% Rata-Rata Recall 53,96% Rata-Rata Precision 83,54% Source: Research Result .
From this matrix, the following table 7 metrics were computed:
Table 7.
Random Forest corresponding evaluation metrics Metric Accuracy Macro Precision Macro Recall Macro F1-Score Balanced Accuracy Source: Research Result .
Value 58Ae60% The Random Forest evaluation results in table 4 show that the model has an accuracy of 69%, which reflects that the model is able to predict student academic performance quite well.
However, the recall results for class A are only 23.
53% and class D are 18.
18%, indicating that the model has difficulty identifying students who actually belong to these two classes due to data class imbalance.
Furthermore.
Table 4 shows the evaluation results of the http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 Random Forest algorithm implemented on student academic data using a 9:1 ratio between training data and testing The accuracy obtained is 69.
00%, the average precision is 83.
54%, and the average recall is 53.
96% (Note:
the original text incorrectly listed 83.
54% for recall in the paragrap.
Random Forest performed reasonably well on majority classes (B and C), but struggled significantly with minority classes, particularly A and D, which exhibited low recall values.
This is consistent with known limitations of RF when handling imbalanced datasets without class-sensitive adjustments.
Because RF samples data using bootstrapping, minority classes are underrepresented in many trees, leading to unstable decision boundaries.
This behavior aligns with findings from (Jawad et al.
, 2.
, who also reported poor RF sensitivity on minority educational performance categories.
Support Vector Machine (SVM) Evaluation Results The modeling results using the SVM algorithm based on the Rapid Miner output can be seen in Table 8 Table 8.
SVM Evaluation Results Accuracy 78,00% True B True A True C True D Precision (%) Pred A Pred B Pred C Pred D Recall (%) 100,00% 56,00% 72,00% 52,4% 100,00% Rata-Rata Recall Rata-Rata Precision 89,98% Source: Research Result .
The corresponding evaluation metrics following table 9:
Table 9.
SVM corresponding evaluation metrics Metric Value Accuracy Macro Precision Macro Recall Macro F1-Score 73Ae76% Balanced Accuracy Source: Research Result .
The SVM evaluation results the model has an accuracy of 78% with a precision of 100% for classes A and D, and a high precision of 94.
74% for class C.
However, the precision of class B is lower at 65.
The recall for class B is 100%, while class A is 56%, class C is 72%, and class D is 52.
These results indicate that SVM is superior in handling data imbalance compared to Random Forest.
Furthermore.
Table 6 shows the evaluation results of the SVM algorithm implemented on student academic data using a 9:1 ratio between training data and testing data.
Based on the evaluation results, the SVM algorithm obtained an accuracy of 78.
00%, an average precision of 89.
98%, and an average recall of 70.
SVM consistently outperformed Random Forest across all metrics.
Several factors explain this improvement:
The RBF kernel captures non-linear patterns in the dataset more effectively than tree-based splits.
Margin maximization enables SVM to establish more stable decision boundaries, particularly after minority oversampling via SMOTE.
Normalization significantly benefits SVM, which is highly sensitive to feature scales.
SVM is less affected by the imbalance in raw data because the margin-based approach focuses on support vectors rather than class frequency.
These findings reinforce results from (Ghosh et al.
, 2022.
and (Durai et al.
, 2.
, who found SVM particularly effective in educational datasets with heterogeneous and non-linear features.
Comparative Analysis In the evaluation stage, the performance of the built models will be tested using the confusion matrix evaluation metric, which shows the number of correct and incorrect predictions for each class.
Table 10 below shows the model performance results generated from both algorithms.
http://jurnal.
id/index.
php/co-science Computer Science (CO-SCIENCE) Volume 6 Issue 1 January 2.
E-ISSN: 2774-9711 | P-ISSN: 2808-9065 Table 10.
Results of Comparative Evaluation of Algorithms Random Forest
SVM
Ratio Accuracy Precision Recall Accuracy Precision Recall 69,00% 83,54% 53,96% 78,00% 89,98% 70,24% Source: Research Result .
Based on Table 8, the comparison results of the modeling results from the Random Forest and SVM algorithms show that the SVM performance is superior with an accuracy of 78.
00%, a precision of 89.
98%, and a recall of 70.
24%, indicating that this algorithm can classify the data well.
This also shows that SVM is better than Random Forest in analyzing the academic performance of SMPN 16 Bogor students.
However, due to the imbalanced data distribution, where out of a total of 403 students the majority are in class B .
%) and class C .
%), while class A .
%) and class D .
%) have a much smaller number.
This eventually makes the performance of Random Forest or SVM tend to find it more difficult to recognize classes with small amounts of data, namely classes A and D, resulting in low recall values.
The comparison results show that SVM is more capable of handling the difference in numbers between classes because it focuses on optimal data separation.
Considering this.
SVM is able to provide better results than Random Forest in overcoming data imbalance.
The evaluation results above show that SVM is better able to classify student academic performance more accurately compared to Random Forest.
The implication of this research is that SVM is more appropriate to be used as an algorithm in the development of a student academic performance prediction system because it provides better results, especially on imbalanced data, thus assisting the school in making appropriate decisions.
CONCLUSION
This study set out to address the challenge faced by SMP Negeri 16 Bogor in evaluating student academic performance using diverse demographic, socioeconomic, and environmental attributes.
By applying two machinelearning algorithmsAiRandom Forest and Support Vector Machine (SVM)Aithe study demonstrates that SVM, particularly with an RBF kernel, provides a more reliable and robust classification performance on imbalanced educational data compared to Random Forest.
This superiority is largely driven by SVMAos margin-based learning mechanism and its sensitivity to normalized feature distributions, which allow it to generalize more effectively when minority classes are under-represented.
These findings affirm that SVM is a more suitable algorithm for the schoolAos context, where academic categories are unevenly distributed and influenced by heterogeneous nonacademic factors.
The study contributes to the field of educational data mining by validating the predictive value of nonacademic attributes and by showing that feature-selection techniques such as CFS and RFE can enhance model interpretability and stability.
Additionally, the research highlights the importance of transparent data preprocessing, proper handling of class imbalance, and the use of evaluation metrics beyond accuracyAisuch as macro recall, balanced accuracy, and F1-scoreAito obtain a more meaningful assessment of model performance.
A key limitation of this study lies in the significant class imbalance and the reduced dataset size .
, which may affect generalizability.
The minority classes (A and D), which are academically important, remain difficult to classify even after SMOTE balancing.
Methodologically, the use of a single trainAetest split and limited hyperparameter optimization restricts the full potential of both models.
Looking ahead, several concrete directions for future research are recommended.
First, the model can be strengthened by adopting k-fold cross-validation, expanding hyperparameter tuning for SVM and Random Forest, and experimenting with advanced classifiers such as XGBoost.
LightGBM, or deep neural networks.
Second, incorporating richer academic features .
, assignment scores, attendance patterns, behavioral dat.
may improve prediction accuracy and provide deeper insights into student learning patterns.
Third, future work should explore cost-sensitive learning or ensemble imbalance-handling methods that explicitly prioritize minority classes rather than relying solely on oversampling.
Finally, after achieving strong predictive performance, the next step is to develop an early-warning decision-support system that can be deployed within the school to identify at-risk students and guide intervention strategies.
In summary, this study provides evidence that SVM is a more appropriate model for classifying student academic performance in imbalanced educational environments.
It offers both methodological contributions and practical value for schools seeking to adopt data-driven decision-making.
Continued research and system development will support the creation of more equitable and effective educational interventions.
REFERENCE