International Journal of Computer and Information System (IJCIS) Peer Reviewed - International Journal Vol : Vol.
Issue 02.
June 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index Performance Comparison K-Nearest Neighbors and Random Forest on Predicting The Performance New Polimedia Student Admissions Dwi Riyono, 2Cholid Mawardi Politeknik Negeri Media Kreatif.
Jakarta.
Indonesia Email : 1dwirion@polimedia.
id, 2cholid@polimedia.
Abstract - New student admissions are at the forefront of the school's operational process.
the success of each college's input stems from this.
Polimedia always conducts new student admissions every year with various strategies used.
Polimedia has 23 study programmes that can enable it to move in the creative industry that can be utilised by the community.
in this study, a strategy using a prediction algorithm is used to be able to see the possible opportunities that occur if implemented in the coming year.
dataset of 3738 data received by new students, an analysis will be carried out on prospective students who have re-registered or who have not re-registered.
The classification model with 2 classes will be by conducting a data analysis process using exploratory data analysis (EDA) and also performing data cleansing so that the data modelling process runs well.
The method used uses the main model of KNearest Neighbors by comparing with other machine learning models such as decision tree and random It is expected that this research can produce high accuracy values 86.
90% with powerful machine learning model comparisons.
This research is also expected to be a reference for other studies that also conduct performance testing processes with machine learning models using various objects.
Keyword: Student.
Performance.
KNN.
Machine Learning INTRODUCTION Admission of new students is a priority for public universities including Politeknik Negeri Media Kreatif (Polimedi.
New student admissions at polytechnics usually follow several stages and registration channels, depending on the rules of each institution In 2023 new student admissions, polimedia achieved an admission ratio over capacity of Meanwhile, in 2024, polimedia's new student admissions reached 87.
7%, higher than the previous year.
A prediction is used to estimate the results in a later period in the coming year and so on whether it is better or even less good than before.
Artificial intelligence is needed to be able to classify data in the function of predicting a class in achieving the problems that occur 1.
The prediction method in machine learning can be used by determining the re-registration scheme or not who re-registered in the previous admission period.
From these results, it will be given the benefit that the cause of the accuracy of the re-registration results can be used as a reference to be implemented in the admission of new students in the previous year.
Previous articles have discussed the Nayve Bayes and K-Nearest Neighbor Method Classification methods for Determining Poor Families from the author Riza Marsuciati, from the article concluded that the best method for classifying low-income families is the Nayve Bayes method 2.
Then research conducted by aisyah et al, in analyzing performance .
ccuracy, precision, recall and fmeasur.
on the image dataset of infected malaria and uninfected malaria 3.
Of course, from previous research, the KNN model or other machine learning models can be used to classify between two cold classes processed with several datasets that have been determined by class.
Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index Page 145 International Journal of Computer and Information System (IJCIS) Peer Reviewed - International Journal Vol : Vol.
Issue 02.
June 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index II.
METHOD
In this study, researchers used two classification methods, namely Random Forest and k-Nearest Neighbor, where this study was looking for the classification method with the best performance to determine prospective new students who chose to re-register after being accepted and also chose not to reregister when they were accepted as prospective new students.
KNN (K-Nearest Neighbor.
is one of the machine learning algorithms often used for classification and regression 4.
Although KNN is not a fundamental mathematical algorithm such as quadratic or integral equations, it does involve some mathematical concepts to calculate distances and determine nearest The main formula in KNN is to calculate the distance between two data points, which is generally done with the Euclidean Formula.
Here is the Euclidean distance formula in KNN:
If there are two data points ya = .
cu1, yc.
ya dan yaA = .
cu2, yc.
twodimensional space, then the Euclidean distance ycc between them is:
a, yaA) = Oo.
cu2 Oe ycu.
c2 Oe yc.
For higher dimensional spaces .
, 3D or beyon.
, the formula becomes:
a, yaA) = Oo cu2 Oe ycu.
c2 Oe yc.
c2 Oe yc.
U .
cu2 Oe ycu.
Where ycu is the number of dimensions.
Meanwhile, random forest is a machine learning model that is included in the ensemble learning method6.
This algorithm is used for both classification and regression Random Forest works by building many decision trees during the training process, and produces a final decision by voting for classification or averaging for Each tree in Random Forest is trained using a random subset of the data obtained by This means the data is Mathematically, from the original dataset ya of size ycA, we take random samples ycA times with replacement to get a new dataset ya.
If ya = {.
cu1, yc.
, .
cu2, yc.
A , .
cuycA, ycycA)}, then ya will be a subset of ya, but may contain some duplication due to sampling with If there are ycN trees in a Random Forest, the final prediction result for classification is the majority vote of all trees.
Mathematically, for input ycu, the final prediction yc is:
yc = ycoycuyccyce{Ea1.
Ea2.
A .
EaycN.
} .
Where Eaycn.
Ea_ycn is the prediction of the Oeycn tree, and mode is the class most frequently selected by the trees.
In short.
Random Forest is a combination of many decision trees that work together through bagging and random feature selection to improve model accuracy and prevent overfitting.
Data Preparation The data to be processed comes from new student admission data when re-registering with 18 parameters, namely no_pendaftaran, nama_lengkap, major_id, golongan_ditetapkan, golongan_ditetapkan_final, program_studi, status_survei, status_prodi, status_data_induk, status_biodata, ulang, status_finalisasi, verificator_id.
Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index Page 146 International Journal of Computer and Information System (IJCIS) Peer Reviewed - International Journal Vol : Vol.
Issue 02.
June 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index Initial Process Before the dataset is calculated, the data obtained needs to be processed first because not all data obtained is used in calculations in There are two stages carried out in this preprocessing process, namely the data cleaning stage and the data selection stage .
Data Cleaning In the dataset that has been obtained, there are 18 parameters, but in using the random forest classification model and also k-nearest neighbor, it is necessary to select parameters so that the calculation results of the two models are maximized.
The data cleaning function to select from 18 parameters to 12 parameters, namely id no_pendaftaran, nama_lengkap, major_id, golongan_ditetapkan_final, status_survei, status_prodi,status_data_induk, status_biodata dan daftar ulang.
The data from this parameter reduction will be used to classify random forest and k-nearest neighbor (KNN).
Figure 1.
Re-Enrollment Classification Of Admission Pathways nearest neighbor classification models, it is necessary to do a dataset by selecting data by filling in data with empty parameter values with values already listed in the option 11.
Part of the data pre-processing is to analyze how numerical features relate to prospective students who re-enroll or not as well as other categories before proceeding to the modeling The numerical features selected are 'skor', 'golongan', and 'jalur'.
The variable numerical_features stores the names of these columns from the df_train dataset.
These are the features that we want to analyze in more depth to see their distribution by status 'daftar Figure 2.
Violin Plot of The Distribution of New Student Candidate Variables Evaluation Model The research classification methods used are nayve bayes and k-nearest neighbor classification methods.
Training data will use 20 percent of the data obtained, and the rest will be used as test data for 80 percent of the data obtained.
Data processing uses python by adding validation in the form of performance to see the accuracy of the two classification In performance, testing will use the confusion matrix method which consists of accuracy, precision, and recall12.
Where in the performance matrix accuracy is used to test how accurate the classification model is.
Data Selection From the dataset, there are several records that do not have values for each parameter as shown in Figure 2.
To avoid errors or reduce the performance of the random forest and kJournal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index ycNycE yaycA yaycaycaycycycaycaycayc = ycEycyceycaycnycycnycuycu = ycNycE ycNycA yaycE yaycA ycNycE yaycA ycNycE ycNycA yaycE yaycA ycIyceycuycycnycycnycycnycyc = ya1 Oe ycIycaycuycyce = ycNycE ycNycE yaycA ycNycE yaycA ycNycE ycNycA yaycE yaycA Page 147 International Journal of Computer and Information System (IJCIS) Peer Reviewed - International Journal Vol : Vol.
Issue 02.
June 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index i.
RESULT AND DISCUSSION
Before testing using Python which will be used as test data and training data, preprocessing is first carried out, namely cleaning the data first from the original data which originally had 18 parameters which were reduced to 12 parameters.
The parameters taken are adjusted to the needs of the random forest and k-nearest neighbor classification methods.
Then data selection is carried out, namely replacing some empty values based on the choice of each parameter.
The goal is to reduce errors and maximize Then prepare a dataset for training of 20% of the total data.
And also the data used for testing is 80% of the total data.
After the data is ready to use, then testing is carried out using the random forest classification method and also k- nearest neighbor.
In testing, performance measurement is also added using the confusion matrix method which will produce accuracy, precision, and recall.
Figure 3.
Comparison of Classification Testing Models Table 1.
Comparison Results Of Model
Testing With Other Machine Learning Model
Algorithm
KNN
Decision Tree Random Forest Accuracy Precision Recall F1Score In table 1 above shows the results of the comparison of the three models, the accuracy value of random forest shows good results than the other two models with an accuracy of While the accuracy value of KNN only gets 80.
Decision tree gets a value 56% only has a slight difference from random forest.
This test uses the same parameters and data in terms of comparing with machine learning models.
From various combinations of dataset ratios and experiments with various K values and other parameters, the highest accuracy value is obtained at a ratio of 80: 20 which is 86.
90% but the recall value looks low because the accuracy of the classification is more inclined to precision than recall, meaning that the classification is less precise in estimating re-enrolled The highest accuracy value in this study is found in the combination of 20:80 datasets and experiments with various K values which obtained an accuracy of 86.
where the precision shows a high value, meaning the balance of the classification process in predicting the prediction of new students who re-register.
Figure 3 above shows the comparison results of the three different models.
As an additional analysis, this research also adds another model as a comparison, namely decision tree.
Figure 4.
Error Rate At K-Value Knn Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index Page 148 International Journal of Computer and Information System (IJCIS) Peer Reviewed - International Journal Vol : Vol.
Issue 02.
June 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index Figure 4 that you have shown illustrates the Error Rate against the K-Value of the KNearest Neighbors (KNN) algorithm.
K in KNN is the number of nearest neighbors considered for classification or prediction.
The value of K here ranges from 1 to about 60.
low values of K .
bout 5 to .
, the error rate seems to be quite low, with values around 0.
As the K value approaches around K = 40, there is a very significant spike in the error rate which almost reaches 0.
This indicates that at that K, the KNN model makes a lot of After K = 40, the error rate again decreases drastically and stabilizes, with an error rate value of about 0.
15 to 0.
2 when K is in the range of 50-60.
Based on this graph, the optimal K value selection is likely to be in the range of K = 10 to 30, depending on the balance between error rate and model Too large or too small a value of K results in poor model performance, as indicated by the drastic increase in error rate around K = 40.
IV.
CONCLUSION
Based on the results of this study, the authors can conclude that in the range of K = 10 to 30, the error rate is quite low and stable, 18 to 0.
2, which indicates that the KNN model works quite well with a minimal error rate in that interval.
In terms of overall model evaluation, the results using random forest show better accuracy than decision tree and KNN.
ACKNOWLEDGMENTS
The author would like to thank the Research and Community Service 2021 Funding from the Politeknik Negeri Media Kreatif.
Ministry of Education.
Culture.
Research and Technology of Indonesia.
REFERENCES