Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
COMPARISON OF SVM.
RANDOM FOREST.
AND LOGISTIC
REGRESSION PERFORMANCE IN STUDENT MENTAL
HEALTH SCREENING
VANNES WIJAYA, 2NUR RACHMAT
Faculty of Informatics Engineering.
University Multi Data Palembang Jl.
Rajawali No.
14, 30113 Palembang.
Sumatera Selatan e-mail: 1vanneswijaya04@mhs.
id, 2nur.
rachmat@mdp.
Corresponding author
ABSTRACT
Mental health is an essential aspect for university students, as undetected mental health disorders can have a significant impact on students' academic performance and well-being.
This study contributes by evaluating Synthetic Minority Oversampling Technique (SMOTE)'s role in improving classification models' performance.
Despite the increasing use of machine learning in mental health detection, limited research has addressed the challenges posed by imbalanced datasets, particularly in smaller student populations.
This research aims to develop a mental health early detection system based on student data from Multi Data University Palembang using the Mental Health Scale (SKM)-12 mental health measurement.
The system aims to remind students' awareness of the importance of mental health.
To improve accuracy, this research compares the performance of three models, namely Support Vector Machine.
Random Forest, and Logistic Regression, both with and without using SMOTE.
The dataset obtained is 78 students, and SKM-12 consists of several groups, namely optimal mental health profile with symbol ( -), maximum mental illness profile with symbol ( ), minimum mental illness profile with symbol (--), and minimal mental health profile with symbol (- ).
The results of this study using the Logistic Regression method using SMOTE obtained better model performance compared to other methods, with an accuracy of 89.
28%, an average class precision of 89.
5%, an average class recall of 89.
75%, and an average F1 - class score of 88.
This research shows that overcoming class imbalance using SMOTE can significantly improve the performance of mental health classification models.
Keywords: Logistic Regression.
Random Forest.
Mental Health Scale.
SMOTE.
Support Vector Machine.
Health Screening INTRODUCTION Mental health is an important aspect in realizing overall health .
Mental health is a stable psychological and emotional state where a person can utilize their cognitive and emotional abilities to fulfill their daily needs and participate in their community .
Often, adolescents experience stress, especially at certain moments in their lives.
Adolescents are considered vulnerable to mental disorders and thus require more attention as they are the country's assets and the next generation of the nation.
Mental health problems in today's modern era can arise due to various pressures in life.
Students are an adult age group who often experience pressure and confusion about studies, family, and other aspects of life .
College students as a group of ages who experience the transition from adolescence to adulthood, students tend to experience stress, especially those originating from the academic process .
The results of several interviews with university resource persons, factors or causes of students experiencing mental health disorders accumulate tasks, family economic factors, final project or theses, and personal or family problems.
Mental health is important for freshmen, undergraduates, and graduating students.
Mental health for new college students is very important in order to adapt to the lecture environment.
The environment during school and college is certainly very different.
College students will find different learning methods compared to their school days.
Mental health for college students who are running the lecture process is important in order to complete their academic DOI: https://doi.
org/10.
54732/jeecs.
Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
tasks well.
For final year students who are in the process of studying, thesis is one of the causes of mental health problems .
Previous research aimed to evaluate various machine learning algorithms in the context of depression prediction, utilizing the growing availability of mental health data.
The study sought to develop predictive models that could significantly contribute to understanding depression risk and implementing more timely interventions.
The methods used in the research included Random Forest.
Nayve Bayes, and K-Nearest Neighbors (KNN).
The findings revealed that the Random Forest method achieved exceptional performance, with an accuracy of 91%, an F1-score of 91%, and both precision and recall around 91% .
A related study discusses the classification of student mental health data, where mental health data is used as input for a model to develop and apply for classifying test data.
The study employed the SVM and Nayve Bayes methods, concluding that the SVM algorithm outperformed the Nayve Bayes classifier in classifying students' mental health.
The classification results include "Yes," indicating that students require specialized therapy, and "No," indicating no need for such therapy.
The SVM method achieved an accuracy of 94.
A study on mental health focuses on classifying students into various mental health issue categories, including stress, depression, and anxiety, using machine learning algorithms.
The methods employed in this research include Decision Tree.
Neural Network.
Support Vector Machine.
Nayve Bayes, and Logistic Regression.
For the stress model, the Decision Tree method achieved the highest accuracy of 84.
In the depression model, the Support Vector Machine (SVM) method attained the highest accuracy of 88.
Meanwhile, for the anxiety model.
Logistic Regression achieved the highest accuracy of 71.
85% .
Another study aimed to develop a method for predicting MBTI personality types based on textual data.
The research utilized the SMOTE technique to address data imbalance issues.
Six different machine learning models were individually tested, including Logistic Regression.
LSVC (Linear Support Vector Classificatio.
SGD (Stochastic Gradient Descen.
Random Forest.
XGBoost, and CatBoost.
The findings revealed that Logistic Regression was the best-performing model, achieving an average F1-score of 0.
Additionally, the use of the SMOTE technique successfully improved model performance, increasing the F1-score to 8337 .
Another study examined the impact of emotions and mental health on students' cumulative grade point average (CGPA) using machine learning algorithms.
To address data imbalance, the research employed the Synthetic Minority Oversampling Technique (SMOTE).
The methods used in this study included Logistic Regression.
Decision Tree.
Random Forest.
SVC.
XGBoost.
KNN.
Voting Classifier, and Stacking Classifier.
The results indicated that the Logistic Regression model achieved an accuracy of 86.
55%, while the Random Forest model achieved a slightly higher accuracy 62% .
Previous studies have demonstrated that Support Vector Machine (SVM).
Random Forest, and Logistic Regression methods can achieve good accuracy in various classification cases.
Additionally, the use of the Synthetic Minority Oversampling Technique (SMOTE) to address data imbalance has proven effective in improving model Therefore, this study employs Logistic Regression.
SVM.
Random Forest, and SMOTE to address class imbalance, aiming to achieve optimal model performance in predicting students' mental health.
In this study using the SKM-12 mental health measurement tool .
SKM-12 is the result of modified questions from the Mental Health Inventory measuring instrument .
It measures mental health from positive aspects .
ositive emotions, love, life satisfactio.
and negative aspects .
nxiety, depression, and loss of contro.
Then it was refined again by simplifying the number of items question.
In each aspect .
ositive and negativ.
, 12 items question were reduced to 6 for each aspect, so that there were 12 items question .
The positive and negative aspects are referred to as psychological well-being and psychological distress.
Mental health data can be classified based on highs and lows of psychological well-being and psychological distress.
The classified data is put into four separate groups.
First, the optimal mental health profile ( -) indicates high psychological well-being and low psychological distress.
Second, the maximum mental illness profile ( ) indicates high psychological well-being and high psychological Third, the minimum mental illness profile (--) indicates low psychological well-being and low psychological Finally, the minimal mental health profile (- ) indicates low psychological well-being and high psychological distress .
This study used the SKM-12 mental health measurement tool to evaluate the mental health of university This study aims to compare several machine learning models, namely Support Vector Machine (SVM).
Random Forest, and Logistic Regression, both with and without SMOTE method.
Model performance evaluation was conducted using Confusion Matrix to determine the best model in predicting students' mental health.
The results show that the Logistic Regression model with the SMOTE imbalance method provides the best performance in predicting student mental health, based on evaluation using Confusion Matrix with an accuracy of 89.
28%, an average class precision of 89.
5%, an average class recall of 89.
75%, and an average F1 - class score of 88.
Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
RESEARCH METHODOLOGY
Before the research is carried out, first conduct a theoretical review and literature study of publication and research manuscripts, in order to understand the methods and steps in the research .
1 Research flow This research flow consists of data collection, preprocessing data, data splitting, implementation SMOTE method, or not using the SMOTE method, further implementation algorithm (SVM.
Random Forest.
Logistic Regressio.
, classification evaluation, and result.
At the data collection stage where data is collected using the SKM-12 questionnaire in the Multi Data University Palembang environment.
Then in the preprocessing data stage, the answer data is converted into likert scale values to facilitate calculation and produce mental health profile classes.
The answer values include very often 5, often 4, sometimes 3, rarely 2, and never 1.
The four classes consist of an optimal mental health profile with a symbol ( -), a maximum mental illness profile with a symbol ( ), a minimum mental illness profile with a symbol (--), and a minimal mental health profile with a symbol (- ).
After the preprocessing data stage, the next stage is the data splitting stage, where the data will be divided into 2, namely 70% data for training and 30% data for testing.
For the next stage, the author tests for the first stage not using the SMOTE method, which is directly at the stage of implementing the three methods, namely SVM.
Random Forest, and Logistic Regression.
The second stage the author tests by using the SMOTE method to balance the class, after using the SMOTE method, then the next stage is to implement the three methods, namely SVM.
Random Forest, and Logistic Regression.
For the first stage, not using the SMOTE method resulted in 3 models, namely SVM.
Random Forest, and Logistic Regression.
For the second stage using the SMOTE method produces 3 models namely SVM.
Random Forest, and Logistic Regression.
So that it produces 6 models, namely.
SVM not SMOTE.
Random Forest not SMOTE.
Logistic Regression not SMOTE.
SVM with SMOTE.
Random Forest with SMOTE, and Logistic Regression with SMOTE.
The next stage is classification evaluation, where the six models are tested with testing data to produce TP.
TN.
FP, and FN which form a confusion matrix.
At the result stage, the TP.
TN.
FP, and FN results of the six models will be calculated and produce accuracy, precision, recall, and f1-score.
For the research flow can be seen in the research flow chart (Figure .
2 Support Vector Machine (SVM) One of the statistical methods that can be used for classification is Support Vector Machine (SVM).
SVM is a technique that aims to find a hyperplane to separate two data sets from two different classes .
Support Vector Machine works by separating classes of data using an algorithm to find the optimal hyperplane in the input space.
The best method to find the hyperplane that separates two classes is to measure the margin of the hyperplane and find the maximum point of the margin .
Figure 1.
Research flow chart Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
The main problem in SVM is to find a hyperplane, expressed by the equation <w, x> b = 0, to separate the data xj consisting of two classes, namely yi = { 1, -.
, with a maximum margin.
This margin refers to the distance between the hyperplane and the data from each class.
The hyperplane is then used as the decision function f.
in solving the two-class classification problem.
The following is the formula for f.
in equation 1 .
yce(OI.
) = ycycnyciycu.
OI.
cu ) yc.
= ycycnyciycu(Oc yuycn ycycn OI.
cuycn )ycN .
OI.
cu ) yc.
ycn=1 Description:
: weight : input variable value : bias The formula is used to calculate the prediction results.
Hyperplane can be uniquely determined based on the values of w and b obtained.
The data xi which is a subset of the training data that is on the margin, is called the support vector .
3 Randon Forest Random Forest applies a straightforward analysis method to select nodes for constructing the root node, internal nodes, and leaf nodes using the same attributes and information, regardless of the criteria applied.
This method achieves high accuracy .
The working mechanism of Random Forest involves combining multiple Decision Trees to achieve stable and accurate predictions.
Random Forest consists of a collection of Decision Trees trained using the bagging method.
The random forest method is an evolution of the CART method, using boostrap aggregating .
and random feature selection .
Random Forest is a collection of Decision Trees trained with bagging methods to produce stable and accurate predictions.
The Decision Tree algorithm includes several variants, such as ID3, which utilizes entropy, and CART, which relies on the Gini index.
The following is the impurity value formula in the CART algorithm in equation 2 and the Gini index value is represented in equation 3 .
= 1 Oe Oc ycEycn2 ycn=1 Description:
: the probability value of a tuple value D in a class : number of class labels yaycnycuycnya .
= .
a1 | .
a2 | yaycnycuycn.
a1 ) yaycnycuycn.
a2 ) .
a | .
a | .
The Gini index evaluates binary splits for each attribute.
To assess a binary split, it calculates the weighted sum of the impurities for each resulting partition.
For instance, if a binary split divides partition D into D1 and D2, the Gini index value for D based on that split can then be determined .
Regression analysis is a method aimed at understanding the effect of one variable on another.
4 Logistic Regression Logistic Regression is a supervised method in machine learning, used to evaluate data and explain the relationship between one or more prediction variables and one response variable.
The value of the Logistic Regression response variable ranges between 0 and 1, with a value cutoff of 0.
The following represents the simple linear regression model as shown in equation 4 .
ycU = yu0 yu1 ycU a .
Description:
: the dependent variable .
redicted valu.
: independent variable yu0 : constant yu1 : regression coefficient .
ncrease or decrease valu.
Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
: random error Because this study uses more than 2 mental health profiles, the softmax function can be used for multi-class logistic regression classification.
The softmax function is used to calculate probabilities from output results, with the highest probability value from the output layer taken as the prediction result.
Softmax computes the probability distribution from a vector of real numbers.
It produces outputs ranging between 0 and 1, with the total probabilities summing to 1.
The following is the softmax equation in equation 5 .
= exp.
cuycn ) Ocyc exp.
cuyc ) .
5 Synthetic Minority Over-Sampling Technique (SMOTE) To overcome class imbalance in dataset, the synthetic minority over-sampling technique (SMOTE) is popular .
SMOTE is a technique that equalizes the dataset by artificially generating new instances for the minority class, helping to achieve a balanced dataset .
SMOTE (Synthetic Minority Oversampling Techniqu.
generates synthetic data for the minority class without simply duplicating existing samples, helping to address the challenge of overfitting.
The process begins by sequentially selecting each minority class sample as the base for generating additional synthetic This process is repeated n times.
Finally, linear interpolation is applied between the base sample and the selected neighbors to create n new synthetic samples .
6 Mental Health Scale (SKM-.
The Mental Health Scale (SKM-.
is a mental health measurement tool consisting of 12 questions containing 6 questions for aspects of psychological well-being and 6 questions for aspects of psychological distress.
From these two aspects, it produces 4 group categories, namely profile optimal mental health ( -), profile maximum mental illness ( ), profile minimum mental illness (--), and profile minimal mental health (- ).
Table 1 shows the list of questions SKM-12 on each aspect.
7 Data collection The first stage of this research is data collection.
In SKM-12 there are 12 questions consisting of 6 psychological well-being questions and 6 psychological distress questions, which are then made into a questionnaire in the form of a Google Form so that it can be accessed easily.
For the distribution of questionnaires in the Multi Data Palembang University environment that can access the Google Form web link.
For the answer to each questionnaire question using a Likert scale consisting of 5, namely very often, often, sometimes, rarely, and never .
The results of data collection from May to June 2024, with different semester levels and majors, obtained 78 respondents.
Following Table 2 are 10 sample data that have been created in csv format, where questions 1 to 6 are psychological wellbeing questions and questions 7 to 12 are psychological distress questions.
Tabel 1.
Question SKM-12 Question Psychological wellbeing Psychological distress Daily life is full of interesting things Finding yourself as a confused or frustrated You generally enjoy what you do Feeling like a tired person or feeling helpless Feel comfortable communicating with your Feeling at the lowest point Feeling valuable because of your friend's Taking time to enjoy the feeling of despair Feeling happy in living this life Feeling a loss of control over thoughts, feelings, and behavior Enjoying what happens in this life Feeling like you have nothing to look forward to in the future Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024 Variable Data 1 Data 2 Data 3 Tabel 2.
Sample Data SKM-12
Data 4
Data 5
Data 6
minimum optimal minimum
P-ISSN: 2528-0260 E-ISSN: 2579-5392
Data 7 Data 8 Data 9 Data 10 8 Preprocessing data The results of the questionnaire were then processed to categorize into 4 mental health profile classes using the SKM-12 calculation method .
The answers were converted into likert scale values for easy calculation and to produce mental health profile classes.
Answer values include very often 5, often 4, sometimes 3, rarely 2, and never 1.
The 4 classes consist of optimal mental health profile with symbol ( -), maximum mental illness profile with symbol ( ), minimum mental illness profile with symbol (--), and minimum mental health profile with symbol (- ) .
The result of dataset management contains 13 columns and 78 data, 12 columns of questions and mental health profiles.
The mental health profile consists of 23 optimal mental health ( -), 16 maximum mental illness ( ), 23 minimum mental illness (--), 16 minimal mental illness (- ).
The dataset is stored in csv format which is then stored in Google Drive to make it easier to integrate into Google Colaboratory.
9 Data splitting This study compares several models with different machine learning methods to get the best performing model.
For the comparison of machine learning methods using SMOTE and without SMOTE.
In this study, we divided the training data as much as 70% and 30% testing data.
10 Implementation SMOTE method This study uses the SMOTE method or without the SMOTE method to see whether the algorithm's evaluation performance increases or decreases.
The SMOTE method addresses class imbalance by increasing the number of samples in the minority class by generating new synthetic examples.
This helps the machine learning model to pay more attention to the minority class, reducing the tendency to ignore the minority class.
In this research for the first stage using SMOTE, the results obtained after doing SMOTE are from 78 data to 92 data.
Where the data for the SKM-12 mental health profile becomes balanced, namely 23 data from 4 classes.
11 Implementation algorithm in this study compared the SVM.
Random Forest, and Logistic Regression algorithms.
in this study using Google Colaboratory to simplify the operation of the algorithm so as to produce several models.
the resulting models include SVM with SMOTE.
SVM without SMOTE.
Random Forest with SMOTE.
Random Forest without SMOTE.
Logistic Regression with SMOTE, and Logistic Regression without SMOTE.
12 Classification evaluation and confusion matrix After implementing the algorithm, classification evaluation will be conducted, where the model will be tested with test data.
The 4 mental health classes will produce TP.
TN.
FP, and FN which will form a confusion matrix.
Confusion matrix will get the results of model accuracy, precision of each class, recall of each class, and F1 - score of each class.
Table 3.
Confusion Matrix Positive Prediction Negative Positive Actual Positive Actual Negative Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
Confusion Matrix is information about the actual classification results that can be predicted by a classification Accuracy is the determination of the system in performing the classification process correctly.
Precision is the ratio of the number of relevant documents to the total number of documents found in the classification system.
Recall is the ratio of the number of documents recovered by the classification system to the total number of relevant documents.
F-measure is a popular evaluation metric for dealing with class imbalance problems .
Description:
: True Positive TN : True Negative : False Positive : False Negative Here are the equations for calculating the confusion matrix algorithm in equations 6, 7, 8, and 9:
ycNycE ycNycA
yaycaycaycycycaycayc = y 100% ycNycE yaycE yaycA ycNycA
ycEycyceycaycnycycnycuycu = ycIyceycaycaycoyco = ycNycE
y 100% ycNycE yaycE
ycNycE
y 100% ycNycE yaycA
ya1 Oe ycycaycuycyce = 2 y
ycyycyceycaycnycycnycuycu y ycyceycaycaycoyco
ycyycyceycaycnycycnycuycu ycyceycaycaycoyco
RESULTS AND DISCUSSIONS
In this research, the algorithm implementation stage consists of SVM.
Random Forest, and Logistic Regression This stage is carried out in 2 stages, namely the test results without SMOTE and using SMOTE.
In these 2 stages, 6 models were produced.
The model will then be tested with testing data that has been divided previously with a ratio of 30% of the data.
resulting in TP.
TP.
TN.
FP, and FN which will evaluate the model using confusion matrix.
The following are the steps in implementing the algorithm.
1 Result implementation and classification evaluation algorithm not SMOTE At this stage of the research, after data splitting is done, the training data will be trained on the algorithm, then the classification evaluation will be carried out with the testing data to obtain model performance results.
The following Table 4 displays the evaluation of the algorithm model without SMOTE.
2 Result implementation and classification evaluation algorithm with SMOTE At this stage of the research, after data splitting is done, training data and testing data are SMOTE to increase the minority class.
After the data is SMOTE, the training data will be trained on the algorithm, then the classification evaluation will be carried out with the testing data to obtain model performance results.
The following Table 5 displays the evaluation of the algorithm model with SMOTE.
Algorithm
SVM
Random Forest Logistic Regression Table 4.
Classification evaluation algorithm not SMOTE.
Mental Health Class Accuracy Precision profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) Available online at: https://ejournal.
id/jeecs Recall F1 - score Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
Algorithm
SVM
Random Forest Logistic Regression P-ISSN: 2528-0260 E-ISSN: 2579-5392 Table 5.
Classification evaluation algorithm with SMOTE.
Mental Health Class Accuracy Precision profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) profile optimal mental health ( -) profile maximum mental illness ( ) profile minimum mental illness (--) profile minimal mental health (- ) Recall F1 - score Figure 2.
Accuracy comparison chart 3 Discussions This discussion compares the classification evaluation results, in the form of a chart.
The following Figure 2 shows the accuracy comparison chart.
The accuracy of the logistic regression method has increased with SMOTE, the accuracy obtained before using SMOTE is 83% accuracy by using SMOTE to 89.
The following Table 6 is a comparison of class averages for precession, recall, and f1 - score using SMOTE and without SMOTE.
The precision of Logistic Regression increases with SMOTE, and has a higher average than other methods with a precision of 89.
The following Figure 3 shows the precision comparison chart.
The recall of Logistic Regression increases with SMOTE, and has a higher average than other methods with a precision of 89.
The following Figure 4 shows the recall comparison chart.
The F1 - score of Logistic Regression increases with SMOTE, and has a higher average than other methods with a precision of 88.
The following Figure 5 shows the F1 - score comparison chart.
Table 6.
Comparison of class averages for precession, recall, f-1 score using SMOTE and without SMOTE
Precision Recall F1 - score
Algorithm
Not SMOTE
With SMOTE
Not SMOTE
With SMOTE
Not SMOTE
With SMOTE
SVM
Random Forest Logistic Regression Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
Figure 3.
Precision comparison chart Figure 4.
Recall comparison chart Figure 5.
F1 - score comparison chart
CONCLUSION
The results of this study show that the use of SMOTE improves the performance of the model in Classification Evaluation.
Logistic Regression method without SMOTE is lower than SVM method without SMOTE, but the results show that the Logistic Regression method has improved accuracy, precision, recall, and f1-score after using SMOTE and higher performance than SVM method without SMOTE.
The results show that the Logistic Regression method with SMOTE has an accuracy of 89.
28%, an average class precision of 89.
5%, an average class recall of 89.
75%, and an Available online at: https://ejournal.
id/jeecs Journal of Electrical Engineering and Computer Sciences Vol.
Issue 2.
December 2024
P-ISSN: 2528-0260 E-ISSN: 2579-5392
average class f1 - score of 88.
So that in further research the model can be applied to mental health screening systems, or for further research try to increase data and also use different methods to increase the accuracy of the model.
REFERENCES