JRMM Ae Jurnal Riset Mahasiswa Matematika. Volume 5 . Pages 267-277 Research Article Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization Andy Irawan1 . Zainal Abidin 1 , and Mohammad Jamhuri 2 Department of Informatics Engineering. Faculty of Science and Technology. Universitas Islam Negeri Maulana Malik Ibrahim. Malang. Indonesia Department of Mathematics. Faculty of Science and Technology. Universitas Islam Negeri Maulana Malik Ibrahim. Malang. Indonesia Article History Received December 02, 2025 Revised February 03, 2026 Accepted March 30, 2026 Published April 30, 2026 Abstract. Predicting student graduation predicates is important for academic monitoring and timely intervention in higher education. This study investigates graduation predicate prediction using deep neural networks under three feature-group settings: academic-only, non-academic-only, and combined academicAenon-academic features. A multilayer perceptron with three hidden layers was trained using SGD with momentum. RMSProp. Adam, and a damped Hessian-free optimization procedure. Two tasks were considered: a four-class graduation predicate classification task and a binary risk-screening task in which Sufficient was treated as the positive risk class. The results show that the combined feature group achieved the best multiclass performance, with an accuracy of 0. 8478 and a weighted F1score of 0. Hessian-free optimization consistently produced the best results across all feature-group scenarios, with the clearest gain appearing in the non-academic-only setting. In the additional risk-screening analysis, non-academic variables provided meaningful but limited predictive signal, and Major emerged as the strongest individual predictor. These findings show that combining academic and non-academic information improves graduation predicate prediction and that Hessian-free optimization is an effective training strategy for deep neural classification in educational data. Keywords: deep neural networks. educational data mining. graduation predicate prediction. Hessian-free optimization. risk screening. Copyright A 2026 by Authors. Published by JRMM Group. This is an open access article under the CC BY-SA License. Introduction Predicting student outcomes has become an important topic in educational data mining and learning analytics because it can support early intervention, targeted academic assistance, and institutional decision-making . Ae. Within this broader area, graduation predicate prediction is especially relevant because it condenses cumulative academic performance into an interpretable final achievement category that is meaningful for students, study programmes, and institutions . , . A large body of previous work has relied primarily on academic variables, including prior grades, semester GPA, course performance, and learning-management-system activity. These variables often provide strong predictive signal because they directly reflect student progress. Several studies have shown that early-semester academic performance is among the strongest predictors of final academic outcomes . Their practical limitation, however, is temporal: they become informative only after students have already progressed through part of their studies. This reduces their value for admission-time screening and very early intervention, when support may be most useful. For that reason, non-academic and admission-related variables remain important, particularly when academic records are incomplete or unavailable. Prior studies have reported that factors such as gender, school background, admission pathway, organizational involvement, and demographic characteristics may contribute to the prediction of student outcomes, although usually less strongly than academic indicators . Ae. Admission-based prediction has likewise been shown to be feasible for anticipating later academic performance . Ae. the context of Islamic higher education, this issue is especially interesting because variables such as boarding-school experience and Arabic-language proficiency may capture forms of prior preparation that are rarely examined in the broader educational data mining literature. These variables may not dominate prediction on their own, but they may still provide useful contextual information, especially in early-stage From the modelling perspective, educational data mining studies have employed a broad range of machine learning methods, including Nayve Bayes. Decision Tree. K-Nearest Neighbors. Support Vector Machine, and neural-networkbased models . , 14Ae. These studies demonstrate that educational data can support useful prediction, yet most of them focus on classifier comparison, feature selection, or predictive accuracy. Comparatively less attention has been given to the optimization procedure used to train neural models, even though the optimizer can materially affect convergence behaviour, stability, and final predictive performance. This issue becomes particularly relevant when the input space combines continuous academic variables with high-dimensional encoded categorical features, because the resulting loss surface may be difficult for purely gradient-based training. In deep learning, first-order methods such as stochastic gradient descent. RMSProp, and Adam are widely used because of their simplicity and computational efficiency. Neverthe- Corresponding authorAos email: m. jamhuri@live. DOI: https://doi. org/10. 18860/JRMM. 37951 / p-ISSN: 2086-0382 | e-ISSN: 2477-3344 (BRIN | ISSN Porta. Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization less, second-order methods remain attractive because they exploit curvature information of the objective function and can therefore yield more informative search directions than gradient information alone. Among these methods. Hessianfree optimization is especially appealing because it avoids explicit construction of the Hessian matrix and instead relies on matrixAevector products computed efficiently by automatic differentiation . , . The resulting inner linear system can then be solved iteratively by the conjugate gradient method, making second-order training practical for neural networks . Although current neural-network optimization research is still dominated by first-order methods, curvature-aware and second-order approaches continue to attract attention because they may provide stronger convergence behaviour and more informative updates in complex learning problems . Ae. Related studies have also shown that GaussAeNewton-type and inexact second-order procedures can improve optimization performance in classification settings, including neural and binary classification problems . Ae. However, explicit investigation of Hessian-free optimization in educational data mining remains limited. Against this background, this study addresses three research questions. First, does the combination of academic and non-academic features yield better graduation predicate prediction performance than academic features alone? Second, can Hessian-free optimization outperform standard first-order optimizers across academic-only, non-academic-only, and combined feature groups? Third, when only non-academic features are available, can they be used to identify students who are potentially at risk of weak final academic achievement? To address these questions, this study formulates two related prediction tasks. The first is a four-class classification problem using the final graduation predicate categories Sufficient. Satisfactory. Very Satisfactory, and Cum Laude. The second is a binary early risk-screening problem in which Sufficient is treated as the positive risk class and all remaining predicates are treated as the non-risk class. A deep neural network implemented as a multilayer perceptron is trained under four optimization methods, namely SGD with momentum. RMSProp. Adam, and Hessian-free optimization, so that the comparison focuses on feature-group and optimizer effects rather than architectural variation. The contribution of this study is twofold. First, it integrates feature-group comparison and optimizer comparison within a unified deep neural classification framework for graduation predicate prediction. Second, it extends the analysis beyond multiclass graduation predicate classification by reformulating the problem as an early risk-screening task suitable for newly admitted students, for whom semester-based academic variables are not yet available. In this way, the study contributes methodologically by examining the role of Hessian-free optimization in deep neural classification and practically by identifying which non-academic variables can support early The remainder of this article is organized as follows. Section 2 presents the dataset representation, preprocessing steps, neural-network formulation, optimization procedures, and evaluation protocol. Section 3 reports the experimental results and discusses their implications. Section 4 concludes the paper. JRMM Ae Jurnal Riset Mahasiswa Matematika Methods This section presents the experimental and computational framework adopted in the study. It begins with the dataset notation, feature-group definitions, and target formulations. It then describes preprocessing and data partitioning before presenting the deep neural network architecture, the objective functions, and the training procedures. Particular attention is given to the damped Hessian-free method and its conjugate-gradient inner solver. The section concludes with the evaluation metrics and the additional subset analysis for non-academic early risk screening. Notation and experimental setting Let D = {. i , yi )}N denote the dataset, where xi is the predictor vector of the ith student and yi is the corresponding target label. The predictor vector is partitioned into two groups, . xi = xi , xi , . where xi denotes academic features and xi academic features. The academic feature group is defined as . denotes non- = (GPA1 . GPA2 . GPA3 . GPA4 ), where GPAs denotes the semester GPA in semester s. The non-academic feature vector is defined by . = (Gender. School Type. Boarding Experience, . Admission Path. Arabic Proficiency. English Proficiency. Computer Proficiency. Majo. O . Accordingly, three feature-group scenarios are considered: Xacad = x. Xnon = x . Xcomb = x. , x . Equations . Ae. define the three experimental settings used throughout the study, while Eq. specifies the non-academic feature vector. Two target formulations are used. In the multiclass task, the target space is Ymulti = . , 2, 3, . , corresponding to the ordered predicates {Sufficient. Satisfactory. Very Satisfactory. Cum Laud. In the binary risk-screening task, the target is defined by 1, if student i has predicate Sufficient, . 0, otherwise. Thus, by Eq. , yi = 1 represents a student who is operationally regarded as academically at risk. Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization h. = i. Data preprocessing The preprocessing stage was designed according to the measurement scale of each feature. Empty strings were first converted into missing values. Missing entries in categorical variables were imputed using the mode, whereas missing entries in numerical variables were imputed using the median. Let xij denote the value of feature j for sample i. For numerical variables, z-score standardization was applied: xij Oe AAj where AAj and Ej are the mean and standard deviation of feature j computed from the training set. Eq. ensures that numerical variables are placed on a comparable scale before training. Categorical variables were transformed using one-hot encoding. Hence, the final input vector to the classifier can be written as xEi = T . i ), where T (A) denotes the preprocessing operator composed of categorical expansion and, when appropriate, numerical standardization. The transformation T was fitted only on the training set and then applied to the validation and test sets to avoid information leakage. Here, i. = max. , . is the rectified linear unit (ReLU). Equations . Ae. describe the feedforward transformation through the three hidden layers. The hidden-layer sizes were fixed at . , 12, . During training, dropout with rate r = 0. 3 was applied after each hidden layer. If m(E. denotes a Bernoulli mask for layer Ee, the dropout-transformed hidden state can be written hE(E. = m(E. Oo h(E. , . where Oo denotes elementwise multiplication. Eq. formalizes the regularization mechanism used to reduce overfitting. For the multiclass task, the output layer computes logits a = W . , and predicted class probabilities are obtained by the softmax exp. ik ) k = 1, 2, 3, 4. yCik = P4 j=1 exp. ij ) Thus. Eq. maps the logits into class probabilities that sum to one. For the binary risk-screening task, the output is ai = W . TrainAevalidationAetest split followed by the sigmoid transformation The dataset was divided into three disjoint subsets, yCi = E. i ) = D = Dtrain O Dval O Dtest , with proportions 60%, 20%, and 20%, respectively. The partition was produced through a two-stage stratified split. First, the full dataset was split into temporary training data . %) and test data . %). Second, the temporary training data were split into training data . %) and validation data . %), which yields the final 60Ae20Ae20 ratio. Stratification was performed with respect to the target labels so that class proportions were approximately preserved across subsets. Deep neural network formulation All experiments used the same deep neural network so that the comparison isolates the effects of feature groups and optimization methods. Let xE OO Rp denote the preprocessed input vector. The network defines a parametric mapping f : Rp Ie RK , where denotes all trainable weights and biases, and K = 4 for the multiclass task and K = 1 for the binary task. The hidden representation is computed by three affine transformations followed by nonlinear activation: = W . xE b. , . = i. ), . = W . , . = i. ), . JRMM Ae Jurnal Riset Mahasiswa Matematika 1 exp(Oeai ) . Eq. gives the estimated probability that student i belongs to the risk class. Objective functions For the multiclass task, the categorical cross-entropy is defined 1 XX yik log yCik , . Lmulti () = Oe N i=1 where yik is the one-hot representation of the true class of sample i. Eq. is minimized when the softmax probabilities in Eq. align with the true multiclass labels. For the binary risk-screening task, the binary cross-entropy is defined as 1 X h . Lbin () = Oe yi log yCi . Oe yi ) log. Oe yCi ) . N i=1 In Eq. , yCi is obtained from the sigmoid model in Eq. For first-order optimizers, class weighting was used to reduce the influence of class imbalance. Let Oc denote the weight for class c. Then the weighted loss can be written in the generic form Lw () = Oy Eei (). N i=1 i . where Eei () denotes the sample-wise loss term. Eq. was used for the first-order baselines, whereas the Hessian-free implementation was kept unweighted in accordance with the computational design adopted in the experiments. Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization First-order optimization methods Three first-order optimizers were used as baselines: SGD with momentum. RMSProp, and Adam. For SGD with momentum, the update rule is vt 1 = AAvt Oe ONL. ), . t 1 = t vt 1 , . where is the learning rate and AA is the momentum coefficient. Equations . show that SGD with momentum uses only first-order information, augmented by an exponential moving average of past gradients. RMSProp and Adam were also employed as adaptive gradient-based optimizers. These methods likewise rely on first-order information and therefore provide suitable baselines for comparison with Hessian-free Hessian-free optimization is a damped second-order method that uses curvature information without explicitly constructing the Hessian matrix . Let denote the gradient at iteration t. In a local quadratic approximation around t , the objective is approximated by . where p is a candidate step direction and Bt is a curvature matrix, implicitly represented through HessianAevector products. Thus. Eq. provides the local model used to determine a Newton-type search direction. To stabilize the step, a damping term is introduced, and the search direction pt is obtained by approximately solving At = Bt t I, . where t > 0 is the damping parameter. Eq. is the core linear system of the Hessian-free iteration. The matrix At is never formed explicitly. Instead, the method computes products of the form v 7Ie At v = (Bt t I)v, . using automatic differentiation. Eq. is implemented through HessianAevector products, following the technique introduced by Pearlmutter . , which makes second-order information available at a computational cost comparable to gradient evaluation. After an approximate solution pt is obtained, the parameter update is t 1 = t t pt , . where t is determined by backtracking. The quality of the step is then measured by L. ) Oe L. t pt ) At = OegtO pt Oe 12 pO t At pt . Equations . define the accepted parameter step and the agreement ratio between predicted and realized reduction. If At is large, the damping parameter is reduced. if At is JRMM Ae Jurnal Riset Mahasiswa Matematika Conjugate gradient as the inner solver The linear system in Eq. was solved approximately by the conjugate gradient (CG) method, which is well suited for large symmetric positive definite systems and requires only matrixAe vector products . Since damping makes At = Bt t I more numerically stable. CG can be applied efficiently within each Hessian-free iteration. Let A = At , b = Oegt . r0 = b Oe Ax0 , . p0 = r0 . Then, for k = 0, 1, 2, . , the updates are gt = ONL. ) At pt = Oegt , small, the damping parameter is increased. This rule adapts the trust in the local quadratic model. Such a curvature-aware perspective remains relevant in current neural optimization research, where exact or approximate second-order schemes continue to be studied as viable alternatives to purely firstorder training strategies . Ae. Starting from an initial approximation x0 = 0, the CG iterations are defined by Hessian-free optimization mt . = L. ) gtO p pO Bt p, k = rO k rk k Apk xk 1 = xk k pk , . rk 1 = rk Oe k Apk , . rO rk 1 k 1 = k 1 k rk pk 1 = rk 1 k 1 pk . Equations . Ae. define the inner iterative solver used to approximate the solution of Eq. The final approximation xk 1 is then used as the Hessian-free search direction pt . Algorithm 1 summarizes the complete damped Hessian-free training procedure used in this study, including the CG inner solver, backtracking line search, and damping adaptation. Evaluation metrics Performance was assessed on the held-out test set using accuracy, precision, recall, and F1-score. For the binary task, these are defined by TP TN TP TN FP FN Precision = TP FP Recall = TP FN Precision A Recall F1-score = 2 A Precision Recall Accuracy = . Equations . Ae. were used to evaluate the binary riskscreening task. For the multiclass task, weighted precision, weighted recall, and weighted F1-score were used. Confusion matrices were also inspected to analyse class-level behaviour. In addition, training and validation loss curves and accuracy curves were recorded to compare convergence across Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization Algorithm 1 Damped HFO with conjugate gradient Results and Discussion Require: Initial parameter 0 , initial damping 0 , maximum CG iterations M , backtracking factor OO . , . 1: for t = 0, 1, 2, . until convergence do Compute gradient gt = ONL. ) Define the linear operator At v = (Bt t I)v Approximately solve At pt = Oegt using M steps of conjugate Compute the predicted reduction This section presents the experimental findings for both the multiclass graduation predicate classification task and the binary early risk-screening task. It begins with a comparison of feature-group performance, then evaluates the role of the optimizer, and finally discusses class-level behaviour and admission-time risk screening based on non-academic OIpred = OegtO pt Oe pt At pt Set t Ia 1 while L. t pt ) Ou L. ) and t is sufficiently large do t Ia t end while Update t 1 Ia t t pt Compute L. ) Oe L. 1 ) At = OIpred if At > 0. 75 then Decrease damping parameter t else if At < 0. 25 then Increase damping parameter t end if 17: end for 18: return Final parameter Non-academic subset analysis for early screening To determine which non-academic variables are most useful for screening newly admitted students, an additional subset analysis was conducted. Let Multiclass classification performance across feature groups The multiclass experiments compared three feature-group scenarios, namely academic-only, non-academic-only, and combined academicAenon-academic features. For each scenario, the same multilayer perceptron architecture was trained using SGD with momentum. RMSProp. Adam, and Hessian-free Table 1: Best classification performance for each feature-group Feature group Best optimizer Accuracy Precision Recall F1-score Academic-only Non-academic-only Combined HFO HFO HFO As shown in Table 1, the combined feature group achieved the best overall multiclass classification performance, with an accuracy of 0. 8478 and a weighted F1-score of 0. The academic-only scenario followed closely, with an accuracy of 8434 and a weighted F1-score of 0. By contrast, the non-academic-only scenario remained substantially weaker, reaching an accuracy of 0. 6111 and a weighted F1-score of S OI Xnon be a candidate subset of non-academic features. For each subset S, a binary classifier (S) : S Ie . , . was trained using the risk definition in Eq. The evaluated subsets included single-feature models, small combinations, readiness-related subsets, admission-background subsets, and the complete non-academic feature set. This procedure was intended to identify which non-academic variables, individually or jointly, provide the strongest predictive signal for early academic risk screening. Implementation environment The experiments were implemented in Python using TensorFlow/Keras for deep neural network modelling and scikit-learn for preprocessing, data splitting, class-weight estimation, and The same random seed was used throughout the experiments to improve reproducibility. The use of deep neural classification in this study is consistent with recent educational data mining and deep-learning research that applies supervised learning models to student performance and risk-prediction tasks . , . JRMM Ae Jurnal Riset Mahasiswa Matematika Fig. 1: Comparison of weighted F1-scores across academic-only, non-academic-only, and combined feature-group scenarios under the four optimization methods. Hessian-free optimization achieved the strongest F1-score in all scenarios, while the combined feature group produced the best overall result. Fig. 1 summarizes the multiclass results visually. Two patterns are immediately clear. First. Hessian-free optimization produced the highest weighted F1-score in all three feature-group scenarios. Second, the combined academicAenonacademic representation slightly outperformed the academiconly setting, whereas the non-academic-only scenario remained much weaker. These results indicate that combining academic and nonacademic variables yields the best predictive representation. Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization However, the improvement of the combined model over the academic-only model is relatively small. The appropriate interpretation is therefore not that non-academic variables radically transform graduation predicate classification, but rather that they provide complementary information that slightly improves the predictive value of academic indicators. Academic variables remain the dominant source of information for final predicate classification, whereas non-academic variables contribute an additional but modest gain. This pattern is consistent with literature showing that academic indicators often remain the strongest predictors of student outcomes, while demographic, admission, and readiness-related variables can contribute supplementary information, especially in earlystage prediction settings . , . At the same time, the non-academic-only scenario still achieved performance that was better than naive prediction. This indicates that non-academic variables contain useful predictive signal even in the absence of semester achievement Such evidence is important because it suggests that admission-time information is not irrelevant for academic prediction, even though it is not as strong as academic performance data. Fig. 2: Comparison of classification accuracy across academic-only, non-academic-only, and combined feature-group scenarios under the four optimization methods. The same ranking observed for weighted F1-score is also visible in accuracy, with HFO achieving the strongest performance across all scenarios. Table 3: Optimizer comparison for the non-academic-only feature Optimizer Accuracy Precision Recall F1-score SGD Momentum RMSProp Adam HFO Table 4: Optimizer comparison for the combined feature group Optimizer Accuracy Precision Recall F1-score SGD Momentum RMSProp Adam HFO Tables 2Ae4 show that Hessian-free optimization was the strongest optimizer in all three feature-group scenarios. In the academic-only setting. HFO achieved the highest weighted F1-score of 0. 8232, only slightly above the best first-order baseline, namely RMSProp with 0. In the non-academiconly setting. HFO produced a much stronger result than the first-order baselines, with an F1-score of 0. 5623 compared 4609 for RMSProp, 0. 3185 for Adam, and 0. 2378 for SGD with momentum. In the combined-feature scenario. HFO again achieved the best overall performance, with an F1-score of 0. 8274, exceeding SGD with momentum . RMSProp . , and Adam . Table 5 makes the advantage of HFO more explicit by comparing it directly with the strongest first-order optimizer in each scenario. The gain in the academic-only setting is small but positive, indicating that HFO remained competitive even when academic predictors already dominated the classification The largest gain appears in the non-academic-only scenario, where HFO improved the weighted F1-score by 0. corresponding to a relative gain of 21. In the combinedfeature scenario. HFO also produced a clear improvement over the best first-order baseline, with an absolute F1-score gain of 0. The same ranking is also visible in Fig. 2, which compares accuracy across scenarios and optimizers. The consistency between Fig. 1 and Fig. 2 strengthens the conclusion that the superiority of HFO and the combined feature representation is not limited to a single evaluation metric. Comparison of optimization methods To examine the role of the optimizer more directly, each feature-group scenario was evaluated under four training methods: SGD with momentum. RMSProp. Adam, and Hessianfree optimization. Table 2: Optimizer comparison for the academic-only feature Optimizer Accuracy Precision Recall F1-score SGD Momentum RMSProp Adam HFO JRMM Ae Jurnal Riset Mahasiswa Matematika Fig. 3: Comparison between HFO and the strongest first-order optimizer in each multiclass scenario based on weighted F1-score. The most substantial advantage of HFO appears in the non-academiconly feature group. Fig. 3 highlights the same pattern visually. HFO consistently outperformed the strongest first-order baseline across all feature-group scenarios. However, the magnitude of the gain was not uniform. In the academic-only setting the advantage was marginal, in the combined setting it was moderate, and in the non-academic-only setting it was substantial. This suggests that curvature-informed optimization is especially Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization Table 5: Performance gain of HFO over the best first-order optimizer in each multiclass scenario Scenario Best first-order Best first-order F1 HFO F1 Absolute gain Relative gain (%) Academic-only Non-academic-only Combined RMSProp RMSProp SGD Momentum beneficial when the classifier must learn from a feature space dominated by encoded categorical admission variables rather than from stronger semester-based academic predictors. Table 6: Mean optimizer performance across the three multiclass Optimizer Mean accuracy Mean precision Mean recall Mean F1-score SGDM RMSProp Adam HFO Table 6 further strengthens the argument by aggregating performance across scenarios. HFO achieved the highest mean accuracy . , mean recall . , and mean weighted F1-score . , clearly above the three first-order methods. This shows that the superiority of HFO is not limited to a single scenario, but persists across all multiclass feature-group Fig. 4 complements the metric-based comparison by revealing the training dynamics in the most informative scenario, namely the combined feature setting. The HFO-based model showed the strongest final validation accuracy and the lowest final validation loss, indicating not only strong final performance but also favourable optimization behaviour. In contrast, the first-order methods converged to weaker validation performance, with SGD with momentum showing the least competitive trajectory. This dynamic behaviour is consistent with the metric-based comparison and supports the argument that HFO is a more effective training strategy for the present classification problem. Taken together, the results in Tables 2Ae6 and Figures 3Ae4 support the usefulness of Hessian-free optimization for deep neural classification in educational datasets. Because HFO incorporates curvature information through HessianAevector products and solves the corresponding damped Newton-type system using conjugate gradient, it can produce more informative update directions than gradient-only methods. The consistency of its superiority across all feature groups, together with its clear gains over the strongest first-order baselines, constitutes the main computational contribution of this study. Class-level behaviour of the best multiclass To examine the behaviour of the best-performing multiclass classifier at the class level, the confusion matrix of the combined-feature HFO model was analysed. Fig. 5 shows that the combined HFO model performed strongly on the dominant predicate categories, particularly Very Satisfactory and Cum Laude. However, the weakest class, namely Sufficient, remained difficult to identify correctly. the confusion matrix, the few samples in this class were still mapped into a higher predicate category. JRMM Ae Jurnal Riset Mahasiswa Matematika Fig. 5: Confusion matrix of the HFO-based classifier for the combined academicAenon-academic feature scenario. The model performed strongly on the dominant predicate classes, but the lowest predicate class (Sufficien. remained difficult to identify. This observation qualifies the interpretation of the strong overall multiclass metrics. The model achieved high accuracy and weighted F1-score largely because it classified the dominant classes effectively. It also helps explain why the subsequent binary risk-screening task is highly challenging. Since Sufficient is extremely rare in the dataset, the model has limited opportunity to learn a robust boundary for this minority class. Therefore, the multiclass task is substantially easier for the dominant predicate groups than for the weakest academic group. Fig. 6: Confusion-matrix comparison for the combined feature scenario between the strongest first-order optimizer and HFO. HFO improves the overall distribution of correct predictions in the dominant classes, although minority-class recovery remains Fig. 6 compares HFO directly with the strongest first-order baseline in the combined-feature scenario. The comparison shows that HFO yields a better overall allocation of predictions in the dominant classes and thereby supports the metricbased evidence that it is the strongest optimizer in this setting. At the same time, both matrices confirm that minority-class recovery remains a fundamental challenge, which is mainly attributable to the extreme scarcity of the Sufficient class rather than to optimization alone. Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization Fig. 4: Training and validation loss and accuracy curves for the combined academicAenon-academic feature scenario. Hessian-free optimization reached the strongest final validation accuracy and the lowest final validation loss among the evaluated optimizers. Binary early risk screening based on nonacademic variables As an additional analysis, the problem was reformulated as a binary early risk-screening task. In this setting, students with the final predicate Sufficient were labelled as positive cases and all remaining students were treated as negative Because the purpose of this analysis was to simulate an admission-time screening setting, only non-academic variables were used. The risk-screening results must be interpreted carefully. The dummy baseline achieved an accuracy of 0. 9964 but a recall of 0. 0000 and an F1-score of 0. 0000, which indicates that the positive class was extremely rare. Therefore, accuracy is not an informative metric for this screening task. A classifier that simply predicts the majority class can appear highly accurate while failing completely to identify at-risk students. For this reason, recall and F1-score are more informative than raw accuracy in Table 7. From this perspective, the non-academic models did show useful signal. The best overall subset, namely the full non-academic feature set including gender, achieved a recall of 0. 4000 and an F1-score of 0. which is clearly superior to the dummy baseline. Likewise, the best single feature. Major, achieved a recall of 0. and an F1-score of 0. 0242, indicating that it alone contains informative structure for identifying students who may later fall into the Sufficient category. Nevertheless, the absolute performance of the screening models remained limited. Although the non-academic subsets improved substantially over the dummy baseline in terms of recall and F1-score, the resulting F1-scores were still very This means that non-academic variables do contain predictive signal for early screening, but the screening problem remains difficult because of severe class imbalance and the limited information available before academic outcomes emerge. This interpretation is also consistent with the broader earlywarning literature, in which models based on limited earlystage information can still provide useful screening value even when absolute predictive performance remains constrained . , . Accordingly, the correct interpretation is not that nonJRMM Ae Jurnal Riset Mahasiswa Matematika academic features can already provide highly accurate risk Rather, the findings show that they provide a limited but meaningful basis for early warning. In practical terms, they may support first-stage screening, but should not be interpreted as a fully reliable standalone decision Single non-academic features for early screening To identify the most informative individual non-academic predictors, single-feature models were compared. Table 8: Best-performing single non-academic features for early risk screening Feature Major Gender Arabic Proficiency Admission Path School Type Boarding Experience English Proficiency Computer Proficiency Best optimizer Accuracy Precision Recall F1-score RMSProp RMSProp RMSProp SGDM Adam HFO HFO HFO Fig. 7: Ranking of single non-academic features for early risk screening based on F1-score. Major emerged as the strongest individual predictor among the evaluated non-academic variables. Among individual non-academic predictors. Major emerged as the strongest single feature, with an F1-score of 0. and recall of 0. Fig. 7 confirms this ranking visually Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization Table 7: Summary of binary early risk-screening performance Category Best overall Best single feature Best multi-feature Dummy baseline Feature set Best optimizer Accuracy Recall F1-score all_nonacademic_with_gender major_only all_nonacademic_with_gender risk_dummy_baseline SGD Momentum RMSProp SGD Momentum DummyMostFrequent and shows that Major stands clearly above the other singlevariable models. Substantively, this may reflect differences in curriculum demands, preparation profiles, or difficulty patterns across study programmes. Although Gender produced the second-highest F1-score among single features, its overall predictive contribution remained modest, and its use in practical screening may require additional ethical consideration. Other variables such as Admission Path. Arabic Proficiency. English Proficiency, and Computer Proficiency also exhibited some predictive signal, but none matched the individual contribution of Major. It is also noteworthy that some single-feature models achieved high recall but extremely low precision. For example. Admission Path reached recall 1. 0000 with very low precision, indicating that the model identified nearly all positive cases at the cost of producing many false positives. Therefore, these variables may still be useful in a high-sensitivity screening context, but not as precise standalone predictors. Multi-feature non-academic subsets for early The next analysis examined combinations of non-academic variables in order to determine whether a compact subset could provide stronger practical screening performance. Table 9: Best-performing multi-feature non-academic subsets for early risk screening Feature subset all_nonacademic_with_gender all_nonacademic_without_gender core_screening major_admission extended_screening Best optimizer Accuracy Precision Recall F1-score SGDM Adam SGDM SGDM SGDM Fig. 8: Top non-academic feature subsets for early risk screening ranked by F1-score. The full non-academic set achieved the best overall screening performance, while smaller subsets centered on study-programme and admission-related variables remained competitive. The full non-academic feature set including gender produced the best overall screening result, with an F1-score of JRMM Ae Jurnal Riset Mahasiswa Matematika However, the version without gender was only slightly worse, with an F1-score of 0. 0221 and a higher recall of 0. The difference between these two settings is therefore small. This suggests that, from a practical perspective, the exclusion of gender may be acceptable if institutional policy favours a more cautious and ethically conservative screening design. Fig. 8 further shows that the highest-ranking subsets are not arbitrary combinations, but are dominated by models that include broad non-academic information or strong studyprogramme-related structure. Among the reduced subsets, the core_screening combination and the major_admission combination performed competitively, although neither surpassed the full feature set. This is relevant for implementation, because a faculty may prefer a smaller and more interpretable screening instrument when operational simplicity is important. Overall, the subset analysis shows that the best screening performance is achieved not by a single variable alone, but by combining several non-academic variables. However, the gain from using all non-academic features remains limited in absolute terms, which again reflects the difficulty of the task under extreme class imbalance. Implications for faculty-level early intervention The findings of this study have two practical implications. First, once semester achievement variables are available, the best predictive strategy is to combine academic and nonacademic information, ideally using Hessian-free optimization as the training method. This provides the most accurate basis for multiclass graduation predicate classification. Second, when only admission-time variables are available, non-academic features can still support early risk screening, even though the resulting performance is limited. In this context, the main value of the screening model is not to deliver highly precise final predictions, but to function as an early-warning filter that identifies a subset of students who may benefit from additional mentoring, preparatory support, or closer academic monitoring. The subset analysis further suggests that Major is the strongest single non-academic predictor, while broader combinations of non-academic variables yield the best overall Therefore, if a faculty wishes to construct a simple but informative early screening tool, study-programme information should likely be treated as a core component. Limitations Several limitations should be acknowledged. First, the experiments were conducted on a single institutional dataset, so the results may not generalize directly to other universities or study programmes. Second, the evaluation used a fixed trainAevalidationAetest split rather than repeated resampling Volume . Issued . Year 2026 Andy Irawan Ae Deep Neural Network-Based Student Performance Prediction with Hessian-Free Optimization or cross-validation. Third, the risk label was derived operationally from the lowest final predicate category rather than from formal dropout status. Therefore, the binary task should be interpreted as academic risk screening rather than direct dropout prediction. A further limitation concerns the extreme rarity of the positive class in the binary screening experiments. Because of this imbalance, precision and F1-score remained low even for the best-performing subsets. Future work may therefore benefit from additional evaluation measures such as balanced accuracy. ROC-AUC, or PR-AUC, as well as data-level or algorithm-level imbalance-handling techniques. Discussion summary Overall, the experimental results support four main conclusions. First, the combined academicAenon-academic representation produced the best multiclass classification performance, although its improvement over the academic-only representation was marginal. Second. Hessian-free optimization was consistently the strongest optimizer across academic-only, non-academic-only, and combined scenarios. Third, the clearest advantage of HFO over first-order optimization appeared in the non-academic-only scenario, where the gain over the strongest first-order baseline was substantial. Fourth, nonacademic variables available at admission time do contain meaningful predictive signal for early screening of students who may later fall into the Sufficient category, but this screening performance remains limited because of severe class imbalance and the inherent difficulty of predicting long-term academic outcomes from pre-admission information alone. Conclusion This study examined student graduation predicate prediction and early academic risk screening using academic-only, non-academic-only, and combined academicAenon-academic feature groups under four deep neural network optimization The results show that the combined feature group achieved the best overall multiclass classification performance, although its advantage over the academic-only group was only This indicates that academic variables remain the strongest predictors, while non-academic variables contribute complementary information. Hessian-free optimization consistently achieved the best performance across all feature-group scenarios. Its advantage over the strongest first-order optimizer was marginal in the academic-only setting, moderate in the combined setting, and substantial in the non-academic-only setting. In addition. HFO achieved the highest mean accuracy and mean weighted F1-score across the three multiclass scenarios. These findings support its effectiveness as a training strategy for deep neural classification in educational data. For the binary early risk-screening task, non-academic variables alone provided meaningful but limited predictive signal for identifying students who may later fall into the Sufficient category. Among individual predictors. Major emerged as the strongest single non-academic feature, while the full non-academic feature set produced the best overall screening performance. These findings suggest that non-academic variables can support first-stage screening of newly admitted JRMM Ae Jurnal Riset Mahasiswa Matematika students, although they are not sufficient for highly accurate standalone risk prediction. Future work may evaluate the proposed framework on broader institutional datasets, use repeated validation strategies, and incorporate imbalance-aware methods to improve early risk-screening performance. CRediT Authorship Contribution Statement Andy Irawan: Conceptualization. Methodology. Software. Data Curation. Formal Analysis. Visualization. WritingAe Original Draft. Zainal Abidin: Validation. Investigation. WritingAeReview & Editing. Supervision. Mohammad Jamhuri: Conceptualization. Methodology. Formal Analysis. Supervision. WritingAeReview & Editing. Declaration of Generative AI and AI-assisted technologies Generative AI was used in a limited manner to assist with language refinement, structural editing, and manuscript drafting Declaration of Competing Interest The authors declare no competing interests. Funding and Acknowledgments This research received no external funding. Data Availability The data supporting the findings of this study are available from the corresponding author upon reasonable request and subject to institutional data-sharing restrictions and confidentiality considerations. References