Psikohumaniora: Jurnal Penelitian Psikologi Vol 10. No 1 . : 1Ae20. DOI: 10. 21580/pjpp. ISSN 2527-7456 . 2502-9363 . Psychometric properties of the 18-item Indonesian Mental Toughness Questionnaire using the Rasch model and Machine Learning Ananta Yudiarso ,1A Ista Wirya Ardhiani ,1 Roy Surya ,1 Ferry Yohannes Watimena ,2 Mami Kanzaki 3 Department of Psychology. Faculty of Psychology. Universitas Surabaya. Surabaya Ae Indonesia. 2Department of Sports Coaching. Faculty of Sport Science. Universitas Negeri Jakarta. Jakarta Ae Indonesia. 3Faculty of Education. Kyoto University of Education. Kyoto Ae Japan Abstract: The psychometric properties of the Indonesian version of the 18-item Mental Toughness Questionnaire (MTQ-. remain vague. This study uses the Rasch model to elucidate these properties. In addition, boosting classification was adopted to assess the predictive validity of athletesAo The sample size comprised 400 athletes. According to the Martin-Loef likelihood-ratio test = 482, p = 1. 0 and factor analysis of the Rasch residuals, the questionnaire tends to make unidimensional assumptions. The MADaQ3 = 0. 074 shows the overall tendency of local independency across all items, with the majority clustered in moderate to low-level measures. Q11. Q15, and Q18 were clearly identified as showing gender bias, with significant effect sizes. According to the boosting classification, the performance between national vs no achievement (F1 = 0. AUC = 0. and international vs no achievement (F1 = 0. AUC = 0. was flagged as unsatisfactory predictive In conclusion, the abridged questionnaire is not preferable for determining an individualAos future performance or achievement. Future studies are needed to develop a better version that is more unimpeded by gender bias, and to resolve the variability of the items. Keywords: differential item functioning. gradient boosting classification. Mental Toughness Questionnaire. rating scale model. Wright Map Copyright A 2025 Psikohumaniora: Jurnal Penelitian Psikologi This is an open access article under the terms and conditions of the Creative Commons Attribution-NonCommercialShareAlike 4. 0 International License. To cite this article (APA Styl. Yudiarso. Ardhiani. Surya. Watimena. , & Kanzki. Psychometric properties of the 18-item Indonesian Mental Toughness Questionnaire using the Rasch model and Machine Learning. Psikohumaniora: Jurnal Penelitian Psikologi, 10. , 1-20. https://doi. org/10. 21580/pjpp. __________ Corresponding Author: Ananta Yudiarso . nanta@staff. Faculty of Psychology. Universitas Surabaya. Jl. Tenggilis Mejoyo. Kali Rungkut. Surabaya. Jawa Timur 60293 Ae Indonesia. https://journal. id/index. php/Psikohumaniora Submitted: 7 Mar 2024. Received in revised form: 1 Jul 2024. Accepted: 3 Jul 2024. Published regularly: May 2025 iC1 A. Yudiarso et al. Introduction The importance of mental toughness (MT) is clearly seen in various contexts such as sports, education or office workplaces. A probable explanation for this is because mental toughness is considered to be an essential resource in achieving optimal mental health and performance (Gerber et , 2013. Lin et al. , 2017. Papageorgiou et al. In addition, the use of MT in sports has attracted research interest, as shown in the study of Hsieh et al. , who used a systematic review approach to explore the implications of MT for athletesAo performance. Regarding the variability of the questionnaires, they highlighted the need to use updated definitions of MT and Another study by Nicholls et al. by the quality of coaching and taskinvolving climate. The original version of MTQ was developed using the 4Cs model . hallenge, commitment, control, and confidenc. The instrument aimed to evaluate an individual's level of mental toughness and consisted of 48 items. Clough et al. showed that mental toughness is a factor that influences peoplesAo friendliness and outgoing nature, helping them be calm and relaxed in competitive situations. According to Kobasa . , challenges reflect the degree to which a person sees obstacles and trials as opportunities for personal growth, while commitment characterizes the determination and ability to complete a task successfully. Control shows a person's level of confidence in their ability to influence their course of life, and confidence reflects self-confidence in oneAos abilities, especially in completing tasks. In particular, the psychometric properties of the MTQ-48 have given rise to debate regarding the dimensionality of the construct. Based on their second study. Gucciardi et al. proposed the unidimensional idea of mental toughness rather than the 4Cs model, regarding the indication of 2iC overlap between the scales when treated as a multidimensional test. On the contrary. Perry et al. and Perry et al. suggested a multidimensional model, but noted that MT could also be considered as an umbrella representing the general trait of associated constructs that influence In the development of the original questionnaire also gained some attention for the short version generation, with fewer items. Kawabata et al. created two abridged versions of the MTQ, the short MTQ (S-MTQ) and very short MTQ (VS-MTQ), with support for the multidimensional model. In addition. Denovan et al. demonstrated that while the MTQ-18 had acceptable psychometric qualities in their Russian sample, it showed a slight problem with the factorial structure based on confirmatory factor Therefore. Dagnall et al . recognized the MTQ-18 as an effective test, but preferred the MTQ-10 as it was more concise for practical purposes and tended to be unidimensional, rather than making multidimensional assumptions. Concerning the internal reliability of the MTQ-18, previous results showed the highest to lowest CronbachAos reported by Brand et al. at = . Sabouri et al. at = . and Lang et al. at = . The findings discussed above relate to the properties of the English version of the MTQ-18, and were obtained in a well-developed manner, but several issues have not been dealt with. This is considered to be a gap of knowledge and motivation to conduct this adaptation study with an Indonesian sample. The first issue is reliability. the most common way to explain this the extent to which the items would behave in a similar way if they were administered to another sample from the same population (Schmidt et al. , 2. The uses of CronbachAos alpha in previous studies have successfully demonstrated the internal consis- Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . tency of the MTQ-18, but reliability was not considered from other perspectives, such as the separation of items and individuals. Second, previous studies have not clarified the agreement level of items compared to individuals, including their relationship, for practical purposes. Constructing an effective questionnaire requires understanding the difficulties of the items and the magnitude of individualsAo latent traits. Third, the performance of the MTQ-18 rating scale also remains unknown. This is concerning, because the distance between rating scales is critical for the validity of the measurement (Pornel & Saldaya. Wakita et al. , 2. Moreover, we highlight the inconsistent findings on the ability of the MTQ to predict performance and achievement. A systematic review by Guszkowska dan Wyjcik . revealed that among 18 studies, 16 were found to indicate a strong relationship between mental toughness and athletesAo performance. For instance. Meggs et al. discovered that mental toughness is strongly correlated to athletesAo subjective performance, as well as being antecedent of dispositional flow. However, recent research by Stimson et al. concluded that there was a minimal contribution of mental toughness measured by the MTQ-48 to The aforementioned studies have used conventional statistics . inear regressio. to predict mental toughness to performance. For this reason, a study to investigate the implications of the extent to which the MTQ, especially the short form, can predict athletesAo performance using another method such as machine learning (ML) is required. In machine learning, the algorithms are typically designed to deal with regression or classification problems. One of the key differences between ML and traditional statistics concerns the assumptions made. The traditional approach is top-down, with a predefined assumption or rigid premise, together with the use of the p-value, while ML is bottom-up and approaches the data as largely unknown, with prioritization of metrics such as accuracy and predictive performance (Orry et al. , 2. Therefore, the main drawback of the traditional approach is if an inappropriate assumption is made to investigate the data, this may potentially lead to misleading results (Ley et , 2. for example, using linear regression on non-linear data. Researchers believe that the rationale behind the use of ML in psychological studies, such as in validation studies, relies on the assumption that the conventional approach may not be capable of standing alone in comprehensively executing all the problems in psychological data. This belief is supported by Fokkema et al. , who emphasize the use of ML in psychological studies, especially to leverage the ability to predict. addition, the utilization of ML is also valuable for the criterion and construct validity of the scales (Gonzalez, 2021. Trognon et al. , 2. As briefly mentioned above, some studies have already delivered convincing results. Methodologically. Dagnall et al. and Denovan et al. used factor analysis and structural equation modeling (SEM) to study the quality of the MTQ-18. Notwithstanding the merit of their procedures, researchers argue that the maximum likelihood estimation (MLE) type of factor analysis is flawed compared to weighted least squares estimators (WLS). Maryco . indicates that polychoric correlation with diagonal WLS is preferable in normally distributed data compared to Pearson correlation with MLE. Consequently, our study does not emphasize using the same method . actor analysi. , instead preferring to use the Rasch model with conditional maximum likelihood estimation (CML). The rationale for this regards the logit score to promote a more linear and objective measure, as it noted by Boone . and Bond and Fox . Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC3 A. Yudiarso et al. With respect to the methods used in previous research, this study offers three novelties. The first concerns the adaptation of the MTQ-18 for an Indonesian sample, while the second is the amalgamation of the internal structure and predictive validity to study the quality of the MTQ18. The third is the use of the Rasch model and ML . radient boosting machin. as the main This study is the first to propose a combination of the Rasch model and ML as the main approach to adapting and validating the Indonesian version of the MTQ-18. Regarding the standard of psychological testing by AERA. APA and NCME . this study aims to use the Rasch model to gauge the evidence of internal structure by unidimensionality, local independence, fit statistics, rating scale performance, and the Wright map. For predictive validity, the research uses ML to determine the extent to which the questionnaire scale is able to predict overall athletesAo achievement. Methods Participants Ethical clearance was given by the ethical committee of the University of Surabaya, 68/KE/IV/2022. The study involved 400 participants, 194 male . 5%) and 206 female . 5%), all Indonesian athletes, with ages varying from 13 to 56 (M = 22. SD = 5. Their sports fields were swimming . 25%). 5%). diving and underwater hockey . 75%). 25%). %). pencak silat . 5%). 25%). football and futsal . 25%). 75%). 25%) and others . 25%), comprising volleyball, dance sport, e-sport, golf, handball, judo, karate, rock climbing, petanque, sepak takraw, water skiing, taekwondo, tarung derajat, tennis, triathlon, and wushu. The researchers used nonrandom sampling through Google forms as the 4iC data collection tool, after obtaining informed Mental Toughness Questionnaire Ae 18 This study focused on the MTQ-18, by Dagnall et al. The Likert scales used were: 1 = strongly disagree, 2 = disagree, 3 = neither agree or disagree, 4 = agree, 5 = strongly agree. The forward and back translation of the MTQ-18 was performed by an English-Indonesian translator, then reviewed by four independent raters, researchers, and postgraduate students with a background in psychological studies. Table 1 shows the final form of the Indonesian version of the MTQ-18. Items Q11. Q6. Q3. Q17. Q16. Q12. Q2. Q8 and Q9 should be administered in reverse. The length of time to finish this test is approximately 58 minutes, according to the pre-trial test involving 10 participants, and none of them were bewildered by the instructions and items. Rasch Model The Rasch model uses a logit-based analytical paradigm to examine items and participantsAo raw data (Linacre, 1. Equation 1 represents the ground form of the polytomous Rasch, the rating scale model (RSM). The parameter estimation process usually uses joint maximum likelihood, marginal or conditional. The study employed likelihood (CML) P(Xs. is the probability that a person selects a category . from item . (Andrich, 1. The summation ensures the calculation of probabilities of every possible response of category . , from 0 to the total number of categories . coycn ). is the latent trait of persons and shows the location in the latent continuum. the difficulty parameter of item . is the threshold that explains the transition point between adjacent categories. Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . Table 1 Mental Tough Questionnaire 18 Indonesian Version MTQ-18 (Original Versio. MTQ-18 (Indonesian Versio. Item I am generally able to react quickly when something unexpected happens Saya umumnya mampu bereaksi dengan cepat saat sesuatu yang tidak terduga Q13 I generally cope well with any problems that occur Saya biasanya mengatasi dengan baik setiap masalah yang terjadi I often wish my life was more predictable Saya berharap hidup saya lebih bisa Q11* AuI just donAot know where to beginAy is a feeling I usually have when presented with several things to do at once "Saya tidak tahu harus mulai dari mana" adalah perasaan yang biasa saya rasakan ketika dihadapkan pada beberapa hal yang harus dilakukan sekaligus Q6* I usually find it hard to summon enthusiasm for the tasks I have to do Saya biasanya sulit membangkitkan semangat terhadap tugas-tugas yang harus dikerjakan Q3* I usually find it difficult to make a mental effort when I am tired Saya biasanya merasa sulit untuk melakukan usaha mental ketika lelah Q17* I generally find it hard to relax Saya biasanya sulit untuk rileks Q16* When I am feeling tired I find it difficult to get going Ketika lelah, saya merasa kesulitan untuk memulai sesuatu Q12* Even when under considerable pressure I usually remain calm Bahkan ketika berada dibawah tekanan yang besar, saya biasanya tetap tenang I tend to worry about things well before they actually happen Saya cenderung mengkhawatirkan segala sesuatunya jauh sebelum hal itu terjadi Q2* I generally feel in control Saya biasanya merasa memegang kendali Q10 When I make mistakes. I usually let it worry me for days after Ketika saya membuat kesalahan, saya cenderung merasa khawatir selama beberapa hari Q8* In discussions. I tend to back down even when I feel strongly about something Dalam diskusi, saya sering mundur meskipun merasa yakin tentang sesuatu Q9* If I feel somebody is wrong. I am not afraid to argue with them Jika saya merasa seseorang salah, saya tidak takut untuk berdebat dengan Q18 I generally feel that I am a worthwhile Secara umum, saya merasa bahwa saya adalah orang yang berharga I usually speak my mind when I have something to say Saya selalu berbicara jujur ketika memiliki sesuatu untuk disampaikan I generally look on the bright side of life Saya biasanya melihat sisi positif Q15 However bad things are. I usually feel they will work out positively in the end Meskipun keadaannya buruk, saya yakin semuanya akan berakhir baik Q14 Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC5 A. Yudiarso et al. minimize false positive and recommended by Kim and Oshima . Equation 1 Rasch Model Rating Scale yceycuycy[Ocycuyco=0. uEyc Oe yuyc yuayco )] ycE. cUycycn ) = ycoycn OcEa=0 yceycuycy[OcEayco=0. uEyc Oe yuycn yuayco )] . It is important that several metrics are understood and thorough guidelines into it can be found elsewhere (Bond & Fox, 2015. Linacre & Wright, 2012. Wolins et al. , 1. Infit-Outift statistics for both MNSQ . ean-squar. and ZSTD . -standardize. of residuals scores are expected to be from . 5 O x O 1. 5 (MNSQ) to -2 O x O 2 (ZSTD). The reliability of the study was based on separation with a desirable value of Ou . The dimensionality measure of this approach indicated the construct validity of the instrument. The raw variance is not the sole method of estimating it should be followed by factor analysis of the Rasch residuals if there is an indication of eigenvalues higher than 1. 5 or 2 in each contrast (Smith, 2. The likelihood ratio Martin-Loef test was also used to support the unidimensionality assumption, (Christensen et al. For local independence. Q3 statistics were Q3 is a non-parametric method to gauge the between-item residual correlation (Debelak & Koller, 2. The rating scale analysis focused on the Rasch-Thurstone thresholds. This is a cumulativeprobability measure approach that representing a value based on a 50% chance of favoring a certain rating scale or category, (Linacre, 1. Despite previously tested in a dichotomous model, later this threshold is also viable to be applied in the rating scale model (Linacre, 2. For differential item functioning (DIF), this study employed RajuAos area method to check the difference between groups (Raju, 1988, 1. Following the adjustment of the significance value, the Benjamini-Hojberg (BH) test was used, as it is false discovery rate (FDR) control technique that 6iC A thorough analysis of rating scale model and differential item functioning was made using eRm (Mair et al. , 2. WrightMap package (Irribarra & Freund, 2. , and difR package (Magis et al. , 2. on RStudio, version 2024. Boosting Classification As an integral part of the study, the authors adopted machine learning, specifically the gradient boosting approach (GBM), to solve the classification problem of discerning participantsAo achievement with respect to their MTQ-18 scores. The gradient boosting conceptualization was previously proposed by Friedman . The rationale for using this technique was due to its usability or flexibility in dealing with more imbalanced data compared to the other machines, such as support vector machines or random forests (Benkendorf et al. , 2. The algorithm . tep 1 Ae . for the GBM is shown below (Algorithm . Algorithm 1 Gradient Boosting Classification Input: {. cuycn , ycycn )}ycuycn=1 and ya. cycn , ya. ) . = arg min OcycA ycn=1 ya. cycn, y. ycycnyco = Oe [ yuiya. cycn ,ya. cUycn )) yuiya. cuycn ) , yceycuyc ycn = 1, . , ycA ya. =yaycoOe1. Fit a regression tree to the ycycnyco and create ycIycyco , for j = 1 . yuycyco = ycaycyciycoycnycu OcycuycnOOycIycnyc ya. cycn , yaycoOe1 . cuycn ) y. yu ya yco . = yaycoOe1 . yc Ocyc=1 yuycyco ya. cu yun ycIycyco ) . Output yaycA . In the input, ycuycn refers to the features and ycycn the target . lassification outcom. , while ya. cycn , ya. ) is the transformation of the -log . , the differentiable loss function. Likewise, in the regression gradient boosting machine, step 1 in Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . the classification (GBM) is required to find the initial leaf or ya0 . , which consists of a constant . represents the log. , while the summation requires us to sum up the loss function for each ycycn . bserved scor. it is important to measure the log . or that minimizes the sum ( Step 2 involves calculating the pseudo yuiya. cycn , ya. cUycn )) is the derivative of the loss function, with respect to the predicted log. For a better understanding, the pseudo residual is obtained from the observed score subtracted by the predicted probability. More importantly, the ya. = yaycoOe1 . require us to use the most recent or updated predicted log. Step 3 is the regression tree construction to predict the residuals ( ), while Step 4 is concerned with calculating the output of the new tree, for j = 1. Jm. The output value of each leaf is naturally the score of gamma ( ) obtained by dividing the residuals by the second derivatives of the loss function or, for simplification, is residuals divided by p . Step 5 involves creating a new prediction ) for each sample, given the new information from earlier measures ya0 . ), ya1 . ), ya2 . ) and so on. is the learning rate . sually set at a small amount, such as 0. 1 or 0. which is preeminent to avoid over-optimistic yayco results, whereas Ocyc=1 yuycyco ya. cu yun ycIycyco ) is the output values from the previous tree. is the final product. Several metrics need to be focused on for the classification: recall or the true positive rate (TPR). the false positive rate (FPR). the F1 score. Matthew coefficient correlation (MCC). Youden index (J). the area under curve (AUC). and Andrew curves. Complete guidelines on these metrics are available elsewhere (Chicco et al. Chicco & Jurman, 2023. Hossin & Sulaiman. Vujovic, 2. FPR are the measures of true and false observations of the data, such as a true observation that is literally true (TPR), or a false observation that is recognized as positive (FPR). MCC is the formula to indicate that the measurement is not simply a random guess. This score ranges from -1 to 1, with a value approaching 1 indicating perfect predictive performance. otherwise, any value approaching -1 shows total The F1 score is the measure of the accuracy of the model using information from recall and precision. J is a metric to measure the effectiveness of the diagnostic tests. Andrews curves are a data visualization method to elucidate how well the classification process distinguishes between the classes . binary outcom. The curves were derived from the Fourier series as a projection of high dimensional data, with the x-axis denoting the Fourier coefficient, which ranges from Ae A to A (A = In a well-performing boosting classification model. Andrews curves for data points from the same class should cluster together, indicating effective class separation (Moustafa, 2. Our analysis was run using the GBM package in the JeffreysAos Amazing Statistic Program (JASP) 0. Logistic Regression We also provided the results of the logistic regression as a comparison to the previous ML approach (GBM). This approach is well known as a way to model the relationship between the target . ategory classe. and the predictor variables (Grymping, 2. The main conceptualization of this technique is an estimation of the probability of the occurrence of an event, with respect to the given predictors. Evaluation of the goodness of fit is made using the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), with better models having lower AIC and BIC. In addition, a pseudo R2 was also used, with R2 which has a different interpretation than classic regression. It is a measure of fit that typically compares the Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC7 A. Yudiarso et al. likelihood of the models (McFadden R. or the measure of the difference in the mean predicted probabilities between the classes (Tjur R. , (Grymping, 2. This analysis was run in JASP category 1 . trongly disagre. or 2 . for these particular items. On contrary. Q14. Q13, and Q3 were relatively more inclined towards The point biserial correlation ranges 24 Ae 0. All of the items showed fit indices. Results Rating Scale Analysis Dimensionality Analysis The results of the Martin-Loef likelihood ratio test with median split were: likelihood-ratio = 482. DF = 1,295, p = 1. 0, which indicate no violation of the unidimensionality assumption. In addition, according to the Rasch residuals factor analysis, the eigenvalues for the first four components were 8, 1. 5, 1. 3 and 1. 2 respectively, showing no indication of the secondary significant construct. However, as an additional report, we presented the highest loading score of the residuals of the items in the first component: Q8 = . Q11 = . and Q13 = -. For local independence, the MADaQ3 yielded . 074, with the majority of pairs of Q3 across all items O . 3, except for Q13 Ae Q14 = . Reliability Analysis The result of CronbachAos 83, while person separation reliability was 0. 352, with squared standard deviation (SSD) = . 115, and mean squared error (MSE) = . Low separation reliability demonstrates the sufficiency of the test length that considerably low or it would be challenging to separate the item difficulties in the questionnaire into a wide range of different levels because they are clustered to relatively the same level of difficulty. Item Analysis The descriptive statistics were 7,200 data points. M = 1,717. SD = 207. 8, log-likelihood ycU 2 = 10,3513. DF = 6,780, global root-mean-square (RMS) residuals = . 579, and p = . As shown in Table 2, items Q8 and Q11 were the most challenging to agree on, which indicates that participants were less likely to disagree or choose 8iC As can be seen in Table 3, and subsequently confirmed in Figure 1 the distance between l1 and l2 was relatively shorter compared to that between l2 and l3 and l3 and l4, indicating that categories 1, 2, and 3 were more likely to be chosen by the participants for all items, as they are closer to each other. On the contrary, the distances between categories 3, 4, and 5 were greater, indicating that individuals may behave more cautiously in agreeing on most of the items, especially Q11. Q8, and Q12. Differential Item Functioning DIF analysis using RajuAos area approach and adjusted significance using Benjamini-Hochberg (BH) was conducted to check bias items between groups . ale and femal. The reference was set as males, with the focal group set as females. shown in Table 4, several items were flagged as being biased towards male and female Respecting the Raju ( 1. as the effect sizes. Q11. Q15 and Q18 showed significantly . < . different functioning across the two groups. Q8, however, exhibited a moderate effect size. The negative statistics show that the trend of bias is towards the focal group, with the positive statistics indicating that the items appeared to function in a way that was more favorable to the mental toughness trait of the reference group. Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . Table 2 Item Analysis Item Mean Measure Infit MNSQ Outfit MNSQ Point biserial Q11 Q12 Q15 Q16 Q17 Q18 Q10 Q14 Q13 Figure 1 Wright Map Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC9 A. Yudiarso et al. Table 3 Rating Scale Thresholds Item Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Table 4 Differential Item Functioning Item Statistic Adjusted p Raju Effect size Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Note: A (< 1, negligible effec. B (> 1, moderate effec. C (> 1. 5, large effec. 10 iC Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . Prediction of Achievement The objective of the analysis was to evaluate the influence of MTQ-18 on prediction. formulated this by classifying the ordinal hierarchy of athletesAo achievements based on the level and competitiveness of the tournament. Achievement was assessed on the basis of their best accomplishment. International tournaments were labeled as 2. national competitions as 1. if the achievement did not correspond to either national or international level, it was labeled as 0. The data pre-processing was conducted by eliminating participants who possessed the highest unexpected responses, and was also based on their fit statistics. The final data for this GBM analysis came from 381 participants. These were assumed to be data outliers that may affect the integrity of the study and the classification process. Subsequently, they separated into two groups. The first comprised the national achievement class . athletes, 68%) vs the no achievement class . athletes, 32%), with a total of 316 individuals. The second group related to international achievement . athletes, 39%) vs no achievement . athletes, 61%), with a total of 167 individuals. The data processing was based on 20% as a sample and 5 folds for trainingvalidation. The minimum number of observations in each node was 10, with 50% of the training data used per tree. As shown Table 5, the model summary and evaluation metrics of the first model . ational vs no achievemen. using 253 persons as trainingAi validation data and 63 as test data were: validation accuracy = . test accuracy = . trees = 7. shrinkage = . 1, and Youden index (J) = . On the other hand, the evaluation of the second model . nternational vs no achievemen. using 134 persons as training-validation data and 33 persons as test data was: validation accuracy = . accuracy = . trees = 3. shrinkage = . 1, and Youden index (J) = . In particular, the process of classification was more distinguishable in the national vs no achievement group (Figure 2 - A), with higher AUC for national competitions. According to Figure 2 C, the GBM model shows relatively inferior performance in predicting international In comparison to the boosting classification performance, conventional logistic regression with the enter method was also conducted. Participants were eliminated based on the fit statistics of the Rasch model. Instead of using the raw ordinal scores, the analysis was run using the supervised data from the parameter estimation or the logit of participants from the Rasch . ating scal. Table 5 Evaluation Metrics Metrics Accuracy Precision Recall FPR F1 Score MCC AUC Achievement Average/Total Achievement Average/Total Note: 0 = No achievement, 1 = National achievement, 2 = International achievement . Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC 11 A. Yudiarso et al. Figure 2 Receiver Operating Curves and Relative Influence Plots of Boosting Classification Note: A - ROC between no achievement . and national achievement . B - Relative influence between 0 and 1. C - ROC between no achievement . and international achievement . D Ae Relative influence between 0 and 2. The model summary of the classification between national vs no achievement was M0 > M1, with AIC0= 399. BIC0= 403. AIC1= BIC1 = 408. DF = 314, p = . McFadden R2 = . 002, and Tjur R2 = . The estimation of participant logit = . SE = . and odds ratio = 1. For the evaluation metrics, accuracy = . 67 and area under curve (AUC) = . On the other hand, the classification between international vs no achievement was M0 > M1, with AIC0= 225. BIC0= 228. AIC1= 226. BIC1 = 233. DF = 314, p = 589. McFadden R2 = . 001, and Tjur R2 = . The estimation of participant logit = . SE = . and the odds ratio = 1. For the evaluation metrics, accuracy = . 61 and AUC = 0. In summary, for the logistic regression, there was no evidence of significant results from the pvalue or from the model through the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) in either group. Since the M0 > M1 indicating that the M1 12 iC . articipants logi. did not improve the prediction compared to the null model (M. However, the modelAos positive value of estimation indicated a positive relation between performance and the logit scores. the higher the logit, the higher the estimation of achievement. Andrews curves (Figure . are a visualization technique that is used to interpret high-dimensional data, and can be particularly useful in set machine learning, including the boosting approach, when dealing with classification tasks to assess class separation. Each data point is represented by a curve, with similar points . ypically from the same clas. having similar curve shapes. The data from Figure 3 - A and B for each class overlapped with each other. This indicated that the model was struggling to distinguish between these classes, possibly due to noise or insufficient features. However, referring to Figure 3 - B, the distinction was more noticeable between no achievement vs international achievement, compared to Figure 3 Ae A. Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . Figure 3 Andrews Curves of Boosting Classification Note: 0 = no achievement, 1 = national achievement, 2 = international achievement 3 Discussion According to the Martin-Loef test, local independence, factor analysis of residuals, and based on our study sample, the MTQ-18 Indonesian version is a fine-tuned questionnaire in a unidimensional construct. However, a report by Denovan et al. also investigated the MTQ18 and found some inconsistent factor loadings for the challenge and control dimensions. Their results may possibly be because to some extent the nature of mental toughness is more suited to unidimensional latent traits. Additionally, the tendency of unidimensionality in the shorter versions of the MTQ was also highlighted in research by Gerber et al. Moreover, along with the addition of more items, it turns the MTQ into a more multidimensional construct, despite Perry et al. recommending a bifactor model that is proven to be more suitable for MTQ-48. Items Q14. Q13 and Q3 were the easiest items to agree on, while Q11 and Q8 were those that participants found more difficult to agree on. According to the mean-square (MNSQ), that followed the threshold of . 5 to 1. 5, all the items provided ideal fit indices. Moreover, according to the Wright map (Figure . , some items were grouped into similar levels of agreement, so can be considered as redundant items. for instance, the pairs of Q13 and Q3. Q15 and Q4, and Q17 and Q5. As a consequence, it is legitimate to argue that these items actually measure the same level of difficulty of the latent trait. It is also suggested to eliminate one of them if an abridged version of the MTQ-18 is needed . MTQ-10 or MTQ-. The rating scale of the questionnaire performed well, and it is important to ensure that we have a constant range of category scales, regardless of the range of the rating, such as from 1-5, or 1-7. Pornel and Saldana . explain that asymmetric verbal anchors may affect the validity of the rating scale used in research. Therefore, based on the threshold measure across every item, we found that the test items shared similar trends. The distance from l1 to l2 was closest indicating that the rating scales Austrongly disagreeAy to AudisagreeAy to Auneither agree or disagreeAy were not well-differentiated in terms of meaning for the Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC 13 A. Yudiarso et al. While this pattern did not affect overall scale performance, it indicates the potential benefit of revising category labels or even providing clearer definitions to enhance differentiation and interpretability A previous study of the MTQ-48, by measuring the construct consistently across gender and age groups, found that the questionnaire functioned well (Perry et al. , 2. However, our study found some notable items that demonstrated bias towards gender . oth male and femal. , with a large effect size from the Raju calculations. These items were Q11. Q15, and Q18, with a large effect size (> 1. , and Q8, with a more moderate size. This suggests that males and females may interpret or respond differently to these items, which could reflect inherent differences in how mental toughness is experienced or expressed across Therefore, a comparison of the performance of males and females should be carefully made. Moreover, this finding has also been indicated in previous studies, which emphasize that males tend to have higher mental toughness scores (Nicholls et al. , 2009. Yarayan et , 2. Related to the use of the Rasch model in psychometric practice, this is not a new emerging Wright . discussed the comparison between factor analysis and the Rasch approach, indicating that the most problematic issue in the use of Likert-type data in factor analysis is the poor reproducibility of the factor sizes and loadings. Therefore, he believed that logit transformation was an alternative to overcome this issue. Jamieson . also explained that it should be clear that any ordinal data, including Likert types, should be treated with nonparametric analysis. However. Sullivan and Artino presented contrasting arguments, claiming that if the normality distribution and adequate sample size hold, it is not necessary to treat Likert data as ordinal. Carifio and Perla . also contend that although Likert as a 14 iC response . onsisting of one ite. may behave in an ordinal fashion, as a scale . omprising several item. it exhibits interval-level measurement, referring to their terms of atom-molecule-scale. For a comparison of the classification process, the performance of logistic regression in this study was inferior to the boosting classification. The dominance of boosting is reasonable, regarding the approach when dealing with classification problems, especially with more non-linear and complex data. Robustness is achieved by starting the training data analysis with a weak learner, which would result a false prediction. The earlier false prediction . ) then becomes the new or updated subset . ), with the same step as before of using a weak learner, producing an updated false prediction of the pseudo residuals. gradually narrowing the gap in the residuals in the correct direction. and later producing the final prediction (Ferreira & Figueiredo, 2012. Schapire. The advantages of boosting compared to traditional logistic regression have also been discussed in previous studies, (Belsti et al. , 2023. Zheng et al. , 2. The performance of MTQ-18 to predict athletesAo achievement in this study was under the satisfactory level. This is in line with the findings of Stimson et al. , who assessed the MTQ-48 and found minimum evidence of the capability of mental toughness to predict performance. Furthermore, the deficient performance of our model to some extent resembles the evidence of the low separation reliability of individuals and logit measure of items. This is confirmed in the Wright map (Figure . , which shows that the MTQ-18 is more sensitive to measuring individuals with moderate to low agreement, so it would not be ideal to effectively distinguish between those with high and low agreements. However, previous studies by Meggs et al. and Cowden . did demonstrate the importance of mental toughness in producing athletes with high performance and achievement. Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . Psychometric properties of the 18-Item Indonesian Mental Toughness Questionnaire . Regarding the thorough process of our study, the authors emphasize the type of psychometric property construction process combining traditional approaches with machine learning. Internal structure validity, together with reliability measures, are important as preliminary steps towards ensuring the quality of the data before commencing predictive validity analysis using machine learning. These steps are important, in light of the GIGO or garbage in garbage out notion. The quality of the input data prior to statistical analysis could undoubtedly affect the output (Kilkenny & Robinson, 2. Moreover, in terms of human annotation in machine-learning studies, researchers are advised to be fully responsible for ensuring the validity of the data for training before commencing prediction (Geiger et al. , 2. As implicitly stated above, there are several implications of this study. The main argument was that the MTQ-18 displayed poor predictability of athletesAo achievement. As also already noted, even the full version (MTQ-. of the questionnaire in English version did not demonstrate notable performance in predicting achievement. Therefore, researchers may need to use a series of properties in order to predict achievement accurately. Additionally, practitioners should treat the male and female norm-scores separately, as some items in the questionnaire behave differently with males or females. The results of the study were limited by the sample size and characteristics, as we focused on only 400 individual athletes. Moreover, the field category of the sports was imbalanced . ominated by swimmin. , and the boosting classification and regression logistic data were also imbalanced. Therefore, it is recommended that balanced data is used for the categorization of each group, together with a test for other sample characteristics such as students in education or employees in organizational settings. In addition, the Martin-Loef dimensionality test works better with more participants, such as > 600. Further studies are needed to develop a better version of the MTQ for the Indonesian culture, which is less unimpeded by gender bias. Other studies are also encouraged to resolve the variability of the items, which should be capable of a wide array of different levels, rather than clustering in the middle and low levels. Conclusion In conclusion, the MTQ-18 Indonesian version is a unidimensional questionnaire with a positive internal consistency of items. However, the separation level of items is very poor and more appropriate for measuring individuals with moderate to low traits of mental toughness. Performance across gender should be cautiously understood, as some items favor gender bias. The predictive validity of the questionnaire is therefore, this short test is not preferable for predicting individualsAo future achievement or performance in a precise Acknowledgments The authors would like to thank the English-Indonesian translator and the four raters who assisted in the translation process and evaluation of the item content. Author Contribution Statement Ananta Yudiarso: Conceptualization. Formal Analysis. Investigation. Methodology. Project Administration. Resources. Validation. Visualization. Writing Original Draft. Writing. Review & Editing. Psikohumaniora: Jurnal Penelitian Psikologi Ai Vol 10. No 1 . iC 15 A. Yudiarso et al. Ista Wirya Ardhiani: Data Curation. Formal Analysis. Investigation. Methodology. Project Administration. Resources. Validation. Writing Original Draft. Roy Surya: Formal Analysis. Investigation. Methodology. Validation. Visualization. Writing. Review & Editing. Ferry Yohannes Watimena: Conceptualization. Investigation. Project Administration. Resources. Writing. Review & Editing. Mami Kanzaki: Conceptualization. Formal Analysis. Investigation. Writing Original Drafts. Writing. Review & Editing. References