Infinity Journal of Mathematics Education Volume 14.
No.
4, 2025
p-ISSN 2089-6867 eAeISSN 2460-9285 https://doi.
org/10.
22460/infinity.
An application of many-facet Rasch measurement to validate the numeracy test for elementary students Shahibul Ahyan1*.
Sri Supiyati1.
Fahrurrozi1.
Muhammad Nasiru Hassan2 Department of Mathematics Education.
Universitas Hamzanwadi.
West Nusa Tenggara.
Indonesia Faculty of Education.
Sokoto State University.
Sokoto.
Nigeria Correspondence: shahibulahyan@hamzanwadi.
Received: Dec 28, 2024 | Revised: Jun 24, 2025 | Accepted: Jul 21, 2025 | Published Online: Sep 12, 2025 Abstract Numeracy skills are essential for students' academic achievement and everyday decision-making.
however, appropriate evaluation instruments are lacking.
The main objective of this study was to investigate the psychometric characteristics of a numeracy test consisting of 16 items .
multiplechoice and four essay.
, which were evaluated by 12 expert raters.
This study utilized the ManyFacet Rasch Measurement (MFRM) to examine item difficulty, rater severity, and participant ability, thus providing an in-depth assessment of the validity and reliability of the test.
The findings showed that all 16 items fit the Rasch model, exhibiting appropriate difficulty levels and ensuring that the test effectively differentiated participants' diverse levels of numeracy ability.
In addition, the study demonstrated a uniform rater performance, thereby increasing the dependability of the evaluation.
This study highlights the need for modern psychometric techniques in educational evaluation to create more effective instruments for assessing numeracy in mathematics education.
This study promotes mathematical assessment and offers a basis for future research to improve educational measurement techniques.
Keywords:
Fit statistics.
Item measurement.
Many-facet Rasch measurement.
Numeracy.
Rater severity How to Cite:
Ahyan.
Supiyati.
Fahrurrozi.
, & Hassan.
An application of many-facet Rasch measurement to validate the numeracy test for elementary students.
Infinity Journal, 14.
, 861-876.
https://doi.
org/10.
22460/infinity.
This is an open access article under the CC BY-SA license.
INTRODUCTION
Numeracy is one of the studentsAo skills that needed in this era.
Numeracy, characterized as the capacity to comprehend and manipulate numerical data, is an essential competency that supports numerous facets of everyday life, education, and professional endeavors (O'Meara et al.
, 2.
The importance of numeracy transcends fundamental it includes the capacity to analyze data, make informed choices, and solve problems in various circumstances (Getenet, 2022.
Steen, 2.
The evaluation of Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A numeracy skills in mathematics education has received increased attention, especially as educators and researchers aim to discover effective strategies for assessing and improving these skills in learners of all ages (Buljan et al.
, 2019.
Purnomo et al.
, 2.
Early intervention in numeracy skills is essential to build a strong foundation in Research has shown that competence and affect self-perceptions in math are separate factors, even at the elementary school level, and these perceptions are related to effort and academic achievement (Arens & Hasselhorn, 2.
By assessing numeracy skills early, educators can identify and address any gaps or difficulties that students may have, potentially preventing long-term struggles with mathematics.
Interestingly, studies have demonstrated that elementary school students are capable of developing complex thinking skills, such as systems thinking, when provided with appropriate curricula and learning environments (Assaraf & Orion, 2.
This suggests that numeracy assessments at this level could be designed to evaluate not only basic skills but also higher-order mathematical thinking.
Prior research has emphasized the significance of early numeracy skills as indicators of subsequent mathematical success.
Research has demonstrated that fundamental numeracy skills, including number recognition and counting, are strongly associated with subsequent mathematical performance (Jordan et al.
, 2009.
Krajewski & Schneider, 2.
The significance of home-learning environments and parental engagement in promoting numeracy abilities is well-established, indicating that supportive settings can improve children's arithmetic development (Hart et al.
, 2.
Nonetheless, despite these findings, a significant gap persists in the literature concerning the validation of numeracy assessment instruments, especially those employing sophisticated psychometric techniques, such as many-facet Rasch measurement.
Many-Facet Rasch Measurement (MFRM) is preferred over the regular Rasch Model when evaluating data involving multiple facets such as judges, criteria, and artifacts.
MFRM
can account for the complexity of these multifaceted assessments, providing a more accurate and nuanced analysis (Boone et al.
, 2.
In contrast to the regular Rasch Model.
MFRM can simultaneously analyze multiple sources of measurement errors, such as raters, items, and cases.
This comprehensive approach provides valuable information for quality control and the improvement of assessment processes (Iramaneerat et al.
, 2007.
Primi et al.
, 2.
MFRM is particularly useful in situations in which raters or judges may have varying levels of severity or leniency.
The many-facet Rasch measurement (MFRM) provides a comprehensive framework for assessing the reliability and validity of evaluation tools by considering various dimensions of measurement, such as individual ability, item difficulty, and rater severity (Eckes, 2019.
He et al.
, 2.
This methodology has been effectively utilized across multiple domains, particularly in health numeracy, to validate instruments that evaluate numeracy competencies in medical settings (Alghodaier et al.
, 2017.
Ichikowitz et al.
, 2.
Nonetheless, its utilization in mathematics education, especially in the validation of numeracy assessments, has not been extensively investigated.
The current literature predominantly emphasizes conventional psychometric techniques, which may insufficiently Infinity Volume 14.
No 4, 2025, pp.
861-876 863
address the intricacies of numeracy evaluations (McNaughton et al.
, 2013.
Weller et al.
The originality of this study lies in its use of MFRM to authenticate a numeracy assessment tailored for mathematics teaching.
Utilizing this sophisticated measurement methodology, we wanted to deliver a thorough assessment of the test's psychometric attributes, encompassing its reliability and construct validity.
The principal objective of this work is to validate a numeracy assessment using a many-facet Rasch measurement, thereby furnishing educators and researchers with a dependable tool for evaluating numeracy competencies in mathematics.
We aimed to investigate the following research questions: What are the psychometric characteristics of the numeracy test as assessed by MFRM? How do various factors, including item difficulty and rater harshness affect assessment outcomes? METHOD Research Design This is a cross-sectional study.
Cross-sectional research design is perhaps the most common design in the social sciences, occurring when researchers collect data from a group of research participants at a single point in time using instruments such as tests, questionnaires, interviews, or observations (Bell & Jones, 2.
Cross-sectional research is used because this study only takes data at one time or in a short period.
In addition, crosectional research helps researchers simultaneously compare several variables at the same The study employed a quantitative approach, specifically a psychometric technique, to assess the validity and reliability of the numeracy test.
This approach is based on Rasch measurement theory, which underscores the necessity of developing accurate assessments that produce invariant measurements across diverse contexts and populations (Boone et al.
Sondergeld & Johnson, 2.
This study sought to provide empirical information concerning the psychometric qualities of the test, encompassing item difficulty, person ability, and rater severity, which are essential for determining the test's overall efficacy in assessing numeracy skills (Bailes & Nandakumar, 2020.
Nam et al.
, 2.
This study employed a three-facet design within the framework of Many-Facet Rasch Measurement (MFRM).
These facets consisted of numeracy test items .
he object of measuremen.
, experts .
, and criteria.
It is important to note that, while the numeracy test serves as the object of measurement, it is also considered a facet within the MFRM Research Participants A panel of 12 mathematics education experts was assembled to evaluate the content validity of the numeracy test items, which contained raters with qualifications in mathematics education.
The codes for each rater were Rater 1 to Rater 12, and the demographic data of the raters are presented in Table 1.
This expert panel evaluated the Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A pertinence and lucidity of each item, guaranteeing that the exam accurately represented the constructions of numeracy, as delineated in the literature (Nguyen et al.
, 2.
Table 1.
The ratersAo demographic profiles Demographic Gender Age Status Frequently Percentage (%) Male Female Below 40 years old 40 Ae 50 years old Above 50 years old Lecturer Teacher Data Collecting Techniques Data were collected using a validation sheet provided to the 12 raters.
The validation sheet was given separately to 12 raters and sufficient time was given to assess the numeracy The validation sheet contained rater information, instructions for completion, and a table containing seven columns in sequence, including question number, ability, process, content, context, sentence structure, and rater comments .
ee Figure .
The second to sixth columns were filled using five ratings .
trongly irrelevant .
to strongly relevant .
The last column contains the qualitative rater comments.
The numeracy test is part of the appendix of the validation sheet.
The numeracy test consisted of 16 items .
odes N1 to N.
, namely 12 multiplechoice and 4 essay questions.
The 16 questions consisted of 4 questions each about numbers, algebra, geometry, probability and statistics.
The numeracy test is about the numeracy of elementary students.
The numeracy test was adapted from numeracy questions for elementary school students developed by the Ministry of Education and Culture of the Republic of Indonesia.
The numeracy test can be found at https://s.
id/numSD .
n Bahas.
Data Analysis Techniques The results of the study were analyzed using MFRM, an advancement of the Rasch Model Measurement (RMM) designed for multi-assessment evaluations (Kudiya et al.
, with the help of the Facets version 3.
6 application.
MFRM is an analysis model that is a development of the Rasch Model (Eckes, 2.
This analysis was formulated by Linacre .
to rectify induced rating variabilities by employing several raters (Bond & Fox, 2.
MFRM analysis effectively models each rater based on the usefulness of a rating scale without anticipating uniform replies (Linacre, 1.
This approach enables evaluators to deliver various assessments.
Numerous studies have examined rater-related variability and inconsistency across diverse sectors (Parra-Lypez & Oreja-Rodryguez, 2.
The MFRM model can include more than two variables/facets in the analysis, which makes it very suitable for performance assessment that includes several facets, such as Infinity Volume 14.
No 4, 2025, pp.
861-876 865
examinees, assessors, assessment criteria, and tasks (Eckes, 2.
In this study, there are three facets or variables analyzed using the Facets software: experts, items, and criteria.
The indicators used to assess the results of raters using MFRM were as stated by Boone et al.
Outfit mean square (MNSQ) value: 0.
5 < MNSQ < 1.
Z-standard (ZSTD) Outfit value: -2 < ZSTD < 2 Point Measure Correlation (Pt Mean Cor.
value: 0.
4 < Pt Mean Corr < 0.
In this analysis, a total of 16 items, 5 criteria .
bility, process, content, context, and sentence structur.
, and 12 raters were utilized.
This indicated that a total of 960 data points .
items y 5 criteria y 12 rater.
were obtained from the analysis, without the occurrence of missing parameters.
The analysis included the Wright map, rater, item fit statistics, criteria, unexpected responses, and bias/interaction analyses.
These analyses are essential for validating the reliability of the numeracy test, guaranteeing that it consistently assesses intended constructions across various administrations.
RESULTS AND DISCUSSION
Results This section explains the Wright Map, rater, item fit statistics, criteria, unexpected responses, and bias/interaction analyses.
Wright Map Analysis The Wright map depicts the distribution of item difficulties and participant skills on a unified scale, enabling assessment of the alignment of numeracy test items and participant This study demonstrated through the Wright map that the 16 numeracy questions had varying levels of difficulty, with certain answers markedly simpler than the Variance in item difficulty is essential for the test's ability to successfully differentiate across varying levels of numeracy skills among participants (Boone & Scantlebury, 2.
The Wright map in Figure 1 illustrates the calibrations of raters, items, task criteria, and a 5point scale used by raters to evaluate items related to the numeracy test.
Figure 1.
Wright map of numeracy test Based on Figure 1.
N4 has a higher measure, whereas N1 and N7 have the lowest.
This means that N4 is the most difficult item, and N1 and N7 are the easier items.
In addition.
Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A ability had the lowest measure, meaning that the ability criteria had the highest score.
Rater 2 had the highest measure.
it means Rater 2 gave the lowest ratings, and so was the most severe rater.
The study revealed that the items were evenly dispersed across the ability spectrum, with an adequate quantity aimed at both lower and higher skill levels.
The distribution is crucial for the test's validity, since it guarantees that the evaluation encompasses a broad spectrum of numeracy skills, from fundamental arithmetic to intricate problem-solving problems (Long et al.
, 2011.
Vaughan et al.
, 2.
The inclusion of suitably hard items for varying ability levels improved the test's ability to yield significant insights into participants' numeracy skills.
Rater Analysis The participation of the 12 raters in the assessment procedure facilitated a comprehensive analysis of the test items.
Each evaluator appraised the items according to established criteria to enhance the comprehension of item performance.
The MFRM study considered rater severity, indicating that certain raters exhibited greater leniency in their assessments than others.
Diversity in rater severity is a significant factor, as it might affect the overall scoring and interpretation of test outcomes (Boone et al.
, 2015.
Purnomo et al.
The ratersAo analysis in Table 2 illustrates how easy and difficult it was for raters to score the numeracy test.
Table 2.
Rater analysis of numeracy test Severity Measure Mean Rater Infit Outfit Fair Obs.
MNSQ MNSQ Average Average Number of Rating Based on Table 2, we can see that among the raters.
Rater 3 was the most lenient rater, achieving a total score of 341.
Conversely.
Rater 2 was the most severe rater, achieving a score of 243.
In addition, the average rating across all raters was 3.
88 on a 5-point scale, suggesting generally high scoring of the numeracy tests.
Infinity Volume 14.
No 4, 2025, pp.
861-876 867
Examination of rater effects revealed that, whereas the majority of raters displayed consistency in their assessments, a minority showed considerable divergence from the mean This finding highlights the necessity of educating and calibrating evaluators to guarantee that their assessments conform to the appropriate measuring framework.
This study improves the reliability of the numeracy exam and reinforces the validity of the findings by mitigating rater variability (Boone et al.
, 2015.
Ichikowitz et al.
, 2.
Item Fit Statistics The fit statistics for each item were assessed using infit and outfit mean square statistics, which measure the alignment of the observed data with the expectations of the Rasch model.
Items that conform to the model are expected to have infit and outfit values of The fit statistics are presented in Table 3.
Table 3.
Item fit statistics No.
Logit
Measure Infit
MNSQ
ZSTD
Outfit
MNSQ
ZSTD
Remark Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable As shown in Table 3, all items demonstrated acceptable fit to the Rasch model.
Infit MNSQ values ranged from 0.
79 to 1.
24, and Outfit MNSQ values ranged from 0.
80 to 1.
Corresponding standardized fit statistics (Infit ZSTD: -1.
5 to 1.
Outfit ZSTD: -1.
4 to 1.
further confirmed that no items exhibited statistically significant misfit (|ZSTD| < 2.
These results support the unidimensionality and internal validity of the numeracy scale, indicating that all items function coherently to measure the intended construct without introducing substantial noise or bias.
Criteria Analysis The criteria were analyzed to offer an understanding of the relative difficulty of the criteria, accuracy of the difficulty estimates, and extent to which the criteria collectively Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A contributed to defining a single latent dimension for this test.
Table 4 presents the criteria measurement reports.
Table 4.
The criteria measurement report
Criteria
Logit
Measure Ability Process
Content
Context
Sentence Structure Infit
MNSQ
Outfit
ZSTD
MNSQ
ZSTD
Based on Table 4, we can see that all the criteria are valid.
In addition, ability has the lowest measure, which means that the ability criteria had the highest score.
Unexpected Responses Table 5 shows the ratersAo unexpected responses.
Table 5.
The unexpected responses of raters Scale Observed Expected Category Score Score Residual Std.
Rater
Residual Rater2
Rater2
Item
Criteria
N16
Content Process Table 5 reveals that only two responses .
002% of 960 data point.
were flagged as unexpected under the MFRM model Ai an exceptionally low rate that supports the overall coherence and predictability of the measurement system.
Both unexpected responses occurred in the AoContentAo and AoProcessAo scoring criteria and were exclusively attributed to Rater 2 Ai the most severe rater identified in the analysis.
This pattern suggests that while Rater 2Aos overall severity is accounted for in the model, their application of specific criteria may deviate from expected patterns, possibly due to unique interpretation or inconsistent rubric use.
Although the low incidence of misfit does not threaten overall validity, it highlights the value of MFRM in detecting subtle rater idiosyncrasies.
Bias/Interaction Analysis Bias/interaction analysis is a crucial component for validating the many-facet Rasch measurement model used in this research.
It examined the interactions of raters with particular items beyond the model's predictions.
There were 30 biases .
ut of .
between the raters and items, 15,6%.
Most bias for item N1 of the five raters.
Table 6 displays only the item N1 bias of raters 1, 2, 3, 7, and 11.
Infinity Volume 14.
No 4, 2025, pp.
861-876 869
Table 6.
Rater-item Bias/interaction analysis Rater Item Observe Score Expected Score Bias Size t-Statistic Based on Table 6, we can see that Rater 1 Ae item N1 has a bias size of -2.
70 and significant bias .
-statistic = -2.
This means that the observed score is 2.
70 logits lower than expected.
Thus.
Rater 1 consistently scores item N1 as more difficult than the model This is because Rater 1 may misinterpret or apply stricter criteria to item N1, or item 1 might include ambiguities that Rater 1 notices, but others do not.
The implication of this result is the review of item N1Aos content and Rater 1Aos understanding of the rubric.
Therefore, training or clarification is necessary.
Discussion The research findings demonstrated that all items within the numeracy test exhibited an adequate fit, indicating their proper functionality within the test framework.
The adequate fit of all items is a positive indicator of the test's internal consistency and validity.
The findings presented across Figure 1 and Tables 2 to 5 collectively affirm the psychometric integrity of the numeracy assessment while offering nuanced insights into its functioning through the lens of Many-Facet Rasch Measurement (MFRM).
The item hierarchy revealed in Figure 1 demonstrates that N4, with the highest logit measure, functions as the most difficult item in the assessment Ai likely requiring higher-order reasoning or complex problem-solving skills Ai whereas N1 and N7, with the lowest measures, serve as accessible entry points assessing foundational numeracy competencies such as basic computation or straightforward interpretation.
This deliberate spread of item difficulties across the latent trait continuum is not merely a technical feature but a foundational strength of the instrument.
it ensures that the assessment captures the full spectrum of numeracy, from rudimentary arithmetic to sophisticated contextual problemsolving, thereby aligning with contemporary frameworks that emphasize functional numeracy in real-world settings (Long et al.
, 2011.
Vaughan et al.
, 2.
As Bond and Fox .
Aua testAos validity is enhanced when its items are distributed to match the range of abilities in the target population, maximizing measurement precision across the continuumAy .
Such balanced targeting enhances the testAos diagnostic utility, allowing educators and researchers to pinpoint whether a learnerAos challenges lie in foundational skills or advanced applications Ai a critical feature for formative assessment and personalized However, the observation that the mean person ability measure is lower than the mean item difficulty warrants careful interpretation.
Contrary to the initial suggestion that this implies Authe ability criteria had the highest score,Ay Rasch measurement principles clarify Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A that a lower ability measure reflects lower proficiency on the latent trait Ai not higher This indicates a potential mismatch between the testAos difficulty and the cohortAos skill level, suggesting that, on average, participants found the items more challenging than their current ability would predict.
As Boone et al.
Auwhen person measures fall consistently below item calibrations, measurement precision is compromised at the lower end of the scale, potentially leading to floor effects and reduced sensitivity to growth among struggling learnersAy .
While this does not invalidate the instrument, it does raise considerations for future administrations: to optimize measurement precision and reduce the risk of participant disengagement, the inclusion of additional items calibrated to lower ability levels may be warranted, particularly if the assessment is intended for formative or diagnostic use across diverse populations.
Rater effects further illuminate the human dimension of performance assessment.
Rater 2, with the highest severity measure, consistently assigned the lowest ratings, confirming their role as the most stringent evaluator Ai a finding corroborated by Table 2, which shows Rater 2Aos total assigned score .
as the lowest among raters, while Rater 3, with a total of 341, emerges as the most lenient.
The average observed rating of 3.
88 on a 5point scale suggests an overall tendency toward higher scoring, but this central tendency should not be conflated with rater agreement or consistency.
As Engelhard and Wind .
Aurater severity is a systematic source of variance that, if unmodeled, can distort comparisons between examinees and threaten the fairness of scoresAy .
The MFRM framework allows these severity differences to be statistically modeled and adjusted, ensuring that person ability estimates remain comparable regardless of which rater evaluated their work Ai a critical safeguard for fairness and validity.
Nevertheless, the identification of divergent raters underscores the necessity of ongoing calibration, training, and monitoring to minimize construct-irrelevant variance introduced through subjective judgment, echoing Myford and WolfeAos .
recommendation that Aurater effects should not be ignored or assumed away, but actively measured and managed as part of quality assurance in performance assessmentAy .
The robustness of the instrument is further supported by the fit statistics reported in Table 3.
All items demonstrated acceptable Infit and Outfit MNSQ values .
79Ae1.
, well within the recommended 0.
7Ae1.
3 range for productive measurement (Wright & Linacre, 1.
, and corresponding ZSTD values .
ll within A1.
confirmed the absence of statistically significant misfit.
These results collectively affirm the unidimensionality and internal validity of the scale, indicating that each item contributes coherently to the measurement of a single underlying construct Ai numeracy proficiency Ai without introducing noise or bias.
As Andrich .
Auitem fit is not about perfection, but about sufficient conformity to the model to support valid ordering of persons and meaningful interpretation of scoresAy .
This psychometric stability provides confidence that the ordering of persons along the ability continuum is meaningful and that item difficulties are reliably estimated, forming a solid foundation for both individual diagnosis and group-level Table 4Aos assertion that Auall criteria are validAy requires contextual refinement.
While the fit statistics support the technical adequacy of the items, validity in the Rasch paradigm Infinity Volume 14.
No 4, 2025, pp.
861-876 871
Ai and in assessment more broadly Ai is an interpretive argument built on multiple strands of evidence, including content representation, internal structure, and consequences of use (Kane, 2.
The Rasch-derived measures themselves provide strong evidence for construct validity, particularly given the logical progression of item difficulties and the coherence of the measurement model.
However, claims of validity should be framed as supported by Ai not synonymous with Ai model fit.
The persistent misstatement that Auability has the lowest measure, meaning the ability criteria had the highest scoreAy again reflects a conceptual slippage between raw scores and interval-level logits.
this misinterpretation should be corrected to preserve the precision and credibility of the analysis.
As Linacre .
reminds us.
Auraw scores are ordinal.
Rasch measures are interval.
Higher raw scores always correspond to higher ability measures Ai never the reverse.
Ay Finally.
Table 5 offers a compelling demonstration of MFRMAos diagnostic With only two unexpected responses out of 960 data points .
002%), the model exhibits exceptional predictive power, indicating that nearly all observed ratings align with expectations based on person ability, item difficulty, and rater severity.
The fact that both anomalies occurred in the AoContentAo and AoProcessAo criteria Ai and were exclusively attributed to Rater 2 Ai suggests not random error but a patterned deviation, likely rooted in that raterAos unique interpretation or inconsistent application of these specific rubric dimensions.
While the negligible frequency of misfit poses no threat to overall validity, it highlights MFRMAos capacity to detect subtle, localized inconsistencies that might otherwise go unnoticed.
Wind and Engelhard .
Auunexpected responses serve as early warning signals Ai not of system failure, but of opportunities for refinement in rater training, rubric clarity, or task designAy .
This finding reinforces the value of embedding psychometric monitoring into routine assessment practice, transforming scoring from a static judgment into a dynamic, improvable process.
Together, these results portray an assessment instrument that is not only psychometrically sound but also thoughtfully designed to reflect the complexity of numeracy as a real-world competency.
The integration of Rasch measurement principles has enabled the disentanglement of multiple sources of variance Ai item, person, and rater Ai producing objective, interval-level measures that support fair, valid, and instructionally meaningful Future work might explore differential item functioning across demographic subgroups, longitudinal shifts in rater behavior, or the predictive validity of these measures on external numeracy outcomes Ai all of which would further strengthen the evidentiary basis for the instrumentAos use in research and practice (Mislevy et al.
, 2003.
OECD, 2.
CONCLUSION
Sixteen numeracy questions were analyzed using the MFRM, which were said to be valid by experts.
The findings of this study offer substantial evidence for the validity and reliability of the mathematics numeracy test evaluated using the many-facet Rasch Wright map analysis, along with assessments of rater effects and item fit statistics, underscores the testAos ability to accurately gauge various numeracy skills.
The results highlight the necessity of utilizing sophisticated psychometric techniques in the Ahyan.
Supiyati.
Fahrurrozi, & Hassan.
An application of many-facet Rasch measurement A creation and validation of educational assessments, thereby enhancing measurement procedures in mathematics education.
Gender bias was not analyzed in this study.
Therefore, future research is expected to analyze gender bias to determine whether it exists.
A gender-based analysis could provide valuable insights into the studied phenomenon, potentially revealing significant differences in experience, outcomes, or perceptions.
This could have substantial implications for the interpretation and application of research findings.
Acknowledgments The authors would like to thank the Directorate of Research.
Technology, and Community Service of the Directorate General of Higher Education.
Research, and Technology of the Ministry of Education.
Culture.
Research, and Technology of the Republic of Indonesia for the support and trust provided via national competitive research funds for fundamental research schemes for the fiscal year 2024.
Furthermore, we extend our gratitude to all the experts who evaluated the numeracy test.
Declarations Author Contribution Funding Statement Conflict of Interest Additional Information : SA: Conceptualization.
Methodology.
Visualization.
Writing original draft, and Writing - review & editing.
SS: Investigation.
Project administration, and Validation.
F: Data curation, and Investigation.
MNH: Validation, and Writing - review & : This research was funded by the Directorate of Research.
Technology, and Community Service of the Directorate General of Higher Education.
Research, and Technology of the Ministry of Education.
Culture.
Research, and Technology Republic of Indonesia for the support and trust given through national competitive research funds for fundamental research schemes 2927/LL8/AL.
04/2024, 040/UH.
P3MP/Ktr.
/2024.
: The authors declare no conflict of interest.
: Additional information is available for this paper.
REFERENCES