Psychology. Evaluation, and Technology in Educational Research , 2024, 107-118 Available Online: http://petier. org/index. php/PETIER Analysis of numeracy skills in Islamic Boarding Schools: Gender bias Rosid Bahar 1 a *. Ahmad Firdaus 2 b Sekolah Tinggi Agama Islam Al-Andina. Jl. Raya Selakopi. Sukabumi 43155 Indonesia Sekolah Tinggi Agama Islam Al-Masthuriyah. Jl. Raya Sukaraja-Sukabumi. Sukabumi, 43155 Indonesia rosidbahar@gmail. b firdaus. ahmad1st@gmail. * Corresponding Author. Received: 19 December 2023. Revised: 4 January 2024. Accepted: 12 January 2024 Abstract: Numeracy skills are an important point in the structure of mathematics, and boarding school students are no exception. This study aims to identify gender bias in numeracy assessment in Islamic boarding schools. This research is a descriptive exploratory research using quantitative methods. The instrument used in this study was a numeracy test of 25 questions consisting of matching, multiplechoice, complex multiple-choice, and description questions. The research subjects involved 383 students in West Java consisting of 4 pesantren in 4 cities, namely West Bandung Regency. Cirebon Regency. Tasikmalaya Regency, and Tasikmalaya City. Quantitative analysis used Item Response Theory (IRT) followed by Differential Item Functioning (DIF) analysis with the Mantel-Haenzel method. The results showed that the instrument was suitable for use because it met the standards of validity and reliability. The model fit test that meets is GPCM, and DIF analysis shows that there is 1 number, namely number 21, in numeracy questions that indicate gender bias. The results of this analysis indicate that the numeracy test instrument set is suitable for use by boarding school students with minimal gender bias. Keywords: Islamic Boarding School. Numeracy Skill. Differential Item Functioning How to Cite: Bahar. , & Firdaus. Analysis of numeracy skills in Islamic Boarding Schools: Gender bias. Psychology. Evaluation, and Technology in Educational Research, 6. , 107118. https://doi. org/10. 33292/petier. INTRODUCTION Islamic Boarding Schools are the oldest educational institutions and are among the institutions that have played a role in fighting for the independence of the Republic of Indonesia (Masqon, 2014. Muafiah et al. , 2. The history also records that the establishment of Islamic boarding schools was a place of study for Muslim students (Santr. to focus on Islamic religious learning so that they have a willing to become clerics of the religion in their area (Isbah, 2020. Wekke & Hamid, 2. The current development of Islamic boarding schools has motivated parents from various circles to send their children to these institutions. This motivation could arise from the parents, either alumni of an Islamic boarding school or parents who want to choose this school that also provides school education (Supriatna, 2. These motivations have an impact on student input in increasingly diverse backgrounds in Islamic boarding school institutions. One of the most prominent problems regarding the impact of student input is that many students still lack attention to general subjects, such as mathematics (Ramdhani et al. , 2021. Yusnita, 2. The current development of Islamic boarding schools has also encouraged the birth of Law No. 18 of 2019 concerning Islamic boarding schools, where Islamic boarding schools are not This is an open access article under the CCAeBY-SA license. 33292/petier. Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 108 Rosid Bahar. Ahmad Firdaus only about teaching Islamic religion but have a more role in upholding the true teachings of Islam, which are reflected in the character of tolerance, balance and community empowerment (Ministry of Religion, 2. Islamic boarding schools are formal institutions that are on par with other schools, starting from elementary to higher education. According to Minister of Religion Regulation (PMA) Number 31 of 2020, the typology of Islamic boarding schools has also changed, including Formal Diniyah Education (PDF). Muadalaah (Muallimin & Salafiya. , and the highest at Ma'had Aly. The levels were changed to Ula (Primar. Wustha . Ulya (Uppe. , and Ma'had Aly (Higher Educatio. In technical learning, the provisions in the law and PMA also apply a minimum of 5 general subjects, including mathematics. Pancasila and citizenship education (PPKN). Indonesian language, natural sciences, and social sciences (Ministry of Religion, 2. This regulation seems to be a concern to begin to know the continuity of the implementation of this Law, including in terms of proficiency in general subjects such as mathematics. Moreover, it is a fact that classroom conditioning in Islamic boarding schools uses rules of separation between male and female students, which leads to gender bias. This kind of culture is commonplace in Islamic boarding school environments. This culture is strongly attached to them today and will probably continue to be preserved. This process is understandable because Islamic boarding schools still maintain and implement fiqh practices following Islamic teachings (Sahri & Hidayah, 2. The question is, is it possible that gender bias will occur in the process of teaching and learning activities, especially in assessments? The initial technique for implementing the Law and Minister of Religion Regulations (PMA) can start by constructing instruments and analyzing them to obtain an instrument construct that can minimize gender bias. In the study, the mathematical instruments in Islamic boarding schools are more specific to numeracy instruments. Numeracy skills are an important point in the structure of mathematics, especially for It helps develop logical thinking skills, problem-solving skills, and skills and knowledge important to understanding the world around us (Whiteford, 2. Numeracy skills are part of literacy skills, including the ability to identify, create, communicate, numerate, and use printed, written, and visual materials (Montoya, 2. Numeracy skills are included in the PISA indicators. Currently, it become a hot topic as Indonesia is ranked 72nd out of 78 countries (OECD, 2. Research in Indonesia shows that 73% of secondary school students still lack an understanding of mathematical literacy (Ate & Lede, 2. Moreover, there is no more specific research at the Islamic boarding school or Tsanawiyah level. This encourages research related to numeracy at the Islamic boarding school This point could be the beginning of discovering numeracy skills in Islamic boarding schools as a basis for implementing the Islamic Boarding School Law. As the number of Islamic boarding schools and Santri is increasing yearly and a form of formal educational institution, it should always be active and dynamic in self-improvement through evaluation (Khaerudin & Munadi, 2020. Yasid, 2. This action is carried out to maintain the implementation of learning programs in Islamic boarding schools in synergy with the outcomes that must be achieved by Indonesian education in meeting the demands of Evaluation is used to determine the feasibility of a program (Munthe, 2. Evaluation aims to obtain recommendations on whether the learning program is good. A component of evaluation that still does not have much attention by Islamic boarding schools is an assessment, which can measure students' success (Yasid, 2. In other words, many Islamic boarding schools do not measure the success of their learning programs. practice, the assessments attract a lot of attention, such as poor quality of the assessment Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 109 Rosid Bahar. Ahmad Firdaus instruments in terms of content and constructs, which often lead to gender bias. This will certainly reduce the actual assessment, which should provide much information regarding implementing Teaching and Learning Activities (KBM) in Islamic boarding schools. Apart from assessment, gender bias in the world of education is often found in textbooks, language, and teacher-student interactions (Nadal, 2. The impact of gender bias is quite detrimental to KBM, so it needs to anticipate this problem. The impact of gender bias varies greatly, even leading to negative effects on students. Then, the possibility of gender bias in all elements of learning must be avoided, including learning in Islamic boarding schools that Islamic boarding schools have a large number of students. The assumption of gender bias must also be proven in the mathematics learning process in Islamic boarding schools. The previous research shows that Islamic studies, the method of memorizing the Al-Qur'an, have an influence on numeracy skills at the junior high school level. However, it has not measured them based on gender. Other research shows that single-sex learning has the highest value compared to mixed-sex learning processes (Franklin & Rangel, 2. More specifically in mathematics, results from the Trends in International Mathematics and Science Study (TIMSS) (OECD, 2. show high international differences in the gender gap in mathematics achievement, but not specifically in numeracy. Interestingly, countries with a high proportion of single-sex schools, such as Saudi Arabia, show unexpectedly high gains for girls (Basharat, 2. On the other hand, other research shows no difference between single-sex learning and mixed classes (Clavel & Flannery, 2. The research can be used as a reference for conducting further research regarding gender bias in mathematics in Islamic boarding In the end, the research is expected to provide knowledge about the factors of gender bias and solutions to minimize gender bias. METHODS The research was exploratory, descriptive research with a quantitative approach. The subjects were 386 students in class 3 at the Ulya level of the Muadallah Muallimin Islamic Boarding School Ae An Islamic boarding school level equivalent to junior high school. The research was conducted in 4 Islamic boarding schools, spreading in three cities: Tasikmalaya City and Regency. West Bandung Regency, and Cirebon Regency. West Java. Indonesia. The number of subjects is the number of students from 21 classes with different class characteristics. The rationalization of subject selection is based on the Cluster Sampling Technique, namely by determining the sample if the object or data source is very broad. This sampling requirement is based on the population area which has been determined through two stages. The first stage determines the sample area, and the second determines the individuals in that area, which refers to cluster sampling (Berndt, 2020. Etikan, 2017. Rahman et al. , 2. The instrument was a numeracy skill that relied on three main indicators: content, context, and process. The content relates to numbers, algebra, geometry and measurement, data, and Meanwhile, context is related to personal, socio-cultural, and scientific. Then, these three indicators were developed into a guideline containing numeracy content, competency achievement indicators, cognitive level, numeracy context, question indicators, question form, and question number. Data was obtained from students' responses through a numeracy skills test in mathematics The skills test instrument is a numeracy test instrument consisting of 25 questions with three matching questions, eleven multiple choice questions, five complex multiple choice questions, and six essays. Each form of question has a different scoring, including matching Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 110 Rosid Bahar. Ahmad Firdaus and multiple choice, which get a score of 1. complex multiple choice has a graded score of 1 and essays have a graded score of 1 to 5. Data analysis used item response theory (IRT) Differential Item Functioning (DIF) model (Gomez-Benito et al. , 2018. Hambleton et al. , 1991. Lee & Joo, 2021. Lee & Kim, 2017. Raju. Retnawati, 2. DIF analysis was carried out to see the gender bias in the questions using the item bias detection method to identify items with different functions for different In other words, bias will be obtained from the analysis when items do not provide equal opportunities for different groups. The software in this analysis was SPSS 26. 0 for validity Meanwhile, the R studio program is used for IRT analysis, starting from model suitability testing, assumption testing, and DIF and reliability. The Standard Error of Measurement (SEM) uses the test of information function from the R Studio program. It is supported by Microsoft Excel software to convert the total score from theta . from different strengths and levels of difficulty. RESULTS AND DISCUSSION The initial step in the analysis begins with testing the validity of the instrument using Exploratory Factor Analysis (EFA) with the help of SPSS software. It starts with the sample adequacy test using the KMO test (Kaiser-Meyer-Olkin Measure of Sampling Adequac. and Bartlett's Sphericity value, the Measure Sampling Adequacy (MSA) value. The analysis shows that the KMO value was 0. The Bartlett's Sphericity value was 0. And the MSA value for each item is more than 0. This means that the results of this analysis show that the sample met validity standards (Hair et al. , 2010. Sarstedt & Mooi, 2. Furthermore, reliability estimates used an IRT. Reliability standards in IRT are seen from the information function value and Standard Error of Measurement (SEM). The total reliability result of this instrument is 0. Based on the marginal reliability output using the R studio program. The standard error of measurement (SEM) is seen in the Total Information Function (TIF) value. Based on the information function value at capability (), 0. 0 is 46. 891 with a measurement error (SEM) of 0. These results indicate that the test produces optimal information when used on students with ability 0. Apart from this condition, there is also a strengthening of the condition regarding the Total Information Function (TIF) value, where if the TIF value is Ou 10 then the test instrument is reliable for measuring students' numeracy skills. This analysis measures the strength of each item/question, which can explain the respondents' abilities as measured by the test (Myszkowski, 2. The valid and reliable instruments will continue to be analyzed at the next stage, namely gender bias analysis. Several stages of gender bias analysis were analyzed using Item Response Theory (IRT). This is an important step because the theory requires these stages and ensures gender bias analysis has gone through valid and reliable stages. These stages include: . the Model Fit Test. the IRT Assumption Test. the DIF. The model fit test was carried out to explain the characteristics of the items in the instrument. This test is seen from the Fit Indexes output results of the R studio program using the irtGui package (Yildiz, 2. These results are presented in Table 1. Table 1. Fit Indexes Model Graded Response Model GPCM AIC BIC Table 1 shows the irtGui output results that provide two analysis recommendations: the Grade Response Model or GPCM. However, the final chosen was the GPCM model because Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 111 Rosid Bahar. Ahmad Firdaus GPCM has the smallest AIC score. This is stated in the information in the irtGUI that the smaller the fit index of the model, the better the model (Desjardins & Bulut, 2. GPCM is an analysis in which the questions are scored in tiered categories, but the difficulty index in each step is different/unordered. This means that in answering a question, the first step can be more difficult than the next step or vice versa, and GPCM only measures the level of difficulty and difference in power (Retnawati, 2. Next, the assumption test in IRT consists of the assumption test of unidimensionality, local independence, and parameter invariance. The unidimensional assumption test has been carried out simultaneously with construct validity using factor analysis (EFA), especially in the Total Variance Explained and Scree Plot table. Eigenvalue of 6. 467 and Variance of 25. 86% are the most dominant factors compared to other factors. The eigenvalue of the first factor is almost three times more than the second factor. Based on these results, the assumption test is fulfilled, following Mars's . statement that the unidimensionality assumption is fulfilled if the first factor has an eigenvalue of at least two times comparing the second factor. In addition, the scree plot results also show the most dominant steepness indicated by the first factor of the seven factors. In sum, this instrument only measures one dimension, namely numeracy. The next assumption test is the local independence test. This assumption defines that the response of the test respondent (Santr. to one test question does not affect the student's performance on other test questions. This assumption will be fulfilled if the student's response or answer to a test question does not influence the student's response to other test questions (Bahar et al. , 2021. Retnawati, 2. Local independence has criteria for fulfilling a correlation value for each item lower than In this study, the local independence test was carried out using Yen's Q3 (Chen & Thissen. The results of the local independence test showed the highest correlation value of 0. while the correlation values for the remaining items were lower than 0. Therefore, these findings indicate that the IRT assumption test criteria have been met. Next, the last assumption is the parameter invariance. This assumption defines that the characteristics of test questions do not depend on the distribution of parameters of participants' skills tests. The parameters that characterize the respondent do not depend on the characteristics of the test questions. The implication is that test participants' skills will not change just because they respond to test questions with different levels of difficulty (Antara, 2. This assumption is verified by estimating the parameters of item and participant skills. Classification of even and odd items is used to measure item parameters. While participants' abilities are measured using the classification of odd and even respondents. With the help of the R Studio software program, it produces item parameter estimates and ability estimates based on differentiating power . , level of difficulty, . pseudo guessing . , and ability parameter estimates. The analysis results on this assumption also show that the item parameters and ability parameters based on parameters a, b, and c do not vary in the odd and even item In other words, the parameter invariance assumption is met. The next analysis is the model suitability test. This test aimed to explain the characteristics of the items in the instrument in more detail. This test is seen from the Fit Indexes output results of the R studio program using the irtGui package (Yildiz, 2. The results are presented in Table 2. Table 2. Fit Indexes Model Graded Response Model GPCM AIC BIC Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 112 Rosid Bahar. Ahmad Firdaus Table 2 shows the irtGui output results that provide two analysis recommendations: the Grade Response Model or GPCM. However, the final chosen was the GPCM model due to the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BYC) having the smallest This is stated in the information in the irtGUI that the smaller the fit index of the model, the better the model (Desjardins & Bulut, 2. GPCM is an analysis in which the questions are scored in tiered categories, but the difficulty index in each step is different/unordered. This means that in answering a question, the first step can be more difficult than the next step or vice versa, and GPCM only measures the level of difficulty and difference in power (Retnawati. Next, item characteristics are based on item fit to verify that the IRT GPCM model was suitable for use because the items were declared fit. The fit provisions are based on the p. S_X2 value of more than 0. This means that if it is less than 0. 05, the item does not fit the model. The results of the analysis are presented in Table 3. Table 3. Item Fit Question V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 S_X2 54,18 29,43 15,63 71,83 63,18 78,82 30,65 39,39 78,99 32,65 36,96 19,51 36,44 77,03 55,88 43,44 95,36 100,75 63,05 131,08 54,51 159,51 185,87 S_X2 1,18 0,93 0,89 0,95 1,01 0,49 0,86 1,03 0,34 0,91 0,92 0,77 0,45 0,83 0,74 0,87 1,05 0,42 0,37 0,78 0,97 0,27 0,98 0,92 Description Fit Fit Fit Fit Fit Fit Unfit Fit Fit Unfit Fit Fit Fit Unfit Fit Fit Fit Fit Unfit Unfit Fit Fit Unfit Fit Fit The analysis results in Table 3 show that 19 items are fit and six are unfit. So. GPCM analysis is suitable to use because fit items are higher than unfit items. Next, it analyzes the characteristics of each question item on the instrument using GPCM. This characteristic is taken from the difficulty level . and a different power . The provisions on the criteria are based on the differential power value which is in the logit scale of 0. 00 Ae 00, and the level of difficulty in the logit scale is between -4. 00 Ae 4. 00 (DeMars, 2. The results of the analysis are presented in Table 4. Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 113 Rosid Bahar. Ahmad Firdaus Table 4. Characteristics Analysis of Numeracy Question Items Items Item_1 Item_2 Item_3 Item_4 Item_5 Item_6 Item_7 Item_8 Item_9 Item_10 Item_11 Item_12 Item_13 Item_14 Item_15 Item_16 Item_17 Item_18 Item_19 Item_20 Item_21 Item_22 Item_23 Item_24 Item_25 Differential Power -0,182 -0,083 0,546 -0,34 -0,223 0,328 -0,183 0,944 -0,546 0,656 0,123 -0,056 1,348 0,193 -0,555 1,622 3,943 1,346 -0,358 1,957 0,186 1,998 -0,049 -0,04 Total Category Poor Poor Good Poor Poor Good Poor Good Poor Good Good Poor Good Good Poor Good Poor Poor Good Poor Good Good Good Poor Poor Difficulty Level -0,271 -4,558 3,591 -0,375 -11,131 1,1674 -4,87 3,264 -1,8632 1,654 -20,473 0,297 5,852 -1,5815 0,7078 -0,11 -0,05 0,294 -1,5148 0,225333 3,374 1,1648 -4,6556 -9,0266 Total Category Good Very easy Very Difficult Good Very easy Good Very easy Very Difficult Good Good Good Very easy Good Very Difficult Good Good Good Good Good Good Good Very Difficult Good Very easy Very easy The Table 4 shows the level of difficulty in this instrument is 64% in the good category, and only 48% has a good differential power. So, overall, this instrument meets the optimal level of difficulty and can differentiate between high- and low-achieving students. However, with these results, it might be said that this is the first indication of the existence/absence of differences between men and women. The next analysis is to see the presence/absence of gender bias in the instrument using instrument analysis using DIF with the Mantel-Haenzel Method. The main reason for the analysis of gender factors is a general fact that often occurs in schools and in a statistical language, referred to as the irrelevance of the source of construct variance (Amelia et al. , 2. In this research, the role of gender analysis refers to identity, behavior, social order, and other lives that can influence one's perspective, either towards oneself or others, especially in Islamic boarding schools (Heidari et al. , 2. Item response theory using the DIF method can investigate this. At least, this investigation in an instrument setting is an initial basis for research regarding the presence/absence of gender bias in an environment (Wetzel et al. , 2. The results of the DIF analysis are presented in Table 5. The criteria for determining DIF are, if the P value <0. 05, then there is a piece of evidence that there is a significant difference in responding to items between groups. The results presented in Table 5 contain 1 question that is indicated as biased. This question discusses the connection between the battle of Badr and the beginning of fasting. This analysis can be a strengthening because this DIF analysis has explained the characteristics of items based on Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 114 Rosid Bahar. Ahmad Firdaus bias, in this case, gender. For clarity. Figure 1 presents an overview of the DIF analysis results in diagram form. Table 5. Analysis Result of DIF (Generalized Mantel-Haenszel chi-square statisti. Number of Question Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Item 16 Item 17 Item 18 Item 19 Item 20 Item 21 Item 22 Item 23 Item 24 Item 25 Stat. P-value Figure 1. DIF Output Results Figure 1 also shows that question number 21 indicates bias. Then, this question is deleted and not used for future tests as biased questions will have a negative effect on a certain group and vice versa. Question categories consist of appropriate, less appropriate, and inappropriate. Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 115 Rosid Bahar. Ahmad Firdaus The FGD with Islamic boarding school teachers, question number 21, was deemed inappropriate, so it needs to be deleted or discarded. Also, this follows the opinion of Karkal & Kundapur . and Odukoya et al. , who state that inappropriate questions should not be used or deleted and not distributed to students. In a clearer statement, the simple analysis in question number 21 is related to the understanding of Fiqh in mathematics teachers. The stimulus for this question is the incident of the Battle of Badr. One of the readings is about the incident on the day of the Battle of Badr, which occurred on Friday, the 17th of Ramadhan. The text of question no. 21 is "Based on the date of the Battle of Badr, what day is the 1st of Ramadhan 2 Hijriah?". Male students are more difficult to answer. They think more about whether the beginning of the Hijriah is determined by reckoning or rukyat. In fact, the text clearly states that the 17th of Ramadan is the day of the battle of Badr. It means that the 1st of Ramadhan has passed. contrast, in the next questions, which relate to the 1st of Eid al-Fitr, students can understand better because the word "istikmal" is mentioned. The text and quantitative analysis show a gender bias for women. This may occur due to relating teaching to female students. The field of Fiqh, or the history of Islamic culture that is taught to women is still general. It means it is textual and more directed towards the implementation of Fiqh. For students who are in class 3 of Wustha or class IX Madrasah, female students are usually less concerned about phenomena such as the time of Eid compared to male students. Male students are more social and more critical of the phenomena and tend to blame for ignorance. Here, the role of male teachers is also fun to explain the differences between the time of Eid and enlightenment and to always respect differences. These results indicate that item bias detection analysis is needed for mathematics teachers to identify that the question instrument has an appropriate construct. As a result, it functions to test two different groups, both men and women (Amelia et al. , 2. In the end, mathematics assessment instruments always relate to gender bias (Nathan & Umoinyang, 2022. Samritin, 2. Moreover, the results of other studies also show that the gender variable influences mathematics ability, where specifically men are more interested and capable in physics and mathematics subjects, while women are more interested in biology (Steegh et al. CONCLUSION Based on the analysis, the instrument for testing students' numeracy abilities has met the validity and reliability values suitable for assessment use. The test instrument also has a suitable GPCM model and has met the IRT assumption test. Differential Item Functioning (DIF) with the Mantel-Haenzel method was used at the gender bias analysis stage. And 1 question was identified as gender bias. Based on these results, this instrument is suitable for use to meet the assessment standards to minimize gender bias in Islamic boarding schools. ACKNOWLEDGMENT The researcher would like to express his gratitude to the Directorate of Islamic Religious Higher Education (Dikti. of the Indonesian Ministry of Religion for fully funding this research through the Litapdimas 2023 program. The researcher would also like to thank all the leaders of Islamic boarding schools for their service and the students who are always ready and cooperative in following the numeracy test. Copyright A 2024. Psychology. Evaluation, and Technology in Educational Research. ISSN 2622-5506 Psychology. Evaluation, and Technology in Educational Research, 6 . , 2024, 116 Rosid Bahar. Ahmad Firdaus REFERENCES