Prihantoro / Jurnal Arbitrer - Vol. 12 No. Online version available in : http://arbitrer. JURNAL ARBITRER | 2339-1162 (Prin. | 2550-1011 (Onlin. | Article DICO-JALF v. 0: Diponegoro Corpus of Japanese Learners as a Foreign Language in Indonesia with AI Error Annotation and Human Supervision Prihantoro1*. ShinAoichiro Ishikawa2. Tanjun Liu3. Zaki Ainul Fadli4. Elizabeth Ika Hesti Aprilia Nindia Rini5. Catur Kepirianto6 Faculty of Humanities. Universitas Diponegoro. Semarang. Indonesia IPHE/Graduate School of Intercultural Studies. Kobe University. Kobe. Japan Department of Applied Linguistics. XiAoan Jiaotong-Liverpool University. Suzhou. China 1,4,5,6 Submission Track A B S T R A C T Received: May 11, 2025 Final Revision: August 1, 2025 Accepted: August 14, 2025 Available Online: September 25, 2025 There is a growing body of research in using AI for corrective feedback in foreign language teaching. However, few studies have specifically addressed the accuracy of AI analysis in learner corpus research. This study aims to create an AI-annotated corpus whose data were obtained from learners of Japanese as a Foreign Language (JFL) in Indonesia with human supervision. branded it as DICO-JALF v. The aim is to measure to what extent ChatGPT accurately annotates errors. A task was first administered to collect corpus data and metadata to build the corpus. The corpus was error-annotated using ChatGPT 4. Human annotators manually supervised the accuracy of AI-generated annotations. Regarding errors committed by learners, it is observed that incorrect lexical choices and forms dominate the cause of errors, while underuse and overuse are It can be concluded that ChatGPT demonstrated an average accuracy of 70% correct identification of errors. Regarding error rate, the verb is the category where errors are most frequent, which maybe driven by its conjugation, a feature absent in Indonesian, the L1 of the students. This suggests that Indonesian learnersAo acquisition of Japanese verbs needs greater emphasis. As compared to other similar studies, this is relatively However, it can be argued that one factor determining the accuracy of ChatGPT annotations, or any other LLM-based tool, is the complexity of the annotation scheme they adhere to. The corpus have been made available for download. The annotations shall be readable by a corpus query system that reads XML tags. This corpus serves as a foundational resource for future research on AI-assisted error analysis in JFL learning contexts in Indonesia. Keywords Corpus linguistics, error annotation. AI. ChatGPT. Indonesia. JFL Correspondence *E-mail: prihantoro@live. INTRODUCTION Learner corpus is one of the resources to study how to tackle language learnersAo challenges. Learner corpora have the potential to describe learnersAo linguistic development and errors, as attested in Perez-Paredez and Mark . , and Forti . , among others. However, the creation of error-annotated learner corpora is usually performed manually because an automatic error DOI: https://doi. org/10. 25077/ar. tagger is absent. LLMs (Large Language Model. , which power various AI (Artificial Intelligenc. tools, provide an alternative to automating the annotation process, which allows for less human intervention and a faster process. However, the accuracy and reliability of AI tools in the area still need to be assessed. This project describes a learner corpus creation, whose error annotations were conducted automatically by an AI tool. Under License of Creative Commons Attribution-Non Commercial 4. 0 International. Prihantoro / Jurnal Arbitrer - Vol. 12 No. with an assessment of the AIAos performance. This Using their analytic scheme for the annotation can background translates to the following objectives leave loopholes, as it does not match the purpose of fully annotating errors in a Japanese learner and potential contributions. The first objective . was to create DICO- corpus. Hasibuan and Arfianty . conducted a JALF v. 0, a grammatically error-annotated similar study, but their tagset is incomprehensive. corpus from Indonesian students learning Japanese For instance, it includes an analysis of Aoincorrect as a foreign language at various proficiency levels. word useAo without giving further details. This corpus will offer a comprehensive data set of learnersAo performance. The second objective . was to describe how accurately ChatGPT 4. 0 (OpenAI, 2. annotated errors in essays written in Japanese. This will allow ChatGPTAos performance to be assessed when analysing non-English data. In Indonesia, in addition to English. Japanese is another popular foreign language to learn1. Among other purposes. Japan attracts many Indonesian migrant workers and students to work, pursue higher education, or a combination thereof, as shown in Umoro . and Budianto . Note that Japanese as a foreign language (JFL) is popular not only in Indonesia, as shown in Table 1. Table 1. Number of Learners of JFL Across the World2 Countries China Indonesia Korea Australia Thailand Number of learners Numerous studies have been conducted in the context of Indonesian students learning Japanese, such as Safama and Diner . Barus and Pujiono . , among many others. These studies focus on analysing errors committed by learners of Japanese as a foreign language in Indonesia, but unlike this project, they focus on certain linguistic unit such as particle or sentence structure. However, some handicaps were observed. First, none of these studies were corpus-based. Second, if their data is considered Japanese learner corpora, none of these corpora are publicly accessible. Such a corpus would be helpful for replication studies, cross-validation, and comparison purposes, among many others. Third, the analytic schemes used in the aforementioned studies targeted specific errors, such as passive voice and particles, instead of overall errors at different level of linguistic structures. https://w. org/detail/news/SEAQIL-dan-UPI-teliti-preferensi-bahasa-siswa-di-Asia-Tenggara https://w. jp/e/project/japanese/survey/result/dl/survey2021/All_contents_r2. So far, three comprehensive error type tagsets (Koyama et al. , 2023. Pavlovis, 2020. Yang and Akahori, 2. have been identified, which may suit the purpose. Compared to Pavlovic . Koyama et al. Aos . tagset is better regarding tagset documentation. For example. PavlovicAos tagset provides references, but no labels or examples. Conversely, labels, references/ descriptions, and examples in KoyamaAos tagset can be observed. In addition. PavlovicAos tagset adopts more technical linguistic terms. For this current study, it is aimed to use a tagset that would allow us to reach a more general audience. KoyamaAos full tagset, which can be observed in Table 2, was preferred. Yang and AkahoriAos . tagset was not considered. This tagset is a subset of tags presenting the aforementioned tags but is not as comprehensive as the other two. While KoyamaAos tagset is comprehensive, there is no tag to identify the absence of a required This is a common mistake made by learners, which should be tagged. For example, in . , the verb shimashita Cea Aodo . ast and polit. Ao is absent, thus causing an error to occur. The support verb should be immediately after sakka CAEECE AosoccerAo, as it must surface at the sentence-final position to allow the sentence to be grammatically . AeACCasAAEaCCaEAACNACAEEC E()3AC Kodomo-no-toki-mainichi-tomodachi-to-issho-nisakkaa() kid-GEN-time-everyday-friend-COM-together-LOCsoccer() AoWhen I was a child, every day, together with my friends, we . soccerAo A slight adjustment to Koyama et al. Aos tagset was commited by incorporating PavlovicAos error cause tagset as the second layer tagset, whose categories are shown in Table 3. PavlovicAos error type tagset consists of labels, without descriptions or examples. To complement this, some examples and labels were added. This The empty brackets with zero element () refers to missing item. Prihantoro / Jurnal Arbitrer - Vol. 12 No. Table 2. Error type tagset . odified from Koyama . Tag ADJ Description Adjective selection error ADV Adverb selection error AUX Auxilliary selection error CONJ Conjunction selection error DET Determiner selection error NOUN Noun selection error PART Particle selection error PRON Pronoun selection error PUNCT Punctuation selection error VERB Verb selection error ADJ : INFL Adjective conjugation error AUX : INFL Auxilliary conjugation error VERB : INFL Verb conjugation error SPELL Spelling error VERB : TENSE Tense usage error Word order error OTHER Other error Example . ncorrect form Ie correct for. AyCeAAeaiuONIeNsAyCayAIAACNAIAC Sonna koto wa ichiban . aisetsu Ie juudain. ketten da to omou deshou. You might think that it is a . weakness AyaNOIAnUoAEICEOAiUCeuCOAaIeaUCOyUACaC Soshite, jibun no kaihatsu to tesuto no shigoto o . oku Ie shikkar. kansui shimashita. and i have . completed my development and testing works AAuuAUuAsuAAoIeAAoyAC Watashi wa nihongo ga daisuki . asu Ie des. I . Japanese very much uAAUAIeAyAAyaiAACCAUaCCAauAA aoCUAeaoAC {Shikashi Ie Soshit. , watashi no ichiban no yume wa Thai e itte. Thai go o genchi no hito to kouryuu suru koto desu. my biggest dream is to go to Thailand and interact with local people in Thai. uACAIeAyAyUnoAasOsAEAEECaoaC {Ano Ie son. kagaku no kaisha no namae wa Chisso kabushiki gaisha datta. chemical substance factory is called Chico incorporated company AyAACEiEOcuOuIeOyCeUaECUAC Sono hito wa sofuto wea . enryokuIe kenri } o motte iru. The person has the . osession righ. of the software OuAAuAuAIeAyAAsaC Kyou, kaminari no oto . i Ie d. I woke up . of thunder AACAAOuAAuAyAeIeAeAeyaoAC Maa, konkai no hanashi wa . oko Ie kok. madeni shimasu. Alright, we . cease our conversation AeAUCeUAEauAAIeACyAiCeaEOCeaC Isshoni uta o utaimashita {. Ie . } gohan o tabete, kaimono o shimashita. We sang together {. Ie ,} then went shopping uyAiCeuAIeAyaC i . breakfast Asa gohan o . omi Ie tab. I . breakfast AEAACCuCAAEIeCayEAoUCaC Totsuzen totemo . abishii Ie sabishik. Suddenly, i felt very . ACIaEaCCAACOuAAEIeayacAAoAUAAEAAoAC Shukudai wa totemo tsumara . ai Ie naku } te totemo muzukashii desu. The homework is extremely . and difficult AAUaEseOaoauCeAuCeuUAAUCUIeUAAUCOyAAEAACNA Shikashi, netto de sagasu to, zensen . itsukare Ie mitsukar. nai desho ! . ut, if you search on the internet, it is completely . ot foun. right? uIIAuEENCCIeNCCyCCaIAiCUanAAEAAiAEAAiCU AAEAAoAC Kokunai no . edeiaIe medi. mo kanri sarete kagekina genron wa issai kinshi sarete imasu. Domestic . has been regulated, and all extreme speech is completely forbidden uAuAIeAAECUyCCEnAUACCUAUCOauAAOuCOAuaiAEAC Shit . aIe te ir. ninki webbu ga aru kara, zehi oshirase kudasai. If you . any popular website, please inform me uouoIeuouuoyaOA Nihongo . hikyuu kentei Ie kentei shikyu. goukaku ! I just passed . evel 4 of Japanese Proficiency Tes. ieUAAEoeAEAuUAOCUCOAIAACeUAIeUCUUCUAIAAUAAy AEAsAAoAC Suubyougo, berandaa wa . ieru youna sakudo o tate Ie miru miru uchi ni tatt. In the next few seconds, the . swiftly stood before me Prihantoro / Jurnal Arbitrer - Vol. 12 No. Table 3 Error cause tagset . odified from Pavlovic . CC-BY) Tag Label Example . ncorrect form Ie correct for. Wrong choice WRONG_CHOICE CEEuCeIeAUyuAsAAoAC Lack of use Form error LACK_USE FORM_ERROR CEEuyoIeAUyuAsAAoAC Overuse OVERUSE AeAAEAOACCAIoCUAuaIeaEayAC ACAoCAUCOCOaUHyannaNatsuAsACEiEEOEoCEEOUiiCeU aHyannaIOiasnAAoiasAAUAAEoCEEOAoCUuA IeyoyniAACeUOaC is one modification. Second. AoredundancyAo is replaced with an AooveruseAo category. It is assumed that overuse can include redundancy, but not vice versa, because the latter is restricted to repeating one item . ith another identical on. , which causes the sentence to be ill-formed. To clarify label 1 and 3 AoWrong choiceAo means that the form is correct in isolation, but contextually erroneous. Meanwhile. AoFORM_ERRORAo means that the learner attempted to use a correct item, but used the wrong form . wrong inflection, conjugation, incomplete word, among other. Overall, this tagset complements the previous tagset, which mixes form and POS segment errors. The tagset focus on explaining the cause of errors, neutral from POS segment where the errors are commited. a review shows that the error analysis schemes proposed by other scholars are insufficient, as the study aims to annotate multiple errors instead of certain ones. Koyama et al. Aos error types tagset and PavlovicAos error cause tagset were then combined into a two-layer error tagset,. Artificial Intelligence (AI) is a generic cover term for systems that can implement human intelligence tasks. A recent trend in AI is the use of Large Language Models (LLM. , language models trained on a large amount of data to interact using human instead of machine language . programming language. In addition to producing texts . reative writing, translation, proofread text. LLMs can also generate images, voices, and video, among others. While studies on the use of Compare . In sentence . , it is AI in foreign language education are numerous observed that the repetition of toki follows sono. Karata et al. , 2024. Pan et al. , 2. , its The latter is unnecessary and can therefore be application in corpus linguistics, particularly categorised as either redundancy or overuse. While in corpus building, is understudied. Among the the two concepts are interchangeable in . , in . , few that exist. I first highlighted Yu et al. only AooveruseAo fits. This is because the erroneous who assessed the potential of LLM for corpus segment A no . enitive marke. is not repeated pragmatic annotation, comparing ChatGPT 3. 5 and from any segment. Its use is unnecessary because, 4. They argue that ChatGPT 4. 0 is better, and without it, the sentence is already acceptable. its performance (Precision-Recall and F1 Scor. is impressively above 90% for all speech acts. oiACEEOCCECAUACAAoCAyAoCAIAA Based on this finding. ChatGPT 4 is used, whose CeaEaCcni performance is better. Koukousei-de-korona-virusu-ga-atta-toki-sono-toki-hitobito-wa-minna-ie-ni-imashita However. Yu et al. Aos . study is different Student. high-school-CONJ-COVID19-SUBJ. CASE- from this project in at least two respects. First. EXT-time-that-people-TOP-all-house-LOC-EXT Yu et al. Aos . study targeted English instead AoWhen I was in high school and there was a coronavirus of Japanese. The size and availability of LLMAos pandemic, everyone stayed at homeAo training data sets for English and Japanese are not . ()LITTLEMIXAAUAiuAsAAoAC Second, this project does not aim to ()LTTLEMIX-no-ga-ichiban-suki-desu. apply AI to conduct pragmatic annotations, as Yu ()LITTLEMIX-GEN-PS-most-like-COP. et al. Ao(I) really like little mixAo For the first aim, the literature review can be concluded as follows. First, the need to construct the corpus targeted in this project is justified by the absence of a Japanese essay corpus written by learners of JFL whose L1 is Indonesian. Second. Compared to Yu et al. Aos study. Poole and CossAo . study is more similar to ours. Pool and Coss . studied how ChatGPT to apply a writing rubric to essay correction tasks and evaluated is While Poole and Coss . argue that ChatGPT can serve as a valuable tool for L2 Prihantoro / Jurnal Arbitrer - Vol. 12 No. Fig 1. Sample results: respondent code, essay. JLPT score, year of study writing assessment, it fails to reach a desirable The objective of the study and this study is similar: applying some kind of rubric to L2 writing assessment. The difference is that, this project is more specific in that an annotated error analysis scheme, not writing in general is applied. Another difference is the target language, which is not English but Japanese. In a different language, the analytic scheme may differ due to the language For instance. AomergeAo and AosplitAo are categories4 of errors present in Arabic, as shown in Alrehli and AlhotahliAos . AI-Assisted error analysis, but not in English. in the Department of Japanese Language Universitas Diponegoro (UNDIP) for several reasons. First. UNDIP is one of the few universities whose Japanese language departments are nationally accredited as A-grade. While the data only cover students from one university, it is argued that this is a sensible starting point before improving representativeness in subsequent studies. Starting with relatively small data is one of the approaches to solidifying the data collection and analysis protocol before expanding to a larger data set. Approximately 300 students in years 1Ae3 . ear 2021-2. , were asked to write a short essay ranging from 300Ae600 characters (Katakana. Hiragana. Kanji, or a combination thereo. on a single topic. Aomy hobbyAo a single topic. Aomy hobbyAo . rom June-July 2. While obvously expandable, and this is considered to be sufficient as a starting point. It shall be addressed in future studies by eliciting and collecting essays on various The consideration for a range of short essays is that the majority of students in year one started as absolute beginners . minimum proficiency in writing any Japanese characters at al. , even though by the time data were collected, they had already learned Japanese for almost one semester. The aforementioned critical reviews translate to the novelty of this project. This is the first learner corpus whose data is purely obtained from Indonesian students learning Japanese. The architecture of the corpus and the data collection procedure are transparent. The corpus is free to This allows for replicability, reproducibility, and data expansion under the same protocol. Error annotations and existing metadata can lead to corpus-based learner profiling. In terms of studies using AI, many studies have focused on using AI tools, while very few studies have specifically addressed the accuracy of AI tools in conducting error annotation. The results of this study may The students were asked to write an essay in be useful for improving automated language 75 minutes and were not allowed to use any AI assessment tools. writing assistance software or dictionaries. Upon completing their essays, they were requested to II. METHODS complete a questionnaire. This is aimed to obtain Corpus data were collected from respondents their metalinguistic information . ear of study. JLPT score, sex, among other. The results were Note that the remaining categories, however, are quite generic saved in spreadsheet files, as shown in Figure 1. such as semantics, morphology, syntax or punctuation Prihantoro / Jurnal Arbitrer - Vol. 12 No. The essays and questionnaires were carefully checked to ensure that all essays complied with the character limit, all required metadata information was supplied, and the respondents clearly expressed giving consent for data access and publication. Data that did not meet these requirements ( 3 essay. were excluded from the sample. The returned essays were randomly sampled, and stratified by year of study: 30% from each year . ee Table . The sample comprised 107 essays . round 33% of the populatio. , which would form the corpus. Table 4. Quick summary of the sample5 Information Essay Average characters JLPT/No-JLPT Male/Female Studied Japanese before college (Y/N) Year 1 5/30 15/20 Year 2 15/20 13/22 Year 3 9/26 6/29 2/33 20/10 AI analysis and supervised corpus data creation An assessment was implemented on these essays with help from ChatGPT 4. 0, which was reported to perform better than its predecessor. ChatGPT 3. 5, as shown by Holland . and Massey et al. The prompts requested ChatGPT to identify each incorrect segment and assign error types and cause tags. The prompts were installed on ChatGPTAos interface . ee Japanese Essay Analyser in Fig. and applied them to each Unlike Park . who compared ChatGPT and human works for error analysis, in this study, the results of ChatGPT analysis were reviewed by human reviewers. The human reviewers6 reviewed ChatGPT responses using a 4-parameter metric, namely. Correct Segment7 (CS). Incorrect Segment (IS). Error Type (ET), and Error Cause (EC), as shown in Table 5. The human reviewers validated whether or not the values given for these four parameters were accurate, using Boolean values (T=True. F=Fals. , as shown in Table 5. Fig 2. ChatGPTAos Installed prompt [Japanese Essay Analyze. and its output sample essay was converted , and its corresponding metadata information, into an XML document to be readable in the userAos preferred corpus query The raw corpus in case users do not need error annotation attributes is also provided. The corpus is named DICO-JALF8 v 1. This procedure helped to fulfil the first objective of this study. For this paperAos visualisation and analysis purposes in this paper. Sketch Engine was used to index the XML-annotated corpus. The preference Once the evaluation was concluded, each for Sketch Engine is because it includes a Japanese POS tagger by default. The slash in each year column correspond to the information. instance 5/30 in line 4 year 1 means that of 35 students sampled in year 1, 5 have JLPT certificates while the other 30 do not have any JLPT certificate. Japanese lecturers/tutors, 2. 5 years or more teaching experience, minimum proficiency JPLT N2 7 Questions asked to reviewers => CS: Is there any error in this segment? IS:Is there any error in this segment? ET: Is the assigned error tag correct? EC: is the error correction accurate? If yes, write T, if not write F. The information from the corpus was used to measure the accuracy of ChatGPTAos annotation for each parameter in the metric and also overall Accuracy is operationally defined in 8 DICO-JALF = Diponegoro Corpus of Japanese Learners as a Foreign Language Prihantoro / Jurnal Arbitrer - Vol. 12 No. Table 5. Sample evaluation metric oAsAAoAC VERB_INFL Form error AeAcAauAoCAUCOO AyCUAUCOaCIAAEuEE CEECeCCe COaoAC VERB Form Error aiCeCEECECEECe ACAyAA AoAC VERB Form Error Table 6. Quick summary of the corpus Token Text (Essa. Metadata attribute POS-tagged (MeCa. GPT 4. 0-Error-annotated Human evaluation FULL Yes Yes this project as the proportion of TRUE values compared to all values (TRUE FALSE). The reference was made to the average value of all four parameters for overall accuracy . Table 5, and Tabel . In addition, the likelihood of each error type label assigned by ChatGPT being correct was also measured. This technically translates to the proportion of each error type label assigned by ChatGPT evaluated as AoTRUEAo by human The accuracy of each error type was then measured. The annotators assigned the correct error type label for each mistake made by students following KoyamaAos modified tagset. The proportion of matching error-type labels given by ChatGPT was calculated. This helped to fulfil the second objective of this study. Unlike Prihantoro . , this project does not use precisionrecall as an evaluation measure because the tags are unambiguous, while precision-recall is an evaluative measure typically used in information retrieval systems, beyond this studyAos coverage. For unambiguous tags, accuracy is a better fit, as commonly used on other projects such as Prihantoro . Pandey . , or Thewissen . EVAL Yes Yes Yes Yes includes a sample . , as shown in Table 6. The latter was enhanced with ChatGPT error annotation and human evaluation. These versions are freely available for download10 and use in usersAo preferred corpus query systems. Regardless, all essays with their corresponding metadata were indexed in SE. As POS tags and metadata are present, users can perform POS-based searches with metadata restrictions, as shown in Figure 3. It shows adjectives used by female students ranked by frequency. The corpus created corpus fulfils the first aim of this research, which is to create an errorannotated corpus from Indonesian students learning Japanese as a foreign language at various Unlike Imamura et al. Aos . experiment in which pseudo data were used, in this study the data used were obtained authentically from human learners of Japanese as a foreign language whose native or dominant language is Indonesian. If specifically compared to other learner corpora in Japanese, such as the International Corpus of Japanese as a Second Language (IJAS). Natsume (Nishina et al. , 2. , the learner corpus i. RESULTS created for Imamura et al. Aos (Imamura et al. , 2. Objective 1 experiment, or NAIST lang 8 (Kasahara et al. , this corpus is smaller. While this may be Architecture The first research objective, to create DICO- considered a handicap, it is argued to be a sensible JALF v. 0, has been fulfilled. Here, two XML starting point. versions of DICO-JALF were created: RAW and 9 This is around 30% of all essays, a manageable amount considering ANNOTATED. The RAW version includes all the time and financial constraints. The commitment is to expand this essays . , while the ANNOTATED version once more finance and time resources are available. https://drive. com/drive/folders/11QREjUrMjARfKOei5mlPpVubjKhQMIar?usp=drive_link Prihantoro / Jurnal Arbitrer - Vol. 12 No. Fig 3. Frequency of adjectives used by female students in DICO-JALFAos SE (Generated by Word List too. Fig 4. Concordance lines from with VERB restriction in ER_TYPE Text Types in DICO-JALFAos SE (Generated by Concordance too. In terms of the contribution in the context of Japanese learners whose L1 is Indonesian, the data were specifically obtained from native speakers of Indonesian, an aspect missing from other learner corpora of Japanese. There is indeed a small subset of IJAS data that came from Indonesian students. However, the number of students for the experiment was just 50. In contrasts, for this project, data from 325 students . error annotate. was collected , far outweighing IJAS respondents. Also IJAS collect data of students from various countries. Conversely, the focus of this study is on Indonesian Thus, this corpus can characterise errors generated by Japanese learners as a foreign language in Indonesia better. In future years, the plan is to expand the corpus size by covering all the universities in Indonesia in which Japanese study programmes are offered. Corpus search based on error tags Users can search the corpus based on error type or cause tags using the following CQL format . o search for verb category error. or . o search for wrong choice error. Note that, in aim 2, it is argued that ChatGPT is not entirely Thus, users can also incorporate human evaluation information. An example is with restriction to TRUE in the ER_TYPE Text Types. This means that it restricts the search for verb errors that have been evaluated as TRUE by human evaluators. This increases the likelihood that users will get the desired outcome . ee Fig. The figure below shows concordance lines whose nodes are error-tagged with AoVERBAo and evaluated TRUE by human annotators, presented alongside their left and right contexts. In some situations, users might want to perform a more underspecified search, such as showing all erroneous segments, error types, and causes . nd, in some cases, human evaluatio. While, by default, the segment that fits the query is visible in concordances, tags will only be visible based on usersAo actions. Their visibility might be helpful to users wishing to identify types of errors and causes of errors . s well as their evaluatio. Figure 6 shows that users can checkmark the attributes they want to show, while Figure 5 shows how these elements are visually presented in Prihantoro / Jurnal Arbitrer - Vol. 12 No. Fig 5. Setting the visibility of XML attribute-values in DICO-JALFAos SE (Generated by Concordance too. Fig 6. Concordance lines from with visible error type in DICO-JALFAos SE (Generated by Concordance too. Compare this with Figure 4 above, in which no attribute or value is present. Objective 2 Watashi-wa-kaku koto-ga-dekimasu 1-TOP-draw-SUBJ-able AoI can drawAo Note that, in many cases. ChatGPTAos analyses were accurate. For instance, in . , it correctly marked oAsAAo tsuzukimasu Aoto resumeAo as an erroneous segment. It also successfully identified VERB_INFL and FORM_ERROR as a correct error type and with correct error cause labels. The correct form of the verb should be oAEAAE AAo tsuzuite imasu Aoto resumeAo. Hence, human evaluators marked them as TRUE, as all evaluations are accurate. This subsection demonstrates that the second aim of this present study has also been fulfilled: to measure the accuracy of ChatGPT annotation. It is argued that ChatGPT can annotate errors in 31% of the cases. Using a human evaluation procedure, it can be established that ChatGPT is not completely accurate and makes See . ChatGPT marked AAsAAo dekimasu Aoable toAo as an erroneous segment, but this is not accurate as this segment should not . AeAcAauAoCAUCOOAAo As A Ao AAsAAo CAUSE>AC AeAcAauAoCAUCOOAAoAsAAoAC kono-shumi-wa-kodomo toki-kara-ima-made- Prihantoro / Jurnal Arbitrer - Vol. 12 No. rate could exceed 90%. One possible reason is that more training data for ChatGPT is available in English instead of Japanese. This data paucity Table 7 shows that ChatGPT was 60. 44% leads to worse system performance when errors in accurate in error identification. In terms of error English are identified. types, accuracy was measured at 65. This As for the likelihood of error-type labels means that more than 30% of the error labels were being valid, an operational question would be: AoIf incorrectly assigned. For instance, in . ChatGPT ChatGPT says this is a particle (PART) error, how assigned ADJ_INFL. However, the correct label accurate is this analysis?Ao It can be observed that the should be SPELL, because the erroneous segment error rate for PART . error type labels is the is missing AE i at the end. 26%), as shown in Table 8. This means Table 7. Accuracy for ChatGPT performance the chance of this label being correct is 76. No Segment Accuracy (%) That means the less the error likelihood, the more Correct segment chances for a label to be correctly annotated. DEM-hobby-TOP-childhood-from-now-until-continue Ao(I) resume this childhood hobby up to now. Ao Erroneous segment Error type tags Error cause tags Mean value accuracy Table 8. Likelihood of error type labels given by ChatGPT Error type label Error likelihood (%) PRON AUX AAoAC ACAEECAueAAUAAeAEAAoAC Ano-manga-wa-hontou ni-kakkoi-desu DEM-comic-TOP-really-cool-COP AoThat comic is really coolAo ADJ ADV VERB CONJ OTHER VERB(INFL) NOUN ADJ(INFL) VERB(TENSE) SPELL PUNCT DET PART ACAEECAueAAUAAeAE In terms of error-cause labels, the accuracy is slightly better at 70. 33 %. This means that only less than 30% of the data were misanalysed . this example. CEC serisu AoserialAo was categorised by ChatGPT as a WRONG_CHOICE in terms of error cause. This is, however, inaccurate because the inaccuracy lies in its form. CEC should be replaced by CEEC shiriizu AuserialAy, as the former is not listed as valid Japanese vocabulary. The correct label is, therefore. FORM_ERROR, instead of WRONG_CHOICE. A A c A EO E E AA C EU E AA o i CECCeUCUAe aoAC Watashi-noAeshumi-wa-dorama,-anime,-eiga,- serisu-omiru koto-desu 1-GENAehobby-TOP-drama-anime-filmserial-OBJwatch-COP AoMy hobby is watching dramas, animes, movies, and seriesAo It may be concluded that, in this project. ChatGPT performed with 70. 31 % accuracy, averaging across the four parameters (SD= 9. Mdn = 68. IQR=14. This is low compared to the findings of Yu et al. , whose accuracy As for the accurate annotations, i. , all ChatGPT error annotations validated as TRUE by human evaluators, as shown in Figure 7, two of the top-three error tags are related to verbs (VERB and VERB_INFL) with a relatively larger proportion than other errors. This result shows that the acquisition of Japanese verbs for the learners of Japanese in the sample is an area where improvements may be prioritised. Looking closely at the error cause of these categories . ll verb. , as shown in Figure 8, the reasons for the verb errors are the use of wrong forms . %) or the use of incorrect verb choice . %). Verb overuse accounts for only 21%. This result means that the Prihantoro / Jurnal Arbitrer - Vol. 12 No. Fig 7. Distribution of error tags Fig 8. Distribution of error tags for all verbs errors acquisition of Japanese morphology and semantics knowledge. Advanced users can usually implement for verbs still needs to be improved. These finding temperature changes by accessing ChatGPTAos API. aligns with Hayashishita and Ueyama . Yusuf et al. Putri . , and Reina and IV. DISCUSSION Lee . , who studied verbal errors committed The results of the study presented earlier can by the learners of Japanese as a foreign language. be interpreted as follows. First, regarding error In some cases. ChatGPT still assigned tags rates, it can be observed that AoverbAo is the most beyond the required tagset. For instance, it marked problematic area. This, whilst an empirical finding, the cause of an error as AoPUNCTUATION_ is anticipated as Japanese has a complex verb ERRORAo or AoSPELLING_ERRORAo. While the conjugation system (Hayashishita and Ueyama, prompts to ChatGPT clearly instructed it not to 2. , a feature absent in Indonesian, the L1 of all include categories beyond the required tagsets, the respondents in this study. This is substantiated this still happened. This may be driven by by form errors, as a negative transfer, whose error ChatGPTAos temperature. As argued by Poole and frequency is quite large. Another substantial cause Coss . , the temperature set . etween 0Ae. of mistakes is lexical choice. The difference in the may affect ChatGPTAos AocreativityAo. As for this verb semantics arguably causes this. For instance, research, the default ChatGPT temperature is as shown in Aror et al . , the verb Aoto wear preserved to experiment how regular users would . Ao in Japanese may be realised in different ChatGPT, which does not require certain technical lexical choices depending on the position . s in Prihantoro / Jurnal Arbitrer - Vol. 12 No. head, hand, body, fee. And due to the absence of the corresponding lexical choices in Indonesian, they resorted to the wrong verb choice. In terms of ChatGPT accuracy . % of correct identification of error. , this may be driven by the complexity of the annotation scheme adhered in this study. In a similar study discussing ChatGPTAos accuracy (Yu et al. , 2. , it may be noticed that the annotation scheme used is different from ours. Another possible factor is ChatGPT AotemperatureAo . he extent to which ChatGPT can be creative or deterministi. can be applied as shown in Poole and Coss . group can also be expanded to represent Japanese learners as a foreign language in Indonesia by recruiting subjects from different universities and regions in Indonesian. CONCLUSION As shown in the earlier section. DICO-JALF 0 has been created, and it is now publicly available . This is the first errorannotated learner corpus of Japanese obtained from Indonesian students. This corpus can already be used to support open and distributed learning. The findings show how the gap in corpus- Students also have the option to use their preferred based research on error analysis of Japanese as systems to access it. The representativeness of the a foreign language in Indonesia has been filled. corpus may be improved by incorproating more Previous studies whose subjects share similarities data from diverse universities in Indonesia. in their characteristics with ours (Aror et al. The use of AI to create error annotations in Putri, 2020. Barus and Pujiono, 2. are DICO-JALF has been demonstrated. However, the not corpus-based. In terms of error annotation post-annotation evaluation suggests that the overall scheme and methodology, this study offers some annotation accuracy . %), i. the proportion Most studies study particular of ChatGPT correctly annotates errors, can still errors, while overall error analysis in this study is be improved . Using this. Japanese Methodologically, previous studies language teachers and lecturers can identify errors or error analyses of Japanese learners in Indonesia in different metadata categories and devise datadid not use AI tools to identify errors. Instead, errors driven teaching materials, allowing them to target were directly identified by human teachers. While specific areas of weakness. methodologically different from Sanosi . This research is an essential contribution and Heintz et al. , the study shares similar to AI-based Japanese language assessment findings in that, in terms of accuracy, there are tools. In this paper. AI assistance into DICOsome areas in which AI-based applications can still JALF to automatically annotate errors has been be improved. Regardless of its shortcomings, as implemented. In addition, the capability of shown by Tyson . in the case of ChatGPT. AI automated systems to detect and classify common is still arguably helpful in providing grammatical errors in Japanese made by Indonesian students However, it still needs to be overseen by has been demonstrated. This may contribute to the human evaluation to improve its accuracy. better performance of AI-based language tools. Regarding implications. DICO-JALF v. he created corpu. may serve as an empirical database for interlanguage studies. For language pedagogy, the findings may be used as a stepping stone to improve Japanese teaching materials or techniques in, but not limited to. Indonesia. It can also be used for data-driven learning by supplying authentic errors and their corrections. For Natural Language Processing (NLP), the corpus may be used as training data for developing error correction ACKNOWLEDGEMENT This research was funded by the Faculty of Humanities. Universitas Diponegoro, through the 2024 Research Grant. International Joint Research Scheme (No: 106/UN7. F6/HK/. , with Prihantoro. Ph. , as the principal investigator, and with Dr. Tanjun Liu (XiAoan JiaotongAeLiverpool University. Chin. and Prof. ShinAoichiro Ishikawa (Kobe University. Japa. as international research partners. The authors express For future studies, it is also possible to compare their sincere gratitude to the reviewers for their ChatGPT with another LLM-based system, insightful feedback and constructive comments, such as DeepSeek or Gemini, to compare their which have greatly contributed to the refinement In terms of population, the learner of this work. Appreciation is also extended to the editors for their professional support throughout the Prihantoro / Jurnal Arbitrer - Vol. 12 No. submission and publication process. The authors ShinAoichiro Ishikawa: Original draft review, further acknowledge the invaluable assistance of methodology, analysis. Japanese native speaker the research assistants. Ninik Elika. Li. , consultant. and Haqi Sang Kautsar. Li. Li. (Cand. ), for Tanjun Liu: Software, statistical analysis, their dedicated support in data collection and methodology, funding acquisition. Elizabeth Ika Hesti Aprilia Nindia Rini: ETHICS STATEMENT Resources, investigation, validation, analysis. All respondents whose texts are in the corpus have agreed to make their data publicly available with The informed consent forms for these agreements were digitally signed. Zaki Ainul Fadli: Resources, investigation, validation, analysis. CREDIT AUTHOR STATEMENT DECLARATION OF COMPETING INTERESTS Prihantoro: Conceptualisation, literature review, methodology, analysis, proofreading, writing original draft, writing -- review and editing, visualisation, supervision, project administration, funding acquisition. Catur Kepirianto: Methodology, analysis, data The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. REFERENCES