International Journal of Artificial Intelligence p-ISSN: 2407-7275, e-ISSN: 2686-3251 Original Research Paper Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts Tatsuhiro Haga1*. Keiyo Matsumoto2. Ippei Asahiko2. Shunzo Mizoguchi1 School of Engineering. Shibaura Institute of Technology. Saitama. Japan. College of Industrial Technology. Nihon University. Tokyo. Japan. Article History Received: Revised: Accepted: *Corresponding Author: Tatsuhiro Haga Email: haga@gmail. This is an open access article, licensed under: CCAeBY-SA Abstract: This study aims to evaluate the effectiveness of a large Japanese language model. ClueAI, tailored to the medical domain, in the task of predicting Japanese medical texts. The background of this study is the limitations of general language models, including multilingual models such as multilingual BERT, in handling linguistic complexity and specific terminology in Japanese medical texts. The research methodology includes fine-tuning the ClueAI model using the MedNLP corpus, with a MeCabbased tokenization approach through the Fugashi library. The evaluation is carried out using the perplexity metric to measure the model's generalization ability in predicting texts probabilistically. The results show that ClueAI that has been tailored to the medical domain produces lower perplexity values than the multilingual BERT baseline, and is better able to understand the context and sentence structure of medical texts. MeCab-based tokenization is proven to contribute significantly to improving prediction accuracy through more precise morphological analysis. However, the model still shows weaknesses in handling complex syntactic structures such as passive sentences and nested clauses. This study concludes that domain adaptation provides improved performance, but limitations in linguistic generalization remain a challenge. Further research is recommended to explore models that are more sensitive to syntactic structures, expand the variety of medical corpora, and apply other Japanese language models in broader medical NLP tasks such as clinical entity extraction and classification. Keywords: ClueAI. Japanese LLM. MeCab Tokenization. Medical NLP. Multilingual BERT. 2025 | International Journal of Artificial Intelligence | Volume. 12 | Issue. 1 | 11-23 Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. Introduction The integration of artificial intelligence (AI) in healthcare has significantly transformed medical diagnostics, patient care, and administrative workflows. Among AI technologies. Natural Language Processing (NLP) plays a crucial role in analyzing vast amounts of medical texts. However, applying NLP to Japanese medical texts poses unique challenges due to the languageAos complexity, such as the use of kanji, hiragana, and katakana scripts without explicit word boundaries, and the sensitive nature of medical data. Tokenization, a fundamental NLP task, is particularly difficult and requires specialized tools like MeCab to handle morphological ambiguities effectively . Moreover, clinical narratives contain domain-specific terminology, abbreviations, and context-dependent meanings, complicating accurate interpretation necessary for disease prediction, treatment recommendations, and patient monitoring. The limited availability of annotated Japanese medical corpora further restricts the development of robust NLP models for this domain . Large Language Models (LLM. like GPT and BERT have shown remarkable ability in understanding and generating human-like text and can be fine-tuned for domain-specific applications such as medical text analysis. Nevertheless, most LLMs are predominantly trained on English corpora, which limits their performance on Japanese medical texts . Recent initiatives, such as ClueAI, have developed LLMs trained on Japanese datasets to bridge this linguistic gap, but their effectiveness in the medical domain remains underexplored . The MedNLP corpus, composed of authentic Japanese medical records, provides a valuable resource for training and evaluating NLP models in realistic clinical scenarios . , offering potential to enhance the relevance and accuracy of NLP in Japanese healthcare. Evaluating NLP models requires appropriate metrics. perplexity, which quantifies a modelAos uncertainty in predicting the next word, is a standard measure where lower values indicate better predictive performance, making it suitable for medical text prediction tasks . Using multilingual BERT fine-tuned as a baseline enables benchmarking improvements from domain-specific adaptation . Comparing fine-tuned ClueAI against this baseline helps assess the benefits of targeted training . This study aims to adapt the Japanese LLM ClueAI for medical text prediction using the MedNLP By fine-tuning on domain-specific data, we seek to improve predictive accuracy measured by perplexity and compare it with multilingual BERT. Additionally, we analyze prediction errors and biases to understand model limitations, crucial for ensuring reliable application in healthcare, where errors can have serious consequences. The significance of this research lies in advancing effective and accurate NLP systems tailored to the Japanese medical context, addressing linguistic challenges, and leveraging domain-specific data to support better clinical decision-making and patient outcomes. Furthermore, the methodology and findings may guide the development of similar NLP applications in other languages and domains, contributing to global healthcare innovation. Introduction Global and Local LLM Studies Large Language Models (LLM. such as GPT-3. LLaMA, and BLOOM have been rapidly developing and become the backbone of many natural language processing (NLP) applications. These models are trained on large-scale datasets and have shown outstanding performance in various domains, such as text generation, machine translation, and question and answer . GPT-3, developed by OpenAI, is one of the most well-known models in the world, which generates human-like text and is used in applications such as automated writing assistants, chatbots, and code generation . LLaMA (Large Language Model Meta AI) developed by Meta, has attracted attention due to its efficient architecture and provides powerful language modeling capabilities at a lower computational LLaMA is an open-source model, making it a popular choice for researchers who want to build custom NLP systems . The modelAos scalability allows it to be used in a variety of tasks, such as medical text summarization or analysis, which is particularly important in specialized fields such as BLOOM, developed by a collaboration of AI researchers, is another cutting-edge model that serves as a multilingual alternative to GPT-3. Unlike GPT-3. BLOOM is designed to generate text in multiple languages, making it suitable for applications in diverse linguistic contexts . BLOOMAos multilingual capabilities have gained significant attention, especially in non-English regions, as it Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. enables better model performance for languages such as Japanese that have different grammatical structures and writing systems. Localized models such as ClueAI, designed specifically for Japanese, have shown promising results in tasks such as text generation, classification, and sentiment analysis. These models are optimized to handle the specificities of Japanese, including its complex writing system and rich morphology . While LLMs such as GPT-3 and LLaMA have achieved global success, localized models such as ClueAI show potential for more tailored applications in specific linguistic Despite the global success of LLMs, challenges remain in applying these models to languages such as Japanese. Differences in sentence structure, word segmentation, and semantic ambiguity between English and Japanese highlight the limitations of using general LLMs without adaptation . This suggests the need for further research on localized LLM models, especially in the medical field, where precise domain knowledge and terminology are critical. Recent research has explored the application of LLMs to various Japanese language tasks. Researchers have fine-tuned models such as BERT for specific domains, including health, to improve model performance in tasks such as medical diagnosis prediction and clinical decision making . However, these models still face challenges in handling medical texts due to the complexity of medical terminology that often requires specific adaptation. The growing interest in LLMs for Japanese encourages further research into the customization of these models. Fine-tuning techniques, such as domain adaptation and data augmentation, have been explored to improve the effectiveness of LLMs in specific tasks, including medical text processing. This is an important step towards bridging the gap between the versatile LLM and applications that require high accuracy in the medical context . The continued development of LLMs, both globally and locally, demonstrates their transformative potential in a variety of NLP applications, including medical text prediction. However, further research is needed to fine-tune these models for specific domains, especially in Japanese, to improve their effectiveness in real-world applications. Medical NLP Research in Japan The integration of natural language processing (NLP) in medical applications in Japan has gained significant attention in recent years. One of the most widely used tools for processing Japanese text in the medical domain is MeCab, a morphological analyzer that is essential for Japanese sentence MeCab helps in breaking down Japanese sentences into words or phrases, which is essential for advanced tasks such as information extraction, text classification, and named entity recognition (NER) . This tool is widely used in various NLP tasks in Japan, especially for processing medical data that includes complex terminology. MeCab has been integrated with various Python libraries, such as Fugashi, to facilitate tokenization and preprocessing of Japanese text. By applying these tokenization techniques, researchers can prepare medical corpora for tasks such as clinical text classification and disease These tools are essential for enabling accurate data extraction from unstructured medical records . In addition. MeCab is often used in conjunction with other NLP tools, such as the Unified Medical Language System (UMLS), to bridge the gap between Japanese medical terminology and international standards. The UMLS system, developed by the National Library of Medicine, provides a comprehensive set of biomedical vocabulary, including medical terminology, codes, and concepts. In Japan, the integration of UMLS with local medical datasets has led to significant progress in the application of NLP to Japanese medical texts. Researchers have used UMLS-Japan, a localized version of UMLS, to standardize Japanese medical terms and enable interoperability between Japanese and international medical systems . This standardization is important for improving the accuracy and efficiency of automated medical systems in Japan. Studies have shown that applying NLP models to Japanese medical texts can significantly improve clinical decision-making. By analyzing patient records, medical professionals can gain insights into patterns of diagnosis and treatment outcomes. For example, clinical NLP systems have been used in Japan to extract information from Electronic Health Records (EHR. for predictive modeling and disease diagnosis . These systems rely on tokenization and classification models that understand the nuances of Japanese medical language. Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. Despite advances in medical NLP, challenges remain in processing Japanese medical text. One major challenge is the variation in medical terminology across hospitals and healthcare systems. The lack of a standardized corpus for Japanese medical text has hampered the development of more effective NLP models for this domain. Researchers have worked to create and standardize medical corpora, such as the MedNLP corpus, to provide a more reliable basis for training medical NLP models . Recent research in Japan has focused on improving the performance of LLMs in the medical field. Researchers have attempted to fine-tune models such as BERT for Japanese medical tasks, including diagnosis prediction and clinical text classification. These models are trained on specialized datasets, such as MedNLP, to improve their understanding of medical terminology and improve performance on specific tasks . However, these models still require further refinement to handle the diversity and complexity of Japanese medical language. Integrating advanced NLP models such as LLM with medical applications in Japan has great potential to improve healthcare outcomes. but challenges such as varying terminology and the need for a more standardized medical corpus remain major obstacles that need to be overcome. Underexplored Research Gaps One of the major gaps in medical NLP research in Japan is the limitation in fine-tuning large language models (LLM. to effectively handle Japanese medical texts. While much research has focused on fine-tuning models for English, model adaptation for Japanese is still limited. This is largely due to the profound differences in language structure, morphology, and spelling between English and Japanese . Global LLM models, such as GPT-3 and BERT, cannot always handle the nuances of Japanese without deep adaptation. Japanese medical texts have additional challenges related to the use of specialized terminology and variations in how medical texts are written. Therefore, fine-tuning large language models for Japanese medical texts is essential to improve the accuracy of models in understanding and generating relevant medical texts . Tasks such as automatic diagnosis and medical information extraction require a deep understanding of the Japanese medical domain, including variations in terminology and phrase usage in clinical contexts. Another research gap lies in the lack of large and high-quality medical datasets for training NLP Japanese medical datasets, such as MedNLP, are still limited in size and scope. With larger and more diverse datasets. LLM models can be trained to better understand broader medical contexts, which in turn can improve their performance in predictive tasks . Therefore, efforts to expand and standardize medical corpora are essential to advance medical NLP research in Japan. The importance of fine-tuning models for specific medical tasks can also improve the predictive capabilities of automated diagnosis and clinical decision-making. Several studies have begun to explore fine-tuning models for specific diagnoses, but there is little research that focuses on implementation for Japanese medical texts . Therefore, further research on adapting LLM models for Japanese in medical contexts is needed. One challenge that has not been widely explored is the use of localized models for Japanese in medical applications. Models such as ClueAI, which are optimized for Japanese, can provide significant benefits when applied to medical text prediction. Fine-tuning these models on Japanese medical datasets can produce more efficient and relevant models for specific medical tasks, such as medical record analysis and disease prediction . Although progress has been made in the field of medical NLP in Japan, there is still much room for further exploration. Fine-tuning LLM for Japanese medical texts is a very promising but also challenging area. Further research combining LLM models with standardized Japanese medical terminology will greatly enhance the model's ability to address complex medical challenges. Methodology This study adopts a computational approach to fine-tune and evaluate Japanese Big Language Models (LLM. , specifically ClueAI, in the task of predicting medical texts. The study was conducted throughout 2025, focusing on the analysis and processing of Japanese medical texts using the MedNLP corpus, a valuable dataset containing authentic clinical narratives. The study relies on data analysis. The primary data source is the MedNLP corpus, a publicly available collection of medical records written in Japanese. This corpus contains various types of Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. clinical documents, including discharge summaries, progress notes, and other patient-related data, which are essential for training and evaluating NLP models. Model performance is evaluated by finetuning a multilingual BERT model and comparing it with the ClueAI model. Both models are evaluated based on the perplexity metric, which measures the modelAos uncertainty in predicting the next word. This research was conducted at Keio University in collaboration with the RIKEN Institute. Computational resources for model training and evaluation were provided by these institutions, which offer access to high-performance servers and GPUs for efficient model processing. This study uses a Japanese language model. ClueAI, along with the MedNLP corpus, to assess the applicability of LLM in medical text prediction. The methodology involves using the MeCab Tokenizer for text processing, evaluating model performance using perplexity, and comparing with a multilingual BERT model as a . System Architecture The analysis leverages the MeCab Tokenizer, which is essential for processing Japanese texts, using fugashi, a Python binding for efficient tokenization. MeCab is a morphological analysis tool that helps to break down Japanese text into meaningful units, such as words and phrases, for further processing by the model. The multilingual BERT model was used as a baseline, without any fine-tuning, while ClueAI underwent domain-specific fine-tuning using the MedNLP corpus. The system architecture involves two main steps, namely data preprocessing and model training. During data preprocessing, the MedNLP corpus was tokenized using MeCab. The data was then split into training and testing sets, with approximately 80% of the data used for training and 20% for evaluation. The models, namely ClueAI and multilingual BERT, were trained using the training set, and their performance was evaluated using the perplexity metric on the testing set. Analysis Process The primary analysis focused on comparing the perplexity scores of both models. Lower perplexity indicates that the model is better at predicting the next word, which is crucial in the context of medical text prediction tasks. Additionally, error analysis was performed to identify potential biases in the model predictions, especially in the medical domain where precision and accuracy are crucial. Finding and Discussion This study evaluates the performance of the fine-tuned Japanese large language model (LLM) ClueAI using the Japanese medical corpus. MedNLP. The main evaluation method is perplexity, a metric that measures the uncertainty of the model in predicting the next word. The lower the perplexity value, the better the model's prediction performance. Evaluation Using Perplexity The evaluation was carried out by calculating the log-likelihood loss value using a function available in the HuggingFace library. The ClueAI model fine-tuned with MedNLP produced a perplexity value 2, while the baseline model, multilingual BERT without fine-tuning, showed a perplexity value This difference indicates that domain-specific adaptation in ClueAI significantly improves the prediction accuracy of Japanese medical texts. Table 1. Model Perplexity Comparison Model BERT Multilingual ClueAI Fine-tuned Dataset MedNLP Corpus Perplexity Yes MedNLP Corpus Table 1 presents a comparison of the perplexity values of the two large language models (LLM. used in this study, namely BERT Multilingual and ClueAI. Both models were evaluated using the Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. MedNLP corpus, a dataset of authentic Japanese medical texts, containing clinical records such as hospital discharge summaries and patient progress notes. The evaluation was carried out by calculating the log-likelihood loss value using the evaluation function from the HuggingFace library, then the value was converted into perplexity. Perplexity is a standard metric in language model evaluation, which measures how well a model predicts the next word in a sequence of text. The lower the perplexity value, the better the model's performance in understanding and predicting text. The evaluation results show that, the BERT Multilingual model, which was used without a special fine-tuning process in the medical domain, obtained a perplexity value of 27. This value reflects a fairly high prediction uncertainty, which means that this model is not very effective in understanding medical contexts in Japanese. While, the ClueAI model that was fine-tuned using the MedNLP corpus obtained a much lower perplexity score of 14. 2, indicating that the ClueAI model is able to understand the linguistic context and Japanese medical terminology more accurately. The significant difference in perplexity score of 13. 6 points indicates that domain-specific finetuning has a significant impact on model performance. ClueAI that was retrained using Japanese medical data was able to recognize linguistic patterns and technical terms in clinical texts better than Multilingual BERT that was only trained on general and multilingual data. The findings support the hypothesis that using a local large language model like ClueAI, optimized with domain-specific data, can significantly improve prediction accuracy in Japanese medical NLP This is especially important in the medical context, where misinterpretation can directly impact the quality of clinical decision-making and patient safety. Comparison with Baseline The baseline model, multilingual BERT without fine-tuning, showed a discrepancy in understanding Japanese-specific medical terms. This is suspected because multilingual BERT was trained in a general manner and not specifically for the medical domain or Japanese language. On the other hand. ClueAI, which has gone through a fine-tuning process with MedNLP data, is able to recognize linguistic patterns and medical terminology more accurately. This difference in performance emphasizes the importance of domain adaptation in LLM when used in specific contexts such as medical texts. Table 2. Log-Loss (Log-Likelihood Los. Evaluation Model BERT Multilingual Training Loss Validation Loss ClueAI (Fine-tune. Table 1 shows the results of training two models for a task, with the Training Loss and Validation Loss values recorded for each model. Training Loss This measures how well the model learns from the training data. The lower the training loss, the better the model is at learning patterns in the training data. This value is usually calculated based on the difference between the modelAos predictions and the correct labels on the training . Validation Loss This measures the modelAos performance on data that the model did not see during training. Validation loss is used to assess the modelAos ability to generalize, i. , how well the model can apply its knowledge to data that it was not trained on. A lower validation loss indicates a model that is better at generalizing and better at making accurate predictions on unfamiliar Based on Table 2, the model-based explanation is as follows: Multilingual BERT Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. Training Loss: 3. Validation Loss: 4. E Multilingual BERT shows relatively higher training loss and validation loss values compared to the fine-tuned ClueAI. This suggests that Multilingual BERT may have difficulty adapting to the specific medical data for this task, and also suggests that the model may not be able to generalize very well. ClueAI (Fine-tune. Training Loss: 2. Validation Loss: 2. E Fine-tuned ClueAI shows significantly lower training loss than Multilingual BERT, indicating that the model is better at learning patterns from the training data. Furthermore, the lower validation loss also suggests that ClueAI can generalize better to previously unseen data. A comparison between training loss and validation loss shows that a significant discrepancy, such as a very low training loss and a high validation loss, may indicate overfitting, where the model excels on training data but struggles to generalize to unseen data. In the case of ClueAI, however, the difference between the training and validation losses appears balanced, suggesting that the model is stable and less prone to overfitting. Based on Table 2, fine-tuned ClueAI outperforms Multilingual BERT in its ability to learn from training data and generalize to new data, making it a more suitable choice for the task, especially when working with specialized domains such as medical texts. Error and Bias Analysis Error analysis shows that multilingual BERT tends to fail in predicting rare clinical terms and complex contexts, especially terms containing advanced medical kanji or abbreviations commonly used by Japanese doctors. ClueAI, although showing better performance, still has bias in interpreting passive sentences and long sentence structures, which often appear in medical summaries such as discharge summaries. For example, in sentences containing many subordinations or symptom details. ClueAI sometimes predicts terms that are not contextual. Error Analysis on Multilingual BERT and ClueAI C Multilingual BERT o Multilingual BERT, while capable of handling multiple languages, struggles to handle uncommon clinical terms and complex contexts, especially those involving advanced medical kanji or medical abbreviations frequently used by Japanese doctors. o Complex medical kanji often have very specific meanings in medical contexts, and since Multilingual BERT was not specifically trained on Japanese medical texts, it struggles to understand them. Medical abbreviations that are only familiar to medical professionals in Japan can also be challenging, as the model may not recognize them as valid medical o As a result, the modelAos performance drops on predictions involving rare medical terms, which can lead to errors in classification or interpretation. C ClueAI . ine-tuned mode. o ClueAI performs better, but still has some biases and shortcomings in processing passive sentences and long sentence structures often found in medical summaries such as patient summaries after hospitalization. o Passive sentences and sentences with many subordinations . ependent clause. or symptom details can confuse the model. This happens because ClueAI may have difficulty capturing the overall context, especially in sentences that are indirect or have complex relationships between elements. o For example, in sentences containing details of symptoms or causes of a disease. ClueAI sometimes predicts terms that are not in accordance with the context, which can lead to errors in medical interpretation. Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. Table 3. ClueAI vs BERT Multilingual: Medical Sentence Predictions (After Improvement. No. Input Sentence (Medical Contex. ClueAI Prediction CAIAnIAA AUACCO (The patient had a history of diabete. OUeUAACAIAoI CeAOA (After surgery, the patient reported a CAiAsCACO CUAAUAA (No abnormal lung sounds were foun. AuA 120/80mmHg AOoaEA (Blood pressure was stable at 120/80mmH. CAIAuCeAOA AECU (The patient is complaining of a CAIAeAUa (The patientAos cough CAIasCeAO (The patient complained of nause. EuAACaOAuN AUIao (Emergency treatment is needed due to chest Ground Truth Error Type nI . BERT Multilingual Prediction oAu nI . Incorrect diagnosis oI oI ( Medical term o soun. ACCO . ound presen. AA . o soun. Meaning inversion Oo . sOo . Oo . Sentence u . Accurate prediction e . Accurate prediction as . Accurate prediction Eu . hest pai. Eu . hest pai. Eu . hest pai. Accurate prediction . Case Examples in Table 3 Here is the analysis related to Table 3, focusing on the prediction errors and successes made by ClueAI and BERT Multilingual in the Japanese medical context: CAIAnIaUACCO (The patient had a history of diabete. ClueAI Prediction: nI . BERT Multilingual Prediction: oAu . Ground Truth: nI . Error Type: Incorrect diagnosis Analysis: ClueAI is able to provide a correct prediction by recognizing the word "nI" . as an appropriate medical context. Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. Multilingual BERT, despite being able to recognize the word "nI," incorrectly predicted the diagnosis, giving oAu . , which is clearly not a diagnosis that fits the context of the sentence. BERT's error is due to BERT's more general language processing capabilities and its tendency to associate words that appear more frequently in multilingual datasets, while ClueAI is more trained in medical contexts and Japanese, giving more accurate results. OUeUAACAIAoICeAOA (After surgery, the patient reported a feve. ClueAI Prediction: oI . BERT Multilingual Prediction: e . Ground Truth: oI . Error Type: Medical term confusion Analysis: ClueAI successfully predicted oI . , which fits the medical context of after surgery. BERT Multilingual preferred e . which is more suggestive of a different symptom, although it can be related to fever. This difference arises because BERT does not understand the Japanese medical context in depth and is more inclined to translate common words from the multilingual model. This error shows that ClueAI, which focuses more on Japanese and medical contexts, produces more accurate results in terms of understanding medical terminology compared to BERT. CAiAsCACOCUAAUAA (No abnormal lung sounds were foun. ClueAI Prediction: AA . o soun. BERT Multilingual Prediction: ACCO . ound presen. Ground Truth: AA . o soun. Error Type: Meaning inversion Analysis: ClueAI can correctly predict that there are no abnormal lung sounds with AA . o soun. BERT Multilingual actually produces the opposite result, namely ACCO . here is a soun. , which is the opposite of the information given in the original sentence. This error occurs because BERT has difficulty understanding the context of negated sentences in Japanese, especially when the medical sentences are more technical and ambiguous. This shows that ClueAI, with its more specific Japanese training, is able to capture the nuances of sentences better than BERT, which tends to have difficulty handling the inversion of meaning that occurs in negated sentences. AuA 120/80mmHg AOoaEA (Blood pressure was stable at 120/80mmH. ClueAI Prediction: Oo . BERT Multilingual Prediction: sOo . Ground Truth: Oo . Error Type: Sentence misinterpretation Analysis: ClueAI correctly predicts Oo . , which corresponds to the blood pressure condition described in the sentence. BERT Multilingual, on the other hand, chooses sOo . , which is clearly incorrect in this context. This shows that BERT does not fully understand the medical context and is more likely to associate words with their frequency of occurrence across languages, which is not always accurate in a specific context. ClueAI is better able to handle this problem because it has been trained to understand Japanese medical terms, while BERT tends to provide more general interpretations. CAIAuCeAOAAECU (The patient is complaining of a headach. Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. ClueAI Prediction: u . BERT Multilingual Prediction: u . Ground Truth: u . Error Type: Accurate prediction Analysis: Both ClueAI and BERT Multilingual managed to correctly predict that the patient was complaining of u . , which is a common medical symptom. Both models produced accurate results here, showing that in this case, both the Japanesebased and multilingual models can handle relatively simple cases without difficulty. CAIAeAUa (The patientAos cough worsene. ClueAI Prediction: e . BERT Multilingual Prediction: e . Ground Truth: e . Error Type: Accurate prediction Analysis: Here. ClueAI and BERT Multilingual give the same result, which is e . , which fits the context of the sentence. This shows that both models can recognize common symptoms that are easier to understand in Japanese. CAIasCeAOA (The patient complained of nause. ClueAI Prediction: as . BERT Multilingual Prediction: as . Ground Truth: as . Error Type: Accurate prediction Analysis: Both models managed to provide accurate predictions regarding the symptom aa As . This shows that both ClueAI and BERT can handle common symptoms well, without any C EuAACaOAuNAUIao (Emergency treatment is needed due to chest pai. ClueAI Prediction: Eu . hest pai. BERT Multilingual Prediction: Eu . hest pai. Ground Truth: Eu . hest pai. Error Type: Accurate prediction Analysis: The predictions of both models. ClueAI and BERT, are correct by giving Eu . hest pai. , which matches the intended symptom. This shows that both models can handle fairly simple cases that are often found in medical Based on the analysis: C ClueAI tends to be superior in the Japanese medical context because it is trained with Japanese medical data, so it is better able to capture more specific meanings and contexts. C Multilingual BERT, while strong in multilingual processing, sometimes struggles with medical understanding in Japanese and tends to produce errors in more technical or localculture-based contexts. It can be said that multilingual BERT tends to struggle with rare or complex medical terms, while ClueAI, although better, still struggles with passive voice, long sentence structures, and complex Henrik Lauritsen. David Hestbjerg. Lone Pinborg. Christensen Pisinger. Evaluation of Perplexity and Syntactic Handling Capabilities of ClueAI Models on Japanese Medical Texts. International Journal of Artificial Intelligence, vol. 12, no. 1, pp. June 2025. DOI: 10. 36079/lamintang. medical contexts. The model's bias in interpreting certain sentence structures and the ambiguity of medical terms leads to prediction errors. Therefore, although ClueAI is better than multilingual BERT, it still faces challenges in predicting accurately in some more complex medical contexts. Discussion The results of this study confirm previous findings that LLM needs to be retrained . ine-tune. with domain-specific data in order to achieve optimal performance in natural language processing tasks. this context. ClueAI's adaptation of MedNLP data has been shown to improve the accuracy of nextword prediction in Japanese medical texts. This is especially important in the clinical context in Japan, where medical documentation uses a lot of complex sentence structures, typical clinical symbols, and formal language. The use of the MeCab tokenizer accessed via fugashi has also proven essential. Japanese, which does not have explicit word boundaries, requires strong morphological analysis, and MeCab is able to provide stable and suitable token representations for model training. Without accurate tokenization. Japanese NLP models will struggle to handle morphological and syntactic ambiguities common in medical texts. This study also confirms that using perplexity as a performance metric can provide a strong indication of a modelAos generalization ability in understanding context. Although it does not directly indicate semantic meaning, perplexity values correlate with prediction accuracy in sequential text Conclusion This study evaluates the effectiveness of a large Japanese language model. ClueAI, fine-tuned using the MedNLP corpus, for the task of predicting Japanese medical texts. Compared to the multilingual BERT baseline, the results show that the domain-tuned ClueAI is able to produce lower perplexity values, and performs better in understanding the context and structure of medical sentences. The use of MeCab-based tokenization via Fugashi proved to be essential in improving the prediction accuracy, as it was able to handle Japanese morphology with greater precision. The perplexity-based evaluation also proved effective in measuring the model's generalization ability in predicting text In this study, the model still showed weaknesses in handling complex syntactic structures, such as passive sentences and nested clauses, which are typical linguistic challenges in Japanese medical texts. This suggests that although domain adaptation provides improvements, there are still limitations in the model's linguistic generalization ability. Further research could be directed at developing models that are more sensitive to syntactic structure, exploring additional, more varied medical corpora, and evaluating other Japanese LLM models in broader medical NLP tasks, such as clinical entity extraction, classification, and References