SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan ISSN: 3047-8618 (Onlin.
| DOI: 10.
31959/js.
Volume 3 .
No.
June 2025 .
Homepage: https://ejournal-polnam.
id/index.
php/JS/index Assessment of Readability of Vietnamese Text Hien Pham1.
Thi-Minh-Huyen Nguyen2.
Phuong Le-Hong3* School of Languages and Tourism.
Hanoi University of Industry.
Vietnam Vietnam National University.
Hanoi *Corresponding author email: phuonglh@vnu.
Abstract: This paper presents a study that has systematically considered the impact of linguistic features on readability for Vietnamese texts.
We used a set of selected texts from the primary school, junior high school and high school Vietnamese textbooks as our input data set.
The set of linguistic features accounted for various aspects, such as morphological features, part-of-speech features, syntactic features and discourse features.
Our study targets qualitatively and quantitatively developing a readability scale for readers who wish to measure their texts to the intended readers.
The applications and implications of the resulting outcomes from this study can benefit textbook composers, who are working on the new set of K-12 textbooks, such as students, teachers, and publishers.
Keywords: readability assessment, linguistic features, morphological analysis INTRODUCTION In the 19th century, various investigations on readability began and their findings contributed to many social domains.
In the English-speaking world, readability has been applied in text categorization and evaluation.
Various formulae for English text readability have been developed.
Although being the 13th spoken language in the world.
However.
Vietnamese is an under-represented language in the field of readability due to its limitations on corpora and natural language processing.
Therefore, building a formula to gauge Vietnamese text readability is a crucial task for those who are working in the field of computational linguistics.
Especially, in the current settings while Vietnam is carrying out its education reforms of the curriculum and textbooks from primary to high school levels.
The current study aims at forming a formula for Vietnamese text To date, several formulas have been developed for English text readability.
However, those formulas cannot be applied directly into Vietnamese since the difference in language typology.
Text readability depends on linguistic factors of written texts, which we consider as linguistic-internal factors.
Our study targets qualitatively and quantitatively developing a readability scale for users/readers who wish to measure their texts to the intended readers.
The applications and implications of the resulting outcomes from this study can benefit textbook composers, who are working on the new set of K-12 textbooks, such as students, teachers, and publishers.
Lisensi Lisensi Internasional Creative Commons Attribution-ShareAlike 4.
SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025
METHOD
A Brief Survey on Existing Methods There is a significant body of research on readability of text that has been developed in the last decades [Kevyn, 2.
Traditional approaches mainly rely on computing difficulty measures.
These measures are normally computed on two main factors, either on the familiarity of linguistic units such as words and phrases, or on the complexity of syntax.
These factors are often combined to devise readability formulas so as to make their application straightforward.
The readability scores obtained by these formulas help evaluate the readability or difficulty of traditional texts.
The most widely used traditional formula is the Flesch-Kincaid score [Kincaid et al.
, 1.
, which is FK_score = 0.
39 * (AverageWordPerSentenc.
8 * (AverageSyllablePerWor.
- 15.
This formula is the basis of many similar variants which have been developed over the years.
However, as stated above, these formulas are all specific to English and not readily transferable to different languages, especially languages of different types such as the Vietnamese language.
Recently, with the advance of many machine learning methods and the availability of training data, there has been an increasing interest in applying artificial intelligence (AI) based approaches to readability assessment.
In these learning-based approaches, there are three main steps.
The first step is corpus acquisition, where a gold-standard corpus of individual texts is constructed.
This corpus is representative of the target language, genre or other aspects of the text that need to be evaluated.
The acquired corpus is normally manually annotated by linguistic experts, with the help of computer It is then divided into a training set and a test set.
The training set is used to develop automated machine learning models which are learned from examples.
The test set, whose examples are served as unseen samples, is used to evaluate the performance of the learned models.
These models sometimes can be tuned on a different test set, which is usually called a validation set or a development set.
The second step is feature extraction, sometimes called featurization.
This step concerns defining and extracting a set of important features that best represent the text under The feature sets are often proposed by experts with a deep domain knowledge, which are salient to the target readability prediction task.
Most of the time, the feature sets are gone through a trial and error process.
Note that once a feature set is defined, their feature instances are extracted/computed by an automated software The third step is model learning.
In this step, a machine learning model is used to learn a mapping from a text to its gold-standard label.
This model relies on the features which are extracted from the previous step, both in the training process or inference process.
The main assumption of model-based learning is that if the texts are drawn from the same statistical distribution, then if a statistical-based machine learning model performs well on the training data, it will perform well on unseen data too.
That is, it helps predict accurately the readability of an unseen text.
Our Proposed Method In this work, we adopt the machine learning approach to readability assessment, taking into account specific features of the Vietnamese language.
Feature Extraction Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy In the first step, we design a set of salient features which are suitable for assessing the readability of a text.
For each text, we compute the following features:
a The average sentence length in characters a The average sentence length in words .
fter performing word segmentatio.
a The ratio of concrete/abstract nouns, which is the number of concrete/abstract nouns divided by the number of tokens a The ratio of proper nouns a The ratio of adjectives a The ratio of clauses, which is approximated by the ratio of prepositions a The ratio of Sino-Vietnamese words.
a Sino-Vietnamese word is a word or morpheme of the Vietnamese language borrowed from Chinese a The ratio of pure old Chinese words, which originated from Chinese a The ratio of pure Vietnamese words a The ratio of French loanwords, which are borrowed from French a The ratio of unknown words, which are not in the standard lexicon a The identity of most frequent unigrams with a cutoff threshold of 2 a The identity of most frequent bigrams with a cutoff threshold of 2 Note that the last two factors are feature templates, which can generate many feature instances.
For example, if we consider a text of 4 words Audifficulty assessment of textAy, then the bigram feature templates would generate the following features: Audifficulty assessmentAy.
Auassessment ofAy, and Auof textAy.
All n-gram features whose frequency not less than 2 are retained and fed into the machine learning In order to compute these features automatically, we need to develop some core pre-processing modules, including - A sentence segmentation module which splits a text into multiple sentences - A word segmentation module which splits a sentence into lexical units .
These modules make use of advanced computer algorithms, as described in scientific publications:
- [Le-Hong and Ho, 2.
for automatic sentence segmentation - [Le-Hong et al.
, 2.
for word segmentation - [Le-Hong et al.
, 2.
for part-of-speech tagging - [Le-Hong et al.
, 2.
for clause extraction and tagging Due to space limitation, we refer the interested reader to the document above for In summary, this project result is built upon many essential works that we have performed over the last ten years.
Machine Learning Model There are a variety of machine learning models for supervised learning which can be used for readability assessment, ranging from linear models to stronger non-linear ones.
Given the size and nature of the training corpus, we chose to use a linear classification model namely logistic regression.
This model is proven to be both fast and efficient for the problem concerned in this project.
We present briefly the mathematical formulation of this model as follows.
Let x be an input text and y be its label.
After featurization, the input x is represented by a realvalued vector f.
of size d, where d is the domain dimension which can be numerous and grows according to the size of the training data.
The label y is binary, taking a value SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025 of either 0 .
or eas.
or 1 .
or difficul.
In this model, we compute the conditional probability distribution of label y given f.
by using the sigmoid function .
lso called the logistic functio.
, as follows:
= 1 | f.
:= 1 / .
exp(-w * f.
)], where the parameter vector w is also of size d, and w * f.
is the inner product of two vectors w and f.
The right-hand side formula is called the sigmoid function of w * f.
= 1 / .
exp(-.
Once the probability that the label is assigned value 1 is computed, we can easily compute the probability that it takes value 0:
= 0 | f.
= 1 - sigmoid.
* f.
Given a training dataset which is composed of N training samples {.
_1, y_.
, .
_2, y_.
, .
_N, y_N)}, we can estimate the parameter vector w by solving a mathematical optimization model by applying the maximum likelihood principle:
:= P.
_1 | f.
_2 | f.
_N | f.
_N).
Ie max Advanced numerical algorithms such as gradient-based methods or NewtonRaphson methods are shown to be very efficient to solve this optimization problem.
Once the parameter vector w is learnt, it can be used to make a prediction for future data samples with a linear time complexity, which is very fast.
To assign an unseen text x to label 1 or 0, we just need to evaluate f.
* w.
if this quantity is greater than a threshold value, for example 0.
5, then y is predicted to be 1, otherwise it is predicted to be 0.
In essence, the parameter vector w encodes the importance of every salient feature in the prediction model.
The greater .
n absolute valu.
a parameter value w_j is, the more important the corresponding feature f_j.
Linguistic Resources Construction In this section, we introduce the construction of two types of resources.
The first is lexical resources used for text preprocessing and lexical feature extraction, and the second is the corpora containing the texts to assess the readability.
As mentioned in Section 2.
1, the difficulty measures are computed based on two factors, which are the familiarity of linguistic units and the complexity of syntax.
Section 2.
2, a list of features is consequently proposed for the construction of a model predicting the difficulty level of a text.
For extracting these features, we need the following lexical resources:
a A word list with part-of-speech (POS) information for the tasks of word segmentation and POS tagging.
a From the above list, we can extract a list of concrete/abstract nouns and a list of a A list of conjunctions.
a A list of Sino-Vietnamese words.
a A list of pure old Chinese words.
a A list of pure Vietnamese words.
a A list of French loanwords.
For the Vietnamese word list with POS information, we make use of the Vietnamese Computational Lexicon (VCL) introduced in [Nguyen et al, 2.
and [Vu and Nguyen, 2.
Each sense of a word entry is associated with several linguistic characteristics: morphological information, word category and Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy subcategory, subcategorization frames for verbs and semantic descriptions .
eaning, semantic constraints, definition and usage example.
Below is an example of the first sense of the word entry "chy" .
o ru.
chy (V) .
gyi, cong v.
di chuyEn thyn thE bng nhng buc nhanh, mnh vy liyn tip Morphological WordType --> simple word Syntactic Category --> V Subcategory --> Vi FrameSet --> Sub V SyntacticFunction --> Sub SyntacticConstituent --> NP Before --> R: cang Semantic Logical constraint CategorialMeaning --> Activity Semantic constraint Sub --> Agt{Person.
Anima.
Def--> .
gyi, cong v.
di chuyEn thyn thE bng nhng buc nhanh, mnh vy liyn tip Exa--> cu by cang chy ================================================== ==================================== The VCL data is encoded in XML format.
This dataset contains about 42,000 From this dataset, we built a tool for extracting all the words found in the studied corpus and their characteristics.
The word category and subcategory can be extracted directly from the dataset, while the attribute of a noun being abstract or not is reconstituted from the meaning category in the lexicon and the semantic tree of the lexicon guidelines.
However, for several words in VCL, these descriptions are missing.
We have filtered these words out to complete our The lists of Sino-Vietnamese words, old Chinese words and French loanwords are built from many Sino-Vietnamese dictionaries and research works.
An unexhausted list of works from which the etymological information of the words can be registered as follow:
T ciEn yu tc Hyn ViNt thyng dung (Dictionary of sino-vietnamese everyday usage element.
/ ch biyn : Hoyng VEn Hynh .
nhng ngyi biyn son : Phan VEn Cyc, .
t al.
Hy Noi : Nhy xuut bn khoa hsc xy hoi, 1991.
O s aU OC O O s O C i : a O i E = Yuenanyu shuangyinjie HanYueci tedian yanjiu : yu Hanyu bijiao /nNye.
nNyec.
Wenqing Luo, niuNOOIa.
Guangzhou : Shi jie tu shu Guangdong chu ban gong si, 2011.
T ciEn t Hyn ViNt .
ca Li Cao NguyNn .
h biy.
vy Phan VEn Cyc.
Nhy xuut bn KHXH.
SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025 T ciEn Cyc t ting ViNt gcc Phyp (Dictionnaire des termes Vietnamiens dAoytymologie franyais.
ca cyc tyc gi NguyIn Qung Tuyn vy NguyIn ac Dyn, xuut bn nEm 1992.
- T gcc Phyp trong ting ViNt / Vng Toyn.
Hy noi : NXB.
Khoa hsc xy hoi.
Concerning the preparation of the corpora for learning and testing models, we have collected the documents from textbooks, then built a tool to format each document in XML.
An interface has been defined to annotate the difficulty level of each text.
Software Implementation In this section, we present the general information about the software system that we have implemented in this project to build an automated system for the readability assessment of text extracted from textbooks.
In order to build a software system that is capable of assessing the readability level of a text efficiently, we need to build from scratch a variety of software modules.
These modules can be grouped into 5 main categories as shown in the following table:
Table:
Category Description Modules Core1 linguistic This category contains modules for segmentation of a text into sentences, sentences into lexical units or words.
- Sentence segmentation - Word segmentation Core2 linguistic This category contains category tagging .
art-ofspeech taggin.
at the sentence level, and clause - Part-of-speech tagging - Clause segmentation Feature - Some summary statistics about lengths .
verage sentence length in words, average sentence length in character.
- Ratio of Sino-Vietnamese words - Ratio of pure Chinese originated - Ratio of French originated words - Ratio of unknown words - Ratio of common nouns - Ratio of proper nouns - Ratio of adjectives This category contains modules for extracting features for readability There are two types of features, namely discrete features and continuous Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy Ratio of prepositions/clauses Unigram features Bigram features Word embeddings Model Estimation This category contains modules for automatic assessment of readability Two classification models are logistic regression and neural network.
- Training and prediction with logistic regression - Training and prediction with feed-forward neural network Web5 services This category contains modules for software integration and demo.
Data indexing service Sentence segmentation service Word segmentation service Part-of-speech tagging service Readability assessment service Demo website .
sing Java Enterprise technologie.
The underlying assessment model is trained on a set of literature text extracted from the textbook of Grade 4 classes (Level .
For example, when a user enters the following text:
Cng-co Vyt ly mot cyng trynh kin tryc vy ciyu khc tuyNt diNu ca nhyn dyn Campu-chia cc xyy dng t cu th kO XII.
Khu cAn chynh gem ba tng vui nhng ngsn thyp lun.
Mucn thEm ht khu cAn chynh phi ci qua ba tng hynh lang dyi gn 1500 myt vy vyo thEm 398 gian phyng.
Suct cuoc do xem ky thy cy, du khych s cm thuy nh lc vyo th giui ca nghN thut chm khc vy kin tryc ci ci.
ayy, nhng cyy thyp lun cc dng bng cy ong vy bsc ngoyi bng cy nhAn.
ayy, nhng bc tyng bueng nhAn byng nh mt gh cy, hoyn toyn cc ghyp bng nhng tng cy lun co gst vuyng vc vy la ghyp vyo nhau kyn khyt nh xyy gch va.
The system gives final and intermediate analysis results, which are shown SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025 Figure:
A probability distribution of difficulty level is estimated and presented, including two main outcomes: easy or difficult with a proportional ratio.
The numbers will be shown when the user hovers the mouse over the corresponding graphical parts.
Some intermediate analyses are also presented for users to check.
The first one is sentence analysis, which contains the results of the sentence segmentation In the example above, this is a list of six sentences:
Cng-co Vyt ly mot cyng trynh kin tryc vy ciyu khc tuyNt diNu ca nhyn dyn Cam-puchia cc xyy dng t cu th kO XII.
Khu cAn chynh gem ba tng vui nhng ngsn thyp lun.
Mucn thEm ht khu cAn chynh phi ci qua ba tng hynh lang dyi gn 1500 myt vy vyo thEm 398 gian phyng.
Suct cuoc do xem ky thy cy, du khych s cm thuy nh lc vyo th giui ca nghN thut chm khc vy kin tryc ci ci.
ayy, nhng cyy thyp lun cc dng bng cy ong vy bsc ngoyi bng cy nhAn.
ayy, nhng bc tyng bueng nhAn byng nh mt gh cy, hoyn toyn cc ghyp bng nhng tng cy lun co gst vuyng vc vy la ghyp vyo nhau kyn khyt nh xyy gch va.
The second intermediate analysis is contains part-of-speech tagged text, where each sentence is labeled with words and their corresponding word categories in context:
Cng-co/Np Vyt/Np ly/V mot/M cyng_trynh/N kin_tryc/N vy/CC ciyu_khc/N tuyNt_diNu/N ca/E nhyn_dyn/N Cam-pu-chia/Np cc/R xyy_dng/V t/E cu/N th_kO/M XII/Np PUNCT/PUNCT Khu/Nc cAn/N chynh/Np gem/V ba/M tng/N vui/E nhng/L ngsn/N thyp/N lun/A PUNCT/PUNCT Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy Mucn/V thEm/V ht/N khu/N cAn/N chynh/Np phi/V ci/V qua/E ba/M tng/N hynh_lang/N dyi/A gn/A 1500/M myt/Nu vy/CC vyo/E thEm/V 398/M gian/N phyng/N PUNCT/PUNCT Suct/A cuoc/N do/V xem/V ky_thy/N cy/P PUNCT/PUNCT du_khych/N s/R cm_thuy/V nh/C lc/V vyo/E th_giui/N ca/E nghN_thut/N chm_khc/V vy/CC kin_tryc/V ci_ci/N PUNCT/PUNCT ayy/P PUNCT/PUNCT nhng/L cyy/N thyp/N lun/A cc/R dng/V bng/E cy_ong/N vy/CC bsc/V ngoyi/A bng/E cy/N nhAn/N PUNCT/PUNCT ayy/P PUNCT/PUNCT nhng/L bc/Nc tyng/N bueng/N nhAn/A byng/A nh/A mt/N gh/N cy/N PUNCT/PUNCT hoyn_toyn/A cc/V ghyp/V bng/E nhng/L tng/N cy/N lun/A co_gst/V vuyng_vc/N vy/CC la/V ghyp/V vyo/E nhau/N kyn/A khyt/N nh/C xyy/V gch/N va/N PUNCT/PUNCT It can be shown in the above result, each token is labeled with a tag designating a part-of-speech category, for example N is a common noun.
Np is a proper noun.
A is an adjective.
V is a verb.
E is a preposition, and so on.
develop the part-of-speech tagging module, we use a linguistic corpus of more than 10,000 sentences which are manually word segmented and tagged by linguists at the Vietnam Center of Lexicography (Vietle.
This corpus is a result of the VLSP 2010 project, funded by the state whose objective is to build fundamental resources and tools for processing Vietnamese text and speech.
The third intermediate analysis contains some linguistic features which are essential for readability assessment, as shown in the following figure:
Figure:
Here, some ratios of sentence lengths in tokens and in characters as well as some etymological features and syntactic features are shown.
RESULTS
The automatic text readability assessment is performed in English.
German.
Swedish.
Japanese and Chinese.
In contrast, research on readability of Vietnamese text is quite limited.
The purpose of this study is to systematically analyze the impact of SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025 linguistic features for assessing the readability level of Vietnamese texts for K-12 learners.
More speciAcally, we designed various features at different levels: morphology, part-ofspeech, syntactic, and discourse, and applied classiAcation models for potentially predicting the reading levels of Vietnamese textbooks for elementary, junior high, and senior high school students.
In the current model, we have tested on selected linguistic features .
n blac.
at different levels, as can be seen in Table # below.
We further regressed these features for different readability levels and selected signiAcant features.
Table : Linguistic features included in the model Level Morphology Domain Word complexity POS Syntactic Feature - The ratio of Sino-Vietnamese words.
a SinoVietnamese word is a word or morpheme of the Vietnamese language borrowed from Chinese - The ratio of pure old Chinese words, which originated from Chinese - The ratio of pure Vietnamese words - The ratio of French loanwords, which are borrowed from French - The ratio of unknown words, which are not in the standard lexicon - The identity of most frequent uni-grams with a cutoff threshold of 2 - The identity of most frequent bigrams with a cutoff threshold of 2 - Average number of syllables per word per document - Average number of syllables per unique word per - Percentage of more than two syllable words per - The ratio of common nouns, which is the number of common nouns divided by the number of tokens - The ratio of proper nouns - The ratio of adjectives per document - Percentage of unique functional words per document - Number of unique functional words per document - Average number of unique nouns per sentence Sentence - The average sentence length in characters - The average sentence length in words .
fter performing word segmentatio.
- The ratio of clauses, which is approximated by the ratio of prepositions - Average number of multi-syllable words per sentence - Average length of prepositional phrases per Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy Syntactic Document - Number of syllables per document - Number of syllable .
ncluding punctuations, numerical, and symbol.
per document - Percentage of unique nouns per document Discourse Entity density - Average number of unique entities per sentence - Number of unique conjunctions per document - Percentage of unique conjunctions per document - Average number of conjunctions per sentence Cohesion Based on pilot data from extracted texts from textbooks for grade 4, we fitted the data into the learning model.
The T-test p values show that the model achieves high accuracies for level 4.
The sample data for the learning model are 80 texts classified by tertiary linguistic and literature students on a 7-level Likert scale.
We asked the students to read those texts and classify them into different levels.
The students are also asked to spell out some of the reasons that make the text difficult.
We use the mentioned features to perform regression on various level text data.
We select a subset of them at 96% conAdence level and derive a readability formula as presented above.
In the next stage of the project, we collected and processed a large body of texts in 17 subjects and evaluated our proposed method of difficulty assessment on these texts.
For each subject, we take 80% of texts for training and 20% of texts for testing.
Each text is classified into either AoeasyAo or AodifficultAo level.
In total, there are 4,930 texts which are The statistics of subjects, their corresponding number of lessons are given in the following table.
Subject Arts Biology Grade Number of texts Test Accuracy .
Test F-measure L04
L05
L06
L07
L08
L09
L10
SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025
Chemistry National Defense Civic Education Geography L11
L12
L08
L09
L10
L11
L12
L10
L11
L12
L06
L07
L08
L09
L10
L11
L12
L06
L07
L08
Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy History History Geology Informatics L09
L10
L11
L12
L06
L07
L08
L09
L10
L11
L12
L04
L05
L03
L04
L05
L06
L07
L08
L09
SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025
Literature Morality Natural - Social
Science Physics L10
L11
L12
L06
L07
L08
L09
L10
L11
L12
L04
L05
L01
L02
L03
L06
L07
L08
L09
L10
Pham1.
Nguyen2.
Le-Hong3.
AuAssessment of Readability of Vietnamese TextAy Science Technology Vietnamese L11
L12
L04
L05
L06
L07
L08
L09
L10
L11
L12
L01
L02
L03
L04
L05
CONCLUSION
In this study, we have systematically considered the impact of linguistic features on readability for Vietnamese texts.
We used a set of selected texts from the primary school, junior high school and high school Vietnamese textbooks as our input data set.
The set of linguistic features accounted for various aspects, such as morphological features.
POS features, syntactic features and discourse features.
The current pilot study does not allow us an insight into the best model yet.
However, in the next phase of the research, we suggest that we will fit all the annotated texts from all the textbooks from grade 1 to grade 12 into the learning model.
That way, we could achieve a highly predictive and accurate model for the majority of the Vietnamese text SCHOLARS: Jurnal Sosial Humaniora dan Pendidikan.
Vol.
No.
June 2025 readability for generally educational purposes.
We also expect to extend this research for assessing a wide range of Vietnamese texts in other domains with various implications.
REFERENCES