Indonesian Journal of Science & Technology 4 . 294-311 Indonesian Journal of Science & Technology Journal homepage: http://ejournal. edu/index. php/ijost/ Question Generator System of Sentence Completion in TOEFL Using NLP and K-Nearest Neighbor Lala Septem Riza1*. Anita Dyah Pertiwi 1. Eka Fitrajaya Rahman1. Munir1. Cep Ubad Abdullah2 Department of Computer Science Education. Universitas Pendidikan Indonesia. Bandung. Indonesia Fakultas Pendidikan Ilmu Pengetahuan Sosial. Universitas Pendidikan Indonesia. Bandung. Indonesia Correspondence: E-mail: lala. riza@upi. ABSTRACT Test of English as a Foreign Language (TOEFL) is one of learning evaluation forms that requires excellent quality of Preparing TOEFL questions using a conventional way certainly spends a lot of time . Computer technology can be used to solve the problem . Therefore , this research was conducted in order to solve the problem of making TOEFL questions with sentence completion type . The built system consists of several stages: . input data collection from foreign media news sites with excellent English grammar quality . preprocessing with Natural Language Processing (NLP). Part of Speech (POS) tagging. question feature extraction. separation and selection of news sentences . determination and value collection of seven features. conversion of categorical data value . target classification of blank position word with KNearest Neighbor (KNN). heuristic determination of rules from human experts . ) options selection or distraction based on heuristic rules. After conducting the experiment on 10 news , it is obtained that 20 questions based on the results of the evaluation showed that the generated questions had very good quality with percentage 93% . fter the assessment by the human exper. , and 70% wthe same blank position from the historical data of TOEFL questions. So, it can be concluded that the generated question has the following characteristics : the quality of the result follows the data training from the historical TOEFL questions , and the quality of the distraction is very good because it is derived from the heuristics of human A 2019Tim Pengembang Jurnal UPI ARTICLE INFO Article History: Submitted/Received 25 Mar 2019 First revised 25 May 2019 Accepted 05 Jul 2019 First available online 07 Jul 2019 Publication date 01 Sep 2019eived 16 Aug 2018 Revised 20 Aug 2018 Accepted 25 Aug 2018 Available online 09 Sep 2018 ____________________ Keywords: TOEFL, Automatic question generation. Natural Language Processing. Machine Learning. K-Nearest Neighbor. Education. Learning 295 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 INTRODUCTION Educational evaluation is a process of describing, obtaining, and presenting useful alternatives in learning (Stufflebeam, 1. One of the forms of evaluation requires quality questions is a Test of English as a Foreign Language (TOEFL ). It is the most well - known test in the field of ELT (English Language Teaching ) (Alderson and Hamp- Lyons, 1. There are several types of TOEFL , which named as . structure - and - written expression, and . reading comprehension. In the structure and written expression in TOEFL, the parameter to be tested relates to the understanding grammar in sentences . Two types of questions in the section have been known in the TOEFL test: completion and error detection . In the first type , it is only to fill in the blank questions in TOEFL , while the second test is to choose an underline word, which is the incorrect word in the sentence . Then , because of the need of updating TOEFL questions on a regular basis with the latest topics and many This makes the TOEFL questions to be helpful automatically in the process of producing qualified questions , especially on the sentence completion type. By using the existing techniques in Machine Learning, the quality of the generated questions can be kept, in accordance with the standards in the previous TOEFL questions. Nilsson . explained that machine learning is a field of science to make a machine or computer to be smart . Machine learning is the most important to make the process simpler . From the machine learning methods , there is one of the most well -known algorithms , namely KNN . machine learning algorithm for system classification ). Then , as well-known, one of the techniques of processing data . amely NLP ) can help the techniques to perform text processing . It is a research and application area that explores. | DOI: http://dx. org/10. 17509/ijost. how computers can be used to understand and manipulate text (Chowdhury, 2. This study was focused on generating sentence completion types, in which they were generated from news articles using a combination of some following techniques , such as NLP. KNN, and heuristic techniques. The proposed system involving these methods compute articles that have good English grammar as input data . then it produces some chosen sentence-completion questions with the answers. Some related works can be found in the For example, research conducted by Aldabe et al. introduced ArikIturri, which is an application used for generating fill -in -the -blank questions using NLP combined by Corpora considering the morphology and syntaxes aspects. Text2Test proposed by Aquino et al. 1 ) utilized : text processing , scoring , and question over generation to build questions . Araki et al. 6 ) generated multiple choice - typed questions in the subject of biology. Questions are generated by using the question template in the wh -question format . A learning management system embedded by the examination paper generated automatically was proposed by Cen et al . 0 ). technique with its evaluation for generating multiple choice close questions in English grammar and vocabulary (Goto et al. , 2. MATERIALS AND METHODS The Method For Generating Sentence Completion Typed Questions As shown in Figure 1, the computational model for generating questions can be divided into two processes: learning step and testing step that involve different data sets . , data training and data testing ). The first is used to generate templates of questions from historical TOEFL data as data training , while the second one uses news or articles as a data testing as a candidate question. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 296 But , both stages consist of the same following processes : inputting data , pre processing with regex , tokenization . POS tagging with Stanford Core NLP, calculating values according to defined features , and converting categorical into numerical values. After that , results from both stages are inputted into KNN for determining a word position as the blank . Some heuristics are defined to select reasonable answers for a distraction . After completing these processes , we obtained full questions with optional answers. Additionally, we can explain these processes in detail in the following section. Data Gathering for Training and Testing There are two sets of data required in this system, which are for training and testing. Data training is taken from historical data of TOEFL question as a reference in determining the blank position on a question. Data training is required so that the generated-question quality can be maintained as to the quality of previous TOEFL For example, in this research experiment we used data training taken from the book as follows: TOEFL Exam Success from Learning Express (Chesla, 2. The Official Guide to the TOEFL Test Fourth Edition (ETS, 2. PetersonAos Master TOEFL Vocabulary (Davy and Davy, 2. Longman Complete Course for the TOEFL Test: Preparation for the Computer and Paper Tests (Phillips, 2. TOEFL Practical Strategy for The Best Scores (Pardiyono, 2. Easy TOEIC: Test of English for International Communication (Riyanto. Easy TOEFL (Riyanto, 2011. | DOI: http://dx. org/10. 17509/ijost. On the other hand, data testing, taken from news articles on internet sites that are believed to have good grammar quality, is required to be used as a question candidate. In other words, all the sentences in the selected news site can be question candidates that will be generated. For example, in this research experiment, we picked some articles from the following news: . Ars Technica . ttp://arstechnica. , . BBC News . ttp://bbc. , . Bloomberg . ttp://bloomberg. NBA . ttp://global. Forbes . ttp://forbes. People . ttp://people. Reuters . ttp://reuters. The Guardian . ttp://theguardian. , . The Star . ttp://thestar. , and . VOA News . ttp://voanews. Data Processing Pre-process was done on two types of datasets . , data training and data testin. The first stage of removal of punctuation . The removed-punctuations were other than dots and underscores. A dot is used for a marker or separator between Whereas, underscore is used to mark the blank position during feature Then, the other pre-process stage is tokenization which is the stage to divide one sentence into a word. This stage is necessary to simplify the process of part-ofspeech tagging and feature extraction in the next stage. For example, given a complete sentence as follow: One of the most popular Indonesian products is Batik, it has been internationally recognized. So, after these processes, we obtain the following sequence: Au One | of | the | most | popula. Indonesian . roduct | is | Batik | it | has | bee. internationally | recognized Ay. Thus, it can be seen that the complete sentence separated into word by word . After that , these processes are also applied to all data. p- ISSN 2528-1410 e- ISSN 2527-8045 | 297 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Start Data Gathering Natural Language Processing Data training: historical TOEFL data Data testing: news/articles from newspaper Calculating values on all features Defining features from POS tagging Calculating values on all features Splitting Selecting Pre-processing with Tokenization POS tagging with Stanford CoreNLP Pre-processing with Tokenization POS tagging with Stanford CoreNLP Converting categorical into continue data Determining blank position with KNN Determining Determining answer options for distraction Converting categorical into continue data End Learning Steps Testing Steps Figure 1. Flow model of question generator system. Part of Speech (POS) Tagging by Stanford Core NLP At this stage, every word will come to part-of-speech tagging (POS Taggin. process to get information about the word class which will then be needed for feature There were many computational linguistic software. One of them used in this study is Stanford Core NLP which can be https://stanfordnlp. io/CoreNLP/. It is a toolkit created for research purposes in the NLP field (Manning , et al. , 2. There are 8 English word classes, namely noun, verb, pronoun , preposition , adverb, conjunction , adjective , and articles . Moreover , there is a popular and commonly used tag set, which is Penn Treebank Tag set (Marcus et al . 1993 ). The examples of the use of POS Tagging in the previous sentence on the preprocessing data is AuCD|IN|DT|RBS|JJ|JJ| NNS |VBZ |NNP |PRP | VBZ |VBN |RB|VBN Ay, where CD means cardinal number . IN. DT, RBS . JJ. NNS . VBZ . NNP . PRP . VBN , and RB are the meaning in preposition , determiner , adverb, adjective , noun plural, present verb for the 3rd person , proper noun , personal pronoun, verb past participle, and adverb, | DOI: http://dx. org/10. 17509/ijost. Thus , each word has its own part-of-speech label. This will facilitate the process of generating questions in the next Separation and Selection of Sentences from News Articles Basically, there are two steps in this section, as follows: separation and selection. The first one is the process of separating sentences in one long news text. It is done to make it easier in making a question since usually it only contains one sentence. So, it is that one large text contains thousands line of sentences will be separated from dot (. ) to dot (. The way of separation of this sentence is by using the regex command. It will search for the mentioned punctuation and then use the split function on any punctuation that has been found. At the selection stage, the selection of sentences is done to simplify and shorten the classification process. Selection of sentences was done with considering two conditions, as Sentences consist of 10 to 30 words. This requirement has been discussed with the previous expert. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 298 Then sentences are randomly chosen from the first condition. Based on the two requirements above, the sentences in the news which became the question candidate is expected to be more number 10 from the first word line to the last line of words. Word-Length: It is filled by the number of words that are repeated in one For example, when in a sentence there are 2 words of, then, this word-length column will contain number 2 in the word line. Word: It is a column containing words in sentences that have passed through the tokenization process in the system. Determination of Seven Features on Data Sets This stage is the process of determining the feature to be used as a word attribute for the blank position classification. These features are important ones that facilitate the classification process. This feature consists of seven features as proposed by this study as follows (Hoshino and Nakagawa. A Post: It is a column filled by the Tag POS of the word in column 1 of that line. The post column is auto-filled using the Stanford CoreNLP library. Prev_Pos: It is a column containing the POS Tag of the previous word in one This is not different from other post columns. This column is auto-filled using the Stanford CoreNLP library. Next_Pos: It is a column containing the POS Tag of the next word in one This column is automatically filled using the Stanford CoreNLP library. Position: It is a column filled with a number which is the position of the word in that line in one sentence. For example, if in a sentence there are 10 words, then the order of words is 1-10, the column position in the first word will be filled by the number 1. Sentence: It is a column containing numbers to determine the number of words in a single sentence. If you have a sentence containing 10 words, this sentence column will contain the | DOI: http://dx. org/10. 17509/ijost. Moreover. Target: It is an output feature showing the index of the blank position. Value Calculation of Seven Features on Data Sets It is the process of collecting seven features for the set of data. In collecting these seven features, it took advantage of a process that has been done before. POS features, next word POS, and previous word POS are taken from tokenization and POS Tagging. In addition, the position feature uses the function count on each sentence to know the order of the words position in the The function of count is used to calculate how many words in a single sentence and then the result will be the value for sentence feature. Furthermore, the word length feature also utilizes the count function, but before that, the words in one sentence must be compared, so the word length of a word will increase if the word is repeated several times in one sentence. Meanwhile, the target is a determinant of word classification and features. The target of data training is automatically obtained by the system by detecting whether there is an underscore (_) at the end of the word. there is an underscore, then the word is a blank position in the sentence, meaning the target is true. p- ISSN 2528-1410 e- ISSN 2527-8045 | 299 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Table 1. Example of data training of seven features values. Words POS Prev_POS One Most Popular Indonesian Products Batik Next_POS RBS NNS VBZ NNP PRP VBZ VBN VBN RBS NNS VBZ NNP PRP VBZ VBN An example of value collecting of these seven features is in Table 1 . As the example, of the following sentence Auone of the most popular indonesian batik , it has been recognized Ay from data Converting Continuous Categorical Data KNN is a distance calculation algorithm that necessarily requires numerical data to find the closest distance to be able to determine the target. So, the point we need in order to find the distance is to convert the categorical data into numerical data. The categorical data of the 7 features in this research are part -of -speech and word . These categorical data will be converted into continuous data using the following ycIycu = . cEycu ycoycaycu Oe ycEycu mi. | DOI: http://dx. org/10. 17509/ijost. Position Sentence RBS NNS VBZ NNP PRP VBZ VBN VBN WordLength ycCycu = OeycEycu ycoycnycu ycO = . cIycu ycu ycEycu ycCycu , . , ycIycu ycu ycEycu ycCycu Target FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE - S is the calculation of categorical data initialization data - 100 is the range that can be changed and determined as needed - P is the class of each categorical data , which is P 1 = part of speech . P 2 = previous word of part of speech . P 3 = next word of part of speech. - x is the index of categorical data classes - V is the categorical data vector after it is converted to numeric . POS tags are initialized into numbers for easy calculation . The initialization is based on proximity between tags . The closer the tag, the tag has the proximity or Similarity as shown in Table 2. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 300 Table 2. Initialization values of POS tags. 1 = CC 9 = JJS 17 = POS 25 = TO 33 = WDT 2 = CD 3 = DT 4 = EX 5 = FW 10 = LS 11 = MD 12 = NN 13 = NNS 20 = RB 21 = RBR PRP PRP$ 27 = VB 28 = VBD 29 = VBG WP$ WRB Since the three categorical data have the same data content, the calculation is as ycI1 = = 2,86, 36 Oe 1 ycI2 = = 2,86, 36 Oe 1 ycI3 = = 2,86, 36 Oe 1 6 = IN 14 = NNP 22 = RBS 7 = JJ 15 = NNPS 23 = RP 8 = JJR 16 = PDT 24 = SYM 30 = VBN 31 = VBP 32 = VBz showed in Table 1. For example, the conversion of the word 'one' is as follows: ycO = . ,86 ycu ycE1 Oe 1, 2,86 ycu ycE2 Oe 1, 2,86 ycu ycE3 Oe . ycC1 = Oe1 = . ,86 ycu yaya Oe 1, 0, 2,86 ycu yaycA Oe . ycC2 = Oe1 = . ,86 ycu 2 Oe 1, 0, 2,86 ycu 6 Oe . = . ,72, 0, 16,. ycC3 = Oe1 The next step is to calculate V. ycO = . cI1 ycu ycE1 ycC1 , ycI2 ycu ycE2 ycC2 , ycI3 ycu ycE3 ycC3 ) = . ,86 ycu ycE1 Oe 1, 2,86 ycu ycE2 Oe 1, 2,86 ycu ycE3 Oe . By applying the same questions as above, we calculate data with features as After the calculation is obtained, perform the calculation to all words in one Thus, the results will be obtained as showed in Table 3. This calculation applied to all data, both data training and data Table 3. The example of value conversion of Table 1. Words POS Prev_POS Next_POS One Most Popular Indonesian 36. Products Batik | DOI: http://dx. org/10. 17509/ijost. Position Sentence Word- Target Length 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 TRUE 1 FALSE 1 FALSE p- ISSN 2528-1410 e- ISSN 2527-8045 | 301 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Table 4. Example of seven features of value in data testing Words POS Prev_POS Next_POS Industry Aggregate Averages Are Calculated Starting With market-cap Weighted Average Determination of Blank Position with KNN This stage is a step done to determine the target of each word in the data testing. This targeting is done by comparing word by word in the data testing of sentence with the word on data training. The word along with the seven features will be calculated the distance to determine the target classification of words that will be the blank position in the sentence. The distance calculation at KNN stage uses Euclidean Distance formula. So, the distance between words from the data training and the data testing can be calculated. For example, given in Table 4 to be data testing that will be calculated the distance to data training. From Tables 3 and 4, we examine the Euclidean Distance . So, the word distance counted is the word 'One' and the word ' Industry': ,32 Oe 4,. Oe . ,12 Oe 16,. 2 ycc=Oo . Oe . Oe . Oe . = 33,99 The calculation is performed to all data testing compared with data training. Thus, it | DOI: http://dx. org/10. 17509/ijost. Position Sentence WordLength will be obtained k the nearest distance on each word. As the example, k is 3. Thus, the most target of three closest distances for a word will be the target of the data testing. two of the targets are false, then the target data testing will be false as well. From that target, the position of blank can be True target to the word in one sentence will be a blank position. Determination of Heuristics and Distractors This stage is the process of heuristic determination to produce a distraction. The heuristic is structured for the purpose of producing qualified distractions. The rules for generating distractions are as follows: verb, preposition, pronoun, modal, determiner, conjunction, wh-pronoun, wh-determiner, and wh-possessive, and wh-adverb. For example, verbs have several types of tags, namely VB. VBD. VBG. VBN. VBP, and VBZ. Selecting distraction in a verb is taken from an online English dictionary using the Application Programming Interface (API) of Ultra lingua that can be accessed at http://api. This API will generate all possible equivalents of a similar p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 302 word from the verb with a feature called Aoverb conjugationAo. blank positions which are the same in order to get the level of accuracy. After determining heuristics, the last stage is to generate incorrect answers to be Since a question with the type of sentence completion has four options with one correct answers and three wrong answers, we need to choose three Therefore, if the POS tagging of true answer is VBZ, then the distractions could be the corresponding verb with the POS tagging VBD. MD VB, and VB. For example in a question: Consistency of the answers analysis: This analysis is an analysis that involving some experts to answer the generatedquestions. Two experts will answer the following questions and they will be checked whether the answers filled by the experts have the same answers with the provided answer key. The earth spins on its axis and A 23 hours, 56 minutes and 4. 09 seconds for one complete rotation. Needed Will need Need Needs RESULTS AND DISCUSSION Experimental Design At the experimental stage, the system will implement the model that has been created and produce the TOEFL questions with the type of sentence completion The numbers of generated question are 50 questions consisting of 1% of the data training which is 30 questions and 20 questions from 10 different news articles. As mentioned earlier, there are 10 news websites with different topics as data testing, which is listed in Table 5. After conducting the experiments, there are three kinds of analyses that will be done, as follows: Same blank position analysis: This analysis will prove the accuracy of the system in choosing the blank position. Generated-question from the data training will be calculated how many | DOI: http://dx. org/10. 17509/ijost. Evaluation and analysis on the quality of questions from the expert: The generated-questions will be evaluated by two experts in order to determine its The assessment given by the experts is based on four metrics of assessment proposed by Araki et al . as follows: Grammatical Correctness (GC): It determines whether a question is syntactically well formed. The researchers determine three points to show the scale of the matrix based on the number of grammar error. Grammar error is calculated except for distraction that makes the sentence wrong. The values of the scale are 1, 2, and 3 that mean the question has no grammatical errors . , the question has 1 or 2 grammatical errors, and the question has 3 or more grammatical errors, . Answer Existence (AE): It identifies whether the answer to a question can be inferred from the related part of the question. The researchers determine two points as follows: 1 . means the answer to a question can be inferred from the question, and 2 . means the answer to a question cannot be inferred from the p- ISSN 2528-1410 e- ISSN 2527-8045 | 303 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 . Distractor Quality (DQ): It is an assessment to measure how precisely a distractor from the four underline is The researcher made a twopoint scale for this assessment as follows: 1 . means Distractor can be easily identified as wrong answers, 2 . means Distractor can be feasible. Difficulty Index (DI): It is an assessment of how difficult the generated question from the system. This assessment is determined from the overall aspects of both questions and distractions. The researchers made a scale of three points as follows: 1 . means the generated question is considered easy, 2 . means the generated question is considered sufficient, and 3 . mean the generated question is considered very difficult. Experimental Results After executing the proposed system as explained in the previous section, we obtained 50 generated questions. Table 6 contains of some questions that have been It can be seen that the yellow color on the word is the correct answer. Table 5. News sites data used in the study. No File Names Topics Technology News NewsAo URL Titles https://arstechnica. com/ga As AI advances rapidly, ming/2018/04/moreMore Human Than human-than-humanHuman says. AuStop, review-light-on-killerletAos think about thisAy robots-killer-on-aiinspection/ http://w. com/news Surabaya church /world-asia-44100278 attacks: One family responsible, police say https://w. c Cast over peace talks om/news/articles/2018as fighting flares in 03-29/shadow-cast-overSouth Sudan peace-talks-as-fightingflares-in-south-sudan http://global. com/news/ Drummond Has drummond-has-another-big- Another Big performance-pistons-beatPerformance to push knicks-115Pistons past Knicks Politic Sport 109/?cid=trafficdriver:nbaco m:homepage https://w. com/si 3 Reasons Why Technology tes/davidthier/2018/03/31 Fotnites Comet Could /3-reasons-why-fortnites- Actually Destroy Tilted comet-could-actuallyTowers destroy-tilted-towers-andwhen/#7e3bacb423c3 | DOI: http://dx. org/10. 17509/ijost. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 304 Table 5 . News sites data used in the study. No File Names NewsAo URL Titles http://people. com/food/w Wedding Registry edding-registry-itemsItems Worth Asking worth-asking-for/ For https://w. com/ Citigroup profit beats article/us-citigroupon strength in results/citigroup-profitconsumer banking, beats-on-strength-inequity trading consumer-banking-equitytradingidUSKBN1HK1MR?il=0 https://w. Reading rooms: the com/lifeandstyle/2018/ma story of an authorAos r/18/reading-rooms-thehouse story-of-an-authors-house https://w. Education is a right for my/news/education/2018/ all 04/29/education-is-a-rightfor-all/ https://w. com Nasa Insight Mission /a/nasa-insight-mission-to- to Mars mars/4380857. Discussion In this section, it will be explained the analysis of the results obtained based on the experimental design in the previous section, namely same blank position analysis, consistency of the answer analysis, and evaluation and quality analysis of the questions by the experts. The Same Blank Position Analysis As the previous explanation, this analysis will prove the accuracy of the blank position between the generated sentences from the data training with the original question. | DOI: http://dx. org/10. 17509/ijost. Topics Food Business Life Style Education Science Table 7 shows the value of the same blank position. The number 1 indicated that the blank position on the generated question by the system was equal to the blank position on the original question . There are 21 questions from 30 questions with 70 % has the same blank position . These results indicated that there were still generated -questions of the system with different blank positions . The difference may be caused by the smaller distance in the selected tag with the majority of true Thus , the obtained blank position was not the same as the original question. p- ISSN 2528-1410 e- ISSN 2527-8045 | 305 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Table 6. Results: 50 questions and answers generated by the proposed system. Questions and Answers Some hangers, buildings used to . large aircraft, are very tall that rain occasionally falls from clouds that form along the ceilings will hold Dairy farming is . leading agricultural activity in the United States this the Most Valuable Car in the World? With the death of Florida fleamarket magnate Preston Henn, a vintage Ferrari is poised to test the $100 million mark. will be A. Weathering is the action . surface rock is disintegrated or decomposed Consistency of the Answer Analysis As mentioned in the previous explanation, this analysis will prove whether the experts answered the same questions according to the answer key generated by the system. The experts will answer 30 questions from data training, and 20 questions from data testing. In Table 8, the two experts answered the number of questions from the data training correctly with different amounts. Expert 1 symbolized by E1 answered 25 questions out of 30 questions correctly. The percentage is Meanwhile, the expert 2 symbolized by E2 answered 21 true questions, with 70%. Whereas, as shown in Table 9, 20 questions generated from data testing, | DOI: http://dx. org/10. 17509/ijost. Expert 1 answered 19 questions according to the answer key, with a percentage of 95%. However, expert 2 only answered 16 questions according to the answer key with the value of 80%. Based on the results of these two experts, it can be concluded that not all questions have a good quality of answer or a good distraction. This is proved by the difference of answers from two experts with the answer key. The difference can be caused by the ambiguity in the sentence, or the existence of a distraction so that it can generate two correct answers. From this assessment, it can be drawn the average consistency of this answer is 81. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 306 Table 7. The same of blank position analysis. Question Blank Position Blank Question Position Question Blank Position Table 8. The analysis by experts on data training . itting ste. Question Answer Expert 1 Key Expert 2 Question evaluation and quality analysis by Human Experts Based on the experimental design, the questions were evaluated for quality with 4 correctness (GC), answer existence (AE), distractor quality (DQ), and difficulty index ( DI ). The results of the evaluation wereassessed by the experts with the average assessment | DOI: http://dx. org/10. 17509/ijost. Question Answer Key Expert 1 Expert 2 index of grammatical correctness 1. answer existence 1. 09, distractor quality 1. 71, and difficulty index 1. Then, it will be categorized that the quality of this question with five categories , which can be classified such as very good . etween 80 and 100 %), good . etween 60 and 80 % ), enough . etween 40 and 60%), less . etween 20 and 40 % ), and very less . ess than 20 % ). The results of these calculations are presented in Table 10. p- ISSN 2528-1410 e- ISSN 2527-8045 | 307 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Table 9. The analysis by experts on data testing . esting ste. Question Answer Expert 1 Key Expert 2 The Comparison with Previous Research In this section, it will be compared the model and implementation of this study with the previous studies that have similar types of research. There were many studies related Question Answer Key Expert 1 Expert 2 to this question generator. Some of them become the references for this study in developing the system model. Both the reference of the algorithm, the problem attributes, and the evaluation of the question The comparison is shown in Table 11. Table 10. The calculation results of each parameter. Parameters Ideal Value Grammatical Correctness (GC) Answer Existence (AE) Distractor Quality (DQ) Difficulty Index (DI) AVERAGE | DOI: http://dx. org/10. 17509/ijost. Ae Score Percentage Category Explanation Parameter Very Low score Very Low score Very High score enough High score p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 308 Table 11. Comparison with other system. References Methology Types of Language Evaluation/Analysis Strategies Questions MultipleEnglish Involving expert judgment with grammatical correctness of assessment criteria and quality of blank position. (Goto et al. Sentence blank position Conditional Random Field (Susanti, et al. Word selection from article using WordNet Vocabulary English (Majumder and Saha. Sentence using novel and Parse Tree Matching Multiplechoice English (Hill and Simha, 2. Determining MultipleEnglish choice fill-inposition the-blank using NER, and choosing using Google n-gram Using human expert as much as 67 native English-speaking volunteers to give an opinion on the given question (Pannu et al. Sentence and Fill-in-theblank using NER English Using human expert with 3 assessment metrics namely validity, key quality, and sentence quality (Chen et al. Sentence selection and using NLP English Using human expert to assess generated-question. Error and fill-inthe-blank | DOI: http://dx. org/10. 17509/ijost. Using human expert to answer and determine whether the question given is from machine-generated or humangenerated Question is assessed by 5 human evaluators p- ISSN 2528-1410 e- ISSN 2527-8045 | 309 | Indonesian Journal of Science & Technology. Volume 4 Issue 2. September 2019 page 294-311 Table 11 . Comparison with other system. References (Agarwal et , 2. Methology Types of Language Evaluation/Analysis Questions Strategies Sentence Cloze English Using human expert in selection using question determining whether the generated-question NER applicable or not. (Papasalouros et al. , 2. strategies like class based. Multiplechoice English Using 3 assessment metrics that are pedagogical quality, linguistic correctness, and number of generated question which are then reviewed by 2 educational experts (Huang and He, 2. Using semantic Whlinguistic English Comparing the difficulty index of the questions generated by the system with the question generated by humans. (Hoshino and Nakagawa. Using KNN and Fill-in-theNayve Bayes in blank English Using the comparison between the blank positions generated by KNN and Nayve Bayes (Araki et al. Compile the Whquestions using available English Using 4 quality evaluation metrics assessment such as Grammatical Correctness. Answer Existence. Distractor Quality, and Difficulty Index This research NLP techniques Sentence English to process sentences and pada KNN to specify TOEFL heuristic to determine the Using the same blank position analysis, consistency of the answers, and 4 assessment metrics in evaluating quality of the question | DOI: http://dx. org/10. 17509/ijost. p- ISSN 2528-1410 e- ISSN 2527-8045 | Lala Septem Riza, et al. Question Generator System of Sentence Completion in. | 310 CONCLUSION After conducting this research, we could draw the following conclusions: This research succeeded in making the computational model to produce TOEFL automatically using Natural Language Processing k-Nearest Neighbor algorithm, and heuristics. Basically, the system contains two main processes: learning and testing. Both stages consist of inputting data, preprocessing with regex, tokenization. POS tagging with Stanford CoreNLP, calculating values according to defined features, and converting categorical into numerical values. After that, results from both stages are inputted into KNN for determining a word position as the Some heuristics are defined to choose reasonable dummy answers for . The results of the question evaluation showed that the generated-question has excellent quality with a percentage of 93% after analyzed by the experts, 25% of consistency of the answer, and 70% of the same blank position. Based on the results and analyses, this study contributes to be used as a tool for generating questions with the sentence completion on TOEFL automatically derived from news articles. AUTHORSAo NOTE The author. that there is no conflict of interest regarding the publication of this article. Authors confirmed that the data and the paper are free of plagiarism. REFERENCES