Available online at https://icsejournal.
com/index.
php/JCSE Journal of Computer Science an Engineering (JCSE) e-ISSN 2721-0251 Vol.
No.
February 2025, pp.
Algorithms for Question Answering to Factoid Question Raihan Pambagyo Fadhila1*.
Detty Purnamasari2 1Universitas Gunadarma.
Jl.
Margonda Raya No.
Depok 16424.
Indonesia 1raihanpambagyo@gmail.
2 detty@staff.
* corresponding author
ARTICLE INFO
ABSTRACT
Article History:
Received February 6, 2025 Revised April 18, 2025 Accepted April 21, 2025 The development of transformer-based natural language processing (NLP) has brought significant progress in question answering (QA) systems.
This study compares three main models, namely BERT.
Sequence-to-Sequence (S2S), and Generative Pretrained Transformer (GPT), in understanding and answering contextbased questions using the SQuAD 2.
0 dataset that has been translated into Indonesian.
This research uses the SEMMA (Sample.
Explore.
Modify.
Model.
Asses.
method to ensure the analysis process runs systematically and efficiently.
The model was tested with exact match (EM).
F1-score, and ROUGE evaluation Results show that BERT excels with an Exact Match score of 99.
57%, an F1-score of 99.
ROUGE-1 of 97%.
ROUGE-2 of 30%, and ROUGE-L of 97%, outperforming S2S and GPT models.
This study proves that BERT is more effective in understanding and capturing Indonesian context in QA tasks.
This research offers explanations for the implementation of Indonesian-based QA and can be a reference in the development of more accurate and efficient NLP systems.
Keywords:
Natural Language Processing.
Question Answering.
BERT.
Sequence to Sequence.
GPT Correspondence:
E-mail:
raihanpambagyo@gmail.
Introduction The advent of Transformer architectures has revolutionized Natural Language Processing (NLP) by introducing self-attention mechanisms that effectively capture both short- and long-range dependencies in text .
Efficiency-focused techniques such as Low-Rank Adaptation (LoRA) .
and hardware-aware optimization strategies for GPUs .
have further accelerated large-scale model training and inference.
Additionally, quantization methods optimize model deployment on edge devices .
, while comprehensive surveys highlight challenges in complex knowledge-based Question Answering (QA), including domain adaptation and multi-hop reasoning .
Bidirectional Encoder Representations from Transformers (BERT) demonstrated state-of-the-art performance across language understanding tasks through bidirectional pre-training .
Generative Pre-trained Transformers (GPT-.
enabled zero- and few-shot learning paradigms in text generation .
Sequence-to-Sequence (S2S) models excelled in machine translation and summarization .
, and multilingual adaptations like mBART extended these capabilities across languages .
Additionally.
T5-based approaches have shown efficacy in closed-book QA settings .
Among benchmarks, the Stanford Question Answering Dataset (SQuAD) 2.
0 challenges models to answer extractive factoid questions while identifying unanswerable queries .
Complementary benchmarks such as Natural Questions emphasize real-world applicability .
, and HybridQA integrates tabular and textual data for multi-modal reasoning .
Despite these advances.
QA research in low-resource languages like Indonesian is limited by the scarcity of large-scale, labeled To address this gap, we translated SQuAD 2.
0 into Indonesian using Opus-MT .
We then evaluate three Transformer-based approachesAiBERT.
GPT-3, and S2SAiunder both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) with LoRA .
Experiments follow the SEMMA (Sample.
Explore.
Modify.
Model.
Asses.
methodology for systematic data handling .
and Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 leverage an NVIDIA DGX Station A100 for computational benchmarking .
Model performance is assessed using Exact Match and F1-score .
, sequence-level metrics (ROUGE-1.
ROUGE-2.
ROUGE-L) .
, and efficiency indicators .
raining time.
GPU utilization, and energy consumptio.
In Indonesian NLP, resources such as IndoLEM and IndoBERT have enabled local model evaluation .
, while NusaX provides multilingual sentiment annotations .
Adaptations of BERT for Indonesian QA achieved significant improvements .
, and hybrid BERTAeGPT systems show promise for dialog tasks .
Our work extends these contributions by offering a unified benchmark of three architectures, providing insights into accuracyAeefficiency trade-offs in a low-resource Our evaluation shows that BERT, when fully fine-tuned on the Indonesian SQuAD 2.
0 dataset, achieves an Exact Match of 99.
57%, an F1-score of 99.
57%, and ROUGE scores of 97% (ROUGE.
, 30% (ROUGE-.
, and 97% (ROUGE-L), significantly outperforming both S2S and GPT-3 under identical conditions.
The main contributions of this paper are: .
A publicly available Indonesian translation of SQuAD 0 via Opus-MT .
A comprehensive comparison of BERT.
GPT-3, and S2S models under full fine-tuning and PEFT regimes .
Empirical insights into accuracyAeefficiency trade-offs, supported by QA and sequence-level metrics .
, .
, and DGX Station A100 resource usage data .
By uniting methodological rigor with practical evaluations, this study advances QA research for Indonesian and other low-resource languages.
Method The research utilizes the SEMMA (Sample.
Explore.
Modify.
Model.
Asses.
method, a systematic approach designed for efficient data analysis .
It involves five stages: first, the Sample stage, where a representative data sample is selected.
Next is the Explore stage, which examines data patterns, relationships, and characteristics.
In the Modify stage, the data is processed and adjusted to meet the analysis requirements, including cleaning or transforming variables.
The Model stage follows, where techniques are applied to build a model that aligns with the research goals.
Finally, in the Assess stage, the model's performance is evaluated to ensure it provides valid and relevant This structured approach enables a reliable and thorough analysis.
Figure 1.
SEMMA Methodology Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Sample The Indonesian SQuAD 2.
0 dataset .
is split into training .
%), validation .
%), and test .
%) sets following established QA evaluation protocols .
Explore Exploratory data analysis examines token distributions, answer length statistics, and translation artifacts introduced by Opus-MT .
Visualizations and statistical summaries leverage Transformer tokenizer utilities .
Modify Preprocessing steps include text normalization .
owercasing, punctuation remova.
, subword tokenization (WordPiece or SentencePiec.
, .
, and data augmentation techniques to mitigate answer class imbalance .
Model Three Transformer-based models are fine-tuned:
BERT .
, with full parameter updating.
Sequence-to-Sequence (S2S) .
, using an encoderAedecoder setup.
GPT-3 .
, leveraging autoregressive generation.
Each model is also adapted using PEFT with LoRA .
, and we compare parameter count, memory footprint, and convergence speed.
Asses Performance metrics include Exact Match and F1-score .
ROUGE-1/2/L .
, and computational benchmarks .
raining time.
GPU memory usage, energy consumptio.
on an NVIDIA DGX Station A100 .
Statistical significance is evaluated using paired t-tests .
Results and Discussion Sample At this stage, the collection of data is to be initiated by downloading SQuaD version 2.
0 data from the official Github website.
This data comprises a total of 142,000 records, which are structured in the semi-unstructured JSON data format.
Following the successful download of the data, it is to be separated into two distinct sets: data intended for model training and data allocated for model Explore This stage is characterized by an in-depth examination of the dataset, encompassing three distinct tasks: data analysis, visualization, and data quality validation.
The primary objective of the data analysis stage is to comprehend the structural elements and characteristics of the data that will be employed in the natural language processing (NLP) question answering (QA) model.
The SQuAD 0 dataset serves as the training and testing foundation for the QA model, comprising a substantial collection of question and answer pairs accompanied by a context or reference text.
Each entry in the dataset under consideration contains several elements of significance.
Table 1.
Data Description Column Name Description Title or topic of the article or text used for context.
A collection of paragraphs containing context and related questions.
A collection of questions and answers related to the context of a particular paragraph.
The text of the question asked based on context.
Unique identity for each question.
A collection of answers to questions that can be answered.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Column Name Description Title or topic of the article or text used for context.
Text the correct answer to a question.
answer_start Index the starting position of the answer text in the context column.
A binary label that indicates whether the question has answer in context or not.
A paragraph of text that contains relevant information to answer the question.
Unique identity for each paragraph.
is_impossible The dataset under consideration in this study has undergone extensive statistical analysis of the primary components in the SQuaD version 2.
0 dataset, namely context, question, and answer, to ascertain the characteristics of the dataset.
Table 2.
Statistical Analysis Results Elements Mean Median Max Context 84 Words 99 Words 653 Words Questions 78 Words 8 Words 35 Words Answers 46 Words 2 Words 223 Words The visualization stage is used to see the distribution picture of the data, see the distribution of the context, questions, answers, and is_impossible labels in a bar graph.
The results at this stage will display a distribution data graph.
Figure 2.
Data Distribution Chart In the data quality validation stage, various checks are carried out to ensure that the data has good and consistent quality.
The data quality validation check is performed by checking for the presence of empty or null values and duplicate values in the data.
The output at this stage includes a summary of the total article data size, total paragraph data size, and total question data size.
Additionally, it includes checking for empty or null values in the data and checking for duplicate data.
Modify At this stage, data processing is executed to ensure its subsequent utilization.
This process involves two stages: initial data construction, which involves transforming the data into a format suitable for the subsequent stage, and subsequent integration, which integrates the data with the next stage.
the data construction stage, the data is processed into a form that suits the needs of the model.
Some of the steps taken include raw data extraction, tokenization, combining tokens, convert to token IDs, attention mask & padding, calculation answer position, tensor creation.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Raw Data Extraction This stage displays the raw data from the JSON format file to be processed and converts it into a data structure that is easier to process for the model.
This is useful in order to understand the original structure of the data, such as the question text and its context, as well as the location of the answer in context.
Table 3.
Raw Data Extraction Column Name Description Question Answer_text Kapan Beyonce mulai populer? Beyoncy Giselle Knowles-Carter (/bi:1jDnseI/ bee-YON-sa.
ahir 4 September 1.
adalah seorang penyanyi, penulis lagu, produser rekaman dan aktris Amerika.
Lahir dan dibesarkan di Houston.
Texas , ia tampil di berbagai kompetisi menyanyi dan menari sebagai seorang anak, dan menjadi terkenal di akhir 1990-an sebagai penyanyi utama dari girl grup R&B DestinyAos Child .
Dikelola oleh ayahnya.
Mathew Knowles , grup ini menjadi salah satu grup wanita terlaris di dunia sepanjang masa.
Hiatus mereka melihat perilisan album debut Beyoncy.
Dangerously in Love ( 2003 ), yang memantapkannya sebagai artis solo di seluruh dunia, memperoleh lima Grammy Awards dan menampilkan singel nomor satu Billboard Hot 100 "Crazy in Love" dan "Baby Boy".
di akhir 1990-an Start_char Context End_char Is_impossible False Tokenization In the subsequent stage of the process, the question and context data that have been extracted are processed into tokens.
These tokens can then be utilized by the Transformers-based model.
Each question and context is broken down into tokens using a tokenizer.
Table 4.
Tokenization Question Question [AokaAo.
Ao##panAo.
AobeyonceAo.
AomuAo.
Ao##laiAo.
AopopAo.
Ao##uleAo.
Ao##rAo.
Ao?A.
[AobeyonceAo.
AogiAo.
Ao##selleAo.
AoknowlesAo.
Ao-Ao.
AocarterAo.
Ao(Ao.
Ao/Ao.
AobiAo.
Ao##:A.
Combining Tokens In the combining tokens stage, the tokens from the previous process are combined to form a sequence that matches the input format in the Transformer model.
After the tokenization process, the tokens from the question and the tokens from the context are combined into one token sequence.
Special tokens such as [CLS] are used as start tokens to mark the beginning of input, while [SEP] tokens are used as separation tokens between question and context, and to mark the end of input.
Table 5.
Combining Tokens Combining Tokens [Ao[CLS]Ao.
AokaAo.
Ao##panAo.
AobeyonceAo.
AomuAo.
Ao##laiAo.
AopopAo.
Ao##uleAo.
Ao##rAo.
Ao?Ao.
Ao[SEP]Ao.
AobeyonceAo.
AogiAo.
Ao##selleAo.
AoknowlesA.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Convert To Token IDs Then the next step is to convert the merged tokens into a numerical representation using the tokenizer This representation is needed because the model does not understand the text directly, but processes the data in the form of numbers (ID.
that match the vocabulary.
Table 6.
Convert To Token IDs
Convert To Token IDs
, 10556, 9739, 20773, 14163, 19771, 3769, 9307, 2099, 1029, 102, 20773,
21025, 19358, 22.
Attention Mask and Padding Following the conversion of data into a numerical representation, the next stage is the attention mask During this stage, the model is informed about the part of the input that is relevant for processing and the part that is padding.
The attention mask is used to optimize the model's focus on important tokens, such as context and questions, and to ignore additional tokens, such as padding.
This stage creates a list of ones to initialize the mask with the number of relevant input tokens, and adds zeros to fill the maximum required length .
Table 7.
Convert To Token IDs Attention Mask (Initial mask lengt.
Padding (After padding lengt.
Calculation Answer Position The primary step at this stage is to calculate the position of the start token and end token of the answer in the given context.
This process is essential for language comprehension, as it requires the precise position of the token to determine the location of the answer in the context.
If the data labeled is_impossible is true, then the answer position is set to zero for the start position and zero for the end Conversely, if the data labeled is_impossible is false, the token position of the answer will be calculated based on the character position in the context.
If the start or end position exceeds the maximum length, the start and end positions will be set to zero.
Table 8.
Calculation Answer Position Start Position End Position Tensor Creation In the tensor creation stage, the previously processed data will be converted into a tensor format.
This will allow the model to use the data and optimize the computational utility of GPU processing.
This stage aims to convert the processed data into a tensor format using the Pytorch library.
The data elements that are converted into tensors are input_ids, attention_mask, start_position, and end_position.
Table 9.
Tensor Creation Input_ids Attention Mask Start Position End Position tensor([ 101, 10556, 9739,20773, 14163, 19771, 3769, 9307, 2099, 1029, 102, 20773, 21025, 19358, tensor(.
, 1, 1, 1, 1, 1, 1, tensor.
Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 The next step is to ensure that the data processed through various previous stages can be combined and integrated structurally so that it is ready to be used in the model training process.
The data integrated into a dictionary-shaped data structure are input_ids, attention_mask, start_position, and end_position.
Model At this stage of the process, the model is being built based on the modified dataset from the previous This stage is divided into two parts: the first part is building test scenarios, and the second part is building models.
Building Test Scenarios The objective of constructing this test scenario is to identify the most effective model outcomes during the training process.
In this study, the test scenario will be executed five times on the learning_rate parameter, four times on the batch_size parameter, and five times on the epochs These executions will vary several parameters believed to influence model performance.
The parameters that will be altered and utilized as test scenarios are batch_size, learning_rate, and epochs parameters.
The learning rate is a critical factor in determining the step size of the optimization algorithm, which is used to update the model weights during the training process.
To ensure the model achieves optimal performance, experiments were conducted with various values of the learning_rate parameter.
Table 10.
Learning Rate Testing Scenario Learning_rate Batch Size Epochs The batch size is the number of samples processed at once before the model performs parameter In the test scenario, to ensure the model achieves optimal performance, tests are conducted using various batch size values.
Table 11.
Batch Size Testing Scenario Batch Size Learning Rate Using the learning rate that has the highest result in the previous test Epochs Epochs refer to the number of full iterations of the model through the training dataset.
Insufficient epochs may result in underfitting, while excessive epochs may lead to overfitting.
To ascertain the optimal model in the test scenario, experiments were conducted using various epochs values.
Table 12.
Epochs Testing Scenario Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Epochs Learning Rate Batch SIze Using the learning rate that has the highest result in the previous test Using the batch size that has the highest result in the previous test Infrastructure Preparation The model was trained using NVIDIA DGX Station A100 hardware.
Table 13.
Infrastructure Specifications Component Spesifications NVIDIA DGX Station
A100
1x NVIDIA A100 40GB Tensor Core GPUs AMD EPYC 7742 .
Cores, 25 GH.
512 GB DDR4 68 TB NVMe SSD Ubuntu 20.
04 LTS
Server
GPU
CPU
RAM
Memory BERT Testing Results Parameter testing on the BERT algorithm model using the test scenario that has been done, with a learning rate parameter of 2e-5 and a batch size parameter of 32.
Table 14.
Epochs Testing Scenario BERT Epochs Exact Match F1-Score Execution Time 74 Minutes 07 Minutes 95 Minutes 22 Minutes 49 Minutes Sequence to Sequence Testing Results Parameter testing on the Sequence to Sequence algorithm model using the test scenario that has been done, with a learning rate parameter of 5e-5 and a batch size parameter of 8.
Table 15.
Epochs Testing Scenario Sequence to Sequence Epochs Exact Match F1-Score Execution Time 80 Minutes 86 Minutes 50 Minutes 63 Minutes 76 Minutes GPT Testing Results Parameter testing on the Sequence to Sequence algorithm model using the test scenario that has been done, with a learning rate parameter of 3e-5 and a batch size parameter of 8.
Table 16.
Epochs Testing Scenario Sequence to Sequence Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 Epochs Exact Match F1-Score Execution Time Out of Memory Out of Memory Out of Memory Out of Memory Out of Memory Out of Memory Out of Memory Out of Memory 58 Minutes Out of Memory Out of Memory Out of Memory Out of Memory Asses At this stage, an evaluation of the model built using the SEMMA (Sample.
Explore.
Modify.
Model.
Asses.
method is conducted.
The primary objective is to assess the model's performance by testing it with appropriate evaluation metrics for the research.
The evaluation metrics employed include exact match (EM).
F1-Score, and ROUGE Score, which are designed to gauge the similarity between the model's outputs and the reference answers.
At this stage, two key tasks are identified: evaluating the modeling results and reviewing the modeling process.
Once the model training is complete, testing is carried out using validation and testing datasets.
In the ROUGE evaluation, the model is tested using ten factoid questions.
Table 17.
ROUGE Evaluation Questions Context Indonesia adalah negara kepulauan yang terletak di Asia Tenggara dan Oseania, yang terdiri dari lebih dari 17.
000 pulau.
Negara ini berbatasan dengan Malaysia di utara.
Papua Nugini di timur, dan Australia di Indonesia memiliki populasi lebih dari 270 juta jiwa, menjadikannya negara dengan populasi terbesar keempat di dunia.
Jakarta adalah ibu kota dan kota terbesar di Indonesia.
Bahasa Indonesia adalah bahasa resmi negara ini.
Indonesia dikenal dengan keanekaragaman budaya, bahasa, dan etnis yang sangat kaya.
Question Reference Apa ibu kota Indonesia ? Indonesia terdiri dari berapa pulau ? Indonesia memiliki populasi lebih dari berapa jiwa ? Australia berbatasan dengan Indonesia di sebelah mana ? Papua Nugini berbatasan dengan Indonesia di sebelah mana ? Indonesia adalah negara kepulauan yang terletak di mana ? Jakarta adalah ibu kota negara apa ? Apa nama kota terbesar di Indonesia Malaysia berbatasan dengan Indonesia di sebelah mana ? Apa bahasa resmi negara Indonesia ? Jakarta lebih dari 270 juta Asia Tenggara dan Oseania Indonesia Jakarta Indonesia The following are the Model Evaluation Results using the ROUGE method.
Table 18.
Model Evaluation Results Model
Exact Match
F1-Score
ROUGE-1
ROUGE-2
ROUGE-L
BERT
S2S
GPT
Conclusion Based on the results of the research conducted, a comparative analysis of BERT.
S2S, and GPT algorithms on question answering for factoid questions using the SQuaD v2.
0 dataset that has been translated into Indonesian has been carried out.
Based on the test results, it can be concluded that the BERT (Bidirectional Encoder Representations from Transformer.
algorithm model obtained the best results in the question answering task for factoid questions using the translated SQuaD v2.
dataset compared to the S2S (Sequence to Sequenc.
and GPT (Generative Pretrained Transformer.
This is evidenced by the exact match value and F1-score which has a value of 99.
57%, and has a fairly high ROUGE evaluation, with a ROUGE-1 value of 97%.
ROUGE-2 value of 30%, and Journal of Computer Science an Engineering (JCSE) Vol.
No.
February 2025, pp.
e-ISSN 2721-0251 ROUGE-L value of 97%.
This indicates that the BERT model is more accurate in understanding the context and providing answers that match the questions given.
This advantage can be attributed to the BERT architecture which is based on bidirectional attention, so that the model can capture the relationship between words in a sentence better than other models.
Meanwhile, the Sequence-toSequence (S2S) and GPT models show relatively lower performance, especially in maintaining the accuracy of answers to factoid questions.
The S2S model, which is commonly used in translation or text sequencing tasks, is less optimal in handling context-specific understanding.
On the other hand, the GPT model, although superior in generating more natural text, has a weakness in maintaining answer precision due to its autoregressive nature, which leads to the possibility of generating answers that are less in line with the facts of the given context.
The results of this study reinforce the understanding that the choice of model in a question answering task is highly dependent on the type of question being asked.
For factoid-based questions, the BERT model proved superior to other approaches due to its ability to capture the relation between words more accurately.
References: