JOIV : Int. Inform. Visualization, 8. : IT for Global Goals: Building a Sustainable Tomorrow - November 2024 1713-1719 INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION journal homepage : w. org/index. php/joiv Enhanced Adverse Drug Event Extraction Using Prefix-Based MultiPrompt Tuning in Transformer Models Salisu Modi a,b,1. Khairul Azhar Kasmiran a,2. Nurfadhlina Mohd Sharef a. Mohd Yunus Sharum a Faculty of Computer Science and Information Technology. Universiti Putra Malaysia. Serdang Selangor Darul Ehsan. Malaysia Department of Computer Science. Sokoto State University. Sokoto. Nigeria Corresponding author: 1 gs63125@student. 2 k_azhar@upm. AbstractAi Extracting mentions of adverse drug events and relationships between them is crucial for effective pharmacovigilance and drug safety surveillance. Recently, transformer-based models have significantly improved this task through fine-tuning. However, traditional fine-tuning of transformer models, especially those with many parameters, is resource-intensive, memory-inefficient, and often leaves a gap between pre-training and downstream task-specific objectives. Soft prompting is a lightweight approach that updates a trainable prompt to guide task-specific fine-tuning, showing comparable performance to traditional fine-tuning for large language models on simple tasks. However, its effectiveness on complex tasks like token-based sequence labeling requiring multiple predictions for a single input sequence remains underexplored, particularly in multi-task settings. In addition, using holistic prompts in multi-task learning settings may be biased to other subtasks. Additionally, some prompt tokens hurt the model prediction. This study proposes a prefix-based multi-prompt soft tuning method with attention-driven prompt token selection for tuning transformer models on multitask dual sequence labelling for concept and relation extraction. We experimented with BERT and SciBERT models using frozen and unfrozen parameter strategies. Our approach achieved state-of-the-art performance on the n2c2 2018 and TAC 2017 datasets for adverse drug event extraction, with multi-prompt tuning in unfrozen models surpassing traditional fine-tuning. Moreover, it outperforms the largest clinical natural language processing model. GatorTron, on the n2c2 2018 dataset. This research highlights the potential of soft prompts in efficiently adapting large language models to complex downstream NLP tasks. KeywordsAi Adverse drug event. fine-tuning. multi-prompt. multi-task. soft prompt tuning. Manuscript received 5 Apr. revised 10 Aug. accepted 12 Sep. Date of publication 30 Nov. International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4. 0 International License. automated approaches of natural language preprocessing (NLP) . , . The dual nature of adverse drug event extractions involving named entity recognition . , . , . , and relation extraction . , . , . , makes it a challenging task. The earlier approaches were rules-based . , . , machine-learning . , . , and deep learning . , . , . However, the natural language processing paradigm has recently experienced a rapid increase in performance due to the prevalence of large language models (LLM) . , . , . The de facto method of adapting the LLMs was through model fine-tuning. This approach works similarly to the traditional supervised learning approach, which requires much-annotated data to train the model on downstreamspecific tasks. The approach is also a top-down through traintest workflow with all model-tuned parameters saved for Thus, it is time-consuming, memory-inefficient and resource-intensive compared to prompting, especially for models with larger parameters . INTRODUCTION Adverse drug events (ADE. refer to any harmful or unpleasant reactions that occur due to taking a medication. Consequently, accurate extraction of adverse drugs is vital for pharmacovigilance studies. In addition, it is significant to the information retrieval research paradigm due to the dual nature of named entity recognition and relation In the past decade, this task has been handled at different stages of drug usage. The notable stages are the premarketing and post-marketing. At the pre-marketing stage, the popular approach was through clinical trials with some volunteer patients, which seriously needed more volunteers. On the one hand, at the post-marketing stage, the spontaneous reporting system (SRS) was the earlier approach to collecting adverse drug event (ADE) cases from affected patients or clinicians, which suffers from underreporting, leading to designated the positive prompt tokens as the winning tickets and the negative tokens as the losing tickets concerning the lottery hypothesis. The importance of soft prompt tokens is defined as the expected sensitivity of model outputs to the mask variables. A larger score implies a token with a significant contribution, and a lower score implies a negative token with little or no contribution to the model tuning. However, in addition to the fact that the procedure is trial training to obtain the optimal tokens . , pruning the soft tokens to various levels . oken and piece level. repeatedly to get the optimal soft tokens is resource-intensive, especially for large pre-trained language models (PLM). While soft prompt-tuning methods have shown promise and match fine-tuning performance, some challenges remain These approaches have mainly been tested on natural language generation tasks and large models using a single, holistic prompt, which may not be suitable for multitask learning with various objectives. Moreover, the impact of negative prompt tokens is limited, and prefix tuning is constrained by the fixed sequence length of LLMs, resulting in a limited number of trainable parameters. Deep prompt tuning also has limitations, requiring fixed prefix tokens at each layer and needing significant changes to the internal workings of the transformer layers. In summary, prompt-based learning has yet to be fully explored for multitask and sequence labeling tasks that require multiple predictions for a single input sequence. , highlighting the need for further research in this area. Prompt tuning methods connect LLMs' pre-training objectives with specific downstream task objectives with an additional prompt. Prompting is a technique to adapt LLMs, where additional tokens guide the model for downstream There are two types of prompts: challenging prompts, which use non-trainable tokens, and soft prompts, which use trainable embeddings added to the input sequence. In promptbased learning, different strategies are employed, including frozen, where LLM parameters are fixed, and unfrozen, where LLM parameters are updated during training. Despite its potential, prompting remains in its early stages. A significant performance gap exists compared to fine-tuning, especially for small-size models when frozen, and has yet to be fully leveraged for complex natural language understanding tasks, such as sequence labeling tasks that require multiple predictions per input sequence and in multitask learning scenarios. To address these challenges, we propose a novel approach that utilizes multiple soft prompts, one for each task, with an attention-driven prompt token selection to optimize the prompt tokens. This multi-prompt soft prompt tuning method selectively highlights the most contributing prompt tokens, enabling more effective model adaptation to downstream Extracting ADEs from the vast amount of unstructured clinical notes is highly significant in real-world settings, as it supports drug discovery and pharmacovigilance studies. Various approaches have been employed to improve this task. These include rule-based . , machine learning . , and deep learning . , as well as adopting large language models (LLM. through fine-tuning . Prompt learning has emerged as a preferred method for adapting LLMs due to the limitations of traditional finetuning, which can be cumbersome, memory-intensive, and resource-heavy, particularly for larger models. Prompt learning encompasses two approaches. One is prompting, which uses discrete tokens to query LLMs, as seen in the success of models like GPT. , and prompt tuning, a more efficient method that adds trainable tokens . oft prompt. to the input sequence to guide the model's performance on specific tasks. Research has explored prompt tuning for both fixed and adaptable models. One approach proposed in . involves inserting trainable tokens into various layers of a pre-trained model, including encoder and decoder layers, while keeping the model's parameters frozen. This technique, known as prefix tuning, was later expanded to deep-prompt tuning, demonstrating its versatility across different model sizes and . As an alternative. P-tuning was introduced, which involves inserting continuous prompts at various points in the input tokens designed by human experts in specific tasks. this approach, both the prompts and initial model parameters are updated. Building on deep prompt tuning, . has developed a system that compares four learning strategiesAi fine-tuning, hard prompting, soft prompting with a frozen model, and soft prompting with an unfrozen modelAiusing GatorTron clinical LLMs. Recent research by . proposes hierarchical structured prompt pruning based on the lottery ticket hypothesis to identify the winning ticket and eliminate the losing ticket in collecting trained prompt tokens. In this research, the authors II. MATERIALS AND METHOD Datasets The TAC 2017 dataset . consists of 200 drug labels in XML format, divided into a training set of 101 labels and a test set of 99. The dataset features five attributes related to Adverse Drug Reactions (ADR): Animal. Drug Class. Factor. Negation, and Severity. Additionally, the dataset includes three types of relationships: Effect . inking severity to ADR). Hypothetical . inking animal, drug class, or factor mentions to ADR), and Negated . inking negation or factor mentions to ADR). The second dataset, the n2c2 2018 dataset . , derived from clinical narratives, was used for the adverse drug events extraction challenge. This dataset contains annotations for nine entities . rug, strength, form, dosage, frequency, route, duration, reason, and ADE entitie. linked to a drug entity as their source, with eight possible relations between them. Our model was trained and evaluated using the official dataset splits of 303 training records and 202 testing records. Multi-prompt-based Multi-task Soft Prompting of Large Language Models Learning multiple related tasks simultaneously can lead to biased results if a single prompt is used to adapt LLMs. overcome this limitation, we propose a novel approach that uses multiple prompts tailored to each task to guide the adaptation of LLMs and ensure more balanced and effective multi-task learning. Two task-specific prompt templates are generated, one for each task. The text prompts are converted into embedding vectors that can be fine-tuned. This process involves two steps: first, the text is broken down into subwords using a pre-trained tokenizer, and then, the embedding layer of a pre-trained model is used to generate vector representations for both the input text and the soft prompt tokens, as in Equation 1. Sp = We(T. and Ex=We(X) where Q. K, and V are obtained from the linear transformation of input embeddings. Attention_output = Attention_weights UI Ptoken where We are the embedding matrix of the model. Sp is the embedding of the soft prompt tokens, and Ex is the embedding of the input sequence. The soft prompt is added to the input embedding as a prefix specific to each task. However, since some prompt tokens can harm LLMs' performance, we use a feature selection method to choose the most important ones based on their attention weights generated by the transformer attention mechanism. This ensures that only beneficial tokens are used to fine-tune the model. The detailed procedure for prompt selection is in the following section. To allow the model to process the added soft prompt embedding, we extended the model's maximum sequence length to accommodate the combined input sequence. This, in turn, required extending the attention mask, token type IDs, and sequence labels to match the new sequence length. Algorithm 1: Procedure for Transformer-based attentiondriven prompt tokens selection method Input: S:Iainput-embedding. Semb:Iasoft-promptembed K:IatopOekOefeatures. D:Iamodel-dimension Output: Cemb, topOeselectedOeprompt I Ia [] Initialize attention score, weight and token importance for Iemb OO S do: for semb OO Semb do: computes the dot product. AIaDotProduct(Iemb,sem. W Ia Softmax(A) convert the attention scores to IIasum(W). sum up the attention weight for importance. IndicesIaGetIndices(I. K) indices of the topOek Transformer-based Attention-driven Prompt Token Selection The self-attention mechanism proposed by . , is an effective way to determine the contextual relationships between different words within an input sequence regardless of their relative distance. It enables the model to ascertain the importance of each word within the sequence. Because some prompt tokens prepended to the input sequence may negatively impact the model adaptation, we apply an attention-based selection approach to select only the top relevant tokens to the input sequence, thereby reducing the use of negative prompt tokens. We start by taking the dot product of the prompt input embeddings to the input sequence embeddings, then apply a SoftMax function to obtain the attention weight, as in Equation 2. Finally, compute the weighted sum for each token to get its importance score, as shown in Equation 3. The overall procedure is depicted in Algorithm 1. Attention _weights = SoftMax ( Oo soft prompt. EindicesIaExpand(Indices. D) expand to the dmodel topOeprompttokensIaGatherSelected(Semb. Eindice. CembIaconcat. opOeprompt,Iem. top prompt toinput return Cemb end procedure return the combined input to the model. Dual Sequence Labelling for Adverse Drug Event Extraction Figure 1 illustrates the overall architecture where two tasksAiconcept identification and attribute relation extractionAiare modeled simultaneously. We employed a sequence labeling and multi-task transfer learning approach as proposed by . Fig. 1 A multi-prompt model takes two prompt sequences, one for each task. We applied a weight decay of 0. 05 and a dropout rate of 0. 1 to prevent overfitting. We trained the model for 10 epochs on the TAC 2017 dataset and 15 on the n2c2 2018 dataset for soft prompt tuning with unfrozen models. Similarly, we trained the model for 200 epochs for soft prompt tuning with frozen models for both datasets. The textual tokens are transformed into trainable embedding vectors and undergo an attention-driven token selection procedure to select top-k prompt tokens . ositive The positive tokens are then prepended to the input embedding of each task to serve as input to the multi-task learning framework to produce shared representation by the transformers model . ith frozen or unfrozen parameter. This method converts both tasks into a dual sequence labeling problem, modeled together using a multi-task deep learning framework . to generate a shared contextual representation of the input via a transformer-based model. The concatenated input, which includes both the input embeddings and the selected prompt embeddings for the two sub-tasks derived from the proposed multi-prompt tuning procedure detailed above, is fed into the transfer learning framework. The output is then directed to task-specific layers, where the sub-task classification head and SoftMax are applied for the final classification of each token in the sequence. During the dual sequence labeling stage, the system transforms the tasks into ADR-source mention identification and ADR-mention attribute relation identification. Each dataset contains either ADR or Drug mentions. The ADRsource mention identification task classifies the input sequence into binary classes: positive . ource mentions containing one or more relations with mention attribute. and negative . ource mentions with no relation with mention Conversely, the ADR-mention attribute relation identification task involves identifying the attributes and relationships of the positive ADR-mentions identified in the first sub-task. The system uses an extended beginning inside outside (BIO) tagging scheme to handle discontinuous mentions and sub-words from word piece tokenization during token-based sequence labeling for the two sub-tasks. Additional tags. DB . iscontinuous mention beginnin. and DI . iscontinuous mention insid. , are introducedAithe "X" tag labels subwords generated by the tokenizer. Results Tables I and II show the reported results of our two experimented models. BERT and SciBERT, on TAC 2017 and n2c2 2018 for concept and end-to-end relation extraction. On the n2c2 dataset, we can see from Table 1 for concept extraction that the SciBERT model has a better overall performance for both fine-tuning and soft prompt tuning for the unfrozen model. In comparison to BERT, the SciBERT model improved by 3. Similarly. SciBERT outperformed BERT on the TAC 2017 dataset by 1. For the frozen model. SciBERT outperformed BERT by 0. 8% and 44% for concept extraction. TABLE I THE RESULT OF THE TWO EXPERIMENTED MODELS ON THREE TUNING STRATEGIES FOR CLINICAL CONCEPT EXTRACTION. Dataset N2C2 TAC Models BERT Training strategy FineSoft Unfrozen F1-score F1-score Soft Frozen F1-score SCIBERT BERT SCIBERT TABLE II THE RESULT OF THE TWO EXPERIMENTED MODELS ON THREE TUNING FOR CLINICAL END-TO-END EXTRACTION Dataset i. RESULTS AND DISCUSSION N2C2 Large Language Models and Experimental Settings The BERT model, introduced by . , is trained on vast text data from English Wikipedia and BooksCorpus. Two pretrained versions of BERT are available, differing in size: BERT-Base and BERT-Large. We experiment with the base mode for fine-tuning and soft prompting . ith frozen and unfrozen model parameter. The SciBERT model . , this model builds upon the BERT architecture and is pre-trained on a large corpus of 1. 14 million full-text papers from Semantic Scholar. There are two available versions of SciBERT: scivocab and base-vocab. We utilize the sci-vocab We configured our model with a maximum sequence length of 512 and a batch size of 8 and 32 for unfrozen and frozen models, respectively. We optimized the learning rate to 2e-5, using the cross-entropy loss function and Adamax TAC Models BERT Training strategy FineSoft Unfrozen F1-score F1-score Soft Frozen F1-score SCIBERT BERT SCIBERT In addition. Table 2 presents the results for end-to-end relation extraction for the experimented models' TAC 2017 and n2c2 2018 datasets. The SciBERT model outperforms BERT by 5. 31% on n2c2 and 2. 18% on TAC 2017. For the frozen model, it was 1. 16% and 1. 14%, respectively. These results demonstrate the capability of the SciBERT model over BERT. The observed performances could be attributed to the pre-training data from the scientific document, giving the model more chances to identify concepts and terminologies from the clinical text. Figure 2 depicts the results of the Fig. 2 Summary of the results obtained in the F1 score by the models for concept . and relation . on both TAC 2017 and n2c2 2018 datasets. achieving the performance of 39% and 40% for concept and relation extraction, respectively, to the highest GatorTron Figure 3 displays the comparison of the models. Discussion Prompt-based learning is a lightweight approach to adopting LLM, especially with frozen models. Most existing prompting systems are mainly developed to handle natural language generation problems. The kinds of research developed on natural language processing understanding problems mostly explored large models with many However, this approach has not fully explored complex NLP problems, such as sequence labeling involving multiple predictions for a single input. In addition, promptbased learning suffers from the quality of prompt tokens to effectively guide the model on downstream tasks, thereby reducing the gap between pre-training and downstream This study proposed a multi-prompt-based soft prompting method with a transformer-based attention prompt tokens selection to select the top necessary prompt tokens. conduct our experiments with two popular small-scale models to investigate the effectiveness of this approach. To further investigate the potential of our model to stateof-the-art models, we evaluate our models' performance on the n2c2 2018 dataset, comparing them to the GatorTron system . , a clinical natural language processing model. Notably. GatorTron is the largest clinical model in the literature, pre-trained on an extensive corpus of over 8. trillion words from biomedical texts and electronic health The model comes in three sizes: GatorTron base . million parameter. GatorTron medium . 6 trillion parameter. , and GatorTron large . 9 trillion parameter. Tables 3 and 4 show the concept and end-to-end relation extraction comparison results. For concept. SciBERT with unfrozen parameters outperformed GatorTron-base, which has the highest score among the GatorTron variants at 1. Similarly, end-to-end relations improved by 5. However, for a frozen model, the performance dropped by 32. 86% for concept and 49. for end-to-end relation compared to GatorTron-large. This drastic drop is not surprising, as the SciBERT model is 1. in parameters compared to GatorTron-large. This parameter difference also indicates the capability of our approach of multi-prompt tuning for multi-task learning settings. TABLE i COMPARISON OF OUR EXPERIMENTED MODELS WITH GATORTRON MODELS ON THE N2C2 2018 DATASET FOR CONCEPT EXTRACTION USING OFFICIAL EVALUATION METRIC (MICRO F1 SCORE). Dataset N2C2 Models BERT SCIBERT GatorTron GatorTron GatorTron Number of Finetuning 110 million 110 million 345 million 9 billion 9 billion Training strategy Soft Soft Unfrozen Frozen F1 score F1 score TABLE IV COMPARISON OF OUR EXPERIMENTED MODELS WITH GATORTRON MODELS ON THE N2C2 2018 DATASET FOR END-TO-END RELATION EXTRACTION USING OFFICIAL EVALUATION METRIC (MICRO F1 SCORE). Dataset N2C2 Models BERT SCIBERT GatorTron GatorTron GatorTron Number of 110 million 110 million 345 million Training strategy FineSoft Unfrozen F1 score 9 billion 9 billion Soft Frozen F1 score This study shows that models with small to medium parameters can perform well while on frozen parameters. However, to attain the performance of traditional fine-tuning, the model's parameters must be scaled up to billions of parameters, as is evident in the GatorTron models. In addition, soft prompt tuning can be successfully applied to complex natural language understanding problems involving multitasking with remarkable performance. Fig. 3 Summary of the results compared results in the F1 score by the experimented models and GatorTron models for concept . and relation . on n2c2 2018 IV. CONCLUSION The prefix-based multi-prompt tuning with attention-based prompt token selection proposed in this study has demonstrated the effectiveness of soft prompt tuning in adopting a large language model for natural language understanding problems involving sequence labeling for a multi-task adverse drug extraction. Our approach with unfrozen models outperforms the traditional fine-tuning and GatorTron models for these tasks. In our future work, we plan to investigate our proposed approach with language models of medium to large size and decoder-based models like GPT. addition, we will explore other NLP tasks like sequence . ACKNOWLEDGMENTS This research is supported by Universiti Putra Malaysia and the Ministry of Higher Education. Malaysia, under the Fundamental Research Grant Scheme (FRGS/1/2023/ICT02/UPM/02/. REFERENCEs