Open Global Scientific Journal 3 . : 83-93 2024 Contents lists available at openscie. E-ISSN: 2961-7952 Open Global Scientific Journal DOI: 10. 70110/ogsj. Journal homepage: https://openglobalsci. Performance Analysis of Fine-Tuned ChatGPT-3. 5 Chatbot for Alni Accessories E-Commerce Service Sugiarti1. Ihwana AsAoad1*. Al Hilaluddin1 Faculty of Computer Science. Universitas Muslim Indonesia. Makassar. Indonesia *Correspondence E-mail: ihwana. asad@umi. ARTICLE INFO Article History: Received 22 September 2024 Revised 29 October 2024 Accepted 10 November 2024 Published 14 November 2024 Keywords: Chatbot performance. ChatGPT-3. E-commerce service. Fine-tunin. Natural language processing. ABSTRACT Background: The rapid advancement of artificial intelligence (AI) technology, particularly in language modeling, has driven the adoption of chatbots to support e-commerce services. Alni Accessories store faces challenges related to limited product information and slow customer service, which negatively impact user satisfaction. Aims: This study aims to analyze the performance of a fine-tuned ChatGPT-3. 5-based chatbot to improve interaction quality and transaction efficiency on the e-commerce platform. Methods & Results: The evaluation process was carried out through a series of experiments, where the 13th experiment achieved the best results with a Training Loss of 0. 3894 and a Validation Loss of 0. The Training Mean Token Accuracy of 8799 and Validation Mean Token Accuracy of 0. 7971 indicate the modelAos ability to effectively learn language patterns. Furthermore, the Full Validation Loss of 0. 6242 and Full Validation Mean Token Accuracy of 0. 8122 demonstrate the modelAos strong generalization The independent t-test produced a p-value of 0. confirming a statistically significant difference in transaction response time between the old system and the new system. The findings of this study show that the fine-tuned ChatGPT-3. 5 chatbot not only accelerates services and improves product information accuracy but also holds great potential for implementation in other sectors such as education and healthcare. To cite this article: Sugiarti. AsAoad. Hilaluddin. Performance analysis of a fine-tuned chatgpt-3. 5 chatbot for alni accessories e-commerce service. Open Global Scientific Journal, 3. , 83Ae93. This article is under a Creative Commons Attribution-ShareAlike 4. 0 International (CC BY-SA 4. License. Creative Commons Attribution-ShareAlike 4. 0 International License Copyright A2024 by author/s Introduction The development of e-commerce technology has become a driving force behind the transformation of modern business. With its ease of access and broad market reach, e-commerce enables businesses to conduct sales without geographical constraints, accelerate and improve the efficiency of transaction processes, and simultaneously serve as an effective promotional medium (Hairi et al. , 2. Alni Accessories, located at Jalan Barukang Utara Lorong 8 No. Cambayya. Makassar, is a small business specializing in titanium-based accessories. Its current sales strategy relies primarily on direct sales and simple online platforms such as WhatsApp and Facebook. However, challenges arise due to incomplete product information and slow customer service response times ranging from 6 minutes to 5 hours which consequently delay the ordering process and reduce customer satisfaction. To address these challenges, a mobile-based e-commerce solution is required to deliver a more interactive and efficient shopping experience. One proposed approach is to integrate a fine-tuned ChatGPT-3. 5 Turbo chatbot that can better understand the business context, provide accurate product information, and respond to customers more quickly. Previous studies have shown that fine-tuning large language models (LLM. using company-specific datasets significantly improves response accuracy in corporate settings, as the model becomes more relevant to the specific business domain (Kishore, 2. In addition, fine-tuning GPT-3. 5 Turbo on e-commerce domain datasets has been proven effective in enhancing text-based product recommendation systems (Xu & Hu, 2. A DevOps approach is also applied to ensure the system is developed in a sustainable, efficient, and scalable manner (Septiadi & Isnandar, 2. Earlier research has highlighted the effectiveness of chatbots in delivering information for example, in new student admission services (PMB) at STT-NF, where a chatbot successfully assisted prospective students in obtaining information via Telegram API and Python, and was proven effective through Black Box Testing. User Acceptance Testing (UAT), and questionnaire surveys (Herfian & Adriansyah, 2. Moreover, studies on ChatGPT have demonstrated its ability to deliver accurate and fast information, improve time efficiency, and support user creativity, although critical evaluation of its outputs remains necessary (Misnawati, 2. In the e-commerce context, a reliable, responsive, and secure chatbot has been shown to improve customer satisfaction (Kappi & Marlina, 2. The DevOps approach is widely adopted in software research because it accelerates the development and operations cycle without compromising system quality (Wibowo et al. , 2. Fine-tuning language models such as ChatGPT has also been reported to significantly improve performance by adapting them to specific domains, resulting in more relevant outputs, efficient token usage, and reduced request latency (OpenAI, 2. Based on the challenges faced by Alni Accessories and the findings of previous research, this study aims to develop a mobile e-commerce application integrating a fine-tuned ChatGPT-3. 5 Turbo chatbot with a DevOps approach. This combination is expected to enhance service quality, accelerate the ordering process, and increase customer satisfaction, thereby maximizing overall e-commerce Methods DevOps Method The DevOps method is a developer-oriented principle designed to effectively and efficiently coordinate collaboration between development teams and operations teams (Kole & Sugeng, 2. DevOps aims to enhance collaboration between these teams, from planning to the successful delivery of applications or features to end users. This approach helps minimize communication barriers between teams, thereby accelerating and standardizing the software development cycle. Furthermore. DevOps emphasizes the implementation of Continuous Integration (CI) and Continuous Delivery (CD), which enable development, testing, and deployment processes to run automatically and This not only shortens product release cycles but also improves software quality by allowing potential errors to be detected and resolved more quickly. In addition, the DevOps approach supports organizational agility, enabling companies to be more adaptive to market demands and technological changes (Fitriani et al. , 2. Thus. DevOps is not merely a technical methodology but also a work culture that emphasizes collaboration, transparency, and continuous improvement. The key steps in implementing DevOps are as follows: Figure 1. DevOps Method Stages Plan In the planning stage, system requirements are identified and designed using UML diagrams. Development management is handled through Google Cloud and GitHub for seamless collaboration. The data sources used at this stage include user requirement documents, e-commerce transaction records, and preliminary chatbot interaction datasets, which serve as the foundation for system design. Develop At this stage, the development team is divided into three groups: backend, frontend, and DevOps. The backend team is responsible for developing APIs and implementing business logic, while the frontend team focuses on building the user interface. The DevOps team manages cloud infrastructure and automates the deployment process. All development activities are carried out through close team collaboration, with version control and code collaboration managed via GitHub. The data sources include product catalog data, user profile information, and training datasets used to build and refine the chatbot Build This stage integrates the software modules for both the e-commerce platform and chatbot, with automated builds managed through GitHub Actions within the CI/CD pipeline. Each commit is tested before being merged into the main branch, producing software artifacts ready for testing. The data sources include integrated API endpoints, chatbot model checkpoints, and configuration files necessary for deployment. Test Testing is performed using the Independent T-Test method to compare response times between the manual system and the e-commerce platform. Google Cloud Monitoring is used to observe performance and response time in real-time, while Postman validates APIs and chatbot functionality. If testing results indicate discrepancies such as suboptimal response times or API errors the team revisits the Develop stage, followed by Build and re-testing until the system meets performance requirements. The data sources include response time logs. API call records, and user interaction logs used to evaluate chatbot accuracy and system efficiency. Deploy After successful testing, the software is deployed to the Google Cloud infrastructure. Deployment is executed through the CI/CD pipeline using Google Cloud Build to ensure a smooth and consistent release The data sources include deployment scripts, infrastructure configuration files, and monitoring dashboards for verifying a successful rollout. Operate Operational management of the production application is handled using Google Cloud Operations Suite to manage logs, monitor application health, and apply autoscaling when necessary. Data sources at this stage include system health metrics, error logs, and scaling activity records, which help maintain optimal system performance. Monitor Real-time monitoring is performed using Google Cloud Monitoring and Logging to detect issues Chatbot interaction data is analyzed to continuously improve customer service quality. Data sources include chatbot conversation logs, user feedback surveys, and performance metrics extracted from the production environment. 2 Fine-tuning Fine-tuning a large language model (LLM) for specific tasks allows the model to fully utilize its capacity by adapting to a particular domain (J et al. , 2. The benefits of fine-tuning include efficiency, as a pre-trained model requires less data and time for task-specific training compared to building a model from scratch (Brown et al. , 2. Therefore, this study adopts a fine-tuning approach to enhance model performance in domain-specific tasks. The fine-tuning process in this research follows these steps: Figure 2. Fine-Tuning Process Flowchart Data Preparation Data preparation is a crucial initial step in the fine-tuning process, which includes data collection and Data collection focuses on obtaining relevant questionAeanswer conversations in the context of titanium accessory e-commerce, such as transaction histories from WhatsApp. Messenger, or external sources. Preprocessing is then performed to remove noise, correct typographical errors, and eliminate duplicates, ensuring high-quality data for model training. Preparing the Dataset This stage involves prompt engineering, dataset splitting, and format adjustment for fine-tuning. this study, the dataset consists of 70 samples, which are split into 80% for the training set and 20% for the validation set. Prompts are designed based on common interaction patterns, such as product inquiries and recommendations. The dataset is then divided into training and validation sets and converted into JSON Lines (JSONL) format to ensure compatibility with the fine-tuning platform. Upload Dataset After formatting, the dataset is uploaded to the OpenAI platform for fine-tuning. The dataset files are structured to meet the platformAos technical requirements, enabling successful training of the GPT-3. Turbo model. Fine-tuning This process involves configuring training parameters such as epoch count, learning rate, and batch size to prevent overfitting, as well as validating the JSONL format and ensuring the alignment of promptAecompletion pairs before execution. Results and Discussion 1 Fine-tuning Results This study conducted fifteen fine-tuning experiments, the results of which are summarized in Table 1. Table 1. Hyperparameter Configurations Experiment Batch Size Learning Rate Epochs Experiment 1 Auto Auto Auto Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12 Experiment 13 Experiment 14 Experiment 15 Fine-Tuning Results of ChatGPT Table 2 presents the results of fifteen fine-tuning experiments conducted on the ChatGPT model. Each experiment was performed using different parameter configurations to evaluate model performance. The results include Training Loss. Training Mean Token Accuracy. Validation Loss. Full Validation Loss. Validation Mean Token Accuracy, and Full Validation Mean Token Accuracy. Table 2. Fine-Tuning Experiment Results of ChatGPT Experiment Training Loss Training Mean Token Accuracy Validation Loss Full Validation Loss Validation Mean Token Accuracy Full Validation Mean Token Accuracy Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12 Experiment 13 Experiment 14 Experiment 15 Based on the fine-tuning experiments conducted, several metrics were used to evaluate the modelAos performance in each trial, including Training Loss. Training Mean Token Accuracy. Validation Loss. Full Validation Loss. Validation Mean Token Accuracy, and Full Validation Mean Token Accuracy. The following figure illustrates the results of Experiment 13, providing a comparative view of the metrics used in this evaluation. From all experiments conducted. Experiment 13 demonstrated the most optimal results. The Training Loss for this experiment was 0. 3894, which is considered relatively low, along with a low Validation Loss of 0. This indicates that the model was able to effectively learn from the training data without showing signs of overfitting and was capable of generalizing well to the validation data. Furthermore, the Training Mean Token Accuracy in Experiment 13 reached 0. 8799, one of the highest among all experiments, indicating that the model achieved a very high level of accuracy in predicting tokens on the training data. Similarly, the Validation Mean Token Accuracy reached 0. 7971, suggesting that the model was able to maintain strong accuracy on the validation set. When compared with Experiment 14 which exhibited an extremely low Training Loss . and high Training Mean Token Accuracy . there was a notable difference in the Validation Loss, which was relatively high . This suggests that although the model learned very well on the training data, it struggled to generalize on the validation data, a clear indication of overfitting. In contrast. Experiment 13 successfully maintained a balance between Training Loss and Validation Loss, making it a more stable and efficient model. In addition, the Full Validation Loss for Experiment 13 was recorded at 0. 6242, which was lower than that of other experiments, indicating greater efficiency in reducing errors across the entire validation The Full Validation Mean Token Accuracy of 0. 8122 further demonstrates the modelAos ability to accurately predict tokens across the full validation data. Based on these results, it can be concluded that Experiment 13 achieved the best performance overall. This model demonstrates an ideal balance between low Training Loss and high Validation Accuracy, as well as optimal generalization capability, making it a more stable and reliable experiment compared to the others. 3 Transaction Speed Comparison A transaction speed comparison was conducted by measuring the response time between the old system and the newly developed system. Data were collected from 15 respondents to determine the performance differences between the two systems. Table 3. Transaction Time Data for Old and New Systems Transaction Time Ae Old System Respondent (Minute. Transaction Time Ae New System (Minute. 1 Average Transaction Time for Each Group Based on the data presented above, several statistical analyses were performed to compare the performance of the two systems. These analyses included the calculation of mean, variance, standard deviation, normality testing, and independent t-test. The average transaction times obtained from the data are as follows: The average transaction time for the old system was 144 minutes. The average transaction time for the new system was 9. 67 minutes 2 Sample Variance Calculation The sample variance and standard deviation were calculated as follows: Sample Variance of the Old System yc12 = 9429,14 Menit . Sample Variance of the New System yc22 = 5,38 Menit 3 Normality Test Before conducting the statistical comparison test, a normality test was performed to ensure that the transaction time data from both systems followed a normal distribution. The Shapiro-Wilk method was employed for this purpose. The results indicated that the p-value for the old system 0642, while the p-value for the new system was 0. Since both p-values were greater 05, it can be concluded that the data from both systems were normally distributed. Consequently, parametric statistical analysis, such as the independent t-test, could be applied. 4 Outlier Test In addition to the normality test, the next step was to check for potential outliers that might influence the results. The outlier detection was performed using the z-score method, which compares each data point to the group mean and standard deviation. Data points with z-scores greater than 3 or less than -3 were considered outliers. The analysis confirmed that no significant outliers were present in the dataset. This finding, combined with the normality test results, allowed for the use of parametric statistical methods such as the independent t-test. 5 F Table Test Following the normality and outlier tests, an F-test was conducted to determine whether the variances of the two groups could be assumed equal . or unequal . The F-test compares the variances of the two groups to check for statistically significant differences. If the calculated F-value is greater than the F-table value, the null hypothesis (HCA) is rejected, indicating significantly different variances . nequal varianc. Conversely, if the calculated F-value is less than or equal to the F-table value. HCA is accepted, meaning the variances can be assumed equal. 6 Independent t-Test In the program output, the F-test result indicated that the calculated F-value was greater than the F-table value, suggesting that the variances of the two groups were significantly different. Therefore. WelchAos t-test was employed, which is a modification of the independent t-test that accounts for unequal variances between groups. After confirming normal distribution, absence of outliers, and unequal variances between groups. WelchAos t-test was performed to determine whether there was a statistically significant difference between the mean transaction times of the old and new systems. The t-value was calculated as follows: yei= yayeye Oe y, yiyi yyeyay,yaye Oo yaye yc = 5,35622 7 Determining the Degrees of Freedom (DF) ye,ycyn yaye After obtaining the t-value, the degrees of freedom (DF) were calculated to find the critical value from the t-distribution table, which serves as a reference for determining statistical significance. Given: yc12 = 9429,14, yc22 =5,38, ycu1 = 15, and ycu2 = 15, the resulting degree of freedom: dyce = 14. 8 Determining the p-Value The next step was to calculate the p-value to test the significance of the difference in mean transaction times between the two systems. Based on the calculations, the p-value obtained was 0. which is lower than 0. This result indicates that the difference in response times between the old and new systems is statistically significant. Therefore, the null hypothesis . a0 ), which states that there is no significant difference between the two systems, can be rejected. This finding confirms that the new system demonstrates a significantly faster response time compared to the old system, indicating an improvement in system performance. Conclusions The use of fine-tuned ChatGPT Turbo 3. 5 successfully provides faster, more relevant, and contextually appropriate responses, thereby enhancing the overall user experience. Based on the statistical analysis, the obtained p-value was 0. 0001, which is lower than 0. This result confirms that the new system, which integrates the AI-powered chatbot, significantly reduces transaction time compared to the old system, indicating a substantial improvement in overall system performance. The best-performing chatbot model achieved a Training Loss of 0. 3894 and a Validation Loss of 5787, indicating effective learning without signs of overfitting. The Training Mean Token Accuracy 8799 and Validation Mean Token Accuracy of 0. 7971 demonstrate high accuracy in token Furthermore, the Full Validation Loss of 0. 6242 and Full Validation Mean Token Accuracy 8122 confirm the modelAos efficiency in reducing errors and improving prediction accuracy across the entire validation dataset. Acknowledgment The authors would like to express their gratitude to Alni Accessories Store for providing the data used in the preparation of this article. Authors Note The authors declare that there are no conflicts of interest related to the publication of this article. The manuscript has undergone thorough verification and is confirmed to be free from any form of plagiarism. References