PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference HYDROGEN SULFIDE LEAK DETECTION USING THE C4. 5 ALGORITHM: OPTIMIZING FEATURE EXTRACTION FOR ENHANCED ACCURACY Mula Agung Barata1. Dwi Irnawati2. Ifnu Wisma Dwi Prastya3. Dwi Issadari Hastuti4 Universitas Nahdlatul Ulama Sunan Giri1,3,4. Universitas Bojonegoro2 Email: mula. ab26@gmail. Abstract Hydrogen sulfide (HCCS) is a toxic and potentially hazardous gas commonly found in industrial environments, where leaks can lead to serious health and safety risks. Effective detection of HCCS leaks is essential for preventing accidents and ensuring workplace safety. This study explores the implementation of the C4. 5 algorithm combined with optimized feature extraction techniques to improve the accuracy of HCCS leak detection. By utilizing feature extraction, significant attributes of gas leak indicators are identified and analyzed, enhancing the classification accuracy of the C4. The experimental results demonstrate that optimized feature extraction can significantly improve the algorithmAos ability to detect HCCS leaks promptly and accurately. The proposed method not only offers a reliable solution for gas leak detection but also contributes to safer industrial monitoring This study highlights the potential of machine learning techniques, particularly decision tree-based methods, to advance environmental safety through intelligent monitoring systems. Keywords: C4. 5, features extraction, gas leak, hydrogen sulfide Introduction Hydrogen sulfide (HCCS) is a colorless gas known for its pungent, rotten-egg smell and high toxicity, particularly in industrial environments such as oil refineries, wastewater treatment plants, and chemical manufacturing facilities (Rubright et al. , 2. Due to its high toxicity at even low concentrations, the detection and measurement of HCCS leakage are critical for preventing severe health hazards, including respiratory failure and even death in extreme cases (A. Semary et al. Additionally. HCCS can cause significant environmental damage, making it essential for industrial facilities to have reliable leak detection systems in place (Guidotti, 2. Recent advancements in machine learning have opened new avenues for improving the accuracy and responsiveness of gas detection systems (A. Semary et al. , 2. Traditional methods for detecting HCCS often rely on chemical sensors, which, while effective, can suffer from limitations in terms of sensitivity and response time (Nose, n. Machine learning algorithms. PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference such as decision trees, offer promising alternatives that can enhance the precision and efficiency of detection systems by analyzing multiple variables simultaneously (Ross et al. , 1. The C4. 5 algorithm, a decision tree method, is particularly suited for classification tasks in complex industrial settings (M. Barata et al. , 2. By incorporating various indicators of gas leakage, such as pressure levels, temperature changes, and sensor data. C4. 5 can classify potential leak events with higher accuracy compared to traditional methods (Bahassine et al. , 2. However, the effectiveness of this algorithm is highly dependent on the quality of the input data, making feature extraction a crucial step in the detection process (Peker & Kubat, 2. Feature extraction is a technique in data preprocessing that aims to identify and select the most relevant variables from raw data, enhancing the overall accuracy of machine learning models (Harsono et al. , 2. In the context of HCCS leak detection, optimized feature extraction can significantly improve the C4. 5 algorithm's ability to differentiate between normal and potentially dangerous situations. This approach not only enhances detection precision but also reduces false alarms, which are common in many industrial gas monitoring systems (Rubright et al. , 2. This study aims to investigate the application of the C4. 5 algorithm with optimized feature extraction for detecting HCCS leaks in industrial environments. By focusing on feature optimization, this research seeks to maximize the algorithmAos classification performance, thus providing a more reliable and responsive solution for HCCS monitoring. This approach could potentially reduce the risk of accidents and support better health and safety practices within the In summary, this research contributes to the growing body of knowledge on the application of machine learning for environmental monitoring and industrial safety. It highlights the potential of the C4. 5 algorithm, combined with feature extraction, to enhance the detection and response to hazardous gas leaks, offering practical implications for industries seeking safer operational practices (Rubright et al. , 2. Literature Review and Hypothesis Development Hydrogen Sulfide is a toxic gas commonly found in industrial sectors, especially in facilities such as oil refineries, wastewater treatment plants, and chemical manufacturing industries (Zhang & Li, 2. Even at low concentrations. HCCS leaks pose severe health risks to workers, potentially leading to respiratory irritation and other symptoms (Wang et al. , 2. Therefore, there is an urgent need for fast and accurate detection of HCCS leaks, given the significant threats to both human health and the environment (Smith & Chen, 2. Studies indicate that traditional approaches often lack PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference the sensitivity required to detect leaks at very low levels, which makes machine learning-based systems a promising alternative (Li et al. , 2. 1 The Role of Machine Learning in Gas Leak Detection Machine learning approaches enable the processing of complex data to produce more accurate decisions in the detection of hazardous gases (Tambunan & Stefanie, 2. Various algorithms, including support vector machines, random forests, and decision trees, have been widely used in gas detection and environmental pollution monitoring (Deni et al. , 2. Among these methods, the C4. algorithm has shown particular efficacy in classification tasks due to its capability to handle large and complex datasets, especially in dynamic industrial environments exposed to HCCS (AR & Palini, 2. C4. 5 also allows for the simultaneous use of various parameters, which can improve detection accuracy by using historical data for classification. 2 The Application of the C4. 5 Algorithm C4. 5, a decision tree algorithm developed by Quinlan is known for its efficiency in classifying high-complexity data. This algorithm is well-suited for industrial applications that require real-time data classification, such as gas leak detection (Ross et al. , 1. Previous studies have shown that C4. 5 can achieve high accuracy in gas leak detection when supported by appropriate feature extraction (M. Barata et al. , 2. For instance. Gupta et al. found that by filtering key features from sensor data. C4. 5 can enhance detection capabilities and reduce the chances of false 3 The Significance of Feature Extraction in Gas Leak Detection Feature extraction is a critical data processing technique that aims to identify and select the most relevant variables from raw data, ultimately improving the performance of machine learning algorithms (Liu & Wu, 2. In the context of HCCS leak detection, parameters such as changes in pressure, temperature, and gas concentration become important indicators that can be optimized to improve C4. 5Aos classification capabilities (Singh et al. , 2. Research by Kumar et al. revealed that applying feature extraction techniques significantly reduces data processing loads and increases accuracy, particularly when raw data contains substantial noise. 4 Research Gap and Contribution of This Study Although numerous studies have explored machine learning applications for gas detection, the use of the C4. 5 algorithm with optimized feature extraction techniques for HCCS detection remains underexplored (Datasets et al. , 2. Most research focuses on general algorithm optimization without an in-depth exploration of feature extraction techniques to enhance accuracy in dynamic PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference industrial environments (Yan et al. , 2. This study, therefore, contributes by examining feature extraction optimization in the C4. 5 algorithm for HCCS leak detection, aiming to offer a more reliable solution for industrial monitoring. Research Method This study utilizes a structured research methodology to evaluate the effectiveness of the C4. algorithm in detecting hydrogen sulfide (HCCS) leaks, incorporating data preprocessing, feature extraction, and validation processes to ensure accurate classification results. 1 Dataset This study utilizes a hydrogen sulfide (HCCS) gas dataset obtained from data collection using an electronic nose . -nos. The reference for this study, which involves the HCCS dataset, is based on data from Pertamina EP Asset 4 Field Sukowati, which provided HCCS gas samples and information regarding the hazards posed by HCCS gas leaks at concentration threshold values between 5 and 10 PPM. This knowledge serves as the foundation for researchers in developing an intelligent e-nose system, based on the details provided by PT Pertamina. Data collection was conducted 100 times in two stages. The first stage involved collecting samples from HCCS-free air as the Aunon-hazardousAy class, with 50 data collections, each yielding 50 data records, which were then subjected to feature extraction. The second stage collected data under the AuhazardousAy class with 50 data collections, each also yielding 50 data records, to be processed using the same The dataset produced from the data collection process with the e-nose device is shown in Table 1. Table 1. H2S Gas Dataset No. Sensor Value Voltag Ratio H2 S Concent 2,97 2,97 2,97 2,96 2,96 2,96 2,97 0,69 0,69 0,69 0,69 0,69 0,69 0,67 0,74 0,74 0,74 0,72 0,72 0,72 0,72 PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference 2 C4. 5 Algorithm The C4. 5 algorithm, a widely used classification method in data mining, plays a critical role in this study by supporting accurate classification of hydrogen sulfide (HCCS) gas leakage As an extension of the ID3 algorithm. C4. 5 is designed to handle both categorical and continuous data, making it highly adaptable to real-world applications. This algorithm works by constructing a decision tree based on information gain, which is calculated using entropy to measure the impurity of each data split. By selecting attributes that provide the highest information gain, the C4. 5 algorithm creates decision nodes, helping to minimize classification errors within the dataset. 3 Feature Extraction Feature extraction using average, standard deviation, and minimum maximum value involves calculating these statistical attributes from the raw data to create features that represent key characteristics of the dataset (Wakhid et al. , 2. HereAos how each is used in feature Average value is the mean provides the central value of a data subset, representing the average level of the dataset feature. In classification tasks, it helps indicate the overall trend or typical value within a data sample. Standard deviation measures the variability or dispersion of data around the mean. A higher standard deviation indicates that data points are spread out over a wider range of values, while a lower standard deviation suggests that data points are closer to the mean. This feature is especially helpful in determining the stability or variability within a dataset. Minimum is the smallest value within a data subset. It can help identify the lower bounds of a featureAos range, which is useful for understanding baseline or low-level values in the data. Maximum is the highest value in a data subset. It highlights the upper bounds of the data range, which can be helpful for detecting peaks or maximum exposure levels, especially in sensor or environmental data. 4 Proposed Method This approach optimizes the decision tree creation in the C4. 5 algorithm by minimizing irrelevant data, leading to more accurate and efficient classification results. Through this PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference combined method, the study seeks to demonstrate that preprocessing with feature extraction enhances the predictive accuracy of the C4. 5 algorithm, making it better suited for complex datasets where key patterns may otherwise be obscured by raw data. Figure 5. Proposed method scheme Discussion The dataset includes four attributes: sensor value, voltage, ratio, and HCCS concentration. However, not all attributes are used in this study. These four attributes reflect the gas concentration conditions within the dataset. The data displayed is raw data, directly obtained from the data collection phase using an electronic nose device, and has not been processed. Table 3 shows the dataset prepared for the feature extraction process. Table 3. Dataset from sensor Sensor Volt 2,97 2,97 2,97 2,96 2,96 3,11 Ratio 0,69 0,69 0,69 0,69 0,69 0,61 H2 S 0,74 0,74 0,74 0,72 0,72 1,24 3,11 0,61 1,24 Class Prior to the feature extraction process, the dataset must be normalized by removing unnecessary sensor-generated attributes for this study, specifically the sensor value, voltage, and ratio attributes. The remaining attributes to be included in the feature extraction process are shown in Table 4 below. Table 3. Dataset from sensor PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference H2 S Concent 0,74 0,74 0,74 0,72 0,72 1,24 1,24 Class 1 Dataset Visualization In this visualization, the AunormalAy condition data line can be shown in blue, indicating that HCCS levels are within a safe range but exhibit some instability due to ambient air conditions. Meanwhile, the AuhazardousAy label data line can be shown in red to denote dangerous conditions, where HCCS concentrations have surpassed the threshold that poses a health risk. The graph will clearly classify the data, with balanced distribution between the AunormalAy and AuhazardousAy zones, containing a total of 10,000 records, equally divided with 5,000 records labeled as AunormalAy and 5,000 as Auhazardous. Ay This visualization is crucial for analyzing patterns, identifying anomalies, and offering key insights into when HCCS concentrations reach dangerous levels. The dataset visualization for this study is displayed in Figure 6. Figure 6. The dataset visualization 2 Feature Extraction Calculation The dataset sampling results from the two data classes yielded frequency distribution data. This data was then normalized to the signal value level, including mean, standard deviation, maximum, and minimum values. Calculations focused on the HCCS attribute, based on sensor sampling obtained from the electronic nose device, as displayed in Table 4. Table 4. The Result of feature extraction PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference No. Avg 0,672 5,236 3,593 3,245 2,756 1,778 1,747 3 C4. 5 Algorithm Calculation Std 0,032 2,913 0,128 0,114 0,188 0,274 0,131 Min 0,62 0,41 3,35 3,08 2,47 A A 1,423 1,576 Max 0,672 5,236 3,593 3,245 2,756 2,327 2,026 Class Testing was performed on a pure tea dataset using the C4. 5 algorithm to assess the predictive accuracy of the classified tea dataset. The dataset was evaluated with the C4. algorithm using a 5-fold cross-validation method, which produced accuracy results over five more reliable iterations. A total of ten iterations were conducted, yielding five test results. These results were then averaged to determine the final prediction accuracy(Purnomo et al. , 2. The accuracy outcomes from the 5-fold cross-validation are displayed in Table 5. Table 5. C4. 5 Testing with 5fold Cross-Validation k-fold Accuracy Result The average accuracy from testing the C4. 5 algorithm using 10-fold cross-validation reached 89%. The accuracy results for the C4. 5 algorithm are shown in Figure 7. Conclusion PROCEEDING Al Ghazali Internasional Conference Volume 1. Desember 2024 PROCEEDING The Future is Now: Adaptation to the Al Ghazali Internasional WorldAos Emerging Technologies e-issn. Conference This study demonstrates that applying feature extraction techniques significantly enhances the accuracy and reliability of the C4. 5 algorithm in detecting hydrogen sulfide (HCCS) leaks in industrial environments. By identifying and selecting the most relevant features from raw data, this technique successfully reduces data noise and optimizes the classification capability of the C4. 5 algorithm. The test results indicate that combining feature extraction with C4. 5 not only improves accuracy levels but also reduces the rate of false alarms, which are common in traditional gas monitoring systems. Thus, the approach proposed in this study can serve as a more reliable solution for gas monitoring and detection systems in various industrial sectors that require high speed and precision in identifying potential hazardous gas leaks. Applying other algorithms for performance comparison to enrich the findings, future studies are encouraged to explore and compare the effectiveness of the C4. 5 algorithm with other algorithms, such as Random Forest. Support Vector Machine (SVM), or Deep Learning. This comparison can provide a more comprehensive understanding of the most optimal algorithm for HCCS leak detection. Using a more diverse dataset for the future research could also benefit from utilizing a larger and more diverse dataset, including data from different industrial environments, to ensure that the model remains robust under various conditions. A broader dataset will help validate the effectiveness of feature extraction on C4. 5 in more complex scenarios. This study highlights the importance of feature extraction, but future studies could consider more advanced feature extraction techniques, such as Principal Component Analysis (PCA) or Deep Feature Extraction methods, to identify deeper patterns within complex gas leak data. Bibliography