Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Enhancing Support Vector Machine Performance for Heart Attack Prediction using RobustScaler-Based Outlier Handling M Munawir Lasiyono1,*. Nurhayati2. Teotino Gomes Soares3. Mulyadi4 1 Informatics Management Study Program. Politeknik Mitra Karya Mandiri. Brebes. Indonesia Informatics Engineering Study Program. Faculty of Engineering. Universitas Muhammadiyah Tangerang. Tangerang. Indonesia 3 Computer Science Department. School of Engineering and Science. Dili Institute of Technology. Dili. Timor-Leste 4 Information Systems Study Program. Faculty of Computer Science. Universitas Nurdin Hamzah. Jambi. Indonesia Email: 1,*mmunawirlasiyono@gmail. com, 2nurhayati09011@ft-umt. id, 3tyosoares@gmail. com, 4mulyadiroesly@gmail. Correspondence Author Email: mmunawirlasiyono@gmail. Abstract Cardiovascular disease remains the leading cause of death worldwide, with most cases attributed to heart attacks and strokes. Early detection is crucial, yet conventional diagnostic methods are often constrained by time, cost, and uneven distribution of clinical Consequently, machine learning-based approaches offer a promising alternative for efficiently supporting heart attack This study employs the Support Vector Machine (SVM) algorithm, focusing on enhancing its performance through RobustScaler as a preprocessing technique to address outliers common in medical datasets. The objective of this study is to evaluate the impact of RobustScaler on SVM performance in heart attack classification. The model was developed using a dataset of 303 patient records, consisting of eight numerical features and one binary target label. Experiments were conducted under two preprocessing scenarios: without scaling . and with RobustScaler. Model performance was assessed using accuracy, precision, recall. F1score, and ROC-AUC. The results show that applying RobustScaler significantly improves model performance, with accuracy increasing from 64. 77% to 85. 23%, representing a 20. 46% improvement, and ROC-AUC rising from 73. 65% to 93. 36%, indicating a 78% increase in discriminatory ability. Additionally, recall for the negative class improved dramatically from 26. 47% to 99. reflecting better sensitivity in identifying non-heart attack cases. These findings demonstrate that proper preprocessing, particularly using RobustScaler, plays a vital role in optimizing SVM performance, especially when handling clinical data with extreme values. Keywords: Support Vector Machine. RobustScaler. Heart Attack Prediction. Outlier Handling. Medical Data Classification INTRODUCTION Cardiovascular disease remains the leading cause of death worldwide. According to the World Health Organization (WHO), approximately 17. 9 million deaths occur each year due to cardiovascular conditions, accounting for 32% of global mortality . Of this figure, about 85% are caused by heart attacks and strokes, with a steadily increasing prevalence, particularly in developing countries . Early detection is one of the most critical strategies to reduce this mortality rate. However, traditional diagnostic processes often require significant time, incur high costs, and heavily rely on clinical expertise, which is not always readily accessible. Therefore, data-driven prediction methods based on machine learning have emerged as promising alternatives for supporting the early identification of high-risk patients in a more efficient and accurate manner. In practice, classical machine learning algorithms such as Logistic Regression. K-Nearest Neighbor (KNN), and Support Vector Machine (SVM) are still widely favored due to their interpretability, computational efficiency, and ability to operate effectively on small datasets . Unlike ensemble or deep learning models-which often require large volumes of data, high computational resources, and tend to function as black-box systems-classical models are more transparent and easier to interpret, especially in applications that demand explainability and accountability . Nevertheless, these classical approaches have several limitations, particularly when dealing with problematic data such as outliers and class imbalance, both of which can impair learning performance and reduce predictive accuracy . Numerous studies have developed heart disease prediction models using classical machine learning methods. Ibrahima and Yu . applied KNN and achieved 72. 37% accuracy, but reported an imbalance in recall values, indicating the need for better attention to data distribution . Barus et al. utilized Naive Bayes and achieved 58% accuracy, but showed a significant discrepancy between precision . 67%) and recall . %), suggesting a lack of proper preprocessing . Febriani et al. proposed Fuzzy Logistic Regression, obtaining 80% accuracy but with low specificity and no consideration of outliers . Azis . employed Logistic Regression and reported accuracy ranging from 80% to 88%, but without addressing preprocessing techniques or the impact of extreme data values . Akhdan et al. compared Decision Tree and Artificial Neural Network (ANN), reaching 87% accuracy, though low precision and F1-score pointed to the potential influence of outliers and class imbalance . Based on prior research, most studies have not explicitly addressed the issue of outliers, which can reduce both accuracy and model generalization, particularly for algorithms like SVM that are highly sensitive to extreme values. Moreover, there is a lack of comparative studies that directly evaluate the impact of preprocessing techniques such as RobustScaler on SVM performance in the context of heart disease prediction. SVM is selected in this study as it is a robust and widely used classification algorithm, especially for binary classification problems . SVM excels in constructing an optimal hyperplane that separates classes with a maximum margin and performs well on high-dimensional data . However. SVM is also known to be sensitive to data scaling and outliers, which may affect the optimality of the decision boundary and decrease prediction accuracy . To address this challenge, an appropriate preprocessing method is required to minimize the influence of outliers. RobustScaler is a data scaling technique designed to be resistant Copyright A 2025 Authors. Page 1 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index to outliers by utilizing the interquartile range (IQR) rather than the mean and standard deviation used in StandardScaler . This approach preserves the central distribution of the data while reducing the impact of extreme values, thus improving the model's stability and accuracy . This study aims to enhance the performance of the SVM algorithm for heart attack prediction through outlier handling using RobustScaler and to conduct a comprehensive evaluation of model performance using metrics such as accuracy, precision, recall. F1-score, and ROC-AUC. The main contribution of this study lies in presenting a systematic approach for outlier handling to optimize SVM performance, along with an empirical comparison between models using no scaling and those using RobustScaler. RESEARCH METHODOLOGY 1 Research Stages The development of a heart attack prediction model using the Support Vector Machine (SVM) algorithm and the RobustScaler scaling technique was conducted through a series of systematic and integrated stages. Each step in the research process was methodologically designed to ensure that the proposed approach could be implemented in a structured manner and replicated in similar contexts . The main stages of this research are illustrated in Figure 1. Data Collection and Cleaning Data Exploration and Analysis Data Preprocessing Development of a Classification Model Model Performance Evaluation and Analysis Figure 1. Research Pipeline Figure 1 presents the overall flow of the research. A detailed explanation of each stage is provided below. Data Collection and Cleaning The dataset used in this study was obtained from the Kaggle platform, titled AuHeart Attack DatasetAy, which is publicly available . This dataset contains medical records of 303 patients, with a total of 9 attributes . features and 1 The features include: Age. Heart Rate. Systolic Blood Pressure. Diastolic Blood Pressure. Blood Sugar. CKMB. Troponin, and Gender. The target label is provided in the Result column, indicating whether a patient experienced a heart attack . or not . Prior to modeling, the data underwent a cleaning process, including the removal of missing values, conversion of all features into appropriate numeric types, and binarization of the target label . for negative, 1 for positiv. This step ensured that the data were of sufficient quality and consistency for preprocessing and model training. Data Exploration and Analysis This stage aimed to understand the general characteristics of the dataset. Descriptive analysis was conducted to assess the distribution of each feature, relationships between variables, and the class proportions of the target label . ositive and negativ. Visualizations such as histograms, heatmaps, and boxplots were used to detect outliers and identify Copyright A 2025 Authors. Page 2 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index relevant features. The results of this exploration informed the selection of preprocessing techniques and justified the use of RobustScaler. Data Preprocessing Data preprocessing was performed to ensure optimal conditions before training the model. The selected scaling technique was RobustScaler, which transforms features based on the median and interquartile range (IQR), making it more resistant to outliers . Categorical features, such as gender, were numerically encoded. The data were then split into training and testing sets using an 80:20 stratified split to maintain balanced class proportions. This ratio is commonly used in classification modeling to allow the model to generalize well from 80% of the data while testing on the remaining 20% . This stage resulted in two datasets ready for training and testing under two scenarios: without scaling . and with RobustScaler. Development of a Classification Model In this stage, the Support Vector Machine (SVM) algorithm was used as the primary classification model. The model was trained using the training set prepared under two scenarios: baseline . ithout scalin. and with RobustScaler. SVM works by finding the optimal hyperplane that separates two classes with the maximum margin. For non-linear data. SVM utilizes a kernel function to transform the data into a higher-dimensional space where a linear separation becomes possible . This study applied the Radial Basis Function (RBF) kernel due to its effectiveness in capturing non-linear patterns among features. All experiments were conducted with consistent parameters to ensure that any observed performance differences were solely due to the preprocessing techniques used. Model Evaluation Model evaluation was conducted by assessing classification performance on the test set using several metrics. This stage began with the generation of a confusion matrix and ROC (Receiver Operating Characteristi. The confusion matrix was used to calculate metrics such as accuracy, precision, recall, and F1-score, reflecting the modelAos correctness, sensitivity, and class-wise balance . The ROC curve illustrates the relationship between the true positive rate . and the false positive rate, and was used to compute the AUC (Area Under the Curv. as an indicator of the model's overall discriminatory capability . The performance of the two models . ith and without RobustScale. was compared to evaluate the extent to which preprocessing affected accuracy and sensitivity, particularly in detecting positive . eart attac. 2 Scaling Techniques Using RobustScaler RobustScaler is a data normalization technique designed to reduce the impact of outliers. Unlike StandardScaler, which transforms data based on the mean and standard deviation. RobustScaler uses the median and interquartile range (IQR), making it more robust to skewed distributions and extreme values . The transformation is defined by Equation . ycuycycaycaycoyceycc = ycu Oe ycE2 ycE3 Oe ycE1 where ycE2 is the median, and ycE1 and ycE3 are the first and third quartiles, respectively. In medical datasets, outliers often arise due to clinical variations, recording errors, or rare conditions. If not properly addressed, outliers can negatively impact model performance, especially for algorithms sensitive to data scale, such as SVM and KNN. Therefore. RobustScaler is considered a suitable approach for improving the stability of predictive models. It is particularly recommended when datasets contain significant outliers or extreme values . This scaling technique transforms features based on their interquartile range, minimizing the distortion caused by extreme It is important to note that in this study. RobustScaler was not used as a separate outlier detection or removal Instead, it served purely as a preprocessing method to mitigate the influence of outliers through scaling, by transforming features relative to their interquartile range. This approach ensured that extreme values did not disproportionately affect the SVMAos margin-based decision boundary. 3 Support Vector Machine (SVM) Method Support Vector Machine (SVM) is a supervised learning algorithm commonly used for classification and regression tasks . SVM works by identifying a hyperplane that optimally separates two classes with the maximum margin . When data are not linearly separable. SVM employs kernel functions to map the data into a higher-dimensional space, enabling linear separation. In this study, the Radial Basis Function (RBF) kernel was chosen due to its ability to effectively capture non-linear relationships among features. Mathematically. SVM solves optimization through Equation . min AnycAn2 with conditions ycycn . c ycN ycuycn yc. Ou 1. OAycn yc,yca 2 To handle non-perfect separability, slack variables and a penalty parameter C are introduced, allowing a balance between maximizing the margin and minimizing classification error. SVM is known for its strength in handling highdimensional data and producing strong generalization on test data. However, a known limitation of SVM is its sensitivity to feature scaling and outliers, which can shift the hyperplane and degrade model performance . Therefore, selecting Copyright A 2025 Authors. Page 3 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index an appropriate preprocessing method such as RobustScaler is essential to ensure optimal model behavior when dealing with complex and varied medical datasets. In this study, the Support Vector Machine model was implemented using the Scikit-learn library in Python. The classifier was instantiated using the SVC class from sklearn. svm, with the kernel set to 'rbf' to support non-linear The regularization parameter C was set to 1. 0 and the kernel coefficient gamma was set to 'scale', which is the default configuration in Scikit-learn and has shown good empirical performance on small to medium datasets. These parameters were kept constant across both scenarios . ith and without scalin. to isolate the impact of the preprocessing technique on model performance. No cross-validation or hyperparameter optimization was performed, as the main objective was to evaluate the effectiveness of RobustScaler in enhancing model robustness under identical modeling RESULT AND DISCUSSION The development of a heart attack classification model using the Support Vector Machine (SVM) approach began with the preparation of an appropriate dataset. The dataset used in this study was sourced from the public platform Kaggle, titled AuHeart Attack DatasetAy . It contains medical records of 303 patients, comprising nine attributes, which include eight input features and one target label. The available features are Age. Heart Rate. Systolic Blood Pressure. Diastolic Blood Pressure. Blood Sugar. CK-MB enzyme. Troponin, and Gender. The target label is represented by the Result attribute, which indicates whether a patient has experienced a heart attack . or not . Before the modeling process, the dataset underwent a cleaning stage involving the removal of missing values, conversion of all features into appropriate numeric formats, and binarization of the target label into 0 for negative and 1 for positive. These steps were essential to ensure data quality and consistency for the preprocessing and training stages. The subsequent stage involved exploratory data analysis, which encompassed a detailed descriptive assessment of the distribution of each feature, investigation of inter-variable relationships, and evaluation of the proportion of instances across target classes. This step aimed to gain initial insights into the structure and characteristics of the dataset. As a starting point, the analysis focused on visualizing the distribution of the target variable to assess the degree of class Understanding class distribution is essential in binary classification tasks, as imbalanced datasets can significantly influence model performance, particularly in terms of bias toward the majority class and reduced sensitivity in detecting the minority class. The visualization of class distribution is presented in Figure 2. Figure 2. Visualization of Target Result Class Distribution Figure 2 illustrates the distribution of the target class, consisting of patients who did not experience a heart attack . and those who did . The class distribution shows a moderate imbalance, with 61. 4% of the data belonging to the negative class and 38. 6% to the positive class. Although the class proportion differs, this imbalance is considered tolerable for training purposes. Therefore, oversampling or any other balancing techniques were not applied. This decision was made to preserve the natural structure of the data, although the possibility of slight bias toward the majority class was considered during model evaluation. The next exploration step focused on analyzing the relationships between variables. This aimed to identify correlation patterns among numeric features and assess how strongly each feature relates to the target label. Such analysis provides preliminary insights into the strength and direction of these relationships, supporting decisions in feature selection and the use of appropriate predictive models. A heatmap was used to visualize the correlation values, with color intensity indicating the strength of the relationship. The correlation heatmap is presented in Figure 3. Copyright A 2025 Authors. Page 4 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Figure 3. Heatmap of Correlation Between Features Figure 3 displays the correlations among the numeric features in the dataset. The results indicate that most features have weak correlations with each other and with the target label. The strongest correlations with the Result label were found in Age . Troponin . , and CK-MB . A moderate correlation of 0. 59 was found between Systolic and Diastolic Blood Pressure. These findings imply that no individual feature dominates the prediction, which supports the use of classification models based on feature interaction, such as SVM with non-linear kernels. Further exploration was conducted to observe the characteristics of the numeric features. This step aimed to examine the distribution of values, detect the presence of outliers, and identify potential impacts on model training. Such information is useful for determining suitable preprocessing strategies, including the selection of scaling techniques. Boxplots for the numeric features prior to scaling are shown in Figure 4. Figure 4. Initial Numerical Features Boxplot Copyright A 2025 Authors. Page 5 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Figure 4 presents boxplots of numeric features that highlight the presence of outliers. Most features, especially Blood Sugar. CK-MB, and Heart Rate, show extreme values beyond the normal range. This confirms that outliers exist in the data and may disrupt model training, which justifies the use of RobustScaler to reduce their influence. To better understand how RobustScaler operates, a simple manual calculation is presented using a sample of five Blood Sugar values: . , 110, 120, 400, . The first step is to compute the median (Q. ycIycuycycyceycc ycycaycoycyceyc: . , 110, 120, 400, . ycE2 . cAyceyccycnycayc. =115 Next, the second step is to calculate the lower quartile (Q. and the upper quartile (Q. So, the calculation is as . = 105 , . Ie ycE3 = = 260 . Ie ycE1 = The third step is to find the Interquartile Range (IQR) value, where IQR is a statistical measure that shows the middle dispersion of a dataset, namely the distance between the third quartile (Q. and the first quartile (Q. So the IQR value is as follows: yaycEycI = ycE3 Oe ycE1 = 260 Oe 105 = 155 After obtaining the median and IQR, each value is transformed using the scaling formula. The results are presented in Table 1. Table 1. Data Transformation Results Original Data Calculation . - . / 155 = -15 / 155 - . / 155 = -5 / 155 - . / 155 = 5 / 155 - . / 155 = 285 / 155 - . / 155 = 0 Scaled Result OO -0. OO -0. OO 0. OO 1. Table 1 illustrates that the median becomes the center of distribution with a value of zero, while extreme values such as 400 retain high magnitudes but are no longer dominant. RobustScaler reduces the influence of outliers by transforming data based on the interquartile range rather than mean and standard deviation. The result of applying RobustScaler to the dataset is visualized in Figure 5. Figure 5. Boxplot of Numeric Features After Scaling Using RobustScaler Figure 5 illustrates the boxplots of numerical features after transformation using RobustScaler. It can be observed that extreme values . have been significantly suppressed, and the distribution of each feature is now more concentrated around the zero median. This indicates that RobustScaler effectively reduces the influence of outliers and prepares the data more appropriately for classification algorithms such as SVM. Copyright A 2025 Authors. Page 6 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index The next step involved the construction of the classification model using the Support Vector Machine (SVM) algorithm, which was implemented for both training and testing processes. The dataset was split into 80% training and 20% testing using stratified sampling to maintain the proportion of target classes in both subsets. In this study, the SVM algorithm was implemented using the scikit-learn . library, a widely used Python library for machine learning. The model was instantiated using the SVC (Support Vector Classifie. class from the sklearn. svm module. The parameter kernel='rbf' was used, as the Radial Basis Function (RBF) kernel is known for its effectiveness in capturing nonlinear relationships between features in high-dimensional space. Model performance was evaluated using several metrics, including the confusion matrix, classification report, and ROC-AUC score, which collectively measure the model's ability to distinguish between classes in a binary classification These metrics provide comprehensive insights into the model's accuracy, sensitivity, and overall performance, particularly in the presence of class imbalance. The confusion matrices and ROC curves for both the SVM model without scaling and the SVM model with RobustScaler preprocessing are presented in Figure 6. Figure 6. Confusion Matrix for the SVM Model Without Scaling, . Confusion Matrix for the SVM Model With RobustScaler, . ROC Curve of Both Models Figure 6 compares the confusion matrices and ROC curves of the two SVM models. The results indicate that the model using RobustScaler achieved a higher AUC score . , signifying better classification performance compared to the model without scaling (AUC 0. Based on the confusion matrix and ROC curve results, further evaluation was conducted using classification reports and ROC-AUC scores for both models. A complete comparison of the performance metrics is shown in Table 2. Copyright A 2025 Authors. Page 7 This Journal is licensed under a Creative Commons Attribution 4. 0 International License Bulletin of Informatics and Data Science Vol. 4 No. May 2025. Page 1Oe9 ISSN 2580-8389 (Media Onlin. DOI 10. 61944/bids. https://ejurnal. id/index. php/bids/index Table 2. Comparison of SVM Model Performance Without Scaling and With RobustScaler Method SVM (No Scalin. SVM RobustScaler Class Negative Positive Negative Positive Precision Recall F1-Score Accuracy ROC-AUC Score The evaluation results in Table 2 demonstrate that the SVM model with RobustScaler consistently outperformed the model without scaling across all major performance metrics. The accuracy improved from 64. 77% to 85. representing an increase of 20. This improvement highlights the significant impact of RobustScaler preprocessing on the modelAos predictive performance. To further contextualize the results, an accuracy comparison with prior studies was conducted. It is important to note that the datasets used in those studies may differ from the one employed in this research. The comparison is summarized in Table 3. Table 3. Accuracy Comparison with Prior Studies Study Ibrahima & Yu . Barus et al. Febriani et al. Azis . Akhdan et al. This study Method K-Nearest Neighbor (KNN) Naive Bayes Fuzzy Logistic Regression Logistic Regression Decision Tree. ANN SVM RobustScaler Reported Accuracy 80Ae88% Table 3 summarizes the accuracy achieved in this study compared to previous classical machine learning Among the referenced studies. Akhdan et al. obtained the highest accuracy of 87. 00% using a combination of Decision Tree and Artificial Neural Network. Ibrahima and Yu . 37% using K-Nearest Neighbor. Barus et al. 58% with Naive Bayes, and Febriani et al. 00% with Fuzzy Logistic Regression. Azis . reported an accuracy ranging from 80% to 88% using Logistic Regression, although details regarding preprocessing were not explicitly mentioned. In comparison, the proposed model achieved 85. 23% accuracy using a single algorithm. Support Vector Machine, demonstrating competitive performance. This study focuses specifically on addressing the outlier problem, which was not explicitly handled in previous By applying RobustScaler as a preprocessing strategy, the research aims to demonstrate that outlier handling can significantly enhance model accuracy. The performance improvements shown in Table 2 reinforce the importance of robust preprocessing, particularly when dealing with clinical datasets that often contain extreme values. Despite the strong results, one limitation was a slight reduction in recall for the positive class, indicating that some heart attack cases remained undetected. This issue requires careful attention in medical contexts, as it may impact clinical decision-making. Future research may focus on SVM hyperparameter optimization, class imbalance handling . uch as class weightin. , and combining preprocessing with feature selection techniques to further improve model performance. CONCLUSION This study demonstrated that applying RobustScaler as a preprocessing technique significantly improved the performance of the Support Vector Machine (SVM) algorithm in predicting heart attack cases. Without preprocessing, the baseline SVM model achieved an accuracy of 64. 77% and showed poor sensitivity toward the negative class, with a recall of only After using RobustScaler, the modelAos accuracy increased to 85. 23%, and the ROC-AUC score rose from 73. 36%, indicating a 26. 78% improvement in classification capability. These findings confirm that selecting the appropriate preprocessing strategy, particularly in handling outliers, plays an essential role in enhancing model performance on clinical datasets. However, this study has several limitations. The dataset used was relatively small, consisting of only 303 patient records, and the results were not validated on external datasets. In addition, the study focused exclusively on RobustScaler without comparing it to other scaling or outlier-handling techniques. The model also showed a slight decline in recall for the positive class, indicating that a number of heart attack cases were still Future research is encouraged to expand the dataset, perform parameter optimization, evaluate other preprocessing methods, and test the model across different populations or clinical settings to improve robustness and REFERENCES