International Journal of Electrical Engineering and Intelligent Computing (IJEEIC) EISSN 3031-5255 https://ejournal. id/index. php/ijeeic/index A Multimodal Deep Learning Framework for Early Detection of Congenital Heart Disease in Neonates Iis Hamsir Ayub Wahab1. Sri Yati2 Department of Electrical Engineering. Faculty of Engineering. Universitas Khairun email : hamsir@unkhair. Department of Medicine. Faculty of Medicine and Health Sciences. Universitas Khairun Addres: Kampus 2 Universitas Khairun Jl. Jusuf Abdulrahman. Kel. Gambesi. Kec. Ternate Selatan. Kota Ternate email : sriyati@unkhair. Article history : Received Oct 21, 2024 | Revised Feb 11, 2025 | Accepted Mei 17, 2025 Abstrak -- Congenital heart disease (CHD) is the most common congenital defect and still adds significantly to the neonatal morbidity and mortality rates. Classic echocardiography and ECG unimodal data traditional methods are often unable to analyze complex, multifunctional, and multifactorial cardiac pathologies in neonates. This paper presents an explainable multimodal deep learning framework that acquires four diverse sources of clinical data. Multimodal data includes echocardiogram videos. ECG, and other physiological and structured electronics health record (HER) data. propose a self-attention-based late fusion transformer architecture that also uses self-attention mechanisms. The model trains and validates on benchmark datasets, which are transparently and reproducibly available (EchoNet-Dynamic. MIMIC-IV. PhysioNet Capnobase, and MIT-BIH). The results achieved using the proposed model mark an improvement over existing benchmarks with 93% accuracy, 95% sensitivity, and 0. 96 area under the ROC curve. Using interpretability modules, features that were value added towards determining the diagnostic indicators that were incorporated in the neonatal infant care were shown to be critically relevant. Moreover, the model shows high performance consistency across several data sources and shifts. The research illustrates the use of explainable deep learning architectures for automation of early-stage heart defect detection in newborns. Some of the future work includes validation through clinical studies and multilingual electronic health record integration. Keywords: CHD. Multimodal. Data. Fusion. AI. Deep Learning This is an open access article under the (CC BY-NC-SA 4. INTRODUCTION Congenital Heart Disease (CHD) is a morphological abnormality of the heart that occurs since the embryonic period and is the main cause of neonatal death due to congenital abnormalities. With an estimated global prevalence of around 8 per 1,000 live births, and the number of new cases reaching more than 1. 35 million babies per year globally. CHD continues to be a major challenge in the health care system, especially in developing countries . Early detection and diagnosis of CHD is key to reducing morbidity and mortality rates, because most forms of CHD require medical intervention within a very short period of time after birth . In current clinical practice, the diagnosis of CHD relies heavily on manual evaluation of various types of data, such as echocardiography results, electrocardiogram (ECG) signals, and patient clinical and laboratory data. Unfortunately, this approach not only relies on the subjectivity of clinician interpretation but is also partial because it often only relies on one type of data . In fact, congenital heart disease is multifactorial, involving structural, functional, and molecular aspects that interact with each other . , . The limitations of partial diagnostic approaches can lead to longer diagnosis times, misclassification of cardiac abnormalities, or even failure to detect the disease. Over the past ten years, rapid developments in the field of artificial intelligence (AI), especially deep learning techniques, have brought new hope in detecting and classifying congenital heart disease. AI shows great potential in improving diagnostic capabilities, both in reading electrocardiogram (ECG) results, analyzing echocardiography images, and processing physiological signals and electronic health records (EHR) data . The combination of various types of data through a multimodal data fusion approach is considered capable of understanding the physiological and morphological complexity of the heart in more depth than using only one type of data. For example, combining ECG and echocardiogram data has been shown to increase diagnostic accuracy by 12Ae15% compared to methods that use only one type of data . , . Information from an electrocardiogram (ECG), which depicts the electrical activity of the heart, has long been the primary way to determine if there are any cardiac arrhythmias or other malformations . However. ECGs have the disadvantage of being less able to detect complex malformations, making them less suitable for use alone. Echocardiography, on the other hand, can provide a direct picture of the heart, but its results are highly dependent on the expertise of the examiner and the availability of the equipment. Vital signs such as heart rate. DOI: 10. 33387/ijeeic. Iis Hamsir Ayub Wahab. Sri Yati blood oxygen levels, and blood pressure can provide important clues about a patientAos circulatory status, which often changes in CHD. Finally. EHR contain patient data over time that is very helpful in making decisions based on medical history . Combining these four methods with an artificial intelligence-based data fusion system could produce more complete and adaptable estimates. Several recent studies have shown that this method is quite effective. In a study conducted by Khan et al. , a multimodal deep learning system showed an accuracy rate of 94. 2% in distinguishing simple and complex CHD. However, various technological innovations such as the use of Transformer architecture, attention-based fusion, and federated learning have begun to address these challenges by offering systems that are more manageable, scalable, and secure regarding data privacy . The use of multimodal AI has great potential in diagnosing in newborns. For example, the CNN-LSTM model equipped with Grad-CAM has achieved a sensitivity level above 92% . In fact, the combination of physiological signal data and EHR can increase the accuracy of predicting heart structural abnormalities . , . In patients with Tetralogy of Fallot, the multimodal sequential model has proven effective in predicting complications with an accuracy of up to 93% . Meanwhile, the attention-fusion architecture showed better results than conventional methods with an F1 score of 7% . However, challenges such as interoperability and data synchronization between modalities are still obstacles in improving system performance . In addition, ethical issues and potential bias also need to be addressed systematically. A systematic review concluded that an average AUC value of 0. 91 can only be obtained if the quality of labeling and standardization of clinical data is truly maintained . This confirms that the success of a multimodal model depends not only on its technical performance, but also on its readiness for implementation in everyday clinical practice. This study aims to design and test in depth an artificial intelligence (AI) model that combines various types of data or multimodal. This model utilizes four key data sources, namely ECG results, echocardiography, physiological signals, and EHR, to detect CHD in newborns. This study not only prioritizes how accurate the model's predictions are, but also how easy the model is to understand, how relevant the model is to the medical world, and how ready the system is to be used in the NICU (Neonatal Intensive Care Uni. With this approach, it is hoped that the resulting system will be able to make a significant contribution to reducing morbidity and mortality in newborns due to late diagnosis of CHD. METHOD System Design and Multimodal Data Acquisition In this study, we combined different types of clinical data with the help of artificial intelligence (AI). There are four main data sources that we used, namely: cardiac recordings (ECG), cardiac images from ultrasound . , live body condition data . uch as heart rate, breathing rate, and oxygen level. , and patient health records from hospitals (EHR). All of these data were taken from open databases that are trusted and often used by other researchers around the world for AI research and development purposes. In more detail, we obtained ECG data and body condition data from PhysioNet, including datasets such as the PTB Diagnostic ECG Database. MIMIC-i Waveform Database, and VitalDB. These databases provide ECG data and body condition parameters in great detail . For cardiac images, we used the CAMUS dataset and EchoNet-Dynamic. Both are common databases that contain cardiac ultrasound images and videos that have been examined and annotated by cardiologists . , . Meanwhile, we took our patient health record data from the MIMIC-IV Clinical Database, which has a complete record of the health history of pediatric patients treated in intensive care units (NICU and PICU) . We arranged all of this data so that the time was right, so that it could be processed simultaneously. Before processing, we cleaned this data first. To do this, we adjusted the signal data to the same scale . -scor. , resized the heart image to 224x224 pixels, and converted the health record text into a computer-readable format using the BioBERT tokenizer, then converted it again into a tensor embedding. By using this open data, our research becomes more transparent and easier for other researchers to repeat. In addition, the results of this study are expected to be widely used in various clinical conditions. The figure 1 is a visualization of the proposed system In the design shown in Figure 1, it is apparent that each part of the architecture has a specific role in managing a specific type of information (ECG, echo, physiological data, and EHR). This allows the system to learn how to represent information from each type of data specifically, without mixing it with other types of data. This approach is much better than combining all the data at the beginning, as it avoids the problems of dimensional differences and inappropriate time. The use of 1D CNN for ECG data and physiological signals was chosen because of its ability to recognize small patterns and frequency abnormalities . Meanwhile, 2D CNN (ResNet-. pre-trained with echo images, ensures effective transfer learning from the field of clinical imaging, with fast convergence results and high accuracy . Bidirectional LSTM was chosen to understand the time context of the signal in two directions, which is very important in detecting changes in cardiac cycle dynamics. Finally, the combination of BioBERT IJEEIC. Vol. 2, n. Iis Hamsir Ayub Wahab. Sri Yati and Transformer allows processing unstructured clinical data from EHR, while maintaining global and local context efficiently . Figure 1. Multimodal AI Framework Architecture for CHD Diagnosis 2 Processing Data 2,2. 1 ECG and Physiological Signal Processing with 1D CNN ECG signals along with physiological signals such as heart rate (HR), blood pressure (BP), respiratory rate (RR), and oxygen saturation (SpOCC ) are time-series data that describe the body's continuous electrophysiological The inherent characteristics of these signals show non-linear oscillatory and periodic patterns and are highly responsive to structural and functional abnormalities of the heart. Therefore, one-dimensional Convolutional Neural Network . D-CNN) is chosen as the main method to process these signals, due to its ability to recognize local spatial features in the time domain while maintaining invariance to time shifts . 1D CNN Structure for ECG and Physiological Signals The 1D CNN model is built from several convolutional layers arranged sequentially, each of which is continued by an activation function (ReLU), a normalization layer (BatchNor. , and a subsampling layer (MaxPoolin. The structure sequence applied is as follows: - Conv1D layer 1: 64 filters, kernel size = 5, stride = 1 - ReLU BatchNorm MaxPool . ool size = . - Conv1D layer 2: 128 filters, kernel size = 3 - ReLU MaxPool - Flatten Dense layer . utput = . The result of the 1D CNN network is a feature representation vector with a fixed dimension ( Ey. which is then forwarded to the multimodal fusion branch. 1D convolution works by gradually shifting filtersAialso called kernelsAialong the time dimension of the signal. For a given input signal ycu . coOe. OO EyycN and a convolution filter of size k, the output hi. at position i in the l layer is calculated as follows: yco Oe1 . co ) yua Eaycn = Oc ycyc. UI ycuycn( ycoOeyc . ) yc =0 a. IJEEIC. Vol. 2 n. Iis Hamsir Ayub Wahab. Sri Yati ycyc yco : weights of kernel the j-th ycuycn ycoOeyc 1 : input from the previous layer at position i j yca. : bias for the kernel in that l layer E: non-linear activation function (ReLU) The above calculations are run in parallel for all available filters and across all time frames, resulting in an output tensor of (TOek 1. F)(T - k 1. F) dimension, where F is the number of filters used. To ensure the stability of the learning process, batch normalization is applied after each convolutional layer, through the following transformation: ycuycn Oe yuNE ycycn = yu ycuCycn yu ycuCycn = Oo. uaE2 yun ) Where: yuNE and yuaE2 : mean and variance of the mini batch , : learned parameters A: small number for numerical stability Additionally. Dropout . is applied in the last layer of the network to reduce the connections between neurons and avoid overfitting. Multivariate Data Processing: Parallel Temporal Fusion Physiological signals consist of multiple channels . eart rate, blood pressure, interval, oxygen saturatio. , so all these signals are combined into an input tensor of dimension (C. T), where C represents the number of channels and T represents the time duration. This model applies multi-channel convolution, with kernels operating across ya yco Oe1 Eaycn = yua (Oc Oc ycyca,yc UI ycuyca,ycn yc yc. yca= 1 yc = 0 a. This allows the model to automatically capture relationships between signals, such as synchrony between cardiac and respiratory activityAia crucial aspect in diagnosing hemodynamic disorders in congenital heart disease. 2 Echocardiogram Image Processing with 2D CNN (ResNet-. Echocardiogram is an important modality in diagnosing CHD because it provides a direct visual representation of the heart structure, valve movement and blood flow in real-time. However, the main limitation of echocardiography lies in data variability, operator dependence and spatial-temporal complexity, especially in neonates with small and dynamic heart structure. Therefore, an automatic approach is needed to generalize various heart structural patterns accurately and robustly. To address this challenge, 2D Convolutional Neural Network (CNN 2D) with ResNet-50 pretrained on ImageNet and fine-tuned with public dataset EchoNet-Dynamic and CAMUS is used. This model has the ability to transfer learning from general imaging domain to medical domain, so it speeds up the training convergence and improves the prediction stability . ResNet-50 Architecture and Its Modifications The architecture consists of fifty layers in a bottleneck configuration, with Convolution-BatchNorm-ReLU building blocks, residual connections, pooling, etc. This architecture helps the model in overcoming the vanishing gradient problem and enables the model to learn and capture multi-scale spatial representations of echocardiogram The model was modified for 224y224-pixel medical images in grayscale. The images were converted to 3 channels . ummy RGB) to be compatible with the pretrained model from ImageNet. The final layer was modified for binary classification (CHD yes/n. , and Global Average Pooling was added to reduce overfitting. The 2D convolution process is performed between a filter ycO OO Eyycoyycu and an input image ycU OO ycIya y ycO . The output at spatial position . is computed as: yco Oe1 ycuOe1 Eaycn,yc = yua ( Oc Oc ycOyc,yc UI ycUycn yc,yc yc yc. yc =0 yc =0 a. IJEEIC. Vol. 2, n. Iis Hamsir Ayub Wahab. Sri Yati where E denotes the ReLU activation function and b is a scalar bias. The result hi,j represents the filter output at spatial position . After padding and striding, the input is transformed by the first convolutional layer from . ,224,. ,112,. , using several filters. Each residual block increases the feature dimensionality simultaneously with a decrease in spatial resolution, which is done through down sampling. Transfer Learning and Fine-Tuning in the Medical Domain Transfer learning is Applied by initializing the model with pre-trained ImageNet weights. Fine-tuning is performed on the top two residual blocks, based on the assumption that lower CNN layers capture generic features . , edges, texture. , while higher layers learn domain-specific representations, such as anatomical structures . alves, septum, etc. ) . The model is optimized using the binary cross-entropy loss function: ycA ya = Oe Oc. cycn log. cC ) . Oe ycycn ) log. Oe ycC )] ycn ycn ycA ycn =1 a. ycCycn = sigmoid. ) Here, f. is the output of the final layer . , passed through a sigmoid activation to yield the probability of the positive class. This setup is well-suited for binary classification tasks such as CHD detection. To support medical interpretability. Gradient-weighted Class Activation Mapping (Grad-CAM) is used to highlight regions of the image that contribute most to the modelAos predictions. This helps verify whether the model is focusing on clinically relevant cardiac structures. Mathematically, the Grad-CAM heatmap for class c is given by: yaycayaycycayccOeyayaycA = ReLU (Oc ycayco yayco ) yco ycayco = OCyc OcOc ycs OCyaycoycn yc ycn Where: yayco : feature map from the convolutional layer yc yca : output score for class c ycayco : importance weight of feature map k for class c Z: normalization factor . otal spatial location. yca yc a. This visualization is crucial in a clinical setting, allowing cardiologists to confirm that the model attends to anatomically significant regionsAisuch as the atria, ventricles, or septumAiwhen predicting congenital heart disease (CHD). 3 Processing EHR with BioBERT and Transformer Encoder EHR provide critical information for diagnosing neonatal CHD, including maternal medical history, lab results, prenatal data, and unstructured clinical narratives . , symptom notes or auscultation finding. However. EHR data is inherently unstructured, heterogeneous, and often semantically ambiguous, presenting significant challenges for direct computational analysis. To address these challenges, we adopt a Transformer-based Natural Language Processing (NLP) pipeline, which begins with BioBERT for contextual embedding and continues with a stack of Transformer Encoder layers to derive comprehensive clinical representations. BioBERT is a domain-specific extension of the original BERT architecture, pre-trained on biomedical corpora such as PubMed abstracts and PMC articles. It excels in capturing medical terminology and contextual language specific to clinical narratives. Each EHR sentence is tokenized using a WordPiece tokenizer, then converted into a dense vector ycU OO ycI ycN yycc a. T: token sequence length d: embedding dimension . ypically 768 for BioBERT-bas. Transformer Encoder Architecture IJEEIC. Vol. 2 n. Iis Hamsir Ayub Wahab. Sri Yati The output embeddings are passed through Transformer Encoder blocks, each consisting of Multi-head Self-Attention and Feedforward Neural Network (FFN) Multi-head Self-Attention Self-attention allows the model to learn dependencies between all tokens in a sequence: ycEya ycN yaycycyceycuycycnycuycu. cE, ya, ycO ) = ycIycuyceycycoycaycu ( )ycO Ooyccyco a. ycE = ycUycO ycE , ya = ycUycO ya , ycO = ycU ycO ycO ycO ycE , ycO ya , ycO ycO OO Eyycc y yccyco : learned projection matrices This mechanism enables the model to capture clinical correlationsAie. , linking Aumaternal heart conditionAy with Auneonatal cyanosis. Ay Multi-head Attention Composition To enrich semantic understanding, multiple attention heads are used: MultiHead. cU) = (Concat. ead1, , . , headEa )ycO ycC Each head processes information from different subspaces of the embedding. Feedforward and Residual Layers Each token then passes through a position-wise feedforward network: cu ) = max. , ycuycO1 yca1 ) ycO2 yca2 FFN and layer normalization: with residual connections a. yaycaycyceycycAycuycyco. cu yayaycA. cu )) These layers enhance nonlinear feature transformations while stabilizing gradients. Final Representation and Fusion The final output is: ycs OO EyycN yycc a. ycc , serves as a summary vector for the EHR and is later fused with outputs The [CLS] token embedding, yc. ayaycI] OO Ey from other modalities . , imagin. for final classification. Ethical Preprocessing of EHR Before model input. EHR data undergoes strict preprocessing: C De-identification: Patient names, birth dates, and IDs are removed C Terminology normalization: ICD codes and synonyms are standardized C Controlled tokenization: Only BioBERTAos vocabulary is used to prevent out-of-vocabulary errors 5 Model Training and Performance Evaluation 1 Choosing the Right Loss Function Since our goal is to classify neonatal CHD as a binary outcome . es/n. , we use binary cross-entropy loss, which measures how close the model's predicted probability ycCycn OO . is to the actual label. The loss function is defined ycA Ee=Oe Oc. cycn log. cC ) . Oe ycycn ) log. Oe ycC )] ycn ycn ycA ycn =1 Where: Ee: total loss averaged across all samples N: number of training examples ycycn : true label for sample i . for normal, 1 for CHD) ycCycn : predicted probability for sample i, where ycCycn = E. cuycn )) E: sigmoid activation function yce. cuycn ): output . from the neural network before activation a. This function penalizes incorrect predictions and encourages confident, correct outputs. It is convex when combined with the sigmoid function, ensuring stable training. IJEEIC. Vol. 2, n. Iis Hamsir Ayub Wahab. Sri Yati 2 Optimization Strategy Using Adam To minimize the loss, we use Adam, a robust optimizer that adapts learning rates during training using first and second moments of gradients. The update rules are: ycoyc = 1 ycoycOe1 . Oe 1 )ONEe yeO ycyc = 2 ycycOe1 . Oe 2 )(ONEe yeO )2 ycoyc ycyc Cyc = , ycCyc = yco yc 1 Oe 1 1 Oe yc2 ycoyc yc 1 = yc Oe UI OoycCyc A Where: yc ONEeyeO : gradient of loss at step t : learning rate 1, 2 : momentum terms for 1st and 2nd moments . efault: 0. 9, 0. : small constant to avoid division by zero . , 10Oe8 ) 4 Explanation of Evaluation Metrics Each evaluation metric has a clear clinical interpretation: AUC-ROC yaycOya = O ycNycEycI. ce Oe1 . aycEycI)) ycc. aycEycI) TPR: True Positive Rate (Sensitivit. AA. AA. FPR: False Positive Rate yce Oe1 : inverse threshold function Interpretation: Measures how well the model separates CHD from non-CHD cases across all thresholds. Sensitivity (Recal. ycIyceycuycycnycycnycycnycyc = ycNycEycNycE yaycAycIyceycuycycnycycnycycnycyc = TP: True Positives . orrectly predicted CHD) FN: False Negatives . issed CHD case. ycNycE ycNycE yaycA A. Specificity ycNycA ycNycA yaycE ycIycyyceycaycnyceycnycaycnycyc = ycNycAycNycA yaycEycIycyyceycaycnyceycnycaycnycyc = FP A. TN: True Negatives . orrectly predicted health. FP: False Positives . ealthy misclassified as CHD) Precision ycEycyceycaycnycycnycuycu = ycNycEycNycE yaycEycEycyceycaycnycycnycuycu = Shows how many positive predictions are correct. ycNycE ycNycE yaycE A. F1 Score ya1 = 2 UI ycEycyceycaycnycycnycuycu . ycIyceycaycaycoyco ycEycyceycaycnycycnycuycu ycIyceycaycaycoyco A. Balances precision and recallAiespecially important with imbalanced datasets. RESULTS AND DISCUSSION The proposed model multimodal deep learning architecture ECG, echocardiography, physiological signs, and electronic health records (EMR. are integrated into four key modalities. This approach has much more to offer when it comes to comparison with using a single model. This model uses different types of data and captures all the relevant information for the patientAos health dissecting the unique capabilities of each modality. Specifically. ECG delivers vital information of electricity activity happening in the heart which is crucial for diagnosing certain conditions like arrhythmias. Echocardiography offers images the internal parts of the heart and hence has the relevance in diagnosing valve abnormalities or damages to the walls of the heart. Physiological signs such as heart rate, blood pressure, oxygen saturation reflect swift changes with time in the patientAos current IJEEIC. Vol. 2 n. Iis Hamsir Ayub Wahab. Sri Yati situation, providing important information about acute health changes. EMRs are important too where rich historical data such as medical history, past treatments, medications as well as clinical notes offer essential information by context around how the other data is meant to be understood. The multimodal techniques have allowed new possibilities in a more complete understanding of the patient health while improving the precision of the diagnosis formed using the assembled data. 1 Datasets In this work, the datasets used are public datasets derived from EchoNet-Dynamic. MIMIC-IV. PhysioNet, and MIT-BIH. The datasets serve to train and validate the model. These datasets are specifically selected because they comprehensively cover cases of neonatal heart disease. They consist of various signal modes such as images from echocardiography. ECG signals, other physiological measurements, and EHR files. The study aims to represent neonatal CHD using diverse data sources is very helpful to improve model performance and generalization across heterogeneous clinical settings. EchoNet-Dynamic combines more than 10,000 dynamically annotated echocardiogram video clips from patients diagnosed with congenital heart defects (CHD). It focuses on echocardiography videos only and contains rich temporal information about the dynamic activity of the heart. For models that operate on visual input, such as deep learning models using CNN architecture, the EchoNet-Dynamic dataset is essential to fully exploit the training potential provided by the dataset. With this dataset, the model is able to monitor and analyze the motion and structure of the heart to diagnose CHD in neonates. Echocardiography is one of the basic diagnostic methods used in primary care for newborns, which makes this dataset very important. MIMIC-IV: This dataset offers more than 200,000 anonymized records of ICU patients, with critical care demographic information. For the purpose of this analysis, neonatal records were screened to obtain data related to CHD diagnosis. It contains a wealth of clinical datasets such as vital parameters, laboratory results, medication records, and even diagnosis codes, making it an excellent candidate for EHR integration. In addition. MIMIC-IV contains temporal information that is important to assess how the neonateAos condition changes over time and helps the AI to hypothesize signs of acute or chronic CHD. PhysioNet: This database collects a variety of physiological signals including HRV, blood pressure, and respiratory rate. Of interest to expand the modelAos physiological feature set are the high-frequency respiratory and cardiac data located in the PhysioNet/Capnobase dataset. These signals capture real-time information about the neonateAos vital signs, facilitating the detection of CHD symptoms such as arrhythmias, abnormal heart rates, or oxygen desaturation. MIT-BIH Arrhythmia Database: This dataset, which is commonly used in ECG studies, was also used to help train a model designed to detect ECG abnormalities in neonates. It consists of ECG recordings from 48 patients, with over 100,000 submissions in total. In this study, it provided useful training data on arrhythmias, which are quite common in CHD cases. Together with other data of different types, the model was able to improve its detection of heart rhythm abnormalities and the differentiation of normal and abnormal beats, which is essential for accurate CHD diagnosis. The integration of these four datasets enhances the capabilities of the multimodal AI model by allowing it to use both visual . and temporal signals (ECG, physiological signals, and EHR), creating a more comprehensive framework for neonatal CHD diagnosis. The variability of this dataset improves not only the accuracy of the model but also its generalizability to other clinical settings, which is important for practical use in neonatal intensive care units (NICU. Furthermore, the unique nature of this dataset ensures that the AI model results are reproducible and supervised, critical factors in maintaining trust and driving clinical integration. 2 Performance Evaluation of the Multimodal Model The confusion matrix illustrated in Figure 2 shows the performance results of a multimodal AI model for diagnosing congenital heart disease (CHD) in newborns. The findings are derived from a complete set of 200 test samples where the model was asked to predict the presence or absence of CHD using echocardiography. ECG, physiological signals, and EHR data. As shown in the confusion matrix, the model accurately classified true cases of CHD and non-CHD (True Positives and True Negatives, respectivel. The modelAos ability to identify all 90 true cases and 40 non-CHD cases resulted in a True Positive Rate (TPR) of 94. This is a significant achievement as the model detected a significant proportion of diagnosed CHD cases, which helps improve survival rates and reduce morbidity associated with undiagnosed CHD in newborns. However, like any clinical system, there are trade-offs, and False Negatives (FN) do pose a significant challenge. In this model, there were 5 FNs Ae cases where the model failed to detect a PJC when it was actually present. This equates to a False Negative Rate (FNR) of 5. 6%, which while low is still risky, especially in a high-risk medical setting. In the context of PJC, false negatives can lead to delayed surgical intervention, significantly increasing the risk of severe complications and The most critical task in automation is placed on minimizing false negatives (FN. because it is a very important focus area. The model, as it is now, has low false-negative rates, but there are opportunities to enhance it with more validation in clinical settings like the real worldAithere can be improvements to lower the rate even IJEEIC. Vol. 2, n. Iis Hamsir Ayub Wahab. Sri Yati Relatively speaking, false positives (FP. are more bearable in a clinical scenario. For example, a false positive rate (FPR) of 2. 5% denotes that the model periodically misdiagnoses neonates who do not have CHD as positive cases leading to follow-up checks such as echocardiography or advanced imaging studies. Such instances are not life-threatening because those falls are usually followed by further confirmatory tests making sure that the non-necessary actions are not taken. The model captures 91. 7% of non-CHD cases correctly declaring them as other . on-CAD) with the TNR, demonstrating the effectiveness of the model in managing unnecessary surgeries or interventions on non-CHD patients which is an added benefit. The model depends on efficient resource utilization where high edges contribute to a substantial reduction in the chances of over-utilization leading to lower medical costs and improved resource allocation. Figure 2. Confusion Matrix of the Proposed Multimodal Model Table 1. Diagnostic Performance Metrics of the Proposed Multimodal Model Modality ECG Echocardiography Physiological Signals EHR Multimodal (Fusio. Accuracy Sensitivity Specificity AUC Table 1 show the diagnostic performance metrics for the advanced AI model concentrating on CHD detection in neonates is discussed and it is clear that the model outperformed single modalities like ECG, echo, physiologic signals and EHR. The results indicate that the model utilizing all four data sources including echo. ECG, and physiologic signals achieved accuracy, sensitivity, and specificity values of 93%, 95%, 91%, and AUC equal to 0. 96, respectively. These results surpass those of echocardiography and ECG which were 88% and 83% accurate single modalities achieving 88% accuracy for echocardiography and 83% for ECG. Not only does the addition of physiologic signals and EHR enhance the modelAos overall classification performance for positive and negative CHD cases, but they also highlight the clear benefit reached through combining multiple data types. The multimodal model's high sensitivity . %) makes certain that all cases of CHD are captured with very little chance of having false negatives, which is extremely important in the context of neonatal care, as undetected cases of CHD pose a severe risk to life due to complications. The same applies to the specificity . %) so that the model does not misidentify CHD cases and accurately classifies all non-CHD cases, thereby protecting healthy neonates from unnecessary procedures or treatments. The AUC value, which stands at 0. 96, further emphasizes the strength of the multimodal model in regard to its ability to classify patients as either CHD or non-CHD patients and demonstrates the dependability of his model across different levels which is important in practice. These results are incredibly important in clinical practice settings like the Neonatal Intensive Care Units where CHD diagnostic precision has a direct correlation with patient health outcomes. A multimodal approach is particularly useful because it combines patient data that is both dynamic and Physiological signals provide vital sign information. CHD ultrasound gives echocardiogram structural heart data and real time rhythm captures are performed using the ECG unit, while context and chronologically ordered EHR data gets fed into the system. This method allows the model to incessantly refine and optimize its forecasting in relation to a neonateAos condition Ae a multifaceted conception methodology unlike stagnant data structures. IJEEIC. Vol. 2 n. Iis Hamsir Ayub Wahab. Sri Yati Employing all these different data types enables the model to improve sensitivity and specificity at the same time, which is especially important for timely and accurate CHD diagnosis thereby reducing neonatal care mortality and morbidity rates. Table 2. Performance Comparison with Baseline Studies Study (Yea. Modalities Used Accuracy Sensitivity Specificity AUC This Study . ECG Echo Phys EHR 0. Jacquemyn et al. Echo ECG Kabir et al. ECG EHR (DLM-base. Bizopoulos & Koutsouris . ECG only In Table 2, we analyze the performance of the multimodal AI model developed for the diagnosis of neonatal CHD against baseline studies that utilized AI-based approaches for CHD detection. The table recapitulates major performance indicators such as accuracy, sensitivity, specificity, and AUC, which are critical benchmarks necessary in assessing the value of the machine learning modelAos diagnosis. With the integration of ECG, echocardiography, physiological signals, and EHR, the proposed model achieved impressive results of 93% accuracy, 95% sensitivity, 91% specificity, and 0. 96 AUC which outperformed the results of previous studies. The multimodal fusion approach has a greater advantage over earlier models that used fewer modalities. For example. Jacquemyn et al. , who used echocardiography and ECG, reported an accuracy of 88%, a sensitivity of 89%, and AUC of 0. 91, which is lower than the 93% accuracy and AUC of 0. 96 achieved by this study. Incorporating EHR and additional physiological signals into the model allow for better detection of non-structural congenital heart disease (CHD, which is often missed by imaging modalities alon. resulting in a 10% increase in AUC. This is remarkable because non-structural forms of CHDs tend to be functional rather than anatomical and are difficult to detect using echocardiography and ECG in isolation. The use of real-time dynamic physiological signals and EHR enables this study to achieve more echocardiographic and cardiac diagnostic sensitivity and Also. Kabir et al. achieved an accuracy of 85% and AUC of 0. 87 using ECG and EHR with deep learning models. Although the model advanced with the use of electronic health data, it was devoid of real-time physiological data and imaging, which are essential for detecting minute structural anomalies crucial for diagnosing CHD. The lack of imaging information in these models may have led to lower sensitivity in more complicated cases where structural anomalies were less apparent and could not be identified through ECG or EHR. Unlike other models, the one used in this study incorporates multimodal fusion and uses attention-based late fusion where the importance of each modality can be changed for each case, leading to better diagnostic This improves the weighting of each modality and ensures that the most crucial data relevant to CHD is accessed first. Finally. Bizopoulos and Koutsouris . , using only ECG, recorded the lowest metrics with 81% accuracy, 79% sensitivity, and AUC of 0. Although ECG is still an important diagnostic tool for measuring arrhythmia and other functional abnormalities, it is often insufficient for capturing the complete range of coronary heart disease (CHD) spectrum, particularly the structural defects, additional imaging modalities are required. Their study captures the shortcomings of unimodal models that are efficient within a defined scope but do not integrate the overarching holistic diagnostic paradigm needed for thorough CHD screening. The proposed multimodal fusion model incorporates more data than all previous works by addressing both structural and functional components of CHD, markedly enhancing overall diagnostic accuracy, especially in subtle or complex cases. Discussion Results from this study confirmed that the diagnostic multimodal AI model for neonatal congenital heart disease (CHD) outperformed all existing unimodal and bimodal diagnostic models by a significant margin. The classifior's performance in diagnosis was exceptional, with an accuracy of 93%, sensitivity of 95%, specificity of 91%, and AUC of 0. 96, which demonstrates that integrating multiple sources of patient data, including echocardiograms. ECGs, physiologic signals, and EHR, yields superior outcomes. The results prove the working assumption that the combination of different data types enhances diagnostic accuracy and latitude, more so in sophisticated conditions like CHD, which requires a merger of structural and functional health data. Incorporating echocardiograms into the model allows for evaluation of structural defects, while real-time heart rate monitoring and oxygen saturation provide crucial indicators of CHD through ECG and physiologic signals. Additionally, the EHR data further improves the model by adding the patientAos medical history, which is critical in evaluating a neonateAos health comprehensively. These findings confirm previous studies that state diagnostic accuracy in cardiovascular diseases is improved by multimodal fusion models . , . This investigation demonstrates improved proficiency in diagnosing CHD relative to earlier studies. echocardiography and ECG achieving 88% accuracy and an AUC of 0. Despite these results being promising, they remain lower than those achieved by the current multimodal model. However, using ECG and EHR and IJEEIC. Vol. 2, n. Iis Hamsir Ayub Wahab. Sri Yati reported 85% accuracy and an AUC of 0. 87, highlighting the importance of combining imaging with physiological and clinical data . , . The striking improvement in the model performance in this study stems from adding physiological signals and EHR, which improved the identification of functional issues that imaging alone sometimes misses. This supports the claim that non-structural congenital cardiac disease associated with functional issues, such as arrhythmias and irregular heartbeats, is often underestimated by imaging modalities without structural analysisAilike echocardiography or ECGAiand vice versa . The addition of real-time physiological data enhances the modelAos sensitivity and specificity, yielding an overall improved diagnosis. However, the report does acknowledge several limitations. One primary constraint is the reliance on publicly available datasets like MIMIC-IV. EchoNet-Dynamic. PhysioNet, and MIT-BIH, which, while elaborate, may not capture all the variations of congenital heart disease found in neonates in real-world clinical settings. Even though these datasets are comprehensive, some rare or atypical cases of CHD may be underrepresented, data quality inconsistencies may exist, and data quality may vary, influencing model performance. Issues such as missing data or inconsistent markers may impact a modelAos ability to perform across different populations . The multimodal AI modelAos computational intensity limits practical use. Real-time, multimodal data processing is expensive and not feasible in many under-resourced clinical environments with limited computational capabilities. Optimization of the model to reduce computational costs without negatively impacting performance, particularly in resourcescarce settings, must become a focus of future work. The impact of these observations is critical for clinical practice, especially in the NICUs where CHD needs to be identified as early as possible. This diagnostic tool works alongside physicians and helps them in early CHD detection so that interventions can be made in time, which in many cases can save lives. Prompt diagnosis of CHD is imperative since there could be many life-threatening complications such as heart failure and even death. Having the dynamism of integrating ECG, echocardiography, physiological signals. EHR, and teaching computers to learn AI, the model provides an all-encompassing view of the neonateAos condition that improves the diagnosis and reduces the chances of healthy new infants being subjected to unnecessary surgeries . However, trust in the system might be improved with the addition of XAI as clinicians would understand the reasoning for modelAos With the decision-making transparency offered by explainable AI, physicians would be able to accept the system into practice which, most likely, will enhance the cliniciansAo trust in the implementable decisions taken. The addition of genomic or biomarker information as supplemental data types could improve the model's diagnostic efficacy, and this must be examined in future studies. Also, research focused on the natural history of congenital heart diseases (CHD) over time would significantly enhance understanding of the model's abilities to predict the long-term outcomes of newborn CHD. Additionally, the potential of federated learning, where data remains in its original location but allows AI models to be collaboratively trained, is particularly relevant. This approach may help resolve the privacy concerns associated with sensitive newborn health information that different clinical institutions hold. Finally, further refinements to the model through methods such as pruning or quantization would reduce computational burden and increase the model's availability in resource-limited settings, expanding its use in clinical practice . , . CONCLUSION The developed multimodal AI model for diagnosing congenital heart disease (CHD) in neonates surpasses the limitations of earlier joint and single modality approaches. Using echocardiography. ECG, and other physiological signals as well as EHR, the model has achieved remarkable diagnostic proficiency with accuracy, sensitivity, and AUC of 93%, 95%, and 0. 96 respectively. Such integration improves the modelAos detection of both functional and structural anomalies related to CHD, enhancing the reliability of the diagnosis given by the In the context of clinical practice, these results are invaluable due to the fact that early and precise diagnosis of CHD significantly improves patient prognosis. The real-time capability of the multimodal model enables clinicians to visualise the welfare of the neonatal heart, thus assisting them in timely decision-making that would escalate survival rates. More work is, however, required to test the model in a clinical setting. It has to be verified how well the model performs in different healthcare facilities and in populations with different demographics. Besides these gaps, the model also faces challenges regarding the interpretability of the results, the quality of the obtained data, and the computational resources required for it. Solving these issues will make it possible to use and integrate this technology into clinical practice. Further research should target strengthening the model's robustness with more diverse datasets, including genomic information, and longitudinal patient data. Additionally, trust and adoption put forth by clinicians can be gained through better model explainability, augmenting the clinician's trust fiducial in the model's workings. Multimodal AI models can aid in enhancing the accuracy of CHD diagnosis in neonates, which will have a large impact on care quality at healthcare facilities. ACKNOWLEDGEMENTS We would like to thank the Institute for Research and Community Service (LPPM) of Universitas Khairun for the support and funding they provided for this research through the Competitive Research Program for Higher Education (PKUPT) 2023. IJEEIC. Vol. 2 n. Iis Hamsir Ayub Wahab. Sri Yati REFERENCES