International Journal of Electrical and Computer Engineering (IJECE) Vol. No. August 2025, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Indonesian speech emotion recognition: feature extraction and neural network approaches Izza Nur Afifah1. Tri Budi Santoso2. Titon Dutono3 Department of Informatics and Computer Engineering. Politeknik Elektronika Negeri Surabaya. Surabaya. Indonesia Department of Creative Multimedia Technology. Politeknik Elektronika Negeri Surabaya. Surabaya. Indonesia Department of Electrical Engineering. Politeknik Elektronika Negeri Surabaya. Surabaya. Indonesia Article Info ABSTRACT Article history: This study explored the challenges of emotion recognition in Indonesian speech using deep learning techniques, addressing the complex nuances of emotional expression in spoken language that posed significant difficulties for automatic recognition systems. The research focused on the application of feature extraction methods and the implementation of convolutional neural networks (CNN) and a hybrid convolutional neural networks-long short-term memory (CNN-LSTM) model to identify emotional states from speech data. By analyzing key features of speech signals, including mel frequency cepstral coefficient (MFCC), zero crossing rate (ZCR), root mean square energy (RMSE), pitch, and spectral centroid, the study evaluated the modelsAo ability to capture both spatial and temporal patterns in the data. Testing was conducted using an Indonesian dataset comprising 200 samples. The CNN model, utilizing four features (MFCC. ZCR. RMSE, and pitc. , and the CNN-LSTM model, which used three features (MFCC. ZCR, and RMSE), both achieved an emotion classification accuracy of approximately The result showed that the CNN-LSTM model achieved comparable performance with a simpler feature set compared to the CNN model. This highlighted the significance of choosing the appropriate techniques in feature extraction and classification to enhance the accuracy of identifying emotions from speech data while also managing computational complexity. Received Aug 31, 2024 Revised Mar 26, 2025 Accepted May 24, 2025 Keywords: CohenAos Kappa Convolutional neural networks Long short-term memory Mel-frequency cepstral Speech emotion recognition This is an open access article under the CC BY-SA license. Corresponding Author: Tri Budi Santoso Department of Creative Multimedia Technology. Politeknik Elektronika Negeri Surabaya Jalan Raya ITS. Keputih. Sukolilo. Surabaya. East Java 60111. Indonesia Email: tribudi@pens. INTRODUCTION Speech communication serves as the simplest and effective approach that people have in order to communicate information. The importance of speech becomes evident when alternative communication methods, such as text messages or emails, are commonly used but can easily be misinterpreted. When we attempt to express emotions in writing, emojis often become necessary aids in text messaging . Thus, speech is the most effective method to communicate in human life, as it carries a wealth of information through both linguistic and paralinguistic elements . The advancement of information and communication technology (ICT) technology has opened up new possibilities for how humans interact with computers. Given that understanding emotional states enhances interpersonal comprehension, there is a need to integrate this concept into computer systems. This idea inspired the establishment of speech emotion recognition (SER), a field focused on identifying and interpreting emotional states conveyed through speech. Many studies have been conducted to explore SER. Journal homepage: http://ijece. ISSN: 2088-8708 but the topic still presents significant challenges. SER technology has potential uses across several fields, including healthcare, call centers, and education . Ae. In healthcare, it can help in the diagnosis of psychological problems like depression, autism, and other mental disorders. In call centers, it helps measure customer satisfaction. In education, particularly in distance learning, it can enhance the learning experience. Despite its significant potential, challenges remain, such as the lack of diverse datasets, choosing the right features, and the choice of effective intelligent recognition techniques . Ae. The majority of SER research has focused on languages with abundant resources and widespread use, such as English or German . Ae. Although these studies have deepened our understanding of detecting emotions in speech, there remains a considerable gap in exploring resource-limited languages like Indonesian. In recent years, research on emotion detection in Indonesian speech has begun to emerge, covering areas such as emotion detection in films . , recognition using acoustic and lexical features . , and automatic emotion recognition . Despite Indonesian being spoken by over 200 million people, research attention in SER remains limited. The scarcity of corpora and standardized databases hampers the progress of SER research in Indonesian. Cross-lingual emotion recognition experiments have been conducted due to these limitations . In simple terms. SER consists of two primary components: feature extraction and classification . , . Feature extraction involves identifying characteristics related to emotion within speech signals . The goal is to extract emotional information from spoken language by converting the raw speech signals into relevant feature sets. SER frameworks divide characteristics into four categories: prosodic features, spectral features, voice quality features, and Teager energy operator (TEO)-based features . The challenge lies in choosing the most essential features that are able to differentiate between different emotions . Melfrequency cepstral coefficients (MFCC) is effective in capturing important spectral characteristics based on human perception of frequency, making it relevant for detecting spectrum changes associated with emotions. zero crossing rate measures how frequently the value of the audio signal changes from above to below zero, providing information about the temporal aspects that may change with emotion. Root mean square energy (RMSE) measures the average energy of the speech signal, which can reflect varying sound intensity levels associated with emotions. Pitch measures the fundamental frequency of the speech, where changes in pitch are often linked to emotional variation. Spectral Centroid measures the average frequency location within the spectrum, reflecting the brightness of the sound, which may change with different energy distributions due to emotions . Ae. Classification is the second crucial step in SER. It involves applying machine learning models to the extracted features to identify the emotions expressed in speech. There are two main approaches to SER classification: conventional classifiers and deep learning classifiers. Recent developments indicate that problems in SER are being addressed with more emphasis on machine learning techniques, especially deep learning approaches. Deep learning methods have demonstrated significant improvements in emotion recognition, offering advantages such as scalability, parameter tuning, and customizable functions . Several researchers have explored various neural network methodologies, including artificial neural networks (ANN), convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), and long short-term memory (LSTM) . Ae. CNN and LSTM are increasingly recognized for SER tasks because they effectively capture temporal dependencies and spatial patterns in sequential data. Based on the challenges faced in SER research for the Indonesian language, this study aims to address the gap by providing a comprehensive comparison of speech emotion recognition systems for determining emotional states. It evaluated the consistency and reliability of emotion labeling using CohenAos kappa, applied several feature extraction approaches including mel frequency cepstral coefficient (MFCC), zero crossing rate (ZCR), root mean square energy (RMSE), pitch, and spectral centroid, and combined these features with classification techniques that used CNN and LSTM. The structure of this paper is as follows: The study chronology, as well as the research design, methodology, dataset collection, feature extraction strategies, and classification algorithms, are covered in section 2. The research results appear in section 3 along with a comprehensive discussion, while section 4 provides the conclusion. METHOD In this research, the process of recognizing emotions in speech was organized into three main stages: data collection, feature extraction, and classification. During data collection stage, a dataset of speech samples representing various emotional states was gathered, and inter-rater reliability was employed to ensure consistent emotion labeling across different evaluators. Once the dataset was prepared, feature extraction was carried out to identify and process key characteristics of a speech signal. These extracted features were then used in the classification stage to accurately categorize the emotional states conveyed in the speech. Int J Elec & Comp Eng. Vol. No. August 2025: 3769-3778 Int J Elec & Comp Eng ISSN: 2088-8708 Data collection The audio dataset for this research consists of speech recordings in Indonesian. The digital audio data was stored in WAV file format. The dataset includes recordings from 10 male and 10 female participants, aged 20 to 22 years. Each audio recording lasts between one to three seconds, with each participant contributing four recordings per emotion. Not all recordings, however, were suitable or usable due to factors such as poor audio quality or inconsistency in the emotional expression. A total of 50 audio files were used to represent four emotional expressions . ngry, happy, neutral, and sad . ), resulting in approximately 200 audio files in total. The dataset contained recordings with sampling rates that varied from 1 to 48 kHz. To ensure consistency for audio analysis, all files were resampled to 48 kHz, preserving high audio quality. To evaluate the consistency and reliability of emotion labeling. Cohen's Kappa analysis was conducted . This statistical method provides deeper insight into the agreement levels between annotators, using a scale from -1 to 1. A value of -1 represents complete disagreement, 0 indicates random agreement, and 1 reflects perfect agreement . Table 1 contains Cohen's Kappa values and associated interpretations. Table 1. Interpretation of CohenAos Kappa CohenAos Kappa Statistic < 0. 00 Ae 0. 21 Ae 0. 41 Ae 0. 61 Ae 0. 81 Ae 1. Strength of agreement Poor Slight Fair Moderate Substantial Almost Perfect Feature extraction One of the most important steps in processing speech data for emotion classification is feature This process involves transforming raw audio signals into relevant features for analysis. The techniques employed in this study were chosen to collect both temporal and spectral characteristics of speech. The techniques used in this study for extracting features from speech include MFCC. ZCR. RMSE, pitch, and spectral centroid. The features were extracted and calculated individually from each audio. Mel frequency cepstral coefficient The first feature extraction method employed was MFCC. MFCC is inspired by the way the human ear processes sound . , . These coefficients focus on the most important aspects of sound, such as the shape of vocal formants and other characteristics, which are essential for tasks like emotion recognition and speech analysis . By emphasizing frequencies that are most important for how humans hear. MFCCs provide a clear representation of speech signals. Figure 1 illustrates the MFCC feature extraction process. The extraction of MFCC features from speech data began with pre-emphasis, which boosted the higher frequencies to enhance clarity. Next, the audio signal Next, the audio signal was segmented into small frames of 25ms, with a 50% overlap. Each frame was then processed using a Hamming window to minimize edge effects before undergoing fast Fourier transform (FFT), which transformed the audio data from the time domain into the frequency domain using an NFFT size of 512. After that, a Mel-filter bank was applied, consisting of 40 filters spaced according to the Mel scale to mimic the human earAos frequency response. manage the wide range of values, log compression was applied, compressing the values between 0 and 1. the final step, the discrete cosine transform (DCT) was used on the log-compressed signal to derived MFCCs. Figure 1. MFCC flowchart Zero crossing rate The second feature employed was the ZCR, which was calculated separately. ZCR was derived by assessing the frequency of zero-crossings in the signal across a frame. The process involved counting each Indonesian speech emotion recognition: feature extraction and neural A (Izza Nur Afifa. A ISSN: 2088-8708 instance where the audio signal shifts from above zero to below zero or the reverse within a frame. ZCR is defined as shown in . ycsyaycI = ycAOe1 OcycAOe1 ycu=1 1. Oo ycu. cu Oe . < . where ycA represents the total sample count within the frame, and 1. Oo ycu. cu Oe . < . represents a function that outputs 1 when there is a sign change between ycu. and ycu. cu Oe . , and 0 otherwise. ZCR reflected the rate of change in the signal, which can be indicative of emotional states. Higher ZCR values are associated with more tense or agitated emotional states, such as angry or happy, where rapid changes in pitch or tone occur. Conversely, lower ZCR values indicate calmer emotions, such as neutral or sad, where the speech is more steady and less variable . Root mean square energy The third feature employed was the RMSE, or the root mean square value of a signal, which was derived by computing the square root of the mean value of the squared samples. For each sample ycu. in the audio signal ycu, the square was calculated as: ycu. The average of all resulting squared values was then calculated, and the square root of this average was used to determine the RMSE, as shown in . yaycIycAycI = Oo OcycA ycu=1 ycu. ycA where ycu. is the signal value at index ycu, and ycA is the total number of signal samples. RMSE reflected the intensity or volume of the speech signal. Emotions such as angry or happy involved higher energy levels due to louder and more forceful speech, resulting in higher RMSE values. Conversely, emotions like sad were expressed with softer, lower-energy speech, leading to lower RMSE Pitch The fourth feature analyzed was pitch. Pitch estimation from an audio signal involves several key steps to accurately determine the fundamental frequency or pitch. This process includes spectral analysis using techniques such as the fast Fourier transform (FFT)-based methods like autocorrelation or cepstral Pitch was calculated as shown in . ycEycnycycaEa = ycIycaycoycyycoycnycuyci ycycaycyce yaycuyccyceycu ycuyce yaycycuyccycaycoyceycuycycayco yaycyceycycyceycuycayc Emotional states expressed through variations in the pitch of the voice. Emotions such as angry or happy tended to produce higher pitch variations, where the voice reached elevated frequencies, adding an energetic or intense quality to the speech. In contrast, sad or neutral involved a lower, more stable pitch, conveying a calmer or more subdued tone and signaling reduced emotional arousal. Spectral centroid The fifth feature calculated was the spectral centroid, which represents the center of gravity of the audio signalAos frequency spectrum, providing an average frequency weighted by the amplitude of each spectral component. It is commonly used to describe how energy is distributed across the frequency range, offering insight into the brightness or sharpness of a sound. Spectral centroid was calculated using . ycIycyyceycaycycycayco yayceycuycycycuycnycc = OcycAOe1 yco=0 yce. | OcycAOe1 yco=0 . | . where yce. is the frequency at index yco, and . | is the magnitude of the spectrum at index yco. Spectral centroid distinguished emotional states in speech by reflecting the brightness or sharpness of the voice. Higher spectral centroid values, linked to angry or happy, indicated energy concentrated in higher frequencies. Meanwhile, lower values, associated with neutral or sad, suggested a softer and more subdued tone. Once each feature was extracted from each audio, a feature vector was formed to represent the key acoustic characteristics of the sound. This vector captured the most significant characteristics of the audio, which were essential in distinguishing various emotional states. These values were then used as input in the Int J Elec & Comp Eng. Vol. No. August 2025: 3769-3778 Int J Elec & Comp Eng ISSN: 2088-8708 classification process, where they helped the model recognize and distinguish between various types of emotional expressions. Classification The extracted features were used as inputs for emotion recognition through various classification In this study, both CNN and CNNAeLSTM models were applied to compare in emotion These models were assessed to measure their accuracy in emotion classification using features extracted from the speech data. The experiments were conducted with data divided into 75% for training, 20% for testing, and 5% for validation. The models were implemented and tested in Google Colab. Convolutional neural networks The CNN model used was a 1D CNN designed to classify input data with 1D dimensions, such as time series or sensor data, where the order or relative position of the data is important. This model processed the input data through multiple convolutional layers, extracting spatial features that helped in emotion classification based on the speech data. Figure 2 depicts the architecture of a one-dimensional CNN model. The 1D CNN architecture began with an input layer of size . , . , followed by several convolutional layers with filters of 32, 64, and 128, each accompanied by batch normalization, activation functions, and pooling layers to reduce data dimensionality. A dropout layer was then applied to prevent The output from the last convolutional layer was flattened and passed to dense layers in order to get the final output, which was used to categorize the four emotions. Figure 2. CNN1D architecture Convolutional neural networks-long short-term memory The CNN model followed by a LSTM network, often referred to as a CNN-LSTM model, is typically used for processing high-dimensional data such as audio or video. In this model. CNN retrieved spatial characteristics from the input, and LSTM captured the temporal dependencies among the derived Figure 3 depicts the architecture of the CNN-LSTM model. Figure 3. CNN-LSTM architecture Indonesian speech emotion recognition: feature extraction and neural A (Izza Nur Afifa. A ISSN: 2088-8708 The CNN-LSTM architecture was similar to the 1D CNN architecture. In this model, the output from the convolutional layers was sent into an LSTM layer, which captured long-term dependencies in the The output from the LSTM layer was flattened and transferred to a dense layer that had four neurons, each of which represented one of the four emotional categories. This final dense layer served as the output, providing the predicted emotion based on the extracted features. RESULTS AND DISCUSSION This section provides the essential results of the research and a discussion on their significance. explores aspects such as inter-rater reliability and compares the feature extraction methods and classification techniques used. These results provide insight into the effectiveness of emotion recognition from speech and highlight important trends observed throughout the analysis. Inter-rater reliability results The study found that the annotators had the highest agreement when labeling segments with the emotion "angry," achieving a score of 0. This suggests that "angry" was a distinct and easily recognizable emotion, leading to more consistent labeling among the annotators. The emotions "happy" and "sad" had agreement levels of 0. 78 and 0. 74, respectively. In contrast, "neutral" had the lowest agreement level, with a score of 0. This indicated that the absence of emotion or a neutral state is more subjective and harder to label consistently. The ambiguity and subtlety in neutral expressions likely contributed to this lower agreement level. The overall CohenAos Kappa results are illustrated in Figure 4. The overall agreement across all emotions in the corpus was 0. 69, which fell into the "substantial" agreement category. This overall Kappa value highlighted the variability and subjectivity in how the two annotators perceived and labeled emotions. The results of the dataset calculations using IBM SPSS software were summarized in Table 2. The Kappa value of 0. 698 indicated a substantial level of agreement between the annotators, suggesting that they often labeled emotions consistently. The T-value of 17. 038 and a significance level of less than 0. 001 confirmed that the findings were statistically significant, meaning that the observed results were unlikely to have occurred by chance. This indicated the reliability of the measurement process. However, these results also highlighted that while annotators generally agree, certain emotions led to inconsistencies in labeling. Agreement level for each emotion Agreement Level 0,85 0,75 0,65 Neutral Sad Happy Emotions Angry Figure 4. CohenAos Kappa results Table 2. CohenAos Kappa results from IBM SPSS Value Asymptotic standard errora Approximate Tb Approximate significance <0. Feature extraction and classification comparison Experiments were conducted using CNN and CNN-LSTM methods with input data derived from feature extraction of speech signals. The features extracted from the speech signals included MFCC. ZCR. RMSE, pitch, and spectral centroid. These features were analyzed to assess its impact in improving emotion classification accuracy. Int J Elec & Comp Eng. Vol. No. August 2025: 3769-3778 Int J Elec & Comp Eng ISSN: 2088-8708 Convolutional neural networks In the CNN method, various feature combinations were tested to achieve the best results in emotion These results highlighted the significance of each feature in helping the model to differentiate emotional states. Table 3 presents the accuracy of testing using CNN with different feature combinations. Table 3. Accuracy comparison using CNN Features MFCC MFCC ZCR MFCC ZCR RMSE MFCC ZCR RMSE Pitch MFCC ZCR RMSE Pitch Spectral Centroid Accuracy The use of MFCC alone resulted in an accuracy of 81%. MFCC is effective in capturing important spectral information, however using MFCC alone may not fully capture the temporal dimensions in speech that correspond to emotional states. Adding the ZCR to MFCC did not significantly improve accuracy. ZCR determines how frequently the signal crosses the zero-amplitude line within a specific time frame, but its contribution to emotion classification seemed less significant compared to other features. When RMSE was added to the combination of MFCC and ZCR, accuracy increased to 83%. RMSE, which measures the energy of the speech signal, enriches the representation of the temporal and strength aspects of the signal that are relevant for emotion detection. The increase in accuracy suggested that signal energy information played an important role in differentiating emotional expressions. Adding pitch to the combination of MFCC. ZCR, and RMSE led to a significant increase in accuracy to 88%. Pitch provides information about voice intonation, which is crucial in emotion recognition because variations in intonation can reflect deep emotional changes. The substantial contribution of pitch underscored the importance of intonation in distinguishing emotional However, adding Spectral Centroid to the combination of MFCC. ZCR. RMSE, and pitch slightly decreased accuracy to 85%. Spectral Centroid, which describes the center of mass of the spectral signal, did not seem to provide significant additional value in the context of emotion classification, or it might even complicate the model without adding informative value. Convolutional neural networks-long short-term memory The CNN-LSTM method also demonstrated strong performance in emotion classification. combining the feature extraction capabilities with the sequential modeling power of LSTM, this approach captured both the relevant features from the speech signal and the sequential dependencies in the data. Table 4 shows the accuracy of testing using CNN-LSTM with various feature combinations. Table 4. Accuracy comparison using CNN-LSTM Features MFCC MFCC ZCR MFCC ZCR RMSE MFCC ZCR RMSE Pitch MFCC ZCR RMSE Pitch Spectral Centroid Accuracy Using MFCC alone resulted in an accuracy of 81%. MFCC is known to be effective in extracting spectral information from audio signals, but this accuracy suggested that the information obtained from MFCC alone still had limitations in fully detecting emotional variations. Adding the ZCR to MFCC actually reduced accuracy to 77%. This reduction might have been due to ZCR introducing noise or less relevant information, thereby disrupting the modelAos ability to classify emotions accurately. However, when RMSE was added to the combination of MFCC and ZCR, accuracy significantly increased to 88%. RMSE provides additional information about the intensity of the speech signal, which is crucial for distinguishing emotions. This increase in accuracy indicated that RMSE added crucial informative value for emotion detection. Adding Pitch to the combination of MFCC. ZCR, and RMSE increased accuracy to 83%. Although Pitch should provide additional information about the fundamental frequency of the voice relevant for emotion classification, this increase in accuracy was not as significant as in the previous combination. This might have been because the information provided by Pitch did not add enough value to the model or there was redundancy with existing features. Adding spectral centroid to the combination of MFCC. ZCR. RMSE, and Pitch did not further increase accuracy, which remained at 83%. Spectral centroid, which described the center Indonesian speech emotion recognition: feature extraction and neural A (Izza Nur Afifa. A ISSN: 2088-8708 of mass of the sound spectrum, did not seem to provide sufficiently differentiating information compared to the other features or might have had an excessive overlap of information. Comparison The testing results for various feature extractions and classification methods show that RMSE had a significant impact in improving accuracy for both the CNN and CNN-LSTM models. In the CNN model, the combination of four features: MFCC ZCR RMSE Pitch led to the highest accuracy of 88%. Similarly, the CNN-LSTM model reached the same accuracy with just three features: MFCC. ZCR, and RMSE. These results emphasized the significance of selecting the right features providing the most valuable contributions, while also highlighted an important trade-off between computational efficiency, as fewer features reduce the computational cost of feature extraction. CONCLUSION This study explored methods to enhance emotion recognition from Indonesian speech using feature extraction techniques and machine learning classification models. The experiments were conducted on an Indonesian language dataset consisting of 200 samples. To assess inter-rater reliability. Cohen's kappa analysis was conducted, which revealed a substantial agreement level . uI = 0. between annotators, highlighting the consistency of emotion labeling. The classification experiments compared CNN and CNN-LSTM models. Both the CNN model, which used four features (MFCC. ZCR. RMSE, and Pitc. , and the CNN-LSTM model, which used three features (MFCC. ZCR, and RMSE), achieved an emotion classification accuracy of approximately 88%. The difference in the number of features suggests that while the CNN model involved more computational tasks due to the additional feature, the CNN-LSTM model managed to achieve similar performance with fewer features, potentially offering a more efficient approach. Overall, the findings demonstrate that incorporating diverse feature extraction techniques can enhance emotion recognition performance, particularly in Indonesian SER. However, careful consideration is needed to balance computational efficiency and feature complexity, as adding more features can improve accuracy but may also increase computational cost. Future research could explore the use of advanced optimization techniques or feature selection methods to further refine model performance while minimizing computational overhead. ACKNOWLEDGMENTS The authors would like to thank PENS management for all support in the form of laboratory facilities and all the equipment provided, so that we can carry out this research well. FUNDING INFORMATION This research has been supported by the Ministry of Education. Culture. Research, and Technology of the Republic of Indonesia with the scheme of Penelitian Tesis Magister for the fiscal year 2024, with the contract number of 524/PL14/PT. 05/i/2024. AUTHOR CONTRIBUTIONS STATEMENT This journal uses the Contributor Roles Taxonomy (CRediT) to recognize individual author contributions, reduce authorship disputes, and facilitate collaboration. Name of Author Izza Nur Afifah Tri Budi Santoso Titon Dutono C : Conceptualization M : Methodology So : Software Va : Validation Fo : Formal analysis ue ue ue ue ue ue ue ue ue ue ue I : Investigation R : Resources D : Data Curation O : Writing - Original Draft E : Writing - Review & Editing Int J Elec & Comp Eng. Vol. No. August 2025: 3769-3778 Vi : Visualization Su : Supervision P : Project administration Fu : Funding acquisition ue Int J Elec & Comp Eng ISSN: 2088-8708 CONFLICT OF INTEREST STATEMENT Authors state no conflict of interest. INFORMED CONSENT We have obtained informed consent from all individuals included in this study. DATA AVAILABILITY The data that support the findings of this study are available from the corresponding authors. INA and TBS, upon reasonable request. REFERENCES