JOIV : Int.
Inform.
Visualization, 9.
- March 2025 789-795
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION
journal homepage : w.
org/index.
php/joiv Indonesian Word Sound Recognition Using Convolutional Neural Network Method Mandahadi Kusuma a,*.
Fayyadh Aunilbarr a Informatics.
Faculty of Sains and Technology.
Universitas Islam Negeri Sunan Kalijaga.
Yogyakarta.
Indonesia Corresponding author: *mandahadi.
kusuma@uin-suka.
AbstractAiAccess to education, particularly in a university environment, is essential for deaf and hard-of-hearing students as more of them pursue higher education.
At UIN Sunan Kalijaga the current challenges are a limited number of sign language interpreters and translating technical terminology in lectures.
Many methods are available for speech recognition, but research on how well this method performs in Indonesian has not been published, especially in education-level recognizers.
This experimental study aims to investigate if Indonesian words can be recognized through Convolutional Neural Networks (CNN) and to find out the Data Ratio for Training.
Validation, and Testing set to get the best performance.
The study used a dataset of 4 Indonesian words with the total voice sample, each with 50 voice samples from young adults aged 19-23.
Audio data is preprocessed into spectrograms, inputs to the CNN model using TensorFlow.
The CNN Model had a 90% accuracy with a 60:20:20 ratio between training, validation, and test data.
The other ratios .
:15:15 and 80:10:.
provided accuracy ranges of between 80% to 90%.
This study shows that CNNs are the best for Indonesian word recognition and that the data ratio of 60:20:20 is optimal.
This result has valuable benefits, such as using voice-to-text over lectures to enhance the ease of learning and education in Indonesia.
Further studies should be conducted using different neural network the denoise approach is also necessary to increase accuracy.
KeywordsAiCNN.
Indonesian.
TensorFlow.
Manuscript received 19 April 2024.
revised 26 Oct.
accepted 23 Nov.
Date of publication: 31 Mar.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.
0 International License.
Networks alone.
Previous research utilizing the MFCC and SVM methods with Indonesian words achieved an F1 score accuracy ranging from 44% to 100% for word classification.
However, the CNN method is widely used in several research, an audio event classification method using convolutional neural networks (CNN.
, achieving an accuracy of 81.
5% for classifying thirty audio events across multiple datasets.
In the current research, the Convolutional Neural Network (CNN) method was chosen because it is considered stronger and faster.
A previous study that aims to improve early literacy in Indonesian children using a Convolutional Neural Network (CNN) approach for alphabet speech recognition reached a high accuracy result for this speech recognition system by 84%.
, however for Indonesian considerable vocabulary speech superiority of deep learning technologies, particularly convolutional neural networks (CNN), over traditional hidden Markov models (HMM) for speech recognition.
The study proposes a discriminatively trained CNN for Indonesian large vocabulary continuous speech recognition (LVCSR).
INTRODUCTION Deaf or hard-of-hearing students at UIN Sunan Kalijaga face significant challenges during lectures due to the limited availability of translators and difficulties in translating foreign words .
To improve accessibility, a voice-to-text feature is proposed, which could assist these students by converting spoken words into text in real-time.
This research aims to identify the best model for recognizing Indonesian words, focusing on the application of Convolutional Neural Networks (CNN.
due to their robustness and efficiency in pattern recognition tasks.
The study explores the effectiveness of CNNs in accurately identifying specific Indonesian words, thereby enhancing the utility of voice recognition technology in educational settings for the deaf community.
Several voice recognition methods can be used: CNN.
MFCC.
ANN, and RNN.
However, the current study proved that the Madaline model of artificial neural networks is not recommended for voice identification.
Meanwhile, the usage of Convolutional and Recurrent Neural Networks could outperform the accuracy if compared to Convolutional Neural Indonesian language use and different word class voice Because a personAos voice differs in tone, pitch, and volume, it is adequate to make it uniquely distinguishable.
using the CNN process of identifying and classifying a person based on their voice, a high level of accuracy.
The study provides information on how CNNs can be used to recognize voices in the language, evaluates the effectiveness of this approach, and supports the advancement of tools to aid the deaf community.
Additionally, it emphasizes the benefits of CNNs in enhancing and making technology more userfriendly.
achieving significant error reduction rates that show a 7.
01% error reduction.
Studies on using Convolutional Neural Networks (CNN.
for processing voice data.
Liang et al.
focus on security by identifying spoofed voices with results of 95% accuracy for detection of fake voices.
Franti et al.
achieve an average accuracy of 71.
33%, aiming to improve human-robot interaction by recognizing emotions.
Both studies demonstrate the versatility of CNNs in audio analysis and their potential impact on technology that interacts with or mimics human behavior.
Another study suggests that the combination of CNN and RNN outperforms CNN alone, 66% accuracy for 20 labels.
The convergence of this research underscores the broad applicability of CNNs in both safeguarding and humanizing technology.
A study for deaf support applications utilizes deep learning techniques for sound analysis using the Mel-Spectrogram representation of sounds.
Real-life sounds using the Korean language are recorded via an app, identified based on learned data and associated with predefined alarms and vibrations.
The experiments showed promising results, with an average classification rate of 85% for real-life sounds .
Convolutional Neural Networks (CNN) are a type of artificial neural network used in pattern recognition, especially in image processing .
CNN has an architecture with several variations, but in general, they consist of convolutional layers and pooling layers, which will be grouped into models.
Followed by a fully connected layer, as is standard in JST.
Convolution layers and pooling layers are internal structures, and fully connected layers handle generating class probabilities.
Convolutional layers are part of a CNN that consists of neurons connected to the receptive fields of the previous layer.
Filters in convolutional layers are applied explicitly in speech recognition, where sound can be converted into images and then analyzed using CNN.
The Dropout and Confusion Matrix techniques are applied so that the effectiveness of CNN in recognizing objects in images becomes the basis for detecting patterns or features in images .
Each neuron in a convolutional layer uses the same weight as a particular filter.
Meanwhile, the Max pooling layer is the grouping layer most commonly used in CNNs.
This layer functions to reduce the dimensions of the spatial representation effectively without adding parameters that can be learned.
Dropout is a regularization technique used in artificial neural networks to reduce overfitting by avoiding dependencies between neurons.
Meanwhile, the Confusion Matrix is used to evaluate the performance of the classification model by presenting prediction results on test data in matrix form, enabling analysis of the quality of recognition for each class or word.
This study aims to evaluate the effectiveness of a Convolutional Neural Network (CNN) for Indonesian word recognition and to determine the accuracy using a dataset of four specific words.
Research limitations include using Indonesian and a specific 4-word dataset.
the words used are Inklusif, pecah, coba, and miring.
The development using the Python programming language and Google Collab .
The benefits of this research include demonstrating the ability of CNN to use Indonesian voice recognition and evaluating the accuracy of the dataset used, with novelty in the context of II.
MATERIALS AND METHOD
The details of each research stage consist of data collection, data preprocessing.
CNN implementation, and analysis and Each step is explained as follows:
Data Collection Voice data was collected from volunteers at the Disabled Services Center (PLD) UIN Sunan Kalijaga for four words with 50 data per word, taken between February and March 2021, with permission from PLD.
The voice dataset was collected from PLD volunteers with an age range of 19-23 years, consisting of the words "inklusif", pecah", coba", miring", each with fifty data.
The average duration of a sound recording is 1-2 seconds in a controlled environment.
This research uses voice data obtained from male and female PLD volunteers at UIN Sunan Kalijaga, aged 19-23 Each word has 50 data from 10 different volunteers, five men, and five women, with four groups of words representing vowels and consonants.
Volunteers were asked to say the words in a normal emotional atmosphere and in a calm environment, without distractions from sounds or other Sound collection is done through recording using the application built into the volunteer's mobile device in WAV format.
Sound collection criteria are set following Tabel I.
TABLE I
SOUND COLLECTION CRITERIA
Categories Volunteers Age Words Duration Application Environment Data Format Criteria Volunteers from PLD UIN Suka who are not disabled students 19-23 years AoInklusifAo .
AopecahAo.
AocobaAo.
AomiringAo.
1-2 seconds Own recording app and device Quiet, not whispering, without other distractions, uttering with ordinary emotion Waveform Audio File Format (WAV) Each volunteer recorded voices 5 times per word, so the total voice is 20 times per volunteer.
The collected data is then stored on Google Drive and accessed using Google Collab to compute the program code.
Before preprocessing the data, the data is labeled by dividing the folders according to each word label, namely in one folder there are 4 folders with the labels AoinklusifAo.
AopecahAo .
AocobaAo, and AomiringAo with 50 data in each B.
Data Preprocessing Data Preprocessing is conducted using TensorFlow.
TensorFlow is used because it allows building large-scale neural network models with many layers and can be used for various purposes such as classification, perception, and prediction.
The sound data is processed with the tf.
function to convert wav type sound files into waveforms which are later converted into spectrograms so they can be input to the CNN.
A spectrogram is an image representation of a waveform signal that shows its frequency intensity range over time.
Although spectrograms only display frequency over time, while waveforms display changes in amplitude over time, spectrograms can lose information about changes in amplitude from audio data or waveforms.
CNN Implementation This stage implements the CNN method in program code via Google Collaboration using TensorFlow.
Starting with determining the model to be created and then compiling the In the Convolutional Neural Network (CNN) architecture, the input layer is used with an image-resizing process to speed up model training.
The hidden layer consists of 2 convolutional layers with a maxPooling layer, followed by a dropout layer and a dense layer for image However, in audio spectrogram conversion.
Tensorflow does not have a feature to return to audio.
Analysis and Evaluating The confusion matrix serves as an analytical tool by utilizing the tf.
confusion_matrix function in TensorFlow, complementing the CNNAos classification This .
raining:validation:tes.
of 60:20:20, 70:15:15, and 80:10:10 to determine the optimal result.
the CNN model analysis can be done by looking at the loss and accuracy curves of the model that has been trained and checking what percentage of accuracy the model runs on test data.
This study finds that choosing a ratio that produces at least 80% accuracy is the optimum result.
RESULTS AND DISCUSSION
An example of the output from this process can be seen in Figure 1, which shows the sound waveform in WAV format.
This waveform is a graph of the time function of each displacement in sound, with the x-axis showing the amplitude and the y-axis showing the period .
Each word's sound amplitude differs, as seen in the inclusive word with a high amplitude at time 50000.
In the case of consonants, it can be seen that the words 'pecahAo and AocobaAo have a silent sound in the middle of the spoken word.
Fig.
1 Sample of sound data in waveform A.
Data Processing At this stage, sound files are processed to be inserted into a Convolutional Neural Network (CNN).
The first stage involves converting sound files in WAV format into waveforms, which are then converted into spectrograms so they can be input into the CNN.
The next stage is converting the waveform into a spectrogram, which is shown in Figure 2, which shows a sample of conversion results from several sound files.
research, model training was carried out using 50 epochs.
The line code as in Figure 4.
Fig.
4 Train Model using 50 epoch Results and Analysis .
Loss and Accuracy.
Model analysis is carried out by examining the loss and accuracy curves as well as the level of accuracy on the test data.
Figure 5 shows at the 60:20:20 ratio, there is a significant decrease in loss after around the 30th epoch, with peak accuracy reached between the 20th and 30th Fig.
2 Sample conversion result from waveform to spectrogram Figure 2 demonstrates the conversion of waveforms into spectrograms for the four tested Indonesian words, which are.
AoinklusifAo,AopecahAo,AocobaAo, and AomiringAo .
Spectrograms display frequency content over time, with color intensity indicating energy levels.
Brighter areas represent higher energy concentrations at specific frequencies, while darker areas indicate lower energy levels.
This conversion is crucial for CNN processing as it transforms audio data into a format suitable for image-based analysis.
Fig.
5 Loss and accuracy analysis results with a ratio of 60:20:20 Meanwhile.
Figure 6 shows that in the 70:15:15 ratio, there was a significant decrease in loss around the 40th iteration, with peak accuracy occurring between the 20th and 30th epochs.
CNN Implementation This stage involves implementing the CNN method in program code via Google Collaboration using Tensorflow.
starts by determining the model to be created and then compiling it.
Next, the model is compiled using the Adam optimizer, which is inserted into the line of code as in Figure 3.
Fig.
3 Compile model using Adam optimizer At this stage, model training is carried out using the prepared dataset.
Model training is a process in which a machine learning algorithm is learned using an appropriate data set.
The data is divided into three groups, namely training data, validation data, and test data.
Training data is used specifically to train the model, while validation data is used to test the model during training.
On the other hand, test data is hidden data used to test predictions after the model has been Before training begins, the ratio between the three types of data is determined, with different ratio options.
In this Fig.
6 Loss and accuracy analysis results with a ratio of 70:15:15 Figure 7 shows that at a ratio of 80:10:10, the lowest loss value occurs in the 30th iteration onwards, while the highest accuracy value occurs between the 30th and 40th iterations.
Figure 8 show that by using this model, the matrix shows the results for the words AoinklusifAo.
AopecahAo.
AocobaAo, and AomiringAo, with AomiringAo achieving a perfect score in precision, recall, and F1-score.
Fig.
9 Confusion matrix results with a ratio of 70:15:15 Fig.
7 Loss and accuracy analysis results with a ratio of 80:10:10 Table 2 compares three different ratios of training, validation, and test data.
The 60:20:20 ratio achieved the best accuracy test result of 90%.
The 70:15:15 ratio resulted in 86% accuracy, and the 80:10:10 ratio had the lowest accuracy at Table II suggests that a balanced distribution of data contributes to higher accuracy in the CNN modelAos Figure 9, the matrix highlights the number of correct and incorrect predictions made by the model, with the word AomiringAo once again showing superior performance compared to the other classes.
TABLE II
ACCURACY RESULTS BETWEEN 3 DIFFERENT RATIOS
Ratio
60:20:20
70:15:15
80:10:10
Training Data 30 data 35 data 40 data Validation Data 10 data 7 data 5 data Test Data 10 data 7 data 5 data Accuracy .
The Evaluation Performance of Confusion Matrix.
The confusion matrix of predictions and labels is evaluated using confusion_matrix in TensorFlow.
It can be seen in Figure 8.
Figure 9, and Figure 10.
However, there is no confusion matrix display feature with true positive, true negative, false positive, false negative, precision, recall, and F1-score.
Independent calculations were carried out in this Fig.
10 Confusion matrix results with a ratio of 80:10:10 Figure 10 indicates that while the overall accuracy is lower with this ratio, the word AomiringAo and AocobaAo both achieve a precision of 100%, and the recall values for AoinklusifAo and AopecahAo reach 100% as well.
The F1-score for AopecahAo is the highest among the words, at 80%.
The evaluation performance of a voice recognition system for Indonesian words using a Convolutional Neural Network (CNN)1.
It focuses on four words: AoinklusifAo.
AopecahAo.
AocobaAo, and AomiringAo.
Precision, recall, and F1-score are used as Precision measures the accuracy of positive predictions, recall measures the coverage of actual positive cases, and the F1-score is the harmonic mean of precision and The evaluation results were presented on Table i.
Table IV, and Table V.
Fig.
8 Confusion matrix results with a ratio of 60:20:20 TABLE i CONFUSION MATRIX RATIO 60:20:20 Classes True positive True negative False positive False negative Precision Recall F1-score of these words were correctly identified by the model.
The F1score for AopecahAo was the highest at 0.
8, suggesting a balance between precision and recall.
However, the F1-score for AomiringAo was only 0.
21, indicating a potential issue with the balance between precision and recall for this word.
With some words like AomiringAo and AocobaAo performing exceptionally well in precision, while others like AoinklusifAo excelled in recall.
This variation indicates that the modelAos ability to recognize words depends on the specific characteristics of each word.
This study found that the 60:20:20 ratio yielded the best results due to a balanced distribution of data, which is crucial for practical training, validation, and testing phases, along with several key points.
Table i shows that the word Aomiring Ao in both precision and recall classes achieved perfect scores, which is the harmonic mean of precision and recall, showing a balanced It also has the highest F1-score of 1, which is the harmonic means of precision and recall, showing a balanced Compared to other words.
AomiringAo had the best overall performance, with AoinklusifAo and AopecahAo having lower precision and recall rates.
The high scores for AomiringAo suggest that the CNN model was particularly effective at recognizing this word within the dataset.
Balanced Training: The 60:20:20 ratio gives the optimal amount of information from which the model can learn .
, validate its learnings .
, and test its performance .
Striking this balance contributes to improved accuracy and enhanced generalization.
Loss and Accuracy: Although the peak accuracy was reached between the 20th and 30th epochs, the loss was significantly lower after around the 30th epoch.
This means that the model can learn well and make reasonable guesses based on unseen data.
TABLE IV
CONFUSION MATRIX RATIO 70:15:15
Classes True positive True negative False positive False negative Precision Recall F1-score .
Confusion Matrix: The confusion matrix of the 60:20:20 ratio had high precision, recall, and F1-scores for each word as well as it did for the word AomiringAo, where it got perfect scores.
So, it means that the model is well enough to make true predictions in general but also true predictions as Table IV result show that the word AomiringAo have score 1 at precision, recall, and F1-score, indicating perfect classification for this word.
The words AoinklusifAo and AopecahAo have lower precision and recall values compared to AomiringAo, suggesting some misclassifications occurred.
The F1 scores for AoinclusiveAo and AopecahAo are 0.
39 and 0.
5, respectively, which are measures of the testAos accuracy.
The high performance for AomiringAo suggests that the CNN model is particularly effective at recognizing this word within the dataset.
The lower scores, in other words, indicate areas where the modelAos recognition capabilities could be improved.
While this ratio is highly accurate for some words, there is room for improvement in its overall word recognition accuracy, especially for words with lower precision and recall .
Precision and recall: The high precision and recall for most words indicate that the model is reliable when predicting the correct class and effectively covers actual positive cases.
These are the implications of achieving 90% accuracy using the 60:20:20 ratio to improve Indonesian universities' educational access.
This recognition could be a huge step toward improving the classes of deaf and hard-of-hearing students on campus, especially the deaf students at UIN Sunan Kalijaga.
Currently, most of these students encounter significant obstacles with lecture material, for instance, through a lack of available sign language interpreters or having trouble translating technical or foreign terminology.
The CNN-based system we developed is a good alternative since it provides real-time voice-to-text conversion.
TABLE V
CONFUSION MATRIX RASIO 80:10:10
Classes True positive True negative False positive False negative Precision Recall F1-score IV.
CONCLUSION
From our analysis of different ratios, the performance of the CNN model varied in crucial ways about its configuration during training.
The 60:20:20 combination was the most accurate setup, with 90%, while being quite balanced for all test words.
Among them, 'miring' performed perfectly in all metrics, but 'inclusive' was good, with a high F1-score of 0.
Some interesting patterns came in with the 70:15:15 ratio.
Whereas 'miring' kept its perfect recognition rate.
other words, it did not.
The most noticeable case was 'inklusif', whose precision degraded to 0.
33 and recalled to 0.
Its decline indicates that a reduction of validation impairs the model's fine-tuning ability.
Table V shows that the words AomiringAo and AocobaAo have a score of 1 or 100% precision, indicating that when the model predicted these words, it was always correct.
The recall for AoinklusifAo and AopecahAo was also 100%, meaning all instances .
The most surprising finding, however, is that despite having the most extensive training set, the 80:10:10 ratio provided the most unreliable results.
Although terms such as 'miring' and 'coba' had a perfect precision of 1, their F1 Scores were low, with 'miring' yielding an F1 Score of only 0.
This is a very important lesson: more training data is not always better, especially when it comes to more minor validation and testing sets.
There is still much work to be done in the future: We will increase our dataset even more by adding words and testing the system in real classroom environments.
Noise reduction techniques can be included to enhance performance in Another interesting investigation could be the study of other neural network architectures, like RNNs or hybrid models, which may improve the results even further.
More importantly, the research points out how artificial intelligence can be used to solve real-world accessibility Since its development focused on practical fields and was done with a high level of accuracy, we are in a position to assist in building an inclusive educational system for all students in Indonesia.
REFERENCES