International Journal of Electrical and Computer Engineering (IJECE) Vol. No. April 2013, pp. ISSN: 2088-8708 Indonesian Vowel Recognition using Artificial Neural Network based on the Wavelet Features Nadya Amalia. Arfan E. Fahrudi. Amar V. Nasrulloh Physics Study Program. Universitas Lambung Mangkurat. Banjarbaru. Indonesia Article Info ABSTRACT Article history: There are six vowels in Indonesian language, i. /a/, /i/, /u/, /e/, /o/ and /o/. This paper presents Indonesian vowel recognition using artificial neural network (ANN) based on the wavelet features. The wavelet features were the wavelet coefficients of vowel signal which were extracted by using discrete wavelet transform (DWT). Vowel samples were recorded from native Indonesian speakers, 10 males and 10 females. Db4 and sym4 were used as the mother wavelet, and decomposition level 2, 4 and 6 were implemented for each vowel sample. Minimum, maximum, mean and standard deviation value of the wavelet coefficients then were used as input vectors of ANN with 2 hidden layers. Backpropagation algorithm was used to training the ANN. From the experimental results, an overall recognition rate of 70. could be achieved. In case of male speakers the highest recognition rate is 90% and in case of female speakers the highest recognition rate is 80%. Received Feb 5, 2013 Revised Mar 21, 2013 Accepted Mar 29, 2013 Keyword: Artificial neural network Back propagation Decomposition Discrete wavelet transform Indonesian vowel Mother wavelet Copyright A2013 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Nadya Amalia. Physics Study Program. UniversitasLambungMangkurat. Banjarbaru. Indonesia Email: aydan. amalia@gmail. INTRODUCTION Indonesia has a broad linguistic diversity. There are 726 languages in the country. making it the worldAos second most diverse, after Papua New Guinea which has 823 local languages . , . Therefore. IndonesiaAos national language. Indonesian language, policy has been called a Aumiraculous successAy . It has been effective in uniting the nation, creating a strong national identity and promoting education and literacy throughout the nation . , . However, there is a difference in pronunciation of a word uttered by people of different mother tongues which is mainly due to the variations in vowels and the way they are pronounced . , . According to Rogers . and Stevens . , vowels are voiced sounds produced by passing air through the mouth Without any major obstruction in the vocal tract. There are six vowels . xcluding the diphthong. in Indonesian language . They include /a/ . ike AuaAy in AufatherA. , /i/ . ike AueeAy in AukneeA. , /u/ . ike AuooAy in AumoonA. , /e/ . ike AueAy in AubedA. , /o/ . ike AueAy in AulanternA. , and /o/ . ike AuoAy in AubossA. The vowel inventory is relatively normal as shown in Table 1. Table 1. Indonesian vowels . Front High Mid Low Unrounded Central Rounded Back Journal homepage: http://iaesjournal. com/online/index. php/IJECE IJECE ISSN: 2088-8708 Vowel recognition has been increasingly developedfor many languages withseveraldifferent Sadeghi and Yaghmaie . introduced Farsi vowel recognition based on visual features of lips. Prica and IliAc . presented Serbian vowel recognition by using formants. In this paper. Indonesian vowel recognition using artificial neural network (ANN) based on wavelet features ofvowel is presented. Each vowel has characteristic features and the process to obtain those features is called feature Feature extraction is an important step in recognition system because the recognition rate depends on the feature extraction results . The wavelet features in this paper are extracted from vowel signal via discrete wavelet transform (DWT). DWT is the most promising mathematical transformation which provides both the time-frequency information of the signal and is computed by successive low-pass filtering and highpass filtering to construct a multi resolution time-frequency plane . Therefore, wavelet coefficients can accurately represent the signal components . Artificial neural network (ANN) with multilayer perceptron architecture is usedas the recognition system. Compared to conventional programming. ANN has the capability of solving problems that do not have algorithmic solution and is therefore found suitable to tackle problems that people are good to solve such as this vowel recognition. MATERIALS AND METHODS Discrete wavelet Transform Vowel signal is essentially a non-stationary signal. The analysis of a non-stationary signal by using Fourier transform (FT) or short time Fourier transform (STFT) does not give satisfactory results. Wavelet analysis is able to reveal signal aspects that other analysis techniques miss, such as trends, breakdown points, discontinuities, etc. It performs a multiresolution analysis which makes it possible to analyze a signal at different frequencies with different resolutions . , . Discrete wavelet transform (DWT) uses filter banks for the construction of the multiresolution timefrequency plane . A filter bank consists of wavelet filters which separate a signal into frequency bands . , . In the two channel filter bank as shown in Figure 1, the low-pass filter L. and high pass filter H. split the signal into two frequency bands. Figure 1. Two channel filter bank with 3 level decomposition . The different output signals of the analysis filter bank are called subbands and the filter bank technique is also called subband decomposition. Decomposition can be expanded to an arbitrary level, depending on the desired resolution . , . Figure 1 visualized decomposition level 3. The subband with wavelet coefficients c is called the approximation coefficient cA and contains the lowest frequencies. The other subbands are calleddetails coefficientcD and give the detail information of the signal . For decomposition level p, cA and cD can be obtained by . cA p Oc x k L 2p cD p Oc x k H 2p Artificial Neural Network Artificial neural network (ANN) simulates the neural information processing of human brain . , . ANN processes inputs in parallel with a large number of processing elements called neurons and uses large interconnected networks of simple and non linear units . Therefore, the ANN function is determined largely by the connections between elements. ANN with multilayer perceptron architecture, as shown in Figure 2, was used in this work. A signal is transmitted in one direction from the inputs to the outputs and therefore this architecture is also called feed forward . Indonesian Vowel Recognition using Artificial Neural Network based on the WaveletA (Nadya Amali. A ISSN:2088-8708 Figure 2. ANN with multilayer perceptron architecture Each neuron may use any differentiable transfer function f to generate its output. Equation . and Equation . show the bipolar sigmoid transfer function and purelin transfer function which were used in this . If the last layer of a multilayer network has sigmoid neurons, then the outputs of the network are limited to a small range Ae1 to 1. If linear output neurons are used the network outputs can take on any value . , . Recording Vowel samples in this work were recorded from 20 native Indonesian speakers, aged between 19 Ae 25 years old, 10 males and 10 females. This work restricted only to the six Indonesian monofthong vowels, /a/, /i/, /u/, /o/, /e/ and /o/, leaving out of consideration the category of diphthongs. Each vowel sound was recorded into 3 seconds *. wav file using a laptop and a microphone. The sampling rate used in recording was 44100 Hz and 16-bit resolution mono. Each vowel was recorded from each speaker for 2 times. The database then was partitioned into two subsets, one for training phase, 75% of the total, and the other for testingphase, 25% of the total. Pre-processing Sampling The effective frequency of human voice is 4000 Hz. Considering Shannon theory . , sampling frequency of 44100 Hz was used in this work. Pre-emphasize filter Pre-emphasize filter is used to spectrally flatten the vowel signal. It was implemented by the Equation . with a pre-emphasized constant value, a, of 0. , . axA n . Frame Blocking dan Windowing After being filtered, the signal then was divided into several frames of 20 ms, so one could assume that the vowel signal was stationary within each frame. (Sampling Rat. (Frame Lengt. = Number of Samples in a Frame 00 samples/. = 882 samples . By choosing frames of 20 ms, each frame in this work consists of 882 samples as shown in Equation . After frame blocking, the next step was windowed each individual frame to minimize the signal discontinuities at the beginning and end of each frame. Hamming window was used in this work, it could be calculated as: IJECE Vol. No. April 2013: 260Ae269 IJECE ISSN: 2088-8708 54 Ae 0. 46 cos 2An/881 , 0 C n C 881 DWT decomposition Selection of the suitable mother wavelet and the number of decomposition level play an important role in obtaining good recognition rate in speech recognition. Among the various wavelet bases, db4 and sym4 were used in this work because of their orthogonality property and efficient filter implementation . Decomposition level 2, 4 and 6 were implemented for each vowel. W n Figure 3. Pre-processing flowchart Processing Building ANN architecture This work used ANN with multilayer perceptron architecture. The architecture consisted of two hidden layerswith activation function of bipolar sigmoid and output layer with activation function of purelin. The number of neurons in the first hidden layer is 10 neurons and vary in the second hidden layer, 5 and 7 Training To perform a particular function. ANN can be trained by adjusting the values of the connections . between elements. The training process requires a set of examples, training set, of proper network behavior-network inputs and target outputs . The minimum, maximum, mean and standard deviation value of the wavelet features of vowel signal from the first subset were used as the input vectors of the ANN. 1, 3, 5, 7, 9 and 11 were the target value for vowel /a/, /i/, /u/, /o/, /e/ and /o/, respectively. Backpropagation algorithmwas used to trainthe ANN. The algorithm involved an input vector, compared the network output to the desired output for that vector, and updated each weight by an amount corresponding to the derivative of the error with respect to that weight times some learning rate . Learning rate of 0. 5, minimum error of 0001 and minimum gradient of 0. 000001 were determined. Indonesian Vowel Recognition using Artificial Neural Network based on the WaveletA (Nadya Amali. A ISSN:2088-8708 Figure 4. ANN training phase flowchart Testing Figure 5. ANN testing phase flowchart. RESULTS AND DISCUSSION Vowel Signal The total of the recorded vowel samples in this work is 120 in case male speakers and 120 in case of female speakers. Figure 6 shows the original Indonesian vowel signals of the first male speaker. Figure 7 shows the spectrally flattened signals and Figure 8 shows the signals which have been frame blocked and IJECE Vol. No. April 2013: 260Ae269 IJECE "A" "I" "U" "EE" "I" "U" "E" "O" "EE" Figure 7. Pre-emphasized signals "A" "O" Figure 6. Original signals of Indonesian vowels of a male speaker 1 "EE" "E" "O" "U" "E" "I" "A" ISSN: 2088-8708 Figure 8. Frame blocked dan windowed signals The vowel database whether from male speakers or female speakers was divided into two subsets. The first subset was used for ANN training phase, 90 vowel samples, and the second subset was used for ANN testing phase, 30 vowel samples. Feature Extraction Results Wavelet coefficients were the results of vowel feature extraction by using DWT in this work. There are three sets wavelet features obtained from each vowel signal for each used mother wavelet: Decomposition level 2 obtained 3 wavelet features: 1 approximation coefficient cA, and 2 detail coefficients cD. Decomposition level 4 obtained 5 wavelet features: 1 approximation coefficient cA, and 4 detail coefficients cD. Decomposition level 6 obtained 7 wavelet features: 1 approximation coefficient cA, and 6 detail coefficients cD. Training and Testing Results Training The minimum, maximum, mean and standard deviation value of the wavelet features of vowel signal from the first subset were used as the input vectors of the ANN in this work Multiple passes of training was required on the entire training set. ather like a person learning a new skil. until minimum error target or gradient could be reached. Each pass is called iteration or epoch. Table 2 and 3 show epoch number of each training set in case of male speakers and female speakers. It is also shown that the ANN which was used input vectors from decomposition level 6 needed less epoch than the ANN Indonesian Vowel Recognition using Artificial Neural Network based on the WaveletA (Nadya Amali. A ISSN:2088-8708 which was used input vectors from decomposition level 4 and furthermore level 2. It because higher decomposition level obtained more wavelet features from vowel signal and therefore gave more input vectors for the ANN. More input vectors means more knowledge for the ANN to learning. Table 2. Epochs in case of male speakers Neurons in 2nd hidden layer Wavelet Dec. Table 3. Epochs in case of female speakers Epoch Wavelet Dec. Neurons in 2nd hidden Epoch Training the ANN means adapting its connections so that the network exhibits the desired computational behavior for the input patterns. The process involved modifying the weights which was random at the initial. Adjusted weights of the ANN were obtained from this phase. Training of the ANN with different number of neurons in the second hidden layer within same inputsresulted different matrix of The ANN with 5 neurons in the second hidden layer obtained first hidden layer weights in matrix 5y10, second hidden layer weights in matrix 1y5 and first hidden layer bias weights in matrix 5y1. The ANN with 7 neurons in the second hidden layer obtained first hidden layer weights in matrix 7y10, second hidden layer weights in matrix 1y7 and first hidden layer bias weights in matrix 7y1. Testing The ANN in this phase used the adjusted weights from the training phase. Threshold of 0. 8 was used to determining the output decisions. Table 4 shows ANN outputs classification based on the threshold. Table 4. ANN outputs classification ANN output 2 O output O 1. 2 O output O 3. 2 O output O 5. 2 O output O 7. 2 O output O 9. 2 O output O 11. training input. test input Input 5 training input. test input Input 6 training input. test input * = target. o = test output training input. test input Input 4 Output Input 3 * = target. o = test output * = target. o = test output Input 2 * = target. o = test output * = target. o = test output * = target. o = test output Input 1 Decision recognized as /a/ recognized as /i/ recognized as /u/ recognized as /e/ recognized as /o/ recognized as /o/ could not be recognized training input. test input training input. test input Figure 10. Classification graph of the ANN outputs IJECE Vol. No. April 2013: 260Ae269 Figure 11. Indonesian vowel recognition testing IJECE ISSN: 2088-8708 The value of the threshold depends on the range between each input targets. Bigger input targets range would need bigger threshold value. Figure 10 visualizes the classification graph of the outputs of the ANN and Figure 11 visualizes the testing module. Based on the testing results, some vowel samples have been recognized wrongly. Some /u/ samples were recognized as /o/ or vice versa. It happened because both vowels has the same inventory, rounded lips and back tongue. Recognition rate shows the effectiveness of the methods in this work: Recognition rate 30 vowel samples were used for the testing phase. The outputs of the ANN in testing phase can be classified into two. Firstly, true data which are the vowel samples that could have been recognized rightly. Secondly, false data which are the vowel samples that whether were recognized wrongly or could not be recognized at all. Table 5. Recognition rates in case of male speakers Wavelet Dec. Neurons in 2nd hidden Recognized Not Recognized Recognition rate (%) Table 6. Recognition rates in case of female speakers Wavelet Dec. Neurons in 2nd hidden Recognized Not Recognized Recognition rate (%) Table 7. Overall recognition rates Wavelet Dec. Average Neurons in 2nd hidden Recognition rate (%) Besides the factors within the ANN, recognition rate was also influenced by physical factors within the vowel signal. Unwanted noise in vowel sound could give different signal representation from the true Indonesian Vowel Recognition using Artificial Neural Network based on the WaveletA (Nadya Amali. A ISSN:2088-8708 The common source was background noise which is always present in any location. Therefore, the obtained approximation coefficients were significantly different. The way the speakers pronounced the vowel was also given an effect. The perfect recordings are each vowel is pronounced naturally, not too low but also not too loud. CONCLUSION Indonesian vowel recognition using artificial neural network based on the wavelet features has been presented in this paper. Different mother wavelets and decomposition levels were implemented to analyze each vowel signal. It is shown that decompositionlevel 2 in DWT obtained 3wavelet features, level 4 obtained 5 wavelet features and level 6 obtained 7 wavelet features. Artificial neural network were used as the recognition system. From the experimental results, it can be concluded that the presented method can recognize the Indonesian vowel well. An overall recognition rate of 70. 83% could be achieved. In case of male speakers the highest recognition rate is 90% and in case of female speakers the highest recognition rate is 80%. Both were achieved by using sym4 as the mother wavelet, decomposition level 6 and ANN with 7 neurons in the second hidden layer. For further work, in order to get better recognition rate, more vowel samples for training phase is needed. The vowel samples should be recorded in silent situation. Another mother wavelet and decomposition level can also be implemented. ACKNOWLEDGMENT The researcher would like to thank the native Indonesian speakers of Physics Study Program. UniversitasLambungMangkurat for participating in this work. REFERENCES