Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol.
No.
September 2025, pp.
814O829
ISSN: 2089-3272.
DOI: 10.
52549/ijeei.
The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy Md Atiqul Islam1 .
Mohammed Abdul Kader2 1 International Centre for Neuromorphic Systems.
The MARCS Institute for Brain.
Behaviour, and Development.
Western Sydney University.
Kingswood.
NSW 2751.
Australia 2 Department of Electrical and Electronic Engineering.
International Islamic University Chittagong.
Bangladesh
Article Info
ABSTRACT
Article history:
Conventional Speaker Identification (SID) systems accurately identify speakers if their speech is noiseless.
However, their classification accuracies reduce substantially when speech is corrupted by noise.
SID systems would be more practical and applicable if they were more noise-robust.
We introduce an SID system that can accurately classify speakers, even when their speech is corrupted by various types of noise at different noise levels.
We investigate the impact of noisy training data on the performance of an SID system and the noise that may enhance the performance of an SID system.
In this paper, we compare two front-end feature extractors: a cochlea model called the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR-FAC) and an FFT-based Gammatone Frequency Cepstral Coefficient (GFCC).
We use the Gaussian Mixture Model with the Universal Background Model (GMM-UBM) and a Extreme Learning Machine (ELM) as classifiers to focus on the influence of the frontends on performance.
We train the GMM-UBM and the neural network with noisy data under various conditions to investigate the impact of noise on the Our results suggest that noisy training data make an SID system noiserobust while the performance under clean conditions remains almost the same.
More interestingly, training with speech-shaped noise .
ocktail part.
enhances SID accuracy more than white noise.
Received September 4, 2024 Revised September 3, 2025 Accepted September 28, 2025 Keywords:
CAR-FAC
GFCC
GMM-UBM
Speaker Identification Cocktail Party Noise-robust Copyright A 2025 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Mohammed Abdul Kader Department of Electrical and Electronic Engineering International Islamic University Chittagong Chittagong.
Bangladesh Email: kader05cuet@gmail.
INTRODUCTION
The applications of SID systems .
are growing due to the advancement of the human-machine interface in state-of-the-art technologies.
They find applications in a variety of online systems, including security systems .
, and call centers in banks .
and health services .
Moreover, in the state-of-the-art, all TeslaAos autopilot cars.
WaymoAos driverless cars and semi trucks .
, and all automated vehicles .
, .
are becoming popular and part of our daily lives.
Drivers can control these cars through a remote access voice control system in the car.
Authentication of the driverAos voice can secure access and the security of the vehicle.
In addition, most smartphones include a voice authentication system to access their devices and data .
Moreover, banks such as HSBC and First Direct use SID systems for online and phone account customers.
Conventional SID methods apply FFT as a frequency analyzer.
Mel-frequency Cepstral Coefficients (MFCC) .
, .
Gammatone Frequency Cepstral Coefficients (GFCC) .
, and Power Normalized Cepstral Journal homepage: https://section.
com/index.
php/IJEEI/ IJEEI
ISSN: 2089-3272
ye Coefficient (PNCC) .
are examples of FFT-based methods and often achieve 100% accuracy when speech is clean, i.
, noiseless.
In real-world applications, almost all input speech has some degree of background The accuracy of FFT considerably methods drops when speech is corrupted by noise .
Ae.
The FFT spectrum not only shows frequency harmonic but also shows energies related to frequencies.
The addition of noise to a clean signal adds energy to the spectrum at varying frequencies.
This causes a significant mismatch between a clean and noisy FFT spectrum and provides poor performance under noisy conditions.
In contrast, the human auditory system is famously robust to background noises and competing speeches, with the well-known cocktail-party problem as a typical example.
Many papers have investigated the effect of cocktail party type noise on speech recognition performance .
Ae.
Unfortunately, very few papers can be found showing the effect of a cocktail party on SID.
In this work, we investigate the effect of cocktail party noise under training and testing conditions on a SID system.
Many cochlear models are now available that emulate auditory functions in humans.
One particularly interesting model of the human cochlea is the Cascaded Asymmetric Resonators with Fast Acting Compression (CAR-FAC) model introduced by Lyon .
It has been shown to fit human physiological data better than the other six auditory models in response to relevant stimuli .
Our previous study .
showed that the CARFAC and other cochlear methods achieve poor performance in fluctuating noise conditions.
Thus, it is also a challenge for cochlear methods to provide noise-robust performance under noisy conditions.
There are two techniques to improve the performance of these methods under noisy conditions.
The first technique focuses on altering the front-end feature extraction procedure to achieve noise-robust performance.
Examples include speech enhancement techniques .
Ae.
and feature fusion .
Ae.
The second technique is to train the classifier model in a way that will learn both types of noise and signal to enhance the SID performance.
Here we investigate the second technique.
The classifier speaker model also influences SID classification accuracy on noisy speech.
Optimization of a classifierAos parameters can strongly influence SID classification rates .
State-of-the-art back-ends, such as x vectors with deep neural networks .
can perform very well in SID tasks with the cost of high computational time .
, .
The Gaussian Mixture Model with the Universal Background Model (GMM UBM) remains a popular classifier for SID systems due to its simple and fast implementation and strong performance on noiseless speech .
, 12, .
Like x-vectors, the GMM is also integrated with neural networks .
to achieve better accuracy.
In some cases, the GMM with the UBM outperforms neural networks .
in a speaker identification task.
It offers competitive performance to more recent classifiers, e.
, the i-vector, given clean or noisy speech .
, .
with low computational cost.
Moreover, the GMM-UBM can be a useful classifier to investigate changes in the front-end .
as it does not add additional nonlinearities like neural networks.
Nonlinear neural networks produce comparatively better performance than GMM-UBM.
However, it requires larger training data to train them and the contribution of front-end data may not be understood correctly due to the nonlinearities in their mechanism.
Like all classifiers, its performance degrades as the signal-to-noise ratio (SNR) of input speech decreases .
, .
This degradation is amplified when it is trained on clean speech, but the testing dataset is noisy.
In this paper, we use a single layer neural network - the Extreme Learning Machine (ELM) .
, as used in the previous study .
to examine if the same experimental setup could produce a similar result to the GMM-UBM and validate the integrity of this We did not explore a deep neural network due to limitations of data in our used datasets.
The GMMUBM exhibits poor performance in cases with nonlinear patterns of input data and mismatched conditions.
The presence of noise in the input training data may reduce the discrepancy among speaker models under different levels of noise conditions and can play a significant role in producing noise-robust performance.
Unlike the GMM-UBM, a neural network classifier requires large amounts of data and balanced data for training purposes to achieve better performance.
The reason behind this type of training is to find the speakerAos distinguishing features under a large variation of training data.
We believe that noise in the training data may help to identify a speaker correctly with limited training data.
This is one of the objectives in this paper.
This paper investigates if we can limit the performance degradation by training the GMM-UBM and ELM classifiers on noisy speech.
We use the GFCC as an FFT algorithm and CAR-FAC as a cochlear algorithm to cover both types of auditory features.
We investigate if the proposed technique to train the GMM-UBM, and ELM provides better performance for both algorithms.
Moreover, a comparison of their performances is also presented here for the first time, using the YOHO dataset .
A recent study .
has shown .
n a mous.
that adding white noise to a signal can enhance the brainAos ability to distinguish subtle tones by suppressing cortical tuning curves.
In this study, we also investigate if the white noise-conditioned data in the speaker training is useful to achieve a noise-robust SID performance.
We apply four different setups for UBM and The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
GMM modeling.
Our results show that if we train our classifier with noisy data, we can achieve SID accuracy that is remarkably robust to noise.
METHODOLOGY
Figure 1 illustrates the block diagram of the presented SID system applying the CAR-FAC model.
It is divided into training .
Figure .
and testing .
Figure .
We describe each part in the following sections.
The front-end feature extraction procedure The CAR-FAC model is described in .
Briefly, it uses a cascade of second-order asymmetric resonators to generate the Basilar Membrane (BM) response to a transduced traveling wave.
The resonator pole and zero locations control the damping factor, which in turn changes the BM filter gain and bandwidth.
They are responsible for the modelAos level dependent compressive nonlinearity .
The distance between the pole and zero, h, is then a crucial parameter in the CAR-FAC model.
Figure 1.
An illustration of the CAR-FAC SID system using the GMM-UBM, including training and testing No samples are common in both training and testing datasets.
In some experiments, we added noises with the training samples to generate noisy training data, as we detail later.
We set h = sin( 2Af fs ) to keep the pole a half-octave away from the zero location in the CAR-FAC Here, fc is the characteristic frequency (CF) of each section of the cascade and fs is the sampling frequency of the input signal.
fc is determined by the Greenwood function .
to map 30 channels from 125 Hz to 3 kHz.
We set the upper-frequency limit at 3 kHz because most SID cues, such as the speakerAos fundamental frequency, pitch, and formants .
1 and f.
are below this cap .
In our investigation, we found the CAR-FAC with 70 channels produces a similar performance to 30 channels, but significantly slows down the back-end processing.
Thus, we forward with the CAR-FAC with 30 channels.
The CAR-FAC algorithm implements nonlinear computations by controlling the pole .
radius r:
r = r1 drz y .
Oe .
y N LF .
Here, r1 is the minimum radius that maximally dampens the resonator:
r1 = 1 Oe y 2Afc where is the damping factor and drz = 0.
7 y .
Oe r1 ).
In the CAR-FAC model, the damping factor controls the BM response compression.
In human hearing research, typical damping factor values range 1 to 0.
We set the damping factor to 0.
The level-dependent multi-rate nonlinearity comes from the inner hair cell feedback .
through the Automatic Gain Control (AGC) loop filter.
The instantaneous Non-Linear Function (NLF) interacts with the input waveforms velocity .
in the CAR section of CAR-FAC.
N LF .
= 1 .
v vof f set )2 .
The NLF controls the gain of the CAR velocity and produces a combination tone, such as the cubedistortion tone in the cochlea .
The parameters k = 0.
1 and vof f set = 0.
04 are default values in the IJEEI.
Vol.
No.
September 2025: 814-829
IJEEI
ISSN: 2089-3272
ye CAR-FAC implementation .
We apply these nonlinearities to obtain cochlear features.
The BM energies are then computed as:
= C.
, 1 j : j L)2 .
where j is the starting index for each time window.
L is the window duration, and C.
contains the output samples of channel i.
Next, we apply the cube root and DCT on the nonlinear BM energy feature to nonlinearly scale the features and decorrelate their dimensions.
Moreover, the cube root and DCT application on the BM produces a better result than the only BM, as found in the previous study .
Figure 2 illustrates the effect of these transforms on BM energy in the form of a histogram.
Cochlear features are not Gaussian distributed, as shown in Figure 2 .
eft pane.
, and featuresAo data are highly correlated, as one bin covers most of the data range.
Applying the cube root on the CAR-FAC features increases data variance, as shown in Figure 2 .
iddle pane.
The cube root nonlinearly amplifies data to reduce the noise effect.
This amplification hinders SID performance, as we observed empirically.
Applying the DCT to the cube root features fits the data closer to a Gaussian distribution .
ight pane.
, which should help the GMM-UBM classifier build a better speaker model.
We extract the GFCC following the previous study .
We use 64 channels for a frequency range of 125Hz to 3kHz.
The Gammatone spectrum was down sampled to 100Hz, and only the magnitude spectrum was considered.
Then, the cube root and DCT were applied.
We omitted the first channel .
hich contains maximum energy and increases similarities among speaker.
The channels above the 29th channel contain negligible energy compared to lower channels, and thus we omitted those channels.
Therefore, the applied number of channels of the presented GFCC is 28.
Figure 2.
Histograms of the energy output of CAR-FAC, and the effect of the cube root and DCT on that GMM-UBM speaker modeling The GMM-UBM speaker modeling system has two parts: UBM development and an adaptation of the speaker data with the UBM to create a GMM speaker model.
The UBM is a single GMM speaker model trained with all pooling data from the training or development dataset using the Expectation-Maximization (EM) algorithm .
The EM algorithm iteratively increases the likelihood of the dataset given GMM parameter values .
eights, means, and variance.
The estimation of the GMM parameters using the EM process has been detailed elsewhere .
In the GMM, each Gaussian component density pk .
of a mixture component, k for input x is exP pressed as a function of a mean vector (AAk ) and a D y D variance matrix ( k ) with a feature dimension pk .
= .
A) | XOe1 .
Oe AAk )} exp{Oe .
Oe AAx )T .
The covariance matrix determines the correlation between adjacent feature dimensions.
A diagonal covariance matrix is often used in the GMM instead of a full covariance matrix to restrict the Gaussian ellipse axis in the The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
direction of the coordinate axis .
This restriction helps the GMM to estimate better parameter values while requiring fewer samples and less computational time to do so.
The parameter estimation of a full covariance GMM can be found in .
Note that the diagonal elements of a covariance matrix are the variances (E 2 ) of channel features.
The UBM (U BM ) is defined by optimized weight Wk , mean AA1k , and variance E1k for all mixture components M as.
U BM = {Wk .
AA1k .
E1k }, k = 1, 2, 3, .
We used M = 256 mixtures in all experiments.
The adaptation of GMM with the U BM starts with all training samples .
= 1, 2, 3 .
T ), and the probability p.
xt ) that a sample belongs to a particular mixture Wk pk .
t ) p.
xt ) = PM k=1 Wk pk .
t ) p.
xt ) and xt are used to compute the mixture probability counts .
, mean AA2 , and variance E22 for each mixture component using equation .
This step is the same as the expectation step in UBM p.
xt ) .
AA2k = p.
xt )xt nk t=1 .
E2k p.
xt )x2t nk t=1 .
Here, nk is a vector of length M , and the variance and mean sizes are N y M , where N is the dimension of a feature.
Finally, these new estimations from each speakerAos training data are used to update the old statistics of the U BM to create the adapted parameters for the GMM model for each mixture k using the following equations:
WkN = [ .
k ) .
Oe )Wk ] .
AAN
k = Ek .
Oe )AAk = Ek .
2 ) .
Oe )(E2k AA2k ) Oe AAN Here, is the adaptation coefficient for the weights, means, and variances, respectively.
This coefficient is The is a essentially a learning rate, defined as a function of counts and a relevance factor .
, as = nkn r scaling factor that ensures k=1 Wk = 1.
In the testing stage, the log-likelihood of the vector sequence of a testing sample (X) is computed against each speaker model, and the mean of testing scores is computed using equation .
Thus, each testing sample has a score against each speaker model.
X1 =
l=1 log.
(Xl |GM M )) .
Here.
X1 is a testing score against a speaker model for a target sample and F is the frame number.
The maximum score for a testing sample against a speaker model indicates the target speaker identity.
IJEEI.
Vol.
No.
September 2025: 814-829
IJEEI
ISSN: 2089-3272
ye Extreme Learning Machine (ELM) speaker modeling We use the ELM instead of a deep neural network as it requires less training data for a comparatively better result.
We use a single Long Short Term Memory (LSTM) .
layer fully connected network with 400 This layer takes input as a sequence and produces series network equal to the number of speakers.
This layer decides what information to keep, add, or discard and captures long-term features in input data.
We use the Root Mean Square Propagation (RMSP) .
optimization technique to train our network, with an initial learning rate of 0.
001 and a regularization rate of 0.
The RMSP optimizer in ELM updates the output weights iteratively through modifying learning rate to speed up convergence.
Moreover, the RMSP in ELM makes the training more stable.
Interested reader can find details of RMSP in .
We resized our input feature to 64 y 64 to facilitate the ELM training.
The maximum number of epochs for training was 22, and a batch size of 28 was used throughout.
We used the SoftMax activation technique in the output layer.
trained the network with all types of noise at all SNR levels.
Thus, there are 29 .
SNRs y 7 types of noise clea.
speaker models.
We used clean and noisy speaker models to identify the target speaker under all types of noise conditions.
In the testing stage, the testing sample is compared with each speaker network to predict the speaker network to which the testing sample belongs.
A testing sample produces only one output related to its class.
Dataset and Experimental Setup We use the YOHO dataset .
that previous SID systems employed .
, .
This dataset is relatively clean, which allows us to add noise according to the demand of our investigations.
Which is why we have not used more realistic datasets, such as Voxceleb .
The YOHO dataset contains 138 speakers with 24 digitpairs samples for each.
We use 18 samples from each speaker to train the GMM and 1380 samples .
from each speakerAo training sample.
to estimate the UBM parameters.
The remaining 6 samples from each speaker were used for the testing purpose.
In many experiments, we apply noisy datasets to train the GMM and UBM.
We add white, pink, street, restaurant, train, and car noises to clean signals to create noise-corrupted signals.
The white noise and pink noise were generated using MATLAB, and the rest types of noise were downloaded from the https :
//w.
f reesf x.
uk/ website.
The SNR ranges from 0 dB to 15 dB in steps of 5 dB.
There are four types of training in this work.
We use four training conditions:
i Clean GMM and clean UBM: We use clean samples from each speaker to train the GMM and the UBM.
ii Clean GMM and noisy UBM: We use clean 18 samples to train the GMM and noisy data to train the UBM.
We use 306 samples from 17 speakers .
speakers from 5dB and 9 speakers from 0dB) for each noise type.
Thus, 119 speakers for seven different types of noise and the rest of 19 speakers with clean samples were used to train the UBM.
This way, the UBM will learn the variability of noise types and Eventually, the developed UBM will help the GMM to produce better SID accuracy under noisy and clean conditions.
i Noisy GMM and clean UBM: We train the GMM using all types of noise with all SNRs.
Thus, there are 28 .
SNRs X 7 types of nois.
GMM speaker models for noisy data and a single clean GMM speaker model for each speaker.
In total, there are 29 speaker models for each speaker.
In the real environment, it is unknown what type of noise is coming with the input signal, and our brain is trained on all possible combinations of noisy data.
This is why we use all GMM speaker models to present all patterns of noise at probable SNRs.
The test sample was tested against each speaker model, and the maximum matching score indicated the correct SID.
iv Noisy GMM and noisy UBM: We integrate the noisy UBM from experiment number ii and noisy GMM from experiment number i to investigate if the presented training is beneficial to achieve a noise-robust SID score.
We use clean and noisy samples to test each developed speaker model.
The speaker model that produces the maximum likelihood score of a testing sample from a speaker is considered the ID for that speaker.
The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
RESULTS
The results are shown for the CAR-FAC and the GFCC method.
Each figure presents the result showing the change when the GMM.
UBM or both were trained using clean or noisy data.
Each bar presents the average result of six random trials of training and testing samples.
The error bar presents the minimum and maximum values of six trials.
In the end, we also show the result using a single layer neural network to verify our findings using the GMM-UBM.
A similar result with both classifiers will emphasize the noise in the training could improve SID accuracy.
Cascade of Asymmetric Resonators with Fast Acting Compression (CAR-FAC) Figure 3 shows the SID result for the CAR-FAC method.
The GMM-UBM was trained in clean and noisy conditions and tested under clean and seven different noise conditions.
In clean or noisy training conditions, the CAR-FAC achieves almost 100% correct SID, as shown in Figure 3 .
op, most right pane.
However, the SID accuracy substantially drops with the increment of noise level, as shown in Figure 3 at 0 dB SNR for all types of noise.
Figure 3.
The SID results are shown for the CAR-FAC method under various types and levels of noise The GMM and UBM were trained with several conditioned of data.
The legend shows the training conditions of the GMM and UBM.
The title of each subplot indicates the noise used in the testing data while scoring a speaker.
To reduce the discrepancy in SID accuracy between low SNR and high SNR, we train the UBM with noisy data, while keeping the GMM clean.
Figure 3 shows the presented training technique improves the SID accuracy at all SNR conditions for all types of noise while the SID accuracy in clean remained the same.
This improvement is substantial, and the average SID improvement varies from 3.
72% (Airpor.
, as shown in Figure 3.
The UBM estimates the variation of noise and speaker variability.
Thus, a noisy UBM tunes the parameters of the GMM to a better estimate of a speaker.
Consequently, the developed GMM speaker model produces a high SID score, as shown in Figure 3.
However, the decay rate of SID accuracy from clean to low SNR conditions is still almost linear.
Next, we train the GMM using noisy data while keeping the UBM clean to investigate if the SID accuracy is further enhanced.
The UBM is now only learning the speaker variability and adopting noisy GMM.
Indeed, this technique further improves the SID accuracy for all types of noise under noisy conditions except for the cocktail noise.
The cocktail noise is a heavily talked background noise that corrupts the speakerAos speech and influences the GMM modelling.
Which is why the GMM is producing poor SID accuracy under cocktail noise.
We also train the GMM with a mixture of noise types and levels to observe which technique is more effective in generating a noise-robust SID.
This new GMM with mixed noise gives poorer performance than our proposed training technique for individual GMM speaker models.
The mixed noise raises higher variabilities among samples within a speaker.
Hence.
IJEEI.
Vol.
No.
September 2025: 814-829
IJEEI
ISSN: 2089-3272
ye the GMM estimates poor parameters from the speaker model.
In the comparison of the noisy GMM and noisy UBM, the noisy UBM is way more effective than the noisy GMM to produce a noise-robust SID result.
Then, we train the GMM and the UBM with noisy data considering if this technique further improves the SID accuracy.
Interestingly, this technique improves SID accuracy a little for some types of noise compared to the previous technique, as shown in Figure 3.
For some types of noise, the SID accuracy is poorer than the previous technique.
The noisy UBM meant to learn the speaker variability, and this can be achieved though clean speech training.
The GMM is already noisy and the noisy UBM is poorly estimating speaker variability As a result, the noisy GMM and noisy UBM combination provides a poor SID score compared to the previous technique, as shown in Figure 3.
Gammatone Frequency Cepstral Coefficient (GFCC) Figure 4 presents the result using the GFCC method.
The same experimental setup following the experimental setup for the CAR-FAC was used for the GFCC method.
The FFT-based GFCC produces 100% correct SID accuracy when the GMM-UBM is trained with clean or noisy data, as shown in Figure 4 .
irst bar in each subplo.
However, this score reduces to almost 0% under mismatched conditions, except for the street The energy distribution between clean and noisy FFT spectra makes them mismatched and causes havoc reduction of SID accuracy under noisy conditions.
This reduction of performance is particularly true when the GMM-UBM was trained with clean data.
Figure 4.
The SID results are shown for the GFCC method under various types and levels of noise conditions.
The GMM and UBM were trained with several conditioned of data.
The legend shows the training conditions of the GMM and UBM.
The title of each subplot indicates the noise used in the testing data while scoring a We introduce noise to UBM by training it with noisy data, while the GMM is trained with clean data.
Figure 4 .
econd bar in each subplo.
shows that the noisy UBM technique significantly enhances the SID The presented method achieves an average 90% correct SID accuracy at 5 dB, as shown in Figure Note that the 5 dB SNR is the noise threshold for an understandable conversation .
Comparing the performance of the clean UBM and noisy UBM, it can be claimed that the enhanced performance under noisy conditions comes from the noisy UBM.
This enhancement evidence that the UBM not only learns the speaker variation but also models the noise variability while trained using noisy data.
However, the SID score still needs to be improved under noisy conditions to claim the SID system is noise-robust.
Next, we train the GMM using noisy data and the UBM with clean data considering if it further improves SID accuracy.
Figure 4 .
hird bar in each subplo.
shows the result using this technique.
The presented technique substantially the performance of the GFCC method, except for the street noise.
This improved result indicates that the noisy GMM training technique is more beneficial than the noisy UBM to produce a noise-robust SID accuracy.
The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
The GFCC method further improves SID accuracy when both the GMM and the UBM were trained with noisy data.
Figure 4 .
ourth bar in each subplo.
shows the generated results for this technique.
Figure 4 shows that the performance of the presented technique produces almost consistent performance up to as low as 5 dB SNR irrespective of noise types.
The average SID score at 0 dB SNR is more than 80%, as shown in Figure 4.
Results in Figure 4 show that the noisy GMM and noisy UBM can learn both the speaker and noise variabilities and hence produce a noise-robust SID accuracy irrespective of types and levels of noise.
Comparing Figure 3 and Figure 4, on average, the CAR-FAC method provides a relatively better result than the GFCC method at lower SNR when the GMM-UBM was trained using clean data.
This better result of the cochlear model under noisy conditions is also coherent with previous studies .
, 45, .
, 39, .
This improvement is particularly true when the GMM-UBM is trained with clean data.
The GFCC method has better results than the CAR-FAC method at higher SNRs.
This better result is expected due to the matching of the energy spectrum of clean and high SNR data.
The GFCC method has a better improvement than the CAR-FAC method while a noisy UBM is adopted with the clean GMM, as seen comparing Figure 3 and Figure The CAR-FAC method has only an improved result than the GFCC method under white noise at 0 dB and 5 dB conditions.
In contrast, the GFCC method produces significantly better performance over the CAR-FAC for all other types of noise at all SNR conditions.
Figure 3 and Figure 4 also show that the CAR-FAC method produces much improved SID accuracy over the GFCC method, while the GMM is trained with noisy data .
hird bar in each subplo.
In contrast, the GFCC method produces a much-improved performance compared to the CAR-FAC method, while a noisy UBM is adopted with the clean or noisy GMM.
This improvement is observed in Figure 3 and 4 .
he third and fourth bars in each subplo.
The utilization of noisy data in the GMM-UBM training enhances the performance of the FFT-based GFCC method more than the cochlear-based CAR-FAC method.
Both methods exhibit the poorest performance in the cocktail party noise scenario.
The cocktail party noise is challenging as it contains many voices from party participants.
However, both methods produce an improved performance under noisy training of the GMM-UBM.
Thus, the exposure of noisy data at the classifier is necessary to develop a noiserobust SID system.
Noise that improves SID performance We use the GMM-UBM and a single-layer neural network to investigate the noise in the training of an SID system that enhances SID accuracy.
In the following section, we present results for those two classifiers.
Results using the GMM-UBM The study .
found that the use of white noise in the brain of a mouse improves the brainAos ability to separate subtle tone discrepancies by suppressing cortical tuning curves.
They claimed that white noise is a good noise that enhances auditory perception.
Moreover, we observed in the previous experiments in Figures 3 to 4 that the performance for both methods struggles under cocktail noise.
Thus, it is interesting to investigate which noise in training or testing produces a better SID result.
In machine learning, the classifier can be considered as the brain that distinguishes speakers based on their speeches.
The presence of noise in the classifier also enhances SID accuracy performance, as observed in Figure 3 to Figure 4.
Here, we investigate.
AuWhich noise in the training data enhances the performance of a classifier?Ay.
The advantage of using the GMM-UBM is that it can focus only on the changes in front- ends without adding nonlinearities like a neural We have trained the GMM using each type of noise-conditioned data and tested it under all types of noisy data.
Note that the UBM was trained on clean data.
We used the CAR-FAC and GFCC methods to investigate which noise-trained classifier produces the best SID accuracy.
We use the same experimental setup for both methods.
The investigated results are separately shown in Figure 5 and Figure 6 for the CAR-FAC and GFCC, respectively.
Each subplot title shows the noise that was used to train the classifier.
The legend shows the types of noise used to test the classifier.
The average results at an individual SNR are also shown to determine the best result at a specific noise trained condition of the GMM-UBM classifier.
IJEEI.
Vol.
No.
September 2025: 814-829
IJEEI
ISSN: 2089-3272
ye Figure 5.
The presentation of the performance of the CAR-FAC method under different noise training and testing under other noise conditions.
The results are shown for various SNR conditions.
An average result for each noise-conditioned training is shown with a red bar .
ost right bar in each subplo.
The errorbar presents the minimum and maximum values of six trials.
Figure 6.
The presentation of the performance of the GFCC method under different noise training and testing under other noise conditions.
The results are shown for various SNR conditions.
An average result for each noise-conditioned training is shown with a red bar .
ost right bar in each subplo.
The errorbar presents the minimum and maximum values of six trials.
The presence of individual noise in the speaker training enhances the performance of the classifier while the same type of noise is added to the testing signal.
However, the addition of noise in the speaker training may enhance or reduce the performance for other types of noise.
This scenery has been shown in Figures 5 and 6.
The results are shown as an average for six trials of random selections of training and testing The error bar indicates the minimum and maximum values of six trials.
The CAR-FAC has the best average performance while the exhibition noise-conditioned data are used to train the classifier, as shown in Figure 5.
However, this method provides relatively better performance for all testing noise conditions at 0 dB, while the cocktail noise was used in the speaker training.
The average SID accuracy under the cocktail noise training condition is also similar to the exhibition noise training condition.
It can be predicted that the presence of the speech-pattern noise type, such as the cocktail and exhibition noise, at the classifier is more beneficial than other types of noise, as shown in Figure 5.
In contrast, these types of noise produce a poor performance while they were present in the testing data.
This result indicates that the cocktail party noise is the most challenging while testing for a SID system to achieve a noise-robust performance.
The performance of the GFCC method has been shown in Figure 6.
The GFCC method provides similar performance at 15 dB SNR for all types of noise training conditions.
However, the performance is subjective to training noise type at low SNR conditions, as shown in Figure 6.
Unlike the result shown in Figure 5, the GFCC method produces a poor SID accuracy for the cocktail and restaurant noise while other types of noise are used to train the speaker Interestingly, this method produces a better result while the cocktail or restaurant noise is used to train the speaker classifier.
Note that the restaurant noise is also a speech-shaped noise.
The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
Comparing Figure 5 and Figure 6, the GFCC provides a better result than the CAR-FAC method.
The deviation of minimum and maximum values for each testing condition are less varied for the GFCC method compared to the CAR-FAC method.
Both methods provide poor performance while speech-shaped noise such as cocktail or restaurant noise was used for testing.
This indicates that the speech-shaped noise is more challenging compared to other types of noise while they are added to testing signals.
In contrast, the speech shaped noise is good while used for speaker training and provides better performance for other types of noise conditioned testing.
Distinctive results from two different methods suggest that the front-end processor plays a vital role to process noisy data and contribute to the performance of a classifier.
Eventually, the performance of the classifier significantly improves in the presence of noise in the signal.
The investigation suggests that the white noise may enhance SID accuracy under high SNR conditions.
However, the effect of white noise may not be significant at low SNRs.
In contrast, the speech-shaped noise, such as cocktail noise in the classifier may be more useful to produce a noise-robust performance under adverse Thus, the white noise based speaker training may not be good for a SID task, though this noise may enhance speech perception in the brain .
CAR-FAC and neural network-based result We also a neural network to investigate if noise in the training of an SID system can improve SID A deep neural network would require a much bigger dataset to properly tune parameters and achieve a state-of-the-art result.
Hence, we used a simple fully connected neural network with one hidden layer.
use the same setup for the neural network as was used for the GMM-UBM, but we resized the input sample to 64 by 64 to facilitate and speed up the training of the network.
Thus, we expect that our neural network will produce a poorer SID accuracy than the GMM-UBM for our used dataset.
The generated results are shown in Figure 7.
The average results shown in Figure 7 are lower than the results shown in Figure 5, which supports our assumption.
Figure 7 shows a similar result to Figure 5 and 8.
The presence of white noise during training produces the poorest performance during testing, as shown in Figure 7.
The highest SID accuracy was observed under airport noise, which is 77% on average, as shown in Figure 7.
Other types of noise produce similar average results, as shown in Figure 7.
Again, training and testing with the same type of noise produces the highest SID accuracy for all types of noise.
Figure 7.
The presentation of the performance of the CAR-FAC with the neural network under different noise training and testing under other noise conditions.
The results are shown for various SNR conditions.
average result for each noise-conditioned training is shown with a red bar .
ost right bar in each subplo.
The error bar presents the minimum and maximum values of six trials.
Discussion In this study, we investigate the impact of noise building a noise-robust SID system.
We used the FFT-based GFCC and cochlear-based CAR-FAC methods for our investigation.
The cochlear front-end produces much better noise-robust performance than the FFT method when the GMM-UBM is trained with clean However, the SID accuracy decreased substantially under low SNR conditions.
Utilization of noisy data to train either the GMM or the UBM improves SID accuracies under noisy test conditions.
IJEEI.
Vol.
No.
September 2025: 814-829
IJEEI
ISSN: 2089-3272
ye The training of the UBM with noisy data enhances the performance of the FFT method more than the cochlear method.
The CAR-FAC channel information is more correlated than the FFT-based GFCCAos channel information, which may hinder the UBM to learn speaker variation to produce a better universal speaker model for the CAR-FAC compared to the GFCC.
In contrast, the CAR-FAC method produces better performance than the GFCC method when the GMM is trained with noisy data.
This improved result of the CAR-FAC indicates that the CAR-FAC extracts speaker distinguishing features, which are useful for the GMM to make individual speaker models.
When we train the GMM-UBM on noisy data, it can more accurately classify noisy speech, while the accuracy in clean conditions remains intact.
Biological systems do not ignore the noise in their environments but instead seem to learn them [? ].
Perhaps one reason that humans identify speakers from noisy speech so well is that they learn accurate noise models throughout their lives and apply them in suitable environments.
We observed that the individual noise in the training samples is very effective to identify a target speaker under that type of noise condition.
This performance is valid for all types of noise and makes the SID system consistent in producing noise-robust We also tried to investigate if a mixture of all types of noise at different levels can influence the performance of an SID system.
We randomly selected four speakers for each SNR .
speakers for each nois.
and seventeen speakers for clean conditions.
Interestingly, we found that this combination of noise reduces SID accuracy significantly under clean and noisy conditions.
This degradation suggests that the human auditory system may know noise types at each SNR level, which makes the auditory system noise-robust and consistent in performance with the SID task.
If we consider the human auditory system performing an SID task, then the cochlea is its frontend feature extractor, and the brain is its classifier.
We do not fully understand how the brain helps humans disambiguate speech from noise.
Our work does not shed light on this issue, because the GMM-UBM classifier is not a biologically inspired system.
We applied it because the GMM-UBM is popular .
, 31, .
, simple, computationally fast, and fit for focusing only on front-end changes.
Other classifiers, such as neural networks more closely resemble biological structure and function.
We applied the neural network to present a biological SID system.
The neural network produces poorer performance than the GMM-UBM.
This could be due to less available training data in our dataset.
However, the neural network produces a similar pattern of results to the GMM-UBM.
Our results do indicate that learning noise during training enhances recognition performance and makes the system more robust to noise.
This outcome supports the findings of previous studies .
Ae.
Noiserobust performance using a noisy data-trained classifier indicates that noise might be one of the influencing factors on brain signals and recognition performance.
Thus, we also investigate which type of noise enhances SID performance using seven different types of noise.
The performance of the presented methods fluctuated with the variation of noise types.
However, cocktail party noise is considered the most challenging under testing conditions for a SID system regardless of front-end processing.
Many studies have been performed on speech recognition in a cocktail party environment, but unfortunately, very few on speaker identification.
This work can be an example to show the effect of a cocktail party on a SID system.
The classifier trained with white noise data generates a better result for testing data with added stationary or slow-varying .
ink or stree.
The testing data with fluctuating types of noise .
estaurant, exhibition, cocktail, or airpor.
produce poor performance when the classifier is trained with white noise data.
This result was observed for both classifiers.
In contrast, the fluctuating noise in speaker models enhances the performance and produces an improved SID accuracy for all other types of noise.
The study in .
showed that added white noise improves subtle tone discrimination based on a mouse.
In contrast, we discovered that fluctuating .
estaurant, cocktail, or airpor.
noise is better than white noise to distinguish speakers in an SID The outcome of this study can be extended for speech recognition.
SID for cochlear implant patients, speech intelligibility, and sound localization to develop a noise-robust SID system.
A bigger dataset with the cochlear front-end and deep neural network can be investigated to execute a more biologically plausible SID Finally, we need to apply noise to develop a speaker model to implement a noise-robust SID system.
This technique can be an alternative to the denoising technique to achieve a noise-robust SID system.
CONCLUSION
In this study, we investigate the impact of noise building a noise-robust SID system.
We used the FFT-based GFCC and cochlear-based CAR-FAC methods for our investigation.
The cochlear front-end produces much better noise-robust performance than the FFT method when the GMM-UBM is trained with clean The Effect of Noise on Speaker Identification and Finding a Noise that Improves Accuracy (Md Atiqul Isla.
ISSN: 2089-3272
However, the SID accuracy decreased substantially under low SNR conditions.
Utilization of noisy data to train either the GMM or the UBM improves SID accuracies under noisy test conditions.
The outcome of this study can be extended for speech recognition.
SID for cochlear implant patients, speech intelligibility, and sound localization to develop a noise-robust SID system.
A bigger dataset with the cochlear front-end and deep neural network can be investigated to execute a more biologically plausible SID system.
Finally, we need to apply noise to develop a speaker model to implement a noise-robust SID system.
This technique can be an alternative to the denoising technique to achieve a noise-robust SID system.
REFERENCES