Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol.
No.
September 2025, pp.
ISSN: 2089-3272.
DOI: 10.
52549/ijeei.
Supporting Communication for Deaf People with Sign Language Recognition Using Deep Learning Approach Thien Ho.
Quyen Tran.
Tra Nguyen.
Nha Tran.
Huy Tran 1Faculty of Information Technology.
Ho Chi Minh City University of Education.
Vietnam
Article Info
ABSTRACT
Article history:
Sign language recognition (SLR) plays a crucial role in improving communication for deaf individuals.
This paper investigates the recognition of sign language through deep learning models based on action features using Skeleton data from the Argentinian Sign Language (LSA.
The models explored include Multi-layer Perceptron (MLP) Neural Network, and Long Short-Term Memory (LSTM).
The MLP Neural Network, utilizing multiple layers of perceptrons, reached an accuracy of 96.
The LSTM model, excelling in processing sequential data, attained the highest accuracy These results demonstrate the effectiveness of deep learning models in sign language recognition, with LSTM showing the most promise due to its ability to effectively capture temporal dynamics.
Consequently, this study opens up prospects for applying sign language recognition technology in practice, contributing to enhancing the quality of life for deaf individuals.
Received Jul 28, 2024 Revised Aug 24, 2025 Accepted Sep 12, 2025 Keywords:
Sign language recognition Deep learning Long Short-Term Memory (LSTM) Multi-layer Perceptron (MLP) Copyright A 2025 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Tran Quang Huy.
Faculty of Information Technology.
Ho Chi Minh City University of Education, 280 An Duong Vuong street.
Ward 4.
District 5.
Ho Chi Minh.
Vietnam Email: huytq@hcmue.
INTRODUCTION
Sign language is one of the most popular languages that helps the deaf in communication.
It is a combination of hand and arm gesture sequences, body posture, and facial expressions, which helps the deaf communicate effectively .
However, sign language is the specific language that belongs to deaf community, it is not popular with normal people, which leads to lots of difficulties for deaf individuals to communicate with others.
Researching on sign language recognition systems promises to increase significantly in the future based on the number of publications and growing interest in the field .
Sign language recognition system also faces many difficulties.
One of them is choosing features for sign language recognition models.
Compared to action recognition, sign language recognition needs to pay more attention to finger actions and facial expressions.
And most of the actions for sign language are performed from the waist up, very few actions are performed from the waist down .
, selecting information is necessary to make the model lighter.
Moreover, the suitable dataset for the SLR to train the model is also very limited and sometimes far from reality .
One of the other difficulties is the differences between sign languages in different countries, requiring a multilingual system capable of translating different sign languages into text or SLR can be divided into two main types: isolated SLR and continuous SLR.
Isolated SLR classifies input into separate signs .
ign classficatio.
, while continuous SLR .
an be called sign language translatio.
focuses on the task of translating consecutive sign language.
Figure 1 illustrates how isolated and continuous SLR works.
When wanting to understand the sentence "I work hard" from the deaf, isolated SLR recognizes and gives prediction results of each sign from the deaf, while continuous SLR continuously recognizes signs from the deaf and produces the result as a complete sentence.
In this paper, we focus on building isolated SLR with the following contributions:
- Extract frame sequences and skeleton data for machine learning models.
Journal homepage: http://section.
com/index.
php/IJEEI/index IJEEI
ISSN: 2089-3272
- Proposing deep learning models such as ResNet.
MLP Neural Network.
LSTM to evaluate datasets based on RGB image features and skeleton features.
Figure 1.
The example compares the identification of isolated and continuous SLR The remaining content of the paper is organized as follows: some related studies are presented in Section 2.
Details of the model building method will be presented in Section 3.
Section 4 presents the experiments and results obtained.
and the conclusion is given in Section 5.
RELATED WORK
In this section, we present an overview of word-level dataset from several countries and some computer vision-based methods for sign language recognition.
Word-level Sign Language Dataset There are many different word-level datasets that have been built for sign language recognition in different countries.
Table 1 lists the datasets in which videos capture word-level sign language representations.
The table includes relevant information such as: number of signs, number of videos, type of video, number of people performing sign language and the name of the country using that sign language.
LSA64 is an Argentinian sign language dataset, with arrangements in context and the clothing worn by the signers.
To simplify the problem of dividing the hands in the image, each signer wears fluorescentcolored gloves.
This may not be suitable for recognizing in real life.
INCLUDE is a dataset about Indian sign language, including 263 signs divided into 15 topics such as places, occupations, animals, etc.
The INCLUDE dataset was shot with natural light, without effort to regulate the signerAos clothes or signing style.
DEVISIGN
is a large-scale word-level Chinese sign language dataset, consisting of 2,000 signs and 24,000 samples taken by 8 people in a controlled laboratory environment.
Although the number of labels is large .
0 label.
, each label only has about 12 samples.
Talking about the American Sign Language dataset, the ASLLVD dataset is a well-researched dataset on American Sign Language.
It has 2742 signs with distinct meanings performed from 1 to 6 people.
However, most labels only have 3 samples, which may affect the model's training.
All videos of this dataset are taken with a uniform background for easy hand and face segmentation.
Overall, the above datasets all present different attempts to tackle the task of sign language recognition at the word-level.
However, there are also many difficulties when using the dataset due to insufficient number of samples or inappropriateness in practice.
Table 1.
Overview of word-level datasets of sign languages Dataset Signs Videos Type Signers Sign Language LSA64 .
INCLUDE .
Argentinan Indian DEVISIGN.
Chinese ASLLVD .
Video RGB Video RGB Video RGB.
Video RGB American Supporting Communication for Deaf People with Sign Language RecognitionA (Thien Ho et a.
A ISSN: 2089-3272 Sign language recognition methods SLR has made significant progress in recent years.
With the emergence of deep learning model-based architectures and advances in computer computing capabilities, it has become possible to design and deploy deep learning models using multimodal data.
First, it is essential to choose an appropriate input, which could include video data, skeletal data, or a combination of both.
This input data serves as the foundation for the entire recognition process.
Second, the system must extract spatial and temporal features from the input data.
This involves analyzing the visual and motion characteristics of the sign language gestures, capturing both the shape and movement over time.
Finally, the system makes predictions based on the extracted features.
All these steps are researched and approached in many directions to increase modelAos accuracy.
Depending on the approach, it is possible for the model to learn different features and combine different features for the model.
Inputs for the model can be static or dynamic.
RGB, depth, skeleton, flow information .
Feature extraction In previous techniques to extract features for sign language recognition, hand-crafted features are widely used in many studies, for example features based on Histogram of Gradients (HOG) .
and Scale Invariant Feature Transform (SIFT) .
HOG is a popular method for extracting features from images for use in image recognition and classification methods, it divides the image of the gradient in each cell then creates a vector histogram in each cell.
Finally, the histogram vectors of all cells combine to form a characteristic representation of the image.
SIFT is an important method in image processing and computer vision to extract and describe features from images in a scale-independent way.
SIFT works by identifying keypoints in an image, then describing them based on the gradient directions of pixels around the keypoints.
This method provides a robust and scale-invariant feature representation for images, suitable for recognition, image matching, and positioning applications.
Besides manual methods, new methods have emerged to extract various features to increase model performance: Depth-based methods use depth images to capture the threedimensional structure of sign language gestures, enhancing recognition accuracy.
These techniques leverage depth variations in hand movements and shapes, as well as depth data from sensors like Kinect, to train neural networks and convolutional neural networks (CNN.
for improved performance in real-time recognition tasks.
Skeleton-based methods extract and analyze the skeletal structure of the hands and arms to understand and recognize sign language gestures.
By capturing the pose and movement of the signer, these techniques utilize models such as Transformers and Long Short-Term Memory (LSTM) networks to achieve high recognition accuracy.
Combining skeletal data extraction tools like MediaPipe with machine learning models further enhances the ability to accurately recognize sign language gestures.
, etc.
Sign language recognition method Previously, there were traditional machine learning approaches for the sign language recognition One of the models that can be mentioned are Hidden Markov Models (HMM).
HMM is a stochastic state machine that analyzes time-varying data with spatial and temporal variability .
It achieves certain results in sign language recognition .
However, this method lacks generalizability.
With the development of deep learning networks, networks such as Convolutional Neural Network (CNN).
Recurrent Neural Network (RNN).
Long Short-term Memory (LSTM).
Transformer are gradually gaining great attention and are being applied more and more for SLR.
While CNN is capable of extracting information from images, models such as RNN.
LSTM, and Transformer perform more strongly in sequence information.
CNN can be used by research as a feature extraction method of video frames or as a classifier .
, 23-.
Models capable of learning good sequential information from videos such as LSTM.
Transformers.
Sundar et al.
proposed a user recognition model using LSTM and MediaPipe to recognize 26 English letters with F1-Score reaching 99%, the proposed model has the potential to be highly effective when used in human-computer interaction (HCI).
Bohacek et al.
proposed the SPOTER model based on the Transformer architecture.
The research team realized the potential of the Transformer when it comes to relatively low computational costs and outstanding performance in string processing tasks.
This makes them the perfect choice for a lightweight computing solution capable of running on modern mobile devices.
This model has strong body position normalization and enhancements compared to previous models, significantly improving accuracy.
SPOTER was tested and compared on 2 datasets.
LSA64.
WLASL data.
Recent works also use 3D convolutional networks to be able to learn both spatial and temporal features .
DATASET
The LSA64 dataset is an extensive and meticulously curated collection aimed at advancing the field of Argentinian Sign Language recognition.
This dataset offers a diverse range of contextual arrangements and includes signers with varied clothing to enhance the complexity and applicability of the data.
Notably, each signer wears fluorescent-colored gloves, a deliberate choice to simplify the segmentation of hands from the IJEEI.
Vol.
No.
September 2025: 668 Ae 678
IJEEI
ISSN: 2089-3272
background in images.
While this feature aids in the technical processing of the data, it might not fully represent real-world conditions where such gloves are not worn.
The primary objective behind the creation of the LSA64 dataset is twofold: to develop a comprehensive dictionary for Argentinian Sign Language (LSA) and to train robust automatic sign recognizers.
The dataset contains 3200 video recordings, featuring 10 non-expert subjects performing 5 repetitions of 64 distinct signs.
Each class label includes 50 videos.
The 64 classes in the dataset represent commonly used signs in everyday communication which are detailed in Table 2.
The signs included in the LSA64 dataset were selected based on their prevalence in everyday communication, ensuring the dataset's practical applicability.
Table 2.
Labels of the LSA64 dataset Name Opaque Red Green Yellow Bright Light-blue Colors Pink Women Enemy Son Man Away Drawer Born Learn Hand Right Right Right Right Right Right Right Right Right Right Right Right Right Right Right Right Name Call Skimmer Bitter Sweet milk Milk Water Food Argentina Uruguay Country Last name Where Mock Birthday Breakfast Photo Hand Right Right Right Right Right Right Right Right Right Right Right Right Both Right Both Both Name Hungry Map Coin Music Ship None Name Patience Perfume Deaf Trap Rice Barbecue Candy Chewing-gum Spaghetti Hand Right Both Both Both Right Right Right Right Right Right Both Both Both Right Right Both Name Yogurt Accept Thanks Shut down Appear To land Catch Help Dance Bathe Buy Copy Run Realize Give Find Hand Both Both Both Right Both Both Both Both Both Both Right Both Both Right Both Right Each video captures a signer performing a sign in a controlled environment, ensuring high-quality and consistent visual data.
This controlled setting considers various factors such as lighting conditions and signer variability, which are crucial for developing and evaluating machine learning models dedicated to sign language recognition.
The inclusion of multiple repetitions by different signers introduces variability that is essential for training models capable of generalizing across different individuals and conditions.
The LSA64 dataset stands as a vital resource for researchers and developers in the domains of computer vision and natural language processing, particularly those focusing on sign language recognition and Its comprehensive and detailed nature makes it suitable for various applications, including the development of real-time sign language recognition systems, enhancing communication accessibility for the deaf and hard-of-hearing community, and contributing to the creation of more inclusive human-computer interaction technologies.
By offering a well-rounded set of sign language videos, the LSA64 dataset not only supports the development of advanced machine learning algorithms but also paves the way for significant improvements in the accessibility and usability of sign language technologies in everyday life.
Figure 2.
Snapshot of six different sign of LSA64 dataset.
METHOD
In this section, we build a model to recognize sign language for deaf people divided into two main Skeleton-based sign language recognition and spatial-based sign language recognition.
To recognize sign language based on the skeleton, we proceed to extract the skeleton using MediapPipe.
We then proceed to extract skeleton features.
To evaluate the skeleton recognition model, we use two models: MLP Neural Network and LSTM.
To recognize sign language based on spatial feature, we extract frames from the videos, then extract them into spatial features and use the ResNet model to evaluate.
Supporting Communication for Deaf People with Sign Language RecognitionA (Thien Ho et a.
A ISSN: 2089-3272 Skeleton Extraction We use the MediaPipe Holistic library .
from MediaPipe to extract data about body postures from videos.
MediaPipe is a library built to optimize performance on many devices, including devices with limited resources such as mobile phones, providing real-time performance with high processing speeds and low resource consumption.
We use the MeadiaPipe Hand model to extract points on the hand.
The keypoint extracted from MediaPipe are shown in the Figure 3.
We choose keypoints numbered 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20.
Figure 3.
Hand landmarks are extracted from MediaPipe .
Each keypoint is represented by a pair of coordinates .
, .
corresponding to the location on the image.
All keypoints are organized into a vector of size 42, where each pair of coordinates .
, .
contributes to the representation of a particular pose of a frame.
In case there is no performer in the frame or MediaPipe cannot identify some keypoints, we decided to fill the unknown coordinates with a value of 0.
To minimize overfitting and enhance the learning ability of the model, we apply the random rotation technique introduced .
in the processing of training data.
Eq.
shows the calculation formula for this The process of random rotation of coordinates is performed from 0A to 13A and is applied according to a specific formula.
In this way, we will optimize the uniformity and flexibility of the training data, helping the model learn from many different angles and poses, improving generalization ability when applying the model in practice.
cu, y.
= (.
cu Oe 0.
ycaycuycyuE Oe .
c Oe 0.
ycycnycuyuE 0.
c Oe 0.
ycaycuycyuE .
cu Oe 0.
ycycnycuyuE 0.
MLP Neural Network The model consists of a sequence of neural network layers, starting with a Dense layer with 256 units and an activation function of relu.
This layer converts the input into a feature space of decreasing We then add a BatchNormalization layer to normalize the output of the previous layer, which improves learning speed and model stability.
Next, we add another Dense layer with 128 units and an activation function of relu.
This layer further transforms the feature space, helping the model learn more complex features.
We then add another BatchNormalization layer to further normalize the output of the previous layer, increasing the stability of the model.
We continue by adding a Dropout layer with a dropout rate of 0.
5, which helps prevent overfitting by randomly dropping half the units during training.
Then we add two more Dense layers with 64 units and an activation function of relu.
Both layers continue to transform the feature space in preparation for the final layer.
Finally, we add a final Dense layer with 64 output units and a softmax activation This layer converts the output into a probability distribution over 64 output classes, representing 64 different classes.
Figure 4 show MLP model architecture.
Figure 4.
MLP with skeleton features.
IJEEI.
Vol.
No.
September 2025: 668 Ae 678
IJEEI
ISSN: 2089-3272
LSTM model We start by adding an LSTM layer with 512 input data units that are a variable length string and each sample has 18 features.
Then apply the Batch Normalization layer to normalize and optimize the previous This helps improve the convergence speed and stability of the model during training.
Next, add another LSTM layer with 256 units.
Then add a Dropout layer with a dropout rate of 0.
5 to prevent overlearning of the model and reduce overfitting.
We continue by adding another LSTM layer with 128 units and continue adding another Dropout layer with a rejection ratio of 0.
Finally, we add a final LSTM layer with 64 units and continue with a BatchNormalization layer to normalize the output of the previous layer.
Finally, add a Dense .
ully connecte.
layer with 64 output units and a softmax activation function.
This layer converts the output from a number into a probability distribution over 64 output classes, representing 64 different classification Figure 5 show LSTM model architecture.
Figure 5.
LSTM with skeleton features.
ResNet50 Model Deep neural networks often have a large number of parameters.
Therefore, the dataset needs to be large enough to train the model parameters and prevent overfitting.
For small datasets, the parameters will overfit the data.
Pre-training can be a kind of regularization, minimizing variance, avoiding overfitting on small datasets .
The ResNet network includes two main blocks: the "Basic Block" and the "Bottleneck Block.
" The Basic Block uses 3x3 convolutional layers, while the Bottleneck Block uses smaller convolutional layers to reduce network complexity.
Each block in ResNet consists of multiple convolutional layers stacked together, combined with Batch Normalization layers and ReLU activation functions.
"Skip connections" are added to each block to form residual connections, allowing information to be transmitted directly across the network.
This minimizes information loss and increases the model's robustness.
The input image is first fed into the network through the initial convolutional layers, then normalized with Batch Normalization and activated with the ReLU function.
Subsequently, the blocks of the ResNet network are executed sequentially, each containing a series of convolutional layers along with Batch Normalization and ReLU layers.
During the forward pass, residual connections facilitate the easy passage of information through the blocks, enabling the network to learn more complex data representations.
Finally, the network output, after traversing the blocks, is passed into a Fully Connected Layer to classify the image into predefined classes.
Figure 7 illustrates the ResNet model Figure 6.
ResNet50 with spatial features.
RESULTS AND DISCUSSION
We conduct experiments on the Google Colab Pro environment, utilizing a GPU (T4 with 15GB VRAM), 12.
7GB of RAM, and 201.
2GB of disk space.
The experiments are performed across three models:
ResNet50.
MLP, and LSTM.
To comprehensively evaluate the performance of these models, we employ several metrics, including accuracy .
, precision .
, recall .
and F1-scores .
These metrics provide a well-rounded assessment of the models' effectiveness in recognizing and classifying Argentinian Sign Language gestures.
Accuracy measures the overall correctness of the model, precision indicates the proportion Supporting Communication for Deaf People with Sign Language RecognitionA (Thien Ho et a.
A ISSN: 2089-3272 of true positive predictions among all positive predictions, recall assesses the model's ability to identify all relevant instances, and the F1 score balances precision and recall, offering a single metric that reflects both.
This thorough evaluation ensures a robust analysis of each model's performance in our experiments.
TP TN
Accuracy .
= TP FN TN FP
Precision (P) = TP FP Recall (R) = TP FN 2x(PxR) .
F1 Oe score = ( P R) TP (True Positiv.
: The count of correctly predicted positive cases.
TN (True Negativ.
: The count of correctly predicted negative cases.
FP (False Positiv.
: The count of instances where observations labeled as negative are incorrectly predicted as positive.
FN (False Negativ.
: The count of instances where observations labeled as positive are incorrectly predicted as negative.
With ResNet50 from Keras library with parameters epochs, batch size, learning rate are shown in Table 3.
The MLP and LSTM have epochs, batch size, and learning rate parameters shown in Table 4.
Table Table 3.
Training parameters for ResNet Hyperparameter batch-size learning rate Value Table 4.
Training parameters for MLP Hyperparameter batch-size learning rate Value Adam .
Table 5.
Training parameters for LSTM Hyperparameter batch-size learning rate Value Adam .
The video dataset, approximately 3.
7 Gigabytes (GB) in size and containing 3200 videos, has been divided into two parts: a training set and a test set for both skeleton-based and spatial feature-based models.
More information is available in Table 6.
Table 6.
Rates of train and test Rate Videos Total video Train set Valid set Test set The results show that the LSTM model outperformed the others, achieving an accuracy of 98.
demonstrating its superior ability to capture the temporal dynamics inherent in sign language.
The MLP model also showed strong performance with an accuracy of 96.
In contrast, the ResNet50 model achieved an accuracy of 82.
13%, suggesting that while ResNet50 is effective in handling spatial features, it may not be as efficient in recognizing the sequential nature of sign language compared to models specifically designed for temporal data, such as LSTM.
Detailed performance metrics for these models are presented in Table 7, which provides a comprehensive comparison of the accuracy rates of MLP, and LSTM on the LSA64 dataset.
Our experimental results show that our model achieves very high accuracy across most labels, with many labels reaching perfect accuracy of 100%.
Labels such as Red.
Light_blue.
Colors.
Women.
Away.
Drawer.
Learn.
Milk.
Food.
Uruguay.
Country.
Where.
Mock.
Birthday.
Photo.
Music.
Ship.
Perfume.
Barbecue.
Candy.
Yogurt.
Bathe.
Buy.
Copy.
Realize.
Give and Help all achieved perfect accuracy of 100%.
IJEEI.
Vol.
No.
September 2025: 668 Ae 678
IJEEI
ISSN: 2089-3272
This is an encouraging result, indicating that the model can classify these labels accurately without any Other labels like Green.
Yellow.
Bright.
Enemy.
Born.
Call.
Bitter.
Sweet_milk.
Water.
Last_name.
Map.
Coin.
Trap.
Rice.
Chewing_gum.
Spaghetti.
Accept.
Shut_down.
Appear.
To_land.
Find and Run also achieved very high accuracy, ranging from 97% to 99%, indicating high reliability.
However, there are still some labels with accuracy below 95%, such as Opaque 93%.
Skimmer 92%.
Hungry 93% and Shut_down 93% as shown in Table 9.
These labels might need further improvement to achieve higher accuracy.
The reasons could be due to insufficient diversity in the training data for these labels or an inadequate number of samples for these labels.
Focusing on these labels during additional training could help improve the overall accuracy of the model.
Table 8 shows the comparison between different models on the LSA64 dataset.
The SPOTER model achieved the highest accuracy at 100%, followed by the HMM model at 97.
44% and the 3DGCN model at Although our LSTM model did not achieve the highest accuracy compared to SPOTER, it still outperformed many other methods, highlighting its strong potential.
Table 7.
Accuracy of the models Network
Skeleton
MLP
Skeleton
LSTM
RGB Resnet Accuracy (%) Precision Recall F1-score Table 8.
Comparison with other model on LSA64 data Method SPOTER .
3DGCN
HMM .
Skeleton MLP (Our.
Skeleton LSTM (Our.
RGB Resnet (Our.
nClasses Accuracy (%) Table 9.
Results of each label on the LSTM model Label Opaque Red Green Yellow Bright Light_blue Colors Pink Women Enemy Son Man Away Drawer Born Learn Call Skimmer Bitter Sweet_milk Perfume Deaf Trap Rice Barbecue Candy Chewing_gum Spaghetti Yogurt Accept Thanks Shut_down Accuracy Label Milk Water Food Argentina Uruguay Country Last_name Where Mock Birthday Breakfast Photo Hungry Map Coin Music Ship Name Patience Appear To_land Catch Help Dance Bathe Buy Copy Run Realize Give Find Accuracy Supporting Communication for Deaf People with Sign Language RecognitionA (Thien Ho et a.
A ISSN: 2089-3272 Our findings reveal that models based on skeletal features not only show shorter training times such as MLP with 30 min 17 seconds.
LSTM with 45 min 18 seconds but also achieve better accuracy.
higher body.
The LSTM model, with its ability to achieve accuracy up to 98.
60%, demonstrates superiority in capturing and explaining the sequential and dynamic aspects of sign language.
In contrast, the spatial feature-based model (ResNet.
required a longer training time of 2 hours 15 minutes 10 seconds and yielded slightly lower accuracy .
13%).
In summary, our model has shown very high and stable performance across most labels, with many achieving perfect accuracy.
However, there are still areas for improvement to enhance the accuracy of the lower-performing labels.
These results demonstrate the potential of our model in accurately classifying a diverse and rich dataset, emphasizing the importance of choosing appropriate features for different types of CONCLUSION This study investigates the enhancement of sign language recognition (SLR) systems for the deaf community through the application of deep learning models.
We evaluated three models ResNet50.
MLP, and LSTM using the LSA64 dataset.
The LSTM model outperformed the others, achieving an accuracy of 98.
thereby showcasing its superior capability in capturing the temporal dynamics inherent in sign language.
The MLP and ResNet50 models also exhibited robust performance, with accuracies of 96.
10% and 82.
Our findings indicate that deep learning models, particularly those leveraging skeletal features, hold significant potential for advancing SLR systems.
Future research should prioritize the expansion of datasets and the optimization of these models for real-world applications, ultimately aiming to improve communication accessibility for the deaf community.
REFERENCES