Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Vol.
No.
March 2026, pp.
287O298
ISSN: 2476-9843, accredited by Kemenristekdikti.
Decree No: 10/C/C3/DT.
00/2025
DOI: 10.
30812/matrik.
Development of an Attention-Based Convolutional Neural Network Long Short-Term Memory Model for Real-Time Ergonomic Analysis of Sitting Posture Gusrio Tendra1 .
Sumijan2 .
Deny Jollyta1 Institut Bisnis dan Teknologi Pelita Indonesia.
Pekanbaru.
Indonesia Universitas Putra Indonesia AuYPTKAy.
Padang.
Indonesia
Article Info
ABSTRACT
Article history:
The digital era has increased the prevalence of musculoskeletal disorders caused by poor sitting posture, posing a significant global health and productivity challenge.
This study introduces an attentionbased deep learning model as the analytical engine for a proposed virtual ergonomics monitor.
ErgoGuard.
The primary objective is to develop a model that accurately performs real-time Movement Quality Assessment of Sitting Posture for computer users, using only a standard webcam to ensure wide accessibility.
This research method is a hybrid architecture that combines a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM), enhanced with an attention mechanism and optimized for three-dimensional skeletal data using the BlazePose Computer Vision approach.
This framework merges a One-Dimensional CNN to extract spatial features from static poses with a Bidirectional LSTM network to model temporal postural shifts.
An integrated attention mechanism enables the model to dynamically focus on critical ergonomic areas, mimicking an expertAos assessment.
For validation, a new OfficePosture dataset was created, containing 500 videos of five common office sitting postures.
The results indicate that the proposed model achieves 94.
2% classification accuracy, substantially outperforming baselines from a pure CNN .
6%) and a standard LSTM network .
2%).
Beyond accuracy, the model offers interpretable feedback through visual attention maps.
conclusion, the proposed architecture is an effective solution for monitoring sitting posture and holds considerable promise as an affordable preventive health tool for corporate and educational settings.
Received September 06, 2025 Revised December 16, 2025 Accepted January 27, 2026 Keywords:
BlazePose.
CNN-LSTM.
Ergonomics.
Real-Time.
Sitting Posture.
Copyright A2026 The Authors.
This is an open access article under the CC BY-SA license.
Corresponding Author:
Gusrio Tendra, 62812-6777-2627.
Department of Information Systems.
Faculty of Computer Science.
Institute of Business and Technology Pelita Indonesia.
Pekanbaru.
Riau.
Email: gusrio.
tendra@lecturer.
How to Cite:
Tendra.
Jollyta, and Sumijan.
AyDevelopment of an Attention-Based Convolutional Neural Network-Long Short-Term Memory Model for Real-Time Ergonomic Analysis of Sitting PostureAy.
MATRIK: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
2, pp.
March, 2026.
This is an open access article under the CC BY-SA license .
ttps://creativecommons.
org/licenses/by-sa/4.
Journal homepage: https://journal.
id/index.
php/matrik ye
ISSN: 2476-9843
INTRODUCTION
The digital transformation has fundamentally altered how individuals work, learn, and interact, positioning computers at the core of modern life.
This advancement, however, is accompanied by substantial health repercussions, most notably the rising prevalence of sedentary lifestyles.
Millions of people spend more than eight hours per day in a seated position, often neglecting ergonomic This behavior stands as a primary risk factor for a range of Musculoskeletal Disorders (MSD.
The research problem is rooted in the high incidence of MSDs among office workers, a fact substantiated by numerous studies.
The Global Burden of Disease (GBD) report consistently ranks lower back pain at the top of all global disabilities .
A majority of these cases are not severe pathological conditions, but are the consequence of cumulative mechanical stress from prolonged poor posture .
Systematic reviews and meta-analyses confirm an exceedingly high prevalence of MSDs among computer users and office employees, with rates varying from 55% to over 90% across different populations .
Pain in the neck, shoulders, and lower back is the most frequently reported complaint .
This problem warrants investigation due to its detrimental dual impact.
From an individual health standpoint.
MSDs lead to chronic pain, diminished quality of life, and disability.
From an organizational and economic perspective, the consequences are equally severe.
MSDs are a leading cause of absenteeism, reduced productivity, and escalating healthcare expenditures .
The phenomenon of Aypresenteeism,Ay where employees are physically present but functionally impaired by pain or discomfort, further compounds productivity losses .
Therefore, developing effective interventions to prevent poor sitting posture is not only a public health imperative but also a sound economic strategy to enhance workplace efficiency and well-being.
In response to this challenge, the field of ergonomic assessment has evolved from subjective and time-intensive manual observation methods .
RULA.
REBA) toward automated approaches that leverage advancements in Computer Vision and Machine Learning .
, .
These automated techniques offer the benefits of objectivity, scalability, and the capacity for continuous monitoring without disrupting workflows .
Within this domain, various deep learning architectures have been explored for human posture analysis .
Convolutional Neural Network (CNN) models have proven effective at extracting spatial features from static images or individual video frames, forming the basis for many posture estimation systems .
, .
However, sitting posture is not merely a series of isolated events.
it possesses a critical temporal dimension.
Postures sustained for extended durations or slow, non-ergonomic transitions are primary sources of strain.
To capture these temporal dynamics, recurrent network models like Long Short-Term Memory (LSTM) have been widely adopted, particularly in the field of Human Activity Recognition (HAR) .
, 7, .
The combination of these two, the hybrid CNN-LSTM architecture, has emerged as a particularly potent approach, capable of simultaneously capturing spatio-temporal features.
The CNN extracts the AywhatAy from each posture, while the LSTM models AyhowAy that posture evolves over time .
More recent research has investigated the use of Graph Convolutional Networks (GCN.
, which model the human skeleton as a graph.
This approach explicitly captures the structural relationships between joints and has demonstrated state-of-the-art performance in recognizing complex, dynamic actions.
Furthermore, attention mechanisms have been integrated into architectures such as CNN-LSTM to enable the model to dynamically focus on the most relevant spatial features or time steps, thereby boosting accuracy and improving interpretability.
Table 1 summarizes several relevant prior studies.
Table 1.
Summary of Related Research in Automated Posture Assessment Reference Yang et.
Methodology/Model Systematic Review Lin et al.
3D CNN LSTM
Nguyen-Trong et Martins et al.
Bagga and Yang .
Zhao et al.
Graph Convolutional Network (GCN) Inertial Data Deep Learning MediaPipe LSTM Pressure Sensors Key Contribution Provides a comprehensive overview of integrating computer vision and machine learning for the assessment of ergonomic posture selects an ergonomic assessment tool (RULA/REBA/OWAS) based on the detected posture.
Predicts occupational diseases using multidimensional data, including posture.
A holistic posture assessment framework using data from inertial sensors.
Real-time monitoring and risk assessment for manual lifting tasks.
A comparative study of sitting posture monitoring systems using pressure sensors.
Limitations/Context General review, does not propose a specific Focuses on diverse industrial tasks rather than a static sitting posture.
Focuses on long-term disease prediction rather than real-time feedback.
Requires wearable sensor hardware.
Designed for dynamic lifting tasks, not for sitting posture.
Requires specially modified chairs or mats.
ilanjutkan di halaman berikutny.
Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
March 2026: 287 Ae 298 Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Tabel 1 .
Reference Vinaya et Zhu et al.
Zhao et al.
Methodology/Model CNN LSTM
Key Contribution The studyAos CNN model outperformed a related work that used a CNN on the UCI
Deep Learning Multimodal Data
LSTM-1DCNN
Combines data from cameras and pressure sensors for sitting posture recognition.
The parallel architecture that extracts spatial and temporal features simultaneously before concatenating them for fusion in a fully connected neural network.
Limitations/Context The research focuses on Human Activity Recognition (HAR) using data collected from sensors such as accelerometers and gyroscopes embedded in wearable devices, including smartphones and smartwatches.
Reliance on multimodal hardware limits accessibility.
The study focuses on an algorithm that uses a single triaxial accelerometer to enhance user comfort and reduce deployment costs.
Based on the literature review, a gap exists for a system that can accurately assess sitting posture quality, operate in realtime on standard hardware, be non-invasive, and deliver interpretable feedback .
, .
The difference between this research and previous research is that while advanced models like GCNs .
are highly effective for dynamic and complex action recognition, they may be computationally excessive and suboptimal for the unique challenge of monitoring quasi-static sitting postures.
Quasistatic postures are characterized by long periods of inactivity interspersed with slow, subtle changes in a problem domain distinct from traditional HAR.
Conversely, approaches that rely solely on CNNs often neglect this crucial temporal dimension, a primary cause of cumulative strain injuries.
The novelty of this research is the development and application of a lightweight, attention-based CNN-LSTM architecture specifically optimized for the ergonomic analysis of sitting posture.
This approach uniquely combines the spatial feature-extraction efficiency of a 1D CNN .
perating on skeletal keypoint data rather than raw image.
with the temporal modeling capabilities of a Bi-LSTM.
The addition of an attention mechanism enables the model to focus its resources on the most informative frames or joints, thereby achieving high accuracy while providing interpretability essential for effective user feedback.
This architecture is engineered to strike an optimal balance between accuracy, computational efficiency, and interpretability for the specific problem domain of sitting posture monitoring.
The purpose of this study is to develop and validate a deep learning model that can accurately and efficiently classify common office sitting postures in real-time using only a standard webcam, thereby providing the foundation for an accessible and interpretable ergonomic monitoring system .
While hybrid CNN-LSTM architectures are widely used in general Human Activity Recognition (HAR) for dynamic actions, a critical gap exists in their application to quasi-static ergonomic monitoring.
Unlike dynamic activities defined by large limb movements, sitting postures are characterized by long periods of inactivity interspersed with slow, subtle Standard HAR models often fail in this domain because they prioritize high-frequency motion features over fine-grained skeletal alignment.
This research addresses this gap by creating the OfficePosture dataset and by designing a model specifically tailored to these quasi-static constraints.
The novelty lies in the architectureAos efficiency: by using a 1D-CNN on 33 skeletal keypoints rather than processing computationally expensive raw images, the model is optimized for real-time performance on standard hardware, without the privacy concerns associated with storing RGB data.
The objective of this research is to develop and validate a lightweight, attention-based deep learning model that accurately classifies quasi-static office sitting postures in real time on standard hardware.
The contribution of this research to the development of science is the formulation of a specialized 1D-CNN-LSTM architecture that successfully adapts high-complexity activity recognition techniques for low-latency ergonomic monitoring.
This provides a significant benefit by enabling accessible, non-invasive, and continuous preventive health monitoring without the need for expensive wearable sensors or privacy-intrusive raw video storage.
RESEARCH METHOD
To ensure the reliability and validity of our findings, a systematic research methodology was employed.
This comprehensive approach encompassed several key phases: dataset design and acquisition, meticulous data pre-processing, innovative model architecture design, and a series of structured testing scenarios.
Each of these stages was executed sequentially to build on the previous one, and the entire workflow is visually outlined in Figure 1 for clarity.
Development of an Attention .
(Gusrio Tendr.
ISSN: 2476-9843
Figure 1.
Schematic of research procedures Figure 1 presents a schematic workflow of the research conducted.
The process begins with .
Data Acquisition, where videos of participants simulating work tasks are recorded using a standard webcam.
The next phase is .
Preprocessing, where video frames are analyzed with BlazePose to extract 33 3D skeletal keypoints from each frame.
This keypoint data is then normalized and segmented into sequential sequences.
The third stage is .
Model Training, where the processed keypoint sequences are used to train the proposed attention-based CNN-LSTM architecture.
Finally, in the .
Evaluation phase, the trained modelAos performance is assessed quantitatively .
sing metrics like accuracy, precision, and recal.
and qualitatively .
hrough visualization of attention map.
and compared against baseline models.
Dataset Design and Acquisition A significant challenge in the study of ergonomic behavior is the lack of a publicly available dataset specifically designed to capture the quasi-static sitting postures common in office environments.
To address this critical gap in available resources, we constructed a new, comprehensive dataset named OfficePosture.
The development of such custom datasets is a standard and often necessary practice in this type of research, enabling focused analysis that general-purpose datasets cannot support.
Experimental Setup and Hardware Specification To ensure reproducibility, the model training process was conducted in the Kaggle Notebook cloud environment, using the platformAos standard GPU acceleration .
NVIDIA Tesla P100/T.
to train for 100 epochs.
However, the architecture was specifically designed for deployment on resource-constrained devices.
Unlike deep 2D-CNNs, which require substantial GPU power for inference, the proposed 1D-CNN operates on low-dimensional vector data .
and therefore runs efficiently on standard consumer CPUs during the real-time application phase.
We recruited 30 participants .
male, 12 femal.
, aged 20-45, for data collection.
Each participant was instructed to simulate computer work for each predefined posture category.
Recordings were captured using a standard webcam .
0p resolution, 30 fp.
in a controlled office setting with adequate lighting.
The camera was positioned approximately 1.
5 meters from the participant to consistently capture a frontal view of the upper body.
The resulting footage was segmented into 500 video clips, each 1-2 minutes in The labeling was performed by two trained annotators, achieving high inter-annotator agreement (CohenAos kappa = 0.
The posture categories, validated by an ergonomics expert, are presented in Table 2 and visually illustrated in Figure 2.
Table 2.
Posture Categories in the OfficePosture Posture Categories Ideal Posture Slouching Forward Head Upper Back Slump Crossing the Legs Description and Characteristics Straight back, relaxed shoulders, feet flat on the ground.
The spine curves to form a C shape.
The neck protrudes forward from the shoulder line.
Only the upper back is bent.
One leg is crossed over the other, hips are not aligned.
Figure 2.
Illustration of posture categories in the officeposture with blazepose keypoint Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
March 2026: 287 Ae 298 ye Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Data Preprocessing and Feature Extraction This stage follows the workflow depicted in Figure 3.
The initial step is data processing, in which each video frame is analyzed using MediaPipeAos BlazePose to extract 33 3D keypoints.
These keypoints are then normalized relative to the hip center to ensure the model is invariant to the subjectAos position or scale within the frame.
Following normalization, the continuous frame data is segmented into overlapping sequences of 60 frames each to serve as input to the model.
A sequence length of 60 frames .
quivalent to 2 seconds at 30 fp.
was chosen to capture sufficient temporal context without introducing excessive latency for a real-time application.
Model Design and Training The processed data sequences serve as direct input to the core of our system, namely the developed model architecture.
The model comprises several functional layers, each progressively transforming the data to extract increasingly complex features.
detailed schematic of this multi-layered structure, illustrating the role of each component, is provided in Figure 3.
The 1D CNN Layer acts as a spatial feature extractor.
It learns to recognize patterns in keypoint coordinates within a single frame, thereby representing the postural state at a single point in time.
The bidirectional LSTM (Bi-LSTM) layer takes the CNNAos spatial features and passes them to a Bi-LSTM network.
Its bidirectional nature allows the model to capture temporal dependencies or movement patterns by considering both past and future frames within a sequence.
The additive Attention Mechanism is implemented after the Bi-LSTM to allow the model to assign AyimportanceAy weights to the most informative frames within a sequence.
This enables the model to focus on crucial moments of postural change.
Classification Layer: Finally, a fully connected layer with a Softmax activation function is used to predict the class of the motion sequence.
Figure 3.
Diagram of the developed model architecture Testing and Evaluation The model was trained for 100 epochs using the Adam optimizer.
Its performance is evaluated on 20% of the data .
not used in training.
The metrics used are Accuracy.
Precision.
Recall, and F1-Score.
The modelAos performance is compared with two baseline architectures: .
Pure CNN and .
Conventional LSTM to verify the efficacy of the suggested hybrid strategy.
RESULT AND ANALYSIS
This section presents a thorough evaluation of the modelAos performance on the test dataset.
The analysis is structured to follow the methodological flow: it begins with the modelAos training results to demonstrate strong generalization, proceeds to a quantitative performance comparison with baselines, and concludes with a qualitative analysis highlighting the modelAos interpretability.
Development of an Attention .
(Gusrio Tendr.
ISSN: 2476-9843
Model Training and Generalization The initial step in the analysis was to confirm that the model was well-trained and did not exhibit overfitting.
Figure 8 shows the modelAos learning curves during training.
The curves indicate that the accuracy and loss for both the training and validation sets This is a strong indicator that the model can effectively generalize its learned knowledge to new, unseen data, a critical prerequisite for further performance evaluation.
Quantitative Performance Analysis The findings of this research are that the developed model demonstrates significant advantages over the baseline models, achieving an impressive overall accuracy of 94.
2% and a balanced F1-score of 0.
The results of this research are consistent with previous studies, which have demonstrated that hybrid CNN-LSTM architectures effectively capture spatio-temporal features.
However, our research extends these findings by demonstrating superior performance specifically in the quasi-static domain through the addition of the attention mechanism.
Furthermore, as shown in Table 3, our modelAos accuracy .
2%) significantly exceeds that of the Pure CNN .
6%) and the Standard LSTM .
2%), validating the efficacy of the proposed hybrid strategy.
Table 3.
Comparison of Overall Model Performance Model CNN Pure Standard LSTM Attention-Based CNN-LSTM Accuracy (%) Precision Recall F1-Score Computational Efficiency Analysis To substantiate the claim of real-time performance, we analyzed the computational complexity of the proposed model compared to traditional image-based approaches.
Table 4 presents a comparison of input data dimensionality, the primary factor influencing inference speed.
Table 4.
Architectural Efficiency Comparison Feature Input Type Input Dimension Processing Unit Throughput Target Traditional Image-Based CNN Raw RGB Frames High .
x 224 x 3 pixel.
Requires a GPU for Real-time Variable Proposed 1D-CNN Model Skeletal Keypoint Vectors Low .
points x 3 axe.
Efficient on Standard CPU Optimized for 30 FPS As shown in Table 4, the proposed system reduces computational load by processing only 99 data points .
keypoints y 3 dimension.
per frame, compared with over 150,000 pixels in a standard image frame.
This massive reduction in input complexity ensures that the model inference latency is negligible, allowing the system to maintain synchronization with the standard 30 fps webcam output without requiring high-end local hardware.
Experimental Validation of Real-Time Performance.
To empirically validate the real-time capability proposed in this study, the full Ergo-Guard pipeline (BlazePose keypoint extraction followed by CNN-LSTM inferenc.
was tested on a standard consumer laptop equipped with an Intel Core i5-1135G7 processor .
40 GH.
and 8GB of RAM, using only the CPU.
During a continuous operational test, the system achieved an average processing speed of 38 FPS (Frames Per Secon.
, which safely exceeds the standard webcam input rate of 30 FPS.
The total system latency per frame, measured from image capture to posture classification, averaged 26.
3 milliseconds .
Specifically.
BlazePose extraction took approximately 24ms, whereas inference on the lightweight 1D-CNN-LSTM model required less than 3ms.
Furthermore, the CPU utilization remained stable at approximately 45% during operation, confirming that the proposed solution is computationally efficient enough to run as a background process without disrupting other office tasks.
To move beyond overall performance metrics, a more detailed analysis of the modelAos classification behavior was conducted.
This analysis used confusion matrices, which provide a granular breakdown of prediction results for each posture class, thereby revealing patterns of misclassification.
The respective confusion matrices for the proposed model and the baseline models are presented for comparison in Figures 4, 5, and 6.
Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
March 2026: 287 Ae 298 ye Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Figure 4.
Confusion matrix of the convolutional neural network model classification results Upon examining the confusion matrix for the Pure CNN model in Figure 4, a significant pattern of misclassification becomes The model exhibits considerable confusion between the AoSlouchingAo and AoUpper Back SlumpAo classes, likely because these two postures share very similar spatial features that are difficult to distinguish from static image analysis alone.
This difficulty in differentiating between such fine-grained postural variations highlights a key limitation of the Pure CNN approach for this nuanced Figure 5.
Confusion matrix results of long short-term memory model classification The confusion matrix for the Standard LSTM model, presented in Figure 5, indicates a noticeable improvement in classification By analyzing the temporal sequence of poses, the LSTM model reduces confusion among similar classes.
some misclassifications persist.
This suggests that while temporal analysis is crucial, relying on it alone, without a robust spatial feature-extraction component, is insufficient to fully resolve ambiguity among nuanced postural states.
Figure 6.
Confusion matrix results of the attention-based convolutional neural network and the long short-term memory model Development of an Attention .
(Gusrio Tendr.
ISSN: 2476-9843
In stark contrast to the baseline models, the confusion matrix for the proposed attention-based CNN-LSTM model, shown in Figure 6, exhibits a clear, dominant diagonal.
This indicates a very low misclassification rate across all posture classes, demonstrating the synergistic effect of its hybrid architecture.
The modelAos success stems from its ability to simultaneously analyze spatial features with its convolutional layers and temporal patterns with its recurrent layers, while the attention mechanism further refines its focus on the most critical information.
In-Depth Analysis of Misclassification Although the Attention-Based CNN-LSTM model achieved a superior overall accuracy of 94.
2%, a granular analysis of the confusion matrix (Figure .
highlights a specific pattern of misclassification between the AoSlouchingAo and AoUpper Back SlumpAo categories.
As shown in the per-class performance metrics.
AoUpper Back SlumpAo achieved an F1-Score of 0.
90, which is slightly lower than the AoIdeal PostureAo score of 0.
This specific confusion can be attributed to the biomechanical similarities between these two postures defined in our study.
According to Table 2.
AoSlouchingAo is characterized by a continuous C-shaped curvature of the entire spine, whereas AoUpper Back SlumpAo involves a localized bend restricted to the thoracic region.
From a computer vision perspective, distinguishing these subtle variations using a single frontal-view camera is challenging.
The depth cues .
-axi.
required to differentiate whether the curvature originates from the lumbar region (Slouchin.
or solely the upper back (Upper Back Slum.
are often compressed in 2D video frames.
However, the integrated attention mechanism substantially reduces this error relative to the Pure CNN baseline by focusing on the relative alignment of the neck and shoulders.
To evaluate the modelAos consistency across different categories, a per-class performance analysis was conducted, with the detailed results shown in Figure 7.
The model demonstrated exceptional capability in recognizing the correct posture, achieving its highest F1-Score of 0.
97 for the AoIdeal PostureAo class.
Conversely, its performance was slightly lower for AoUpper Back SlumpAo (F1-Score 0.
, a finding that aligns with the minor class confusions previously observed in the confusion Figure 7.
Performance metrics comparison chart for each class To ensure the model is not overfitting to the training data, its learning curve was plotted and analyzed, as shown in Figure 8.
The plot shows that both the accuracy and loss metrics for the training and validation datasets follow a closely parallel trajectory throughout training.
This convergence is a strong indicator of a well-fitted model, demonstrating its ability to generalize effectively and perform reliably on new, unseen data.
Figure 8.
Model learning curve .
ccuracy and los.
Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer.
Vol.
No.
March 2026: 287 Ae 298 ye Matrik: Jurnal Manajemen.
Teknik Informatika, dan Rekayasa Komputer Qualitative Analysis and Model Interpretation A key advantage of the proposed model is its ability to provide transparent and explainable feedback, moving beyond simple This is achieved through an attention-visualization technique that generates a map highlighting the specific skeletal joints and body regions the model focused on to reach its conclusion.
For instance.
Figure 9 clearly shows the model correctly focusing on the head, neck, and upper spine when identifying AoForward Head PostureAo.
Figure 9.
Visualization of the attention map on Auhead forward postureAy Spatial Attention Figure 9 shows that the model has learned to specifically monitor the alignment between the ears, shoulders, and hips.
When the head moves forward, the model assigns the highest attention weight to the neck and shoulder joints.
Temporal Attention.
Figure 9 shows that the modelAos attention increases over time, indicating that it can detect gradual changes in posture.
This interpretability enables the system to provide highly relevant and actionable feedback, such as AuStraighten your neck,Ay a significant advantage over Aublack boxAy models.
Discussion on Limitations and Robustness While the experimental results demonstrate high accuracy, it is critical to address potential biases and limitations inherent in the current study to ensure fair interpretation.
First, regarding demographic bias, although the OfficePosture dataset maintained a gender balance .
males, 12 female.
, the sample size of 30 participants may not fully capture the substantial variation in anthropometric measurements .
, body mass index, height extreme.
observed in the global population.
Individuals with markedly different body shapes or those wearing loose-fitting clothing may introduce occlusions that degrade the precision of BlazePoseAos keypoint extraction.
Secondly, environmental and ergonomic variables present a challenge.
The training data was collected in a controlled environment with a fixed camera height and standard office chairs.
In real-world deployment, such as dynamic Aywork-from-homeAy setups, variations in chair height relative to the desk or non-standard camera angles .
, a laptop placed on a low tabl.
could alter the perspective of the skeletal vector.
While the normalization of keypoints relative to the hip center aims to mitigate scale and position issues, the modelAos robustness to severe viewing-angle distortions requires further validation in future field studies.
Addressing these biases is essential before the system can be considered a universally applicable health intervention tool.
CONCLUSION
This study successfully validated an Attention-Based CNN-LSTM architecture optimized for the real-time assessment of quasistatic sitting behaviors.
On the custom OfficePosture dataset comprising 500 video clips, the model achieved a classification accuracy 2% and an F1 Score of 0.
94, substantially outperforming the baseline CNN .
6%) and the Standard LSTM .
2%) models.
The primary contribution of this work distinguishes itself from existing Human Activity Recognition (HAR) studies by specifically addressing the ergonomic domain of quasi-static behaviors.
Unlike generic deep learning models that prioritize high-dynamic motion detection .
, walking, runnin.
, our proposed Attention-Based CNN-LSTM architecture is uniquely optimized to detect the subtle, low-velocity spinal deviations characteristic of office sitting postures.
By processing skeletal vector data rather than raw images.
Development of an Attention .
(Gusrio Tendr.
ISSN: 2476-9843
the model achieves a precise balance among high classification accuracy .
2%), computational efficiency .
3ms latenc.
, and biomechanical interpretability, thereby filling a critical gap in accessible preventive health technology.
Future work will focus on addressing current limitations by exploring Graph Convolutional Networks (GCN.
to better model the topological connectivity of spinal joints, thereby resolving the remaining confusion among slouching subtypes.
Additionally, investigating transformer-based architectures could enable superior temporal modeling for long-duration monitoring without the vanishing-gradient limitations of LSTMs.
ACKNOWLEDGEMENTS
The author expresses sincere gratitude to all participants who contributed to the data acquisition process for this study.
would also like to extend our highest appreciation to the ergonomics experts for their valuable input and validation of the posture categories used.
As a last note, weAod like to thank Institut Bisnis dan Teknologi Pelita Indonesia and Universitas Putra Indonesia AuYPTKAy Padang.
DECLARATIONS
AI USAGE STATEMENT
Artificial Intelligence tools .
pecifically Grammarly and GeminiAI) were used during the preparation of this manuscript exclusively for the purposes of grammatical correction, language refinement, and clarity improvement.
No AI tools were used to generate the research data, results, or scientific conclusions.
AUTHOR CONTIBUTION
Gusrio Tendra: Conceptualization.
Methodology.
Software.
Validation.
Writing Original Draft.
Sumijan: Supervision.
Resources.
Writing Review, and Editing.
Deny Jollyta: Supervision.
Writing Review and Editing.
FUNDING STATEMENT
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
COMPETING INTEREST
The authors confirm that there are no conflicts of interest, either financial or non-financial, that could influence the research results and interpretation of the data in this article.
REFERENCES