Real-Time Gesture Recognition via Deep Spatiotemporal Modeling for Human–Computer Interaction
Main Article Content
Abstract
Accurate and real-time gesture recognition is critical for next-generation human–computer interaction systems, particularly in immersive and touchless environments. This study presents a spatiotemporal deep learning framework combining a lightweight convolutional backbone with a temporal transformer encoder to capture dynamic motion patterns in video sequences. The model is trained on a curated dataset consisting of 25,000 gesture samples across 30 predefined gesture classes, collected under varying lighting conditions and user backgrounds. The proposed approach achieves an overall recognition accuracy of 96.2%, surpassing conventional CNN-LSTM baselines (91.5%) and 3D CNN models (93.1%). In real-time testing scenarios, the system maintains a latency of <40 ms, ensuring smooth user interaction. Cross-user generalization experiments show a performance drop of only 2.4%, demonstrating strong robustness to individual variability. Additionally, the model exhibits high resilience to noise and occlusion, maintaining 92.8% accuracy under partial hand visibility. These findings indicate that transformer-based temporal modeling significantly enhances gesture recognition performance for real-world HCI applications.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.