Adaptive Grasping in Robotic Manipulation through Learning-Driven Multi-Modal Sensor Fusion
Main Article Content
Abstract
Robust grasping in unstructured environments remains a fundamental challenge for robotic manipulation systems due to uncertainties in object geometry, occlusion, and sensor noise. This paper proposes a deep learning-based adaptive grasping framework that integrates visual and tactile information through a multi-modal fusion architecture. Specifically, a dual-stream convolutional neural network (CNN) is designed to extract spatial features from RGB-D images and force feedback signals, followed by a cross-attention module for feature alignment and fusion. The model is trained on a dataset of 45,000 grasping trials collected using a 6-DOF robotic arm across 120 object categories. Experimental results demonstrate that the proposed method achieves a grasp success rate of 91.3%, outperforming baseline vision-only models (84.7%) and tactile-only models (79.2%). In cluttered environments with partial occlusion, the success rate remains at 87.5%, indicating strong robustness. Furthermore, real-time deployment shows an average inference latency of 32 ms per frame, enabling practical applicability in industrial settings. Comparative ablation studies confirm that the cross-attention fusion module contributes a +5.8% improvement in grasp accuracy. These results highlight the effectiveness of multi-modal deep learning for enhancing robotic manipulation performance in complex scenarios.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.