Structured Cross-Modal Alignment via Hypergraph-Enhanced Transformers
Main Article Content
Abstract
In the era of multi-domain artificial intelligence, effective fusion of visual and textual modalities has become essential in numerous applications ranging from autonomous navigation to medical diagnosis and human-computer interaction. However, current models often struggle to generalize across distinct domains or lack the capability to capture high-order semantic dependencies in heterogeneous input data. In this paper, we propose a novel framework that integrates hypergraph-based structural modeling with transformer-based semantic alignment to construct a unified cross-modal representation paradigm. Specifically, our method constructs a dynamic hypergraph to encode high-order correlations among image regions and textual tokens, which is subsequently fused within a dual-stream transformer encoder. The model is trained under a contrastive alignment objective across multiple domains, including natural scenes, satellite imagery, and clinical imaging, ensuring transferability and robustness. Extensive experiments on four benchmark datasets—Flickr30K, MS-COCO, RSICD, and IU X-Ray—demonstrate that our approach outperforms previous state-of-the-art methods in both zero-shot retrieval and domain adaptation tasks. Our contributions include: (1) a unified hypergraph construction pipeline for vision-language data, (2) a hierarchical transformer architecture that integrates hypergraph features with token embeddings, and (3) empirical insights into domain-generalizable multimodal learning. The proposed framework establishes a new state-of-the-art for cross-domain image-text representation and serves as a strong foundation for real-world multimodal AI systems.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.