Structured Cross-Modal Alignment via Hypergraph-Enhanced Transformers

Main Article Content

Lennart Ainsley
Mireille Tovey
Romilly Vancroft

Abstract

In the era of multi-domain artificial intelligence, effective fusion of visual and textual modalities has become essential in numerous applications ranging from autonomous navigation to medical diagnosis and human-computer interaction. However, current models often struggle to generalize across distinct domains or lack the capability to capture high-order semantic dependencies in heterogeneous input data. In this paper, we propose a novel framework that integrates hypergraph-based structural modeling with transformer-based semantic alignment to construct a unified cross-modal representation paradigm. Specifically, our method constructs a dynamic hypergraph to encode high-order correlations among image regions and textual tokens, which is subsequently fused within a dual-stream transformer encoder. The model is trained under a contrastive alignment objective across multiple domains, including natural scenes, satellite imagery, and clinical imaging, ensuring transferability and robustness. Extensive experiments on four benchmark datasets—Flickr30K, MS-COCO, RSICD, and IU X-Ray—demonstrate that our approach outperforms previous state-of-the-art methods in both zero-shot retrieval and domain adaptation tasks. Our contributions include: (1) a unified hypergraph construction pipeline for vision-language data, (2) a hierarchical transformer architecture that integrates hypergraph features with token embeddings, and (3) empirical insights into domain-generalizable multimodal learning. The proposed framework establishes a new state-of-the-art for cross-domain image-text representation and serves as a strong foundation for real-world multimodal AI systems.

Article Details

How to Cite
Ainsley, L., Tovey, M., & Vancroft, R. (2025). Structured Cross-Modal Alignment via Hypergraph-Enhanced Transformers. Journal of Computer Science and Software Applications, 5(5). https://doi.org/10.5281/zenodo.15381951
Section
Articles