Structured Cross-Modal Alignment via Hypergraph-Enhanced Transformers

Lennart Ainsley; Mireille Tovey; Romilly Vancroft

doi:10.5281/zenodo.15381951

pdf

Published: May 1, 2025

DOI: https://doi.org/10.5281/zenodo.15381951

Lennart Ainsley

Mireille Tovey

Romilly Vancroft

Abstract

In the era of multi-domain artificial intelligence, effective fusion of visual and textual modalities has become essential in numerous applications ranging from autonomous navigation to medical diagnosis and human-computer interaction. However, current models often struggle to generalize across distinct domains or lack the capability to capture high-order semantic dependencies in heterogeneous input data. In this paper, we propose a novel framework that integrates hypergraph-based structural modeling with transformer-based semantic alignment to construct a unified cross-modal representation paradigm. Specifically, our method constructs a dynamic hypergraph to encode high-order correlations among image regions and textual tokens, which is subsequently fused within a dual-stream transformer encoder. The model is trained under a contrastive alignment objective across multiple domains, including natural scenes, satellite imagery, and clinical imaging, ensuring transferability and robustness. Extensive experiments on four benchmark datasets—Flickr30K, MS-COCO, RSICD, and IU X-Ray—demonstrate that our approach outperforms previous state-of-the-art methods in both zero-shot retrieval and domain adaptation tasks. Our contributions include: (1) a unified hypergraph construction pipeline for vision-language data, (2) a hierarchical transformer architecture that integrates hypergraph features with token embeddings, and (3) empirical insights into domain-generalizable multimodal learning. The proposed framework establishes a new state-of-the-art for cross-domain image-text representation and serves as a strong foundation for real-world multimodal AI systems.

How to Cite

Ainsley, L., Tovey, M., & Vancroft, R. (2025). Structured Cross-Modal Alignment via Hypergraph-Enhanced Transformers. Journal of Computer Science and Software Applications, 5(5). https://doi.org/10.5281/zenodo.15381951

Issue

Vol. 5 No. 5 (2025)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.

Article Sidebar

Main Article Content

Abstract

Article Details