Bridging modalities : An analysis of cross-modal wasserstein adversarial translation networks and their theoretical foundations

Joseph Tafataona Mtetwa; Kingsley A. Ogudo; Sameerchand Pudaruth

doi:10.3390/math13162545

Back

Bridging modalities : An analysis of cross-modal wasserstein adversarial translation networks and their theoretical foundations

Journal article

Open access

Peer reviewed

Bridging modalities : An analysis of cross-modal wasserstein adversarial translation networks and their theoretical foundations

Joseph Tafataona Mtetwa, Kingsley A. Ogudo and Sameerchand Pudaruth

Mathematics (Basel), Vol.13(16), p.2545

08/08/2025

DOI: https://doi.org/10.3390/math13162545

Handle:

https://hdl.handle.net/10210/516806

Abstract

cross-modal translation

Wasserstein adversarial training

multi-modal learning

What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal understanding. Unlike traditional generative models that rely on stochastic noise, our frameworks learn deterministic translation mappings that preserve semantic fidelity across modalities through rigorous mathematical foundations. We systematically examine: (1) cross-modality consistent dual-critical networks; (2) Wasserstein cycle consistency; (3) multi-scale Wasserstein distance; (4) regularization through modality invariance; and (5) Wasserstein information bottleneck. Each approach employs adversarial training with Wasserstein distances to establish theoretically grounded translation functions between heterogeneous data representations. Through mathematical analysis—including information-theoretic frameworks, differential geometry, and convergence guarantees—we establish the theoretical foundations underlying cross-modal translation. Our empirical evaluation across MS-COCO, Flickr30K, and Conceptual Captions datasets, including comparisons with transformer-based baselines, reveals that our proposed multi-scale Wasserstein cycle consistent (MS-WCC) framework achieves remarkable performance gains—12.1% average improvement in FID scores and 8.0% enhancement in cross-modal translation accuracy—compared to state-of-the-art methods, while maintaining superior computational efficiency. These results demonstrate that principled mathematical approaches to cross-modal translation can significantly advance machine understanding of multimodal data, opening new possibilities for applications requiring seamless communication between visual and textual domains.

Files and links (2)

pdf

GetDocument (86)3.20 MBDownload View

CC BY V4.0, Open Access

url

https://doi.org/10.3390/math13162545View

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Bridging modalities : An analysis of cross-modal wasserstein adversarial translation networks and their theoretical foundations
Creators - without role: Joseph Tafataona Mtetwa - University of Johannesburg
Kingsley A. Ogudo
Sameerchand Pudaruth - University of Mauritius
Publication Details: Mathematics (Basel), Vol.13(16), p.2545
Identifiers: 9956487807691
Publication Details: 2227-7390
Academic Unit: University of Johannesburg; Electrical and Electronic Engineering Studies; Faculty of Engineering & the Built Environment
Language: English
Resource Type: Journal article