Knowledge-Guided Multi-Modal Understanding

Title

Organizers

• Min Tan (tanmin@hdu.edu.cn)
• Kuiwen Xu (kuiwenxu@hdu.edu.cn)

Abstract

In recent years, multi-modal understanding has emerged as a key frontier in artificial intelligence, enabling machines to interpret and reason across heterogeneous data sources such as vision, language, electromagnetic waves, and neural signals. While significant progress has been made in aligning modalities through deep learning, challenges persist in achieving fine-grained, interpretable, and cognitively aligned understanding—particularly in complex, dynamic, or human-centered environments.

This session centers on knowledge-guided approaches that inject structural, semantic, or physic/cognitive priors into multi-modal learning frameworks to enhance cross-modal alignment, reasoning, and generation. For instance, in brain-to-image reconstruction, neurocognitive priors can help bridge fMRI signals and visual abstractions, while emotion-aware video captioning benefits from hierarchical semantic modeling and temporal-scale reasoning. Similarly, navigation agents in continuous spaces can leverage spatial and linguistic priors to improve grounding and trajectory planning.

Representative works in this session span diverse tasks, including brain-informed image generation, emotionally enriched video description, and waypoint prediction in vision-language navigation—united by a common emphasis on embedding domain-specific knowledge into the modeling of cross-modal interactions. By foregrounding knowledge as a guiding force, this session aims to foster discussion on new architectures, supervision strategies, and representations that support more robust, interpretable, and human-aligned multi-modal systems.

Topics

• Neurocognitive-Inspired Generation: Leveraging fMRI or EEG signals and brain knowledge to reconstruct or generate visual stimuli, e.g., visual image reconstruction from fMRI or EEG signals.
• Knowledge-Augmented Generation: Enhancing visual or language generation through the injection of structured knowledge—such as emotional semantics, hierarchical scene understanding, or symbolic guidance—especially in ambiguous or weakly supervised settings.
• Physics-Driven Generation: Cross-modal generation guided by physical mechanisms, e.g., electromagnetic imaging based on Maxwell's equations.
• Privacy-Preserving Generation: Secure generative modeling from sensitive data via federated, encrypted, or obfuscated learning paradigms, e.g., introducing federated learning for privacy preserved generation.

Submission Site

https://cmt3.research.microsoft.com/mmasia2025

Code of Conduct