Return to Article Details Cross-Attention Transformer-Based Visual-Language Fusion for Multimodal Image Analysis Download Download PDF