Multimodal Document Representation for Image-Text Fusion

Akshata Upadhye

doi:10.21275/SR24430153718

Multimodal Document Representation for Image-Text Fusion

Akshata Upadhye

Abstract: This survey paper aims to discuss the advancements in the field of multimodal document representation with a specific focus on the fusion of textual and visual information. The overview begins with providing an historical context of multimodal representation techniques, ranging from early hand- crafted feature-based approaches to recent advancements in deep learning. Further the paper explores various strategies used to fuse multimodal information such as concatenation, attention mechanisms, and shared layers. The paper also highlights various applications including image captioning, document retrieval, vi- sual question answering, and multimedia analysis, to demonstrate the broad impact and significance of multimodal representation across diverse domains. Despite the progress made in research and development of advanced techniques, challenges such as data heterogeneity, scalability, and interpretability persist, which open up avenues for future research and development. Finally, the paper offers insights into the current state-of-the-art techniques and identifies opportunities for advancing the field of multimodal document representation.

Keywords: Multimodal Representation, Document Fusion, Image-Text integration, Deep Learning, Information Retrieval, Semantic Understanding

How to Cite?: Akshata Upadhye, "Multimodal Document Representation for Image-Text Fusion", Volume 11 Issue 6, June 2022, International Journal of Science and Research (IJSR), Pages: 1998-2002, https://www.ijsr.net/getabstract.php?paperid=SR24430153718, DOI: https://dx.doi.org/10.21275/SR24430153718

Download Citation: APA | MLA | BibTeX | EndNote | RefMan