Downloads: 11
Research Paper | Computer Science and Engineering | Volume 15 Issue 3, March 2026 | Pages: 1759 - 1769 | India
Corpusio: A Practical Multimodal Extraction and Retrieval Ecosystem for Knowledge Discovery in Resource-Constrained Environments
Abstract: As enterprise organizations accumulate vast amounts of heterogeneous unstructured data spanning PDFs, scanned contracts, and event photography traditional keyword retrieval systems fail to capture critical multimodal associations. This paper presents Corpusio, a multimodal extraction, indexing, and serving ecosystem designed for resource-constrained production environments. Rather than attempting semantic understanding via computationally prohibitive end-to-end large multimodal models, Corpusio employs an operationally conservative pipeline. Specifically, the system utilizes a deterministic, two-pass person-image linking strategy that combines layout-first card-based proximity grouping with a formal linear assignment solver to prevent cross-identity leakage. Visual artifacts are indexed via perceptual hashing to enable efficient deduplication and anchor-based feedback loops. Furthermore, the ecosystem mediates downstream Retrieval Augmented Generation (RAG) by surfacing stable evidence locators wrapped in strict Role-Based Access Control (RBAC) masks. On a log-verified production corpus of 46 diverse documents, the pipeline successfully executed with a mean latency of 208.54 s/doc and sustained a 3.72 GB peak allocated GPU footprint on a commodity NVIDIA T4 (16 GB). These results demonstrate that a deterministic, multi-stage retrieval architecture can deliver high-precision, auditable multimodal discovery under bounded compute while strictly enforcing operational bounds and data confidentiality.
Keywords: access control, face verification, hybrid retrieval, image-text binding, layout analysis, multi- modal document understanding, privacy-preserving machine learning, resource-constrained deployment, retrieval-augmented generation, semantic embeddings
How to Cite?: Mahanthi Bharadwaj Phani Datta, Kamaleeswari Kamboji, Kancharala Subhaashini, Kella Kedhareesh, Raja N. Moorthy, "Corpusio: A Practical Multimodal Extraction and Retrieval Ecosystem for Knowledge Discovery in Resource-Constrained Environments", Volume 15 Issue 3, March 2026, International Journal of Science and Research (IJSR), Pages: 1759-1769, https://www.ijsr.net/getabstract.php?paperid=SR26326125321, DOI: https://dx.dx.doi.org/10.21275/SR26326125321