Resource Efficient Semantic Retrieval Pipeline via Generative Captioning and Text-to-Text Transformers for Bridging the Modality Gap
(1) Universitas Nusa Mandiri
(2) Universitas Muhammadiyah Semarang
(3) Universitas Muhammadiyah Semarang
(4) Universitas Nusa Mandiri
(5) International Hellenic University
(*) Corresponding Author
Abstract
The rapid expansion of multimodal digital content necessitates the development of robust information retrieval systems capable of bridging the semantic gap between visual and textual data. However, contemporary cross- modal models, such as CLIP, impose significant computational demands, rendering them impractical for real-time deployment in resource-limited environments. To address this efficiency challenge, this study introduces a novel lightweight retrieval pipeline that reconceptualizes cross-modal retrieval as a text-to-text task through generative transformation. The proposed methodology employs the Bootstrapped Language-Image Pretraining (BLIP) model to distill visual features into rich textual descriptions, which are subsequently encoded into dense semantic vectors using the T5 transformer architecture. Extensive experiments conducted on the MSCOCO and Flickr30K datasets demonstrate that the proposed pipeline achieves a Semantic Average Recall (SAR@5) of 0.561, significantly surpassing traditional lexical (BM25) and dense (SBERT) baselines. Notably, while the computationally intensive CLIP model retains a slight advantage in absolute accuracy, our approach delivers approximately 90% of CLIP’s semantic performance while enhancing inference throughput by 2.1× and reducing GPU memory consumption by 62%. These findings confirm that generative semantic distillation offers a scalable, cost-effective alternative to end-to-end multimodal systems, particularly for latency-sensitive applications requiring high semantic fidelity.
Keywords
Full Text:
PDFReferences
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning Transferable Visual Models From Natural Language Supervision. In: International Conference on Machine Learning (ICML). PMLR; 2021. p. 8748-63. Available from: https://proceedings.mlr.press/v139/radford21a.
Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: International Conference on Machine Learning (ICML). PMLR; 2021. p. 4904-16. Available from: https://proceedings.mlr.press/
v139/jia21b.html.
Li J, Li D, Xiong C, Hoi S. BLIP: Bootstrapping Language-Image Pre training for Unified Vision-Language Understanding and Generation. In: International Conference on Machine Learning (ICML). PMLR; 2022. p. 12888-900. Available from: https://proceedings.mlr.press/v162/li22n.html.
Goyal K, Gupta U, De A, Chakrabarti S. Deep neural matching models for graph retrieval. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; 2020. p. 1701-4. Available from: https://doi.org/10.1145/3397271.3401216.
Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2019. p. 3982-92. Available from: https://doi.org/10.48550/arXiv.1908.10084Focustolearnmore.
Treviso M, Ji T, Lee B Ji-Ung andajanoh, Martins AF. Efficient Methods for Natural Language Processing: A Survey. Transactions of the Association for Computational Linguistics. 2023;11:826-60. Available from: https://doi.org/10.1162/tacl_a_00577.
Wang Z, Liu R, De Luca M. Cross-Modal Index Alignment: Bridging Vision and Language in Neural Retrieval Architectures. Computer Science Bulletin. 2025;8(01):327-46. Available from: https://doi.org/10.71465/csb165.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020;21(140):1-
Available from: http://jmlr.org/papers/v21/20-074.html.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: European Conference on Computer Vision (ECCV). Springer; 2014. p. 740-55. Available from: https://doi.org/10.1007/978-3-319-10602-1_48.
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision. 2015;123(1):74-93. Available from: https://openaccess.thecvf.com/content_iccv_2015/html/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.html.
Hubert N, Monnin P, Brun A, Monticolo D. Sem@K: Is my knowledge graph embedding model semantic-aware? arXiv preprint arXiv:230105601. 2023. Available from: https://doi.org/10.3233/SW-233508.
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G, Levy R, et al. A New Approach to Cross-Modal Multimedia Retrieval. In: Proceedings of the ACM Multimedia 2010 International Conference; 2010. p. 251-60. Available from: https://doi.org/10.1145/1873951.1873987.
Berger A, Lafferty J. Information retrieval as statistical translation. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval; 2017. p.222-9. Available from: https://doi.org/10.1145/3130348.3130371.
Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(11):2579-605. Available from: https://www.jmlr.org/papers/v9/vandermaaten08a.html.
Article Metrics
Abstract view : 93 timesPDF - 5 times
DOI: https://doi.org/10.26714/jichi.v6i2.19240
Refbacks
- There are currently no refbacks.
____________________________________________________________________________
Journal of Intelligent Computing and Health Informatics (JICHI)
ISSN 2715-6923 (print) | 2721-9186 (online)
Organized by
Department of Informatics
Faculty of Engineering
Universitas Muhammadiyah Semarang
W : https://jurnal.unimus.ac.id/index.php/ICHI
E : jichi.informatika@unimus.ac.id, ahmadilham@unimus.ac.id
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.




