Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration

Journal Title: Engineering and Technology Journal - Year 2024, Vol 9, Issue 07

Abstract

This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.

Authors and Affiliations

Ismail OUBAH , Dr. Selçuk ŞENER,

Keywords

Related Articles

PERFORMANCE OF PRESSURE DIVIDER VALVES IN HYDRAM PUMPS WITH VARIATION OF DIVIDER TUBE LENGTH

Air is the source of life for humans, plants and animals. One of the efforts to meet water needs is to use a hydram pump. A hydram pump is a pump that works without requiring external power to activate it, but this work...

The Impact of Agricultural Land Reclamation and Conversion on Farmers' Livelihoods: A Case Study of the Thai Nguyen Stadium Project in Phuc Trieu Commune, Thai Nguyen City, Thai Nguyen Province

This study focuses on evaluating the impacts of agricultural land reclamation and conversion on the livelihoods of households affected by the Thai Nguyen Stadium Project in Phuc Trieu Commune, Thai Nguyen City. The resea...

Blockchain's Transformative Potential in Securing Digital Identities and Personal Data

This review paper explores the transformative potential of blockchain technology in securing digital identities and personal data. It examines various blockchain applications, including identity verification, self-sovere...

DYNAMIC CELLULAR MANUFACTURING SYSTEMS AND THEIR SOLUTION USING GENETIC ALGORITHM

Production planning and cell formation problems are two important parts of this system, which have mutual effects. In this work, a comprehensive, non-linear mathematical model with integer variables has been proposed for...

Blend of Metal Oxides and Polyaniline on Platinum Electrodes for Dissolved Oxygen Sensors

In this study, we report the detection of dissolve oxygen (DO) by using a blend of Ruthenium oxide and polyaniline (PANI) coated on platinum electrodes. Optical properties of the PANI were characterized by using Ultravio...

Download PDF file
  • EP ID EP740258
  • DOI 10.47191/etj/v9i07.09
  • Views 59
  • Downloads 0

How To Cite

Ismail OUBAH, Dr. Selçuk ŞENER, (2024). Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration. Engineering and Technology Journal, 9(07), -. https://europub.co.uk/articles/-A-740258