To better understand the idea of RAG in the field of generative AI, one can think of a newsroom setting. A seasoned journalist is able to write different pieces of articles based on their knowledge about the subject matter.
Yet, when dealing with difficult stories such as investigations and/or subjects in technical fields, they require assistance from researchers to search and collect pertinent information.
In the same manner, a large language model can formulate its response to any question thrown at it, but it requires another entity to help it gather accurate information before doing so.
The retrieval augmented generation is an approach where the function of traditional information retrieval mechanisms such as databases and search engines are merged with the capabilities of LLMs in terms of generating texts. LLMs understand the pattern through which humans construct sentences using language, hence they provide fast answers to almost all types of questions. The problem arises when people require a more elaborate and updated answer on a particular topic. RAG helps in filling this gap between information retrieval and the generation of texts through its dual process approach.

In order to explain how information is processed and accessed in a RAG AI model, let's discuss vector databases and the way that RAG accesses data from different databases.
Vector databases are tailored for storing embeddings – numeric vectors representing real-world objects such as words, images, or videos that can be used more efficiently by machine learning models. In particular, RAG searches embeddings (numeric vectors), which allow the model to easily access information from a vector database as it finds similar vectors in search.
Once a user queries an LLM, the latter will send the question to a system that will convert the query into a vector or an embedding and match it against the existing vectors found in the knowledge base index of a machine-readable form. If there are any matches, the data will be obtained, converted into the plain text, and delivered to the LLM together with the generated answer to the question.
For example, let us consider how we could apply this type of architecture in developing a customer service assistant with the use of RAG technology. The purpose of the tool is to provide answers to customers' queries related to the products manufactured by the company.
In order for the customer service assistant to function:
To make the model work correctly, an embedding model updates machine-readable indexes (also called vector databases) with newly introduced knowledge bases.
In RAG, data is categorized in a manner similar to that in which books are organized in a library. This allows for easier data access via indexing. Here, indexing is done by classifying data in categories that include word matching, thematic indexing, or even data such as topic, author, date, and key terms. There are different ways to do this:
In general, RAG systems contain documents that are further split into segments. Each of the segments is embedded in a vector form.
At this stage, it is essential to optimize the user’s query in order to fit it closer to the indexed data. Efficient query processing allows the retrieval augmentation generation (RAG) method to find relevant data. In case of working with vector indexes, the optimized query is transformed into an embedding.
With the question clearly defined, RAG then looks into the database it has indexed to find the most relevant content from which it will generate the answer. The search algorithm will depend on the format of the storage. In cases where vectors are used for the search, the distance of the query vector to the chunks of documents will determine the results.
This process produces a lot of results. Unfortunately, all of these cannot be used in generating answers by the LLMs. This means that the data will have to be filtered and ranked. The ranking process in RAG works similarly to what happens when people do a search on search engines such as Google.
For instance, a Google search will produce a number of pages of results. However, the most important and relevant data will come up on the first page of results. Only the best results will then go on to the next step.
The process of enriching the prompt is carried out through the inclusion of the most relevant information into the initial prompt and further combining it. In other words, LLM receives additional context for a deeper understanding of the query, which allows it to provide more specific answers. Therefore, the answer contains not only general information but also the newest data.
In the final stage, LLM provides an answer based on the enriched prompt. With access to all relevant data, it creates a response, which includes both general knowledge and the specifics related to the query.
Very few enterprises develop their own artificial intelligence models on their own. The practice that enterprises follow is to customize existing models according to their business needs through RAG or fine-tuning. The technique of fine-tuning entails tweaking the internal architecture of the model and building a model with high specificity that is relevant for the given task. Fine-tuning suits enterprises experiencing any level of specialization very well. However, it is important to understand that fine-tuning is quite a meticulous process and should be done with utmost diligence. It requires gathering of data to train the model, which can be rather daunting, and runs the risk of blunting the model and making it perform below par after training.
The RAG approach, however, does not involve any weighting adjustments. The technique relies on the gathering of information from different databases to make a query more relevant and generate answers to fit the user’s expectations.
Some firms choose to use RAG as a basis for further training and specialize in specific use cases, while others do not need anything else to customize their AI system.
A machine learning model must be provided with sufficient context to give meaningful responses, similar to how a human requires relevant information to make decisions or resolve any issues. In the absence of the right context, it becomes difficult to take action.
The current applications of generative AI depend on language models developed based on transformers. This type of model works under a context window that represents the highest volume of data it can analyze at once. While these are relatively small, there have been improvements in AI technology, which means that these are continuously becoming bigger.
However, since there is a limit to the size of the context window, machine learning developers have to make choices regarding what information should be included in the prompt. The process of making these selections is referred to as prompt engineering and it ensures that the output provided is highly relevant.
RAG increases the contextual awareness of an AI model by ensuring that an LLM accesses information that is outside its training data set. The inclusion of this information retrieved from different sources enables RAG to improve on the first prompt by helping the AI model provide a more accurate response.
Unlike traditional keyword searches a machine learning-based semantic search system uses its training data to recognize the relationships between terms. For instance, in a keyword search, "coffee" and "espresso" might be treated as unrelated terms. However, a semantic search system understands that these words are closely linked through their association with beverages and cafés. As a result, a search for "coffee and espresso" might prioritize showing results for popular cafés or coffee-making techniques at the top.
If an RAG system employs a custom database or search engine, using semantic search can contribute to making the context included in the prompt more relevant, ensuring high-quality output generated by AI.
As we already know, the RAG system relies on vector databases and embeddings for extracting relevant content. However, we have not yet mentioned that the RAG system does not depend only on embeddings or vector databases.
It is possible to employ semantic search within the RAG system for obtaining relevant content from several different sources – be it an embedding-based retrieval system, a traditional database, or a search engine. Afterwards, excerpts from those documents get formatted and incorporated into the model's prompt.
RAG methodologies prove quite beneficial for developing artificial intelligence-based semantic search engines. As a result of integrating RAG algorithms into natural language processing applications for search purposes, advanced software is developed that does not only provide answers to user questions but additionally generates new data based on generative artificial intelligence technologies.
A special feature of RAG-based search engines lies in the possibility of working with unstructured content. Specifically, instead of searching for keyword matches in legal documents, a semantic search engine can be used to provide answers to more complex user requests, e.g., identify legal cases when some law was applied in certain conditions.
As it may be seen, combining RAG and semantic search methodologies enable the engine to return accurate results as well as reveal patterns in data.
RAG algorithms may also extract information from both internal and external search engines. When integrated into the external search engine, RAG algorithms are able to find relevant information from the Internet. At the same time, the integration into the internal search engine provides users with access to corporate resources, including websites.
As an illustration, consider a customer service chatbot for an e-commerce organization that can access not only an external search engine like Google but also an internal search engine that would enable access to information available in the knowledge base of the company.
RAG origins date back to the 1970s. Natural language processing was used in the very early applications to retrieve information, focusing on niche topics. While the main ideas behind text mining have stayed consistent, the technology behind these systems has advanced significantly, making them more effective. By the mid-1990s, services like Ask Jeeves (now Ask.com) popularized question-answering with user-friendly interfaces. IBM's Watson brought further attention to the field in 2011 when it beat human champions on the TV game show Jeopardy!
RAG took a major step forward in 2020, due to research led by Patrick Lewis during his doctoral studies in NLP at University College London and his work at Meta's AI lab. Patrick's team aimed to enhance LLMs by integrating a retrieval index into the model, which would allow it to access and incorporate external data dynamically. Inspired by earlier methods and a paper from Google researchers, they envisioned a system capable of generating accurate, knowledge-based text outputs.
When Lewis integrated a promising retrieval system developed by another Meta team, the results exceeded expectations on the first try is an uncommon feat in AI development.
Conducted using a cluster of NVIDIA GPUs, the study, which was heavily contributed to by Ethan Perez and Douwe Kiela, showed how retrieval could make AI models more precise and reliable. The paper that was subsequently written based on this research is now referenced by hundreds of other researchers for future progress.
Current LLMs have been revolutionized by concepts such as RAG, and they will continue to do so for question-answering and generative AI applications. By accessing external information sources via models, RAG will ensure more authoritative responses, paving the way for innovative applications across various other fields.
