Retrieval augmented generation, or RAG, augments a large language model’s (or LLM) predictive abilities by grounding the model with information from external knowledge that is current and contextual.
LLMs such as GPT-4 represent a significant advancement in natural language processing abilities, which enables computers to process, understand, and generate human language.
Despite the seemingly human-like capabilities LLMs have brought into our everyday lives, we also know about the limitations and risks. LLMs are prone to hallucinate, provide misinformation, and cannot grow their knowledge outside of their training data. Additionally, they can pose serious security and privacy risks.
This is where RAG comes in.
What Is Retrieval Augmented Generation?
RAG is an AI framework that improves the quality of LLM output by introducing an information retrieval system that draws from trusted sources of knowledge. LLMs on their own are limited to their training data, stuck in the time when their training ended.
Still, their aim is to predict the best next piece of text based on the user prompt and their training — whether it’s factually correct or not. LLMs with RAG, however, have the ability to access updated and contextual knowledge or documents in response to a query. This is called grounding.
By first retrieving only the information that is relevant to a query and a user, RAG can help produce the most up-to-date, accurate response to a prompt. The concept first came into prominence through the 2020 data science research by Patrick Lewis and a team from what was then called Facebook (now Meta).
For an enterprise, RAG provides greater control over the quality and context of data the LLM uses to generate its responses. This could mean restricting a model’s answers to pull only from an enterprise’s approved company procedures, policies, or product information. Using this approach, enterprises can provide the LLM with greater context for queries and ensure greater accuracy.
RAG particularly suits tasks that are knowledge-intensive, meaning tasks that most humans would need to turn to an external source of knowledge to complete.
Moving from Naive RAG to Advanced RAG With Coveo
The above description of Retrieval Augmented Generation (RAG) is considered a starting point — Coveo takes RAG to the next level by doubling up on the R (retrieval) aspect to create an advanced application of RAG.
First, Coveo converts text into word vectors to create vector embeddings. An embedding model enables different passages to be grouped with others that discuss the same topic, improving the retrieved context for a given input query.
Then, the first stage of retrieval at the document level answers the question, what are the most relevant items in the index addressing this query? The first 100 relevant documents with specific information relevant to the user query are fetched.
Then, the second retrieval stage identifies the most relevant passages from those documents, which are then used in the generated response.
Coveo offers RAG in two forms: the full package with our Relevance Generative Answering as part of our industry-leading AI search platform, or via our Passage Retrieval API that makes solely our advanced retrieval accessible to enterprises.
Why Do Enterprises Need RAG?
Interest in retrieval systems has grown with the need to overcome limitations in LLMs to apply them in real-life scenarios in which accuracy and timeliness are important. We run into the following issues with pre-trained language models:
- Difficulty in extending knowledge
- Outdated information
- Lack of sources
- Tendency to hallucinate
- Risk of leaking private, sensitive data
RAG attempts to address these challenges when working with language models. Let’s next look at how RAG accomplishes this.
How Does RAG Work?
Typically, a pre-trained language model takes a user prompt — or query — and generates a response based on what the model knows from its training data. The model draws from its parametric memory, which is a representation of information that’s already stored internally in its neural network.
With RAG, a pre-trained model now has access to external knowledge that provides the basis for factual and up-to-date information. The retrieval system first identifies and retrieves from external sources the most relevant pieces of text based on the user’s query.
Techniques such as word embeddings, vector search, and other machine learning models assist in finding the most relevant information for the user’s query in the current user’s context. In an enterprise setting, external sources may be knowledge bases with documents on specific products or procedures or an internal website for employees such as an intranet.
When a user submits a question, RAG builds on the prompt using relevant text chunks from external sources that contain recent knowledge. The “augmented” prompt can also include instructions — or guardrails — for the model. These guardrails might include: don’t make up answers (hallucinate), or, limit responses to only the information found in the approved trusted sources. Adding contextual information to the prompt means the LLM can generate responses to the user that are accurate and relevant.
Next, the model uses the retrieved information to generate the best answer to the user’s query through human-like text. In the generated response, the LLM can provide source citations in order to give the user the ability to verify and check for accuracy, because the LLM is “grounded” with the identifiable retrieved information from the retrieval system.
Benefits of Using RAG With LLMs
Retrieval augmented generation brings several benefits to enterprises looking to employ generative models while addressing many of the challenges of large language models. They include:
- Up-to-date information. A retrieval-based approach ensures models have access to the most current, reliable facts.
- Control over knowledge sources. RAG also gives enterprises greater control over the knowledge LLMs use to generate their answers. Rather than relying on the vast general knowledge from their training data, you can have LLMs generate responses from vetted external sources.
- Verification of accuracy. Users have insight into the model’s sources to cross-reference and check for accuracy.
- Lower likelihood of hallucinations and privacy violations. Because the LLM is grounded in factual knowledge, accessible by the user, the model does not depend as much on pulling information from its parameters. As a result, there is a lower chance of hallucinations, leaking sensitive data, and misleading information.
- Dynamic knowledge updates. RAG eliminates the need to retrain a model on new data and update its parameters. Instead, the external source is kept up to date so that the retrieval system can search and provide the LLM with current and relevant information to generate responses.
- Cost savings. Without the need to depend on retraining parameters, which can be time-intensive and costly, RAG can potentially lower the computational and financial costs of running LLM-powered chatbots in an enterprise setting.
How Is RAG Different From Fine Tuning?
RAG and fine tuning have emerged as key ways to extend an LLM’s capabilities and knowledge beyond its initial training data. They aim to add domain-specific knowledge to a pre-trained language model, but they differ in the way they introduce and implement new knowledge.
Fine tuning, which occurs downstream of NLP tasks, takes a pre-trained language model and fine tunes it with additional training data for a particular task. This approach involves adjusting the model’s internal parameters for a specialized body of knowledge, such as industry-specific use cases.
In contrast, RAG combines nonparametric data from secondary knowledge sources through a retrieval mechanism with the model’s existing parametric memory to enhance the pretrained language model’s generated responses.
Whether an organization chooses to fine tune its model or employ RAG depends on what it’s trying to accomplish – and what it’s trying to avoid.
Fine tuning works best when the main goal is to get the language model to perform a specific task, such as an analysis of customer sentiment on certain social media platforms. In these cases, extensive external knowledge or integration with such a knowledge repository are not necessary for success.
However, fine tuning depends on static data, which is not ideal for use cases that rely on data sources that constantly change or need updating. Retraining an LLM like ChatGPT on new data with each update is not feasible from the standpoint of cost, time, and resources.
RAG is the better choice for tasks that are knowledge-intensive and stand to benefit from retrieving up-to-date information from external sources, such as real-time stock data or customer behavior data . These tasks may involve answering open-ended, ambiguous questions or synthesizing complex, detailed information to generate a summary.
Further, there is a security risk associated with fine-turning. As you feed the model data – you are potentially exposing private or confidential information. Grounding with RAG helps eliminate this risk.
[Note: Coveo Relevance Generative Answering uses its unified search platform to ground content by retrieving relevant content from trusted sources that a user is allowed to see . Access controls protect proprietary and private information. Only approved documents will be fed to the LLM to generate the answer.]
Applications of Retrieval Augmented Generation
RAG has the potential to greatly enhance the quality and usability of LLM technologies in the enterprise space. Some of the ways businesses can use RAG include:
- Search: By combining search with a retrieval-based LLM, your search index first retrieves documents that are relevant to your query before responding to the user. The generative model with this approach can provide a high-quality, up-to-date response with citations. This should also significantly reduce instances of hallucinations.
- Chatbots: Incorporating RAG with chatbots can lead to richer, context-aware conversations that engage customers and employees while satisfying their queries.
- Content generation: RAG can help businesses in creating content in areas such as marketing and human resources that are accurate and helpful to target audiences. Writers can gain assistance in retrieving the most relevant documents, research, and reports.
What’s Next with Coveo, RAG and Generative AI
At Coveo, we believe in the exciting opportunities large language models bring to enterprises. We also see the potential in RAG as a key approach to improving human-computer interactions using NLP. Our Coveo Relevance Answering capability takes LLM technologies and combines them with the Coveo Search Platform, a secured AI-powered semantic search retrieval system, to provide what we think is a pretty powerful search experience.
“We anticipate that demand for Generative AI question answering experiences will become ubiquitous in every digital experience,” said Laurent Simoneau, President, CTO and Co-Founder of Coveo. “In the enterprise, we believe that search and generative question-answering need to be integrated, coherent, based on current sources of truth with compliance for security and privacy.”
Learn more by visiting our Coveo Relevance Generative Answering page!