What is Retrieval Augmented Generation? A Visual Guide On RAGs in the Context of LLMs

Without Fine-tuning, Integrate Custom Information and External Data Sources To Give LLMs Relevant Context To Hallucinate Less And Be More Accurate

Jul 11, 2024

Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.

Consider sharing this with someone who wants to know more about machine learning.

New here and don’t know where to begin?
Read the Getting Started page to find your way around The Code Compass.

In the rapidly evolving field of artificial intelligence, large language models (LLMs) such as ChatGPT-4, Gemini, Claude, and LLAMA have become household names.

Making an LLM perform well on your custom use case can be a costly affair. Pre-training ChatGPT-4 required 10s of TB of data, 100M$, and 100s of cutting-edge GPUs.

RAG is cheaper and easy to use on demand. Pre-training and fine-tuning are more resource-intensive and run less often.

You could always fine-tune a pre-trained model on your data using LoRA to reduce the number of parameters tuned to less than 1% of total model parameters!

But what if you do not have access to a high-quality fine-tuning dataset and compute? RAG for LLMs or retrieval augmented generation can help you solve your problem.

Without the need to re-train or fine-tune, Retrieval Augmented Generation (RAG), elevates the capabilities of LLMs by integrating them with external knowledge sources such as databases, file systems, or knowledge graphs. This article goes over (the technicalities of) RAG and why you should care about it.

Would like to read other related pieces?
Here you can read more about the Transformers series and LLMs series.

1. The Rise of Retrieval Augmented Generation

RAG emerged as a solution to one of the fundamental challenges faced by LLMs: their dependence on static, pre-existing knowledge. While these models are trained on vast amounts of data, their knowledge is limited to what was available during their training phase.

If you have used ChatGPT or other LLMs, you might have noticed a cut-off date i.e. the model is pre-trained on “internet data” until that date. So if you would like to query information on recent events and the model does not know about them, you might not get the response you seek.

This is where RAG steps in, enabling models to access and retrieve information from external sources, thus enhancing their ability to provide “more” accurate and up-to-date responses.

"Attention, Please!": A Visual Guide To The Attention Mechanism [Transformers Series]

CodeCompass

May 2, 2024

Read full story

2. Why Should You Care About RAG: Overcoming the limitations of LLMs

RAG addresses several limitations of LLMs:

Integrate Sources of Information to Generate Responses

LLMs, despite their impressive capabilities, are limited to the knowledge they were trained on. RAG allows these models to access external databases, documents, and APIs, integrating up-to-date information into their responses. This integration ensures that the responses are not only accurate but also rich in context and detail.

External sources of information provide the LLM more relevant context. This context helps the LLM respond correctly to incoming queries.

Imagine trying to integrate an LLM to answer questions from attendants in a medical setting such as a hospital. The LLM does not have knowledge or any data about your patients, health records, procedures, etc.

There are a few large limitations of LLMs. Some of these are overcome thanks to retrieval augmented generation or RAG.

Remedy Out-of-Date Pre-Trained Models

One significant challenge with pre-trained models is that they can become outdated quickly as new information emerges. RAG mitigates this issue by allowing the model to retrieve the latest data during its operation. This capability is particularly valuable in fields where information is constantly evolving, such as technology, medicine, and current events.

Model Grounding and Trustworthy LLM Responses

RAG enhances the grounding of models, ensuring that the generated content is anchored in verifiable and relevant data. This grounding improves the trustworthiness and reliability of the model's outputs, as the information is drawn from credible and up-to-date sources.

Enhanced Contextual Understanding

By retrieving external information, RAG improves the model's understanding of the context, enabling it to generate more relevant and nuanced responses. This contextual enhancement is crucial for complex queries that require detailed and specific knowledge.

Enhancing Accuracy and Relevance

By leveraging external data sources, RAG significantly improves the accuracy and relevance of the generated content. For instance, if a user asks about the latest research in quantum computing, a RAG-based system can retrieve the most recent papers and integrate their findings into the response. This not only ensures the information is current but also enhances the depth and quality of the generated text.

How Apple Performs Person Recognition Without Photos Leaving Your iPhone

CodeCompass

April 5, 2024

Read full story

Confident Product Decisions with Data: Inside Spotify’s Risk-Aware A/B Testing Framework

CodeCompass

June 29, 2024

Read full story

Inside AlphaFold: DeepMind’s Recipe For Success

CodeCompass

June 6, 2024

Read full story

3. An Intuitive Explanation of RAG: Closed-Book vs. Open-Book

RAG can be viewed as a hybrid system that brings together the strengths of retrieval-based models and generative models.

Imagine you are preparing for an important exam. There are two different exam formats you could face: an open-book exam, where you can bring and refer to your textbooks, and a closed-book exam, where you rely solely on what you have memorized.

RAG is like the open-book exam where you have access to the relevant context when responding to questions in the exam.

The Closed-Book Approach: Traditional LLMs

Traditional large language models (LLMs) operate like students taking a closed-book exam. These models are trained on vast amounts of data, learning patterns, facts, and language structures during their training phase. When they generate text, they rely solely on this internalized knowledge. This method allows for fast and fluent text generation, but it has some limitations:

Outdated Information: The knowledge base of the model is fixed at the time of training. Any information emerging after this period is not available to the model.
Knowledge Gaps: No matter how extensive the training data is, there will always be gaps in the information that the model has internalized.
Static Understanding: The model cannot adapt to new information or changes in the real world.

The Open-Book Approach: RAG

On the other hand, the retrieval mechanism in RAG functions like a student in an open-book exam. When faced with a question, the student can look up the latest information in textbooks or online resources, ensuring their answer is accurate and up-to-date. This approach has its own strengths:

Up-to-date Information: By accessing external sources, the model can provide information that reflects the most current knowledge.
Comprehensive Coverage: It can retrieve specific details from vast databases, filling in gaps that the closed-book memory might have.
Domain-Specific Context: External sources can include niche and domain-specific information. It is likely that the model has not seen such type of information at all during pre-training or fine-tuning. But with RAG, it is able to generate relevant responses thanks to the relevant context.

4. The Mechanics of RAG

To understand RAG, it’s essential to break down its components and how they work together:

Retrieval Mechanism

The retrieval component is responsible for accessing external knowledge sources. When a query is received, this mechanism searches relevant databases, documents, or APIs to find the most pertinent information. The retrieved data is then fed into the generation mechanism.

Retrieving relevant information is key. Big tech companies like Apple and Netflix have used embeddings to solve similar problems in their products. They use embeddings.

Relevant data chunks are embedded. This can be documents or parts of large documents (here we call them chunks). The embeddings are then stored and used to compute similarity against the incoming query.

If you don’t have time to read how they use it, here is a gist:
This is usually done via embeddings of (1) the external information sources and (2) the incoming query to the LLM. The similarity between the query embeddings and external information embeddings is computed. The most similar hits are passed on to the LLMs as relevant context.

This can be implemented using vector DBs such as Chroma, Pinecone, Qdrant, or other popular vector DBs.

A rich embedding space captures various high-level and low-level nuances of the data and query. When a query, X, is made, top-K relevant context can be fetched and sent to the LLM to augment the generation of the response.

If you are not a fan of embeddings you could use any other mechanism that is able to fetch the “relevant context”. For example, one could be a little more “exotic” and send the query to a search engine, fetch the top 3 web pages from the search, and send it to the LLM as “relevant context”.

For a query, X, relevant context is fetched based on a similarity measure. The LLM uses this context to augment the query during the response generation phase.

Generation Mechanism

The generation mechanism, typically an LLM, takes the retrieved information and incorporates it into the generated text. This process involves aligning the retrieved data with the query context to produce coherent and contextually appropriate responses.

The LLM generates text “keeping in mind” the context that was the output of the retrieval mechanism. The fact that this context is most relevant to the query made to the LLM, we hope that it contains critical information to answer the query correctly.

Putting Together Retrieval and Generation

Finally, with the retrieval and generation mechanisms ready to go, we need an integration glue to put them together and behave as an end-to-end pipeline that ingests queries to the LLMs and outputs the LLMs response.

Of course, inside the end-to-end pipeline, we have the retrieval of relevant context and augmenting the LLMs response generation.

Putting all the pieces together. The input is used to fetch the most relevant context(s) which is used to assist the LLM to generate the response to the query in the context of this extra piece of information.

5. The Future of RAG

RAG represents a significant leap forward in the capabilities of LLMs, enabling them to provide more accurate, relevant, and timely information. As AI continues to advance, RAG will likely become a standard component in the architecture of next-generation models, enhancing their ability to interact with and leverage vast external knowledge sources.

Expanding Use Cases

With the integration of RAG, the scope of applications for LLMs expands significantly. Fields such as healthcare, finance, and law can benefit from more precise and updated information retrieval, leading to better decision-making and outcomes.

Improved User Experiences

For end-users, RAG translates to more reliable and informative interactions with AI systems. Whether it’s getting advice on complex topics or staying updated with the latest news, RAG ensures that users receive the most relevant and current information available.

Continuous Learning and Adaptation Without Re-training

One of the most promising aspects of RAG is its potential for continuous learning and adaptation. As external data sources evolve, RAG systems can dynamically update their knowledge base, maintaining their relevance and accuracy over time.

6. Outro

By integrating retrieval with generation, RAG unlocks new possibilities, making LLMs more powerful, versatile, and effective in solving complex real-world problems.

With RAG, LLMs can access and utilize real-time information, providing better accuracy and relevance in their responses. To get LLMs to “integrate” custom data when generating responses you do not need to spend millions on pre-training and fine-tuning. For a large fraction of the end-users, setting up a RAG pipeline can go a long way in generating relevant results.

The Code Compass

"Attention, Please!": A Visual Guide To The Attention Mechanism [Transformers Series]

How Apple Performs Person Recognition Without Photos Leaving Your iPhone

Confident Product Decisions with Data: Inside Spotify’s Risk-Aware A/B Testing Framework

Inside AlphaFold: DeepMind’s Recipe For Success

What is an Eigenvector?: A Visual Guide to This Fundamental Concept From Linear Algebra

Uber's Billion Dollar Problem: Predicting ETAs Reliably And Quickly

How Apple's iPhone Uses Private, On-Device Machine Learning To Recognize People In Photos

How Netflix Uses Machine Learning To Decide What Content To Create Next For Its 260M Users

Inside AlphaFold: DeepMind’s Recipe For Success

Discussion about this post

The Code Compass

What is Retrieval Augmented Generation? A Visual Guide On RAGs in the Context of LLMs

Without Fine-tuning, Integrate Custom Information and External Data Sources To Give LLMs Relevant Context To Hallucinate Less And Be More Accurate

1. The Rise of Retrieval Augmented Generation

"Attention, Please!": A Visual Guide To The Attention Mechanism [Transformers Series]

2. Why Should You Care About RAG: Overcoming the limitations of LLMs

Integrate Sources of Information to Generate Responses

Remedy Out-of-Date Pre-Trained Models

Model Grounding and Trustworthy LLM Responses

Enhanced Contextual Understanding

Enhancing Accuracy and Relevance

How Apple Performs Person Recognition Without Photos Leaving Your iPhone

Confident Product Decisions with Data: Inside Spotify’s Risk-Aware A/B Testing Framework

Inside AlphaFold: DeepMind’s Recipe For Success

3. An Intuitive Explanation of RAG: Closed-Book vs. Open-Book

The Closed-Book Approach: Traditional LLMs

The Open-Book Approach: RAG

4. The Mechanics of RAG

Retrieval Mechanism

Generation Mechanism

Putting Together Retrieval and Generation

5. The Future of RAG

Expanding Use Cases

Improved User Experiences

Continuous Learning and Adaptation Without Re-training

6. Outro

Further Reading

What is an Eigenvector?: A Visual Guide to This Fundamental Concept From Linear Algebra

Uber's Billion Dollar Problem: Predicting ETAs Reliably And Quickly

How Apple's iPhone Uses Private, On-Device Machine Learning To Recognize People In Photos

How Netflix Uses Machine Learning To Decide What Content To Create Next For Its 260M Users

Inside AlphaFold: DeepMind’s Recipe For Success

Discussion about this post