Site icon Use AI the right way

Contextual Retrieval: a powerful RAG technique that your wallet will like

With the recent launch of Contextual Retrieval from Anthropic which has terrific results for RAG, it is interesting to explain how it works, and why it can make sense money-wise if used correctly.
To do that, we will explain the basics of RAG, do the cost calculation and even explain the limits!
As always, we will try to be as simple as possible so let’s go!

RAG basics: Chunking, Vector Store and Retriever

With the recent launch of Contextual Retrieval from Anthropic, which has shown terrific results for RAG, it’s interesting to explain how it works and why it can be cost-effective if used correctly. To do this, we will explain the basics of RAG, perform the cost calculation, and even discuss its limitations.

As always, we’ll try to keep things as simple as possible, so let’s get started!

Chunking

Chunking involves slicing large text data into manageable pieces, or “chunks.” Instead of dealing with entire documents, which will dilute search results, chunking ensures that smaller, contextually relevant parts of the text is used. There are multiple techniques for chunking but the most simple is chunking by lines or paragraph and fixed-length chunking.

Embeddings

Embeddings convert each chunk into a fixed length vector—a numerical representation that captures the semantic meaning of the text. Using embedding models (text-embedding-3-small from OpenAI or voyage-3 from Voyage-AI), embeddings encode the context, relationships, and nuances of the content. To use a methaphor, imagine that we took all the topics in the world and then regrouped them in a fixed number of columns (for example 8191). We could measure any text we have using the these columns, giving value on how much our text talks about this topic group. The resulting line of values would be an embedding.

Vector Store

When the chunks are transformed into embeddings, we need a way to store these vectors efficiently. This is the role of the vector database, a specialized repository designed to store vectors (and their corresponding text chunks) and also search through them with a technique called similarity searches. This technique works by searching for the most similar vectors to the given one by using a mathematical function called cosine similarity (when you calculate the cosine similarity of two vectors, you calculate how similar their orientation is similar).

Retriever

The Retriever is a component tasked to retrieve all the necessary chunks for a given input. It involves using similarity search to find the most similar chunks of text to the input. The input given to the retriever is first embedded, then similarity search is done for all the chunks of the vector store and the input and finally the k most similar chunks are returned. Finally all this chunks along the question is sent to an LLM to generate an answer. This is how RAG works.

The pitfall of RAG

Now one of the problem of what is described before is that when chunking, you lose its context. Imagine that we take a document describing any topic. If we chunks this document, and give a random chunks to someone, there is a good chance that he will not understand what this token because we will give only on of the may chunks. And so this leads to a lot of problems like hallucinations (when the LLM invent things that is not in the text or completely false) or just don’t find the most relavant chunks.
There a lot of techniques to try to offset this problem. Here’s som examples:

Standard RAG system that uses both embeddings and Best Match to retrieve information.

Now let’s explain the one created by Anthropic which has really nice results: contextual retrieval.

Contextual retrieval

The goal of contextual retrieval is to add an explanation of the content of the chunks in relation to the whole document for each chunks. Really, the whole point is to make each chunks the most independent from each other so that when retrieved, it will give chunk data itself but also the full context to understand it.

Let’s take an example to better understand it. Let’s imagine that we are parsing a document about deforestation in the Amazon Rainforest.
Here’s the content of the original chunk and the augmented chunk:

original_chunk: """
The forest area in the region decreased by 2% compared to the previous year.
"""

augmented_chunk:
"""
<chunk_context>
This statement is from a 2023 environmental report by the World Wildlife Fund on deforestation in the Amazon rainforest; in 2022, the forest area was estimated to be 5.5 million square kilometers.
</chunk_context>

The forest area in the region decreased by 2% compared to the previous year.
"""

As you can see, it is much easier to understand what is happening with the augmented context than the original context. Questions about the Amazon forest would have more chance to use this chunk.

How to implement Contextual Retrieval

So how to implement contextual retrieval? Here’s what Anthropic proposes:

Here’s an example prompt for that:

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

As you can see, there is some problems here:

  1. You need an LLM that can accept an input big enough for your whole document. This would be a document of at least 500 pages so that makes most use cases valid.
  2. The context needs to be relevant enough to be used. Anthropic models are pretty good to understand the whole context given, even if it is very big. But if the document is particularly complex or with specialized vocabulary, this will become a problem.
  3. You need it to be cost-effective enough to perform this query for ALL your chunks. Anthropic proposes to use their new features: Prompt Caching which could save up to 90%. Don’t hesitate to check my post explaining how Prompt Caching works, it would help you understand how to use it. Here’s the link.

What is the cost of Contextual Retrieval

Let’s do the math to generate an estimate of how much the processing of a file cost using this technique.

Reminder:

Suppositions:

Here’s the full calculations:

Number of Chunks
    Total tokens: 200,000
    Tokens per chunk: 1,000
    Number of chunks = 200,000 / 1,000 = 200

Initial Query Cost (Cache Write)
    Base input cost = 200,000 * 0.25 / 1,000,000 = $0.05
    Cache write cost = 200,000 * 0.30 / 1,000,000 = $0.06
    Output tokens (200 chunks * 100) = 20,000
    Output cost = 20,000 * 1.25 / 1,000,000 = $0.025
    Total initial cost = $0.05 + $0.06 + $0.025 = $0.135

Cost for Subsequent Queries (Cache Hits)
    Cache hit cost = 200,000 * 0.03 / 1,000,000 = $0.006
    Output cost = 20,000 * 1.25 / 1,000,000 = $0.025
    Total per query = $0.006 + $0.025 = $0.031

Total Cost
    Initial query cost = $0.135
    199 subsequent queries = 199 * $0.031 = $6.169
    Total cost = $0.135 + $6.169 = $6.304

To summarize, processing a 500-page pdf with contextual retrieval will cost only 6.30 $. This is really impressive.
Considering the performance that the RAG system will have, this is really a low cost. And even better, the solution is very simple and easy to integrate into any current RAG system.
But there are limits of course. Let’s see that right now!

Limits of Contextual Retrieval

The concept is really impressive, but to better use it, you need to understand its limits:

Conclusion

In conclusion, Contextual Retrieval is a very powerful technique that makes RAG even more accurate by giving a personalized context for each chunks using LLM calls, making them more understandable and easier to find. Cost wise, by using Prompt Caching, we can drastically lower the total cost, making it completely reasonable, even with lot of files. The technique is very simple to integrate and the only requirement is really to have an LLM that has Prompt Caching and a sufficient input size. This is really powerful and impressive!

Afterward

I hope you really loved this post. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.
Follow me on and and please leave a comment.

Exit mobile version