With the recent launch of Contextual Retrieval from Anthropic which has terrific results for RAG, it is interesting to explain how it works, and why it can make sense money-wise if used correctly.
To do that, we will explain the basics of RAG, do the cost calculation and even explain the limits!
As always, we will try to be as simple as possible so let’s go!
RAG basics: Chunking, Vector Store and Retriever
With the recent launch of Contextual Retrieval from Anthropic, which has shown terrific results for RAG, it’s interesting to explain how it works and why it can be cost-effective if used correctly. To do this, we will explain the basics of RAG, perform the cost calculation, and even discuss its limitations.
As always, we’ll try to keep things as simple as possible, so let’s get started!
Chunking
Chunking involves slicing large text data into manageable pieces, or “chunks.” Instead of dealing with entire documents, which will dilute search results, chunking ensures that smaller, contextually relevant parts of the text is used. There are multiple techniques for chunking but the most simple is chunking by lines or paragraph and fixed-length chunking.
Embeddings
Embeddings convert each chunk into a fixed length vector—a numerical representation that captures the semantic meaning of the text. Using embedding models (text-embedding-3-small
from OpenAI or voyage-3
from Voyage-AI), embeddings encode the context, relationships, and nuances of the content. To use a methaphor, imagine that we took all the topics in the world and then regrouped them in a fixed number of columns (for example 8191). We could measure any text we have using the these columns, giving value on how much our text talks about this topic group. The resulting line of values would be an embedding.
Vector Store
When the chunks are transformed into embeddings, we need a way to store these vectors efficiently. This is the role of the vector database, a specialized repository designed to store vectors (and their corresponding text chunks) and also search through them with a technique called similarity searches. This technique works by searching for the most similar vectors to the given one by using a mathematical function called cosine similarity (when you calculate the cosine similarity of two vectors, you calculate how similar their orientation is similar).
Retriever
The Retriever is a component tasked to retrieve all the necessary chunks for a given input. It involves using similarity search to find the most similar chunks of text to the input. The input given to the retriever is first embedded, then similarity search is done for all the chunks of the vector store and the input and finally the k most similar chunks are returned. Finally all this chunks along the question is sent to an LLM to generate an answer. This is how RAG works.
The pitfall of RAG
Now one of the problem of what is described before is that when chunking, you lose its context. Imagine that we take a document describing any topic. If we chunks this document, and give a random chunks to someone, there is a good chance that he will not understand what this token because we will give only on of the may chunks. And so this leads to a lot of problems like hallucinations (when the LLM invent things that is not in the text or completely false) or just don’t find the most relavant chunks.
There a lot of techniques to try to offset this problem. Here’s som examples:
- Overlapping chunks: This technique consists of create chunks that overlap with each other. This approach ensures that some context is carried over from one chunk to the next. For example, if the chunk size is 300 tokens, you might overlap by 100 tokens. This way, the first 100 tokens of the second chunk are also the last 100 tokens of the first chunk. This technique helps losing context that are between two chunks.
- Semantic chunks: This technique consists of dividing a document into chunks based on the meaning of content. The goal is to create chunks that represent specific ideas or topics, and preserving the semantic context within each segment. One of the way it is implemented is to calculate the embedding for each next phrases in the text and split the current chunk whenever the semantic changes too much.
- Exact Match Search: This technique consist of adding a system that will index all the key words (using TF-IDF or BM-25 which allows to find the most relevant words in a text) of all the chunks. When retrieving chunks for RAG, the system will get the chunks from the vector store but also from the indexing (using the most import words of the input) and then filter on the most relevant chunks using a fusion system. When using both semantic search (using vector store) and exact match search, the system is called Hybrid Search.
- Hierarchical Summarization: The goal is to first summarize individual sections or chapters, then combine these summaries to create a higher-level summary and integrate it in each chunk. The hierarchical approach can help distill context information from multiple granularities.
Now let’s explain the one created by Anthropic which has really nice results: contextual retrieval.
Contextual retrieval
The goal of contextual retrieval is to add an explanation of the content of the chunks in relation to the whole document for each chunks. Really, the whole point is to make each chunks the most independent from each other so that when retrieved, it will give chunk data itself but also the full context to understand it.
Let’s take an example to better understand it. Let’s imagine that we are parsing a document about deforestation in the Amazon Rainforest.
Here’s the content of the original chunk and the augmented chunk:
original_chunk: """
The forest area in the region decreased by 2% compared to the previous year.
"""
augmented_chunk:
"""
<chunk_context>
This statement is from a 2023 environmental report by the World Wildlife Fund on deforestation in the Amazon rainforest; in 2022, the forest area was estimated to be 5.5 million square kilometers.
</chunk_context>
The forest area in the region decreased by 2% compared to the previous year.
"""
As you can see, it is much easier to understand what is happening with the augmented context than the original context. Questions about the Amazon forest would have more chance to use this chunk.
How to implement Contextual Retrieval
So how to implement contextual retrieval? Here’s what Anthropic proposes:
- Use the Anthropic model Claude 3 Haiku (which is light weight and cheap but can still accept 200k tokens in input)
- Push the whole content of the document into the prompt and ask it to generate the context of the chunk considering the whole document.
Here’s an example prompt for that:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
As you can see, there is some problems here:
- You need an LLM that can accept an input big enough for your whole document. This would be a document of at least 500 pages so that makes most use cases valid.
- The context needs to be relevant enough to be used. Anthropic models are pretty good to understand the whole context given, even if it is very big. But if the document is particularly complex or with specialized vocabulary, this will become a problem.
- You need it to be cost-effective enough to perform this query for ALL your chunks. Anthropic proposes to use their new features: Prompt Caching which could save up to 90%. Don’t hesitate to check my post explaining how Prompt Caching works, it would help you understand how to use it. Here’s the link.
What is the cost of Contextual Retrieval
Let’s do the math to generate an estimate of how much the processing of a file cost using this technique.
Reminder:
- The following is only an estimate to give an idea of the cost. This means, depending on the case and the documents, this could change greatly.
- Estimating the number of pages for a number of tokens is at best an approximation, so this all needs to be taken with a grain of salt.
- This part is to give you an estimate on the cost but the real price will vary.
Suppositions:
- Input documents:
- 500 pages
- 200 000 tokens (just enough to be inputed as a whole in the Haiku)
- Only text, no images (that will be simpler to calculate)
- Prompt output:
- LLM output: 100 tokens (just the context of the chunks)
- Processing:
- We will assume that we hit the cache for the whole document in all the query.
- The first query will write to the cache which has a different pricing. Below the cache pricing to be clear.
Here’s the full calculations:
Number of Chunks
Total tokens: 200,000
Tokens per chunk: 1,000
Number of chunks = 200,000 / 1,000 = 200
Initial Query Cost (Cache Write)
Base input cost = 200,000 * 0.25 / 1,000,000 = $0.05
Cache write cost = 200,000 * 0.30 / 1,000,000 = $0.06
Output tokens (200 chunks * 100) = 20,000
Output cost = 20,000 * 1.25 / 1,000,000 = $0.025
Total initial cost = $0.05 + $0.06 + $0.025 = $0.135
Cost for Subsequent Queries (Cache Hits)
Cache hit cost = 200,000 * 0.03 / 1,000,000 = $0.006
Output cost = 20,000 * 1.25 / 1,000,000 = $0.025
Total per query = $0.006 + $0.025 = $0.031
Total Cost
Initial query cost = $0.135
199 subsequent queries = 199 * $0.031 = $6.169
Total cost = $0.135 + $6.169 = $6.304
To summarize, processing a 500-page pdf with contextual retrieval will cost only 6.30 $. This is really impressive.
Considering the performance that the RAG system will have, this is really a low cost. And even better, the solution is very simple and easy to integrate into any current RAG system.
But there are limits of course. Let’s see that right now!
Limits of Contextual Retrieval
The concept is really impressive, but to better use it, you need to understand its limits:
- Prompt Caching is mandatory. Without prompt caching, the cost will potentially be multiplied by x10. Currently, Prompt Caching is available only when using Anthropic, OpenAI and Gemini (Google) APIs and not using other providers like GCP or AWS. The other models and providers don’t support it yet.
- Be sure to use Prompt Caching to its full potential. For Anthropic and Google, you need to use a specific tag (see here and here). For OpenAI, it is automatic but the cost reduction will be less (50% instead of Anthropic’s 90%).
- Have documents that are small enough to be put inside the LLM context. If not, you will need some processing.
- Do you even need RAG system ? With the gigantic input size, cheap cost and the capacity to find even small information into long text, adding the whole document and just querying it can make sense.
Conclusion
In conclusion, Contextual Retrieval is a very powerful technique that makes RAG even more accurate by giving a personalized context for each chunks using LLM calls, making them more understandable and easier to find. Cost wise, by using Prompt Caching, we can drastically lower the total cost, making it completely reasonable, even with lot of files. The technique is very simple to integrate and the only requirement is really to have an LLM that has Prompt Caching and a sufficient input size. This is really powerful and impressive!
Afterward
I hope you really loved this post. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.
Follow me on