Site icon Use AI the right way

How Prompt Caching Works: A Deep Dive into Optimizing AI Efficiency

With the launch of prompt caching from Anthropic and OpenAI, and the potential cost saving, it is interesting to explain how it works and why it is not just classic caching.
To do that we will need to deep dive into how a LLM models handles inputs and where caching is done. we will try to be as simple as possible so let’s get to it!

Prompt caching is not classic caching

Classic Caching is used to store data or computations temporarily for quicker access when the same data is requested again. This approach typically deals with static outputs, such as webpage content, API responses, or database query results.

In classic caching, a unique key-value structure is used where a key (such as a URL or query) maps to a specific result. When the same key is requested later, the system retrieves the stored result instead of recomputing it, reducing the time and resources needed to respond.

Classic caching isn’t suitable for large language models (LLMs) because their outputs are dynamic and context-dependent. LLMs generate responses based on both fixed context and variable user input and even with the same input, outputs can vary. So simply caching an output doesn’t apply when parts of the prompt change. Unlike static responses (e.g., API data), LLMs need to process intermediate steps (like self-attention key-value pairs). Therefore, a more advanced system—prompt caching—is needed to optimize repeated computations.

So let’s explain how a LLM handles inputs and then see where prompt caching happens.

Tokenization – The Starting Point

Before any kind of caching or self-attention can happen, the input text (prompt) needs to be tokenized. In simple terms, tokenization splits the input text into smaller units called tokens. These tokens are usually words or subwords, depending on the tokenizer being used by the model. For example, the sentence “The cat is on the mat” might be broken down into tokens like ["The", "cat", "is", "on", "the", "mat"].

Each token is then represented as a vector of numbers through a process called embedding. These token embeddings are the inputs to the model, but before the model can process them, it uses a mechanism called self-attention.

Self-Attention mechanism: how tokens communicate

For each token in the input, the model creates three vectors (through linear transformations of the input vector):

  1. Query (Q) – Represents the “question” that the token is asking, i.e., what am I looking for?
  2. Key (K) – Represents the “identity” of the token, i.e., what information do I contain?
  3. Value (V) – Contains the actual information or data that the token holds.

These vectors are computed for every token in the input sequence. For example, in the sentence “The cat is on the mat”, each word (token) will have its own set of Query, Key, and Value vectors.

The self-attention mechanism is a way for each token to “look” at other tokens in the sequence and decide how much focus (or “attention”) to give them. Here’s how this works step-by-step:

Multi-Head Attention: multiple perspectives

Transformer models (like GPT and Claude) use multi-head attention, where the attention mechanism is run several times in parallel with different sets of weights (so with different results). This allows the model to capture different aspects of relationships between tokens. For instance, one attention head might focus on short-range relationships like “subject-verb,” while another might focus on longer-range relationships like “subject-object.”

Once each attention head computes its attention-weighted values, the results are merged (concatenated) and passed through a final linear transformation to produce the final, updated representation for each token. This output is then passed to the next layer of the model which is the real starting point of the model.

When does prompt caching happens?

Prompt caching happens just before the attention is calculated in the self-attention mechanism, where the key-value pairs generated are stored. Here’s how it fits into the process:

  1. First Request: When the model processes a prompt for the first time, it performs all the steps mentioned above—tokenization, creating Q-K-V vectors, and calculating attention scores. During this process, the key-value pairs for each token are stored in a cache if the prompt is likely to be reused.
  2. Subsequent Requests: When the same or similar prompt is submitted again, the cached key-value pairs are retrieved. This means that the model doesn’t need to redo the Q-K-V calculations and attention steps for those cached parts of the input, saving computation time and resources.

What’s Cached? It’s not the original input text, but the key-value vectors produced during self-attention that are stored. This ensures that even if the input text changes slightly, the model can still reuse relevant sections of the cached data (because the embedding vectors will be still the same).

You can find the paper most likely used for this feature here.

How Prompt Caching Works in OpenAI and Anthropic

Now that we understand how the attention mechanism works and where prompt caching fits in, let’s look at how OpenAI and Anthropic implement this feature.

OpenAI’s Prompt Caching:

Anthropic’s Prompt Caching:

FeatureOpenAIAnthropic
ConfigurationAutomatic (no user setup needed)Manual configuration via API
Cost Savings50% discount on cache hits90% discount on cache reads
Cache Write CostFree25% surcharge
Cache Duration5-10 minutes, evicted after 1 hour5 minutes, refreshed on access
Best ForGeneral use, low complexityLarge, static data
OpenAI vs Anthropic Prompt Caching

Conclusion

By caching and reusing key-value pairs from the attention mechanism, LLMs like GPT and Claude can handle repetitive prompts with greater efficiency. This system cuts down computational load, reduces latency, and saves costs. For developers, prompt caching means faster and cheaper responses when dealing with long or repeated prompts.

Whether you use OpenAI’s simple automatic system or Anthropic’s customizable, cost-efficient solution, prompt caching is a key feature that optimize LLM costs when used correctly.

Afterwards

I hope you really loved this post. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.
Follow me on and and please leave a comment.

Exit mobile version