How Prompt Caching Works: A Deep Dive into Optimizing AI Efficiency

Prompt Caching representation

With the launch of prompt caching from Anthropic and OpenAI, and the potential cost saving, it is interesting to explain how it works and why it is not just classic caching.
To do that we will need to deep dive into how a LLM models handles inputs and where caching is done. we will try to be as simple as possible so let’s get to it!

Prompt caching is not classic caching

Classic Caching is used to store data or computations temporarily for quicker access when the same data is requested again. This approach typically deals with static outputs, such as webpage content, API responses, or database query results.

In classic caching, a unique key-value structure is used where a key (such as a URL or query) maps to a specific result. When the same key is requested later, the system retrieves the stored result instead of recomputing it, reducing the time and resources needed to respond.

Classic caching isn’t suitable for large language models (LLMs) because their outputs are dynamic and context-dependent. LLMs generate responses based on both fixed context and variable user input and even with the same input, outputs can vary. So simply caching an output doesn’t apply when parts of the prompt change. Unlike static responses (e.g., API data), LLMs need to process intermediate steps (like self-attention key-value pairs). Therefore, a more advanced system—prompt caching—is needed to optimize repeated computations.

So let’s explain how a LLM handles inputs and then see where prompt caching happens.

Tokenization – The Starting Point

Before any kind of caching or self-attention can happen, the input text (prompt) needs to be tokenized. In simple terms, tokenization splits the input text into smaller units called tokens. These tokens are usually words or subwords, depending on the tokenizer being used by the model. For example, the sentence “The cat is on the mat” might be broken down into tokens like ["The", "cat", "is", "on", "the", "mat"].

Each token is then represented as a vector of numbers through a process called embedding. These token embeddings are the inputs to the model, but before the model can process them, it uses a mechanism called self-attention.

Self-Attention mechanism: how tokens communicate

For each token in the input, the model creates three vectors (through linear transformations of the input vector):

  1. Query (Q) – Represents the “question” that the token is asking, i.e., what am I looking for?
  2. Key (K) – Represents the “identity” of the token, i.e., what information do I contain?
  3. Value (V) – Contains the actual information or data that the token holds.

These vectors are computed for every token in the input sequence. For example, in the sentence “The cat is on the mat”, each word (token) will have its own set of Query, Key, and Value vectors.

The self-attention mechanism is a way for each token to “look” at other tokens in the sequence and decide how much focus (or “attention”) to give them. Here’s how this works step-by-step:

  • Dot Product: For a given token (say, “cat”), the model calculates the dot product between the query vector of “cat” and the key vectors of every other token in the sentence. The dot product gives a score, which tells the model how much “cat” should pay attention to words like “is”, “on”, or “mat”.
  • Softmax and Attention Weights: The scores from the dot products are passed through a softmax function to normalize them into attention weights (values between 0 and 1). These weights indicate how much attention the token “cat” should give to each other token in the sentence.
  • Weighted Sum of Values: The attention weights are then used to compute a weighted sum of the value vectors for all tokens. This updated representation allows the model to enrich each token with information from other tokens. For example, the word “cat” might now include information from “mat” or “on”, giving it context.

Multi-Head Attention: multiple perspectives

Transformer models (like GPT and Claude) use multi-head attention, where the attention mechanism is run several times in parallel with different sets of weights (so with different results). This allows the model to capture different aspects of relationships between tokens. For instance, one attention head might focus on short-range relationships like “subject-verb,” while another might focus on longer-range relationships like “subject-object.”

Once each attention head computes its attention-weighted values, the results are merged (concatenated) and passed through a final linear transformation to produce the final, updated representation for each token. This output is then passed to the next layer of the model which is the real starting point of the model.

When does prompt caching happens?

Prompt caching happens just before the attention is calculated in the self-attention mechanism, where the key-value pairs generated are stored. Here’s how it fits into the process:

  1. First Request: When the model processes a prompt for the first time, it performs all the steps mentioned above—tokenization, creating Q-K-V vectors, and calculating attention scores. During this process, the key-value pairs for each token are stored in a cache if the prompt is likely to be reused.
  2. Subsequent Requests: When the same or similar prompt is submitted again, the cached key-value pairs are retrieved. This means that the model doesn’t need to redo the Q-K-V calculations and attention steps for those cached parts of the input, saving computation time and resources.

What’s Cached? It’s not the original input text, but the key-value vectors produced during self-attention that are stored. This ensures that even if the input text changes slightly, the model can still reuse relevant sections of the cached data (because the embedding vectors will be still the same).

You can find the paper most likely used for this feature here.

How Prompt Caching Works in OpenAI and Anthropic

Now that we understand how the attention mechanism works and where prompt caching fits in, let’s look at how OpenAI and Anthropic implement this feature.

OpenAI’s Prompt Caching:

  • Automatic caching: OpenAI’s models (like GPT-4o) cache parts of a prompt automatically when the input exceeds 1,024 tokens. There’s no need for users to manually manage caching.
  • Discount on cache hits: If a cached portion of the prompt is reused, OpenAI applies a 50% discount on input tokens for that part, lowering costs.
  • Short-term retention: Cached data is typically stored for 5-10 minutes, but the system evicts all cache entries after 1 hour, even if they’ve been used repeatedly.
  • Ease of use: No configuration is required from the developer’s side, making it straightforward to use.

Anthropic’s Prompt Caching:

  • Manual configuration: Anthropic offers more granular control, where users can explicitly define which sections of a prompt to cache using a cache_control parameter. This is useful for cases where specific large static data (like long documents or instructions) is reused.
  • Higher savings: Anthropic provides up to a 90% discount on cached input tokens, but there’s a 25% surcharge for writing new data to the cache.
  • 5-minute cache: Cached data is stored for 5 minutes and refreshed each time it’s accessed. This shorter cache time is designed to ensure that stale data isn’t used.
  • Best for large static data: Anthropic’s system is ideal for prompts where large, static content (e.g., system instructions or knowledge bases) is reused across multiple API calls.
FeatureOpenAIAnthropic
ConfigurationAutomatic (no user setup needed)Manual configuration via API
Cost Savings50% discount on cache hits90% discount on cache reads
Cache Write CostFree25% surcharge
Cache Duration5-10 minutes, evicted after 1 hour5 minutes, refreshed on access
Best ForGeneral use, low complexityLarge, static data
OpenAI vs Anthropic Prompt Caching

Conclusion

By caching and reusing key-value pairs from the attention mechanism, LLMs like GPT and Claude can handle repetitive prompts with greater efficiency. This system cuts down computational load, reduces latency, and saves costs. For developers, prompt caching means faster and cheaper responses when dealing with long or repeated prompts.

Whether you use OpenAI’s simple automatic system or Anthropic’s customizable, cost-efficient solution, prompt caching is a key feature that optimize LLM costs when used correctly.

Leave a Reply

Your email address will not be published. Required fields are marked *