Context caching overview

Context caching aims to reduce the cost and the latency of requests to Gemini that contain repeated content.

By default, Google automatically caches inputs for all Gemini models to reduce latency and accelerate responses for subsequent prompts.

For Gemini 2.5 Flash (minimum input token count of 1,024) and Gemini 2.5 Pro (minimum input token count of 2,048) models, the cached input tokens are charged at a 75% discount relative to standard input tokens when a cache hit occurs.

View cache hit token information in the responses metadata field. To disable this, refer to Generative AI and data governance.

Through the Vertex AI API, you can create context caches and exercise more control over them by:

You can also use the Vertex AI API to get information about a context cache.

Note that caching requests using the Vertex AI API charges input tokens at the same 75% discount relative to standard input tokens and provides assured cost savings. There is also a storage charge based on the amount of time data is stored.

When to use context caching

Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by subsequent requests.

Cached context items, such as a large amount of text, an audio file, or a video file, can be used in prompt requests to the Gemini API to generate output. Requests that use the same cache in the prompt also include text unique to each prompt. For example, each prompt request that composes a chat conversation might include the same context cache that references a video along with unique text that comprises each turn in the chat.

Consider using context caching for use cases such as:

  • Chatbots with extensive system instructions
  • Repetitive analysis of lengthy video files
  • Recurring queries against large document sets
  • Frequent code repository analysis or bug fixing

Cost-efficiency through caching

Context caching is a paid feature designed to reduce overall operational costs. Billing is based on the following factors:

  • Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.
  • Storage duration: The amount of time cached tokens are stored, billed hourly. The cached tokens are deleted when a context cache expires.
  • Other factors: Other charges apply, such as for non-cached input tokens and output tokens.

Context caching doesn't support Provisioned Throughput. Provisioned Throughput requests that use context caching are treated as pay-as-you-go.

Supported models

The following Gemini models support context caching:

For more information, see Available Gemini stable model versions. Note that context caching supports all MIME types for supported models.

Availability

Context caching is available in regions where Generative AI on Vertex AI is available. For more information, see Generative AI on Vertex AI locations.

VPC Service Controls support

Context caching supports VPC Service Controls, meaning your cache cannot be exfiltrated beyond your service perimeter. If you use Cloud Storage to build your cache, include your bucket in your service perimeter as well to protect your cache content.

For more information, see VPC Service Controls with Vertex AI in the Vertex AI documentation.

What's next