Token Caching

Qwen Code supports prompt caching (also called context caching) to significantly reduce costs and latency when working with large, repetitive contexts like codebases, documentation, and long conversations.

Overview

Prompt caching allows LLM providers to store and reuse portions of the prompt context, reducing:

Costs: Cached tokens are charged at a fraction of the price (typically 10% of input token costs)
Latency: Cached content doesn’t need to be reprocessed
Token usage: Significantly reduces effective prompt token consumption

Supported Providers

Anthropic (Claude): Full support with ephemeral caching
OpenAI: Support for prompt caching via cached_tokens field
Google (Gemini): Limited support depending on model

How It Works

Anthropic Prompt Caching

Qwen Code implements Anthropic’s prompt caching by adding cache_control markers to specific parts of the prompt.

System Instruction Caching

From packages/core/src/core/anthropicContentGenerator/converter.ts:61:

// Add cache_control to enable prompt caching (if enabled)
const system = this.enableCacheControl
  ? this.buildSystemWithCacheControl(systemText)
  : systemText;

The system prompt gets cache control markers:

// From converter.ts:529
/**
 * Anthropic prompt caching requires cache_control on system content.
 * This method adds cache_control markers to the last text block
 * in the system array to enable prompt caching.
 */
private buildSystemWithCacheControl(
  systemText: string
): Array<Anthropic.TextBlockParam | Anthropic.ToolUseBlockParam> {
  const blocks: Array<Anthropic.TextBlockParam> = [];
  
  if (systemText && systemText.trim().length > 0) {
    blocks.push({
      type: 'text',
      text: systemText,
      cache_control: { type: 'ephemeral' }
    });
  }
  
  return blocks;
}

Tool Definition Caching

From converter.ts:123:

// Add cache_control to the last tool for prompt caching (if enabled)
if (this.enableCacheControl && tools.length > 0) {
  const lastToolIndex = tools.length - 1;
  tools[lastToolIndex] = {
    ...tools[lastToolIndex],
    cache_control: { type: 'ephemeral' }
  };
}

Message History Caching

From converter.ts:549:

/**
 * Adds cache_control markers to enable prompt caching for conversation history.
 * This enables prompt caching for the conversation context.
 */
private addCacheControlToMessages(
  messages: AnthropicMessageParam[]
): void {
  if (messages.length === 0) return;
  
  // Add cache_control to the last user message
  const lastMessage = messages[messages.length - 1];
  if (lastMessage.role === 'user' && Array.isArray(lastMessage.content)) {
    const content = lastMessage.content;
    if (content.length > 0) {
      const lastContent = content[content.length - 1];
      if (lastContent.type === 'text') {
        lastContent.cache_control = { type: 'ephemeral' };
      }
    }
  }
}

OpenAI Prompt Caching

OpenAI’s prompt caching is tracked via usage metadata. From packages/core/src/core/openaiContentGenerator/converter.ts:356:

prompt_tokens_details?: { cached_tokens?: number };
usageMetadata: {
  cached_tokens: usageMetadata.cachedContentTokenCount,
  // ...
}

From converter.ts:891:

// Support both formats: prompt_tokens_details.cached_tokens (OpenAI standard)
// and top-level cached_tokens (some models return it here)
const cachedTokens = 
  usage.prompt_tokens_details?.cached_tokens ??
  (usage as { cached_tokens?: number }).cached_tokens ??
  0;

Token Counting

Cached tokens are tracked separately in telemetry. From packages/core/src/telemetry/types.ts:320:

this.cached_content_token_count = usage_data?.cachedContentTokenCount ?? 0;

From packages/core/src/telemetry/uiTelemetry.ts:189:

modelMetrics.tokens.cached += event.cached_content_token_count;

Configuration

Enable Caching

Caching is typically enabled automatically for supported models. The configuration happens during content generator initialization.

Anthropic Configuration

// Caching is enabled via enableCacheControl flag
const converter = new AnthropicConverter(
  /* schemaCompliance */ true,
  /* enableCacheControl */ true  // Enable prompt caching
);

Cache Behavior

From Anthropic’s documentation:

Cache Duration: 5 minutes of inactivity
Cache Key: Based on exact prompt content
Minimum Size: 1024 tokens for caching to activate
Cost: ~10% of input token cost for cache hits

Monitoring Cache Performance

Qwen Code tracks caching effectiveness through telemetry.

Cache Hit Rate

From packages/cli/src/ui/utils/computeStats.ts:31:

return (metrics.tokens.cached / metrics.tokens.prompt) * 100;

Total Cached Tokens

From computeStats.ts:50:

(acc, model) => acc + model.tokens.cached

Viewing Cache Metrics

# View session stats
qwen stats

# Output includes:
# Tokens:
#   Prompt: 50,000
#   Cached: 45,000 (90% cache hit rate)
#   Completion: 5,000
#   Total: 55,000
#
# Cost Savings: ~$0.45 (90% of prompt tokens cached)

Optimizing for Cache Performance

1. Structure Prompts for Caching

Best Practice: Place stable context first, variable content last.

// Good: Stable system prompt + tools are cached
[
  { role: 'system', content: 'Large stable instructions...' },  // Cached
  { role: 'system', content: 'Tool definitions...' },           // Cached
  { role: 'user', content: 'Variable user input' }              // Not cached
]

// Bad: Variable content breaks cache
[
  { role: 'user', content: 'Timestamp: 2024-...' },  // Changes every request
  { role: 'system', content: 'Instructions...' }      // Never cached
]

2. Maximize Cache Window

Keep sessions active to maintain cache:

# Long-running session maintains cache
qwen
# Ask multiple questions in same session
# Cache remains active for 5 minutes of inactivity

3. Chat Compression with Caching

From packages/core/src/services/sessionService.ts:584:

/**
 * Builds the model-facing chat history (Content[]) from a reconstructed
 * conversation. This keeps UI history intact while applying chat compression
 * checkpoints for the API history used on resume.
 *
 * Strategy:
 * - Find the latest system/chat_compression record (if any).
 * - Use its compressedHistory snapshot as the base history.
 * - Append all messages after that checkpoint (skipping system records).
 * - If no checkpoint exists, return the linear message list (message field only).
 */

Chat compression creates stable checkpoints that can be cached effectively:

if (compressedHistory && lastCompressionIndex >= 0) {
  const baseHistory: Content[] = structuredClone(compressedHistory);
  // Compressed history becomes cache-eligible
  
  // Only newer messages are uncached
  for (let i = lastCompressionIndex + 1; i < messages.length; i++) {
    baseHistory.push(structuredClone(record.message));
  }
}

4. Memory System Integration

From packages/core/src/utils/memoryDiscovery.ts, hierarchical memory files are loaded and can benefit from caching:

export async function loadServerHierarchicalMemory(
  currentWorkingDirectory: string,
  includeDirectoriesToReadGemini: readonly string[],
  fileService: FileDiscoveryService,
  extensionContextFilePaths: string[] = [],
  folderTrust: boolean,
  importFormat: 'flat' | 'tree' = 'tree'
): Promise<LoadServerHierarchicalMemoryResponse> {
  // Loads QWEN.md files across directory hierarchy
  // These files become part of cached system context
}

Memory content is stable and benefits from caching:

<!-- ~/.qwen/QWEN.md -->
## Qwen Added Memories
- User prefers TypeScript for new projects
- Project uses pnpm for package management
- API keys stored in 1Password

This content is included in system prompt and cached across sessions.

Resume and Token Restoration

When resuming sessions, token counts are restored from checkpoints. From sessionService.ts:670:

const resumePromptTokens = getResumePromptTokenCount(conversation);
if (resumePromptTokens !== undefined) {
  uiTelemetryService.setLastPromptTokenCount(resumePromptTokens);
}

From sessionService.ts:676:

/**
 * Returns the best available prompt token count for resuming telemetry:
 * - If a chat compression checkpoint exists, use its new token count.
 * - Otherwise, use the last assistant usageMetadata input (fallback to total).
 */
function getResumePromptTokenCount(
  conversation: ConversationRecord
): number | undefined {
  // First, check for chat compression checkpoint with token count
  for (let i = conversation.messages.length - 1; i >= 0; i--) {
    const record = conversation.messages[i];
    if (record.type === 'system' && record.subtype === 'chat_compression') {
      const payload = record.systemPayload as ChatCompressionRecordPayload | undefined;
      if (payload?.newTokenCount !== undefined) {
        return payload.newTokenCount;
      }
    }
  }
  
  // Fallback to last usage metadata
  for (let i = conversation.messages.length - 1; i >= 0; i--) {
    const record = conversation.messages[i];
    if (record.role === 'model' && record.usageMetadata) {
      return record.usageMetadata.promptTokenCount ?? record.usageMetadata.totalTokenCount;
    }
  }
  
  return undefined;
}

Cost Analysis

Example Savings

Typical Anthropic pricing (Claude 3.5 Sonnet):

Input tokens:  $3.00 per million
Cached tokens: $0.30 per million (10x cheaper)
Output tokens: $15.00 per million

For a 50,000 token codebase context:

Without caching:
  50,000 prompt tokens × $3.00/M = $0.15 per request
  
With 90% cache hit rate:
  5,000 uncached × $3.00/M = $0.015
  45,000 cached × $0.30/M = $0.0135
  Total = $0.0285 per request
  
Savings: $0.1215 per request (81% reduction)

For 100 requests:
  Without caching: $15.00
  With caching: $2.85
  Savings: $12.15

JSON Output Format

From packages/cli/src/nonInteractive/io/BaseJsonOutputAdapter.ts:212:

usage.cache_read_input_tokens = metadata.cachedContentTokenCount;

JSON output includes cache metrics:

{
  "usage": {
    "input_tokens": 50000,
    "cache_read_input_tokens": 45000,
    "output_tokens": 5000,
    "total_tokens": 55000
  }
}

Advanced Topics

Cache Invalidation

Cache is invalidated when:

Content changes: Any modification to cached portions
5-minute timeout: Inactivity exceeds cache duration
Context length: Cache size limits reached
Model changes: Switching to different model

Multi-Turn Caching Strategy

For long conversations:

// Turn 1: System + Tools cached
[
  { system: 'Instructions...', cache_control: { type: 'ephemeral' } },
  { tools: [...], cache_control: { type: 'ephemeral' } },  // Last tool
  { user: 'First question' }
]

// Turn 2: System + Tools + History cached
[
  { system: 'Instructions...' },         // Cache hit
  { tools: [...] },                     // Cache hit
  { user: 'First question' },           // New to cache
  { assistant: 'First answer' },        // New to cache
  { user: 'Second question', cache_control: { type: 'ephemeral' } }
]

// Turn 3: All previous context cached
[
  { system: 'Instructions...' },         // Cache hit
  { tools: [...] },                     // Cache hit  
  { user: 'First question' },           // Cache hit
  { assistant: 'First answer' },        // Cache hit
  { user: 'Second question' },          // Cache hit
  { assistant: 'Second answer' },       // New to cache
  { user: 'Third question', cache_control: { type: 'ephemeral' } }
]

Breakpoints

Anthropic supports up to 4 cache breakpoints:

End of system instructions
End of tool definitions
End of most recent cached message
Custom location (if needed)

Troubleshooting

Low Cache Hit Rate

Problem: Cache hit rate below 50% Causes:

Prompt structure changes between requests
Short session duration (cache expires)
Dynamic content in system prompt
Context size below 1024 token minimum

Solutions:

// Avoid dynamic timestamps in system prompt
// Bad:
const systemPrompt = `Current time: ${new Date().toISOString()}\n${instructions}`;

// Good:
const systemPrompt = instructions; // Static content

Cache Not Activating

Problem: cached_content_token_count always 0 Causes:

Context size below 1024 tokens
Caching not enabled for model
Provider doesn’t support caching

Check:

# Verify model supports caching
qwen model info claude-3-5-sonnet-20241022
# Look for: "Prompt caching: Supported"

Unexpected Cache Misses

Problem: Cache misses despite identical prompts Causes:

Whitespace differences
Tool order changes
Hidden formatting differences

Solution: Log exact prompt content to debug:

// Enable debug logging
export DEBUG=true
qwen --debug

Source Code References

Anthropic caching: packages/core/src/core/anthropicContentGenerator/converter.ts:61,123,529,549
OpenAI caching: packages/core/src/core/openaiContentGenerator/converter.ts:356,891
Telemetry: packages/core/src/telemetry/types.ts:320, uiTelemetry.ts:189
Stats computation: packages/cli/src/ui/utils/computeStats.ts:31,50
Session restore: packages/core/src/services/sessionService.ts:670,676

​Token Caching

​Overview

​Supported Providers

​How It Works

​Anthropic Prompt Caching

​System Instruction Caching

​Tool Definition Caching

​Message History Caching

​OpenAI Prompt Caching

​Token Counting

​Configuration

​Enable Caching

​Anthropic Configuration

​Cache Behavior

​Monitoring Cache Performance

​Cache Hit Rate

​Total Cached Tokens

​Viewing Cache Metrics

​Optimizing for Cache Performance

​1. Structure Prompts for Caching

​2. Maximize Cache Window

​3. Chat Compression with Caching

​4. Memory System Integration

​Resume and Token Restoration

​Cost Analysis

​Example Savings

​JSON Output Format

​Advanced Topics

​Cache Invalidation

​Multi-Turn Caching Strategy

​Breakpoints

​Troubleshooting

​Low Cache Hit Rate

​Cache Not Activating

​Unexpected Cache Misses

​Source Code References