Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/qwen-code/llms.txt
Use this file to discover all available pages before exploring further.
Token Caching
Qwen Code supports prompt caching (also called context caching) to significantly reduce costs and latency when working with large, repetitive contexts like codebases, documentation, and long conversations.
Overview
Prompt caching allows LLM providers to store and reuse portions of the prompt context, reducing:
- Costs: Cached tokens are charged at a fraction of the price (typically 10% of input token costs)
- Latency: Cached content doesn’t need to be reprocessed
- Token usage: Significantly reduces effective prompt token consumption
Supported Providers
- Anthropic (Claude): Full support with ephemeral caching
- OpenAI: Support for prompt caching via
cached_tokens field
- Google (Gemini): Limited support depending on model
How It Works
Anthropic Prompt Caching
Qwen Code implements Anthropic’s prompt caching by adding cache_control markers to specific parts of the prompt.
System Instruction Caching
From packages/core/src/core/anthropicContentGenerator/converter.ts:61:
// Add cache_control to enable prompt caching (if enabled)
const system = this.enableCacheControl
? this.buildSystemWithCacheControl(systemText)
: systemText;
The system prompt gets cache control markers:
// From converter.ts:529
/**
* Anthropic prompt caching requires cache_control on system content.
* This method adds cache_control markers to the last text block
* in the system array to enable prompt caching.
*/
private buildSystemWithCacheControl(
systemText: string
): Array<Anthropic.TextBlockParam | Anthropic.ToolUseBlockParam> {
const blocks: Array<Anthropic.TextBlockParam> = [];
if (systemText && systemText.trim().length > 0) {
blocks.push({
type: 'text',
text: systemText,
cache_control: { type: 'ephemeral' }
});
}
return blocks;
}
From converter.ts:123:
// Add cache_control to the last tool for prompt caching (if enabled)
if (this.enableCacheControl && tools.length > 0) {
const lastToolIndex = tools.length - 1;
tools[lastToolIndex] = {
...tools[lastToolIndex],
cache_control: { type: 'ephemeral' }
};
}
Message History Caching
From converter.ts:549:
/**
* Adds cache_control markers to enable prompt caching for conversation history.
* This enables prompt caching for the conversation context.
*/
private addCacheControlToMessages(
messages: AnthropicMessageParam[]
): void {
if (messages.length === 0) return;
// Add cache_control to the last user message
const lastMessage = messages[messages.length - 1];
if (lastMessage.role === 'user' && Array.isArray(lastMessage.content)) {
const content = lastMessage.content;
if (content.length > 0) {
const lastContent = content[content.length - 1];
if (lastContent.type === 'text') {
lastContent.cache_control = { type: 'ephemeral' };
}
}
}
}
OpenAI Prompt Caching
OpenAI’s prompt caching is tracked via usage metadata.
From packages/core/src/core/openaiContentGenerator/converter.ts:356:
prompt_tokens_details?: { cached_tokens?: number };
usageMetadata: {
cached_tokens: usageMetadata.cachedContentTokenCount,
// ...
}
From converter.ts:891:
// Support both formats: prompt_tokens_details.cached_tokens (OpenAI standard)
// and top-level cached_tokens (some models return it here)
const cachedTokens =
usage.prompt_tokens_details?.cached_tokens ??
(usage as { cached_tokens?: number }).cached_tokens ??
0;
Token Counting
Cached tokens are tracked separately in telemetry.
From packages/core/src/telemetry/types.ts:320:
this.cached_content_token_count = usage_data?.cachedContentTokenCount ?? 0;
From packages/core/src/telemetry/uiTelemetry.ts:189:
modelMetrics.tokens.cached += event.cached_content_token_count;
Configuration
Enable Caching
Caching is typically enabled automatically for supported models. The configuration happens during content generator initialization.
Anthropic Configuration
// Caching is enabled via enableCacheControl flag
const converter = new AnthropicConverter(
/* schemaCompliance */ true,
/* enableCacheControl */ true // Enable prompt caching
);
Cache Behavior
From Anthropic’s documentation:
- Cache Duration: 5 minutes of inactivity
- Cache Key: Based on exact prompt content
- Minimum Size: 1024 tokens for caching to activate
- Cost: ~10% of input token cost for cache hits
Qwen Code tracks caching effectiveness through telemetry.
Cache Hit Rate
From packages/cli/src/ui/utils/computeStats.ts:31:
return (metrics.tokens.cached / metrics.tokens.prompt) * 100;
Total Cached Tokens
From computeStats.ts:50:
(acc, model) => acc + model.tokens.cached
Viewing Cache Metrics
# View session stats
qwen stats
# Output includes:
# Tokens:
# Prompt: 50,000
# Cached: 45,000 (90% cache hit rate)
# Completion: 5,000
# Total: 55,000
#
# Cost Savings: ~$0.45 (90% of prompt tokens cached)
1. Structure Prompts for Caching
Best Practice: Place stable context first, variable content last.
// Good: Stable system prompt + tools are cached
[
{ role: 'system', content: 'Large stable instructions...' }, // Cached
{ role: 'system', content: 'Tool definitions...' }, // Cached
{ role: 'user', content: 'Variable user input' } // Not cached
]
// Bad: Variable content breaks cache
[
{ role: 'user', content: 'Timestamp: 2024-...' }, // Changes every request
{ role: 'system', content: 'Instructions...' } // Never cached
]
2. Maximize Cache Window
Keep sessions active to maintain cache:
# Long-running session maintains cache
qwen
# Ask multiple questions in same session
# Cache remains active for 5 minutes of inactivity
3. Chat Compression with Caching
From packages/core/src/services/sessionService.ts:584:
/**
* Builds the model-facing chat history (Content[]) from a reconstructed
* conversation. This keeps UI history intact while applying chat compression
* checkpoints for the API history used on resume.
*
* Strategy:
* - Find the latest system/chat_compression record (if any).
* - Use its compressedHistory snapshot as the base history.
* - Append all messages after that checkpoint (skipping system records).
* - If no checkpoint exists, return the linear message list (message field only).
*/
Chat compression creates stable checkpoints that can be cached effectively:
if (compressedHistory && lastCompressionIndex >= 0) {
const baseHistory: Content[] = structuredClone(compressedHistory);
// Compressed history becomes cache-eligible
// Only newer messages are uncached
for (let i = lastCompressionIndex + 1; i < messages.length; i++) {
baseHistory.push(structuredClone(record.message));
}
}
4. Memory System Integration
From packages/core/src/utils/memoryDiscovery.ts, hierarchical memory files are loaded and can benefit from caching:
export async function loadServerHierarchicalMemory(
currentWorkingDirectory: string,
includeDirectoriesToReadGemini: readonly string[],
fileService: FileDiscoveryService,
extensionContextFilePaths: string[] = [],
folderTrust: boolean,
importFormat: 'flat' | 'tree' = 'tree'
): Promise<LoadServerHierarchicalMemoryResponse> {
// Loads QWEN.md files across directory hierarchy
// These files become part of cached system context
}
Memory content is stable and benefits from caching:
<!-- ~/.qwen/QWEN.md -->
## Qwen Added Memories
- User prefers TypeScript for new projects
- Project uses pnpm for package management
- API keys stored in 1Password
This content is included in system prompt and cached across sessions.
Resume and Token Restoration
When resuming sessions, token counts are restored from checkpoints.
From sessionService.ts:670:
const resumePromptTokens = getResumePromptTokenCount(conversation);
if (resumePromptTokens !== undefined) {
uiTelemetryService.setLastPromptTokenCount(resumePromptTokens);
}
From sessionService.ts:676:
/**
* Returns the best available prompt token count for resuming telemetry:
* - If a chat compression checkpoint exists, use its new token count.
* - Otherwise, use the last assistant usageMetadata input (fallback to total).
*/
function getResumePromptTokenCount(
conversation: ConversationRecord
): number | undefined {
// First, check for chat compression checkpoint with token count
for (let i = conversation.messages.length - 1; i >= 0; i--) {
const record = conversation.messages[i];
if (record.type === 'system' && record.subtype === 'chat_compression') {
const payload = record.systemPayload as ChatCompressionRecordPayload | undefined;
if (payload?.newTokenCount !== undefined) {
return payload.newTokenCount;
}
}
}
// Fallback to last usage metadata
for (let i = conversation.messages.length - 1; i >= 0; i--) {
const record = conversation.messages[i];
if (record.role === 'model' && record.usageMetadata) {
return record.usageMetadata.promptTokenCount ?? record.usageMetadata.totalTokenCount;
}
}
return undefined;
}
Cost Analysis
Example Savings
Typical Anthropic pricing (Claude 3.5 Sonnet):
Input tokens: $3.00 per million
Cached tokens: $0.30 per million (10x cheaper)
Output tokens: $15.00 per million
For a 50,000 token codebase context:
Without caching:
50,000 prompt tokens × $3.00/M = $0.15 per request
With 90% cache hit rate:
5,000 uncached × $3.00/M = $0.015
45,000 cached × $0.30/M = $0.0135
Total = $0.0285 per request
Savings: $0.1215 per request (81% reduction)
For 100 requests:
Without caching: $15.00
With caching: $2.85
Savings: $12.15
From packages/cli/src/nonInteractive/io/BaseJsonOutputAdapter.ts:212:
usage.cache_read_input_tokens = metadata.cachedContentTokenCount;
JSON output includes cache metrics:
{
"usage": {
"input_tokens": 50000,
"cache_read_input_tokens": 45000,
"output_tokens": 5000,
"total_tokens": 55000
}
}
Advanced Topics
Cache Invalidation
Cache is invalidated when:
- Content changes: Any modification to cached portions
- 5-minute timeout: Inactivity exceeds cache duration
- Context length: Cache size limits reached
- Model changes: Switching to different model
Multi-Turn Caching Strategy
For long conversations:
// Turn 1: System + Tools cached
[
{ system: 'Instructions...', cache_control: { type: 'ephemeral' } },
{ tools: [...], cache_control: { type: 'ephemeral' } }, // Last tool
{ user: 'First question' }
]
// Turn 2: System + Tools + History cached
[
{ system: 'Instructions...' }, // Cache hit
{ tools: [...] }, // Cache hit
{ user: 'First question' }, // New to cache
{ assistant: 'First answer' }, // New to cache
{ user: 'Second question', cache_control: { type: 'ephemeral' } }
]
// Turn 3: All previous context cached
[
{ system: 'Instructions...' }, // Cache hit
{ tools: [...] }, // Cache hit
{ user: 'First question' }, // Cache hit
{ assistant: 'First answer' }, // Cache hit
{ user: 'Second question' }, // Cache hit
{ assistant: 'Second answer' }, // New to cache
{ user: 'Third question', cache_control: { type: 'ephemeral' } }
]
Breakpoints
Anthropic supports up to 4 cache breakpoints:
- End of system instructions
- End of tool definitions
- End of most recent cached message
- Custom location (if needed)
Troubleshooting
Low Cache Hit Rate
Problem: Cache hit rate below 50%
Causes:
- Prompt structure changes between requests
- Short session duration (cache expires)
- Dynamic content in system prompt
- Context size below 1024 token minimum
Solutions:
// Avoid dynamic timestamps in system prompt
// Bad:
const systemPrompt = `Current time: ${new Date().toISOString()}\n${instructions}`;
// Good:
const systemPrompt = instructions; // Static content
Cache Not Activating
Problem: cached_content_token_count always 0
Causes:
- Context size below 1024 tokens
- Caching not enabled for model
- Provider doesn’t support caching
Check:
# Verify model supports caching
qwen model info claude-3-5-sonnet-20241022
# Look for: "Prompt caching: Supported"
Unexpected Cache Misses
Problem: Cache misses despite identical prompts
Causes:
- Whitespace differences
- Tool order changes
- Hidden formatting differences
Solution: Log exact prompt content to debug:
// Enable debug logging
export DEBUG=true
qwen --debug
Source Code References
- Anthropic caching:
packages/core/src/core/anthropicContentGenerator/converter.ts:61,123,529,549
- OpenAI caching:
packages/core/src/core/openaiContentGenerator/converter.ts:356,891
- Telemetry:
packages/core/src/telemetry/types.ts:320, uiTelemetry.ts:189
- Stats computation:
packages/cli/src/ui/utils/computeStats.ts:31,50
- Session restore:
packages/core/src/services/sessionService.ts:670,676