What does an LLM API call really cost? Tokens, caching, and batch explained
Anyone building an application against the OpenAI, Anthropic, or Google API will sooner or later run into the question: What does this actually cost? The official pricing pages are transparent, but they do not reveal how costs evolve in a real conversation. This guide explains the mechanics behind it — from tokens and input/output to caching, batch, and overhead — and concludes with a sample calculation showing why chatbot costs often end up higher than expected.
1. What are tokens and how are they calculated?
A token is the smallest unit into which a language model breaks down text. It is not a word and not a letter, but a subword unit — typically 3 to 5 characters long, often a syllable or a frequently occurring fragment. English texts usually end up at around four characters per token, while German texts tend to be closer to three because of the many compound words and umlauts.
Example: The sentence "Artificial intelligence is changing the world of work." is broken down into 9 to 12 tokens depending on the model. The ä alone in “verändert” can be counted as its own unit depending on the tokenizer, because it occurs less frequently than in the English training data corpus.
Important: Each provider uses its own tokenizer. OpenAI uses the tiktoken library, while Anthropic and Google use proprietary methods. The same text may come out to 140 tokens with OpenAI and 180 with Claude. Anyone wanting to calculate across models should determine the token count separately for each model.
Cost breakdown (per request)
2. Input tokens vs. output tokens
Every API request creates two line items: input tokens (everything you send to the AI — system prompt, user question, chat history) and output tokens (the AI’s response). Both are billed separately, and output is typically 4–5x more expensive than input.
The reason: output tokens cost the provider more compute. When processing the input, the model analyzes the entire text once in one pass. For output, it has to generate token by token sequentially — each new token requires a full compute pass through the model. That does not scale linearly.
Practical consequence: if your application produces long AI responses (for example summaries or creative texts), output costs dominate. For classification or routing tasks (short answers, lots of context), input costs dominate.
3. Conversations: why every request sends the entire history
This is where the biggest cost pitfall lies. An LLM has no memory between API calls. For a chat AI to know what is currently being discussed, every new message must include the entire previous history again — including the system prompt, all earlier user messages, and all earlier AI responses.
Specifically: if a chatbot runs with an 800-token system prompt and 5 message exchanges, the system prompt is billed five times — once per request. Every new user question contributes not only its own text, but the entire stack before it.
Mathematically, this is quadratic growth: a conversation with 10 rounds does not cost 10x as much as one with a single round, but often 30–50x as much. That is why production chatbots become expensive quickly without optimization.
4. Cached input and batch pricing — the two most important levers
Prompt caching: up to 90% discount on repeated inputs
Both OpenAI and Anthropic offer prompt caching. The idea: the provider stores the beginning of a request (typically the system prompt) in a cache for 5 to 10 minutes. If another request with the same beginning arrives within that time, the cached tokens are not processed again but fetched from the cache instead — and billed at about 10% of the normal price (in other words, a 90% discount on the cached portion).
Important: the discount applies only to the cached portion, not to the entire request. In a 2,000-token request with an 800-token system prompt, the system prompt becomes cheaper, but the rest of the request remains at the normal price. The actual cost savings are typically between 20 and 40%, not 90%.
Batch API: 50% discount for asynchronous workloads
If responses are not needed in real time — for example for nightly data analysis, mass classification, or automatic translation — the Batch API offers a flat 50% discount on input and output. In return, you accept that results are delivered within 24 hours rather than in seconds.
Caching and batch can be combined. With properly set up pipelines, combined savings of 60–70% are realistic.
5. Context window, max output, and overhead tokens
Three terms that appear in pricing pages and are often confused:
- Context window — the maximum number of tokens a model can process in total in a single request (input + output combined). Current models such as Claude Sonnet 4.6 or GPT-5.4 offer 200K to 1M tokens. Once your conversation exceeds that, the API fails — you have to summarize or shorten the history.
- Max output — the maximum length of a single response. This is typically 8K to 128K tokens, so significantly lower than the context window. Even if there would theoretically still be room.
- Overhead tokens — invisible control tokens that every message receives. The model needs to know where a message begins, what role it has (system, user, assistant), and where it ends. Rule of thumb: 3 tokens per message plus 3 tokens for the request. In a chat with 6 messages, that is 6×3+3 = 21 additional tokens that you do not see anywhere, but still pay for.
6. Sample calculation: customer support chatbot with 5 turns
Enough theory. Let’s look at a realistic use case: a support chatbot based on Claude Sonnet 4.6 ($3 / $15 per 1M tokens, cached input $0.30).
Assumptions:
- System prompt: 800 tokens (persona, instructions, knowledge snippet)
- Per turn: 50 tokens user question + 200 tokens AI response
- 5-turn conversation, cumulative calculation
| Turn | Input | Overhead | Output | Total turn |
|---|---|---|---|---|
| 1 | 850 | 9 | 200 | 1.059 |
| 2 | 1.100 | 15 | 200 | 1.315 |
| 3 | 1.350 | 21 | 200 | 1.571 |
| 4 | 1.600 | 27 | 200 | 1.827 |
| 5 | 1.850 | 33 | 200 | 2.083 |
| Total | 6.750 | 105 | 1.000 | 7.855 |
Three cost scenarios for this exact conversation:
| Scenario | Per conversation | At 100,000 conv./month |
|---|---|---|
| Standard (no caching, no batch) | $0.0356 | $3,560 |
| With prompt caching (system prompt cached from turn 2 onward) | $0.0269 | $2,692 |
| With caching + Batch API (50% discount) | $0.0135 | $1,346 |
Two observations from the table:
- Caching alone brings about 24% savings — not 90%, because only the system-prompt portion benefits. With longer system prompts or more turns, the effect becomes larger.
- Caching + batch together achieve around 62% savings. Anyone who can use both optimizations (that is, does not absolutely need real time) saves on the order of $2,200/month in this example — enough to justify the investment in proper caching logic several times over.
Conclusion
Token costs are not magical; they follow clear rules. Anyone who understands the three mechanisms — tokenization, conversation accumulation, and special pricing — can plan budgets realistically and apply optimizations in a targeted way. The most important takeaways:
- Output is 4–5x more expensive than input — short answers save the most.
- In conversations, costs grow quadratically, not linearly.
- Prompt caching typically reduces costs by 20–40%, batch by 50%.
- Overhead tokens are invisible, but countable — and relevant at scale.
Glossary
- Batch API
- Asynchronous processing mode with a 50% discount on input and output. Responses arrive within 24 hours instead of in real time.
- Cached Input
- Input tokens that were already processed in an earlier call and temporarily stored by the provider. They are billed at about 10% of the normal price.
- Context Window
- Maximum number of tokens a model can process in a single request (input + output combined).
- Input Token
- A token sent to the AI — everything that appears in the prompt: system prompt, user message, chat history.
- Max Output
- Maximum number of tokens the model can produce in a single response. It is lower than the context window.
- Output Token
- A token generated by the AI in its response. Typically 4–5x more expensive than input.
- Overhead Tokens
- Invisible control tokens that every message in a chat request receives — typically 3 per message plus 3 for the overall request.
- Prompt Caching
- Mechanism by which providers temporarily store recurring prompt parts (typically system prompts) and bill them at the cached price on the next request.
- System Prompt
- The instruction at the beginning of an AI request that defines the model’s behavior (persona, response style, constraints). It is sent again with every request in a conversation.
- Token
- Smallest processing unit of a language model. Subword-based, typically 3–5 characters long. Each provider has its own tokenizer.
- Tokenizer
- Algorithm that breaks text into tokens. OpenAI uses
tiktoken, other providers use proprietary methods.
Pricing as of: May 2026. Always check the providers’ official pages for current rates.
