If you’re experimenting with AI APIs like OpenAI’s GPT, Anthropic’s Claude, or Google’s Gemini, you’ve probably come across the term tokens. And if you’re paying by the token, or hitting frustrating limits, you might be wondering: “how do I use fewer tokens without breaking everything?”
It’s a very good question, and one that I have been pondering for the last few weeks. Using an AI API has the potential to be unbelievably powerful but also comes with the risk of being eye-wateringly expensive.
This article aims to walk you through the basics of how to reduce token usage in your AI applications, even if you’ve never touched an API before. It’s packed with practical examples, beginner-friendly explanations, and low-effort optimisations that can save you money, speed up response times, and help your apps scale more efficiently.
What is a token?
Think of a token as a chunk of text. Tokens are how language models “see” your input and generate a response. They don’t read sentences or paragraphs the way humans do, they process tokens.
- “hamburger” → 3 tokens
- “cheeseburger” → 5 tokens
Models count both input and output tokens:
- You pay for the text you send and the text you receive.
- If you hit a model’s token limit (e.g., 128,000 for GPT-4o), your prompt may get truncated, or the response cut off.
Top tip: Use OpenAI’s tokeniser or Anthropic’s token visualiser to paste in text and preview token counts.
Why should you care about reducing tokens?
Reducing tokens isn’t just a cost thing (although it absolutely helps):
- Lower costs: Most APIs charge per 1,000 tokens.
- Faster responses: Smaller prompts = quicker processing.
- More room for content: Token limits are strict. Less prompt = more space for input/output.
- Less risk of hallucination: Long, rambling prompts can confuse the model.
Let’s say you’re summarising reports daily:
- Prompt: 500 tokens.
- Response: 750 tokens.
- That’s 1,250 tokens × 100 requests/day = 125,000 tokens/day.
- At $0.10/1K tokens = $375.00/month for just one function.
Optimising that down to 700 tokens total could save over 40%.
Step 1: Trim the prompt fat
Be Specific
Vague:
Tell me everything about marketing.
Specific:
List 3 digital marketing strategies for local businesses.
Cut unnecessary instructions
You don’t need to say things like “Act like a smart AI”, that’s the default.
Instead of:
You are a wise AI professor with decades of experience…
Try:
Explain this like I’m new to the topic.
Use prompt templates and variables
If you’re calling the API programmatically, structure your prompts with variables:
prompt = f"Summarise this report in {length} words:\n{report_text}"
Avoid repeating boilerplate each time.
Step 2: Minimise the context window
When using chat APIs (like OpenAI’s chat/completions), every message you send is remembered unless you manage the context.
Only include what’s relevant
Instead of passing the entire chat history every time:
Keep a rolling window of the last few exchanges.
Exclude irrelevant or outdated messages.
Example:
messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Summarise this article in 5 bullet points."} ]
If you’re building a chatbot, consider memory pruning, summary caching, or retrieval-augmented generation (RAG) to avoid bloating the context.
Step 3: Compress input data
AI APIs are often fed long documents or transcripts. Here’s how to shrink them before they hit your token quota.
Pre-summarise
If you’re passing in an article, meeting transcript, or PDF:
Summarise it using a cheaper model first (e.g. GPT-3.5)
Then feed that summary into your main prompt.
Clean the Input
Strip out:
Whitespace
Headers/footers
Duplicate content
HTML tags
Use tools like BeautifulSoup, regex, or even ChatGPT itself to clean up content.
Step 4: Control the output length
Set Boundaries
Unconstrained prompts create long, expensive responses.
Don’t say:
Tell me everything about climate change.
Say:
Summarise climate change in 3 bullet points, max 100 words.
Use max_tokens
Most APIs let you set this parameter to cap output length:
response = client.chat.completions.create( model="gpt-4", messages=messages, max_tokens=150 )
Tip: Ask the model to think before writing
Think step-by-step. Output only the final answer in 3 short sentences.
Step 5: Make system prompts work harder
System prompts shape behaviour but sit silently in the background, eating up tokens. Instead of: You are a globally renowned professor and AI assistant who is kind, witty, and brilliant at delivering insights… Use: You are a clear, concise AI assistant. Save and reuse short, effective system prompts across sessions. Store them as config strings instead of repeating them every call.
Step 6: Cache and deduplicate repeated calls
If you’re sending the same input multiple times, cache the result. Example: Product descriptions, onboarding emails, or glossary lookups.
def get_or_generate(prompt): if prompt in cache: return cache[prompt] result = call_api(prompt) cache[prompt] = result return result
For fuzzy matches, use embeddings (vector search) to find similar previous prompts and reuse their outputs.
Step 7: Use smaller or cheaper models where possible
Not every task needs GPT-4:
Use GPT-3.5 or Claude Instant for simpler outputs.
Use LLaMA, Mistral, or Gemini Nano for local, lightweight inference.
Chain models: smaller one for extraction, bigger one for synthesis.
Example:
GPT-3.5 summarises
GPT-4 refines or edits
Bonus: Advanced tips for efficiency nerds
Use stop sequences
Prevent runaway output:
response = client.chat.completions.create( stop=["\nUser:"] )
Monitor usage actively
Use dashboards (like OpenAI’s usage tab) or build your own with logging: log_usage(prompt, len(tiktoken.encode(prompt)))
Modularise prompts
Break large tasks into multiple small ones:
One call to extract
One call to summarise
One call to format
Shorter calls → lower token use → easier debugging.
Quick wins to cut tokens today
High-impact actions:
Remove filler words and vague instructions.
Trim context to recent messages only.
Use max_tokens + stop sequences.
Medium-impact actions:
Clean up input documents before sending.
Cache repeated prompts.
Use smaller models for simple tasks.
TL;DR
Tokens are the currency of AI APIs.
Less is more: shorter prompts, tighter inputs, smaller responses.
You don’t need advanced dev skills to cut usage, just awareness.
The results: faster, cheaper, more scalable AI.
Let's collaborate
Partner with us
Let’s work together to create smarter, more effective solutions for your business.
Related blogs
Who we are
Explore how our culture and expertise fuel digital innovation
Join us