Contact

Blog

How to Reduce Token Use in AI (Even If You’re New to APIs)

Person typing on a laptop with digital workflow or automation icons overlaid, representing data processing or system integration.

If you’re experimenting with AI APIs like OpenAI’s GPT, Anthropic’s Claude, or Google’s Gemini, you’ve probably come across the term tokens. And if you’re paying by the token, or hitting frustrating limits, you might be wondering: “how do I use fewer tokens without breaking everything?”

It’s a very good question, and one that I have been pondering for the last few weeks. Using an AI API has the potential to be unbelievably powerful but also comes with the risk of being eye-wateringly expensive.

This article aims to walk you through the basics of how to reduce token usage in your AI applications, even if you’ve never touched an API before. It’s packed with practical examples, beginner-friendly explanations, and low-effort optimisations that can save you money, speed up response times, and help your apps scale more efficiently.

What is a token?

Think of a token as a chunk of text. Tokens are how language models “see” your input and generate a response. They don’t read sentences or paragraphs the way humans do, they process tokens.

  • “hamburger” → 3 tokens
  • “cheeseburger” → 5 tokens

Models count both input and output tokens:

  • You pay for the text you send and the text you receive.
  • If you hit a model’s token limit (e.g., 128,000 for GPT-4o), your prompt may get truncated, or the response cut off.

Top tip: Use OpenAI’s tokeniser or Anthropic’s token visualiser to paste in text and preview token counts.

Why should you care about reducing tokens?

Reducing tokens isn’t just a cost thing (although it absolutely helps):

  • Lower costs: Most APIs charge per 1,000 tokens.
  • Faster responses: Smaller prompts = quicker processing.
  • More room for content: Token limits are strict. Less prompt = more space for input/output.
  • Less risk of hallucination: Long, rambling prompts can confuse the model.

Let’s say you’re summarising reports daily:

  • Prompt: 500 tokens.
  • Response: 750 tokens.
  • That’s 1,250 tokens × 100 requests/day = 125,000 tokens/day.
  • At $0.10/1K tokens = $375.00/month for just one function.

Optimising that down to 700 tokens total could save over 40%.

Step 1: Trim the prompt fat

Be Specific

Vague:

Tell me everything about marketing.

Specific:

List 3 digital marketing strategies for local businesses.

Cut unnecessary instructions

You don’t need to say things like “Act like a smart AI”, that’s the default.

Instead of:

You are a wise AI professor with decades of experience…

Try:

Explain this like I’m new to the topic.

Use prompt templates and variables

If you’re calling the API programmatically, structure your prompts with variables:

prompt = f"Summarise this report in {length} words:\n{report_text}"

Avoid repeating boilerplate each time.

Step 2: Minimise the context window

When using chat APIs (like OpenAI’s chat/completions), every message you send is remembered unless you manage the context.

Only include what’s relevant

Instead of passing the entire chat history every time:

Keep a rolling window of the last few exchanges.

Exclude irrelevant or outdated messages.

Example:

 

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this article in 5 bullet points."}
]

If you’re building a chatbot, consider memory pruning, summary caching, or retrieval-augmented generation (RAG) to avoid bloating the context.

Step 3: Compress input data

AI APIs are often fed long documents or transcripts. Here’s how to shrink them before they hit your token quota.

Pre-summarise

If you’re passing in an article, meeting transcript, or PDF:

Summarise it using a cheaper model first (e.g. GPT-3.5)

Then feed that summary into your main prompt.

Clean the Input

Strip out:

Whitespace

Headers/footers

Duplicate content

HTML tags

Use tools like BeautifulSoup, regex, or even ChatGPT itself to clean up content.

 

Step 4: Control the output length

Set Boundaries

Unconstrained prompts create long, expensive responses.

 

Don’t say:

 

Tell me everything about climate change.

 

Say:

Summarise climate change in 3 bullet points, max 100 words.

 

Use max_tokens

Most APIs let you set this parameter to cap output length:

response = client.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=150
)

Tip: Ask the model to think before writing

Think step-by-step. Output only the final answer in 3 short sentences.

Step 5: Make system prompts work harder

System prompts shape behaviour but sit silently in the background, eating up tokens. Instead of: You are a globally renowned professor and AI assistant who is kind, witty, and brilliant at delivering insights… Use: You are a clear, concise AI assistant. Save and reuse short, effective system prompts across sessions. Store them as config strings instead of repeating them every call.

Step 6: Cache and deduplicate repeated calls

If you’re sending the same input multiple times, cache the result. Example: Product descriptions, onboarding emails, or glossary lookups.

 

def get_or_generate(prompt):
if prompt in cache:
return cache[prompt]
result = call_api(prompt)
cache[prompt] = result
return result

For fuzzy matches, use embeddings (vector search) to find similar previous prompts and reuse their outputs.

Step 7: Use smaller or cheaper models where possible

Not every task needs GPT-4:

Use GPT-3.5 or Claude Instant for simpler outputs.

Use LLaMA, Mistral, or Gemini Nano for local, lightweight inference.

Chain models: smaller one for extraction, bigger one for synthesis.

Example:

GPT-3.5 summarises

GPT-4 refines or edits

Bonus: Advanced tips for efficiency nerds

Use stop sequences

Prevent runaway output:

 

response = client.chat.completions.create(
stop=["\nUser:"]
)

Monitor usage actively

Use dashboards (like OpenAI’s usage tab) or build your own with logging: log_usage(prompt, len(tiktoken.encode(prompt)))

Modularise prompts

Break large tasks into multiple small ones:

One call to extract

One call to summarise

One call to format

Shorter calls → lower token use → easier debugging.

Quick wins to cut tokens today

High-impact actions:

Remove filler words and vague instructions.

Trim context to recent messages only.

Use max_tokens + stop sequences.

Medium-impact actions:

Clean up input documents before sending.

Cache repeated prompts.

Use smaller models for simple tasks.

TL;DR

Tokens are the currency of AI APIs.

Less is more: shorter prompts, tighter inputs, smaller responses.

You don’t need advanced dev skills to cut usage, just awareness.

The results: faster, cheaper, more scalable AI.

Authors

James Carr

Senior SEO Specialist

James is our Senior SEO Specialist, and has worked in marketing since 2011. With a background including web development to content writing (and everything in between), he brings a well-rounded, consistent approach to SEO. He also leads on digital sustainability and accessibility, and is a two-time industry award finalist.

Share

Services

Let's collaborate

Partner with us

Let’s work together to create smarter, more effective solutions for your business.

Related blogs

Digital shield with "AI" text, glowing in red and blue, symbolising AI and cybersecurity.

You’ve probably seen tonnes of AI headlines lately, and for good reason. AI is transforming businesses faster than anyone expected, with adoption rising across every sector. As a Content Specialist,…

08 August 2025

AI
Cybersecurity

Who we are

Explore how our culture and expertise fuel digital innovation