How to Reduce Token Use in AI (Even If You're New to APIs) - Elixirr Digital

Accessibility

Want to know more about upcoming accessibility legislation? Try searching "accessibility"

AI

Looking for information on our AI services? Try searching "AI"

Digital sustainability

Need to make your business more sustainable? Try searching "digital sustainability"

Digital strategy

Do you need to address wider digital marketing needs? Try searching "digital strategy"

Digital transformation

Need to maximise your business potential with digital innovation? Try searching "digital transformation"

Experience design

Need a new look for your brand or website? Try searching "experience design"

No results found

If you’re experimenting with AI APIs like OpenAI’s GPT, Anthropic’s Claude, or Google’s Gemini, you’ve probably come across the term tokens. And if you’re paying by the token, or hitting frustrating limits, you might be wondering: “how do I use fewer tokens without breaking everything?”

It’s a very good question, and one that I have been pondering for the last few weeks. Using an AI API has the potential to be unbelievably powerful but also comes with the risk of being eye-wateringly expensive.

This article aims to walk you through the basics of how to reduce token usage in your AI applications, even if you’ve never touched an API before. It’s packed with practical examples, beginner-friendly explanations, and low-effort optimisations that can save you money, speed up response times, and help your apps scale more efficiently.

What is a token?

Think of a token as a chunk of text. Tokens are how language models “see” your input and generate a response. They don’t read sentences or paragraphs the way humans do, they process tokens.

“hamburger” → 3 tokens
“cheeseburger” → 5 tokens

Models count both input and output tokens:

You pay for the text you send and the text you receive.
If you hit a model’s token limit (e.g., 128,000 for GPT-4o), your prompt may get truncated, or the response cut off.

Top tip: Use OpenAI’s tokeniser or Anthropic’s token visualiser to paste in text and preview token counts.

Why should you care about reducing tokens?

Reducing tokens isn’t just a cost thing (although it absolutely helps):

Lower costs: Most APIs charge per 1,000 tokens.
Faster responses: Smaller prompts = quicker processing.
More room for content: Token limits are strict. Less prompt = more space for input/output.
Less risk of hallucination: Long, rambling prompts can confuse the model.

Let’s say you’re summarising reports daily:

Prompt: 500 tokens.
Response: 750 tokens.
That’s 1,250 tokens × 100 requests/day = 125,000 tokens/day.
At $0.10/1K tokens = $375.00/month for just one function.

Optimising that down to 700 tokens total could save over 40%.

Step 1: Trim the prompt fat

Be Specific

Vague:

Tell me everything about marketing.

Specific:

List 3 digital marketing strategies for local businesses.

Cut unnecessary instructions

You don’t need to say things like “Act like a smart AI”, that’s the default.

Instead of:

You are a wise AI professor with decades of experience…

Try:

Explain this like I’m new to the topic.

Use prompt templates and variables

If you’re calling the API programmatically, structure your prompts with variables:

prompt = f"Summarise this report in {length} words:\n{report_text}"

Avoid repeating boilerplate each time.

Step 2: Minimise the context window

When using chat APIs (like OpenAI’s chat/completions), every message you send is remembered unless you manage the context.

Only include what’s relevant

Instead of passing the entire chat history every time:

Keep a rolling window of the last few exchanges.

Exclude irrelevant or outdated messages.

Example:

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this article in 5 bullet points."}
]

If you’re building a chatbot, consider memory pruning, summary caching, or retrieval-augmented generation (RAG) to avoid bloating the context.

Step 3: Compress input data

AI APIs are often fed long documents or transcripts. Here’s how to shrink them before they hit your token quota.

Pre-summarise

If you’re passing in an article, meeting transcript, or PDF:

Summarise it using a cheaper model first (e.g. GPT-3.5)

Then feed that summary into your main prompt.

Clean the Input

Strip out:

Whitespace

Headers/footers

Duplicate content

HTML tags

Use tools like BeautifulSoup, regex, or even ChatGPT itself to clean up content.

Step 4: Control the output length

Set Boundaries

Unconstrained prompts create long, expensive responses.

Don’t say:

Tell me everything about climate change.

Say:

Summarise climate change in 3 bullet points, max 100 words.

Use max_tokens

Most APIs let you set this parameter to cap output length:

response = client.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=150
)

Tip: Ask the model to think before writing

Think step-by-step. Output only the final answer in 3 short sentences.

Step 5: Make system prompts work harder

System prompts shape behaviour but sit silently in the background, eating up tokens. Instead of: You are a globally renowned professor and AI assistant who is kind, witty, and brilliant at delivering insights… Use: You are a clear, concise AI assistant. Save and reuse short, effective system prompts across sessions. Store them as config strings instead of repeating them every call.

Step 6: Cache and deduplicate repeated calls

If you’re sending the same input multiple times, cache the result. Example: Product descriptions, onboarding emails, or glossary lookups.

def get_or_generate(prompt):
if prompt in cache:
return cache[prompt]
result = call_api(prompt)
cache[prompt] = result
return result

For fuzzy matches, use embeddings (vector search) to find similar previous prompts and reuse their outputs.

Step 7: Use smaller or cheaper models where possible

Not every task needs GPT-4:

Use GPT-3.5 or Claude Instant for simpler outputs.

Use LLaMA, Mistral, or Gemini Nano for local, lightweight inference.

Chain models: smaller one for extraction, bigger one for synthesis.

Example:

GPT-3.5 summarises

GPT-4 refines or edits

Bonus: Advanced tips for efficiency nerds

Use stop sequences

Prevent runaway output:

response = client.chat.completions.create(
stop=["\nUser:"]
)

Monitor usage actively

Use dashboards (like OpenAI’s usage tab) or build your own with logging: log_usage(prompt, len(tiktoken.encode(prompt)))

Modularise prompts

Break large tasks into multiple small ones:

One call to extract

One call to summarise

One call to format

Shorter calls → lower token use → easier debugging.

Quick wins to cut tokens today

High-impact actions:

Remove filler words and vague instructions.

Trim context to recent messages only.

Use max_tokens + stop sequences.

Medium-impact actions:

Clean up input documents before sending.

Cache repeated prompts.

Use smaller models for simple tasks.

TL;DR

Tokens are the currency of AI APIs.

Less is more: shorter prompts, tighter inputs, smaller responses.

You don’t need advanced dev skills to cut usage, just awareness.

The results: faster, cheaper, more scalable AI.

Let's collaborate

Partner with us

Let’s work together to create smarter, more effective solutions for your business.

Related blogs

Business process automation concept with digital workflow icons and a finger selecting automation.

Blog

What You Need to Know About AI Business Process Automation

Business process automation (BPA) was once limited to rigid, rule-based workflows. Today, it’s evolving into something smarter, thanks to AI. If you’re aiming to streamline your operations and improve efficiency,…

02 September 2025

AI

B2B

Artificial intelligence data analytics with charts and graphs displayed on a laptop screen.

Blog

A Practical Guide to AI in Business Operations

AI is becoming an undeniable part of everyday corporate life. By taking on repetitive tasks and processing large amounts of data, it frees your workforce up to focus on making…

02 September 2025

AI

B2B

Digital quality assurance and compliance checklist with certification icon on a laptop screen.

Blog

Tired of Firefighting Inefficiencies? Take Back Control with Business Process Optimisation

When your business is growing fast, it’s easy to overlook how work gets done. But even small inefficiencies in your existing processes can quickly turn into serious blockers. Business process…

02 September 2025

AI

B2B

Who we are

Explore how our culture and expertise fuel digital innovation

Join us

Help us create digital solutions that matter