Tokenization
The process of breaking text into smaller units (tokens) that AI models can process. Tokens might be words, subwords, or characters depending on the tokenizer.
In-Depth Explanation
Tokenization converts raw text into the numerical tokens that language models actually process. Understanding tokenization helps explain model behaviour and costs.
Tokenization approaches:
- Word-level: Each word is a token (limited vocabulary)
- Character-level: Each character is a token (long sequences)
- Subword: Balance of both (BPE, WordPiece, SentencePiece)
- Byte-level: Operating on raw bytes (GPT-2/3/4 style)
Key tokenizers:
- BPE: GPT models, learns merge rules from data
- WordPiece: BERT, similar to BPE
- SentencePiece: Language-agnostic, used by many models
- Tiktoken: OpenAI's fast BPE implementation
Tokenization effects:
- Token count affects API costs (priced per token)
- Context window limits are in tokens, not words
- Rare words split into multiple tokens
- Different languages tokenize differently
Business Context
Understanding tokenization helps estimate API costs, work within context limits, and debug unexpected model behaviour with unusual text.
How Clever Ops Uses This
We help Australian businesses understand tokenization for cost estimation and optimisation when deploying LLM solutions.
Example Use Case
"Estimating costs: a 1000-word document is roughly 1300 tokens with GPT-4, so processing 10,000 documents would cost approximately $X at current API rates."
Frequently Asked Questions
Related Terms
Related Resources
Tokens
The basic units of text that LLMs process. Roughly 1 token = 4 characters or 0.7...
Context Window
The maximum amount of text (measured in tokens) that an LLM can process in a sin...
LLM (Large Language Model)
AI models trained on vast amounts of text that can understand and generate human...
Learning Centre
Guides, articles, and resources on AI and automation.
AI & Automation Services
Explore our full AI automation service offering.
AI Readiness Assessment
Check if your business is ready for AI automation.
