The process of breaking text into smaller units (tokens) that AI models can process. Tokens might be words, subwords, or characters depending on the tokenizer.
Tokenization converts raw text into the numerical tokens that language models actually process. Understanding tokenization helps explain model behaviour and costs.
Tokenization approaches:
Key tokenizers:
Tokenization effects:
Understanding tokenization helps estimate API costs, work within context limits, and debug unexpected model behaviour with unusual text.
We help Australian businesses understand tokenization for cost estimation and optimisation when deploying LLM solutions.
"Estimating costs: a 1000-word document is roughly 1300 tokens with GPT-4, so processing 10,000 documents would cost approximately $X at current API rates."