Inference
Using a trained model to make predictions or generate outputs on new data. This is the "runtime" phase of AI, as opposed to training.
In-Depth Explanation
Inference is the process of using a trained AI model to produce outputs from new inputs. While training teaches the model, inference puts that learning to work on real-world data.
The inference process:
- Receive input data (text prompt, image, etc.)
- Preprocess and tokenise the input
- Pass through the model's neural network
- Generate output predictions
- Post-process into the final result
Key inference considerations:
- Latency: Time from input to output
- Throughput: Requests processed per second
- Cost: Compute resources and API fees
- Accuracy: Quality of outputs
- Reliability: Consistency and uptime
Inference optimisation techniques:
- Model quantisation (reduce precision)
- Batching multiple requests
- Caching common queries
- Using smaller models where appropriate
- Streaming outputs progressively
- Speculative decoding for speed
Business Context
Inference costs are your ongoing AI expenses. Optimising inference speed and efficiency directly impacts operational costs and user experience.
How Clever Ops Uses This
Example Use Case
"When a customer asks your chatbot a question, the model performs inference to generate a response in real-time."
Frequently Asked Questions
Related Resources
Latency
The time delay between sending a request and receiving a response from an AI sys...
Tokens
The basic units of text that LLMs process. Roughly 1 token = 4 characters or 0.7...
Batching
Processing multiple requests or data points together in a single operation rathe...
Learning Centre
Guides, articles, and resources on AI and automation.
AI & Automation Services
Explore our full AI automation service offering.
AI Readiness Assessment
Check if your business is ready for AI automation.
