Latency
The time delay between sending a request and receiving a response from an AI system. Critical for real-time applications.
In-Depth Explanation
Latency in AI systems measures the time from when a request is sent to when a response is fully received. For user-facing applications, latency directly impacts user experience and satisfaction.
Latency components:
- Network latency: Time to reach the API
- Queue time: Waiting for processing capacity
- Processing time: Model inference duration
- Token generation: Time to produce output
- Response transmission: Sending results back
Factors affecting latency:
- Model size (larger = slower)
- Input/output length
- Server load and capacity
- Geographic distance to API
- Batch size and queuing
Latency benchmarks:
- Excellent: <500ms total
- Good: 500ms-2s
- Acceptable: 2-5s
- Poor: >5s
Optimisation strategies:
- Use streaming for perceived speed
- Choose appropriate model size
- Cache common responses
- Optimise prompts (fewer tokens)
- Use edge deployments
Business Context
How Clever Ops Uses This
Example Use Case
"A chatbot with 500ms latency feels instant; 5 seconds feels broken. The difference significantly impacts user satisfaction and adoption."
Frequently Asked Questions
Related Resources
Inference
Using a trained model to make predictions or generate outputs on new data. This ...
Batching
Processing multiple requests or data points together in a single operation rathe...
Streaming
Sending AI model output incrementally as it's generated rather than waiting for ...
Learning Centre
Guides, articles, and resources on AI and automation.
AI & Automation Services
Explore our full AI automation service offering.
AI Readiness Assessment
Check if your business is ready for AI automation.
