AI models that can process and generate multiple types of data - text, images, audio, and video. GPT-4V and Gemini are multi-modal.
Multi-modal AI systems can understand and work with multiple types of data in a unified way. Rather than separate models for text, images, and audio, multi-modal models process all inputs together, understanding relationships between modalities.
Capabilities of multi-modal models:
Leading multi-modal models:
Business applications:
Multi-modal AI enables processing invoices with images, analysing visual content, and building richer user experiences that combine text and images.
We leverage multi-modal capabilities for Australian businesses in document processing, visual inspection, and creating more natural user interactions.
"Uploading a photo of a product defect and asking AI to describe the issue, classify its severity, and suggest remediation."