The process of detecting and correcting errors, inconsistencies, and inaccuracies in datasets to improve data quality.
Data cleaning (or data cleansing) is the process of identifying and correcting errors in data. It's a critical but often underestimated part of any data or AI project.
Common data quality issues:
Cleaning techniques:
Tools for data cleaning:
The 80/20 rule:
"Garbage in, garbage out" - AI model quality depends on data quality. Investing in data cleaning pays dividends in model performance.
We emphasise data cleaning for Australian business AI projects, knowing that quality data is the foundation of reliable AI systems.
"Cleaning customer database: deduplicating merged accounts, standardising addresses to postal format, filling missing postcodes from suburb lookup."