Clean data – the holy grail of data science

Data scientists around the world are spending countless hours transforming messy data to clean data. Of course, you can automate some data cleaning processes, but can you also avoid data cleaning entirely?

Data cleaning is the most time-consuming task in the data science project. It is the process of improving data quality that results in better data modeling and better predictions. No model can make good predictions with bad data. Furthermore, data cleaning is closely related to data collection. In fact, I believe eliminating data collection errors is the key to the holy grail of data science.

Data collection & data cleaning

Before jumping into data cleaning, you should first understand the journey of data from its collection to the dataset you are working on. If you extracted it with the database query, you should also understand how it came to the database. Was it collected with a digital entry form or was it filled out in a paper form and later digitalized with computer vision? Understanding the collection technique is critical to detecting errors in data.

User errors

Humans are prone to errors. We can quickly make a mistake while entering data or even forget about the data entry entirely. We can also enter the same piece of data more than once. It can be a mistake but sometimes we just cannot know.

Users can create all of the most common data quality issues. Therefore it is extremely important to make sufficient data entry validations, that can catch most of them.

User errors usually require a lot of effort to identify and correct.

Sensor errors

The ongoing digital transformation is introducing different kinds of new technologies that can collect data without human intervention. Sensors can measure temperature or pressure, collect geolocation data, etc. However, like other machines, they can break. Faulty sensors can fail to collect data or even collect incorrect data.

Every sensor output should be monitored to identify missing data, reading timestamps and data drift. Reading time intervals should be consistent with the plan and data should stay within the predefined boundaries.

Identifying missing data and inconsistent time intervals is quite easy. However, detecting data drift and the resulting outliers can be much harder.

API & synchronization errors

Data errors can also emerge when the data is being transferred to the database. Software bugs can result in multiple POST requests, that generate duplicates. On the other hand, if the system uses local databases, synchronization issues are possible. If local and global databases are not synced correctly, duplicated or missing data is inevitable.

Finding the duplicates is relatively easy since the records are usually exactly the same, however, missing data is much harder to detect. Thankfully this kind of error can’t generate outliers.

Transformation errors

While working on a specific task, you need data from different database tables. Sometimes you merge data from multiple tables, that can share some of its data. Incorrect SQL joins are another reason behind duplicated / missing data. Missing data can also be the result of bad WHERE conditions.

Transformation errors are the easiest to detect since they will usually contain duplicated IDs and/or timestamps.

Conclusion

Clean data is a prerequisite for impactful data analysis and data modeling. Unfortunately, real-world data is never clean.I believe it is extremely important to identify the root cause of all data quality issues. This will not only help you with the data cleaning but also enable you to upgrade collection techniques to prevent future errors.

References:

Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says (forbes.com)

Dirty Data – Outlier