Clean data is a must for reliable analysis.
Without clean data, your analysis might be erroneous, biased, and lead to no actionable insights.
That is why data cleaning might take up a majority of your workload as data analyst.
While there are multiple tools and techniques to ensure data’s tidiness, junior data analysts often struggle to do a thorough cleaning. You might have some idea of what needs to be cleaned and why it is important. But you still wonder: Where should I start? How can I be sure that data are ready to be analyzed?
For a more complete and systematic cleaning, here is a checklist that could help you rest assured that things are under control.
But first, what is data cleaning?
To clean data is to transform them from raw state into consistent state, so that they fit for analysis. In other words, with this process, you want to ensure that data are: 1) free of technical errors, and 2) consistent with the domain knowledge.
For a breakdown, it means two things:
1. Technically speaking, data must be stored in a comprehendible format with recognizable column names and in the right format. For example, values in date columns must be in the Date type, not the String type.
2. Regarding domain knowledge, data must conform to constraints based on knowledge of the subject. An example of violation might be a row indicating that a child of five years old holds a driver's license, which contradicts the law of driving age. In this phase, you also deal with missing values and outliers.
Checklist for thorough data cleaning
To ensure the technical correctness of data and their consistency with the domain knowledge, you must thoroughly inspect data, develop a plan for cleansing operations and, when needed, justify why you choose to handle an error in a certain way.
This checklist aims to help beginners proceed the cleansing process with more confidence, knowing where to start and when to stop.
1. From raw to technically correct data
Data are technically correct when they are stored in a comprehendible format like table or data frame with recognizable column names and in the right type.
So, here is a list of questions to inspect data’s technical correctness.
1.1 Are data in a readable format like table for data frame?
If the answer is No, you must convert data into understandable forms.
1.2 Are data stored with recognizable column names?
In other words, can you tell what variable the field represents just by looking at the column name? If not, change the column name to a clearer one. Without a clear understanding of data, you will have trouble in analyzing them.
1.3 Are the observations unique?
Put differently, do data contain duplicates? Duplicates lead to faulty analysis. Fortunately, it is easy to identify and remove duplicates in either Excel or R.
1.4 Are data in the right type?
Data can be a number, a string of characters, a Boolean, or a date. Errors occur when values of a certain type are stored in another. For example, miscalculation happens when values in a numeric field are stored as text.
1.5 Are data in a consistent format?
This question refers to the uniformity of data. Are all data in a column stored in the same format?
There are multiple ways of writing dates, numeric numbers, phone numbers, etc. No matter what format is chosen, make sure that it is used consistently.
Here are several checkpoints to ensure that data are in consistent formats:
Date format: Is the date format used uniformly across the data set? While dates in many parts of the world are often written as DD-MM-YYYY, date data from the United States might be written as MM-DD-YYYY. So, be careful.
Case check: Are all values in text columns written in the same case? When one category is written both as annual member and Annual member, expect the error in your calculations. So, choose one format and apply it through the whole data set.
Text format: Do all values in text columns following a specific standard like phone numbers or postal codes have the same formatting? Make sure that all postal codes are stored consistently as XXX-XXX or XXX XXX, instead of both ways.
1.6 Are data free of white spaces?
A neglected white space before or after values can lead to misinterpretation, miscalculations, and other complications when merging data. A tidy data set has no place for white spaces.
2. From technically correct data to analyzable data
Once data are technically correct, you must guarantee their readiness for the analysis by ensuring that:
data conform with domain knowledge, and
missing values, obvious errors and outliers are handled.
You can ask three following questions to inspect the analyzability of the data.
2.1 Does data correspond to real-world knowledge about the subject?
Data must be consistent with constrains based on knowledge of the domain. It will be handy to verify predefined rules for the data set. For example, a rule that the Age column must fall in a certain range helps identify those out of this range.
Here are some checkpoints to inspect data’s consistency with real-world knowledge.
Data range check (for integral data): Do values fall within expected range?
Category check (for categorical data): Do categorical columns have correct and consistent categories? Are different entries used for the same category?
Length check (for text or string data): Do text columns with specific standards have consistent string lengths? The postal code for Canada has six characters. Therefore, any value with under or over six is erroneous.
2.2 Are there any missing values?
Data can be missing randomly or systematically. You can choose to remove them all at once or fill in the blank. No matter what operation you choose, make sure you can justify it with the domain knowledge.
2.3 Are there any outliers in the data set?
Though outliers are not necessarily errors, they can skew your analysis. Hence, a justification based on domain knowledge is needed whether you choose to remove or keep them.
Sources
de Jonge & van der Loo (2013). An Introduction to data cleaning with R. Available at: https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
Data cleaning checklist by DataCamp: https://www.datacamp.com/blog/infographic-data-cleaning-checklist
Vadali (Dec, 2017). Day 7: Data cleaning — All you need to know about it. Available at: https://becominghuman.ai/day-7-data-cleaning-all-that-you-need-to-know-about-it-23b05738abe7


