Top Five Rules for Cleaning Data in a Strategic Project

Michael Watson Ph.D Partner
Read Time: 3 minutes apprx.
data cleansing data science

After receiving a lot of feedback on our video interview on data cleaning, we just published an article on Supply Chain Digest on the top five rules for cleaning data in a strategic project:

#1:  Be patient.

Usually by the time the project starts, the management team wants (or has been promised) fast results. It may do much more harm than good to show initial results without first coming up with a clean data set. The person doing the project needs to set expectations that it will take time to develop a clean data set. The management team needs to realize that if they skip the data cleaning step, they may regret it later.

#2 Assume that the data is neither complete nor correct.

No matter what you been told or hope about the data, you should force the analysis to prove that it is correct. It like the old saying in journalism, “if you mom says she loves you, check it out.” A rigorous checking often reveals innocent mistakes, data that wasn’t entered correctly, or missing data. And, if the data turns out to be clean, a rigorous check doesn’t take very long.

#3 Cross check with other data sources and be as granular as possible.

When validating and checking data, it is good to cross reference similar data from different sources. For example, the sales data from the demand planning system should match the shipment data and should match the financial data. Also, makes sure you don’t just check the summary level data. We have found many cases where the summary level data matched, but there were many problems with the details.

#4 “Data Sushi Principle: Raw Data is Better” (We saw this title at a talk at Strata this year and thought it was appropriate here too).

At the start of a strategic study, we are often asked for data templates. Whenever you ask for data in a specific format, you are introducing plenty of room for errors—the data gets rolled up in ways you don’t want, the data gets calculated in ways that you don’t want, and data gets left out that shouldn’t be. Instead, it is better to define the types of data needed and have the IT team pull the raw data. And, if you have the raw data, you can later go back an fix issues that unexpectedly pop up.

#5 Know your data architects and work with them.

There are IT/Business Analysts who understand the meta data better than anyone else in the company. They are not your business users, nor your SQL Data extraction experts. It is good to have these people on the team. They will have good insight on how people are filling in the data and can likely help you anticipate possible problems.