<< Back

Dirty Data Is Not Your Friend

Dirty data – it’s not your friend. No, I’m not talking about things that don’t belong in the workplace! I’m talking about informational data sources that have become riddled with misspellings and extra spaces, variations in cases, and three different ways to write the same address. Dirty data can cause all sorts of inadvertent difficulties in your data files and can affect the outcome of your data analysis, too. 

Evidence-based decision making is impossible without clean data.

So how do you fix it? Don’t worry – it’s not that hard to clean up your data. Here are five things you can do today to improve the quality of the information your business relies on.

  1. Removing duplicates. Duplication crops up in every data set and can skew your results, so you want to make sure that each data record is a unique record and not a duplicate. But duplicate data can be tricky to identify. Do you really have 3 customers whose names are Jenny Smith? What about J Smith, Jen Smith, and Jennifer–are they all the same woman, or are there really that many different Jennifers who have the last name Smith? Cross-checking other unique identifiers such as postal code or phone number can help you determine the true duplicates from the merely similar and ensure your results give you truer values. But be careful – sometimes what appears to be duplication is really a data connection that needs to be better addressed in the way you collect information in the first place. If Joe orders pizza on Friday night and Jane orders pizza delivered to the same address on Saturday, do they represent duplicate data or not? Should you be tracking deliveries by address, by person, or by relationship? The answer to managing these types of data connections very much depends on your business. 
  1. Changing case. When you look at a set of data, it might seem obvious to you that 123 Ridge St is in fact the same as 123 RIDGE ST, but changing case within a data field can cause problems when it comes to analyzing your data. In Canada, postal codes are a prime candidate for case issues, as are any other data fields that might be entered in lower case when they should be upper case, including email addresses, invoice numbers, and anything else where case discrepancies could crop up.  

  1. Fixing number formats. The same problem applies to number formats. Do telephone area codes need surrounding parentheses? Do invoice numbers need a hyphen in them? What about dollar amounts–do you round to 2 decimals or 3 when calculating items with fractional penny amounts? Do even numbers need a trailing zero? If your number formats match, your data analysis will be more effective. And watch out for number fields that were accidentally formatted as text–they’ll be ignored when it comes time to run calculations if you don’t fix them first. 

  1. Fixing dates and times. Dates are a chronic problem, especially when you track them numerically. 10/31/2014 is obviously the 31st of October, 2014; but what about 02/04/08 or 4/16? The first could be February 4, 2008 or April 2, 2008–assuming the last 2 digits are the year. As for 4/16, it could be April 16th of this year, but it could just as easily be April of 2016. Without a standardized date format, there is a huge potential for error. 

  1. Those pesky addresses, cities and provinces. Do you keep address information in a single column, or is each element of the address broken down into its own column? If you track address information in a single line, you are losing a lot of potential information. Breaking down by city and province is common and lets you see what people are doing in Ottawa vs. Halifax. Breaking your data down further with separate items for each of street number, apartment number, street name, etc. can give you a valuable source of local information including things like where, precisely, do customers come from? How powerful is word-of-mouth among neighbours for your business? If you are a landscaping company, you might already suspect that someone might admire the work you did for their neighbours and be more likely to hire you themselves, but does that trend work only for the summer the work was done or does it persist for years afterwards, too? And are they more likely or less likely to put in a pool if they live on the south side of the street? Good, clean location data can help you take advantage of those types of answers.