Susan Walsh, Founder & MD, The Classification Guru

With a decade of experience fixing your dirty data, Susan Walsh is the founder and MD of the Classification Guru Ltd, a specialist data classification, taxonomy customisation and data cleansing consultancy. She is an industry thought leader, TEDx speaker and author of the ‘Between the Spreadsheets: Classifying and Fixing Dirty Data’. She’s also the founder of COAT. Susan has developed a methodology that her team use to accurately and efficiently classify, normalise, cleanse and check data for errors which will help prevent costly mistakes. This could save days, if not weeks of laborious cleansing and classifying and can help your business find cost and time savings, driving profitability and supporting better, more informed business decisions.


Not sure? Well, the unfortunate truth is probably, as any data that is incorrect can be classed as dirty data! The interpretation of incorrect data, however, varies from person to person. For example, one person might consider DHL a ‘courier’, while another might log it as ‘logistics’ or ‘warehousing’. So, it’s important to note what is dirty data for you, so that you can keep it clean!

Here are some examples of dirty data you can watch out for:

Misspelt Names

This is an easy one to miss. Have you heard of Typoglycemia? It refers to the ability to read words clearly despite the letters being in the wrong order. Take for example, ABC Printing, this could be ABC Printign, or a missing letter such as T Shoemit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones which may not be easily picked up. In this busy world we live in, it’s easy to skim over something and think it reads correctly.

In a world also involving regulations such as GDPR, it’s doubly important to ensure your data is spelt correctly for personal information. Not only can it insult someone and put them off working with you, but it can also lead them to make a complaint.

Incorrect or Misleading Descriptions

Again, all information is up for interpretation, so it’s important to ensure your descriptions are correct. This issue can be found quite commonly in invoice and PO descriptions. It could be something as simple as “services” in the description, and the person’s name as the supplier. This vague description leaves us wondering who are they then? The copywriter, the lawyer, another consultant of some sort? You may think it doesn’t matter, but the risk of data being classified wrong, say in ‘Professional Services’ instead of ‘Facilities’ leads you to add up spend incorrectly.

The consequence of this is to then make decisions based off of this data, for example you might think you need to cut back the budget for ‘Professional Services’, but it’s actually in ‘Facilities’ where the large spend lies!

Missing or Incorrect Codes

This is a most commonly occurring issue in the manufacturing and supply chain industries. With so many areas of production heavily reliant on the product code, it can be catastrophic if there is no code! So why would a product code be missing? Several reasons, if it’s an older product then historically it might never have been assigned a code. Or perhaps the code wasn’t available when the product was set up, but no one followed up to add it in once the code had been created.

And then there’s the “can’t be bothered” aspect. Unfortunately, many are overworked and underpaid, so understandably lacking the drive to follow the correct processes.

No Standard Formats for Addresses

This dirty data issue is seen A LOT in both supplier and personal data. It’s a very common problem. We have found that address recording changes to some degree in nearly every data set! Sometimes the address is all in one cell, sometimes split over a few columns, and then there’s the mix-ups, like cities with counties/states, and postal/zip codes in cities.

Not to mention abbreviations. What is normally considered a benefit (easier to read), now has the potential to become a nightmare! Terrace could be Terr, Place – Plc, Road – Rd, Street – St, etc… and this could lead to near-duplicates, multiple records, and information split between these multiple records, leading to incorrect information and reporting being used in the business.

No Standard Units of Measure

Here is where the minor things really become an issue, so having an eye for detail is key. Imagine you have a space between the number and the unit of measure, this can cause near duplications, which can cause a lot of issues when you are trying to analyse or report on a specific product.

To avoid multiple versions of the same items, make sure you are being clear and specific with your team, and within your processes.

Currency Issues

With so many different types of currencies out there, it’s no wonder this one is easily done. If you are not aware that the values you are working with are in multiple currencies, then you could spend hours trying to get figures to match up.

Particularly when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like you’ve spent £500 on a taxi…

Incorrect/Partially Classified Spend Data

You may think this is better than not having classified data at all, but when it comes to classifying data, it’s got to be done again from scratch. Why? You wouldn’t be using classification services if there wasn’t a problem somewhere, meaning it is far easier and time efficient to work from a clean slate.

You want your complete data to be set at a certain standard, this ensures data coming in is met with the same processes, and you do not repeat your mistakes.

Dreaded Duplicates

They’re everywhere; in your invoicing, in your customer/supplier records, in your orders, duplicates can, and will, appear anywhere. Duplicates create multiple records, meaning the information is split between the two, resulting in you only seeing part of the picture.

Likewise, near duplicates are just as frightful! In business this could be PWC, P.W.C, or with personal information, this could be Robert Smith and Bob Smith.

Avoid the consequences of Dirty Data by keeping your data protected

You might think it’s low-risk dirty data, but the reality is without proper protection your data, and your company will succumb to exposure from the elements. We’re talking financial risk, job loss, business performance or even fraud!

So, make sure your data has its COAT on – it’s Consistent, Organised, Accurate and Trustworthy!

Content Disclaimer

Related Articles