BLOGS

What is data cleaning? Your complete guide.

August 18, 2022
6 minute read

Data cleaning is an integral part of every savvy digital marketer’s tool belt. It is essential to use data clean (sometimes also referred to as data cleansing) techniques in order to achieve accurate and reliable analysis. But what is data cleaning, and how can you start using it in your day-to-day work?

If you’re fairly new to working with marketing data, you’ll want to understand the ins and outs of what data cleaning is. But first, let’s start with the end goal: to go from raw data to clean data. 

What is clean data?

Clean data is an aggregated set of data that is up to date, free of duplications, has no missing values in the set, consistent in formatting and labeling, and is free of numerical outliers that could skew analysis. Basically, clean data is accurate and ready to analyze. 

Working with clean data means that you can trust your reports and dashboards and the insights you derive.

What is data cleaning then?

Data cleaning is the act of sifting through your data to detect and correct inaccuracies or any corrupted elements. It is especially valuable for digital marketers who are combining performance data from multiple sources. 

When dealing with smaller data sets, you may be able to perform some level of data cleaning manually. However, as your organization, campaigns, and datasets grow, you will need to employ more sophisticated tools that can identify errors and clean the data for you. 

Unfortunately, there isn’t a single surefire way to go about cleaning your data. Every organization and digital marketer will need to adapt data cleaning best practices to their needs. However, there are a few general guidelines or contours that can be followed and adapted to your needs.

The data cleaning process

Before we dive head first into the data cleaning process, it is important that you ask yourself the question “What is data cleaning going to do for me?” 

For instance, are you collecting purely marketing data from various ad platforms? Are you integrating your performance data with other areas of the business that are outside of your control?

You will want to identify your specific requirements and needs. For our purposes, we will imagine that we are a digital marketer who is seeking to analyze advertising performance data from across Facebook, Google Ads, and YouTube. 

Our data cleaning process will follow 5 main steps:

  1. Ensure the data is up to date
  2. Remove duplicates
  3. Identify and correct missing values
  4. Check the data fields
  5. Remove any outliers 

 

Ensure the data is up to date


While this one may seem obvious, it’s important to make sure that you are collecting and working with up to date information. If you work in an agency setting or a marketing department that uses multiple dashboards for different audiences, it can be quite easy to lose track of which data set has been updated, and which hasn’t. 

In our experience, Monday tends to be report generation day. If you’re working with multiple clients and/or reports, and trying to send them out as quickly as possible, it’s worth taking a moment to just double check that all of your data has been collected recently. 

Remove duplicates

Ahh, duplicated data. A common thorn in the side of digital marketers around the world. Data collection processes that scrape data from multiple sources are ripe for this sort of dirty data. Especially with the deprecation of Google’s Universal Analytics, digital marketers should pay particular attention to removing duplicate data. 

Imagine, for instance, that you are comparing website sessions in Universal Analytics with more recent events in Google Analytics 4. Depending on how you’ve set your dimensions and metrics, you may be picking up some double counting of website traffic. 

It’s important to understand the difference in how the two systems view and record your website traffic, then take steps to remove any double counting. Otherwise, it may look like you have a huge spike — or even a drop off — in your web traffic. 

Check out our Funnel Tip that goes through how GA4 records events here

 

Identify and correct missing values

If you haven’t run into this issue before, you probably will at some point in your career. Sometimes, data just seems to disappear like socks in the dryer. There isn’t any explaining where it went, but you’re left to deal with the consequences. 

For instance, if your Google Analytics tag wasn’t firing properly for a few days, you may not have session data for those days. Depending on the severity, you may want to remove that section from your current data analysis, try to input an average value based on historical and current performance, or you may need to include a caveat when sharing your dashboards and performance reports. 

Each of these options have their own drawbacks, so tread lightly. You may need to do some cost benefit analysis to make an educated decision with your team on how to best move forward. 

Check the data fields

Just like ensuring your data is up to date, checking the data fields (or any steps in this process) should be standard operating procedure every single time you handle data. Particularly when pulling data from multiple marketing platforms. 

If we envision our campaign running across Google Ads, Facebook, and YouTube, we will find that even the country data fields are handled differently by each platform. Dig further, and you’ll find even more inconsistencies in how the platforms treat the same exact data field. 

It’s important that you, the digital marketer, identify those differences and make a plan to transform those fields to ensure they are consistent. 

Be sure to check out our recent Funnel Tip in which Alex explains exactly how to adjust these country fields across multiple platforms to create a geo map. 

 

Remove any outliers 

Outliers can be found in almost every single dataset. They can also be a very tricky thing to handle. On one hand, an outlier is legitimately recorded data. It is part of the broader story of your performance. On the other hand, the outlier may be such a fluke (or odd result) that it detracts from your data story. 

It’s up to you to determine on which side of the hazy, gray lines your outliers fall. While you don’t want a blatant outlier to affect what would otherwise be a strong correlation curve, you also don’t want to skew your data and visualizations to confirm any biases you may have. 

Once you’ve completed each of these steps, it’s also helpful to go through a validation or quality assurance process. Even if it’s just checking in with a manager or colleague, it’s an important step that can reduce the risk of bias or mistakes further affecting your analysis. A second pair of eyes might just be what you need to prevent typographical errors or data errors from being in your report.

Why is data cleaning important?

As the old mantra goes: garbage in, garbage out. You can’t hope to have high-quality decisions without reliable insights. You can’t have reliable insights without clean data. And to have clean data, you need to implement data cleaning. 

Additionally, if you’re working from strong, reliable  data, you are also reducing the risk of rework to fix errors or adjust your reports. This means an increase in overall productivity. It also means happier clients and colleagues. 

You know we love analogies and hypothetical examples, so let’s use another one. 

In an agency setting

Imagine you’re working for a digital marketing agency in their strategy and analysis department. Monday rolls around and you have to spit out 15 different performance reports for 15 different clients. 

If you simply rush through the process without properly implementing your data cleaning steps, your clients will end up reviewing reports that may not be accurate. Those reports may be double counting website visits, or even ad spend. 

In turn, that client may start making improperly informed decisions on their marketing investment - perhaps to give you even more budget allocation. If that further investment doesn’t pan out (because the underlying assumptions about performance were wrong), that will reflect negatively on you and the agency. It could even mean a lost client. 

Not good. 

The moral of the story - data cleaning should be an integral part of your process every time you handle data.

For a deeper dive into why maintaining data quality is important, check out our blog article all about the topic. 

 

Data cleaning tools

Just as there are many different approaches and processes you can apply to your data cleaning, there is also an ever-growing list of tools to help you do it. Some data cleaning tools employ sophisticated AI to search through every nook and cranny of your data set and anticipate dirty data. 

Some tools support loads of different languages. Meanwhile, other tools boast open source platforms. 

As a marketing data hub built by marketers for marketers, we think data cleaning should be as easy and quick as possible. We prefer solutions that use point-and-click approaches rather than complex code. We also like it when everything works smoothly. 

That’s why we built data cleaning functionality right into our hub model. That way, it’s one seamless process to collect, clean, transform, and share your data anywhere. 

However, as always, different organizations require different approaches that best meet their needs. It’s best to take stock of what your own needs and capabilities are before diving head first into any data cleaning solution. 

Data cleaning vs. data transformation

By now, you may be wondering what is data cleaning’s main difference from data transformation. After all, data transformation is the act of altering data — often in the quest to ensure consistency and better analysis. 

We admit that it can all be a bit confusing, but think of data cleaning more as identifying inconsistencies and removing items that don’t belong in your data set. Meanwhile, data transformation is the manipulation of data that does belong in your analysis. 

You may want to think of data cleaning as a stepping stone to data transformation.

Where to get started

Now that you are a burgeoning data cleaning expert, where do you start? 

A great place to begin data cleaning is with your campaign naming. It’s a great way to start organizing all of your campaigns to make the data easier to work with. Check out this article for a step-by-step walkthrough.