What is data cleaning? Your complete guide.

Published Aug 18 2022 Last updated Apr 10 2024 7 minute read
Contributors
  • Sean Dougherty
    Written by Sean Dougherty

    A copywriter at Funnel, Sean has more than 15 years of experience working in branding and advertising (both agency and client side). He's also a professional voice actor.

Data cleaning is an integral part of every savvy digital marketer’s tool belt. Data cleaning is often considered the foundational step of the data analytics process because the quality and reliability of the data directly impact the accuracy and validity of the insights and decisions you’ll reach. It is essential to use data cleaning techniques in order to achieve accurate and reliable analysis. But what is data cleaning, and how can you start using it in your day-to-day work?

If you’re fairly new to working with marketing data, you’ll want to understand the ins and outs of what data cleaning is, without an army of data analysts to help. But first, let’s start with the end goal: to go from raw data to clean data. 

What is clean data?

Clean data is an aggregated set of data that is up to date, free of duplications, has no missing values in the set, consistent in formatting and labeling, and is free of numerical outliers that could skew analysis.  Because if you start the data analytics process with messy data, you’re going to get messy results. Basically, clean data is accurate and ready to analyze. 

Working with clean and high quality data sets means that you can trust your reports and dashboards and the insights you derive.

What is data cleaning?

Data cleaning is the act of sifting through your data to detect and correct inaccuracies or any corrupted elements – thus improving data quality. It is especially valuable for digital marketers who are combining performance data from multiple sources. It is part of the overall data management process.

So who does data cleaning? Having data analysts and experts to hand is ideal when you’re digging into data sets – but it’s not a strict requirement when it comes to data cleaning. Ultimately, the level of expertise needed for cleaning data depends on the complexity of the dataset and the specific cleaning tasks involved. While a data analyst or data scientist may excel in performing advanced data cleaning tasks, simpler cleaning tasks can often be handled by individuals with basic data manipulation skills using tools like spreadsheets or specialized software packages.

When dealing with smaller data sets, you may be able to clean your data manually. However, as your organization, campaigns, and datasets grow, you will need to employ more sophisticated tools that can identify errors and clean the data for you. 

How it's done depends on the data set

Unfortunately, there isn’t a single surefire way to go about cleaning your data. Every organization and digital marketer will need to adapt data cleaning best practices to their needs. It also depends on the data set you're working with. However, there are a few general guidelines or contours that can be followed and adapted to your needs.

The data cleaning process

Before we dive head first into the data cleaning process, it is important that you ask yourself the question “What is data cleaning going to do for me?” 

For instance, are you collecting purely marketing data from various ad platforms? Are you integrating your performance data with other areas of the business that are outside of your control?

You will want to identify your specific requirements and needs. For our purposes, we will imagine that we are a digital marketer who is seeking to analyze advertising performance data from across Facebook, Google Ads, and YouTube. 

Our data cleaning process will follow 5 main steps:

  1. Ensure the data is up to date
  2. Remove duplicates
  3. Identify and correct missing values
  4. Check the data fields
  5. Remove any outliers 

 

Ensure the data is up to date


While this one may seem obvious, it’s important to make sure that you are working with up to date information. We are talking about low quality data here.

An example: If you work in an agency or a marketing department that uses multiple dashboards for different audiences, it can be quite easy to lose track of which data set has been updated, and which hasn’t. 

In our experience, Monday tends to be report generation day. If you’re working with multiple clients and/or reports, and trying to send them out as quickly as possible, it’s worth taking a moment to just double check that all of your data is up to date. Otherwise, you might be reporting on 'old' data.

Remove duplicates

Ahh, duplicated data. A common thorn in the side of digital marketers around the world. Data collection processes that scrape data from multiple sources are ripe for this sort of dirty data. Especially with the deprecation of Google’s Universal Analytics, digital marketers should pay particular attention to removing duplicate data. 

Imagine, for instance, that you are comparing website sessions in Universal Analytics with more recent events in Google Analytics 4. Depending on how you’ve set your dimensions and metrics, you may be picking up some double counting of website traffic. 

It’s important to understand the difference in how the two systems view and record your website traffic, then take steps to remove any duplicate records. Otherwise, it may look like you have a huge spike — or even a drop off — in your web traffic. 

Identifying and removing duplicate records is a crucial step in data cleaning and preparation for analysis. 

Check out our Funnel Tip that goes through how GA4 records events here

 

Identify and correct missing values

If you haven’t run into this issue before, you probably will at some point in your career. Sometimes, data just seems to disappear like socks in the dryer. There isn’t any explaining where your missing data went, but you’re left to deal with the consequences. 

An example:

For instance, if your Google Analytics tag wasn’t firing properly for a few days, you may not have session data for those days. This means you have missing data. Depending on the severity, you may want to remove that section from your current data analysis, try to replace the missing values with an average value based on historical and current performance, or you may need to include a caveat when sharing your dashboards and performance reports. 

Each of these options have their own drawbacks, so tread lightly. You may need to do some cost benefit analysis to make an educated decision with your team on how to best move forward in such a case of missing data. 

Check the data fields

Just like ensuring your data is up to date, checking the data fields (or any steps in this process) should be standard operating procedure every single time you handle data. Particularly when pulling data from multiple marketing platforms. 

If we envision our campaign running across Google Ads, Facebook, and YouTube, we will find that even the country data fields are handled differently by each platform. Dig further, and you’ll find even more inconsistencies in how the platforms treat the same exact data field. 

It’s important that you, the digital marketer, identify those differences and make a plan to transform those fields to ensure they are consistent. That way, you can be sure of the data quality before creating and sharing reports.

Be sure to check out our recent Funnel Tip in which Alex explains exactly how to adjust these country fields across multiple platforms to create a geo map. 

 

Remove any outliers 

Outliers can be found in almost every single dataset. They can also be a very tricky thing to handle. On one hand, an outlier is legitimately recorded data. It is part of the broader story of your performance. On the other hand, the outlier may be such a fluke (or odd result) that it detracts from your data story. 

It’s up to you to determine on which side of the hazy, gray lines your outliers fall. While you don’t want a blatant outlier to affect what would otherwise be a strong correlation curve, you also don’t want to skew your data and visualizations to confirm any biases you may have. 

Once you’ve completed each of these steps, it’s also helpful to go through a validation or quality assurance process. Even if it’s just checking in with a manager or colleague, it’s an important step that can reduce the risk of bias or mistakes further affecting your data analysis. A second pair of eyes might just be what you need to prevent typographical errors or data errors from being in your report.

Why is data cleaning important?

As the old mantra goes: garbage in, garbage out. You can’t hope to have high-quality decisions without reliable insights. You can’t have reliable insights without clean data. And to have clean data, you need to implement data cleaning. 

The benefits of data cleaning

If you’re working from high quality, reliable data, you are also reducing the risk of rework to fix errors or adjust your reports. This means an increase in overall productivity. It also means happier clients and colleagues. 

You know we love analogies and hypothetical examples, so let’s use another one. 

In an agency setting

Imagine you’re working for a digital marketing agency in their strategy and analysis department. Monday rolls around and you have to spit out 15 different performance reports for 15 different clients. 

If you simply rush through the process without properly implementing your data cleaning steps, your clients will end up reviewing reports that may not be accurate. Those reports may be double counting website visits, or even ad spend. 

In turn, that client may start making improperly informed decisions on their marketing investment - perhaps to give you even more budget allocation. If that further investment doesn’t pan out (because the underlying assumptions about performance were wrong), that will reflect negatively on you and the agency. It could even mean a lost client. 

Not good. 

The moral of the story - data cleaning should be an integral part of your process every time you handle data.

For a deeper dive into why maintaining data quality is important, check out our blog article all about the topic. 

 

Data cleaning tools

Just as there are many different approaches and processes you can apply to your data cleaning, there is also an ever-growing list of tools to help you do it. Some data cleaning tools employ sophisticated AI to search through every nook and cranny of your data set and anticipate dirty data. Data quality analysts can work with these tools to find 'dirty data' and clean it.

Some tools support loads of different languages. Meanwhile, other tools boast open source platforms. 

Data cleaning for marketing data

As a marketing data hub built by marketers for marketers, we think data cleaning should be as easy and quick as possible. We prefer robust solutions that use point-and-click approaches rather than complex code. 

That’s why we built data cleaning functionality right into our hub model. That way, it’s one seamless process to collect, clean, transform, and share your data anywhere. 

However, as always, different organizations require different approaches that best meet their needs. It’s best to take stock of what your own needs and capabilities are before diving head first into any data cleaning solution. 

Data cleaning, data cleansing or data scrubbing?

Data cleaning, data cleansing and data scrubbing – they all sound similar, but are they the same? Data cleaning is actually part of a process called data cleansing. As we’ve explored, data cleaning involves finding inconsistencies in data and fixing any errors. Data cleansing, on the other hand, is a broader process that, as well as cleaning, includes standardization, validation, removing duplicate data, and if required, enrichment of data. Meanwhile, the terms data cleaning and data scrubbing are sometimes used interchangeably to describe the same process. So now you know!

Data cleaning vs. data transformation

By now, you may be wondering what is data cleaning’s main difference from data transformation. After all, data transformation is the act of altering data — often in the quest to ensure consistency and better analysis. 

We admit that it can all be a bit confusing, but think of data cleaning more as identifying inconsistencies and removing items that don’t belong in your data set. Meanwhile, data transformation is the manipulation of data that does belong in your analysis. 

You may want to think of data cleaning as a stepping stone to data transformation.

Where to get started

Now that you are a burgeoning data cleaning expert, where do you start? 

A great place to begin data cleaning is with your campaign naming. It’s a great way to start organizing all of your campaigns to make the data easier to work with. Check out this article for a step-by-step walkthrough. 

Want to work smarter with your marketing data?
Discover Funnel