There are several symbolic visuals that can represent the American West. However, few of them are as powerful (or as popular) as the cowboy.
Rugged. Strong. On a mission to wrangle cattle across the wide plains on route to their final processing point. They work tirelessly to ensure that their livestock keeps marching on. The cowboys protect the livestock from threats, and ensure they stay tightly grouped without wandering off.
Modern data analysts might be considered a sort of cowboy, too. Seriously. A sort of “digital cowboy” that employs data wrangling to ensure that data is collected from various sources and transported to a final destination. All the while, these data wranglers ensure that any errors or inconsistencies are removed, leading to better final outcomes.
What is data wrangling?
Data wrangling is the process of changing raw data in such a way that it becomes usable for analysis. It involves removing errors, merging different data sets and validating data to make sure it is correct.
Our own digital cowboy, em… data engineer, Ishan explains data wrangling like this:
“The term data wrangling is used by more advanced users of the data, such as a data engineer or data scientist. Where the main focus is changing the shape of data, such as pivoting the data by a specific field to make the data set wider, or un-pivoting the data set to make it longer (also known as spread and gather). The main reason to apply data wrangling is to make it fit to a certain shape of data that is either fit for a model or input to another process.”
Benefits of data wrangling
Why is data wrangling important?
Data wrangling is important in much the same way that cowboys need to wrangle livestock. Without the cowboy’s efforts, the cattle would just go everywhere. The whole operation would be a total mess.
Similarly, raw or unstructured data is often not as insightful as you may think. For example, imagine a dataset that contains information from your e-commerce platform. It contains the following information:
- Sales per day
- Revenue per day
- Product code of the products that were sold
While this gives you information like the number of sales on a given day, you don't know which products are actually sold. Instead, you would need to connect another dataset such as data from your product information management system.
With the right data wrangling tools, you can merge the data sets and find out what the names are of the product codes that you had in the first data set. Amazing, right? And you don’t even need a lasso or 10-gallon hat.
Note! Merging data is just one example of data wrangling. We will look into more complex data wrangling examples below.
What are data wrangling techniques?
While there are several types and styles of data wrangling, let’s review some of the most common.
Data cleaning is the process of removing incorrect or duplicated data from the set.
Lots of data sets are very, very large. That means they require a lot of computing power to work with. (Hence the term big data.) In many situations, you don’t actually need all the information in a specific data set. Instead, you can filter out the data you don't need to make the incoming data easier and lighter to work with.
Renaming fields is often important when working with data, particularly when combining multiple data sets. For example, one source may call your total marketing spend “cost” while another calls the same data “total spend.”
Widening a data set
When your current data is formatted as variables, but you instead need it formatted as metrics, you’ll need to widen the data set. Let’s look at the data below as an example.
In order to use the data in a specific tool, or enhance it, you need to have specific columns for each variable. That means you need to 'widen' the data set, which may look something like this:
You will note that the actual data is the same, but the formatting (or data model) is different.
Making a data set longer
When making the dataset longer, you do the exact opposite of what is done with data widening. Whether you need the long or wide format will depend on the data destination or software you intend to use it in. For example, a data visualization tool might need one format, and MMM software needs the other format.
Filling empty cells (or deleting them)
Imagine you have daily data from Google Analytics, but there were a couple days when the GA tag didn't fire. This might mean you won’t be able to use this data set, since it will show strange numbers when compared to data in a previous period.
In order to fix this, you could take the average of the subsequent weeks, and use them to fill out the gaps. You could also add data from another source if you have other software that tracks the same metrics.
In other cases, missing data will show up as N/A (meaning not available or not applicable), in which case you can either delete these missing values altogether or find the missing values and merge them with your data set.
The data wrangling process, explained
Ok, we need to be honest for a second. There isn’t a single process by which every “digital cowboy” wrangles their data. Your method may be different than ours depending on the quality of your raw data, the tools and software you use, and the destination you send the data to. However, we think this step-by-step process works for many data wrangling situations.
Steps in the data wrangling process:
- Define the final outcome. Make sure you have a clear idea of what you and your data team want to achieve, and work backward from there.
- Gather the data. Collect all of the data you need into a singular system or file.
- Clean the data. Fix or remove any incorrect, mis-formatted, duplicated, or incomplete values within the data set (for instance, remove null values).
- Data validation. Compare the data you’ve collected with the same data in the source. Is everything the same, or did any unintentional changes occur? This step can be performed before or after cleaning.
- Data transformation. This is where things get interesting. You can add additional data to combine two data sets, transform the shape of the data, filter the data, or perform a host of other data transformations.
- Share or export the data. The last step in our process is to share or export the data itself. In most cases, data is shared with data scientists or specialists who can analyze data. The end goal: obtain valuable insights for the business.
Once you’ve made it through all 6 steps, you’ve converted raw data to a state where you can start extracting real value from it. In some cases, data wrangling tools may not provide intuitive user interfaces that help you quickly identify insights. That’s why tools like marketing mix modeling, data warehouse, or visualization tools may be the best destination for you.
You might also like our blogpost about mapping data.
What is the difference between data cleaning and data wrangling?
As explained in this blogpost, data wrangling is the complete process of making data ready for use. Data cleaning is one of the steps in that process.
What is raw data?
In data science, raw data is the data before the process of data wrangling has been performed. It is the pure data as it comes directly from the source.
What is data munging?
Data munging is just another word for data wrangling.
What platforms are most used for data wrangling?
Python, R and SQL are the most commonly used languages for data wrangling.