7 Steps to Mastering Data Cleaning with Python and Pandas

Image by Author

Pandas is the most widely used Python library for data analysis and manipulation. But the data that you read from the source often requires a series of data cleaning steps—before you can analyze it to gain insights, answer business questions, or build machine learning models.

This guide breaks down the process of data cleaning with pandas into 7 practical steps. We’ll spin up a sample dataset and work through the data cleaning steps.

Let’s get started!

Spinning Up a Sample DataFrame

Link to Colab Notebook

Before we get started with the actual data cleaning steps, let's create pandas dataframe with employee records. We’ll use Faker for synthetic data generation. So install it first:

!pip install Faker

If you’d like, you can follow along with the same example. You can also use a dataset of your choice. Here’s the code to generate 1000 records:

import pandas as pdfrom faker import Fakerimport random# Initialize Faker to generate synthetic datafake = Faker()# Set seed for reproducibilityFaker.seed(42)# Generate synthetic datadata = []for _ in range(1000): data.append({ 'Name': fake.name(), 'Age': random.randint(18, 70), 'Email': fake.email(), 'Phone': fake.phone_number(), 'Address': fake.address(), 'Salary': random.randint(20000, 150000), 'Join_Date': fake.date_this_decade(), 'Employment_Status': random.choice(['Full-Time', 'Part-Time', 'Contract']), 'Department': random.choice(['IT', 'Engineering','Finance', 'HR', 'Marketing']) })

Let’s tweak this dataframe a bit to introduce missing values, duplicate records, outliers, and more:

# Let's tweak the records a bit!# Introduce missing valuesfor i in random.sample(range(len(data)), 50): data[i]['Email'] = None# Introduce duplicate recordsdata.extend(random.sample(data, 100))# Introduce outliersfor i in random.sample(range(len(data)), 20): data[i]['Salary'] = random.randint(200000, 500000)

Now let’s create a dataframe with these records:

# Create dataframedf = pd.DataFrame(data)

Note that we set the seed for Faker and not the random module. So there'll be some randomness in the records you generate.

Step 1: Understanding the Data

Step 0 is always to understand the business question/problem that you are trying to solve. Once you know that you can start working with the data you’ve read into your pandas dataframe.

But before you can do anything meaningful on the dataset, it’s important to first get a high-level overview of the dataset. This includes getting some basic information on the different fields and the total number of records, inspecting the head of the dataframe, and the like.

Here we run the info() method on the dataframe:

df.info()

Output >>>RangeIndex: 1100 entries, 0 to 1099Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 1100 non-null object 1 Age 1100 non-null int64 2 Email 1047 non-null object 3 Phone 1100 non-null object 4 Address 1100 non-null object 5 Salary 1100 non-null int64 6 Join_Date 1100 non-null object 7 Employment_Status 1100 non-null object 8 Department 1100 non-null objectdtypes: int64(2), object(7)memory usage: 77.5+ KB

And inspect the head of the dataframe:

df.head()

7 Steps to Mastering Data Cleaning with Python and Pandas - KDnuggets (2)

Output of df.head()

Step 2: Handling Duplicates

Duplicate records are a common problem that skews the results of analysis. So we should identify and remove all duplicate records so that we're working with only the unique data records.

Here’s how we find all the duplicates in the dataframe and then drop all the duplicates in place:

# Check for duplicate rowsduplicates = df.duplicated().sum()print("Number of duplicate rows:", duplicates)# Removing duplicate rowsdf.drop_duplicates(inplace=True)

Output >>>Number of duplicate rows: 100

Step 3: Handling Missing Data

Missing data is a common data quality issue in many data science projects. If you take a quick look at the result of the info() method from the previous step, you should see that the number of non-null objects is not identical for all fields, and there are missing values in the email column. We’ll get the exact count nonetheless.

To get the number of missing values in each column you can run:

Step 4: Transforming Data

When you’re working on the dataset, there may be one or more fields that do not have the expected data type. In our sample dataframe, the 'Join_Date' field has to be cast into a valid datetime object:

# Convert 'Join_Date' to datetimedf['Join_Date'] = pd.to_datetime(df['Join_Date'])print("Join_Date after conversion:")print(df['Join_Date'].head())

Output >>>Join_Date after conversion:0 2023-07-121 2020-12-312 2024-05-093 2021-01-194 2023-10-04Name: Join_Date, dtype: datetime64[ns]

Because we have the joining date, it's actually more helpful to have a `Years_Employed` column as shown:

# Creating a new feature 'Years_Employed' based on 'Join_Date'df['Years_Employed'] = pd.Timestamp.now().year - df['Join_Date'].dt.yearprint("New feature 'Years_Employed':")print(df[['Join_Date', 'Years_Employed']].head())

Output >>>New feature 'Years_Employed': Join_Date Years_Employed0 2023-07-12 11 2020-12-31 42 2024-05-09 03 2021-01-19 34 2023-10-04 1

Step 5: Cleaning Text Data

It’s quite common to run into string fields with inconsistent formatting or similar issues. Cleaning text can be as simple as applying a case conversion or as hard as writing a complex regular expression to get the string to the required format.

In the example dataframe that we have, we see that the 'Address' column contains many ‘\n’ characters that hinder readability. So let's replace them with spaces like so:

# Clean address stringsdf['Address'] = df['Address'].str.replace('\n', ' ', regex=False)print("Address after text cleaning:")print(df['Address'].head())

Output >>>Address after text cleaning:0 79402 Peterson Drives Apt. 511 Davisstad, PA 351721 55341 Amanda Gardens Apt. 764 Lake Mark, WI 078322 710 Eric Estate Carlsonfurt, MS 786053 809 Burns Creek Natashaport, IA 080934 8713 Caleb Brooks Apt. 930 Lake Crystalbury, CA...Name: Address, dtype: object

Step 6: Handling Outliers

If you scroll back up, you’ll see that we set some of the values in the 'Salary' column to be extremely high. Such outliers should also be identified and handled appropriately so that they don’t skew the analysis.

You’ll often want to factor in what makes a data point an outlier (if it’s incorrect data entry or if they’re actually valid values and not outliers). You may then choose to handle them: drop records with outliers or get the subset of rows with outliers and analyze them separately.

Let's use the z-score and find those salary values that are more than three standard deviations away from the mean:

# Detecting outliers using z-scorez_scores = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()outliers = df[abs(z_scores) > 3]print("Outliers based on Salary:")print(outliers[['Name', 'Salary']].head())

Output >>>Outliers based on Salary: Name Salary16 Michael Powell 414854131 Holly Jimenez 258727240 Daniel Williams 371500328 Walter Bishop 332554352 Ashley Munoz 278539

Step 7: Merging Data

In most projects, the data that you have may not be the data you’ll want to use for analysis. You have to find the most relevant fields to use and also merge data from other dataframes to get more useful data that you can use for analysis.

As a quick exercise, create another related dataframe and merge it with the existing dataframe on a common column such that the merge makes sense. Merging in pandas works very similarly to joins in SQL, so I suggest you try that as an exercise!

Wrapping Up

That's all for this tutorial! We created a sample dataframe with records and worked through the various data cleaning steps. Here is an overview of the steps: understanding the data, handling duplicates, missing values, transforming data, cleaning text data, handling outliers, and merging data.

If you want to learn all about data wrangling with pandas, check out 7 Steps to Mastering Data Wrangling with Pandas and Python.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

7 Steps to Mastering Data Cleaning with Python and Pandas - KDnuggets (2024)