Skip to article frontmatterSkip to article content

Data Preprocessing

Advanced Techniques for Data Cleaning and Transformation

Gdansk University of Technology
Chongqing Technology and Business University

Abstract

This notebook provides an introduction to data wrangling and preprocessing techniques in Python. It covers strategies for identifying and handling missing data, merging datasets, and reshaping data into tidy formats for analysis and visualization. Practical examples demonstrate how to clean, transform, and prepare real-world datasets, emphasizing best practices for ensuring data quality and usability in data science workflows.

Keywords:datastatisticsdata analysispython

Goals of this lecture

A huge %\% of science involves data wrangling.

This could be an entire course on its own, but today we’ll focus on:

  • What is data wrangling?
  • What to do about missing values?
  • How to combine datasets?
  • Tidy data!
    • What is it?
    • How do we make our data tidy?

Importing relevant libraries

import seaborn as sns ### importing seaborn
import pandas as pd
import numpy as np
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

What is data wrangling?

Data wrangling refers to manipulating, reshaping, or transforming a dataset as needed for your goals (e.g., visualization and/or analysis).

  • A huge %\% of working with data involves “wrangling”.
  • Includes:
    • “Cleaning” data (missing values, recasting variables, etc.).
    • Merging/combining different datasets.
    • Reshaping data as needed.

Data Wrangling with Pandas

CHEAT-SHEET is available here

Types of missing data

Before we start analyzing missing values, it is important to understand the different reasons for missing data. Generally speaking, there are three possible reasons:

1. Missing completely at random (MCAR).

The missing values of a given variable (Y) are not related to other variables in the data set or to the variable (Y) itself. In other words, there is no specific reason for the missing values.

2. Missing at random (MAR).

MAR occurs when the missing value is not random, but when the missing value can be fully explained by variables for which there is complete information.

3. Missing not at random (MNAR).

The missing value depends on unobserved data or on the values of the missing data itself.

MCAR

What does it mean? When missing data points do not follow any particular reasoning or pattern.

  • Example 1. You have demographic data for your community. But the “Middle Name” variable is missing 50% of the values. This 50% of the data is a perfect example of MCAR data. There is no pattern or reason why the middle name is blank in most of the entries.

  • Example 2. In a paper survey, one page accidentally fell out of several forms during shipping. So the answers to questions on page 3 are missing, but regardless of who the respondent was or how they answered earlier.

How to deal with this?

For MCAR data, the following methods can be used:

  1. Row-wise deletion: Delete a record if there is missing data in any of the variables/columns in the dataset. This works best only when the amount of missing data is small, such as when only 2% of the data in the dataset is missing completely at random.
  2. Pairwise deletion: Pairwise deletion removes only the cases where one of the variables used in the statistical method under consideration is missing. It works on the same principle as the correlation matrix. In case of missing values between two variables (in the sense of pairwise), finding the correlation matrix takes into account all the complete cases for those two variables. Suppose the number of cases in this scenario is N. After taking another set of variables and calculating the correlation matrix, the number of complete cases will be different from N. This serves as the main difference between rowwise and pairwise deletion. Pairwise deletion has the advantage of causing minimal data loss. In case of a data set which has common missing values in almost all variables, pairwise deletion would be a wiser choice to deal with the missing values. 3. Mean, Median and Mode Imputation: Missing values can also be replaced by the mean, median and mode of the respective variables.

MAR

What does it mean? When missing data points follow a pattern.

  • Example 1. Take the same example of demographic data for your community. But this time, the salaries of several men over the age of 45 are missing.

  • Example 2. In a health survey, women are more likely than men to not report their weight. Missing data is determined by gender (which is known), but not by the weight value itself.

In this case, missing data is imputed to data from another variable. As such, it is a “missing at random” mechanism. MAR is probably the most difficult to understand because of its name.

How to deal with it?

Because there is a relationship in this mechanism, the best option would be to use an imputation technique—mean, median, mode, or multiple imputation.

MNAR

What does this mean?

When the missing data points follow a pattern, it means that they follow the MNAR mechanism.

  • Example 1. In the same demographic data of residents of your community, let’s say that the salary of several men is missing when the salary exceeds a certain amount. (say, one million).

  • Example 2. People with very high incomes do not want to disclose their earnings in the survey. The missing data on income depends directly on its value (i.e. the one that is missing).

In this case, we are dealing with the “Missing Not at Random” mechanism. Usually, when the missing data is not MCAR or MAR, it tends to follow MNAR.

How to deal with it?

Since MNAR is a self-induced dependence, the best way to avoid it is to pool the data or model the missing data.

The reason we need to analyze the mechanisms of missing values is that we need to understand the best way to impute a missing value so that the accuracy of our ML models remains unaffected.

Dealing with missing values

Loading a dataset with missing values

The titanic dataset contains information about different Titanic passengers and whether they Survived (1 vs. 0).

  • Commonly used as a tutorial for machine learning, regression, and data wrangling.
df_titanic = pd.read_csv("data/wrangling/titanic.csv")
df_titanic.head(3)
Loading...

Why is missing data a problem?

If you’re unaware of missing data:

  • You might be overestimating the size of your dataset.
  • You might be biasing the results of a visualization or analysis (if missing data are non-randomly distributed).
  • You might be complicating an analysis.
    • By default, many analysis packages will “drop” missing data––so you need to be aware of whether this is happening.

How to deal with missing data

  1. Identify whether and where your data has missing values.
  2. Analyze how these missing values are distributed.
  3. Decide how to handle them.

Not an easy problem––especially step 3!

Step 1: Identifying missing values

The first step is identifying whether and where your data has missing values.

There are several approaches to this:

  • Using .isna
  • Using .info
  • Using .isnull

isna()

  • The isna() function tells us whether a given cell of a DataFrame has a missing value or not (True vs. False).
  • If we call isna().any(), it tells us which columns have missing values.
df_titanic.isna().head(1)
Loading...
df_titanic.isna().any()
PassengerId False Survived False Pclass False Name False Sex False Age True SibSp False Parch False Ticket False Fare False Cabin True Embarked True dtype: bool
Inspecting columns with nan

Now we can inspect specific columns that have nan values.

df_titanic[df_titanic['Age'].isna()].head(5)
Loading...
How many nan?

If we call sum on the nan values, we can calculate exactly how many nan values are in each column.

df_titanic.isna().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

info

The info() function gives us various information about the DataFrame, including the number of not-null (i.e., non-missing) values in each column.

df_titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Check-in

How many rows of the DataFrame have missing values for Cabin?

### Your code here

Solution

### How many? (Quite a few!)
df_titanic[df_titanic['Cabin'].isna()].shape
(687, 13)

Visualizing missing values

  • Finally, we can visualize the rate of missing values across columns using seaborn.heatmap.
    • The dark cells are those with not-null values.
    • The light cells have nan values.
sns.heatmap(df_titanic.isna())
<Figure size 432x288 with 2 Axes>

Step 2: Analyze how the data are distributed

Having identified missing data, the next step is determining how those missing data are distributed.

Is variable YY different depending on whether XX is nan?

One approach is to ask whether some variable of interest (e.g., Survived) is different depending on whether some other variable is nan.

### Mean survival for people without data about Cabin info
df_titanic[df_titanic['Cabin'].isna()]['Survived'].mean()
0.29985443959243085
### Mean survival for people *with* data about Cabin info
df_titanic[~df_titanic['Cabin'].isna()]['Survived'].mean()
0.6666666666666666

Check-in

What is the mean Survived rate for values with a nan value for Age vs. those with not-null values? How does this compare to the overall Survived rate?

### Your code here

Solution

### Mean survival for people without data about Age
df_titanic[df_titanic['Age'].isna()]['Survived'].mean()
0.2937853107344633
### Mean survival for people *with* data about Age
df_titanic[~df_titanic['Age'].isna()]['Survived'].mean()
0.4061624649859944
### Mean survival for people overall
df_titanic['Survived'].mean()
0.3838383838383838

Package missingno

If you want to dig deeper, you can check out the missingno Python library (which needs to be installed separately).

Finding the reason for missing data using a matrix chart

import missingno as msno
msno.matrix(df_titanic);
<Figure size 2500x1000 with 2 Axes>

Finding the cause of missing data using Heatmap

msno.heatmap(df_titanic);
<Figure size 2000x1200 with 2 Axes>

The heat map feature shows that there are no strong correlations between missing values of different features. This is good; low correlations also indicate that the data is MAR.

Finding the Cause of Missing Data Using a Dendrogram

A dendrogram is a tree diagram of missing data. It groups together highly correlated variables.

msno.dendrogram(df_titanic);
<Figure size 2500x1000 with 1 Axes>

Step 3: Determine what to do!

Having identified missing data, you need to determine how to handle it.

There are several approaches you can take.

  • Removing all rows with any missing data.
  • Removing rows with missing data only when that variable is relevant to the analysis or visualization.
  • Imputing (i.e., guessing) what values missing data should have.

Removing all rows with any missing data

  • We can filter our DataFrame using dropna, which will automatically “drop” any rows containing null values.
  • Caution: if you have lots of missing data, this can substantially impact the size of your dataset.
df_filtered = df_titanic.dropna()
df_filtered.shape
(183, 13)

Removing all rows with missing data in specific columns

  • Here, we specify that we only want to dropna for rows that have nan in the Age column specifically.
  • We still have missing nan for Cabin, but perhaps that’s fine in our case.
df_filtered = df_titanic.dropna(subset = "Age")
df_filtered.shape
(714, 13)

Imputing missing data

One of the most complex (and controversial) approaches is to impute the values of missing data.

There are (again) multiple ways to do this:

  1. Decide on a constant value and assign it to all nan values.
    • E.g., assign the mean Age to all people with nan in that column.
  2. Try to guess the value based on specific characteristics of the data.
    • E.g., based on other characteristics of this person, what is their likely Age?
Imputing a constant value

We can use fillna to assign all values with nan for Age some other value.

## Assign the mean Age to all people with nan for Age
df_titanic['Age_imputed1'] = df_titanic['Age'].fillna(df_titanic['Age'].mean())
## Now let's look at those rows
df_titanic[df_titanic['Age'].isna()].head(5)
Loading...
Guessing based on other characteristics
  • You can try to guess what their Age would be, based on other features.
  • The more sophisticated version of this is to use statistical modeling or using SimpleImputer from the sklearn library.
  • For now, simply note that Age correlates with other features (like Pclass).
## Passenger Class is correlated with Age
sns.barplot(data = df_titanic, x = 'Pclass', y = 'Age');
<Figure size 640x480 with 1 Axes>

Check-in

What would happen if you used fillna with the median Age instead of the mean? Why would this matter?

### Your code here

Solution

The median Age is slightly lower.

## Assign the median Age to all people with nan for Age
df_titanic['Age_imputed2'] = df_titanic['Age'].fillna(df_titanic['Age'].median())
## Now let's look at those rows
df_titanic[df_titanic['Age'].isna()].head(5)
Loading...

Random imputation/hot deck

Hot deck imputation is a method in which each missing value is replaced with a value from a similar record in the dataset, often referred to as the “donor.” The donor record is selected based on matching criteria—such as demographic characteristics or proximity in a multidimensional feature space—to ensure that the imputation preserves the inherent distribution and relationships in the data.

For example, if income data for a survey respondent are missing, the income value of a similar respondent (based on factors such as occupation, geographic location, and education level) is used as a surrogate.

Hot deck imputation is crucial to maintaining the integrity of the dataset. It ensures that:

  • Statistical analyses are not biased by the arbitrary or discarded incomplete cases.

  • Natural variability and distributional patterns present in complete cases are preserved.

  • Comparisons between different groups within the data remain feasible and valid.

Hot Deck Random Imputation: In this method, missing values are imputed by randomly selecting a donor from a pool of potential candidates. This preserves the distribution of the data, especially when the missing values are random. The downside is that the randomness can overlook nuances in the data.

Hot Deck Nearest Neighbor Imputation: In this case, a donor is selected based on the similarity of observed features. A commonly used algorithm is the k-nearest neighbors (k-NN) approach, which calculates a distance metric between observations. For a record with a missing value, its k nearest neighbors (in terms of Euclidean or Mahalanobis distance) are determined, and one of their values is used for imputation.

The k-NN algorithm identifies the closest records that minimize the distance:

d(x,y)=i=1n(xiyi)2d(x, y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}

where:

  • (x)(x) and (y)(y) are two points in n-dimensional space,
  • (xi)(x_i) and (yi)(y_i) are the coordinates of points (x)(x) and (y)(y) in dimension (i)(i),
  • d(x,y)d(x, y) is the Euclidean distance between points (x)(x) and (y)(y).

Hot Deck imputation offers several advantages over other methods:

  1. Mean/Median Imputation: Unlike simply replacing missing values with the grand mean or median, Hot Deck imputation uses information from similar cases. This preserves local variability and takes into account relationships between variables.

  2. Regression Imputation: While regression imputation relies on predictive models to estimate missing values, it can lead to biased estimates if model assumptions are violated. Hot Deck imputation, on the other hand, borrows directly from observed values, often making it more robust under different conditions.

  3. Multiple Imputation: Multiple imputation creates multiple imputed data sets and combines the results for a final analysis, accounting for imputation uncertainty. Although it is often more statistically sophisticated, it can be computationally expensive. Hot Deck imputation remains a practical choice when maintaining ease of interpretation and computational efficiency are priorities.

The main conclusion is that Hot Deck imputation offers an intuitive, practically feasible approach that still respects the statistical properties of the data.

Hot-deck Random Imputation

# Hot Deck Random Imputation Example
def random_hot_deck_imputation(df, column):
    # We filter non-null values
    non_null_values = df[column].dropna().values
    # We replace NaN with a random value from the existing ones
    df[column] = df[column].apply(lambda x: np.random.choice(non_null_values) if pd.isna(x) else x)
    return df

# Imputation for 'Age' column
df_titanic = random_hot_deck_imputation(df_titanic, 'Age')
df_titanic.isna().any()
PassengerId False Survived False Pclass False Name False Sex False Age False SibSp False Parch False Ticket False Fare False Cabin True Embarked True dtype: bool

Hot Deck Nearest Neighbor (k-NN) Imputation

from sklearn.impute import KNNImputer

# We create a KNNImputer object
knn_imputer = KNNImputer(n_neighbors=5)  # You can change the number of neighbors

# We select numeric columns for imputation
columns_to_impute = ['Age', 'Fare']  # Columns
df_titanic[columns_to_impute] = knn_imputer.fit_transform(df_titanic[columns_to_impute])

df_titanic.isna().any()
PassengerId False Survived False Pclass False Name False Sex False Age False SibSp False Parch False Ticket False Fare False Cabin True Embarked True dtype: bool

Summary

  1. Hot Deck Random Imputation:
  • NaN values in a column are replaced with a random value from existing data in the same column.
  • It is fast but may introduce randomness that does not take into account the relationships between variables.
  1. Hot Deck Nearest Neighbor Imputation (k-NN):
  • The k-NN algorithm uses other features (e.g. Pclass, Fare) to find similar records and impute missing values.
  • It is more advanced and takes into account the relationships between variables.

Dirty Data

Many times we spend hours troubleshooting missing values, logical inconsistencies, or outliers in our datasets. In this tutorial, we will discuss the most popular data cleaning techniques.

We will work with the unstructured iris dataset. Originally published in the UCI Machine Learning Repository: Iris Data Set, this small dataset from 1936 is often used to test machine learning algorithms and visualizations. Each row of the table represents an iris flower, including its species and the dimensions of its botanical parts, sepal and petal, in centimeters.

Check out this dataset here:

dirty_iris = pd.read_csv("data/dirty_iris.csv")
dirty_iris.head(3)
Loading...

Coherent data are technically correct data that are suitable for statistical analysis. They are data in which missing values, special values, (obvious) errors, and outliers have been removed, corrected, or imputed. The data conform to constraints based on actual knowledge about the subject that the data describes.

title

We have the following basic knowledge:

  • The species should be one of the following values: setosa, versicolor, or virginica.

  • All measured numerical properties of the iris should be positive.

  • The length of an iris petal is at least 2 times its width.

  • The length of an iris sepal cannot exceed 30 cm.

  • The sepals of the iris are longer than its petals.

We will now define these rules in a separate “RULES” object and read them into Python. We will print the resulting constraint object:

# We define rules as functions:
def check_rules(df):
    rules = {
        "Sepal.Length <= 30": df["Sepal.Length"] <= 30,
        "Species in ['setosa', 'versicolor', 'virginica']": df["Species"].isin(['setosa', 'versicolor', 'virginica']),
        "Sepal.Length > 0": df["Sepal.Length"] > 0,
        "Sepal.Width > 0": df["Sepal.Width"] > 0,
        "Petal.Length > 0": df["Petal.Length"] > 0,
        "Petal.Width > 0": df["Petal.Width"] > 0,
        "Petal.Length >= 2 * Petal.Width": df["Petal.Length"] >= 2 * df["Petal.Width"],
        "Sepal.Length > Petal.Length": df["Sepal.Length"] > df["Petal.Length"]
    }
    return rules

# Data frame rules:
rules = check_rules(dirty_iris)

# Print:
for rule, result in rules.items():
    print(f"{rule}: {result.all()}")
Sepal.Length <= 30: False
Species in ['setosa', 'versicolor', 'virginica']: True
Sepal.Length > 0: False
Sepal.Width > 0: False
Petal.Length > 0: False
Petal.Width > 0: False
Petal.Length >= 2 * Petal.Width: False
Sepal.Length > Petal.Length: False

Now we can determine how often each rule is broken (violations). We can also summarize and plot the result.

# We check for rule violations:
violations = {rule: ~result for rule, result in rules.items()}

# We summarize them:
summary = {rule: result.sum() for rule, result in violations.items()}

# Print:
print("Summary of Violations:")
for rule, count in summary.items():
    print(f"{rule}: {count} violations")
Summary of Violations:
Sepal.Length <= 30: 12 violations
Species in ['setosa', 'versicolor', 'virginica']: 0 violations
Sepal.Length > 0: 11 violations
Sepal.Width > 0: 19 violations
Petal.Length > 0: 20 violations
Petal.Width > 0: 12 violations
Petal.Length >= 2 * Petal.Width: 34 violations
Sepal.Length > Petal.Length: 30 violations

What percentage of % of data has errors?

import matplotlib.pyplot as plt
# Violation Chart:
violation_counts = pd.Series(summary)
ax = violation_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Summary of errors in data')
plt.xlabel('Rules')
plt.ylabel('Number of errors')

# I add the percentages above the bars:
for p in ax.patches:
    ax.annotate(f'{p.get_height() / len(dirty_iris) * 100:.1f}%', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', xytext=(0, 10), 
                textcoords='offset points')

plt.show()
<Figure size 1000x600 with 1 Axes>

Find which flowers have sepals that are too long using the violation results.

violations = {rule: ~result for rule, result in rules.items()}
violated_df = pd.DataFrame(violations)
violated_rows = dirty_iris[violated_df["Sepal.Length <= 30"]]
print(violated_rows)
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
14            NaN          3.9          1.70          0.4      setosa
18            NaN          4.0           NaN          0.2      setosa
24            NaN          3.0          5.90          2.1   virginica
27           73.0         29.0         63.00          NaN   virginica
29            NaN          2.8          0.82          1.3  versicolor
57            NaN          2.9          4.50          1.5  versicolor
67            NaN          3.2          5.70          2.3   virginica
113           NaN          3.3          5.70          2.1   virginica
118           NaN          3.0          5.50          2.1   virginica
119           NaN          2.8          4.70          1.2  versicolor
124          49.0         30.0         14.00          2.0      setosa
137           NaN          3.0          4.90          1.8   virginica

Let’s find outliers in the plot length using the boxplot method.

We’ll take the relevant observations and check the remaining values.

Any ideas what might have happened?

We’ll set outliers to NA (or whatever value you think is more appropriate).

plt.figure(figsize=(10, 6))
plt.boxplot(dirty_iris['Sepal.Length'].dropna())
plt.title('Sepal Length Box Plot')
plt.ylabel('Sepal Length')
plt.show()
<Figure size 1000x600 with 1 Axes>
# We will find outliers:
outliers = dirty_iris['Sepal.Length'][np.abs(dirty_iris['Sepal.Length'] - dirty_iris['Sepal.Length'].mean()) > (1.5 * dirty_iris['Sepal.Length'].std())]
outliers_idx = dirty_iris.index[dirty_iris['Sepal.Length'].isin(outliers)]

# We will print them:
print("Outliers:")
print(dirty_iris.loc[outliers_idx])
Outliers:
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
27           73.0         29.0          63.0          NaN  virginica
124          49.0         30.0          14.0          2.0     setosa

They all seem too big... maybe they were measured in mm instead of cm?

# We will correct outliers (assuming they were measured in mm instead of cm).
dirty_iris.loc[outliers_idx, ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']] /= 10
plt.figure(figsize=(10, 6))
sns.boxplot(x='Species', y='Sepal.Length', data=dirty_iris)
plt.title('Boxplot of Sepal Length by Species')
plt.xlabel('Species')
plt.ylabel('Sepal Length')
plt.show()
<Figure size 1000x600 with 1 Axes>

Notice that the simple boxplot shows an additional outlier!

Error Correction

Let’s replace non-positive values from Sepal.Width with NA:

# We define the correction rule:
def correct_sepal_width(df):
    df.loc[(~df['Sepal.Width'].isna()) & (df['Sepal.Width'] <= 0), 'Sepal.Width'] = np.nan
    return df

# We apply correction to the data frame:
mydata_corrected = correct_sepal_width(dirty_iris)

# and let's look at the data:
print(mydata_corrected)
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
0             6.4          3.2           4.5          1.5  versicolor
1             6.3          3.3           6.0          2.5   virginica
2             6.2          NaN           5.4          2.3   virginica
3             5.0          3.4           1.6          0.4      setosa
4             5.7          2.6           3.5          1.0  versicolor
..            ...          ...           ...          ...         ...
145           6.7          3.1           5.6          2.4   virginica
146           5.6          3.0           4.5          1.5  versicolor
147           5.2          3.5           1.5          0.2      setosa
148           6.4          3.1           NaN          1.8   virginica
149           5.8          2.6           4.0          NaN  versicolor

[150 rows x 5 columns]

Replacing all error values with NA:

# we apply rules for violations (errors):
rules = check_rules(dirty_iris)
violations = {rule: ~result for rule, result in rules.items()}
violated_df = pd.DataFrame(violations)

# we locate errors and change them to NA:
for col in violated_df.columns:
    dirty_iris.loc[violated_df[col], col.split()[0]] = np.nan

Better to have NA before errors in data. One could possibly attempt to impute newly created gaps now...

Merging datasets

Merging refers to combining different datasets to leverage the power of additional information.

  • In Week 1, we discussed this in the context of data linkage.
  • Can link datasets as a function of:
    • Shared time window.
    • Shared identity.

Why merge?

  • Each dataset contains limited information.
    • E.g., GDP by Year.
  • But merging datasets allows us to see how more variables relate and interact.
  • Much of social science research involves locating datasets and figuring out how to combine them.

How to merge?

In Python, pandas.merge allows us to merge two DataFrames on a common column(s).

pd.merge(df1, df2, on = "shared_column")

merge in practice

For demonstration, we’ll merge two Linguistics datasets:

  • One dataset contains information about the Age of Acquisition of different English words (Kuperman et al., 2014).
  • The other dataset contains information about the Frequency and Concreteness of English words (Brysbaert et al., 2014).

Loading datasets

df_aoa = pd.read_csv("data/wrangling/AoA.csv")
df_aoa.head(1)
Loading...
df_conc = pd.read_csv("data/wrangling/concreteness.csv")
df_conc.head(1)
Loading...

Different kinds of merging

As we see, the datasets are not the same size. This leaves us with a decision to make when merging.

  • inner join: Do we preserve only the words in both datasets?
  • left join: Do we preserve all the words in one dataset (the “left” one), regardless of whether they occur in the other?
  • right join: Do we preserve all the words in one dataset (the “right” one), regardless of whether they occur in the other?
  • outer join: Do we preserve all words in both, leaving empty (nan) values where a word only appears in one dataset?
df_aoa.shape
(31124, 2)
df_conc.shape
(28612, 4)

inner join

  • For our purposes, it makes the most sense to use an inner join.
  • This leaves us with fewer words than occur in either dataset.
df_merged = pd.merge(df_aoa, df_conc, on = "Word", how = "inner")
df_merged.head(2)
Loading...
df_merged.shape
(23569, 5)

Check-in

What happens if you use a different kind of join, e.g., outer or left? What do you notice about the shape of the resulting DataFrame? Do some rows have nan values?

### Your code here

Solution

df_outer_join = pd.merge(df_aoa, df_conc, on = "Word", how = "outer")
df_outer_join.shape
(36167, 5)
df_outer_join.head(4)
Loading...

Why merge is so useful

Now that we’ve merged our datasets, we can look at how variables across them relate to each other.

sns.scatterplot(data = df_merged, x = 'Concreteness',
                y = 'AoA', alpha = .2 );
<Figure size 640x480 with 1 Axes>

Reshaping data

Reshaping data involves transforming it from one format (e.g., “wide”) to another format (e.g., “long”), to make it more amenable to visualization and analysis.

Often, we need to make our data tidy.

What is tidy data?

Tidy data is a particular way of formatting data, in which:

  1. Each variable forms a column (e.g., GDP).
  2. Each observation forms a row (e.g., a country).
  3. Each type of observational unit forms a table (tabular data!).

Originally developed by Hadley Wickham, creator of the tidyverse in R.

Tidy vs. “untidy” data

Now let’s see some examples of tidy vs. untidy data.

Keep in mind:

  • These datasets all contain the same information, just in different formats.
  • “Untidy” data can be useful for other things, e.g., presenting in a paper.
  • The key goal of tidy data is that each row represents an observation.

Tidy data

Check-in: Why is this data considered tidy?

df_tidy = pd.read_csv("data/wrangling/tidy.csv")
df_tidy
Loading...

Untidy data 1

Check-in: Why is this data not considered tidy?

df_messy1 = pd.read_csv("data/wrangling/messy1.csv")
df_messy1
Loading...

Untidy data 2

Check-in: Why is this data not considered tidy?

df_messy2 = pd.read_csv("data/wrangling/messy2.csv")
df_messy2
Loading...

Making data tidy

Fortunately, pandas makes it possible to turn an “untidy” DataFrame into a tidy one.

The key function here is pandas.melt.

pd.melt(df, ### Dataframe
        id_vars = [...], ### what are the identifying columns?
        var_name = ..., ### name for variable grouping over columns
        value_name = ..., ### name for the value this variable takes on

If this seems abstract, don’t worry––it’ll become clearer with examples!

Using pd.melt

  • Let’s start with our first messy DataFrame.
  • Has columns for each ppt, which contain info about rt.
df_messy1
Loading...
pd.melt(df_messy1, id_vars = 'condition', ### condition is our ID variable
       var_name = 'ppt', ### new row for each ppt observation
       value_name = 'rt') ### label for the info we have about each ppt
Loading...

Check-in

Try to use pd.melt to turn df_messy2 into a tidy DataFrame.

Hint: Think about the existing structure of the DataFrame––how is data grouped––and what the id_vars would be.

df_messy2
Loading...
### Your code here

Solution

pd.melt(df_messy2, id_vars = 'ppt', ### here, ppt is our ID variable
       var_name = 'condition', ### new row for each ppt observation
       value_name = 'rt') ### label for the info we have about each ppt
Loading...

Hands-on: a real dataset

Now, we’ll turn to a real dataset, which Timothy Lee, creator of Full Stack Economics, compiled and shared with me.

df_work = pd.read_csv("data/viz/missing_work.csv")
df_work.head(5)
Loading...

Check-in

Is this dataset tidy? How could we make it tidy, if not––i.e., if we wanted each row to be a single observation corresponding to one of the reasons for missing work?

### Your code here

Solution

df_melted = pd.melt(df_work, id_vars = ['Year', 'Month'],
                    var_name = "Reason",
                    value_name = "Days Missed")
df_melted.head(2)
Loading...

Why tidy data is useful

Finally, let’s use this dataset to recreate a graph from FullStackEconomics.

Original graph

title

Check-in

As a first-pass approach, what tools from seaborn could you use to recreate this plot?

### Your code here

Solution

This is okay, but not really what we want. This is grouping it by Year. But we want to group by both Year and Month.

# Your code here
# %load ./solutions/solution7.py

Using datetime

  • Let’s make a new column called date, which combines the Month and Year.
  • Then we can use pd.to_datetime to turn that into a custom pandas representation.
## First, let's concatenate each month and year into a single string
df_melted['date'] = df_melted.apply(lambda row: str(row['Month']) + '-' + str(row['Year']), axis = 1)
df_melted.head(2)
Loading...
## Now, let's create a new "datetime" column using the `pd.to_datetime` function
df_melted['datetime'] = pd.to_datetime(df_melted['date'])
df_melted.head(2)
Loading...

Plotting again

Much better!

sns.lineplot(data = df_melted, x = "datetime", y = "Days Missed", hue = "Reason");
<Figure size 640x480 with 1 Axes>

Data Transformations

Sometimes we come across a situation where we have problems with skewed distributions or we simply want to transform, recode or perform discretization. Let’s look at some of the most popular transformation methods.

First, standardization (also known as normalization):

  • ZZ-score approach - a standardization procedure, using the formula: z=xμσz=\frac{x-\mu}{\sigma}, where μ = mean and σ = standard deviation. ZZ scores are also known as standardized scores; these are scores (or data values) that have been assigned a common standard. This standard is a mean of zero and a standard deviation of 1.

  • minmax approach - An alternative approach to normalizing (or standardizing) the ZZ score is the so-called MinMax scaling (often also called simply “normalization” - which is a common cause of ambiguity). In this approach, the data is scaled to a fixed range - usually from 0 to 1. The trade-off of having this limited range - unlike normalization - is that we get smaller standard deviations, which can suppress the effect of outliers. If you want to do MinMax scaling - simply subtract the minimum value and divide it by the range: (xmin)/(maxmin)(x-min)/(max-min).

To solve problems with very skewed distributions, we can also use several types of simple transformations:

  • loglog
  • log+1log+1
  • sqrtsqrt
  • x2x^2
  • x3x^3

Exercise: Normalize income and plot the transformed income distribution in a box plot.

income = pd.read_csv("data/models/income.csv")
income.head(3)
Loading...
plt.figure(figsize=(10, 6))
plt.hist(income['Income'], edgecolor='black')
plt.title('Histogram of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()
<Figure size 1000x600 with 1 Axes>
# your code here

Box - Cox Transformations

The Box-Cox transformation is a technique used in data wrangling that helps stabilize the variance and transform the data so that it is closer to a normal distribution—which is desirable in many statistical analysis and modeling methods. This transformation works particularly well for data with a right-skewed distribution. It requires positive values and involves finding the optimal parameter λ (lambda), which is chosen using the maximum likelihood method. In Python, the Box-Cox transformation can be performed using the boxcox function from the scipy.stats module.

Let y=(y1,y2,,yn)y=(y_1,y_2,…,y_n) be the data to which the Box-Cox transformation is to be applied. Box and Cox defined their transformation as:

for λ0\lambda \neq 0:

y(i)λ=1λ(yiλ1)y{(i)}^{\lambda}=\frac{1}{\lambda}*(y_{i}^{\lambda}-1)

for λ=0\lambda = 0

log(yi)log(y_i)

such as the unknown λ,

yλ=Xβ+ϵy^\lambda=X* \beta + \epsilon

where y(λ)y^{(\lambda)} is the λ-transformed data, X is the design matrix (possible covariates of interest), β is a set of parameters associated with the λ-transformed data, and ϵ=(ϵ1,ϵ2,...,ϵn)ϵ = (ϵ_1, ϵ_2, ..., ϵ_n) is an error term. Since the goal of equation (1) is that

y(λ)ϵ N(Xβ,σ2In),(3)y^{(\lambda)} \epsilon ~ N(X* \beta, \sigma^2*I_n), (3)

then ϵ N(0,σ2)\epsilon ~ N(0, \sigma^2). Note that the transformation in equation (1) is only valid for yiy_i > 0, i = 1, 2, ..., n, and modifications must be made when negative observations are present.

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import boxcox
import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize = (4, 4))
data = np.random.beta(1, 3, 5000) # we create random non normal data (from beta distribution)
sns.distplot(data)
plt.show()
<Figure size 400x400 with 1 Axes>
data_transformed, lambda_ = boxcox(data) #If lambda is None, the second argument returned is a lambda that maximizes the log-likelihood function.
print('Transformed data to normal distribution', data_transformed)
print(40*'==')
print('Lambda, which maximizes the log-likelihood function for the normal distribution:', lambda_)
Transformed data to normal distribution [-1.18780311 -1.24561546 -0.73856995 ... -1.36663233 -1.20507286
 -0.89269272]
================================================================================
Lambda, which maximizes the log-likelihood function for the normal distribution: 0.40914726473864266
manually_transformed = [(i**lambda_ -1)/lambda_ for i in data] # manually
warnings.filterwarnings('ignore')
plt.figure(figsize = (4, 4))
sns.distplot(data_transformed)
plt.show()
<Figure size 400x400 with 1 Axes>
warnings.filterwarnings('ignore')
plt.figure(figsize = (4, 4))
sns.distplot(manually_transformed) # manually selected lambda
plt.show()
<Figure size 400x400 with 1 Axes>
lambde = [-4,-2,-1,0.5,1,2,4] # only for lambda 0.5 we have a normal distribution
x = np.linspace(data.min(), data.max(), 5000)
for l in lambde:
    transformed = [(i**l -1)/l for i in data]
    plt.figure(figsize=(4,4))
    plt.hist(data, bins=20, density=True)
    plt.hist(transformed, bins=30, density=True, label = f'lambda: {l}')
    plt.xlim(-2, 2)
    plt.legend()
<Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes><Figure size 400x400 with 1 Axes>

Data categorization

Sometimes we would simply like to perform the so-called “binning” procedure to be able to analyze our categorical data, compare several categorical variables, construct statistical models, etc. With the “binning” function, we can transform quantitative variables into categorical ones using several methods:

  1. Quantile Binning
  • Divides data into equal-sized groups based on percentiles. Formula:
  • Example: pd.qcut(data, q=4) creates 4 quantile-based bins.
  1. Equal-Width Binning
  • binning to obtain a fixed length of intervals (e.g. every 100 USD)
  • Formula:
    Bin width=max(data)min(data)number of bins\text{Bin width} = \frac{\text{max(data)} - \text{min(data)}}{\text{number of bins}}
  • Example: pd.cut(data, bins=5) creates 5 equal-width bins.
  1. Pretty Binning
  • A compromise between quantile and equal-width binning, creating aesthetically pleasing intervals.
  • Rounded values based on data range and bin count.
  1. K-Means Binning
  • Uses the K-Means clustering algorithm to group data into bins.
  • Formula:
    Centroids minimize i=1n(xiμk)2\text{Centroids minimize } \sum_{i=1}^{n} (x_i - \mu_k)^2
  • where (μk)( \mu_k ) is the mean of cluster (k)( k ).
  1. Bag Clustering (BClust)
  • Groups data using bagging techniques to improve clustering stability.
  • Clusters are formed by aggregating results from multiple clustering models.

Each method is suited for different use cases, depending on the data distribution and analysis goals.

Exercise Using the quantile approach, binning the “Income” variable. Hint: Pandas has a ready-to-use function pd.qcut

# Your code here

Summary

That was an introduction to data wrangling.

  • As mentioned, data organization is an incredibly important topic - and could be an entire subject in this postgraduate program.

  • But today we focused on:

  • Identifying and removing missing data.

  • Merging data sets.

  • Organizing data.