Skip to article frontmatterSkip to article content

Bivariate Statistics

Foundations of Statistical Analysis in Python

Gdansk University of Technology
Chongqing Technology and Business University

Abstract

This notebook explores bivariate relationships through linear correlations, highlighting their strengths and limitations. Practical examples and visualizations are provided to help users understand and apply these statistical concepts effectively.

Keywords:pythonpandasdata wranglingdata cleansingdata visualizationdata analysis

Goals of this lecture

There are many ways to describe a distribution.

Here we will discuss:

  • Measurement of the relationship between distributions using linear, rank correlations.
  • Measurement of the relationship between qualitative variables using contingency.

Importing relevant libraries

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns ### importing seaborn
import pandas as pd
import scipy.stats as ss
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
import pandas as pd
df_pokemon = pd.read_csv("data/pokemon.csv")

Describing bivariate data with correlations

  • So far, we’ve been focusing on univariate data: a single distribution.
  • What if we want to describe how two distributions relate to each other?
    • For today, we’ll focus on continuous distributions.

Bivariate relationships: height

df_height = pd.read_csv("data/wrangling/height.csv")
df_height.head(2)
Loading...
Plotting Pearson’s height data
sns.scatterplot(data = df_height, x = "Father", y = "Son", alpha = .5);
<Figure size 640x480 with 1 Axes>

Introducing linear correlations

A correlation coefficient is a number between [1,1][–1, 1] that describes the relationship between a pair of variables.

Specifically, Pearson’s correlation coefficient (or Pearson’s rr) describes a (presumed) linear relationship.

Two key properties:

  • Sign: whether a relationship is positive (+) or negative (–).
  • Magnitude: the strength of the linear relationship.
r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } }

Where:

  • rr - Pearson correlation coefficient
  • xix_i, yiy_i - values of the variables
  • xˉ\bar{x}, yˉ\bar{y} - arithmetic means
  • nn - number of observations

Pearson’s correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. Its value ranges from -1 to 1:

  • 1 → perfect positive linear correlation
  • 0 → no linear correlation
  • -1 → perfect negative linear correlation

This coefficient does not tell about nonlinear correlations and is sensitive to outliers.

Calculating Pearson’s rr with scipy

scipy.stats has a function called pearsonr, which will calculate this relationship for you.

Returns two numbers:

  • rr: the correlation coefficent.
  • pp: the p-value of this correlation coefficient, i.e., whether it’s significantly different from 0.
ss.pearsonr(df_height['Father'], df_height['Son'])
PearsonRResult(statistic=np.float64(0.5011626808075912), pvalue=np.float64(1.2729275743661585e-69))
Check-in

Using scipy.stats.pearsonr (here, ss.pearsonr), calculate Pearson’s rr for the relationship between the Attack and Defense of Pokemon.

  • Is this relationship positive or negative?
  • How strong is this relationship?
### Your code here
Solution
ss.pearsonr(df_pokemon['Attack'], df_pokemon['Defense'])
(0.4386870551184888, 5.858479864290367e-39)
Check-in

Pearson’r rr measures the linear correlation between two variables. Can anyone think of potential limitations to this approach?

Limitations of Pearson’s rr

  • Pearson’s rr presumes a linear relationship and tries to quantify its strength and direction.
  • But many relationships are non-linear!
  • Unless we visualize our data, relying only on Pearson’r rr could mislead us.
Non-linear data where r=0r = 0
x = np.arange(1, 40)
y = np.sin(x)
p = sns.lineplot(x = x, y = y)
<Figure size 432x288 with 1 Axes>
### r is close to 0, despite there being a clear relationship!
ss.pearsonr(x, y)
(-0.04067793461845844, 0.8057827185936633)
When rr is invariant to the real relationship

All these datasets have roughly the same correlation coefficient.

df_anscombe = sns.load_dataset("anscombe")
sns.relplot(data = df_anscombe, x = "x", y = "y", col = "dataset");
<Figure size 2000x500 with 4 Axes>
# Compute correlation matrix
corr = df_pokemon.corr(numeric_only=True)

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap
sns.heatmap(corr, 
            annot=True,         # Show correlation coefficients
            fmt=".2f",          # Format for coefficients
            cmap="coolwarm",    # Color palette
            vmin=-1, vmax=1,    # Fixed scale
            square=True,        # Make cells square
            linewidths=0.5,     # Line width between cells
            cbar_kws={"shrink": .75})  # Colorbar shrink

# Title and layout
plt.title("Correlation Heatmap", fontsize=16)
plt.tight_layout()

# Show plot
plt.show()
<Figure size 1000x800 with 2 Axes>

Rank Correlations

Rank correlations are measures of the strength and direction of a monotonic (increasing or decreasing) relationship between two variables. Instead of numerical values, they use ranks, i.e., positions in an ordered set.

They are less sensitive to outliers and do not require linearity (unlike Pearson’s correlation).

Types of Rank Correlations

  1. ρρ (rho) Spearman’s
  • Based on the ranks of the data.
  • Value: from –1 to 1.
  • Works well for monotonic but non-linear relationships.
ρ=16di2n(n21)\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

Where:

  • did_i – differences between the ranks of observations,
  • nn – number of observations.
  1. ττ (tau) Kendall’s
  • Measures the number of concordant vs. discordant pairs.
  • More conservative than Spearman’s – often yields smaller values.
  • Also ranges from –1 to 1.
τ=(CD)12n(n1)\tau = \frac{(C - D)}{\frac{1}{2}n(n - 1)}

Where:

  • ττ — Kendall’s correlation coefficient,
  • CC — number of concordant pairs,
  • DD — number of discordant pairs,
  • nn — number of observations,
  • 12n(n1)\frac{1}{2}n(n - 1) — total number of possible pairs of observations.

What are concordant and discordant pairs?

  • Concordant pair: if xix_i < xjx_j and yiy_i < yjy_j, or xix_i > xjx_j and yiy_i > yjy_j.
  • Discordant pair: if xix_i < xjx_j and yiy_i > yjy_j, or xix_i > xjx_j and yiy_i < yjy_j.

When to use rank correlations?

  • When the data are not normally distributed.
  • When you suspect a non-linear but monotonic relationship.
  • When you have rank correlations, such as grades, ranking, preference level.
Correlation typeDescriptionWhen to use
Spearman’s (ρ)Monotonic correlation, based on ranksWhen data are nonlinear or have outliers
Kendall’s (τ)Counts the proportion of congruent and incongruent pairsWhen robustness to ties is important

Interpretation of correlation values

Range of valuesCorrelation interpretation
0.8 - 1.0very strong positive
0.6 - 0.8strong positive
0.4 - 0.6moderate positive
0.2 - 0.4weak positive
0.0 - 0.2very weak or no correlation
< 0similarly - negative correlation
# Compute Kendall rank correlation
corr_kendall = df_pokemon.corr(method='kendall', numeric_only=True)

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap
sns.heatmap(corr, 
            annot=True,         # Show correlation coefficients
            fmt=".2f",          # Format for coefficients
            cmap="coolwarm",    # Color palette
            vmin=-1, vmax=1,    # Fixed scale
            square=True,        # Make cells square
            linewidths=0.5,     # Line width between cells
            cbar_kws={"shrink": .75})  # Colorbar shrink

# Title and layout
plt.title("Correlation Heatmap", fontsize=16)
plt.tight_layout()

# Show plot
plt.show()
<Figure size 1000x800 with 2 Axes>

Comparison of Correlation Coefficients

PropertyPearson (r)Spearman (ρ)Kendall (τ)
What it measures?Linear relationshipMonotonic relationship (based on ranks)Monotonic relationship (based on pairs)
Data typeQuantitative, normal distributionRanks or ordinal/quantitative dataRanks or ordinal/quantitative data
Sensitivity to outliersHighLowerLow
Value range–1 to 1–1 to 1–1 to 1
Requires linearityYesNoNo
Robustness to tiesLowMediumHigh
InterpretationStrength and direction of linear relationshipStrength and direction of monotonic relationshipProportion of concordant vs discordant pairs
Significance testYes (scipy.stats.pearsonr)Yes (spearmanr)Yes (kendalltau)

Brief summary:

  • Pearson - best when the data are normal and the relationship is linear.
  • Spearman - works better for non-linear monotonic relationships.
  • Kendall - more conservative, often used in social research, less sensitive to small changes in data.

Your Turn

For the Pokemon dataset, find the pairs of variables that are most appropriate for using one of the quantitative correlation measures. Calculate them, then visualize them.

from scipy.stats import pearsonr, spearmanr, kendalltau

## Your code here

Correlation of Qualitative Variables

A categorical variable is one that takes descriptive values that represent categories—e.g. Pokémon type (Fire, Water, Grass), gender, status (Legendary vs. Normal), etc.

Such variables cannot be analyzed directly using correlation methods for numbers (Pearson, Spearman, Kendall). Other techniques are used instead.

Contingency Table

A contingency table is a special cross-tabulation table that shows the frequency (i.e., the number of cases) for all possible combinations of two categorical variables.

It is a fundamental tool for analyzing relationships between qualitative features.

Chi-Square Test of Independence

The Chi-Square test checks whether there is a statistically significant relationship between two categorical variables.

Concept:

We compare:

  • observed values (from the contingency table),
  • with expected values, assuming the variables are independent.
χ2=(OijEij)2Eij\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

Where:

  • OijO_{ij} – observed count in cell (ii, jj),
  • EijE_{ij} – expected count in cell (ii, jj), assuming independence.

Example: Calculating Expected Values and Chi-Square Statistic in Python

Here’s how you can calculate the expected values and Chi-Square statistic (χ²) step by step using Python.


Step 1: Create the Observed Contingency Table

We will use the Pokémon example:

Type 1Legendary = FalseLegendary = TrueTotal
Fire18523
Water25328
Grass20222
Total631073
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Observed values (contingency table)
observed = np.array([
    [18, 5],  # Fire
    [25, 3],  # Water
    [20, 2]   # Grass
])

# Convert to DataFrame for better visualization
observed_df = pd.DataFrame(
    observed,
    columns=["Legendary = False", "Legendary = True"],
    index=["Fire", "Water", "Grass"]
)
print("Observed Table:")
print(observed_df)
Observed Table:
       Legendary = False  Legendary = True
Fire                  18                 5
Water                 25                 3
Grass                 20                 2

Step 2: Calculate Expected Values The expected values are calculated using the formula:

Eij=Row Total×Column TotalGrand TotalE_{ij} = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

You can calculate this manually or use scipy.stats.chi2_contingency, which automatically computes the expected values.

# Perform Chi-Square test
chi2, p, dof, expected = chi2_contingency(observed)

# Convert expected values to DataFrame for better visualization
expected_df = pd.DataFrame(
    expected,
    columns=["Legendary = False", "Legendary = True"],
    index=["Fire", "Water", "Grass"]
)
print("\nExpected Table:")
print(expected_df)

Expected Table:
       Legendary = False  Legendary = True
Fire           19.849315          3.150685
Water          24.164384          3.835616
Grass          18.986301          3.013699

Step 3: Calculate the Chi-Square Statistic The Chi-Square statistic is calculated using the formula:

χ2=(OijEij)2Eij\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

This is done automatically by scipy.stats.chi2_contingency, but you can also calculate it manually:

# Manual calculation of Chi-Square statistic
chi2_manual = np.sum((observed - expected) ** 2 / expected)
print(f"\nChi-Square Statistic (manual): {chi2_manual:.4f}")

Chi-Square Statistic (manual): 1.8638

Step 4: Interpret the Results The chi2_contingency function also returns:

p-value: The probability of observing the data if the null hypothesis (independence) is true. Degrees of Freedom (dof): Calculated as (rows - 1) * (columns - 1).

print(f"\nChi-Square Statistic: {chi2:.4f}")
print(f"p-value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")

Chi-Square Statistic: 1.8638
p-value: 0.3938
Degrees of Freedom: 2

Interpretation of the Chi-Square Test Result:

ValueMeaning
High χ² valueLarge difference between observed and expected values
Low p-valueStrong basis to reject the null hypothesis of independence
p < 0.05Statistically significant relationship between variables

Qualitative Correlations

Cramér’s V

Cramér’s V is a measure of the strength of association between two categorical variables. It is based on the Chi-Square test but scaled to a range of 0–1, making it easier to interpret the strength of the relationship.

V=χ2n(k1)V = \sqrt{ \frac{\chi^2}{n \cdot (k - 1)} }

Where:

  • χ2\chi^2 – Chi-Square test statistic,
  • nn – number of observations,
  • kk – the smaller number of categories (rows/columns) in the contingency table.

Phi Coefficient (φφ)

Application:

  • Both variables must be dichotomous (e.g., Yes/No, 0/1), meaning the table must have the smallest size of 2×2.
  • Ideal for analyzing relationships like gender vs purchase, type vs legendary.
ϕ=χ2n\phi = \sqrt{ \frac{\chi^2}{n} }

Where:

  • χ2\chi^2 – Chi-Square test statistic for a 2×2 table,
  • nn – number of observations.

Tschuprow’s T

Tschuprow’s T is a measure of association similar to Cramér’s V, but it has a different scale. It is mainly used when the number of categories in the two variables differs. This is a more advanced measure applicable to a broader range of contingency tables.

T=χ2n(k1)T = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}

Where:

  • χ2\chi^2 – Chi-Square test statistic,
  • nn – number of observations,
  • kk – the smaller number of categories (rows or columns) in the contingency table.

Application: Tschuprow’s T is useful when dealing with contingency tables with varying numbers of categories in rows and columns.


Summary - Qualitative Correlations

MeasureWhat it measuresApplicationValue RangeStrength Interpretation
Cramér’s VStrength of association between nominal variablesAny categories0 – 10.1–weak, 0.3–moderate, >0.5–strong
Phi (φφ)Strength of association in a 2×2 tableTwo binary variables-1 – 1Similar to correlation
Tschuprow’s TStrength of association, alternative to Cramér’s VTables with similar category counts0 – 1Less commonly used
Chi² (χ2χ²)Statistical test of independenceAll categorical variables0 – ∞Higher values indicate stronger differences

Example

Let’s investigate whether the Pokémon’s type (type_1) is affected by whether the Pokémon is legendary.

We’ll use the scipy library.

This library already has built-in functions for calculating various qualitative correlation measures.

from scipy.stats.contingency import association

# Contingency table:
ct = pd.crosstab(df_pokemon["Type 1"], df_pokemon["Legendary"])

# Calculating Cramér's V measure
V = association(ct, method="cramer") # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.association.html#association

print(f"Cramer's V: {V}") # interpret!
Cramer's V: 0.3361928228447545

Your turn

What visualization would be most appropriate for presenting a quantitative, ranked, and qualitative relationship?

Try to think about which pairs of variables could have which type of analysis based on the Pokemon data.


## Your code and discussion here

Heatmaps for qualitative correlations

# git clone https://github.com/ayanatherate/dfcorrs.git
# cd dfcorrs 
# pip install -r requirements.txt

from dfcorrs.cramersvcorr import Cramers
cram=Cramers()
# cram.corr(df_pokemon)
cram.corr(df_pokemon, plot_htmp=True)
Loading...

Your turn!

Load the “sales” dataset and perform the bivariate analysis together with necessary plots. Remember about to run data preprocessing before the analysis.

df_sales = pd.read_excel("data/sales.xlsx")
df_sales.head(5)
Loading...

Summary

There are many ways to describe our data:

  • Measure central tendency.

  • Measure its variability; skewness and kurtosis.

  • Measure what correlations our data have.

All of these are useful and all of them are also exploratory data analysis (EDA).