Skip to article frontmatterSkip to article content

Python Fundamentals 2.

Introduction to Python Fundamentals for Data Analysis

Gdansk University of Technology
Chongqing Technology and Business University

Abstract

This notebook provides a comprehensive review of Python fundamentals essential for data analysis. Topics covered include variables, functions, conditional statements, loops, and basic data types. Through examples and exercises, participants will strengthen their understanding of Python syntax and its application in data science workflows.

Keywords:datastatisticsdata analysispython

Review (Working with Data)

Python 4 DS

Goals of this lecture

This lecture will review tools useful for working with data in Python.

  • Focus: tabular data in .csv files.
  • Intro to pandas.
  • Basic manipulation and analysis using pandas.

What is a file?

A file is a set of bytes used to store some kind of data.

The format of this data depends on what you’re using it for, but at some level, it is translated into binary bits (1s and 0s).

The file format is usually specified in the file extension.

  • .csv: comma separated values.
  • .txt: a plain text file.
  • .py: an executable Python file.
  • .png: a portable network graphic file (i.e., an image).

What is tabular data?

Tabular data is data organized in a table with rows and columns.

  • This kind of data is two-dimensional.
  • Typically, each row represents an “observation”.
  • Typicallly, each column represents an attribute.

Often stored in .csv files.

  • .csv = “comma-separated values”

Example: Countries

Check-in: What does each row represent? What about each column?

CountryPopulation (million)GDP (Trillions)
USA329.520.94
UK76.222.7
China140214.72

Introducing pandas

pandas is a package that enables fluid and efficient storage, manipulation, and analysis of data.

## Import statement: pandas is a "package"
import pandas as pd

pandas.read_csv

Tabular data is often stored in .csv files.

  • pandas.read_csv can be used to read in a .csv file.
  • This is represented as a pandas.DataFrame.
pd.read_csv("path/to/file.csv") ### replace with actual filepath!
### .csv file with data about different Pokemon
df_pokemon = pd.read_csv("data/pokemon.csv")

read_csv with a URL

You can also pass a URL that points to a .csv file into read_csv.

  • This is a dataset from Brand et al. (2019), which quantified changes in the positivity and negativity in song lyrics over time.
  • We’ll be working more with that dataset soon!
df_lyrics = pd.read_csv("https://raw.githubusercontent.com/lottybrand/song_lyrics/master/data/billboard_analysis.csv")
df_lyrics.head(2)
Loading...

Using a DataFrame

  • Now that we have a DataFrame object, we want to be able to use that DataFrame.
  • This includes:
    • Get basic information about DataFrame (e.g., its shape).
    • Accessing specific columns.
    • Accessing specific rows.

Using shape

df.shape tells us how many rows and columns are in the DataFrame.

## (#rows, #columns)
df_pokemon.shape
(800, 13)

Using head and tail

  • The head(x) function displays the top x rows of the DataFrame.
  • Similarly, tail(x) displays the last x rows.
df_pokemon.head(2)
Loading...
df_pokemon.tail(2)
Loading...

Accessing columns

  • A column can be accessed using dataframe_name['column_name'].
### What does this bracket syntax (["column_name"]) remind you of?
df_pokemon['Speed'].head(5)
0 45 1 60 2 80 3 80 4 65 Name: Speed, dtype: int64

Useful operations with pandas

DataFrames enable all sorts of useful operations, including:

  • sorting a DataFrame by a particular column.
  • Calculating descriptive statistics (e.g., mean, median, etc.).
  • Filtering a DataFrame.
  • Aggregating across levels of a variable using groupby.

Note that we’ll also cover these topics more in-depth in Week 4!

sort_values

### By default, will sort from lowest to highest
df_pokemon.sort_values("HP").head(2)
Loading...
### Show the highest HP
df_pokemon.sort_values("HP", ascending = False).head(2)
Loading...

Check-in

What is the Speed of the Pokemon with the highest Attack?

### Your code here

# %load ./solutions/solution4.py

Descriptive statistics

Columns of a DataFrame can also be summarized:

  • mean: average value (for numeric variables)
  • median: “middle” value (for numeric variables)
  • mode: most frequent value
df_pokemon['Attack'].mean()
79.00125
df_pokemon['Attack'].median()
75.0
df_pokemon['Attack'].mode()
0 100 Name: Attack, dtype: int64

Filtering a DataFrame

  • Often, we want to filter a DataFrame so we only see observations that meet certain conditions.
  • Ultimately, this is similar to using a conditional statement––just with different syntax.

Example 1: filtering on a categorical variable

  • The legendary column is a categorical variable, meaning there are several discrete categories.
## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True].head(5)
Loading...

Example 2: filtering on a continuous variable

  • The HP column is a continuous variable.
  • Let’s show only the rows for Pokemon with a HP > 150.
df_pokemon[df_pokemon['HP'] > 150].head(3)
Loading...

Using groupby

The groupby function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g., mean).

The syntax is as follows:

df_name.groupby("column_to_group_by").mean() ## or median, etc.

Example: mean Attack by Legendary

Here, the [[...]] syntax just limits the columns in the DataFrame to the columns we directly care about.

df_pokemon[['Legendary', 'Attack']].groupby("Legendary").mean()
Loading...

Check-in:

How would you calculate the median Defense by Legendary status?

### Your code here

# %load ./solutions/solution5.py

Check-in:

How would you calculate the mean HP by Type 1?

### Your code here

# %load ./solutions/solution6.py

Conclusion

This concludes our unit on interacting with data.

  • Reading in .csv files with pandas.
  • Summarizing and working with tabular data.