Jak pandy analyzují data

Úvod do Pandas

Pandas je knihovna pro analýzu tabulkových dat. Je to jeden ze základních nástrojů prod data science, data engineering apod.

Převzato z https://github.com/janpipek/fbmi-python-course/blob/main/notebooks/130_pandas_intro.ipynb

In [ ]:
# Start using pandas (default import convention)
import pandas as pd
import numpy as np
In [ ]:
# Let pandas speak for themselves
print(pd.__doc__)

Visit the official website for a nicely written documentation: https://pandas.pydata.org

In [ ]:
# Current version (should be 1.5+ in 2023)
print(pd.__version__)

Basic objects

The pandas library has a vast API with many useful functions. However, most of this revolves around two important classes:

  • Series
  • DataFrame

In this introduction, we will focus on them - what each of them does and how they relate to each other and numpy objects.

Series

Series is a one-dimensional data structure, central to pandas.

For a complete API, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [ ]:
# My first series
series = pd.Series([1, 2, 3])
series

This looks a bit like a Numpy array, does it not?

Actually, in most cases the Series wraps a Numpy array...

In [ ]:
series.values  # The result is a Numpy array

But there is something more. Alongside the values, we see that each item (or "row") has a certain label. The collection of labels is called index.

In [ ]:
series.index

This index (see below) can be used, as its name suggests, to index items of the series.

In [ ]:
# Return an element from the series
series.loc[1]
In [ ]:
# Or
series[1]
In [ ]:
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab

Exercise: Create a series with 5 elements.

In [ ]:
result = ...

DataFrame

A DataFrame is pandas' answer to Excel sheets - it is a collection of named columns (or, in our case, a collection of Series). Quite often, we directly read data frames from an external source, but it is possible to create them from:

  • a dict of Series, numpy arrays or other array-like objects
  • from an iterable of rows (where rows are Series, lists, dictionaries, ...)
In [ ]:
# List of lists (no column names)
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(table)
table_df
In [ ]:
# Dict of Series (with column names)
df = pd.DataFrame({
    'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
    'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df
In [ ]:
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))

df = pd.DataFrame(data, columns=['a', 'b'])
df
In [ ]:
# A DataFrame also has an index.
df.index
In [ ]:
# ...that is shared by all columns
df.index is df["a"].index
In [ ]:
# The columns also form an index.
df.columns

Exercise: Create DataFrame whose x-column is $0, \frac{1}{4}\pi, \frac{1}{2}\pi, .. 2\pi $, y column is cos(x) and index are fractions 0, 1/4, 1/2 ... 2

In [ ]:
import fractions

index = [fractions.Fraction(n, ___) for n in range(___)]
x = np.___([___ for ___ in ___])
y = ___

df = pd.DataFrame(___, index = ___)

# display
df

D(ata) types

Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.

In [ ]:
typed_df = pd.DataFrame({
  "bool": np.arange(5) % 2 == 0,
  "int": range(5),
  "int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
  "float": np.arange(5) * 3.14,
  "complex": np.array([1 + 2j, 2 + 3j, 3 + 4j, 4 + 5j, 5 + 6j]),
  "object": [None, 1, "2", [3, 4], 5 + 6j],
  "string?": ["a", "b", "c", "d", "e"],
  "string!": pd.Series(["a", "b", "c", "d", "e"], dtype="string"),
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
  "timedelta": pd.timedelta_range(0, freq="1s", periods=5),
  "category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category"),
  "period": pd.period_range('2018-01-01', periods=5, freq='M'),
})
typed_df
In [ ]:
typed_df.dtypes

We will see some of the types practically used in further analysis.

Indices & indexing

In [ ]:
abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series
In [ ]:
abc_series.index
In [ ]:
abc_series.index = ["c", "d", "e"]  # Changes the labels in-place!
abc_series.index.name = "letter"
abc_series
In [ ]:
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(
    table,
    index=["first", "second", "third"],
    columns=["alpha", "beta"]
)
table_df
In [ ]:
alpha = table_df["alpha"]  # Simple [] indexing in DataFrame returns Series
alpha
In [ ]:
alpha["second"]             # Simple [] indexing in Series returns scalar values.
In [ ]:
alpha.second   # This also works

but careful!

In [ ]:
alpha.first

A slice with a ["list", "of", "columns"] yields a DataFrame with those columns.

For example:

In [ ]:
table_df[["beta", "alpha"]]

[["column_name"]] returs a DataFrame as well, not Series:

In [ ]:
table_df[["alpha"]]

There are two ways how to properly index rows & cells in the DataFrame:

  • loc for label-based indexing
  • iloc for order-based indexing (it does not use the index at all)

Note the square brackets. The mentioned attributes actually are not methods but special "indexer" objects. They accept one or two arguments specifying the position along one or both axes.

loc
In [ ]:
first = table_df.loc["first"]
first
In [ ]:
table_df.loc["first", "beta"]            
In [ ]:
table_df.loc["first":"second", "beta"]   # Use ranges (inclusive)
iloc
In [ ]:
table_df.iloc[1]
In [ ]:
table_df.iloc[0:4:2]   # Select every second row
In [ ]:
table_df.at["first", "beta"]
In [ ]:
type(table_df.at)

Modifying DataFrames

Adding a new column is like adding a key/value pair to a dict. Note that this operation, unlike most others, does modify the DataFrame.

In [ ]:
from datetime import datetime
table_df["now"] = datetime.now()
table_df

Non-destructive version that returns a new DataFrame, uses the assign method:

In [ ]:
table_df.assign(delta = [True, False, True])
In [ ]:
# However, the original DataFrame is not changed
table_df

Deleting a column is very easy too.

In [ ]:
del table_df["now"]
table_df

The drop method works with both rows and columns (creating a new data frame), returning a new object.

In [ ]:
table_df.drop("beta", axis=1)
In [ ]:
table_df.drop("second", axis=0)

Exercise: Use a combination of reset_index, drop and set_index to transform table_df into pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])

In [ ]:
results = table_df.___.___.___

# display
result

Let's get some real data!

I/O in pandas

Pandas can read (and write to) a huge variety of file formats. More details can be found in the official documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Most of the functions for reading data are named pandas.read_XXX, where XXX is the format used. We will look at three commonly used ones.

In [ ]:
# List functions for input in pandas.

print("\n".join(method for method in dir(pd) if method.startswith("read_")))

Read CSV

Nowadays, a lot of data comes in the textual Comma-separated values format (CSV). Although not properly standardized, it is the de-facto standard for files that are not huge and are meant to be read by human eyes too.

Let's read the population of U.S. states that we will need later:

In [ ]:
territories = pd.read_csv("data/us_state_population.csv")
territories.head(9)

The automatic data type parsing converts columns to appropriate types:

In [ ]:
territories.dtypes

Sometimes the CSV input does not work out of the box. Although pandas automatically understands and reads zipped files, it usually does not automatically infer the file format and its variations - for details, see the read_csv documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [ ]:
pd.read_csv('data/iris.tsv.gz')

...in this case, the CSV file does not use commas to separate values. Therefore, we need to specify an extra argument:

In [ ]:
pd.read_csv("data/iris.tsv.gz", sep='\t')

See the difference?

Read Excel

Let's read the list of U.S. incidents when lasers interfered with airplanes.

In [ ]:
pd.read_excel("data/laser_incidents_2019.xlsx")

Note: This reads just the first sheet from the file. If you want to extract more sheets, you will need to use the pandas.'ExcelFile class. See the relevant part of the documentation.

Read HTML (Optional)

Pandas is able to scrape data from tables embedded in web pages using the read_html function. This might or might not bring you good results and probably you will have to tweak your data frame manually. But it is a good starting point - much better than being forced to parse the HTML ourselves!

In [ ]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_laser_types")
type(tables), len(tables)
In [ ]:
tables[1]
In [ ]:
tables[2]

Write CSV

Pandas is able to write to many various formats but the usage is similar.

In [ ]:
tables[1].to_csv("gas_lasers.csv", index=False)

Data analysis (very basics)

Let's extend the data of laser incidents to a broader time range and read the data from a summary CSV file:

In [ ]:
laser_incidents_raw = pd.read_csv("data/laser_incidents_2015-2020.csv")

Let's see what we have here...

In [ ]:
laser_incidents_raw.head()
In [ ]:
laser_incidents_raw.tail()

For an unknown, potentially unevenly distributed dataset, looking at the beginning / end is typically not the best idea. We'd rather sample randomly:

In [ ]:
# Show a few examples
laser_incidents_raw.sample(10)
In [ ]:
laser_incidents_raw.dtypes

The topic of data cleaning and pre-processing is very broad. We will limit ourselves to dropping unused columns and converting one to a proper type.

In [ ]:
# The first three are not needed
laser_incidents = laser_incidents_raw.drop(columns=laser_incidents_raw.columns[:3])

# We convert the timestamp
laser_incidents = laser_incidents.assign(
    timestamp = pd.to_datetime(laser_incidents["timestamp"])
)
laser_incidents
In [ ]:
laser_incidents.dtypes
Categorical dtype (Optional)

To analyze Laser Color, we can look at its typical values.

In [ ]:
laser_incidents["Laser Color"].describe()

Not too many different values.

In [ ]:
laser_incidents["Laser Color"].unique()
In [ ]:
laser_incidents["Laser Color"].value_counts(normalize=True)

This column is a very good candidate to turn into a pandas-special, Categorical data type. (See https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [ ]:
laser_incidents["Laser Color"].memory_usage(deep=True)   # ~60 bytes per item
In [ ]:
color_category = laser_incidents["Laser Color"].astype("category")
color_category.sample(10)
In [ ]:
color_category.memory_usage(deep=True)              # ~1-2 bytes per item

Exercise: Are there any other columns in the dataset that you would suggest for conversion to categorical?

Integer vs. float

Pandas is generally quite good at guessing (inferring) number types. You may wonder why Altitude is float and not int though. This is a consequence of not having an integer nan in numpy. There's been many discussions about this.

In [ ]:
laser_incidents["Altitude"]
In [ ]:
laser_incidents["Altitude"].astype(int)

Quite recently, Pandas introduced nullable types for working with missing data, for example nullable integer.

In [ ]:
laser_incidents["Altitude"].astype("Int64")

Filtering

Indexing in pandas Series / DataFrames ([]) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.

You can also use standard comparison operators like <, <=, ==, >=, >, !=.

It is possible to perform logical operations with boolean series too. You need to use |, &, ^ operators though, not and, or, not keywords.

As an example, find all California incidents:

In [ ]:
is_california = laser_incidents.State == "California"
is_california.sample(10)

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [ ]:
laser_incidents[is_california].sample(10)

Or maybe we should include the whole West coast?

In [ ]:
# isin takes an array of possible values
west_coast = laser_incidents[laser_incidents.State.isin(["California", "Oregon", "Washington"])]
west_coast.sample(10)

Or low-altitude incidents?

In [ ]:
laser_incidents[laser_incidents.Altitude < 300]

Visualization intermezzo

Without much further ado, let's create our first plot.

In [ ]:
# Most frequent states
laser_incidents["State"].value_counts()[:20]
In [ ]:
laser_incidents["State"].value_counts()[:20].plot(kind="bar");

Sorting

In [ ]:
# Display 5 incidents with the highest altitude
laser_incidents.sort_values("Altitude", ascending=False).head(5)
In [ ]:
# Alternative
laser_incidents.nlargest(5, "Altitude")

Exercise: Find the last 3 incidents with blue laser.

Arithmetics and string manipulation

Standard arithmetic operators work on numerical columns too. And so do mathematical functions. Note all such operations are performed in a vector-like fashion.

In [ ]:
altitude_meters = laser_incidents["Altitude"] * .3048
altitude_meters.sample(10)

You may mix columns and scalars, the string arithmetics also works as expected.

In [ ]:
laser_incidents["City"] + ", " + laser_incidents["State"]

Summary statistics

The describe method shows summary statistics for all the columns:

In [ ]:
laser_incidents.describe()
In [ ]:
laser_incidents.describe(include="all", datetime_is_numeric=True)
In [ ]:
laser_incidents["Altitude"].mean()
In [ ]:
laser_incidents["Altitude"].std()
In [ ]:
laser_incidents["Altitude"].max()

Basic string operations (Optional)

These are typically accessed using the .str "accessor" of the Series like this:

  • series.str.lower
  • series.str.split
  • series.str.startswith
  • series.str.contains
  • ...

See more in the documentation.

In [ ]:
laser_incidents[laser_incidents["City"].str.contains("City")]["City"].unique()
In [ ]:
laser_incidents[laser_incidents["City"].str.contains("City")]["City"].str.strip().unique()

Merging data

It is a common situation where we have two or more datasets with different columns that we need to bring together. This operation is called merging and the Pandas apparatus is to a great detail described in the documentation.

In our case, we would like to attach the state populations to the dataset.

In [ ]:
population = pd.read_csv("data/us_state_population.csv")
population

We will of course use the state name as the merge key. Before actually doing the merge, we can explore a bit whether all state names from the laser incidents dataset are present in our population table.

In [ ]:
unknown_states = laser_incidents.loc[~laser_incidents["State"].isin(population["Territory"]), "State"]
print(f"There are {unknown_states.count()} rows with unknown states.")
print(f"Unknown state values are: \n{list(unknown_states.unique())}.")

We could certainly clean the data by correcting some of the typos. Since the number of the rows with unknown states is not large (compared to the length of the whole dataset), we will deliberetly not fix the state names. Instead, we will remove those rows from the merged dataset by using the inner type of merge. All the merge types: left, inner, outer and right are well explained by the schema below:

merge types

We can use the merge function to add the "Population" values.

In [ ]:
laser_incidents_w_population = pd.merge(
    laser_incidents, population, left_on="State", right_on="Territory", how="inner"
)
In [ ]:
laser_incidents_w_population
In [ ]:
laser_incidents_w_population.describe(include="all", datetime_is_numeric=True)

Grouping & aggregation

A common pattern in data analysis is grouping (or binning) data based on some property and getting some aggredate statistics.

Example: Group this workshop participants by nationality a get the cardinality (the size) of each group.

Possibly the simplest group and aggregation is the value_counts method, which groups by the respective column value and yields the number (or normalized frequency) of each unique value in the data.

In [ ]:
laser_incidents_w_population["State"].value_counts(normalize=False)

This is just a primitive grouping and aggregation operation, we will look into more advanced patterns. Let us say we would like to get some numbers (statistics) for individual states. We can groupby the dataset by the "State" column:

In [ ]:
grouped_by_state = laser_incidents_w_population.groupby("State")

What did we get?

In [ ]:
grouped_by_state

What is this DataFrameGroupBy object? Its use case is:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.

Let's try a simple aggregate: the mean of altitude for each state:

In [ ]:
grouped_by_state["Altitude"].mean().sort_values()

What if we were to group by year? We don't have a year column but we can just extract the year from the date and use it for groupby.

In [ ]:
grouped_by_year = laser_incidents_w_population.groupby(laser_incidents_w_population["timestamp"].dt.year)

You may have noticed how we extracted the year using the .dt accessor. We will use .dt even more below.

Let's calculate the mean altitude of laser incidents per year. Are the lasers getting more powerful? 🤔

In [ ]:
mean_altitude_per_year = grouped_by_year["Altitude"].mean().sort_index()
mean_altitude_per_year

We can also quickly plot the results, more on plotting in the next lessons.

In [ ]:
mean_altitude_per_year.plot(kind="bar");

Exercise: Calculate the sum of injuries per year. Use the fact that True + True = 2 ;)

We can also create a new Series if the corresponding column does not exist in the dataframe and group it by another Series (which in this case is a column from the dataframe). Important is that the grouped and the by series have the same index.

In [ ]:
# how many incidents per million inhabitants are there for each state?
incidents_per_million = (1_000_000 / laser_incidents_w_population["Population"]).groupby(laser_incidents_w_population["State"]).sum()
incidents_per_million.sort_values(ascending=False)
In [ ]:
incidents_per_million.sort_values(ascending=False).plot(kind="bar", figsize=(15, 3));

Time series operations (Optional)

We will briefly look at some more specific operation for time series data (data with a natural time axis). Typical operations for time series are resampling or rolling window transformations such as filtering. Note that Pandas is not a general digital signal processing library - there are other (more capable) tools for this purpose.

First, we set the index to "timestamp" to make our dataframe inherently time indexed. This will make doing further time operations easier.

In [ ]:
incidents_w_time_index = laser_incidents.set_index("timestamp")
incidents_w_time_index

First, turn the data into a time series of incidents per hour. This can be done by resampling to 1 hour and using count (basically on any column or on any column that has any non-NA value) to count the number of incidents.

In [ ]:
incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
incidents_hourly

Looking at those data gives us a bit too detailed information.

In [ ]:
incidents_hourly.sort_index().plot(kind="line", figsize=(15, 3));

A daily mean, the result of resampling to 1 day periods and calculating the mean, is already something more digestible. Though still a bit noisy.

In [ ]:
incidents_daily = incidents_hourly.resample("1D").mean()
incidents_daily.plot.line(figsize=(15, 3));

We can look at filtered data by rolling mean with, e.g., 28 days window size.

In [ ]:
incidents_daily_filtered = incidents_daily.rolling("28D").mean()
incidents_daily.plot.line(figsize=(15, 3));
incidents_daily_filtered.plot.line(figsize=(15, 3));

Komentáře

Comments powered by Disqus