Jak pandy analyzují data
Úvod do Pandas¶
Pandas je knihovna pro analýzu tabulkových dat. Je to jeden ze základních nástrojů prod data science, data engineering apod.
Převzato z https://github.com/janpipek/fbmipythoncourse/blob/main/notebooks/130_pandas_intro.ipynb
# Start using pandas (default import convention)
import pandas as pd
import numpy as np
# Let pandas speak for themselves
print(pd.__doc__)
Visit the official website for a nicely written documentation: https://pandas.pydata.org
# Current version (should be 1.5+ in 2023)
print(pd.__version__)
Basic objects¶
The pandas library has a vast API with many useful functions. However, most of this revolves around two important classes:
 Series
 DataFrame
In this introduction, we will focus on them  what each of them does and how they relate to each other and numpy objects.
Series¶
Series is a onedimensional data structure, central to pandas.
For a complete API, visit https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.Series.html
# My first series
series = pd.Series([1, 2, 3])
series
This looks a bit like a Numpy array, does it not?
Actually, in most cases the Series wraps a Numpy array...
series.values # The result is a Numpy array
But there is something more. Alongside the values, we see that each item (or "row") has a certain label. The collection of labels is called index.
series.index
This index (see below) can be used, as its name suggests, to index items of the series.
# Return an element from the series
series.loc[1]
# Or
series[1]
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab
Exercise: Create a series with 5 elements.
result = ...
DataFrame¶
A DataFrame is pandas' answer to Excel sheets  it is a collection of named columns (or, in our case, a collection of Series). Quite often, we directly read data frames from an external source, but it is possible to create them from:
 a dict of Series, numpy arrays or other arraylike objects
 from an iterable of rows (where rows are Series, lists, dictionaries, ...)
# List of lists (no column names)
table = [
['a', 1],
['b', 3],
['c', 5]
]
table_df = pd.DataFrame(table)
table_df
# Dict of Series (with column names)
df = pd.DataFrame({
'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))
df = pd.DataFrame(data, columns=['a', 'b'])
df
# A DataFrame also has an index.
df.index
# ...that is shared by all columns
df.index is df["a"].index
# The columns also form an index.
df.columns
Exercise: Create DataFrame
whose x
column is $0, \frac{1}{4}\pi, \frac{1}{2}\pi, .. 2\pi $, y
column is cos(x)
and index are fractions
0, 1/4, 1/2 ... 2
import fractions
index = [fractions.Fraction(n, ___) for n in range(___)]
x = np.___([___ for ___ in ___])
y = ___
df = pd.DataFrame(___, index = ___)
# display
df
D(ata) types¶
Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.
typed_df = pd.DataFrame({
"bool": np.arange(5) % 2 == 0,
"int": range(5),
"int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
"float": np.arange(5) * 3.14,
"complex": np.array([1 + 2j, 2 + 3j, 3 + 4j, 4 + 5j, 5 + 6j]),
"object": [None, 1, "2", [3, 4], 5 + 6j],
"string?": ["a", "b", "c", "d", "e"],
"string!": pd.Series(["a", "b", "c", "d", "e"], dtype="string"),
"datetime": pd.date_range('20180101', periods=5, freq='3M'),
"timedelta": pd.timedelta_range(0, freq="1s", periods=5),
"category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category"),
"period": pd.period_range('20180101', periods=5, freq='M'),
})
typed_df
typed_df.dtypes
We will see some of the types practically used in further analysis.
Indices & indexing¶
abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series
abc_series.index
abc_series.index = ["c", "d", "e"] # Changes the labels inplace!
abc_series.index.name = "letter"
abc_series
table = [
['a', 1],
['b', 3],
['c', 5]
]
table_df = pd.DataFrame(
table,
index=["first", "second", "third"],
columns=["alpha", "beta"]
)
table_df
alpha = table_df["alpha"] # Simple [] indexing in DataFrame returns Series
alpha
alpha["second"] # Simple [] indexing in Series returns scalar values.
alpha.second # This also works
but careful!
alpha.first
A slice with a ["list", "of", "columns"]
yields a DataFrame
with those columns.
For example:
table_df[["beta", "alpha"]]
[["column_name"]]
returs a DataFrame
as well, not Series
:
table_df[["alpha"]]
There are two ways how to properly index rows & cells in the DataFrame:

loc
for labelbased indexing 
iloc
for orderbased indexing (it does not use the index at all)
Note the square brackets. The mentioned attributes actually are not methods but special "indexer" objects. They accept one or two arguments specifying the position along one or both axes.
loc¶
first = table_df.loc["first"]
first
table_df.loc["first", "beta"]
table_df.loc["first":"second", "beta"] # Use ranges (inclusive)
iloc¶
table_df.iloc[1]
table_df.iloc[0:4:2] # Select every second row
table_df.at["first", "beta"]
type(table_df.at)
Modifying DataFrames¶
Adding a new column is like adding a key/value pair to a dict. Note that this operation, unlike most others, does modify the DataFrame.
from datetime import datetime
table_df["now"] = datetime.now()
table_df
Nondestructive version that returns a new DataFrame, uses the assign
method:
table_df.assign(delta = [True, False, True])
# However, the original DataFrame is not changed
table_df
Deleting a column is very easy too.
del table_df["now"]
table_df
The drop method works with both rows and columns (creating a new data frame), returning a new object.
table_df.drop("beta", axis=1)
table_df.drop("second", axis=0)
Exercise: Use a combination of reset_index
, drop
and set_index
to transform table_df
into pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])
results = table_df.___.___.___
# display
result
Let's get some real data!
I/O in pandas¶
Pandas can read (and write to) a huge variety of file formats. More details can be found in the official documentation: http://pandas.pydata.org/pandasdocs/stable/user_guide/io.html
Most of the functions for reading data are named pandas.read_XXX
, where XXX is the format used. We will look at three commonly used ones.
# List functions for input in pandas.
print("\n".join(method for method in dir(pd) if method.startswith("read_")))
Read CSV¶
Nowadays, a lot of data comes in the textual Commaseparated values format (CSV). Although not properly standardized, it is the defacto standard for files that are not huge and are meant to be read by human eyes too.
Let's read the population of U.S. states that we will need later:
territories = pd.read_csv("data/us_state_population.csv")
territories.head(9)
The automatic data type parsing converts columns to appropriate types:
territories.dtypes
Sometimes the CSV input does not work out of the box. Although pandas automatically understands and reads zipped files,
it usually does not automatically infer the file format and its variations  for details, see the read_csv
documentation here:
https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.read_csv.html
pd.read_csv('data/iris.tsv.gz')
...in this case, the CSV file does not use commas to separate values. Therefore, we need to specify an extra argument:
pd.read_csv("data/iris.tsv.gz", sep='\t')
See the difference?
Read Excel¶
Let's read the list of U.S. incidents when lasers interfered with airplanes.
pd.read_excel("data/laser_incidents_2019.xlsx")
Note: This reads just the first sheet from the file. If you want to extract more sheets, you will need to use the pandas.'ExcelFile
class. See the relevant part of the documentation.
Read HTML (Optional)¶
Pandas is able to scrape data from tables embedded in web pages using the read_html
function.
This might or might not bring you good results and probably you will have to tweak your
data frame manually. But it is a good starting point  much better than being forced to parse
the HTML ourselves!
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_laser_types")
type(tables), len(tables)
tables[1]
tables[2]
Write CSV¶
Pandas is able to write to many various formats but the usage is similar.
tables[1].to_csv("gas_lasers.csv", index=False)
Data analysis (very basics)¶
Let's extend the data of laser incidents to a broader time range and read the data from a summary CSV file:
laser_incidents_raw = pd.read_csv("data/laser_incidents_20152020.csv")
Let's see what we have here...
laser_incidents_raw.head()
laser_incidents_raw.tail()
For an unknown, potentially unevenly distributed dataset, looking at the beginning / end is typically not the best idea. We'd rather sample randomly:
# Show a few examples
laser_incidents_raw.sample(10)
laser_incidents_raw.dtypes
The topic of data cleaning and preprocessing is very broad. We will limit ourselves to dropping unused columns and converting one to a proper type.
# The first three are not needed
laser_incidents = laser_incidents_raw.drop(columns=laser_incidents_raw.columns[:3])
# We convert the timestamp
laser_incidents = laser_incidents.assign(
timestamp = pd.to_datetime(laser_incidents["timestamp"])
)
laser_incidents
laser_incidents.dtypes
Categorical dtype (Optional)¶
To analyze Laser Color, we can look at its typical values.
laser_incidents["Laser Color"].describe()
Not too many different values.
laser_incidents["Laser Color"].unique()
laser_incidents["Laser Color"].value_counts(normalize=True)
This column is a very good candidate to turn into a pandasspecial, Categorical data type. (See https://pandas.pydata.org/pandasdocs/stable/user_guide/categorical.html)
laser_incidents["Laser Color"].memory_usage(deep=True) # ~60 bytes per item
color_category = laser_incidents["Laser Color"].astype("category")
color_category.sample(10)
color_category.memory_usage(deep=True) # ~12 bytes per item
Exercise: Are there any other columns in the dataset that you would suggest for conversion to categorical?
Integer vs. float¶
Pandas is generally quite good at guessing (inferring) number types.
You may wonder why Altitude
is float and not int though.
This is a consequence of not having an integer nan in numpy. There's been many discussions about this.
laser_incidents["Altitude"]
laser_incidents["Altitude"].astype(int)
Quite recently, Pandas introduced nullable types for working with missing data, for example nullable integer.
laser_incidents["Altitude"].astype("Int64")
Filtering¶
Indexing in pandas Series / DataFrames ([]
) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.
You can also use standard comparison operators like <
, <=
, ==
, >=
, >
, !=
.
It is possible to perform logical operations with boolean series too. You need to use 
, &
, ^
operators though, not and
, or
, not
keywords.
As an example, find all California incidents:
is_california = laser_incidents.State == "California"
is_california.sample(10)
Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)
laser_incidents[is_california].sample(10)
Or maybe we should include the whole West coast?
# isin takes an array of possible values
west_coast = laser_incidents[laser_incidents.State.isin(["California", "Oregon", "Washington"])]
west_coast.sample(10)
Or lowaltitude incidents?
laser_incidents[laser_incidents.Altitude < 300]
Visualization intermezzo¶
Without much further ado, let's create our first plot.
# Most frequent states
laser_incidents["State"].value_counts()[:20]
laser_incidents["State"].value_counts()[:20].plot(kind="bar");
Sorting¶
# Display 5 incidents with the highest altitude
laser_incidents.sort_values("Altitude", ascending=False).head(5)
# Alternative
laser_incidents.nlargest(5, "Altitude")
Exercise: Find the last 3 incidents with blue laser.
Arithmetics and string manipulation¶
Standard arithmetic operators work on numerical columns too. And so do mathematical functions. Note all such operations are performed in a vectorlike fashion.
altitude_meters = laser_incidents["Altitude"] * .3048
altitude_meters.sample(10)
You may mix columns and scalars, the string arithmetics also works as expected.
laser_incidents["City"] + ", " + laser_incidents["State"]
Summary statistics¶
The describe
method shows summary statistics for all the columns:
laser_incidents.describe()
laser_incidents.describe(include="all", datetime_is_numeric=True)
laser_incidents["Altitude"].mean()
laser_incidents["Altitude"].std()
laser_incidents["Altitude"].max()
Basic string operations (Optional)¶
These are typically accessed using the .str
"accessor" of the Series like this:
 series.str.lower
 series.str.split
 series.str.startswith
 series.str.contains
 ...
See more in the documentation.
laser_incidents[laser_incidents["City"].str.contains("City")]["City"].unique()
laser_incidents[laser_incidents["City"].str.contains("City")]["City"].str.strip().unique()
Merging data¶
It is a common situation where we have two or more datasets with different columns that we need to bring together. This operation is called merging and the Pandas apparatus is to a great detail described in the documentation.
In our case, we would like to attach the state populations to the dataset.
population = pd.read_csv("data/us_state_population.csv")
population
We will of course use the state name as the merge key. Before actually doing the merge, we can explore a bit whether all state names from the laser incidents dataset are present in our population table.
unknown_states = laser_incidents.loc[~laser_incidents["State"].isin(population["Territory"]), "State"]
print(f"There are {unknown_states.count()} rows with unknown states.")
print(f"Unknown state values are: \n{list(unknown_states.unique())}.")
We could certainly clean the data by correcting some of the typos. Since the number of the rows with unknown states is not large (compared to the length of the whole dataset), we will deliberetly not fix the state names. Instead, we will remove those rows from the merged dataset by using the inner type of merge. All the merge types: left, inner, outer and right are well explained by the schema below:
We can use the merge
function to add the "Population"
values.
laser_incidents_w_population = pd.merge(
laser_incidents, population, left_on="State", right_on="Territory", how="inner"
)
laser_incidents_w_population
laser_incidents_w_population.describe(include="all", datetime_is_numeric=True)
Grouping & aggregation¶
A common pattern in data analysis is grouping (or binning) data based on some property and getting some aggredate statistics.
Example: Group this workshop participants by nationality a get the cardinality (the size) of each group.
Possibly the simplest group and aggregation is the value_counts
method, which groups by the respective column value
and yields the number (or normalized frequency) of each unique value in the data.
laser_incidents_w_population["State"].value_counts(normalize=False)
This is just a primitive grouping and aggregation operation, we will look into more advanced patterns.
Let us say we would like to get some numbers (statistics) for individual states.
We can groupby
the dataset by the "State"
column:
grouped_by_state = laser_incidents_w_population.groupby("State")
What did we get?
grouped_by_state
What is this DataFrameGroupBy
object? Its use case is:
 Splitting the data into groups based on some criteria.
 Applying a function to each group independently.
 Combining the results into a data structure.
Let's try a simple aggregate: the mean of altitude for each state:
grouped_by_state["Altitude"].mean().sort_values()
What if we were to group by year? We don't have a year column but we can just extract the year from the date and use it for groupby
.
grouped_by_year = laser_incidents_w_population.groupby(laser_incidents_w_population["timestamp"].dt.year)
You may have noticed how we extracted the year using the .dt
accessor.
We will use .dt
even more below.
Let's calculate the mean altitude of laser incidents per year. Are the lasers getting more powerful? 🤔
mean_altitude_per_year = grouped_by_year["Altitude"].mean().sort_index()
mean_altitude_per_year
We can also quickly plot the results, more on plotting in the next lessons.
mean_altitude_per_year.plot(kind="bar");
Exercise: Calculate the sum
of injuries per year. Use the fact that True + True = 2
;)
We can also create a new Series
if the corresponding column does not exist in the dataframe and group it by another Series
(which in this case is a column from the dataframe). Important is that the grouped and the by series have the same index.
# how many incidents per million inhabitants are there for each state?
incidents_per_million = (1_000_000 / laser_incidents_w_population["Population"]).groupby(laser_incidents_w_population["State"]).sum()
incidents_per_million.sort_values(ascending=False)
incidents_per_million.sort_values(ascending=False).plot(kind="bar", figsize=(15, 3));
Time series operations (Optional)¶
We will briefly look at some more specific operation for time series data (data with a natural time axis). Typical operations for time series are resampling or rolling window transformations such as filtering. Note that Pandas is not a general digital signal processing library  there are other (more capable) tools for this purpose.
First, we set the index to "timestamp"
to make our dataframe inherently time indexed. This will make doing further time operations easier.
incidents_w_time_index = laser_incidents.set_index("timestamp")
incidents_w_time_index
First, turn the data into a time series of incidents per hour. This can be done by resampling to 1 hour and using
count
(basically on any column or on any column that has any nonNA value) to count the number of incidents.
incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
incidents_hourly
Looking at those data gives us a bit too detailed information.
incidents_hourly.sort_index().plot(kind="line", figsize=(15, 3));
A daily mean, the result of resampling to 1 day periods and calculating the mean, is already something more digestible. Though still a bit noisy.
incidents_daily = incidents_hourly.resample("1D").mean()
incidents_daily.plot.line(figsize=(15, 3));
We can look at filtered data by rolling mean with, e.g., 28 days window size.
incidents_daily_filtered = incidents_daily.rolling("28D").mean()
incidents_daily.plot.line(figsize=(15, 3));
incidents_daily_filtered.plot.line(figsize=(15, 3));
Komentáře
Comments powered by Disqus