pandas0 slides

Pandas, Part 0¶

Stats 507, Fall 2021

James Henderson, PhD
September 16, 2021

Overview¶

About
I/O
The series class
The DataFrame class
Selcting rows and columns

Pandas¶

Pandas is a Python library that facilitates:
- working with rectangular data frames,
- reading and writing data,
- aggregation by group,
- much else.
Pandas is a core library for data analysts working in Python.

Canonical Import¶

import pandas as pd
In the reading, Wes McKinney suggests: from pandas import Series, DataFrame
I won't (usually) do this, but you can if wanted on problem sets.

In [ ]:

import numpy as np
import pandas as pd
pd.__version__

Tidy Rectangular Data¶

Rectangular datasets are a staple of data analysis.
A dataset is "tidy" if each row is an observation and each column is a variable.
The distinction between "observation" and "variable" can depend on context - work to develop your intuition on this front.
Don't store "data" in column names.
pandas cheat sheet

I/O for Rectangular Data¶

The easiest way to read rectangular data, delimited and otherwise, into Python is using a pandas pd.read_*() function.
pd.read_csv() accepts a filename, including remote URLs.
Write data to file using a pandas object's .to_*() methods.

Series¶

A pandas Series is a fixed-length, ordered dictionary.
Series are closely related to the DataFrame class.
Series are indexed, with the index (keys) mapping to values.
Use the pd.Series() constructor with a dict.

In [ ]:

nyc_air = pd.Series(
    {'LGA': 'East Elmhurst', 'JFK': 'Jamaica', 'EWR': 'Newark'})
nyc_air.index.name = 'airport'
nyc_air.name = 'city'
nyc_air

DataFrames¶

The pandas DataFrame class is the primary way of representing heterogeneous, rectangular data in Python.
A DataFrame can be thought of as an ordered dictionary of Series (columns) with a shared index (rown ames).
Rectangular means all the columns (Series) have the same length.
We will use DataFrames heavily in this class going forward.

The DataFrame Constructor¶

A common way to construct a DataFrame directly from data is to pass a dict of equal length lists, NumPy arrays, or Series, to pd.DataFrame().
Use the columns argument to order the columns.

In [ ]:

wiki = pd.Series({
    'LGA': 'https://en.wikipedia.org/wiki/LaGuardia_Airport',
    'EWR': 'https://en.wikipedia.org/wiki/Newark_Liberty_International_Airport',
    'JFK': 'https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport'
})
df = pd.DataFrame({'city': nyc_air, 'wiki': wiki})
df

Select Columns ...¶

by name using [] with a string (caution) or list of strings,
by positon using the .iloc[:, 0:2] indexer,
by name using the .loc[:, ["col1", "col2"] indexer.
Columns with valid Python names an be accessed as attributes, e.g. df.column (but don't).

In [ ]:

city = df['city']
city2 = df['city'].copy()
df_city = df[['city']]
[(city is df['city'], df_city is df[['city']]), (type(city), type(df_city))]

Select Columns ...¶

by name using [] with a string (caution) or list of strings,
by positon using the .iloc[:, 0:2] indexer,
by name using the .loc[:, ["col1", "col2"] indexer.
Columns with valid Python names an be accessed as attributes, e.g. df.column (but don't).

In [ ]:

df.loc[["JFK", "LGA"], "city"] = "JFK"  # always returns a view
df

Create/Modify Columns¶

Assign to a selected column to modify (or create) it.
To delete a column use the del keyword or (better) the .drop(columns='col', inplace=True) method.
Style "rule" - prefer (exposed) methods when available.

In [ ]:

dat = pd.DataFrame({'a': range(5), 'b': np.linspace(0, 5, 5)})
dat['c'] = dat['d'] = dat['a'] + dat['b']
del dat['c']
dat.drop(columns='a', inplace=True)
dat

Selecting rows¶

Select rows by position using .iloc[0, :] or by index using .loc["a", :].
More on this topic after discussing the Index class in more detail.

In [ ]:

dat.iloc[0, :] = -1
print(dat.index)
dat.loc[0:5:2, :]  # takes a slice object b/c uses RangeIndex()

Filtering¶

Observations satisifying some condition can be selected through Boolean indexing or (better) using the .query() method.
The primary argument to .query() is a string containing a Boolean expression involving column names.

In [ ]:

b = dat[dat['b'] > 0]
q = dat.query('b > 0')
[b, q]

Takeaways¶

Pandas DataFrames are used to represent tidy, rectangular data.
Think of DataFrames as a collection of Series of the same length and sharing an index.
Pay attention to whehter you are:
- getting a Series or a (new) DataFrame
- a view (alias) or a copy.
Prefer methods when avialable.
I recommend keeping a Pandas cheat sheet close at hand.
More on Pandas and DataFrame methods in the next few lectures.