Pandas, Part 0¶

Stats 507, Fall 2021

James Henderson, PhD
September 16, 2021

Overview¶

  • About
  • I/O
  • The series class
  • The DataFrame class
  • Selcting rows and columns

Pandas¶

  • Pandas is a Python library that facilitates:
    • working with rectangular data frames,
    • reading and writing data,
    • aggregation by group,
    • much else.
  • Pandas is a core library for data analysts working in Python.

Canonical Import¶

  • import pandas as pd
  • In the reading, Wes McKinney suggests: from pandas import Series, DataFrame
  • I won't (usually) do this, but you can if wanted on problem sets.
In [ ]:
import numpy as np
import pandas as pd
pd.__version__

Tidy Rectangular Data¶

  • Rectangular datasets are a staple of data analysis.
  • A dataset is "tidy" if each row is an observation and each column is a variable.
  • The distinction between "observation" and "variable" can depend on context - work to develop your intuition on this front.
  • Don't store "data" in column names.
  • pandas cheat sheet

I/O for Rectangular Data¶

  • The easiest way to read rectangular data, delimited and otherwise, into Python is using a pandas pd.read_*() function.
  • pd.read_csv() accepts a filename, including remote URLs.
  • Write data to file using a pandas object's .to_*() methods.

Series¶

  • A pandas Series is a fixed-length, ordered dictionary.
  • Series are closely related to the DataFrame class.
  • Series are indexed, with the index (keys) mapping to values.
  • Use the pd.Series() constructor with a dict.
In [ ]:
nyc_air = pd.Series(
    {'LGA': 'East Elmhurst', 'JFK': 'Jamaica', 'EWR': 'Newark'})
nyc_air.index.name = 'airport'
nyc_air.name = 'city'
nyc_air

DataFrames¶

  • The pandas DataFrame class is the primary way of representing heterogeneous, rectangular data in Python.
  • A DataFrame can be thought of as an ordered dictionary of Series (columns) with a shared index (rown ames).
  • Rectangular means all the columns (Series) have the same length.
  • We will use DataFrames heavily in this class going forward.

The DataFrame Constructor¶

  • A common way to construct a DataFrame directly from data is to pass a dict of equal length lists, NumPy arrays, or Series, to pd.DataFrame().
  • Use the columns argument to order the columns.
In [ ]:
wiki = pd.Series({
    'LGA': 'https://en.wikipedia.org/wiki/LaGuardia_Airport',
    'EWR': 'https://en.wikipedia.org/wiki/Newark_Liberty_International_Airport',
    'JFK': 'https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport'
})
df = pd.DataFrame({'city': nyc_air, 'wiki': wiki})
df

Select Columns ...¶

  • by name using [] with a string (caution) or list of strings,
  • by positon using the .iloc[:, 0:2] indexer,
  • by name using the .loc[:, ["col1", "col2"] indexer.
  • Columns with valid Python names an be accessed as attributes, e.g. df.column (but don't).
In [ ]:
city = df['city']
city2 = df['city'].copy()
df_city = df[['city']]
[(city is df['city'], df_city is df[['city']]), (type(city), type(df_city))]

Select Columns ...¶

  • by name using [] with a string (caution) or list of strings,
  • by positon using the .iloc[:, 0:2] indexer,
  • by name using the .loc[:, ["col1", "col2"] indexer.
  • Columns with valid Python names an be accessed as attributes, e.g. df.column (but don't).
In [ ]:
df.loc[["JFK", "LGA"], "city"] = "JFK"  # always returns a view
df

Create/Modify Columns¶

  • Assign to a selected column to modify (or create) it.
  • To delete a column use the del keyword or (better) the .drop(columns='col', inplace=True) method.
  • Style "rule" - prefer (exposed) methods when available.
In [ ]:
dat = pd.DataFrame({'a': range(5), 'b': np.linspace(0, 5, 5)})
dat['c'] = dat['d'] = dat['a'] + dat['b']
del dat['c']
dat.drop(columns='a', inplace=True)
dat

Selecting rows¶

  • Select rows by position using .iloc[0, :] or by index using .loc["a", :].
  • More on this topic after discussing the Index class in more detail.
In [ ]:
dat.iloc[0, :] = -1
print(dat.index)
dat.loc[0:5:2, :]  # takes a slice object b/c uses RangeIndex()

Filtering¶

  • Observations satisifying some condition can be selected through Boolean indexing or (better) using the .query() method.
  • The primary argument to .query() is a string containing a Boolean expression involving column names.
In [ ]:
b = dat[dat['b'] > 0]
q = dat.query('b > 0')
[b, q]

Takeaways¶

  • Pandas DataFrames are used to represent tidy, rectangular data.
  • Think of DataFrames as a collection of Series of the same length and sharing an index.
  • Pay attention to whehter you are:
    • getting a Series or a (new) DataFrame
    • a view (alias) or a copy.
  • Prefer methods when avialable.
  • I recommend keeping a Pandas cheat sheet close at hand.
  • More on Pandas and DataFrame methods in the next few lectures.