Intro to pandas

Intro to pandas#

Review and Outline#

Great Work! We have made it this far…we know a few basics in python, e.g., built-in data types and structures (lists, tuples, strings, dictionaries), key operations and one of the most important packages numpy for numerical computing.

Where are we going know…we will get into the key data analysis package: pandas.

What is pandas??? “pan”nel “da”ta “s”tructures. Powerfull, intuitive, data analysis tool. This package convinced me to learn and start to use python as a research tool Developed at AQR (a quantative hedgefund) by Wes Mckinney. They made it open source and quickly expanded developed and became widely used.

This notebook largely follows the discussion in the Book.

Python#

First we need to import the pandas package…very simmilar to when we imported our functions, but this is a MUCH larger. Further more, this is what makes pandas a higher-level add-on to python. That is at a lower level the objects, methods, functions… are already created for us, then when we import pandas they are ready to go.

Then we will learn the key data structures in Pandas and their attributes and methods. Moreover, we will learn how to select data in DataFrame and then do computations afterwards.

Buzzwords. DataFrame, Series

Basics#

First let’s import and rename them as follows like we learned in intro to numpy notebook.

import pandas as pd
import numpy as np

Let’s first get to know the two most important data structures in Pandas.

Series#

The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. It can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict.

# create a series from a dictionary
gdp = {"GDP": [5974.7, 10031.0, 14681.1]} 
# what kind of data structure is this
type(gdp)

dict

It should tell us that it is a dictionary, with keys and values (which are lists). How do we get those?

# 'name' parameter specifies the column name of the series object
pd.Series(gdp, name = 'GDP')

GDP    [5974.7, 10031.0, 14681.1]
Name: GDP, dtype: object

# create a series from a numpy 1 dimension array
cpi = np.array([127.5, 169.3, 217.488])
pd.Series(cpi, name = 'CPI')

# create series from a list
year = [1990, 2000, 2010]
country = ["US", "US", "US"]
pd.Series(country, name = 'country', index = year)

  US
  US
  US
Name: country, dtype: object

From Series to DataFrame#

A DataFrame is essentially just a table of data, or a dictionary of Series while a Series can be thought of as a one columned DataFrame.

Let’s create a DataFrame from the series previously created by pandas concat methods. It concatenates pandas objects, e.g., Series, DataFrame along a particular axis.

I know it might seem a little bit overwhelming right now, e.g., new methods and not understanding what does axis mean here? But let’s see the examples by setting different axis parameters at first. We will discuss this over and over later.

#concatenate series along axis = 1
pd.concat((pd.Series(country), pd.Series(year)), axis = 1)

	0	1
0	US	1990
1	US	2000
2	US	2010

#concatenate series along axis = 0
pd.concat((pd.Series(country), pd.Series(year)), axis = 0)

    US
    US
    US
  1990
  2000
  2010
dtype: object

Besides, we can also convert Dataframe to Series as well via just selecting one column of a DataFrame.

DataFrame#

Now let’s create a DataFrame object from a dictionary.

data = {"GDP": [5974.7, 10031.0, 14681.1],
                   "CPI": [127.5, 169.3, 217.488],
                   "Year": [1990, 2000, 2010],
                   "Country": ["US", "US", "US"]}

Now we are going to convert the type of data to a “DataFrame” this is the key oject within pandas. (If familiar with R this is simmilar to their dataframe)

# create a DataFrame
df = pd.DataFrame(data)

# check the type
type(df)

pandas.core.frame.DataFrame

# examine the frame
df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

#general info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   GDP      3 non-null      float64
 1   CPI      3 non-null      float64
 2   Year     3 non-null      int64  
 3   Country  3 non-null      object 
dtypes: float64(2), int64(1), object(1)
memory usage: 228.0+ bytes

Columns in `DataFrame`#

df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

This lays out the data in a very intuitive way, columns will be labeled as they are in the excel file.

Rows in `DataFrame`#

Rows are labeled with unique identifiers as well, called the “index.” We’ll learn how to retrieve the index of a DataFrame by df.index. Also, we’ll learn how to reset it by reset_index.

Amazing. You may be thinking… so what, well there is a reason why excel is popular, it is natural for people do think about data in a table like formate, a dataframe is always going to present this in the intuitive, natural way. This is also important because it helps us visualize and then implement calculations and operations on the table. Where as this could be very hard to do in the data variable above.

Therefore, we call the DataFrame as symmetric and indexed versions of spreadsheets.

`DataFrame` Methods and Attributes Intro#

In python remember the DataFrame is an object and with that object comes methods and attributes (we have seen less attributes, but lots of methods)

#shape
df.shape

(3, 4)

#columns
df.columns

Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')

#index
df.index

RangeIndex(start=0, stop=3, step=1)

#data types
df.dtypes

GDP        float64
CPI        float64
Year         int64
Country     object
dtype: object

So this is interesting, for the numerical values it says that they are flaoting point vlaues, that is great. For the names, strings, it says that they are objects NOT strings? Pandas does this (i) if all the data in a column is a number, then it is recognized as a number (ii) if not, then it is just going to be an object.

Time to practice#

Exercises. Consider the following data below:

pwt_data = {'countrycode': ['CHN', 'CHN', 'CHN', 'FRA', 'FRA', 'FRA'],
        'pop': [1124.8, 1246.8, 1318.2, 58.2, 60.8, 64.7],
        'rgdpe': [2.611, 4.951, 11.106, 1.294, 1.753, 2.032],
        'year': [1990, 2000, 2010, 1990, 2000, 2010]}

pwt = pd.DataFrame(pwt_data)

a) What are the dimensions of pwt?

pwt.shape

(6, 4)

b) What dtypes are the variables? What do they mean?

pwt.dtypes

countrycode     object
pop            float64
rgdpe          float64
year             int64
dtype: object

#countrycode is string, population and rgdpe are decimal valued numbers and year is an integer number

c) What does pwt.columns.tolist() do? How does it compare to list(pwt)?

pwt.columns.tolist(), list(pwt)  #same thing!

(['countrycode', 'pop', 'rgdpe', 'year'],
 ['countrycode', 'pop', 'rgdpe', 'year'])

d) Challenging. What is list(pwt)[0]? Why? What type is it?

list(pwt)[0] #first column

'countrycode'

e) Challenging. What would you say is the natural index? How would you set it?

#looks like the year would be

Understanding DataFrame#

After learning the basics, we will now dive in grabbing columns, rows(slicing) and other operations of a DataFrame.

Play with Columns#

Grab one column#

The most recommend way to use is…

df.columns

Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')

#select a column
df['CPI']

  127.500
  169.300
  217.488
Name: CPI, dtype: float64

In addition, there are two alternative ways used… The reasons why it is not suggested:

For .: you might get confused CPI as a methods since we only normally use . to call the methods, e.g., .
For iloc: what if you change the sequence of columns, everything will be changed. Hard to debug.

#dot notation
df.CPI

  127.500
  169.300
  217.488
Name: CPI, dtype: float64

# .iloc -- numeric index row, column
df.iloc[:, 1]

  127.500
  169.300
  217.488
Name: CPI, dtype: float64

Since DataFrame is like an excel, think about the first input in the above bracket as index for rows while second for columns. Here, we want to select all the rows in the first column.

Remember python index starts with 0.

Regarding :, it’s similar to we have learned in Python Fundmentals 2 Slicing section.

Grab several columns#

# select multiple columns
df[['GDP', 'CPI']]

	GDP	CPI
0	5974.7	127.500
1	10031.0	169.300
2	14681.1	217.488

# We can also do this with iloc.
df.iloc[:, :2]

	GDP	CPI
0	5974.7	127.500
1	10031.0	169.300
2	14681.1	217.488

But why only two columns selected?

Note: this is different for what we have learned in numpy array.

How about grabbing columns like CPI, GDP, can we do this with iloc at once? No.

Now we might notice that we can almost use iloc and other methods to grab rows interchangeably. However, we might consider by specifying the column name, it is easier for us to debug in the future. Think about, once you change the sequence of columns, everything won’t work.

Reset the column name#

#uppercase the names
[i.upper() for i in df.columns.tolist()]

['GDP', 'CPI', 'YEAR', 'COUNTRY']

Another way to rename specific instances… not that if we did not have the df in front, nothing would fundementally change, it would just copy and print out the new one, but the saved df is the same…

# rename method
df.rename({"gdp": 'GDP', 'cpi': "CPI", "Year": "YEAR", "Country": "COUNTRY"}, axis = 1)

	GDP	CPI	YEAR	COUNTRY
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

Play with Rows#

The most brute force way to grab rows is via just iloc.

df.iloc[0]

GDP        5974.7
CPI         127.5
Year         1990
Country        US
Name: 0, dtype: object

For example, if we want to grab the second row. Then it is the same for list or numpy data structure indexing system.

# second row
df.iloc[1]

GDP        10031.0
CPI          169.3
Year          2000
Country         US
Name: 1, dtype: object

It can also be used to select several rows

# slice of rows
df.iloc[:2]

	GDP	CPI	Year	Country
0	5974.7	127.5	1990	US
1	10031.0	169.3	2000	US

But the above way is hard to do conditional selection. For example, if we want to select the year at 2000.

# conditional with .loc
df.loc[df['YEAR'] == 2000]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'YEAR'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[35], line 2
      1 # conditional with .loc
----> 2 df.loc[df['YEAR'] == 2000]

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'YEAR'

# This will also work
df[df['YEAR'] == 2000]

	GDP	CPI	YEAR	COUNTRY
1	10031.0	169.3	2000	US

Set Row Index#

In some conditions, maybe the above way to select is easier. However, we can transform the column where the selection criteria is as the new index and then do the job.

Below is an example of setting the index. This is a feature that I’m slowing starting to embrace. The idea essentially, is that by setting the index, then we can use the .loc or location finding command to pull out only specific entries on a row. For example, if we only want year 2000, then we

set the index to be the year
then use .loc to pull out that particular year.

How you would you do this for the country. Same idea, not that we set the index to countrycode, then select the county code that we want.

Two more points about this:

One is that we can multi-index, that is have layers of indexes…why would we want to do this. This relates to the question of the “natural index”…

I would argue the natural index would be on the level of an observation. What does that mean? Think about the data set above, what is an observation look like and what are the variables associated with it. Here an observation is a country time pair. Note an observation is two dimensions a country at a particular time. Then the variables associated with each observation are population, gdp. Back to the natural index, given this argument above, I would actually say that it is a multi index with countries and years.

MTWN: This discussion relates to this concept of “tidy data” which is discusses nicely here.

# set index to year
df.set_index('Year')

	GDP	CPI	Country
Year
1990	5974.7	127.500	US
2000	10031.0	169.300	US
2010	14681.1	217.488	US

# did it change?
df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

Why is the index back to the original…well its just like string methods, the original data frame is not fundamentally changed. To change it you need to either (i) assign the modified data frame either to itself or to a new name or (ii) use the inplace = True command where it does not create a new object, but directly creates the new index on the old object.

So if we set the parameter inplace = True…

# try with inplace
df.set_index('Year', inplace = True)

Reset the Row Index#

Right now, we have set the index of pwt DataFrame as year column, can we reset it to numeric index? Yes, pandas provide the method reset_index methods as we’ll show you.

# reset the index
df.reset_index()

	Year	GDP	CPI	Country
0	1990	5974.7	127.500	US
1	2000	10031.0	169.300	US
2	2010	14681.1	217.488	US

If we set the inplace parameter = True, we will have similar results as set_index methods.

# use with .inplace
df.reset_index(inplace = True)

df

	Year	GDP	CPI	Country
0	1990	5974.7	127.500	US
1	2000	10031.0	169.300	US
2	2010	14681.1	217.488	US

Time to practice#

Make sure one or two exercises on .set_index .loc approach Use the pwt_df grab country or year, conditional selection

pwt

	countrycode	pop	rgdpe	year
0	CHN	1124.8	2.611	1990
1	CHN	1246.8	4.951	2000
2	CHN	1318.2	11.106	2010
3	FRA	58.2	1.294	1990
4	FRA	60.8	1.753	2000
5	FRA	64.7	2.032	2010

Exercise. How would you extract all rows for just one country, say “FRA”?

pwt.loc[pwt['countrycode'] == 'FRA']

	countrycode	pop	rgdpe	year
3	FRA	58.2	1.294	1990
4	FRA	60.8	1.753	2000
5	FRA	64.7	2.032	2010

Exercise. How would you extract two rows which have the latest data for “FRA” and “CHN”?

Exercise (Challenging). How would you extract all rows after 1990?

Remove Stuff by Column or Row#

How do we remove stuff, well there is the .drop method. In addition, we come across the axis parameter again. Let’s become familar with it.

# Reset the df DataFrame
df=pd.DataFrame(data)
df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

Can you guess what will happen, if…

#df.drop("CPI", axis = 1)

	GDP	Year	Country
0	5974.7	1990	US
1	10031.0	2000	US
2	14681.1	2010	US

df.drop(0, axis = 0)  # the first 0 here means we want drop the first row which is indexed by 0

	GDP	CPI	Year	Country
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

Now, we can conclude: if we want to perform operations columnwise, we should set axis = 1 while for row-wise, axis = 0. We will see more examples for the DataFrame calculations part to help us grasp the idea.

Time to practice#

Exercise. For the DataFrame df, create a column diff equal to the difference of ngdp and rgdp. Verify that diff is now in df.

Exercise. How would you extract the variables ngdp and year?

Exercise How would you extract all variables for just one year, say 2000?

Exercise How would you drop the variable diff? If you print your dataframe again, is it gone? If not, why do you think it is still there?

Exercise How would you drop one year from the data set?

Hint: the key thing to recognize is the axis, this is saying drop a column named “CPI” if you did this with out the axis it would give an error, why the defalut is axis = 0 which are rows…and there is no index named “CPI”

Conditional Selection#

More than often, we want to select part of a DataFrame and perform some operations on it to get some statistics.

Here we present a way to select the rows which satisfy certain conditions.

df

	GDP	CPI	Year	Country
0	5974.7	127.500	1990	US
1	10031.0	169.300	2000	US
2	14681.1	217.488	2010	US

# if we just want to select the particular rows
# Here we select rows with CPI larger than 170
df.loc[df['CPI'] > 170]

	GDP	CPI	Year	Country
2	14681.1	217.488	2010	US

# if we want to select the particular rows while selecting the specified columns
df.loc[2, 'GDP']

14681.1

Calculations on a Dataframe#

Below are a bunch of calculations. This is essentially, the “excel” functionality of the data frame.

# Then it is super easy to create a new column based on an operation or existing
# columns, almost excel like...

df['RGDP'] = df['GDP']/df['CPI']

df['GDP_div_1000'] = df['GDP'] / 1000

<class 'pandas.core.series.Series'>
  11949.4
  20062.0
  29362.2
Name: GDP, dtype: float64
  46.860392
  59.249852
  67.503035
dtype: float64
  100.000000
  167.891275
  245.721124
Name: GDP, dtype: float64

        GDP      CPI  Year Country       RGDP  GDP_div_1000
 5974.7  127.500  1990      US  46.860392        5.9747
10031.0  169.300  2000      US  59.249852       10.0310
14681.1  217.488  2010      US  67.503035       14.6811

df

	GDP	CPI	Year	Country	RGDP	GDP_div_1000
0	5974.7	127.500	1990	US	46.860392	5.9747
1	10031.0	169.300	2000	US	59.249852	10.0310
2	14681.1	217.488	2010	US	67.503035	14.6811

Operations across rows/columns#

Here again, we need to set the axis parameters. Rememer, for across row computations, we need to set it as 0 and wise versa. Can you think of the execution results?

df.sum(axis=0)

GDP                30686.8
CPI                514.288
Year                  6000
Country             USUSUS
RGDP            173.613279
GDP_div_1000       30.6868
dtype: object

How about this one? Can even it be executed? Remember we have one column with string data structures.

df.sum(axis=1, numeric_only=True)

   8145.035092
  12269.580852
  16990.772135
dtype: float64

Yes, it can. It just ignores the column with string values. Amazing!

Can you try the following?

df.var(axis=0, numeric_only=True)

GDP             1.897973e+07
CPI             2.027861e+03
Year            1.000000e+02
RGDP            1.079554e+02
GDP_div_1000    1.897973e+01
dtype: float64

df.var(axis=1, numeric_only=True)

  6.601827e+06
  1.863618e+07
  4.047240e+07
dtype: float64

Time to practice#

Exercise. Can you compute the mean of each column of df?

df.mean(numeric_only=True)

Year     2000.000000
GDP     10228.933333
CPI       171.429333
dtype: float64

Exercise. Can you select the year 2010 and compute the row sum of df?

df[df['Year'] == 2010].sum(axis = 1, numeric_only = True)

2    16908.588
dtype: float64

Exercise (Challenging). Can you compute the mean of GDP where it is larger than 6000 of df?

Simple Statistics#

Here are some simple commands that can report basic summary statistics of the data:

# .describe()
df.describe()

	GDP	CPI	Year	RGDP	GDP_div_1000
count	3.000000	3.000000	3.0	3.000000	3.000000
mean	10228.933333	171.429333	2000.0	57.871093	10.228933
std	4356.573590	45.031773	10.0	10.390159	4.356574
min	5974.700000	127.500000	1990.0	46.860392	5.974700
25%	8002.850000	148.400000	1995.0	53.055122	8.002850
50%	10031.000000	169.300000	2000.0	59.249852	10.031000
75%	12356.050000	193.394000	2005.0	63.376443	12.356050
max	14681.100000	217.488000	2010.0	67.503035	14.681100

df

	GDP	CPI	Year	Country	RGDP	GDP_div_1000
0	5974.7	127.500	1990	US	46.860392	5.9747
1	10031.0	169.300	2000	US	59.249852	10.0310
2	14681.1	217.488	2010	US	67.503035	14.6811

df.describe(include = 'all')

	GDP	CPI	Year	Country	RGDP	GDP_div_1000
count	3.000000	3.000000	3.0	3	3.000000	3.000000
unique	NaN	NaN	NaN	1	NaN	NaN
top	NaN	NaN	NaN	US	NaN	NaN
freq	NaN	NaN	NaN	3	NaN	NaN
mean	10228.933333	171.429333	2000.0	NaN	57.871093	10.228933
std	4356.573590	45.031773	10.0	NaN	10.390159	4.356574
min	5974.700000	127.500000	1990.0	NaN	46.860392	5.974700
25%	8002.850000	148.400000	1995.0	NaN	53.055122	8.002850
50%	10031.000000	169.300000	2000.0	NaN	59.249852	10.031000
75%	12356.050000	193.394000	2005.0	NaN	63.376443	12.356050
max	14681.100000	217.488000	2010.0	NaN	67.503035	14.681100

Exercise. Compute the summary statistics (for the pwt data frame). Write these summary stats to an excel sheet. Can you do this only for China?

Output/Save Data#

We can output data in easy way as well with these commands. Note that it creates the file within your working directory unless you specify otherwise…

# to a csv
df.to_csv('gdp.csv')

# to excel
df.to_excel('gdp.xlsx')

<ipython-input-54-e42ea829b4fc>:2: UserWarning: Pandas requires version '1.4.3' or newer of 'xlsxwriter' (version '1.2.9' currently installed).
  df.to_excel('gdp.xlsx')

Summary#

Congratulations! First, it’s amazing that you have made it this far. Reflect on what you knew before working through this notebook, namely what we did in python fundamental notebooks. Now reflect on what you can do…AMAZING!!! Let us summarize some key things that we covered.

Pandas Core Objects: A DataFrame is essentially just a table of data while a Series can be thought of as a one columned DataFrame.
Understanding the DataFrame:
- Become familiar with basic attributes (.columns, .shape) and methods (.sum(), .mean()) in DataFrame data structure.
- Know different methods to grab columns and rows, e.g., their pros and cons, especially for the differences between iloc and loc. They look familiar but the inputs for the two methods are very different. loc gets rows (or columns) with particular labels from the index, while iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
- Learned how to perform basic mathematic/statistical computations in pandas.
Axis Understanding: when setting axis, always think about the operation first, whether it will be done across column or across row. If the former, setting axis = 1. For this course and the majority of dataframe, the axis will always be 0 or 1.

Intro to pandas

Contents

Intro to pandas#

Review and Outline#

Python#

Basics#

Series#

From Series to DataFrame#

DataFrame#

Columns in DataFrame#

Rows in DataFrame#

DataFrame Methods and Attributes Intro#

Time to practice#

Understanding DataFrame#

Play with Columns#

Grab one column#

Grab several columns#

Reset the column name#

Play with Rows#

Set Row Index#

Reset the Row Index#

Time to practice#

Remove Stuff by Column or Row#

Time to practice#

Conditional Selection#

Calculations on a Dataframe#

Operations across rows/columns#

Time to practice#

Simple Statistics#

Output/Save Data#

Summary#

Columns in `DataFrame`#

Rows in `DataFrame`#

`DataFrame` Methods and Attributes Intro#