Intro to pandas#
Review and Outline#
Great Work! We have made it this far…we know a few basics in python, e.g., built-in data types and structures (lists, tuples, strings, dictionaries), key operations.
Where are we going know…we will get into the key data analysis package: pandas.
What is pandas??? “pan”nel “da”ta “s”tructures. Powerfull, intuitive, data analysis
tool. This package convinced me to learn and start to use python as a research tool
Developed at AQR (a quantative hedgefund) by Wes Mckinney. They made it open source and quickly expanded developed and became widely used.
Python#
First we need to import the pandas package…very simmilar to when we imported
our functions, but this is a MUCH larger. Further more, this is what makes pandas
a higher-level add-on to python. That is at a lower level the objects, methods, functions…
are already created for us, then when we import pandas they are ready to go.
Then we will learn the key data structures in Pandas and their attributes and methods. Moreover, we will learn how to select data in DataFrame and then do computations afterwards.
Buzzwords. DataFrame, Series
Basics#
First let’s import and alias Pandas and NumPy.
import pandas as pd
import numpy as np
Let’s first get to know the two most important data structures in Pandas.
Besides, we can also convert Dataframe to Series as well via just selecting one column of a DataFrame.
DataFrame#
Now let’s create a DataFrame object from a dictionary.
data = {"GDP": [5974.7, 10031.0, 14681.1],
"CPI": [127.5, 169.3, 217.488],
"Year": [1990, 2000, 2010],
"Country": ["US", "US", "US"]}
Now we are going to convert the type of data to a “DataFrame” this is the key oject within pandas. (If familiar with R this is simmilar to their dataframe)
# create a DataFrame
df = pd.DataFrame(data)
# check the type
type(df)
pandas.core.frame.DataFrame
# examine the frame
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
#general info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GDP 3 non-null float64
1 CPI 3 non-null float64
2 Year 3 non-null int64
3 Country 3 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 224.0+ bytes
Creating a DataFrame#
Head over to the Federal Reserve banks data on unemployment here. Download the unemployment data as a .csv file and store this in a data folder.
#from a local file
#give filepath
#from a url
Columns in DataFrame#
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
This lays out the data in a very intuitive way, columns will be labeled as they are in the excel file.
Rows in DataFrame#
Rows are labeled with unique identifiers as well, called the “index.” We’ll learn how to retrieve the index of a DataFrame by df.index. Also, we’ll learn how to reset it by reset_index.
Amazing. You may be thinking… so what, well there is a reason why excel is popular, it is natural for people do think about data in a table like formate, a dataframe is always going to present this in the intuitive, natural way. This is also important because it helps us visualize and then implement calculations and operations on the table. Where as this could be very hard to do in the data variable above.
Therefore, we call the DataFrame as symmetric and indexed versions of spreadsheets.
DataFrame Methods and Attributes Intro#
In python remember the DataFrame is an object and with that object comes methods and attributes (we have seen less attributes, but lots of methods)
#shape
(3, 4)
#columns
Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')
#index
RangeIndex(start=0, stop=3, step=1)
#data types
GDP float64
CPI float64
Year int64
Country object
dtype: object
So this is interesting, for the numerical values it says that they are flaoting point vlaues, that is great. For the names, strings, it says that they are objects NOT strings? Pandas does this (i) if all the data in a column is a number, then it is recognized as a number (ii) if not, then it is just going to be an object.
Time to practice#
Exercises. Consider the following data below:
pwt_data = {'countrycode': ['CHN', 'CHN', 'CHN', 'FRA', 'FRA', 'FRA'],
'pop': [1124.8, 1246.8, 1318.2, 58.2, 60.8, 64.7],
'rgdpe': [2.611, 4.951, 11.106, 1.294, 1.753, 2.032],
'year': [1990, 2000, 2010, 1990, 2000, 2010]}
pwt = pd.DataFrame(pwt_data)
a) What are the dimensions of pwt?
b) What dtypes are the variables? What do they mean?
Understanding DataFrame#
After learning the basics, we will now dive in grabbing columns, rows(slicing) and other operations of a DataFrame.
Play with Columns#
Grab one column#
The most recommend way to use is…
df.columns
#select a column
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
In addition, there are two alternative ways used… The reasons why it is not suggested:
For
.: you might get confused CPI as a methods since we only normally use . to call the methods, e.g., .For
iloc: what if you change the sequence of columns, everything will be changed. Hard to debug.
#dot notation
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
# .iloc -- numeric index row, column
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
Since DataFrame is like an excel, think about the first input in the above bracket as index for rows while second for columns. Here, we want to select all the rows in the first column.
Remember python index starts with 0.
Regarding :, it’s similar to we have learned in Python Fundmentals 2 Slicing section.
Grab several columns#
# select multiple columns
| GDP | CPI | |
|---|---|---|
| 0 | 5974.7 | 127.500 |
| 1 | 10031.0 | 169.300 |
| 2 | 14681.1 | 217.488 |
# We can also do this with iloc.
| GDP | CPI | |
|---|---|---|
| 0 | 5974.7 | 127.500 |
| 1 | 10031.0 | 169.300 |
| 2 | 14681.1 | 217.488 |
But why only two columns selected?
Note: this is different for what we have learned in numpy array.
How about grabbing columns like CPI, GDP, can we do this with iloc at once? No.
Now we might notice that we can almost use iloc and other methods to grab rows interchangeably. However, we might consider by specifying the column name, it is easier for us to debug in the future. Think about, once you change the sequence of columns, everything won’t work.
Reset the column name#
#uppercase the names
['GDP', 'CPI', 'YEAR', 'COUNTRY']
Another way to rename specific instances… not that if we did not have the df in front, nothing would fundementally change, it would just copy and print out the new one, but the saved df is the same…
# rename method
df.rename({"gdp": 'GDP', 'cpi': "CPI", "Year": "YEAR", "Country": "COUNTRY"}, axis = 1)
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Play with Rows#
The most brute force way to grab rows is via just iloc.
#first row
GDP 5974.7
CPI 127.5
YEAR 1990
COUNTRY US
Name: 0, dtype: object
For example, if we want to grab the second row. Then it is the same for list or numpy data structure indexing system.
# second row
GDP 10031.0
CPI 169.3
YEAR 2000
COUNTRY US
Name: 1, dtype: object
It can also be used to select several rows
# slice of rows
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.5 | 1990 | US |
| 1 | 10031.0 | 169.3 | 2000 | US |
But the above way is hard to do conditional selection. For example, if we want to select the year at 2000.
# conditional with .loc
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 1 | 10031.0 | 169.3 | 2000 | US |
# This will also work
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 1 | 10031.0 | 169.3 | 2000 | US |
Set Row Index#
In some conditions, maybe the above way to select is easier. However, we can transform the column where the selection criteria is as the new index and then do the job.
Below is an example of setting the index. This is a feature that I’m slowing starting to embrace. The idea essentially, is that by setting the index, then we can use the .loc or location finding command to pull out only specific entries on a row. For example, if we only want year 2000, then we
set the index to be the year
then use
.locto pull out that particular year.
How you would you do this for the country. Same idea, not that we set the index to countrycode, then select the county code that we want.
Two more points about this:
One is that we can multi-index, that is have layers of indexes…why would we want to do this. This relates to the question of the “natural index”…
I would argue the natural index would be on the level of an observation. What does that mean? Think about the data set above, what is an observation look like and what are the variables associated with it. Here an observation is a country time pair. Note an observation is two dimensions a country at a particular time. Then the variables associated with each observation are population, gdp. Back to the natural index, given this argument above, I would actually say that it is a multi index with countries and years.
MTWN: This discussion relates to this concept of “tidy data” which is discusses nicely here.
# set index to year
| GDP | CPI | Country | |
|---|---|---|---|
| Year | |||
| 1990 | 5974.7 | 127.500 | US |
| 2000 | 10031.0 | 169.300 | US |
| 2010 | 14681.1 | 217.488 | US |
# did it change?
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Why is the index back to the original…well its just like string methods, the original data frame is not fundamentally changed. To change it you need to either (i) assign the modified data frame either to itself or to a new name or (ii) use the inplace = True command where it does not create a new object, but directly creates the new index on the old object.
So if we set the parameter inplace = True…
# try with inplace
Reset the Row Index#
Right now, we have set the index of pwt DataFrame as year column, can we reset it to numeric index? Yes, pandas provide the method reset_index methods as we’ll show you.
# reset the index
| Year | GDP | CPI | Country | |
|---|---|---|---|---|
| 0 | 1990 | 5974.7 | 127.500 | US |
| 1 | 2000 | 10031.0 | 169.300 | US |
| 2 | 2010 | 14681.1 | 217.488 | US |
If we set the inplace parameter = True, we will have similar results as set_index methods.
# use with .inplace
#take a peek
Time to practice#
Make sure one or two exercises on .set_index .loc approach Use the pwt_df grab country or year, conditional selection
pwt
| countrycode | pop | rgdpe | year | |
|---|---|---|---|---|
| 0 | CHN | 1124.8 | 2.611 | 1990 |
| 1 | CHN | 1246.8 | 4.951 | 2000 |
| 2 | CHN | 1318.2 | 11.106 | 2010 |
| 3 | FRA | 58.2 | 1.294 | 1990 |
| 4 | FRA | 60.8 | 1.753 | 2000 |
| 5 | FRA | 64.7 | 2.032 | 2010 |
Exercise. How would you extract all rows for just one country, say “FRA”?
Exercise. How would you extract two rows which have the latest data for “FRA” and “CHN”?
Exercise (Challenging). How would you extract all rows after 1990?
Remove Stuff by Column or Row#
How do we remove stuff, well there is the .drop method. In addition, we come across the axis parameter again. Let’s become familar with it.
# Reset the df DataFrame
df=pd.DataFrame(data)
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Can you guess what will happen, if…
#df.drop("CPI", axis = 1)
| GDP | Year | Country | |
|---|---|---|---|
| 0 | 5974.7 | 1990 | US |
| 1 | 10031.0 | 2000 | US |
| 2 | 14681.1 | 2010 | US |
df.drop(0, axis = 0) # the first 0 here means we want drop the first row which is indexed by 0
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Now, we can conclude: if we want to perform operations columnwise, we should set axis = 1 while for row-wise, axis = 0. We will see more examples for the DataFrame calculations part to help us grasp the idea.
Time to practice#
Exercise. For the DataFrame df, create a column diff equal to the difference of ngdp and rgdp. Verify that diff is now in df.
Exercise. How would you extract the variables ngdp and year?
Exercise How would you drop the variable diff? If you print your dataframe again, is it gone? If not, why do you think it is still there?
Exercise How would you drop one year from the data set?
Hint: the key thing to recognize is the axis, this is saying drop a column named “CPI” if you did this with out the axis it would give an error, why the defalut is axis = 0 which are rows…and there is no index named “CPI”
Conditional Selection#
More than often, we want to select part of a DataFrame and perform some operations on it to get some statistics.
Here we present a way to select the rows which satisfy certain conditions.
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
# if we just want to select the particular rows
# Here we select rows with CPI larger than 170
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 2 | 14681.1 | 217.488 | 2010 | US |
# if we want to select the particular rows while selecting the specified columns
14681.1
Calculations on a Dataframe#
Below are a bunch of calculations. This is essentially, the “excel” functionality of the data frame.
# Then it is super easy to create a new column based on an operation or existing
# columns, almost excel like...
df['RGDP'] = df['GDP']/df['CPI']
df['GDP_div_1000'] = df['GDP'] / 1000
<class 'pandas.core.series.Series'>
0 11949.4
1 20062.0
2 29362.2
Name: GDP, dtype: float64
0 46.860392
1 59.249852
2 67.503035
dtype: float64
0 100.000000
1 167.891275
2 245.721124
Name: GDP, dtype: float64
GDP CPI Year Country RGDP GDP_div_1000
0 5974.7 127.500 1990 US 46.860392 5.9747
1 10031.0 169.300 2000 US 59.249852 10.0310
2 14681.1 217.488 2010 US 67.503035 14.6811
df
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US | 46.860392 | 5.9747 |
| 1 | 10031.0 | 169.300 | 2000 | US | 59.249852 | 10.0310 |
| 2 | 14681.1 | 217.488 | 2010 | US | 67.503035 | 14.6811 |
Operations across rows/columns#
Here again, we need to set the axis parameters. Rememer, for across row computations, we need to set it as 0 and wise versa. Can you think of the execution results?
df.sum(axis=0)
GDP 30686.8
CPI 514.288
Year 6000
Country USUSUS
RGDP 173.613279
GDP_div_1000 30.6868
dtype: object
How about this one? Can even it be executed? Remember we have one column with string data structures.
df.sum(axis=1, numeric_only=True)
0 8145.035092
1 12269.580852
2 16990.772135
dtype: float64
Yes, it can. It just ignores the column with string values. Amazing!
Can you try the following?
df.var(axis=0, numeric_only=True)
GDP 1.897973e+07
CPI 2.027861e+03
Year 1.000000e+02
RGDP 1.079554e+02
GDP_div_1000 1.897973e+01
dtype: float64
df.var(axis=1, numeric_only=True)
0 6.601827e+06
1 1.863618e+07
2 4.047240e+07
dtype: float64
Time to practice#
Exercise. Can you compute the mean of each column of df?
df.mean(numeric_only=True)
Year 2000.000000
GDP 10228.933333
CPI 171.429333
dtype: float64
Exercise. Can you select the year 2010 and compute the row sum of df?
df[df['Year'] == 2010].sum(axis = 1, numeric_only = True)
2 16908.588
dtype: float64
Exercise (Challenging). Can you compute the mean of GDP where it is larger than 6000 of df?
Simple Statistics#
Here are some simple commands that can report basic summary statistics of the data:
# .describe()
df.describe()
| GDP | CPI | Year | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|
| count | 3.000000 | 3.000000 | 3.0 | 3.000000 | 3.000000 |
| mean | 10228.933333 | 171.429333 | 2000.0 | 57.871093 | 10.228933 |
| std | 4356.573590 | 45.031773 | 10.0 | 10.390159 | 4.356574 |
| min | 5974.700000 | 127.500000 | 1990.0 | 46.860392 | 5.974700 |
| 25% | 8002.850000 | 148.400000 | 1995.0 | 53.055122 | 8.002850 |
| 50% | 10031.000000 | 169.300000 | 2000.0 | 59.249852 | 10.031000 |
| 75% | 12356.050000 | 193.394000 | 2005.0 | 63.376443 | 12.356050 |
| max | 14681.100000 | 217.488000 | 2010.0 | 67.503035 | 14.681100 |
df
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US | 46.860392 | 5.9747 |
| 1 | 10031.0 | 169.300 | 2000 | US | 59.249852 | 10.0310 |
| 2 | 14681.1 | 217.488 | 2010 | US | 67.503035 | 14.6811 |
df.describe(include = 'all')
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| count | 3.000000 | 3.000000 | 3.0 | 3 | 3.000000 | 3.000000 |
| unique | NaN | NaN | NaN | 1 | NaN | NaN |
| top | NaN | NaN | NaN | US | NaN | NaN |
| freq | NaN | NaN | NaN | 3 | NaN | NaN |
| mean | 10228.933333 | 171.429333 | 2000.0 | NaN | 57.871093 | 10.228933 |
| std | 4356.573590 | 45.031773 | 10.0 | NaN | 10.390159 | 4.356574 |
| min | 5974.700000 | 127.500000 | 1990.0 | NaN | 46.860392 | 5.974700 |
| 25% | 8002.850000 | 148.400000 | 1995.0 | NaN | 53.055122 | 8.002850 |
| 50% | 10031.000000 | 169.300000 | 2000.0 | NaN | 59.249852 | 10.031000 |
| 75% | 12356.050000 | 193.394000 | 2005.0 | NaN | 63.376443 | 12.356050 |
| max | 14681.100000 | 217.488000 | 2010.0 | NaN | 67.503035 | 14.681100 |
Exercise. Compute the summary statistics (for the pwt data frame). Write these summary stats to an excel sheet. Can you do this only for China?
Output/Save Data#
We can output data in easy way as well with these commands. Note that it creates the file within your working directory unless you specify otherwise…
# to a csv
df.to_csv('gdp.csv')
# to excel
df.to_excel('gdp.xlsx')
<ipython-input-54-e42ea829b4fc>:2: UserWarning: Pandas requires version '1.4.3' or newer of 'xlsxwriter' (version '1.2.9' currently installed).
df.to_excel('gdp.xlsx')
Summary#
Congratulations! First, it’s amazing that you have made it this far. Reflect on what you knew before working through this notebook, namely what we did in python fundamental notebooks. Now reflect on what you can do…AMAZING!!! Let us summarize some key things that we covered.
Pandas Core Objects: A
DataFrameis essentially just a table of data while aSeriescan be thought of as a one columnedDataFrame.Understanding the
DataFrame:Become familiar with basic attributes (
.columns,.shape) and methods (.sum(),.mean()) inDataFramedata structure.Know different methods to grab columns and rows, e.g., their pros and cons, especially for the differences between
ilocandloc. They look familiar but the inputs for the two methods are very different.locgets rows (or columns) with particular labels from the index, whileilocgets rows (or columns) at particular positions in the index (so it only takes integers).Learned how to perform basic mathematic/statistical computations in
pandas.
Axis Understanding: when setting axis, always think about the operation first, whether it will be done across column or across row. If the former, setting axis = 1. For this course and the majority of dataframe, the axis will always be 0 or 1.