Intro to pandas#
Review and Outline#
Great Work! We have made it this far…we know a few basics in python, e.g., built-in data types and structures (lists, tuples, strings, dictionaries), key operations and one of the most important packages numpy for numerical computing.
Where are we going know…we will get into the key data analysis package: pandas.
What is pandas??? “pan”nel “da”ta “s”tructures. Powerfull, intuitive, data analysis
tool. This package convinced me to learn and start to use python as a research tool
Developed at AQR (a quantative hedgefund) by Wes Mckinney. They made it open source and quickly expanded developed and became widely used.
Python#
First we need to import the pandas package…very simmilar to when we imported
our functions, but this is a MUCH larger. Further more, this is what makes pandas
a higher-level add-on to python. That is at a lower level the objects, methods, functions…
are already created for us, then when we import pandas they are ready to go.
Then we will learn the key data structures in Pandas and their attributes and methods. Moreover, we will learn how to select data in DataFrame and then do computations afterwards.
Buzzwords. DataFrame, Series
Basics#
First let’s import and rename them as follows like we learned in intro to numpy notebook.
import pandas as pd
import numpy as np
Let’s first get to know the two most important data structures in Pandas.
Series#
The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. It can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict.
# create a series from a dictionary
gdp = {"GDP": [5974.7, 10031.0, 14681.1]}
# what kind of data structure is this
type(gdp)
dict
It should tell us that it is a dictionary, with keys and values (which are lists). How do we get those?
# 'name' parameter specifies the column name of the series object
pd.Series(gdp, name = 'GDP')
GDP [5974.7, 10031.0, 14681.1]
Name: GDP, dtype: object
# create a series from a numpy 1 dimension array
cpi = np.array([127.5, 169.3, 217.488])
pd.Series(cpi, name = 'CPI')
# create series from a list
year = [1990, 2000, 2010]
country = ["US", "US", "US"]
pd.Series(country, name = 'country', index = year)
1990 US
2000 US
2010 US
Name: country, dtype: object
From Series to DataFrame#
A DataFrame is essentially just a table of data, or a dictionary of Series while a Series can be thought of as a one columned DataFrame.
Let’s create a DataFrame from the series previously created by pandas concat methods. It concatenates pandas objects, e.g., Series, DataFrame along a particular axis.
I know it might seem a little bit overwhelming right now, e.g., new methods and not understanding what does axis mean here? But let’s see the examples by setting different axis parameters at first. We will discuss this over and over later.
#concatenate series along axis = 1
pd.concat((pd.Series(country), pd.Series(year)), axis = 1)
| 0 | 1 | |
|---|---|---|
| 0 | US | 1990 |
| 1 | US | 2000 |
| 2 | US | 2010 |
#concatenate series along axis = 0
pd.concat((pd.Series(country), pd.Series(year)), axis = 0)
0 US
1 US
2 US
0 1990
1 2000
2 2010
dtype: object
Besides, we can also convert Dataframe to Series as well via just selecting one column of a DataFrame.
DataFrame#
Now let’s create a DataFrame object from a dictionary.
data = {"GDP": [5974.7, 10031.0, 14681.1],
"CPI": [127.5, 169.3, 217.488],
"Year": [1990, 2000, 2010],
"Country": ["US", "US", "US"]}
Now we are going to convert the type of data to a “DataFrame” this is the key oject within pandas. (If familiar with R this is simmilar to their dataframe)
# create a DataFrame
df = pd.DataFrame(data)
# check the type
type(df)
pandas.core.frame.DataFrame
# examine the frame
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
#general info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GDP 3 non-null float64
1 CPI 3 non-null float64
2 Year 3 non-null int64
3 Country 3 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 228.0+ bytes
Columns in DataFrame#
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
This lays out the data in a very intuitive way, columns will be labeled as they are in the excel file.
Rows in DataFrame#
Rows are labeled with unique identifiers as well, called the “index.” We’ll learn how to retrieve the index of a DataFrame by df.index. Also, we’ll learn how to reset it by reset_index.
Amazing. You may be thinking… so what, well there is a reason why excel is popular, it is natural for people do think about data in a table like formate, a dataframe is always going to present this in the intuitive, natural way. This is also important because it helps us visualize and then implement calculations and operations on the table. Where as this could be very hard to do in the data variable above.
Therefore, we call the DataFrame as symmetric and indexed versions of spreadsheets.
DataFrame Methods and Attributes Intro#
In python remember the DataFrame is an object and with that object comes methods and attributes (we have seen less attributes, but lots of methods)
#shape
df.shape
(3, 4)
#columns
df.columns
Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')
#index
df.index
RangeIndex(start=0, stop=3, step=1)
#data types
df.dtypes
GDP float64
CPI float64
Year int64
Country object
dtype: object
So this is interesting, for the numerical values it says that they are flaoting point vlaues, that is great. For the names, strings, it says that they are objects NOT strings? Pandas does this (i) if all the data in a column is a number, then it is recognized as a number (ii) if not, then it is just going to be an object.
Time to practice#
Exercises. Consider the following data below:
pwt_data = {'countrycode': ['CHN', 'CHN', 'CHN', 'FRA', 'FRA', 'FRA'],
'pop': [1124.8, 1246.8, 1318.2, 58.2, 60.8, 64.7],
'rgdpe': [2.611, 4.951, 11.106, 1.294, 1.753, 2.032],
'year': [1990, 2000, 2010, 1990, 2000, 2010]}
pwt = pd.DataFrame(pwt_data)
a) What are the dimensions of pwt?
pwt.shape
(6, 4)
b) What dtypes are the variables? What do they mean?
pwt.dtypes
countrycode object
pop float64
rgdpe float64
year int64
dtype: object
#countrycode is string, population and rgdpe are decimal valued numbers and year is an integer number
c) What does pwt.columns.tolist() do? How does it compare to list(pwt)?
pwt.columns.tolist(), list(pwt) #same thing!
(['countrycode', 'pop', 'rgdpe', 'year'],
['countrycode', 'pop', 'rgdpe', 'year'])
d) Challenging. What is list(pwt)[0]? Why? What type is it?
list(pwt)[0] #first column
'countrycode'
e) Challenging. What would you say is the natural index? How would you set it?
#looks like the year would be
Understanding DataFrame#
After learning the basics, we will now dive in grabbing columns, rows(slicing) and other operations of a DataFrame.
Play with Columns#
Grab one column#
The most recommend way to use is…
df.columns
Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')
#select a column
df['CPI']
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
In addition, there are two alternative ways used… The reasons why it is not suggested:
For
.: you might get confused CPI as a methods since we only normally use . to call the methods, e.g., .For
iloc: what if you change the sequence of columns, everything will be changed. Hard to debug.
#dot notation
df.CPI
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
# .iloc -- numeric index row, column
df.iloc[:, 1]
0 127.500
1 169.300
2 217.488
Name: CPI, dtype: float64
Since DataFrame is like an excel, think about the first input in the above bracket as index for rows while second for columns. Here, we want to select all the rows in the first column.
Remember python index starts with 0.
Regarding :, it’s similar to we have learned in Python Fundmentals 2 Slicing section.
Grab several columns#
# select multiple columns
df[['GDP', 'CPI']]
| GDP | CPI | |
|---|---|---|
| 0 | 5974.7 | 127.500 |
| 1 | 10031.0 | 169.300 |
| 2 | 14681.1 | 217.488 |
# We can also do this with iloc.
df.iloc[:, :2]
| GDP | CPI | |
|---|---|---|
| 0 | 5974.7 | 127.500 |
| 1 | 10031.0 | 169.300 |
| 2 | 14681.1 | 217.488 |
But why only two columns selected?
Note: this is different for what we have learned in numpy array.
How about grabbing columns like CPI, GDP, can we do this with iloc at once? No.
Now we might notice that we can almost use iloc and other methods to grab rows interchangeably. However, we might consider by specifying the column name, it is easier for us to debug in the future. Think about, once you change the sequence of columns, everything won’t work.
Reset the column name#
#uppercase the names
[i.upper() for i in df.columns.tolist()]
['GDP', 'CPI', 'YEAR', 'COUNTRY']
Another way to rename specific instances… not that if we did not have the df in front, nothing would fundementally change, it would just copy and print out the new one, but the saved df is the same…
# rename method
df.rename({"gdp": 'GDP', 'cpi': "CPI", "Year": "YEAR", "Country": "COUNTRY"}, axis = 1)
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Play with Rows#
The most brute force way to grab rows is via just iloc.
df.iloc[0]
GDP 5974.7
CPI 127.5
Year 1990
Country US
Name: 0, dtype: object
For example, if we want to grab the second row. Then it is the same for list or numpy data structure indexing system.
# second row
df.iloc[1]
GDP 10031.0
CPI 169.3
Year 2000
Country US
Name: 1, dtype: object
It can also be used to select several rows
# slice of rows
df.iloc[:2]
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.5 | 1990 | US |
| 1 | 10031.0 | 169.3 | 2000 | US |
But the above way is hard to do conditional selection. For example, if we want to select the year at 2000.
# conditional with .loc
df.loc[df['YEAR'] == 2000]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'YEAR'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[35], line 2
1 # conditional with .loc
----> 2 df.loc[df['YEAR'] == 2000]
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'YEAR'
# This will also work
df[df['YEAR'] == 2000]
| GDP | CPI | YEAR | COUNTRY | |
|---|---|---|---|---|
| 1 | 10031.0 | 169.3 | 2000 | US |
Set Row Index#
In some conditions, maybe the above way to select is easier. However, we can transform the column where the selection criteria is as the new index and then do the job.
Below is an example of setting the index. This is a feature that I’m slowing starting to embrace. The idea essentially, is that by setting the index, then we can use the .loc or location finding command to pull out only specific entries on a row. For example, if we only want year 2000, then we
set the index to be the year
then use
.locto pull out that particular year.
How you would you do this for the country. Same idea, not that we set the index to countrycode, then select the county code that we want.
Two more points about this:
One is that we can multi-index, that is have layers of indexes…why would we want to do this. This relates to the question of the “natural index”…
I would argue the natural index would be on the level of an observation. What does that mean? Think about the data set above, what is an observation look like and what are the variables associated with it. Here an observation is a country time pair. Note an observation is two dimensions a country at a particular time. Then the variables associated with each observation are population, gdp. Back to the natural index, given this argument above, I would actually say that it is a multi index with countries and years.
MTWN: This discussion relates to this concept of “tidy data” which is discusses nicely here.
# set index to year
df.set_index('Year')
| GDP | CPI | Country | |
|---|---|---|---|
| Year | |||
| 1990 | 5974.7 | 127.500 | US |
| 2000 | 10031.0 | 169.300 | US |
| 2010 | 14681.1 | 217.488 | US |
# did it change?
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Why is the index back to the original…well its just like string methods, the original data frame is not fundamentally changed. To change it you need to either (i) assign the modified data frame either to itself or to a new name or (ii) use the inplace = True command where it does not create a new object, but directly creates the new index on the old object.
So if we set the parameter inplace = True…
# try with inplace
df.set_index('Year', inplace = True)
Reset the Row Index#
Right now, we have set the index of pwt DataFrame as year column, can we reset it to numeric index? Yes, pandas provide the method reset_index methods as we’ll show you.
# reset the index
df.reset_index()
| Year | GDP | CPI | Country | |
|---|---|---|---|---|
| 0 | 1990 | 5974.7 | 127.500 | US |
| 1 | 2000 | 10031.0 | 169.300 | US |
| 2 | 2010 | 14681.1 | 217.488 | US |
If we set the inplace parameter = True, we will have similar results as set_index methods.
# use with .inplace
df.reset_index(inplace = True)
df
| Year | GDP | CPI | Country | |
|---|---|---|---|---|
| 0 | 1990 | 5974.7 | 127.500 | US |
| 1 | 2000 | 10031.0 | 169.300 | US |
| 2 | 2010 | 14681.1 | 217.488 | US |
Time to practice#
Make sure one or two exercises on .set_index .loc approach Use the pwt_df grab country or year, conditional selection
pwt
| countrycode | pop | rgdpe | year | |
|---|---|---|---|---|
| 0 | CHN | 1124.8 | 2.611 | 1990 |
| 1 | CHN | 1246.8 | 4.951 | 2000 |
| 2 | CHN | 1318.2 | 11.106 | 2010 |
| 3 | FRA | 58.2 | 1.294 | 1990 |
| 4 | FRA | 60.8 | 1.753 | 2000 |
| 5 | FRA | 64.7 | 2.032 | 2010 |
Exercise. How would you extract all rows for just one country, say “FRA”?
pwt.loc[pwt['countrycode'] == 'FRA']
| countrycode | pop | rgdpe | year | |
|---|---|---|---|---|
| 3 | FRA | 58.2 | 1.294 | 1990 |
| 4 | FRA | 60.8 | 1.753 | 2000 |
| 5 | FRA | 64.7 | 2.032 | 2010 |
Exercise. How would you extract two rows which have the latest data for “FRA” and “CHN”?
Exercise (Challenging). How would you extract all rows after 1990?
Remove Stuff by Column or Row#
How do we remove stuff, well there is the .drop method. In addition, we come across the axis parameter again. Let’s become familar with it.
# Reset the df DataFrame
df=pd.DataFrame(data)
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Can you guess what will happen, if…
#df.drop("CPI", axis = 1)
| GDP | Year | Country | |
|---|---|---|---|
| 0 | 5974.7 | 1990 | US |
| 1 | 10031.0 | 2000 | US |
| 2 | 14681.1 | 2010 | US |
df.drop(0, axis = 0) # the first 0 here means we want drop the first row which is indexed by 0
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
Now, we can conclude: if we want to perform operations columnwise, we should set axis = 1 while for row-wise, axis = 0. We will see more examples for the DataFrame calculations part to help us grasp the idea.
Time to practice#
Exercise. For the DataFrame df, create a column diff equal to the difference of ngdp and rgdp. Verify that diff is now in df.
Exercise. How would you extract the variables ngdp and year?
Exercise How would you extract all variables for just one year, say 2000?
Exercise How would you drop the variable diff? If you print your dataframe again, is it gone? If not, why do you think it is still there?
Exercise How would you drop one year from the data set?
Hint: the key thing to recognize is the axis, this is saying drop a column named “CPI” if you did this with out the axis it would give an error, why the defalut is axis = 0 which are rows…and there is no index named “CPI”
Conditional Selection#
More than often, we want to select part of a DataFrame and perform some operations on it to get some statistics.
Here we present a way to select the rows which satisfy certain conditions.
df
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US |
| 1 | 10031.0 | 169.300 | 2000 | US |
| 2 | 14681.1 | 217.488 | 2010 | US |
# if we just want to select the particular rows
# Here we select rows with CPI larger than 170
df.loc[df['CPI'] > 170]
| GDP | CPI | Year | Country | |
|---|---|---|---|---|
| 2 | 14681.1 | 217.488 | 2010 | US |
# if we want to select the particular rows while selecting the specified columns
df.loc[2, 'GDP']
14681.1
Calculations on a Dataframe#
Below are a bunch of calculations. This is essentially, the “excel” functionality of the data frame.
# Then it is super easy to create a new column based on an operation or existing
# columns, almost excel like...
df['RGDP'] = df['GDP']/df['CPI']
df['GDP_div_1000'] = df['GDP'] / 1000
<class 'pandas.core.series.Series'>
0 11949.4
1 20062.0
2 29362.2
Name: GDP, dtype: float64
0 46.860392
1 59.249852
2 67.503035
dtype: float64
0 100.000000
1 167.891275
2 245.721124
Name: GDP, dtype: float64
GDP CPI Year Country RGDP GDP_div_1000
0 5974.7 127.500 1990 US 46.860392 5.9747
1 10031.0 169.300 2000 US 59.249852 10.0310
2 14681.1 217.488 2010 US 67.503035 14.6811
df
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US | 46.860392 | 5.9747 |
| 1 | 10031.0 | 169.300 | 2000 | US | 59.249852 | 10.0310 |
| 2 | 14681.1 | 217.488 | 2010 | US | 67.503035 | 14.6811 |
Operations across rows/columns#
Here again, we need to set the axis parameters. Rememer, for across row computations, we need to set it as 0 and wise versa. Can you think of the execution results?
df.sum(axis=0)
GDP 30686.8
CPI 514.288
Year 6000
Country USUSUS
RGDP 173.613279
GDP_div_1000 30.6868
dtype: object
How about this one? Can even it be executed? Remember we have one column with string data structures.
df.sum(axis=1, numeric_only=True)
0 8145.035092
1 12269.580852
2 16990.772135
dtype: float64
Yes, it can. It just ignores the column with string values. Amazing!
Can you try the following?
df.var(axis=0, numeric_only=True)
GDP 1.897973e+07
CPI 2.027861e+03
Year 1.000000e+02
RGDP 1.079554e+02
GDP_div_1000 1.897973e+01
dtype: float64
df.var(axis=1, numeric_only=True)
0 6.601827e+06
1 1.863618e+07
2 4.047240e+07
dtype: float64
Time to practice#
Exercise. Can you compute the mean of each column of df?
df.mean(numeric_only=True)
Year 2000.000000
GDP 10228.933333
CPI 171.429333
dtype: float64
Exercise. Can you select the year 2010 and compute the row sum of df?
df[df['Year'] == 2010].sum(axis = 1, numeric_only = True)
2 16908.588
dtype: float64
Exercise (Challenging). Can you compute the mean of GDP where it is larger than 6000 of df?
Simple Statistics#
Here are some simple commands that can report basic summary statistics of the data:
# .describe()
df.describe()
| GDP | CPI | Year | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|
| count | 3.000000 | 3.000000 | 3.0 | 3.000000 | 3.000000 |
| mean | 10228.933333 | 171.429333 | 2000.0 | 57.871093 | 10.228933 |
| std | 4356.573590 | 45.031773 | 10.0 | 10.390159 | 4.356574 |
| min | 5974.700000 | 127.500000 | 1990.0 | 46.860392 | 5.974700 |
| 25% | 8002.850000 | 148.400000 | 1995.0 | 53.055122 | 8.002850 |
| 50% | 10031.000000 | 169.300000 | 2000.0 | 59.249852 | 10.031000 |
| 75% | 12356.050000 | 193.394000 | 2005.0 | 63.376443 | 12.356050 |
| max | 14681.100000 | 217.488000 | 2010.0 | 67.503035 | 14.681100 |
df
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| 0 | 5974.7 | 127.500 | 1990 | US | 46.860392 | 5.9747 |
| 1 | 10031.0 | 169.300 | 2000 | US | 59.249852 | 10.0310 |
| 2 | 14681.1 | 217.488 | 2010 | US | 67.503035 | 14.6811 |
df.describe(include = 'all')
| GDP | CPI | Year | Country | RGDP | GDP_div_1000 | |
|---|---|---|---|---|---|---|
| count | 3.000000 | 3.000000 | 3.0 | 3 | 3.000000 | 3.000000 |
| unique | NaN | NaN | NaN | 1 | NaN | NaN |
| top | NaN | NaN | NaN | US | NaN | NaN |
| freq | NaN | NaN | NaN | 3 | NaN | NaN |
| mean | 10228.933333 | 171.429333 | 2000.0 | NaN | 57.871093 | 10.228933 |
| std | 4356.573590 | 45.031773 | 10.0 | NaN | 10.390159 | 4.356574 |
| min | 5974.700000 | 127.500000 | 1990.0 | NaN | 46.860392 | 5.974700 |
| 25% | 8002.850000 | 148.400000 | 1995.0 | NaN | 53.055122 | 8.002850 |
| 50% | 10031.000000 | 169.300000 | 2000.0 | NaN | 59.249852 | 10.031000 |
| 75% | 12356.050000 | 193.394000 | 2005.0 | NaN | 63.376443 | 12.356050 |
| max | 14681.100000 | 217.488000 | 2010.0 | NaN | 67.503035 | 14.681100 |
Exercise. Compute the summary statistics (for the pwt data frame). Write these summary stats to an excel sheet. Can you do this only for China?
Output/Save Data#
We can output data in easy way as well with these commands. Note that it creates the file within your working directory unless you specify otherwise…
# to a csv
df.to_csv('gdp.csv')
# to excel
df.to_excel('gdp.xlsx')
<ipython-input-54-e42ea829b4fc>:2: UserWarning: Pandas requires version '1.4.3' or newer of 'xlsxwriter' (version '1.2.9' currently installed).
df.to_excel('gdp.xlsx')
Summary#
Congratulations! First, it’s amazing that you have made it this far. Reflect on what you knew before working through this notebook, namely what we did in python fundamental notebooks. Now reflect on what you can do…AMAZING!!! Let us summarize some key things that we covered.
Pandas Core Objects: A
DataFrameis essentially just a table of data while aSeriescan be thought of as a one columnedDataFrame.Understanding the
DataFrame:Become familiar with basic attributes (
.columns,.shape) and methods (.sum(),.mean()) inDataFramedata structure.Know different methods to grab columns and rows, e.g., their pros and cons, especially for the differences between
ilocandloc. They look familiar but the inputs for the two methods are very different.locgets rows (or columns) with particular labels from the index, whileilocgets rows (or columns) at particular positions in the index (so it only takes integers).Learned how to perform basic mathematic/statistical computations in
pandas.
Axis Understanding: when setting axis, always think about the operation first, whether it will be done across column or across row. If the former, setting axis = 1. For this course and the majority of dataframe, the axis will always be 0 or 1.