Introduction to pandas#

#import and alias

Creating a DataFrame#

#create DataFrame from dictionary
data = {'CPI': [10, 11, 9],
        'GDP': [8, 3, 7],
        'Year': [2020, 2021, 2022]}

Head over to our course repository here and download the .csv file for the Ames Housing data. Upload this to colab or your local jupyter lab instance.

#create DataFrame from a local .csv file
#look at the first five rows
#check out the shape of the data
#what are the column names?
#what are the row names?
#all above in one place with .info()

Turns out we can even just use the url to a .csv file as the filepath to create a DataFrame with. In our repo, let’s look at the “raw” data by selecting the raw button for the Ames data.

#url to .csv raw data
url = ''
#use read_csv to read url to dataframe
#look at first 10 rows
#look over info

Selecting Rows and Columns#

Both the rows and columns can be referred to based on their index or their names. For index selection, we will use the .iloc function and for names we use the .loc method.

#select one column
#select two columns
#select first row
#select first three rows and three columns
#select the Alley column
#select the Alley and BsmtCond columns

Selections based on conditions#

Often we are interested in subsets of the data that satisfy specific criteria. We can use the .loc method, the more dictionary like syntax, as well as the new .query method in pandas.

df[conditional statement]
#or
df.loc[conditional statement]
#or
df.query(conditional statement)
#houses with a 3 car garage -- GarageCars feature
#using dictionary syntax
#using .loc
#using .query
#3 car garage and built after the year 2000 (YearBuilt feature)
#average price of houses with 3 car garage built in 2000's?

CHALLENGE

Head back to our repository and create a new DataFrame named health from the health insurance charges data.

  1. Did people over 50 have different average costs than those under 50 years of age?

  1. What is the average bmi of all the male observations?

  1. Did people in the southeast spend more or less on average than those in the northeast?

Sort and Summarize#

In addition to the .mean method, there are a variety of helpful sorting, selecting, and aggregating functions built in to the DataFrame. We explore these below.

  • .nlargest

  • .nsmallest

  • .sort_values

  • .value_counts

  • .describe

#sort the data
#find the 10 largest bmi
#how many entries for each region?
#describe the numeric data
#describe all the data

CHALLENGE

With your neighbor, find another .csv file (from our repo here or from googling around) and create a DataFrame from this. Check that the data is formatted as you expect and if there are any datatype or missing value issues. Use three other pandas methods (feel free to consult the cheat sheet here) to explore your data.

#load in your data
#look at the info
#first few rows of data
#method 1
#method 2
#method 3

CHALLENGE

With your neighbor, identify an industry or topic of interest. Try to find a structured data source for this online. Were you able to find a .csv file format or something different? What are some challenges you feel exist about turning this data into a DataFrame?