Introduction to pandas#

#import and alias
import pandas as pd

Creating a DataFrame#

#create DataFrame from dictionary
data = {'CPI': [10, 11, 9],
        'GDP': [8, 3, 7],
        'Year': [2020, 2021, 2022]}
df = pd.DataFrame(data)
df
CPI GDP Year
0 10 8 2020
1 11 3 2021
2 9 7 2022

Head over to our course repository here and download the .csv file for the Ames Housing data. Upload this to colab or your local jupyter lab instance.

#create DataFrame from a local .csv file
filepath = 'data/Ames_Housing_Sales.csv'
filepath = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Ames_Housing_Sales.csv'
ames = pd.read_csv(filepath)
ames
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
0 856.0 854.0 0.0 NaN 3 1Fam TA No 706.0 0.0 ... 0.0 Pave 8 856.0 AllPub 0.0 2003 2003 2008 208500.0
1 1262.0 0.0 0.0 NaN 3 1Fam TA Gd 978.0 0.0 ... 0.0 Pave 6 1262.0 AllPub 298.0 1976 1976 2007 181500.0
2 920.0 866.0 0.0 NaN 3 1Fam TA Mn 486.0 0.0 ... 0.0 Pave 6 920.0 AllPub 0.0 2001 2002 2008 223500.0
3 961.0 756.0 0.0 NaN 3 1Fam Gd No 216.0 0.0 ... 0.0 Pave 7 756.0 AllPub 0.0 1915 1970 2006 140000.0
4 1145.0 1053.0 0.0 NaN 4 1Fam TA Av 655.0 0.0 ... 0.0 Pave 9 1145.0 AllPub 192.0 2000 2000 2008 250000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1374 953.0 694.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 953.0 AllPub 0.0 1999 2000 2007 175000.0
1375 2073.0 0.0 0.0 NaN 3 1Fam TA No 790.0 163.0 ... 0.0 Pave 7 1542.0 AllPub 349.0 1978 1988 2010 210000.0
1376 1188.0 1152.0 0.0 NaN 4 1Fam Gd No 275.0 0.0 ... 0.0 Pave 9 1152.0 AllPub 0.0 1941 2006 2010 266500.0
1377 1078.0 0.0 0.0 NaN 2 1Fam TA Mn 49.0 1029.0 ... 0.0 Pave 5 1078.0 AllPub 366.0 1950 1996 2010 142125.0
1378 1256.0 0.0 0.0 NaN 3 1Fam TA No 830.0 290.0 ... 0.0 Pave 6 1256.0 AllPub 736.0 1965 1965 2008 147500.0

1379 rows × 80 columns

#look at the first five rows
ames.head()
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
0 856.0 854.0 0.0 NaN 3 1Fam TA No 706.0 0.0 ... 0.0 Pave 8 856.0 AllPub 0.0 2003 2003 2008 208500.0
1 1262.0 0.0 0.0 NaN 3 1Fam TA Gd 978.0 0.0 ... 0.0 Pave 6 1262.0 AllPub 298.0 1976 1976 2007 181500.0
2 920.0 866.0 0.0 NaN 3 1Fam TA Mn 486.0 0.0 ... 0.0 Pave 6 920.0 AllPub 0.0 2001 2002 2008 223500.0
3 961.0 756.0 0.0 NaN 3 1Fam Gd No 216.0 0.0 ... 0.0 Pave 7 756.0 AllPub 0.0 1915 1970 2006 140000.0
4 1145.0 1053.0 0.0 NaN 4 1Fam TA Av 655.0 0.0 ... 0.0 Pave 9 1145.0 AllPub 192.0 2000 2000 2008 250000.0

5 rows × 80 columns

#check out the shape of the data
ames.shape
(1379, 80)
#what are the column names?
ames.columns
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenAbvGr',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
       'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle',
       'SaleCondition', 'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd',
       'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd',
       'YrSold', 'SalePrice'],
      dtype='object')
#what are the row names?
ames.index
RangeIndex(start=0, stop=1379, step=1)
#all above in one place with .info()
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1379 non-null   float64
 1   2ndFlrSF       1379 non-null   float64
 2   3SsnPorch      1379 non-null   float64
 3   Alley          82 non-null     object 
 4   BedroomAbvGr   1379 non-null   int64  
 5   BldgType       1379 non-null   object 
 6   BsmtCond       953 non-null    object 
 7   BsmtExposure   953 non-null    object 
 8   BsmtFinSF1     1379 non-null   float64
 9   BsmtFinSF2     1379 non-null   float64
 10  BsmtFinType1   953 non-null    object 
 11  BsmtFinType2   952 non-null    object 
 12  BsmtFullBath   1379 non-null   int64  
 13  BsmtHalfBath   1379 non-null   int64  
 14  BsmtQual       953 non-null    object 
 15  BsmtUnfSF      1379 non-null   float64
 16  CentralAir     1379 non-null   object 
 17  Condition1     1379 non-null   object 
 18  Condition2     1379 non-null   object 
 19  Electrical     1379 non-null   object 
 20  EnclosedPorch  1379 non-null   float64
 21  ExterCond      1379 non-null   object 
 22  ExterQual      1379 non-null   object 
 23  Exterior1st    1379 non-null   object 
 24  Exterior2nd    1379 non-null   object 
 25  Fence          265 non-null    object 
 26  FireplaceQu    761 non-null    object 
 27  Fireplaces     1379 non-null   int64  
 28  Foundation     1379 non-null   object 
 29  FullBath       1379 non-null   int64  
 30  Functional     1379 non-null   object 
 31  GarageArea     1379 non-null   float64
 32  GarageCars     1379 non-null   int64  
 33  GarageCond     1379 non-null   object 
 34  GarageFinish   1379 non-null   object 
 35  GarageQual     1379 non-null   object 
 36  GarageType     1379 non-null   object 
 37  GarageYrBlt    1379 non-null   float64
 38  GrLivArea      1379 non-null   float64
 39  HalfBath       1379 non-null   int64  
 40  Heating        1379 non-null   object 
 41  HeatingQC      1379 non-null   object 
 42  HouseStyle     1379 non-null   object 
 43  KitchenAbvGr   1379 non-null   int64  
 44  KitchenQual    1379 non-null   object 
 45  LandContour    1379 non-null   object 
 46  LandSlope      1379 non-null   object 
 47  LotArea        1379 non-null   float64
 48  LotConfig      1379 non-null   object 
 49  LotFrontage    1379 non-null   float64
 50  LotShape       1379 non-null   object 
 51  LowQualFinSF   1379 non-null   float64
 52  MSSubClass     1379 non-null   int64  
 53  MSZoning       1379 non-null   object 
 54  MasVnrArea     1379 non-null   float64
 55  MasVnrType     582 non-null    object 
 56  MiscFeature    51 non-null     object 
 57  MiscVal        1379 non-null   float64
 58  MoSold         1379 non-null   int64  
 59  Neighborhood   1379 non-null   object 
 60  OpenPorchSF    1379 non-null   float64
 61  OverallCond    1379 non-null   int64  
 62  OverallQual    1379 non-null   int64  
 63  PavedDrive     1379 non-null   object 
 64  PoolArea       1379 non-null   float64
 65  PoolQC         7 non-null      object 
 66  RoofMatl       1379 non-null   object 
 67  RoofStyle      1379 non-null   object 
 68  SaleCondition  1379 non-null   object 
 69  SaleType       1379 non-null   object 
 70  ScreenPorch    1379 non-null   float64
 71  Street         1379 non-null   object 
 72  TotRmsAbvGrd   1379 non-null   int64  
 73  TotalBsmtSF    1379 non-null   float64
 74  Utilities      1379 non-null   object 
 75  WoodDeckSF     1379 non-null   float64
 76  YearBuilt      1379 non-null   int64  
 77  YearRemodAdd   1379 non-null   int64  
 78  YrSold         1379 non-null   int64  
 79  SalePrice      1379 non-null   float64
dtypes: float64(21), int64(16), object(43)
memory usage: 862.0+ KB

Turns out we can even just use the url to a .csv file as the filepath to create a DataFrame with. In our repo, let’s look at the “raw” data by selecting the raw button for the Ames data.

#url to .csv raw data
url = ''
#use read_csv to read url to dataframe
filepath = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Ames_Housing_Sales.csv'
ames = pd.read_csv(filepath)
#look at first 10 rows
ames.head(10)
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
0 856.0 854.0 0.0 NaN 3 1Fam TA No 706.0 0.0 ... 0.0 Pave 8 856.0 AllPub 0.0 2003 2003 2008 208500.0
1 1262.0 0.0 0.0 NaN 3 1Fam TA Gd 978.0 0.0 ... 0.0 Pave 6 1262.0 AllPub 298.0 1976 1976 2007 181500.0
2 920.0 866.0 0.0 NaN 3 1Fam TA Mn 486.0 0.0 ... 0.0 Pave 6 920.0 AllPub 0.0 2001 2002 2008 223500.0
3 961.0 756.0 0.0 NaN 3 1Fam Gd No 216.0 0.0 ... 0.0 Pave 7 756.0 AllPub 0.0 1915 1970 2006 140000.0
4 1145.0 1053.0 0.0 NaN 4 1Fam TA Av 655.0 0.0 ... 0.0 Pave 9 1145.0 AllPub 192.0 2000 2000 2008 250000.0
5 796.0 566.0 320.0 NaN 1 1Fam TA No 732.0 0.0 ... 0.0 Pave 5 796.0 AllPub 40.0 1993 1995 2009 143000.0
6 1694.0 0.0 0.0 NaN 3 1Fam TA Av 1369.0 0.0 ... 0.0 Pave 7 1686.0 AllPub 255.0 2004 2005 2007 307000.0
7 1107.0 983.0 0.0 NaN 3 1Fam TA Mn 859.0 32.0 ... 0.0 Pave 7 1107.0 AllPub 235.0 1973 1973 2009 200000.0
8 1022.0 752.0 0.0 NaN 2 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 8 952.0 AllPub 90.0 1931 1950 2008 129900.0
9 1077.0 0.0 0.0 NaN 2 2fmCon TA No 851.0 0.0 ... 0.0 Pave 5 991.0 AllPub 0.0 1939 1950 2008 118000.0

10 rows × 80 columns

#look over info
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1379 non-null   float64
 1   2ndFlrSF       1379 non-null   float64
 2   3SsnPorch      1379 non-null   float64
 3   Alley          82 non-null     object 
 4   BedroomAbvGr   1379 non-null   int64  
 5   BldgType       1379 non-null   object 
 6   BsmtCond       953 non-null    object 
 7   BsmtExposure   953 non-null    object 
 8   BsmtFinSF1     1379 non-null   float64
 9   BsmtFinSF2     1379 non-null   float64
 10  BsmtFinType1   953 non-null    object 
 11  BsmtFinType2   952 non-null    object 
 12  BsmtFullBath   1379 non-null   int64  
 13  BsmtHalfBath   1379 non-null   int64  
 14  BsmtQual       953 non-null    object 
 15  BsmtUnfSF      1379 non-null   float64
 16  CentralAir     1379 non-null   object 
 17  Condition1     1379 non-null   object 
 18  Condition2     1379 non-null   object 
 19  Electrical     1379 non-null   object 
 20  EnclosedPorch  1379 non-null   float64
 21  ExterCond      1379 non-null   object 
 22  ExterQual      1379 non-null   object 
 23  Exterior1st    1379 non-null   object 
 24  Exterior2nd    1379 non-null   object 
 25  Fence          265 non-null    object 
 26  FireplaceQu    761 non-null    object 
 27  Fireplaces     1379 non-null   int64  
 28  Foundation     1379 non-null   object 
 29  FullBath       1379 non-null   int64  
 30  Functional     1379 non-null   object 
 31  GarageArea     1379 non-null   float64
 32  GarageCars     1379 non-null   int64  
 33  GarageCond     1379 non-null   object 
 34  GarageFinish   1379 non-null   object 
 35  GarageQual     1379 non-null   object 
 36  GarageType     1379 non-null   object 
 37  GarageYrBlt    1379 non-null   float64
 38  GrLivArea      1379 non-null   float64
 39  HalfBath       1379 non-null   int64  
 40  Heating        1379 non-null   object 
 41  HeatingQC      1379 non-null   object 
 42  HouseStyle     1379 non-null   object 
 43  KitchenAbvGr   1379 non-null   int64  
 44  KitchenQual    1379 non-null   object 
 45  LandContour    1379 non-null   object 
 46  LandSlope      1379 non-null   object 
 47  LotArea        1379 non-null   float64
 48  LotConfig      1379 non-null   object 
 49  LotFrontage    1379 non-null   float64
 50  LotShape       1379 non-null   object 
 51  LowQualFinSF   1379 non-null   float64
 52  MSSubClass     1379 non-null   int64  
 53  MSZoning       1379 non-null   object 
 54  MasVnrArea     1379 non-null   float64
 55  MasVnrType     582 non-null    object 
 56  MiscFeature    51 non-null     object 
 57  MiscVal        1379 non-null   float64
 58  MoSold         1379 non-null   int64  
 59  Neighborhood   1379 non-null   object 
 60  OpenPorchSF    1379 non-null   float64
 61  OverallCond    1379 non-null   int64  
 62  OverallQual    1379 non-null   int64  
 63  PavedDrive     1379 non-null   object 
 64  PoolArea       1379 non-null   float64
 65  PoolQC         7 non-null      object 
 66  RoofMatl       1379 non-null   object 
 67  RoofStyle      1379 non-null   object 
 68  SaleCondition  1379 non-null   object 
 69  SaleType       1379 non-null   object 
 70  ScreenPorch    1379 non-null   float64
 71  Street         1379 non-null   object 
 72  TotRmsAbvGrd   1379 non-null   int64  
 73  TotalBsmtSF    1379 non-null   float64
 74  Utilities      1379 non-null   object 
 75  WoodDeckSF     1379 non-null   float64
 76  YearBuilt      1379 non-null   int64  
 77  YearRemodAdd   1379 non-null   int64  
 78  YrSold         1379 non-null   int64  
 79  SalePrice      1379 non-null   float64
dtypes: float64(21), int64(16), object(43)
memory usage: 862.0+ KB

Selecting Rows and Columns#

Both the rows and columns can be referred to based on their index or their names. For index selection, we will use the .iloc function and for names we use the .loc method.

#select one column
ames['Alley']
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
1374    NaN
1375    NaN
1376    NaN
1377    NaN
1378    NaN
Name: Alley, Length: 1379, dtype: object
#select two columns
ames[['Alley', 'SaleType']]
Alley SaleType
0 NaN WD
1 NaN WD
2 NaN WD
3 NaN WD
4 NaN WD
... ... ...
1374 NaN WD
1375 NaN WD
1376 NaN WD
1377 NaN WD
1378 NaN WD

1379 rows × 2 columns

#select first row
ames.iloc[0]
1stFlrSF           856.0
2ndFlrSF           854.0
3SsnPorch            0.0
Alley                NaN
BedroomAbvGr           3
                  ...   
WoodDeckSF           0.0
YearBuilt           2003
YearRemodAdd        2003
YrSold              2008
SalePrice       208500.0
Name: 0, Length: 80, dtype: object
#select first three rows and three columns
ames.iloc[:3]
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
0 856.0 854.0 0.0 NaN 3 1Fam TA No 706.0 0.0 ... 0.0 Pave 8 856.0 AllPub 0.0 2003 2003 2008 208500.0
1 1262.0 0.0 0.0 NaN 3 1Fam TA Gd 978.0 0.0 ... 0.0 Pave 6 1262.0 AllPub 298.0 1976 1976 2007 181500.0
2 920.0 866.0 0.0 NaN 3 1Fam TA Mn 486.0 0.0 ... 0.0 Pave 6 920.0 AllPub 0.0 2001 2002 2008 223500.0

3 rows × 80 columns

#select the Alley column
ames.loc[:, 'Alley']
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
1374    NaN
1375    NaN
1376    NaN
1377    NaN
1378    NaN
Name: Alley, Length: 1379, dtype: object
#select the Alley and BsmtCond columns
ames.loc[:, ['Alley', 'BsmtCond']]
Alley BsmtCond
0 NaN TA
1 NaN TA
2 NaN TA
3 NaN Gd
4 NaN TA
... ... ...
1374 NaN NaN
1375 NaN TA
1376 NaN Gd
1377 NaN TA
1378 NaN TA

1379 rows × 2 columns

Selections based on conditions#

Often we are interested in subsets of the data that satisfy specific criteria. We can use the .loc method, the more dictionary like syntax, as well as the new .query method in pandas.

df[conditional statement]
#or
df.loc[conditional statement]
#or
df.query(conditional statement)
#houses with a 3 car garage -- GarageCars feature
ames.loc[ames['GarageCars'] == 3]
ames.query('GarageCars == 3')
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
3 961.0 756.0 0.0 NaN 3 1Fam Gd No 216.0 0.0 ... 0.0 Pave 7 756.0 AllPub 0.0 1915 1970 2006 140000.0
4 1145.0 1053.0 0.0 NaN 4 1Fam TA Av 655.0 0.0 ... 0.0 Pave 9 1145.0 AllPub 192.0 2000 2000 2008 250000.0
11 1182.0 1142.0 0.0 NaN 4 1Fam TA No 998.0 0.0 ... 0.0 Pave 11 1175.0 AllPub 147.0 2005 2006 2006 345000.0
13 1494.0 0.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 1494.0 AllPub 160.0 2006 2007 2007 279500.0
20 1158.0 1218.0 0.0 NaN 4 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 9 1158.0 AllPub 240.0 2005 2006 2006 325300.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1337 1569.0 0.0 0.0 NaN 1 TwnhsE TA No 988.0 0.0 ... 0.0 Pave 7 1386.0 AllPub 143.0 2007 2007 2009 175900.0
1339 1310.0 1140.0 0.0 NaN 3 1Fam TA Mn 781.0 0.0 ... 0.0 Pave 7 1284.0 AllPub 0.0 1998 1998 2009 340000.0
1359 1932.0 0.0 304.0 NaN 2 1Fam TA Av 1336.0 0.0 ... 0.0 Pave 7 1932.0 AllPub 0.0 2008 2008 2008 394617.0
1364 1026.0 981.0 0.0 NaN 3 1Fam TA Av 765.0 0.0 ... 0.0 Pave 10 1017.0 AllPub 168.0 2008 2008 2009 310000.0
1371 1578.0 0.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 1573.0 AllPub 0.0 2008 2009 2009 287090.0

181 rows × 80 columns

#using dictionary syntax
#using .loc
#using .query
#3 car garage and built after the year 2000 (YearBuilt feature)
conditions = (ames['GarageCars'] == 3) & (ames['YearBuilt'] > 2000)
ames.loc[conditions]
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold SalePrice
11 1182.0 1142.0 0.0 NaN 4 1Fam TA No 998.0 0.0 ... 0.0 Pave 11 1175.0 AllPub 147.0 2005 2006 2006 345000.0
13 1494.0 0.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 1494.0 AllPub 160.0 2006 2007 2007 279500.0
20 1158.0 1218.0 0.0 NaN 4 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 9 1158.0 AllPub 240.0 2005 2006 2006 325300.0
25 1600.0 0.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 1566.0 AllPub 0.0 2007 2007 2009 256300.0
27 1704.0 0.0 0.0 NaN 3 1Fam TA No 1218.0 0.0 ... 0.0 Pave 7 1704.0 AllPub 0.0 2007 2008 2010 306000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1326 1464.0 0.0 0.0 NaN 3 1Fam TA Gd 929.0 0.0 ... 0.0 Pave 6 1485.0 AllPub 168.0 2007 2007 2007 282922.0
1337 1569.0 0.0 0.0 NaN 1 TwnhsE TA No 988.0 0.0 ... 0.0 Pave 7 1386.0 AllPub 143.0 2007 2007 2009 175900.0
1359 1932.0 0.0 304.0 NaN 2 1Fam TA Av 1336.0 0.0 ... 0.0 Pave 7 1932.0 AllPub 0.0 2008 2008 2008 394617.0
1364 1026.0 981.0 0.0 NaN 3 1Fam TA Av 765.0 0.0 ... 0.0 Pave 10 1017.0 AllPub 168.0 2008 2008 2009 310000.0
1371 1578.0 0.0 0.0 NaN 3 1Fam NaN NaN 0.0 0.0 ... 0.0 Pave 7 1573.0 AllPub 0.0 2008 2009 2009 287090.0

130 rows × 80 columns

#average price of houses with 3 car garage built in 2000's?
print(ames.loc[conditions]['SalePrice'].mean())
317667.8461538461

CHALLENGE

Head back to our repository and create a new DataFrame named health from the health insurance charges data.

  1. Did people over 50 have different average costs than those under 50 years of age?

health = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/health_insurance_charges.csv', index_col=0)
health.head()
age sex bmi children smoker region charges
1 18.0 male 33.770 1.0 no southeast 1725.55230
2 28.0 NaN 33.000 3.0 no southeast 4449.46200
3 33.0 male 22.705 0.0 no northwest 21984.47061
5 31.0 female NaN 0.0 no southeast 3756.62160
6 46.0 female 33.440 1.0 no southeast 8240.58960
over_50 = health.loc[health['age'] > 50]
under_50 = health.loc[health['age'] < 50]
over_50 #just a DataFrame --> select the column and find the average
age sex bmi children smoker region charges
9 60.0 female 25.84 0.0 no northwest 28923.13692
11 62.0 female 26.29 0.0 NaN southeast 27808.72510
13 56.0 female 39.82 0.0 no southeast 11090.71780
16 52.0 female 30.78 1.0 no northeast 10797.33620
18 56.0 male 40.30 0.0 NaN NaN 10602.38500
... ... ... ... ... ... ... ...
1327 51.0 male 30.03 1.0 no southeast 9377.90470
1329 52.0 male NaN 2.0 no southwest 10325.20600
1330 57.0 female 25.74 2.0 NaN southeast 12629.16560
1332 52.0 NaN 44.70 3.0 no southwest 11411.68500
1337 61.0 female 29.07 0.0 yes northwest 29141.36030

293 rows × 7 columns

#over 50 charges
over_50['charges'].mean()
18590.903296723547
#under 50 charges
under_50['charges'].mean()
11266.310656624202
  1. What is the average bmi of all the male observations?

condition = health['sex'] == 'male'
health.loc[condition, 'bmi'].mean()
31.150359281437126
  1. Did people in the southeast spend more or less on average than those in the northeast?

#southeast average charges
health.loc[(health['region'] == 'southeast'), 'charges'].mean()
14444.922324256757
#northeast average charges
condition = (health['region'] == 'northeast')
column = 'charges'
health.loc[condition, column].mean()
12977.084407629629

Sort and Summarize#

In addition to the .mean method, there are a variety of helpful sorting, selecting, and aggregating functions built in to the DataFrame. We explore these below.

  • .nlargest

  • .nsmallest

  • .sort_values

  • .value_counts

  • .describe

#sort the data
#find the 10 largest bmi
#how many entries for each region?
#describe the numeric data
#describe all the data

CHALLENGE

With your neighbor, find another .csv file (from our repo here or from googling around) and create a DataFrame from this. Check that the data is formatted as you expect and if there are any datatype or missing value issues. Use three other pandas methods (feel free to consult the cheat sheet here) to explore your data.

#load in your data
#look at the info
#first few rows of data
#method 1
#method 2
#method 3

CHALLENGE

With your neighbor, identify an industry or topic of interest. Try to find a structured data source for this online. Were you able to find a .csv file format or something different? What are some challenges you feel exist about turning this data into a DataFrame?