Introduction to pandas#
#import and alias
import pandas as pd
Creating a DataFrame#
#create DataFrame from dictionary
data = {'CPI': [10, 11, 9],
'GDP': [8, 3, 7],
'Year': [2020, 2021, 2022]}
df = pd.DataFrame(data)
df
| CPI | GDP | Year | |
|---|---|---|---|
| 0 | 10 | 8 | 2020 |
| 1 | 11 | 3 | 2021 |
| 2 | 9 | 7 | 2022 |
Head over to our course repository here and download the .csv file for the Ames Housing data. Upload this to colab or your local jupyter lab instance.
#create DataFrame from a local .csv file
filepath = 'data/Ames_Housing_Sales.csv'
filepath = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Ames_Housing_Sales.csv'
ames = pd.read_csv(filepath)
ames
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 856.0 | 854.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 706.0 | 0.0 | ... | 0.0 | Pave | 8 | 856.0 | AllPub | 0.0 | 2003 | 2003 | 2008 | 208500.0 |
| 1 | 1262.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Gd | 978.0 | 0.0 | ... | 0.0 | Pave | 6 | 1262.0 | AllPub | 298.0 | 1976 | 1976 | 2007 | 181500.0 |
| 2 | 920.0 | 866.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 486.0 | 0.0 | ... | 0.0 | Pave | 6 | 920.0 | AllPub | 0.0 | 2001 | 2002 | 2008 | 223500.0 |
| 3 | 961.0 | 756.0 | 0.0 | NaN | 3 | 1Fam | Gd | No | 216.0 | 0.0 | ... | 0.0 | Pave | 7 | 756.0 | AllPub | 0.0 | 1915 | 1970 | 2006 | 140000.0 |
| 4 | 1145.0 | 1053.0 | 0.0 | NaN | 4 | 1Fam | TA | Av | 655.0 | 0.0 | ... | 0.0 | Pave | 9 | 1145.0 | AllPub | 192.0 | 2000 | 2000 | 2008 | 250000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1374 | 953.0 | 694.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 953.0 | AllPub | 0.0 | 1999 | 2000 | 2007 | 175000.0 |
| 1375 | 2073.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 790.0 | 163.0 | ... | 0.0 | Pave | 7 | 1542.0 | AllPub | 349.0 | 1978 | 1988 | 2010 | 210000.0 |
| 1376 | 1188.0 | 1152.0 | 0.0 | NaN | 4 | 1Fam | Gd | No | 275.0 | 0.0 | ... | 0.0 | Pave | 9 | 1152.0 | AllPub | 0.0 | 1941 | 2006 | 2010 | 266500.0 |
| 1377 | 1078.0 | 0.0 | 0.0 | NaN | 2 | 1Fam | TA | Mn | 49.0 | 1029.0 | ... | 0.0 | Pave | 5 | 1078.0 | AllPub | 366.0 | 1950 | 1996 | 2010 | 142125.0 |
| 1378 | 1256.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 830.0 | 290.0 | ... | 0.0 | Pave | 6 | 1256.0 | AllPub | 736.0 | 1965 | 1965 | 2008 | 147500.0 |
1379 rows × 80 columns
#look at the first five rows
ames.head()
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 856.0 | 854.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 706.0 | 0.0 | ... | 0.0 | Pave | 8 | 856.0 | AllPub | 0.0 | 2003 | 2003 | 2008 | 208500.0 |
| 1 | 1262.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Gd | 978.0 | 0.0 | ... | 0.0 | Pave | 6 | 1262.0 | AllPub | 298.0 | 1976 | 1976 | 2007 | 181500.0 |
| 2 | 920.0 | 866.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 486.0 | 0.0 | ... | 0.0 | Pave | 6 | 920.0 | AllPub | 0.0 | 2001 | 2002 | 2008 | 223500.0 |
| 3 | 961.0 | 756.0 | 0.0 | NaN | 3 | 1Fam | Gd | No | 216.0 | 0.0 | ... | 0.0 | Pave | 7 | 756.0 | AllPub | 0.0 | 1915 | 1970 | 2006 | 140000.0 |
| 4 | 1145.0 | 1053.0 | 0.0 | NaN | 4 | 1Fam | TA | Av | 655.0 | 0.0 | ... | 0.0 | Pave | 9 | 1145.0 | AllPub | 192.0 | 2000 | 2000 | 2008 | 250000.0 |
5 rows × 80 columns
#check out the shape of the data
ames.shape
(1379, 80)
#what are the column names?
ames.columns
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenAbvGr',
'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
'PavedDrive', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle',
'SaleCondition', 'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd',
'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd',
'YrSold', 'SalePrice'],
dtype='object')
#what are the row names?
ames.index
RangeIndex(start=0, stop=1379, step=1)
#all above in one place with .info()
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 80 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1stFlrSF 1379 non-null float64
1 2ndFlrSF 1379 non-null float64
2 3SsnPorch 1379 non-null float64
3 Alley 82 non-null object
4 BedroomAbvGr 1379 non-null int64
5 BldgType 1379 non-null object
6 BsmtCond 953 non-null object
7 BsmtExposure 953 non-null object
8 BsmtFinSF1 1379 non-null float64
9 BsmtFinSF2 1379 non-null float64
10 BsmtFinType1 953 non-null object
11 BsmtFinType2 952 non-null object
12 BsmtFullBath 1379 non-null int64
13 BsmtHalfBath 1379 non-null int64
14 BsmtQual 953 non-null object
15 BsmtUnfSF 1379 non-null float64
16 CentralAir 1379 non-null object
17 Condition1 1379 non-null object
18 Condition2 1379 non-null object
19 Electrical 1379 non-null object
20 EnclosedPorch 1379 non-null float64
21 ExterCond 1379 non-null object
22 ExterQual 1379 non-null object
23 Exterior1st 1379 non-null object
24 Exterior2nd 1379 non-null object
25 Fence 265 non-null object
26 FireplaceQu 761 non-null object
27 Fireplaces 1379 non-null int64
28 Foundation 1379 non-null object
29 FullBath 1379 non-null int64
30 Functional 1379 non-null object
31 GarageArea 1379 non-null float64
32 GarageCars 1379 non-null int64
33 GarageCond 1379 non-null object
34 GarageFinish 1379 non-null object
35 GarageQual 1379 non-null object
36 GarageType 1379 non-null object
37 GarageYrBlt 1379 non-null float64
38 GrLivArea 1379 non-null float64
39 HalfBath 1379 non-null int64
40 Heating 1379 non-null object
41 HeatingQC 1379 non-null object
42 HouseStyle 1379 non-null object
43 KitchenAbvGr 1379 non-null int64
44 KitchenQual 1379 non-null object
45 LandContour 1379 non-null object
46 LandSlope 1379 non-null object
47 LotArea 1379 non-null float64
48 LotConfig 1379 non-null object
49 LotFrontage 1379 non-null float64
50 LotShape 1379 non-null object
51 LowQualFinSF 1379 non-null float64
52 MSSubClass 1379 non-null int64
53 MSZoning 1379 non-null object
54 MasVnrArea 1379 non-null float64
55 MasVnrType 582 non-null object
56 MiscFeature 51 non-null object
57 MiscVal 1379 non-null float64
58 MoSold 1379 non-null int64
59 Neighborhood 1379 non-null object
60 OpenPorchSF 1379 non-null float64
61 OverallCond 1379 non-null int64
62 OverallQual 1379 non-null int64
63 PavedDrive 1379 non-null object
64 PoolArea 1379 non-null float64
65 PoolQC 7 non-null object
66 RoofMatl 1379 non-null object
67 RoofStyle 1379 non-null object
68 SaleCondition 1379 non-null object
69 SaleType 1379 non-null object
70 ScreenPorch 1379 non-null float64
71 Street 1379 non-null object
72 TotRmsAbvGrd 1379 non-null int64
73 TotalBsmtSF 1379 non-null float64
74 Utilities 1379 non-null object
75 WoodDeckSF 1379 non-null float64
76 YearBuilt 1379 non-null int64
77 YearRemodAdd 1379 non-null int64
78 YrSold 1379 non-null int64
79 SalePrice 1379 non-null float64
dtypes: float64(21), int64(16), object(43)
memory usage: 862.0+ KB
Turns out we can even just use the url to a .csv file as the filepath to create a DataFrame with. In our repo, let’s look at the “raw” data by selecting the raw button for the Ames data.
#url to .csv raw data
url = ''
#use read_csv to read url to dataframe
filepath = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Ames_Housing_Sales.csv'
ames = pd.read_csv(filepath)
#look at first 10 rows
ames.head(10)
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 856.0 | 854.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 706.0 | 0.0 | ... | 0.0 | Pave | 8 | 856.0 | AllPub | 0.0 | 2003 | 2003 | 2008 | 208500.0 |
| 1 | 1262.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Gd | 978.0 | 0.0 | ... | 0.0 | Pave | 6 | 1262.0 | AllPub | 298.0 | 1976 | 1976 | 2007 | 181500.0 |
| 2 | 920.0 | 866.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 486.0 | 0.0 | ... | 0.0 | Pave | 6 | 920.0 | AllPub | 0.0 | 2001 | 2002 | 2008 | 223500.0 |
| 3 | 961.0 | 756.0 | 0.0 | NaN | 3 | 1Fam | Gd | No | 216.0 | 0.0 | ... | 0.0 | Pave | 7 | 756.0 | AllPub | 0.0 | 1915 | 1970 | 2006 | 140000.0 |
| 4 | 1145.0 | 1053.0 | 0.0 | NaN | 4 | 1Fam | TA | Av | 655.0 | 0.0 | ... | 0.0 | Pave | 9 | 1145.0 | AllPub | 192.0 | 2000 | 2000 | 2008 | 250000.0 |
| 5 | 796.0 | 566.0 | 320.0 | NaN | 1 | 1Fam | TA | No | 732.0 | 0.0 | ... | 0.0 | Pave | 5 | 796.0 | AllPub | 40.0 | 1993 | 1995 | 2009 | 143000.0 |
| 6 | 1694.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Av | 1369.0 | 0.0 | ... | 0.0 | Pave | 7 | 1686.0 | AllPub | 255.0 | 2004 | 2005 | 2007 | 307000.0 |
| 7 | 1107.0 | 983.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 859.0 | 32.0 | ... | 0.0 | Pave | 7 | 1107.0 | AllPub | 235.0 | 1973 | 1973 | 2009 | 200000.0 |
| 8 | 1022.0 | 752.0 | 0.0 | NaN | 2 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 8 | 952.0 | AllPub | 90.0 | 1931 | 1950 | 2008 | 129900.0 |
| 9 | 1077.0 | 0.0 | 0.0 | NaN | 2 | 2fmCon | TA | No | 851.0 | 0.0 | ... | 0.0 | Pave | 5 | 991.0 | AllPub | 0.0 | 1939 | 1950 | 2008 | 118000.0 |
10 rows × 80 columns
#look over info
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 80 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1stFlrSF 1379 non-null float64
1 2ndFlrSF 1379 non-null float64
2 3SsnPorch 1379 non-null float64
3 Alley 82 non-null object
4 BedroomAbvGr 1379 non-null int64
5 BldgType 1379 non-null object
6 BsmtCond 953 non-null object
7 BsmtExposure 953 non-null object
8 BsmtFinSF1 1379 non-null float64
9 BsmtFinSF2 1379 non-null float64
10 BsmtFinType1 953 non-null object
11 BsmtFinType2 952 non-null object
12 BsmtFullBath 1379 non-null int64
13 BsmtHalfBath 1379 non-null int64
14 BsmtQual 953 non-null object
15 BsmtUnfSF 1379 non-null float64
16 CentralAir 1379 non-null object
17 Condition1 1379 non-null object
18 Condition2 1379 non-null object
19 Electrical 1379 non-null object
20 EnclosedPorch 1379 non-null float64
21 ExterCond 1379 non-null object
22 ExterQual 1379 non-null object
23 Exterior1st 1379 non-null object
24 Exterior2nd 1379 non-null object
25 Fence 265 non-null object
26 FireplaceQu 761 non-null object
27 Fireplaces 1379 non-null int64
28 Foundation 1379 non-null object
29 FullBath 1379 non-null int64
30 Functional 1379 non-null object
31 GarageArea 1379 non-null float64
32 GarageCars 1379 non-null int64
33 GarageCond 1379 non-null object
34 GarageFinish 1379 non-null object
35 GarageQual 1379 non-null object
36 GarageType 1379 non-null object
37 GarageYrBlt 1379 non-null float64
38 GrLivArea 1379 non-null float64
39 HalfBath 1379 non-null int64
40 Heating 1379 non-null object
41 HeatingQC 1379 non-null object
42 HouseStyle 1379 non-null object
43 KitchenAbvGr 1379 non-null int64
44 KitchenQual 1379 non-null object
45 LandContour 1379 non-null object
46 LandSlope 1379 non-null object
47 LotArea 1379 non-null float64
48 LotConfig 1379 non-null object
49 LotFrontage 1379 non-null float64
50 LotShape 1379 non-null object
51 LowQualFinSF 1379 non-null float64
52 MSSubClass 1379 non-null int64
53 MSZoning 1379 non-null object
54 MasVnrArea 1379 non-null float64
55 MasVnrType 582 non-null object
56 MiscFeature 51 non-null object
57 MiscVal 1379 non-null float64
58 MoSold 1379 non-null int64
59 Neighborhood 1379 non-null object
60 OpenPorchSF 1379 non-null float64
61 OverallCond 1379 non-null int64
62 OverallQual 1379 non-null int64
63 PavedDrive 1379 non-null object
64 PoolArea 1379 non-null float64
65 PoolQC 7 non-null object
66 RoofMatl 1379 non-null object
67 RoofStyle 1379 non-null object
68 SaleCondition 1379 non-null object
69 SaleType 1379 non-null object
70 ScreenPorch 1379 non-null float64
71 Street 1379 non-null object
72 TotRmsAbvGrd 1379 non-null int64
73 TotalBsmtSF 1379 non-null float64
74 Utilities 1379 non-null object
75 WoodDeckSF 1379 non-null float64
76 YearBuilt 1379 non-null int64
77 YearRemodAdd 1379 non-null int64
78 YrSold 1379 non-null int64
79 SalePrice 1379 non-null float64
dtypes: float64(21), int64(16), object(43)
memory usage: 862.0+ KB
Selecting Rows and Columns#
Both the rows and columns can be referred to based on their index or their names. For index selection, we will use the .iloc function and for names we use the .loc method.
#select one column
ames['Alley']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1374 NaN
1375 NaN
1376 NaN
1377 NaN
1378 NaN
Name: Alley, Length: 1379, dtype: object
#select two columns
ames[['Alley', 'SaleType']]
| Alley | SaleType | |
|---|---|---|
| 0 | NaN | WD |
| 1 | NaN | WD |
| 2 | NaN | WD |
| 3 | NaN | WD |
| 4 | NaN | WD |
| ... | ... | ... |
| 1374 | NaN | WD |
| 1375 | NaN | WD |
| 1376 | NaN | WD |
| 1377 | NaN | WD |
| 1378 | NaN | WD |
1379 rows × 2 columns
#select first row
ames.iloc[0]
1stFlrSF 856.0
2ndFlrSF 854.0
3SsnPorch 0.0
Alley NaN
BedroomAbvGr 3
...
WoodDeckSF 0.0
YearBuilt 2003
YearRemodAdd 2003
YrSold 2008
SalePrice 208500.0
Name: 0, Length: 80, dtype: object
#select first three rows and three columns
ames.iloc[:3]
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 856.0 | 854.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 706.0 | 0.0 | ... | 0.0 | Pave | 8 | 856.0 | AllPub | 0.0 | 2003 | 2003 | 2008 | 208500.0 |
| 1 | 1262.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Gd | 978.0 | 0.0 | ... | 0.0 | Pave | 6 | 1262.0 | AllPub | 298.0 | 1976 | 1976 | 2007 | 181500.0 |
| 2 | 920.0 | 866.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 486.0 | 0.0 | ... | 0.0 | Pave | 6 | 920.0 | AllPub | 0.0 | 2001 | 2002 | 2008 | 223500.0 |
3 rows × 80 columns
#select the Alley column
ames.loc[:, 'Alley']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1374 NaN
1375 NaN
1376 NaN
1377 NaN
1378 NaN
Name: Alley, Length: 1379, dtype: object
#select the Alley and BsmtCond columns
ames.loc[:, ['Alley', 'BsmtCond']]
| Alley | BsmtCond | |
|---|---|---|
| 0 | NaN | TA |
| 1 | NaN | TA |
| 2 | NaN | TA |
| 3 | NaN | Gd |
| 4 | NaN | TA |
| ... | ... | ... |
| 1374 | NaN | NaN |
| 1375 | NaN | TA |
| 1376 | NaN | Gd |
| 1377 | NaN | TA |
| 1378 | NaN | TA |
1379 rows × 2 columns
Selections based on conditions#
Often we are interested in subsets of the data that satisfy specific criteria. We can use the .loc method, the more dictionary like syntax, as well as the new .query method in pandas.
df[conditional statement]
#or
df.loc[conditional statement]
#or
df.query(conditional statement)
#houses with a 3 car garage -- GarageCars feature
ames.loc[ames['GarageCars'] == 3]
ames.query('GarageCars == 3')
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 961.0 | 756.0 | 0.0 | NaN | 3 | 1Fam | Gd | No | 216.0 | 0.0 | ... | 0.0 | Pave | 7 | 756.0 | AllPub | 0.0 | 1915 | 1970 | 2006 | 140000.0 |
| 4 | 1145.0 | 1053.0 | 0.0 | NaN | 4 | 1Fam | TA | Av | 655.0 | 0.0 | ... | 0.0 | Pave | 9 | 1145.0 | AllPub | 192.0 | 2000 | 2000 | 2008 | 250000.0 |
| 11 | 1182.0 | 1142.0 | 0.0 | NaN | 4 | 1Fam | TA | No | 998.0 | 0.0 | ... | 0.0 | Pave | 11 | 1175.0 | AllPub | 147.0 | 2005 | 2006 | 2006 | 345000.0 |
| 13 | 1494.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 1494.0 | AllPub | 160.0 | 2006 | 2007 | 2007 | 279500.0 |
| 20 | 1158.0 | 1218.0 | 0.0 | NaN | 4 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 9 | 1158.0 | AllPub | 240.0 | 2005 | 2006 | 2006 | 325300.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1337 | 1569.0 | 0.0 | 0.0 | NaN | 1 | TwnhsE | TA | No | 988.0 | 0.0 | ... | 0.0 | Pave | 7 | 1386.0 | AllPub | 143.0 | 2007 | 2007 | 2009 | 175900.0 |
| 1339 | 1310.0 | 1140.0 | 0.0 | NaN | 3 | 1Fam | TA | Mn | 781.0 | 0.0 | ... | 0.0 | Pave | 7 | 1284.0 | AllPub | 0.0 | 1998 | 1998 | 2009 | 340000.0 |
| 1359 | 1932.0 | 0.0 | 304.0 | NaN | 2 | 1Fam | TA | Av | 1336.0 | 0.0 | ... | 0.0 | Pave | 7 | 1932.0 | AllPub | 0.0 | 2008 | 2008 | 2008 | 394617.0 |
| 1364 | 1026.0 | 981.0 | 0.0 | NaN | 3 | 1Fam | TA | Av | 765.0 | 0.0 | ... | 0.0 | Pave | 10 | 1017.0 | AllPub | 168.0 | 2008 | 2008 | 2009 | 310000.0 |
| 1371 | 1578.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 1573.0 | AllPub | 0.0 | 2008 | 2009 | 2009 | 287090.0 |
181 rows × 80 columns
#using dictionary syntax
#using .loc
#using .query
#3 car garage and built after the year 2000 (YearBuilt feature)
conditions = (ames['GarageCars'] == 3) & (ames['YearBuilt'] > 2000)
ames.loc[conditions]
| 1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 1182.0 | 1142.0 | 0.0 | NaN | 4 | 1Fam | TA | No | 998.0 | 0.0 | ... | 0.0 | Pave | 11 | 1175.0 | AllPub | 147.0 | 2005 | 2006 | 2006 | 345000.0 |
| 13 | 1494.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 1494.0 | AllPub | 160.0 | 2006 | 2007 | 2007 | 279500.0 |
| 20 | 1158.0 | 1218.0 | 0.0 | NaN | 4 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 9 | 1158.0 | AllPub | 240.0 | 2005 | 2006 | 2006 | 325300.0 |
| 25 | 1600.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 1566.0 | AllPub | 0.0 | 2007 | 2007 | 2009 | 256300.0 |
| 27 | 1704.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | No | 1218.0 | 0.0 | ... | 0.0 | Pave | 7 | 1704.0 | AllPub | 0.0 | 2007 | 2008 | 2010 | 306000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1326 | 1464.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | TA | Gd | 929.0 | 0.0 | ... | 0.0 | Pave | 6 | 1485.0 | AllPub | 168.0 | 2007 | 2007 | 2007 | 282922.0 |
| 1337 | 1569.0 | 0.0 | 0.0 | NaN | 1 | TwnhsE | TA | No | 988.0 | 0.0 | ... | 0.0 | Pave | 7 | 1386.0 | AllPub | 143.0 | 2007 | 2007 | 2009 | 175900.0 |
| 1359 | 1932.0 | 0.0 | 304.0 | NaN | 2 | 1Fam | TA | Av | 1336.0 | 0.0 | ... | 0.0 | Pave | 7 | 1932.0 | AllPub | 0.0 | 2008 | 2008 | 2008 | 394617.0 |
| 1364 | 1026.0 | 981.0 | 0.0 | NaN | 3 | 1Fam | TA | Av | 765.0 | 0.0 | ... | 0.0 | Pave | 10 | 1017.0 | AllPub | 168.0 | 2008 | 2008 | 2009 | 310000.0 |
| 1371 | 1578.0 | 0.0 | 0.0 | NaN | 3 | 1Fam | NaN | NaN | 0.0 | 0.0 | ... | 0.0 | Pave | 7 | 1573.0 | AllPub | 0.0 | 2008 | 2009 | 2009 | 287090.0 |
130 rows × 80 columns
#average price of houses with 3 car garage built in 2000's?
print(ames.loc[conditions]['SalePrice'].mean())
317667.8461538461
CHALLENGE
Head back to our repository and create a new DataFrame named health from the health insurance charges data.
Did people over 50 have different average costs than those under 50 years of age?
health = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/health_insurance_charges.csv', index_col=0)
health.head()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 1 | 18.0 | male | 33.770 | 1.0 | no | southeast | 1725.55230 |
| 2 | 28.0 | NaN | 33.000 | 3.0 | no | southeast | 4449.46200 |
| 3 | 33.0 | male | 22.705 | 0.0 | no | northwest | 21984.47061 |
| 5 | 31.0 | female | NaN | 0.0 | no | southeast | 3756.62160 |
| 6 | 46.0 | female | 33.440 | 1.0 | no | southeast | 8240.58960 |
over_50 = health.loc[health['age'] > 50]
under_50 = health.loc[health['age'] < 50]
over_50 #just a DataFrame --> select the column and find the average
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 9 | 60.0 | female | 25.84 | 0.0 | no | northwest | 28923.13692 |
| 11 | 62.0 | female | 26.29 | 0.0 | NaN | southeast | 27808.72510 |
| 13 | 56.0 | female | 39.82 | 0.0 | no | southeast | 11090.71780 |
| 16 | 52.0 | female | 30.78 | 1.0 | no | northeast | 10797.33620 |
| 18 | 56.0 | male | 40.30 | 0.0 | NaN | NaN | 10602.38500 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1327 | 51.0 | male | 30.03 | 1.0 | no | southeast | 9377.90470 |
| 1329 | 52.0 | male | NaN | 2.0 | no | southwest | 10325.20600 |
| 1330 | 57.0 | female | 25.74 | 2.0 | NaN | southeast | 12629.16560 |
| 1332 | 52.0 | NaN | 44.70 | 3.0 | no | southwest | 11411.68500 |
| 1337 | 61.0 | female | 29.07 | 0.0 | yes | northwest | 29141.36030 |
293 rows × 7 columns
#over 50 charges
over_50['charges'].mean()
18590.903296723547
#under 50 charges
under_50['charges'].mean()
11266.310656624202
What is the average
bmiof all the male observations?
condition = health['sex'] == 'male'
health.loc[condition, 'bmi'].mean()
31.150359281437126
Did people in the southeast spend more or less on average than those in the northeast?
#southeast average charges
health.loc[(health['region'] == 'southeast'), 'charges'].mean()
14444.922324256757
#northeast average charges
condition = (health['region'] == 'northeast')
column = 'charges'
health.loc[condition, column].mean()
12977.084407629629
Sort and Summarize#
In addition to the .mean method, there are a variety of helpful sorting, selecting, and aggregating functions built in to the DataFrame. We explore these below.
.nlargest.nsmallest.sort_values.value_counts.describe
#sort the data
#find the 10 largest bmi
#how many entries for each region?
#describe the numeric data
#describe all the data
CHALLENGE
With your neighbor, find another .csv file (from our repo here or from googling around) and create a DataFrame from this. Check that the data is formatted as you expect and if there are any datatype or missing value issues. Use three other pandas methods (feel free to consult the cheat sheet here) to explore your data.
#load in your data
#look at the info
#first few rows of data
#method 1
#method 2
#method 3
CHALLENGE
With your neighbor, identify an industry or topic of interest. Try to find a structured data source for this online. Were you able to find a .csv file format or something different? What are some challenges you feel exist about turning this data into a DataFrame?