Missing Values and Split Apply Combine

Missing Values and Split Apply Combine#

This lesson focuses on reviewing our basics with pandas and extending them to more advanced munging and cleaning. Specifically, we will discuss how to load data files, work with missing values, use split-apply-combine, use string methods, and work with string and datetime objects. By the end of this lesson you should feel confident doing basic exploratory data analysis using pandas.

OBJECTIVES

Read local files in as DataFrame objects
Drop missing values
Replace missing values
Impute missing values
Use .groupby and split-apply-combine
Explore basic plotting from a DataFrame

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Missing Values#

Missing values are a common problem in data, whether this is because they are truly missing or there is confusion between the data encoding and the methods you read the data in using.

ufo_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/ufo.csv'

#create ufo dataframe
ufo = pd.read_csv(ufo_url)
ufo.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

# examine ufo info
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             80492 non-null  object
 1   Colors Reported  17034 non-null  object
 2   Shape Reported   72141 non-null  object
 3   State            80543 non-null  object
 4   Time             80543 non-null  object
dtypes: object(5)
memory usage: 3.1+ MB

# one-liner to count missing values
ufo.isna().sum()

City                  51
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

# drop missing values
ufo.dropna(subset = 'City')

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00
...	...	...	...	...	...
80538	Neligh	NaN	CIRCLE	NE	9/4/2014 23:20
80539	Uhrichsville	NaN	LIGHT	OH	9/5/2014 1:14
80540	Tucson	RED BLUE	NaN	AZ	9/5/2014 2:40
80541	Orland park	RED	LIGHT	IL	9/5/2014 3:43
80542	Loughman	NaN	LIGHT	FL	9/5/2014 5:30

80492 rows × 5 columns

# still there?
ufo_nona = ufo.dropna( )

# fill missing values
ufo.fillna('missing')

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	missing	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	missing	OTHER	NJ	6/30/1930 20:00
2	Holyoke	missing	OVAL	CO	2/15/1931 14:00
3	Abilene	missing	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	missing	LIGHT	NY	4/18/1933 19:00
...	...	...	...	...	...
80538	Neligh	missing	CIRCLE	NE	9/4/2014 23:20
80539	Uhrichsville	missing	LIGHT	OH	9/5/2014 1:14
80540	Tucson	RED BLUE	missing	AZ	9/5/2014 2:40
80541	Orland park	RED	LIGHT	IL	9/5/2014 3:43
80542	Loughman	missing	LIGHT	FL	9/5/2014 5:30

80543 rows × 5 columns

# most common values as a dictionary
ufo.mode()

	City	Colors Reported	Shape Reported	State	Time
0	Seattle	ORANGE	LIGHT	CA	7/4/2014 22:00

ufo.mode().iloc[0].to_dict()

{'City': 'Seattle',
 'Colors Reported': 'ORANGE',
 'Shape Reported': 'LIGHT',
 'State': 'CA',
 'Time': '7/4/2014 22:00'}

# replace missing values with most common value
ufo.fillna(ufo.mode().iloc[0].to_dict())

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	ORANGE	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	ORANGE	OTHER	NJ	6/30/1930 20:00
2	Holyoke	ORANGE	OVAL	CO	2/15/1931 14:00
3	Abilene	ORANGE	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	ORANGE	LIGHT	NY	4/18/1933 19:00
...	...	...	...	...	...
80538	Neligh	ORANGE	CIRCLE	NE	9/4/2014 23:20
80539	Uhrichsville	ORANGE	LIGHT	OH	9/5/2014 1:14
80540	Tucson	RED BLUE	LIGHT	AZ	9/5/2014 2:40
80541	Orland park	RED	LIGHT	IL	9/5/2014 3:43
80542	Loughman	ORANGE	LIGHT	FL	9/5/2014 5:30

80543 rows × 5 columns

Problem#

Read in the dataset churn_missing.csv from our repo as a DataFrame using the url below, assign to a variable churn_df

churn_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/churn_missing.csv'
churn_df = pd.read_csv(churn_url)
churn_df.head(2)

	state	account_length	area_code	intl_plan	vmail_plan	vmail_message	day_mins	day_calls	day_charge	eve_mins	eve_calls	eve_charge	night_mins	night_calls	night_charge	intl_mins	intl_calls	intl_charge	custserv_calls	churn
0	KS	128	415	no	yes	25.0	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.7	1	False
1	OH	107	415	no	yes	26.0	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.7	1	False

churn_df.agg({'account_length': 'mean', 'area_code': 'median'})

account_length    101.064806
area_code         415.000000
dtype: float64

Are there any missing values? What columns are they in and how many are there?

churn_df.shape[0]

churn_df[['vmail_plan', 'vmail_message']]

	vmail_plan	vmail_message
0	yes	25.0
1	yes	26.0
2	no	0.0
3	no	0.0
4	no	0.0
...	...	...
3328	yes	36.0
3329	no	0.0
3330	no	0.0
3331	no	0.0
3332	yes	25.0

3333 rows × 2 columns

churn_df.isna().sum()/churn_df.shape[0]

state             0.000000
account_length    0.000000
area_code         0.000000
intl_plan         0.000000
vmail_plan        0.120012
vmail_message     0.120012
day_mins          0.000000
day_calls         0.000000
day_charge        0.000000
eve_mins          0.000000
eve_calls         0.000000
eve_charge        0.000000
night_mins        0.000000
night_calls       0.000000
night_charge      0.000000
intl_mins         0.000000
intl_calls        0.000000
intl_charge       0.000000
custserv_calls    0.000000
churn             0.000000
dtype: float64

What do you think we should do about these? Drop, replace, impute?

churn_df.fillna({'vmail_plan': 'no', 
                 'vmail_message': churn_df['vmail_message'].mean()})

	state	account_length	area_code	intl_plan	vmail_plan	vmail_message	day_mins	day_calls	day_charge	eve_mins	eve_calls	eve_charge	night_mins	night_calls	night_charge	intl_mins	intl_calls	intl_charge	custserv_calls	churn
0	KS	128	415	no	yes	25.0	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False
1	OH	107	415	no	yes	26.0	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False
2	NJ	137	415	no	no	0.0	243.4	114	41.38	121.2	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False
3	OH	84	408	yes	no	0.0	299.4	71	50.90	61.9	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False
4	OK	75	415	yes	no	0.0	166.7	113	28.34	148.3	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3328	AZ	192	415	no	yes	36.0	156.2	77	26.55	215.5	126	18.32	279.1	83	12.56	9.9	6	2.67	2	False
3329	WV	68	415	no	no	0.0	231.1	57	39.29	153.4	55	13.04	191.3	123	8.61	9.6	4	2.59	3	False
3330	RI	28	510	no	no	0.0	180.8	109	30.74	288.8	58	24.55	191.9	91	8.64	14.1	6	3.81	2	False
3331	CT	184	510	yes	no	0.0	213.8	105	36.35	159.6	84	13.57	139.2	137	6.26	5.0	10	1.35	2	False
3332	TN	74	415	no	yes	25.0	234.4	113	39.85	265.9	82	22.60	241.4	77	10.86	13.7	4	3.70	0	False

3333 rows × 20 columns

`groupby`#

Often, you are faced with a dataset that you are interested in summaries within groups based on a condition. The simplest condition is that of a unique value in a single column. Using .groupby you can split your data into unique groups and summarize the results.

NOTE: After splitting you need to summarize!

# sample data
titanic = sns.load_dataset('titanic')
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

# survival rate of each sex?
titanic.groupby('sex')['survived'].mean()

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

# survival rate of each class?
titanic.groupby('class')['survived'].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/3036205504.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic.groupby('class')['survived'].mean()

class
First     0.629630
Second    0.472826
Third     0.242363
Name: survived, dtype: float64

# each class and sex survival rate
titanic.groupby(['class', 'sex'])[['survived']].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/3320218811.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic.groupby(['class', 'sex'])[['survived']].mean()

		survived
class	sex
First	female	0.968085
First	male	0.368852
Second	female	0.921053
Second	male	0.157407
Third	female	0.500000
Third	male	0.135447

# working with multi-index -- changing form of results
titanic.groupby(['class', 'sex'], as_index = False)[['survived']].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/3906719161.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic.groupby(['class', 'sex'], as_index = False)[['survived']].mean()

	class	sex	survived
0	First	female	0.968085
1	First	male	0.368852
2	Second	female	0.921053
3	Second	male	0.157407
4	Third	female	0.500000
5	Third	male	0.135447

# age less than 40 survival rate
titanic['age'] < 40

     True
     True
     True
     True
     True
       ...  
   True
   True
  False
   True
   True
Name: age, Length: 891, dtype: bool

less_than_40 = titanic['age'] < 40
titanic.groupby(less_than_40)['survived'].mean()

age
False    0.332353
True     0.415608
Name: survived, dtype: float64

Problems#

tips = sns.load_dataset('tips')

tips.head(2)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

Average tip for smokers vs. non-smokers.

tips.groupby('smoker')['tip'].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/2919508398.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  tips.groupby('smoker')['tip'].mean()

smoker
Yes    3.008710
No     2.991854
Name: tip, dtype: float64

Average bill by day and time.

tips.groupby(['day', 'time'])['total_bill'].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/1203280366.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  tips.groupby(['day', 'time'])['total_bill'].mean()

day   time  
Thur  Lunch     17.664754
      Dinner    18.780000
Fri   Lunch     12.845714
      Dinner    19.663333
Sat   Lunch           NaN
      Dinner    20.441379
Sun   Lunch           NaN
      Dinner    21.410000
Name: total_bill, dtype: float64

What is another question groupby can help us answer here?

tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.groupby('sex')['tip_pct'].mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_67944/3543391185.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  tips.groupby('sex')['tip_pct'].mean()

sex
Male      0.157651
Female    0.166491
Name: tip_pct, dtype: float64

Plotting from a `DataFrame`#

Next class we will introduce two plotting libraries – matplotlib and seaborn. It turns out that a DataFrame also inherits a good bit of matplotlib functionality, and plots can be created directly from a DataFrame.

url = 'https://raw.githubusercontent.com/evorition/astsadata/refs/heads/main/astsadata/data/UnempRate.csv'

unemp = pd.read_csv(url)

unemp.head()

	index	value
0	1948 Jan	4.0
1	1948 Feb	4.7
2	1948 Mar	4.5
3	1948 Apr	4.0
4	1948 May	3.4

#default plot is line
unemp.plot()

<Axes: >

../_images/79d51d7417337acad1084308f4861af3e3af572f2612217276a2be1e143ad805.png

unemp.head()

	index	value
0	1948 Jan	4.0
1	1948 Feb	4.7
2	1948 Mar	4.5
3	1948 Apr	4.0
4	1948 May	3.4

unemp = pd.read_csv(url, index_col = 0)
unemp.head()

	value
index
1948 Jan	4.0
1948 Feb	4.7
1948 Mar	4.5
1948 Apr	4.0
1948 May	3.4

unemp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 827 entries, 1948 Jan to 2016 Nov
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   value   827 non-null    float64
dtypes: float64(1)
memory usage: 12.9+ KB

unemp.plot()

<Axes: xlabel='index'>

../_images/13e5d6a6975f6e5b447811dc9c26d79bc1bb55276b0e859db9e73f1b0ded8959.png

unemp.hist()

array([[<Axes: title={'center': 'value'}>]], dtype=object)

../_images/42f292946302533e5e6c1b7f91eef3ad8308b6d758db109ea3675b61a840edbc.png

unemp.boxplot()

<Axes: >

../_images/444e0aec97584c7ea50e4692b000ef6e924e54c82707484acef2d7cb536d1a8c.png

#create a new column of shifted measurements
unemp['shifted'] = unemp.shift()

unemp.plot()

unemp.plot(x = 'value', y = 'shifted', kind = 'scatter')

unemp.plot(x = 'value', y = 'shifted', kind = 'scatter', title = 'Unemployment Data', grid = True);

More with pandas and plotting here.

`datetime`#

A special type of data for pandas are entities that can be considered as dates. We can create a special datatype for these using pd.to_datetime, and access the functions of the datetime module as a result.

#ufo date column

#make it a datetime

#assign as time column

#investigate datatypes

#set as index

#sort the values

#groupby month and average

Data Resources#

NYU has a number of resources for acquiring data with applications to economics and finance here.

See you Thursday!

Missing Values and Split Apply Combine

Contents