datetime, and matplotlib intro

`datetime`, and `matplotlib` intro#

This lesson rounds out the introductory pandas work and introduces our basic plotting library matplotlib.

OBJECTIVES

Understand and use datetime objects in pandas DataFrames
Use matplotlib to produce basic plots from data
Understand when to use histograms, boxplots, line plots, and scatterplots with data

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 5
      3 import pandas as pd
      4 import matplotlib.pyplot as plt
----> 5 import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

`datetime`#

A special type of data for pandas are entities that can be considered as dates. We can create a special datatype for these using pd.to_datetime, and access the functions of the datetime module as a result.

# read in the AAPL data
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/AAPL.csv'

#read_csv
aapl = pd.read_csv(url)
aapl.head(3)

	Date	Open	High	Low	Close	Adj Close	Volume
0	2005-04-25	5.212857	5.288571	5.158571	5.282857	3.522625	186615100
1	2005-04-26	5.254286	5.358572	5.160000	5.170000	3.447372	202626900
2	2005-04-27	5.127143	5.194286	5.072857	5.135714	3.424510	153472200

aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3523 entries, 0 to 3522
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       3523 non-null   object 
 1   Open       3523 non-null   float64
 2   High       3523 non-null   float64
 3   Low        3523 non-null   float64
 4   Close      3523 non-null   float64
 5   Adj Close  3523 non-null   float64
 6   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 192.8+ KB

# convert to datetime
aapl['Date'] = pd.to_datetime(aapl['Date'])
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3523 entries, 0 to 3522
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       3523 non-null   datetime64[ns]
 1   Open       3523 non-null   float64       
 2   High       3523 non-null   float64       
 3   Low        3523 non-null   float64       
 4   Close      3523 non-null   float64       
 5   Adj Close  3523 non-null   float64       
 6   Volume     3523 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 192.8 KB

# extract the month
aapl['Date'].dt.month

     4
     4
     4
     4
     4
       ..
  4
  4
  4
  4
  4
Name: Date, Length: 3523, dtype: int64

# extract the day
aapl['Date'].dt.day

     25
     26
     27
     28
     29
        ..
  16
  17
  18
  22
  23
Name: Date, Length: 3523, dtype: int64

# set date to be index of data
aapl.set_index('Date', inplace = True)

# sort the index
aapl.sort_index(inplace = True)

aapl.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3523 entries, 2005-04-25 to 2019-04-23
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       3523 non-null   float64
 1   High       3523 non-null   float64
 2   Low        3523 non-null   float64
 3   Close      3523 non-null   float64
 4   Adj Close  3523 non-null   float64
 5   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 321.7 KB

# select 2019
aapl.loc['2018':'2019']

	Open	High	Low	Close	Adj Close	Volume
Date
2018-01-02	170.160004	172.300003	169.259995	172.259995	168.987320	25555900
2018-01-03	172.529999	174.550003	171.960007	172.229996	168.957886	29517900
2018-01-04	172.539993	173.470001	172.080002	173.029999	169.742706	22434600
2018-01-05	173.440002	175.369995	173.050003	175.000000	171.675278	23660000
2018-01-08	174.350006	175.610001	173.929993	174.350006	171.037628	20567800
...	...	...	...	...	...	...
2019-04-16	199.460007	201.369995	198.559998	199.250000	199.250000	25696400
2019-04-17	199.539993	203.380005	198.610001	203.130005	203.130005	28906800
2019-04-18	203.119995	204.149994	202.520004	203.860001	203.860001	24195800
2019-04-22	202.830002	204.940002	202.339996	204.529999	204.529999	19439500
2019-04-23	204.429993	207.750000	203.899994	207.479996	207.479996	23309000

328 rows × 6 columns

# read back in using parse_dates = True and index_col = 0
aapl = pd.read_csv(url, parse_dates = True, index_col = 0)
aapl.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3523 entries, 2005-04-25 to 2019-04-23
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       3523 non-null   float64
 1   High       3523 non-null   float64
 2   Low        3523 non-null   float64
 3   Close      3523 non-null   float64
 4   Adj Close  3523 non-null   float64
 5   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 192.7 KB

from datetime import datetime

# what time is it?
then = datetime.now()
then

datetime.datetime(2024, 9, 26, 15, 51, 22, 322906)

# how much time has passed?
datetime.now() - then

datetime.timedelta(seconds=110, microseconds=154544)

More with timestamps#

Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.
Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.

# create a pd.Timedelta
delta = pd.Timedelta('1W')

# shift a date by 3 months
datetime.now() + delta

datetime.datetime(2024, 10, 3, 15, 55, 26, 749137)

Problems#

ufo_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/ufo.csv'

Return to the ufo data and convert the Time column to a datetime object.

ufo = pd.read_csv(ufo_url)

Set the Time column as the index column of the data.

ufo.set_index('Time', inplace = True)

Sort it

ufo.sort_index(inplace = True)
ufo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 80543 entries, 1/1/1944 10:00 to 9/9/2013 9:51
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             80496 non-null  object
 1   Colors Reported  17034 non-null  object
 2   Shape Reported   72141 non-null  object
 3   State            80543 non-null  object
dtypes: object(4)
memory usage: 3.1+ MB

Create a new dataframe with ufo sightings since January 1, 1999

ufo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 80543 entries, 1/1/1944 10:00 to 9/9/2013 9:51
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             80496 non-null  object
 1   Colors Reported  17034 non-null  object
 2   Shape Reported   72141 non-null  object
 3   State            80543 non-null  object
dtypes: object(4)
memory usage: 5.1+ MB

ufo.index = pd.to_datetime(ufo.index)

ufo.loc['1999':]

<ipython-input-47-4de6a5ec51ed>:1: FutureWarning: Value based partial slicing on non-monotonic DatetimeIndexes with non-existing keys is deprecated and will raise a KeyError in a future Version.
  ufo.loc['1999':]

	City	Colors Reported	Shape Reported	State
Time
1999-01-01 14:00:00	Florence	NaN	CYLINDER	SC
1999-01-01 15:00:00	Lake Henshaw	NaN	CIGAR	CA
1999-01-01 17:15:00	Wilmington Island	NaN	LIGHT	GA
1999-01-01 18:00:00	DeWitt	NaN	LIGHT	AR
1999-01-01 19:12:00	Bainbridge Island	NaN	NaN	WA
...	...	...	...	...
2013-09-09 23:00:00	Starr	RED	DIAMOND	SC
2013-09-09 23:00:00	Edmond	RED	CIGAR	OK
2013-09-09 23:30:00	Ft. Lauderdale	RED	OVAL	FL
2013-09-09 03:00:00	Struthers	NaN	NaN	OH
2013-09-09 09:51:00	San Diego	NaN	LIGHT	CA

67711 rows × 4 columns

Grouping with Dates#

An operation similar to that of the groupby function can be used with dataframes whose index is a datetime object. This is the resample function, and the groups are essentially a time period like week, month, year, etc.

dow = sns.load_dataset('dowjones')

#check the info
dow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    649 non-null    datetime64[ns]
 1   Price   649 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 10.3 KB

#handle the index
dow.set_index('Date', inplace = True)

#check that things changed
dow.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 1914-12-01 to 1968-12-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Price   649 non-null    float64
dtypes: float64(1)
memory usage: 10.1 KB

dow.head()

	Price
Date
1914-12-01	55.00
1915-01-01	56.55
1915-02-01	56.00
1915-03-01	58.30
1915-04-01	66.45

#average yearly price
dow.resample('M').mean()

	Price
Date
1914-12-31	55.00
1915-01-31	56.55
1915-02-28	56.00
1915-03-31	58.30
1915-04-30	66.45
...	...
1968-08-31	883.72
1968-09-30	922.80
1968-10-31	955.47
1968-11-30	964.12
1968-12-31	965.39

649 rows × 1 columns

#quarterly maximum price
dow.resample('Q').max()

	Price
Date
1914-12-31	55.00
1915-03-31	58.30
1915-06-30	68.40
1915-09-30	85.50
1915-12-31	97.00
...	...
1967-12-31	907.54
1968-03-31	884.77
1968-06-30	906.82
1968-09-30	922.80
1968-12-31	965.39

217 rows × 1 columns

Introduction to `matplotlib`#

Now, let us turn our attention to plotting data. We begin with basic plots, and later explore some customization and additional plots. For these exercises, we will use the stock price data and a dataset about antarctic penguins from the seaborn library.

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset('penguins')

Line Plots with Matplotlib#

To begin, select the bill_length_mm column of the data.

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

### bill length
bill_length = penguins['bill_length_mm']

### plt.plot
plt.plot(bill_length)

[<matplotlib.lines.Line2D at 0x7f8a8e1f11f0>]

_images/606c8c45a64565295da1628ea5c0808aa2d2252a61d4e3906fc27b09a182a032.png

### use the series
bill_length.plot()

<AxesSubplot: >

#plot dow jones Price with matplotlib
plt.plot(dow)

[<matplotlib.lines.Line2D at 0x7f8a8d1856a0>]

_images/c1f7733604be9a47362731efdde09de7565224c2f07c4fea262df89d3b64d39f.png

#plot dow jones data from series
dow.plot()

<AxesSubplot: xlabel='Date'>

_images/157cc77e756feeb2b721c0e73cad8225dbe4afb8c841bb48b4b094e17febcce9.png

Choosing A Plot#

Below, plots are shown first for single quantiative variables, then single categorical variables. Next, two continuous variables, one continuous vs. one categorical, and any mix of continuous and categorical.

Histogram#

A histogram is an approximate representation of the distribution of numerical data. This is a plot we use for any single continuous feature to better understand the shape of the data.

### bill length histogram
plt.hist(bill_length)

(array([ 9., 40., 57., 48., 49., 55., 61., 16.,  5.,  2.]),
 array([32.1 , 34.85, 37.6 , 40.35, 43.1 , 45.85, 48.6 , 51.35, 54.1 ,
        56.85, 59.6 ]),
 <BarContainer object of 10 artists>)

_images/8cb82c7b6e398e2a607ac5af6bd680ce78760d94f7ad422dcb4aaf10293652aa.png

### as a method with the series
bill_length.hist()

<AxesSubplot: >

_images/5e81c27764d15f4ef82c1663ab7aa7a53d5b5c83eecc6c489c42b2a2729d39db.png

### adjusting the bin number
plt.hist(bill_length, bins = 100);

_images/88c0bb7148b0ff4c8490ed3c80b5635cf0ebd0a5435ffdadaaff40038ae59333.png

### adding a title, labels, edgecolor, and alpha
plt.hist(bill_length, 
         edgecolor = 'black', 
         color = 'red', 
         alpha = 0.3)

(array([ 9., 40., 57., 48., 49., 55., 61., 16.,  5.,  2.]),
 array([32.1 , 34.85, 37.6 , 40.35, 43.1 , 45.85, 48.6 , 51.35, 54.1 ,
        56.85, 59.6 ]),
 <BarContainer object of 10 artists>)

_images/ba7b01b18a15ab34a8a97a6b788f52ad8694f4c443cd642141634ca3fc247e14.png

plt.title('Bill Length (mm)');

_images/29a6e163d66fb7f7c7c4d15d0dbf6bb186fc3dbd87479ed5deb172c72f58a399.png

penguins.hist();

_images/ed1fb19054a1c6bd0b6ac70ebd0977a9d92e6c49e35acfa92e746d2d078f5ac6.png

Boxplot#

Similar to a histogram, a boxplot can be used on a single quantitative feature.

### boxplot of bill length
plt.boxplot(bill_length);

_images/b0c4ef32d67edc3b73b628fb05c24b8681eb396e529b5b1323e170a453764020.png

### WHOOPS -- lets try this without null values
plt.boxplot(bill_length.dropna());

_images/b08425df25aa761d706036d9834d5856e54dee5d3d7df2d20e7c6b0f557e8e90.png

### Make a horizontal version of the plot
plt.boxplot(bill_length.dropna(), vert = False);

_images/38e7a087b163b30802db2e1f768fa052b1866809a618a89a0daed5d5c337736c.png

Bar Plot#

A bar plot can be used to summarize a single categorical variable. For example, if you want the counts of each unique category in a categorical feature.

### counts of species
penguins['species'].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

### barplot of counts
penguins['species'].value_counts().plot(kind = 'bar')

<AxesSubplot: >

_images/742fa610ad54431d49a7c296725f3b228e310e8a12d74e7d779237eff92b5874.png

Two Variable Plots#

penguins.head(2)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female

Scatterplot#

Two continuous features can be compared using scatterplots. Typically, one is interested in if a relationship between the features exists and the strength and direction of many datasets.

### bill length vs. bill depth
x = bill_length
y = penguins['bill_depth_mm']

### scatterplot of x vs. y
plt.scatter(x, y)

<matplotlib.collections.PathCollection at 0x7f8a8cf8b2e0>

_images/956d02d10b8e9ece533ef14336731ad9443bb09e48de88c8babf4f466d406671.png

`pandas.plotting`#

There is not a quick easy plot in matplotlib to compare all numeric features in a dataset. Instead, pandas.plotting has a scatter_matrix function that serves a similar purpose.

from pandas.plotting import scatter_matrix

### scatter matrix of penguin data
scatter_matrix(penguins);

_images/ad6d011ab561b4ec02e310051f90ac05837ffc2149bdf87bcd6fc2633ca50e50.png

### adding arguments and changing size
scatter_matrix(penguins, diagonal = 'kde', figsize = (10, 10));

_images/b5d1f228c4e5b762a34eb6dab7e8f2b78ac45bdf02257de984fed551eba7584d.png

PROBLEMS

iris = sns.load_dataset('iris')

iris.head(2)

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa

Problem 1: Histogram of petal_length

iris['petal_length'].hist()

<AxesSubplot: >

_images/611a622d77d129505d3a8dc853670227311e3b965035fe8c5ab49caf75c491e3.png

Problem 2: Scatter plot of sepal_length vs. sepal_width.

plt.scatter(iris.sepal_length, iris.sepal_width)

<matplotlib.collections.PathCollection at 0x7f8a92051a30>

_images/4817bff71355594decbfd87d9317972a1e4138706a929057cc44a036f44cffd9.png

Problem 3: New column where

setosa -> blue
virginica -> green
versicolor -> orange

iris['colors'] = iris['species'].replace({'setosa': 'blue', 'virginica': 'green', 'versicolor': 'orange'})

Problem 4: Scatterplot of sepal_length vs petal_length colored by species.

plt.scatter(iris.sepal_length, iris.sepal_width, c = iris.colors)

<matplotlib.collections.PathCollection at 0x7f8a92b2c460>

_images/448de03a0f14ffede4ebdc035456ff4808ad009fa6a46e7e0fd575019ad7a656.png

Subplots and Axes#

### create a 1 row 2 column plot
fig, ax = plt.subplots(1, 2)

_images/b93cf7fff3c5d68e3b8745856f6ada432bd6ea315ffc53cfb8e3deb270be80e2.png

### add a plot to each axis
fig, ax = plt.subplots(1, 2)
ax[0].hist(bill_length)
ax[1].boxplot(penguins['flipper_length_mm'].dropna())
ax[1].set_title('Flipper Length')

Text(0.5, 1.0, 'Flipper Length')

_images/3933f86769ec758ac9d4f738d441f7b54c53cd9798d7eed5bf6b74c07e0a631e.png

### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))
ax[1, 1].hist(bill_length)
fig.suptitle('2 x 2 grid of subplots')
plt.savefig('subplottin.png')

_images/e52836bb4125d3784dff757f964e852073264ec3610a974aa48d68645b44c1c3.png

Summary#

Great job! We will get practice plotting in this weeks homework and examine some other libraries and approaches during class next week. For now, make sure you are familiar with the basic plots above – histogram, boxplot, bar plot, scatterplot – and when to use each.