datetime, and matplotlib intro

`datetime`, and `matplotlib` intro#

This lesson rounds out the introductory pandas work and introduces our basic plotting library matplotlib.

OBJECTIVES

Understand and use datetime objects in pandas DataFrames
Use matplotlib to produce basic plots from data
Understand when to use histograms, boxplots, line plots, and scatterplots with data

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

`datetime`#

A special type of data for pandas are entities that can be considered as dates. We can create a special datatype for these using pd.to_datetime, and access the functions of the datetime module as a result.

# read in the AAPL data
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/AAPL.csv'

#read_csv
aapl = pd.read_csv(url)
aapl.head()

	Date	Open	High	Low	Close	Adj Close	Volume
0	2005-04-25	5.212857	5.288571	5.158571	5.282857	3.522625	186615100
1	2005-04-26	5.254286	5.358572	5.160000	5.170000	3.447372	202626900
2	2005-04-27	5.127143	5.194286	5.072857	5.135714	3.424510	153472200
3	2005-04-28	5.184286	5.191429	5.034286	5.077143	3.385454	143776500
4	2005-04-29	5.164286	5.175714	5.031428	5.151429	3.434988	167907600

#examine info
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3523 entries, 0 to 3522
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       3523 non-null   object 
 1   Open       3523 non-null   float64
 2   High       3523 non-null   float64
 3   Low        3523 non-null   float64
 4   Close      3523 non-null   float64
 5   Adj Close  3523 non-null   float64
 6   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 192.8+ KB

# convert to datetime
aapl['Date'] = pd.to_datetime(aapl['Date'])
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3523 entries, 0 to 3522
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       3523 non-null   datetime64[ns]
 1   Open       3523 non-null   float64       
 2   High       3523 non-null   float64       
 3   Low        3523 non-null   float64       
 4   Close      3523 non-null   float64       
 5   Adj Close  3523 non-null   float64       
 6   Volume     3523 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 192.8 KB

# extract the month
aapl['Date'].dt.month

     4
     4
     4
     4
     4
       ..
  4
  4
  4
  4
  4
Name: Date, Length: 3523, dtype: int32

# extract the day
aapl['Date'].dt.day

     25
     26
     27
     28
     29
        ..
  16
  17
  18
  22
  23
Name: Date, Length: 3523, dtype: int32

# set date to be index of data
aapl.set_index('Date', inplace = True)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_2944/963803784.py in ?()
      1 # set date to be index of data
----> 2 aapl.set_index('Date', inplace = True)

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py in ?(self, keys, drop, append, inplace, verify_integrity)
   6118                     if not found:
   6119                         missing.append(col)
   6120 
   6121         if missing:
-> 6122             raise KeyError(f"None of {missing} are in the columns")
   6123 
   6124         if inplace:
   6125             frame = self

KeyError: "None of ['Date'] are in the columns"

# sort the index
aapl.sort_index(inplace = True)

#see if things have changed
aapl.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3523 entries, 2005-04-25 to 2019-04-23
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       3523 non-null   float64
 1   High       3523 non-null   float64
 2   Low        3523 non-null   float64
 3   Close      3523 non-null   float64
 4   Adj Close  3523 non-null   float64
 5   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 192.7 KB

# select 2019
aapl.loc['2015':'2019']

	Open	High	Low	Close	Adj Close	Volume
Date
2015-01-02	111.389999	111.440002	107.349998	109.330002	101.528191	53204600
2015-01-05	108.290001	108.650002	105.410004	106.250000	98.667984	64285500
2015-01-06	106.540001	107.430000	104.629997	106.260002	98.677261	65797100
2015-01-07	107.199997	108.199997	106.699997	107.750000	100.060936	40105900
2015-01-08	109.230003	112.150002	108.699997	111.889999	103.905510	59364500
...	...	...	...	...	...	...
2019-04-16	199.460007	201.369995	198.559998	199.250000	199.250000	25696400
2019-04-17	199.539993	203.380005	198.610001	203.130005	203.130005	28906800
2019-04-18	203.119995	204.149994	202.520004	203.860001	203.860001	24195800
2019-04-22	202.830002	204.940002	202.339996	204.529999	204.529999	19439500
2019-04-23	204.429993	207.750000	203.899994	207.479996	207.479996	23309000

1083 rows × 6 columns

# read back in using parse_dates = True and index_col = 0
aapl = pd.read_csv(url, parse_dates = True, index_col = 0)
aapl.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3523 entries, 2005-04-25 to 2019-04-23
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       3523 non-null   float64
 1   High       3523 non-null   float64
 2   Low        3523 non-null   float64
 3   Close      3523 non-null   float64
 4   Adj Close  3523 non-null   float64
 5   Volume     3523 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 192.7 KB

from datetime import datetime

# what time is it?
then = datetime.now()
then

datetime.datetime(2025, 9, 18, 15, 53, 50, 929023)

# how much time has passed?
datetime.now() - then

datetime.timedelta(seconds=35, microseconds=562058)

More with timestamps#

Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.
Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.

# create a pd.Timedelta
delta = pd.Timedelta('1W')

# shift a date by 3 months
datetime.now() + delta

datetime.datetime(2025, 9, 25, 15, 55, 29, 94830)

Problems#

ufo_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/ufo.csv'

Return to the ufo data and convert the Time column to a datetime object.

ufo_df = pd.read_csv(ufo_url)
ufo_df.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

ufo_df['Time'] = pd.to_datetime(ufo_df['Time'])

Set the Time column as the index column of the data.

ufo_df.set_index('Time', inplace = True)

Sort it

ufo_df.sort_index(inplace = True)

Create a new dataframe with ufo sightings since January 1, 1999

ufo_df.loc['1999':]

	City	Colors Reported	Shape Reported	State
Time
1999-01-01 02:30:00	Loma Rica	NaN	LIGHT	CA
1999-01-01 03:00:00	Bauxite	NaN	NaN	AR
1999-01-01 14:00:00	Florence	NaN	CYLINDER	SC
1999-01-01 15:00:00	Lake Henshaw	NaN	CIGAR	CA
1999-01-01 17:15:00	Wilmington Island	NaN	LIGHT	GA
...	...	...	...	...
2014-09-04 23:20:00	Neligh	NaN	CIRCLE	NE
2014-09-05 01:14:00	Uhrichsville	NaN	LIGHT	OH
2014-09-05 02:40:00	Tucson	RED BLUE	NaN	AZ
2014-09-05 03:43:00	Orland park	RED	LIGHT	IL
2014-09-05 05:30:00	Loughman	NaN	LIGHT	FL

67711 rows × 4 columns

Grouping with Dates#

An operation similar to that of the groupby function can be used with dataframes whose index is a datetime object. This is the resample function, and the groups are essentially a time period like week, month, year, etc.

dow = sns.load_dataset('dowjones')

#check the info
dow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    649 non-null    datetime64[ns]
 1   Price   649 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 10.3 KB

#handle the index
dow.set_index('Date', inplace = True)

#check that things changed
dow.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 1914-12-01 to 1968-12-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Price   649 non-null    float64
dtypes: float64(1)
memory usage: 10.1 KB

dow.head()

	Price
Date
1914-12-01	55.00
1915-01-01	56.55
1915-02-01	56.00
1915-03-01	58.30
1915-04-01	66.45

#average yearly price
dow.resample('Y').mean()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_2944/581637822.py:2: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  dow.resample('Y').mean()

	Price
Date
1914-12-31	55.000000
1915-12-31	74.329167
1916-12-31	94.791667
1917-12-31	87.729167
1918-12-31	81.066667
1919-12-31	99.770833
1920-12-31	90.116667
1921-12-31	73.375000
1922-12-31	92.962500
1923-12-31	94.575000
1924-12-31	99.858333
1925-12-31	134.225000
1926-12-31	152.933333
1927-12-31	175.570833
1928-12-31	226.454167
1929-12-31	307.570833
1930-12-31	236.008333
1931-12-31	137.950000
1932-12-31	64.229167
1933-12-31	83.462500
1934-12-31	97.904167
1935-12-31	120.158333
1936-12-31	161.470833
1937-12-31	165.641667
1938-12-31	132.000000
1939-12-31	141.516667
1940-12-31	134.837500
1941-12-31	121.704167
1942-12-31	107.308333
1943-12-31	134.940000
1944-12-31	143.153333
1945-12-31	169.766667
1946-12-31	190.692500
1947-12-31	177.541667
1948-12-31	180.337500
1949-12-31	179.050833
1950-12-31	216.305833
1951-12-31	257.635000
1952-12-31	270.763333
1953-12-31	275.965000
1954-12-31	333.960833
1955-12-31	442.717500
1956-12-31	493.010000
1957-12-31	475.707500
1958-12-31	491.659167
1959-12-31	632.117500
1960-12-31	618.875000
1961-12-31	691.554167
1962-12-31	639.759167
1963-12-31	714.808333
1964-12-31	834.053333
1965-12-31	910.882500
1966-12-31	873.601667
1967-12-31	879.120000
1968-12-31	905.746667

#quarterly maximum price
dow.resample('Q').max()

/var/folders/8v/7bhy8yqn04b7rzqglb2s38200000gn/T/ipykernel_2944/2439399997.py:2: FutureWarning: 'Q' is deprecated and will be removed in a future version, please use 'QE' instead.
  dow.resample('Q').max()

	Price
Date
1914-12-31	55.00
1915-03-31	58.30
1915-06-30	68.40
1915-09-30	85.50
1915-12-31	97.00
...	...
1967-12-31	907.54
1968-03-31	884.77
1968-06-30	906.82
1968-09-30	922.80
1968-12-31	965.39

217 rows × 1 columns

Introduction to `matplotlib`#

Now, let us turn our attention to plotting data. We begin with basic plots, and later explore some customization and additional plots. For these exercises, we will use the stock price data and a dataset about antarctic penguins from the seaborn library.

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset('penguins')

Line Plots with Matplotlib#

To begin, select the bill_length_mm column of the data.

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

### bill length
bill_length = penguins['bill_length_mm']

### plt.plot
plt.plot(bill_length)

[<matplotlib.lines.Line2D at 0x135e471d0>]

../_images/34c4ca0ed3708a73f0f2dce9ce0cb0bb0b9716634b20781c4c202d024ebd7acc.png

### use the series
bill_length.plot()

<Axes: >

#plot dow jones Price with matplotlib
plt.plot(dow['Price'])

[<matplotlib.lines.Line2D at 0x136016f30>]

../_images/3a6fb4b8b4eabe3d950f1dd454ac71454388752b892c4825ca019c121074b74e.png

#plot dow jones data from series
dow.plot(figsize = (10, 4))
plt.grid();

../_images/76ad3f78f273f9e2bf242ac7c9e889545ef547cb8e61152696fcd564c990fe38.png

Choosing A Plot#

Below, plots are shown first for single quantiative variables, then single categorical variables. Next, two continuous variables, one continuous vs. one categorical, and any mix of continuous and categorical.

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Histogram#

A histogram is an approximate representation of the distribution of numerical data. This is a plot we use for any single continuous feature to better understand the shape of the data.

### bill length histogram
plt.hist(bill_length)

(array([ 9., 40., 57., 48., 49., 55., 61., 16.,  5.,  2.]),
 array([32.1 , 34.85, 37.6 , 40.35, 43.1 , 45.85, 48.6 , 51.35, 54.1 ,
        56.85, 59.6 ]),
 <BarContainer object of 10 artists>)

../_images/b06e2e7f6b2f312fab3029e771805c5ba52aefbb274bea4a6e1b8d4051b94792.png

### as a method with the series
bill_length.hist()
plt.title('Bill Length (mm)');

../_images/4dd2138dfcd5fa378c3f402e22e8660bf117223600db9fdbea37d36dcbf0e3cd.png

### adjusting the bin number
plt.hist(bill_length, bins = 100);

../_images/1f9c4692709c62ba836e477a990a62e515ac094897412dbde0db037f60e30ec7.png

### adding a title, labels, edgecolor, and alpha
plt.hist(bill_length, 
         edgecolor = 'black', 
         color = 'red', 
         alpha = 0.3 )

(array([ 9., 40., 57., 48., 49., 55., 61., 16.,  5.,  2.]),
 array([32.1 , 34.85, 37.6 , 40.35, 43.1 , 45.85, 48.6 , 51.35, 54.1 ,
        56.85, 59.6 ]),
 <BarContainer object of 10 artists>)

../_images/84bf5d50d3b0b7b8cf3e4044b5c5c91cebb0c1d5485adcf3a2dd0ddadd5f1fde.png

plt.title('Bill Length (mm)');

penguins.hist();

Boxplot#

Similar to a histogram, a boxplot can be used on a single quantitative feature.

### boxplot of bill length
plt.boxplot(bill_length);

../_images/45a5a04c64839df310b3f212d59922d615e7581d5a6827e3aff653ef65ea2ae1.png

### WHOOPS -- lets try this without null values
plt.boxplot(bill_length.dropna());

../_images/82582df96e7255d8882c5776bc8d36b8f0c2456ed6ece18a3106735138435955.png

### Make a horizontal version of the plot
plt.boxplot(bill_length.dropna(), vert = False);

../_images/55d5aa8e921bcbd09b3762a255e9132eceacec57c0e3f5a4477a406fe9541dec.png

Bar Plot#

A bar plot can be used to summarize a single categorical variable. For example, if you want the counts of each unique category in a categorical feature.

### counts of species
penguins['species'].value_counts()

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

### barplot of counts
penguins['species'].value_counts().plot(kind = 'barh')
penguins.plot(

<Axes: ylabel='species'>

../_images/119dba755c4db77f1352759c2db91d4806f5c977b728b6701d8573cab6573c16.png

Two Variable Plots#

penguins.head(2)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female

Scatterplot#

Two continuous features can be compared using scatterplots. Typically, one is interested in if a relationship between the features exists and the strength and direction of many datasets.

### bill length vs. bill depth
x = bill_length
y = penguins['bill_depth_mm']

### scatterplot of x vs. y
plt.scatter(x, y)

<matplotlib.collections.PathCollection at 0x1362c5670>

../_images/b74e8e282d63df82f8fc70ada2f8b2b1b8ff3bee7b15c419084410ff1858eb14.png

`pandas.plotting`#

There is not a quick easy plot in matplotlib to compare all numeric features in a dataset. Instead, pandas.plotting has a scatter_matrix function that serves a similar purpose.

from pandas.plotting import scatter_matrix

### scatter matrix of penguin data
scatter_matrix(penguins);

### adding arguments and changing size
scatter_matrix(penguins, diagonal = 'kde', figsize = (10, 10));

PROBLEMS

iris = sns.load_dataset('iris')

iris.head(2)

Problem 1: Histogram of petal_length

Problem 2: Scatter plot of sepal_length vs. sepal_width.

Problem 3: New column where

setosa -> blue
virginica -> green
versicolor -> orange

iris['colors'] = iris['species'].replace({'setosa': 'blue', 'virginica': 'green', 'versicolor': 'orange'})

Problem 4: Scatterplot of sepal_length vs petal_length colored by species.

Subplots and Axes#

### create a 1 row 2 column plot

### add a plot to each axis
fig, ax = plt.subplots(1, 2)

### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))

Summary#

Great job! We will get practice plotting in this weeks homework and examine some other libraries and approaches during class next week. For now, make sure you are familiar with the basic plots above – histogram, boxplot, bar plot, scatterplot – and when to use each.