# `datetime`, and `matplotlib` intro

This lesson rounds out the introductory pandas work and introduces our basic plotting library `matplotlib`.  

**OBJECTIVES**

- Understand and use `datetime` objects in pandas DataFrames
- Use `matplotlib` to produce basic plots from data
- Understand when to use histograms, boxplots, line plots, and scatterplots with data


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## `datetime`

A special type of data for pandas are entities that can be considered as dates.  We can create a special datatype for these using `pd.to_datetime`, and access the functions of the `datetime` module as a result.

In [None]:
# read in the AAPL data
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/AAPL.csv'

#read_csv


In [None]:
#examine info

In [None]:
# convert to datetime


In [None]:
# extract the month


In [None]:
# extract the day


In [None]:
# set date to be index of data


In [None]:
# sort the index


In [None]:
#see if things have changed
aapl.info()

In [None]:
# select 2019


In [None]:
# read back in using parse_dates = True and index_col = 0


In [None]:
from datetime import datetime

In [None]:
# what time is it?
then = datetime.now()
then

In [None]:
# how much time has passed?
datetime.now() - then

### More with timestamps

- Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.

- Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.


In [None]:
# create a pd.Timedelta
delta = pd.Timedelta('1W')

In [None]:
# shift a date by 3 months
datetime.now() + delta

#### Problems

In [None]:
ufo_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/ufo.csv'

1. Return to the ufo data and convert the Time column to a datetime object.

2. Set the Time column as the index column of the data.

3. Sort it

4. Create a new dataframe with ufo sightings since January 1, 1999

### Grouping with Dates

An operation similar to that of the `groupby` function can be used with dataframes whose index is a datetime object.  This is the `resample` function, and the groups are essentially a time period like week, month, year, etc. 

In [None]:
dow = sns.load_dataset('dowjones')

In [None]:
#check the info
dow.info()

In [None]:
#handle the index
dow.set_index('Date', inplace = True)

In [None]:
#check that things changed
dow.info()

In [None]:
dow.head()

In [None]:
#average yearly price


In [None]:
#quarterly maximum price
dow.resample('Q').max()

## Introduction to `matplotlib`

Now, let us turn our attention to plotting data.  We begin with basic plots, and later explore some customization and additional plots.  For these exercises, we will use the stock price data and a dataset about antarctic penguins from the `seaborn` library.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
penguins = sns.load_dataset('penguins')

### Line Plots with Matplotlib

To begin, select the `bill_length_mm` column of the data.  

In [None]:
penguins.info()

In [None]:
### bill length
bill_length = penguins['bill_length_mm']

In [None]:
### plt.plot


In [None]:
### use the series


In [None]:
#plot dow jones Price with matplotlib


In [None]:
#plot dow jones data from series


#### Choosing A Plot

Below, plots are shown first for single quantiative variables, then single categorical variables.  Next, two continuous variables, one continuous vs. one categorical, and any mix of continuous and categorical.

#### Histogram

A histogram *is an approximate representation of the distribution of numerical data*.  This is a plot we use for any single continuous feature to better understand the shape of the data.  

In [None]:
### bill length histogram
plt.hist(bill_length)

In [None]:
### as a method with the series
bill_length.hist()

In [None]:
### adjusting the bin number
plt.hist(bill_length, bins = 100);

In [None]:
### adding a title, labels, edgecolor, and alpha
plt.hist(bill_length, 
         edgecolor = 'black', 
         color = 'red', 
         alpha = 0.3)

In [None]:
plt.title('Bill Length (mm)');

In [None]:
penguins.hist();

#### Boxplot

Similar to a histogram, a boxplot can be used on a single quantitative feature.

In [None]:
### boxplot of bill length
plt.boxplot(bill_length);

In [None]:
### WHOOPS -- lets try this without null values
plt.boxplot(bill_length.dropna());

In [None]:
### Make a horizontal version of the plot
plt.boxplot(bill_length.dropna(), vert = False);

#### Bar Plot

A bar plot can be used to summarize a single categorical variable.  For example, if you want the counts of each unique category in a categorical feature. 

In [None]:
### counts of species
penguins['species'].value_counts()

In [None]:
### barplot of counts
penguins['species'].value_counts().plot(kind = 'bar')

#### Two Variable Plots

In [None]:
penguins.head(2)

#### Scatterplot

Two continuous features can be compared using scatterplots.  Typically, one is interested in if a relationship between the features exists and the strength and direction of many datasets.

In [None]:
### bill length vs. bill depth
x = bill_length
y = penguins['bill_depth_mm']

In [None]:
### scatterplot of x vs. y
plt.scatter(x, y)

#### `pandas.plotting`

There is not a quick easy plot in `matplotlib` to compare all numeric features in a dataset.  Instead, `pandas.plotting` has a `scatter_matrix` function that serves a similar purpose.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
### scatter matrix of penguin data
scatter_matrix(penguins);

In [None]:
### adding arguments and changing size
scatter_matrix(penguins, diagonal = 'kde', figsize = (10, 10));

**PROBLEMS**

In [None]:
iris = sns.load_dataset('iris')

In [None]:
iris.head(2)

**Problem 1**: Histogram of `petal_length`

**Problem 2**: Scatter plot of `sepal_length` vs. `sepal_width`.

**Problem 3**: New column where 

```
setosa -> blue
virginica -> green
versicolor -> orange
```

In [None]:
iris['colors'] = iris['species'].replace({'setosa': 'blue', 'virginica': 'green', 'versicolor': 'orange'})

**Problem 4**: Scatterplot of `sepal_length` vs `petal_length` colored by species.

#### Subplots and Axes

![](https://matplotlib.org/stable/_images/users-explain-axes-index-1.2x.png)

In [None]:
### create a 1 row 2 column plot


In [None]:
### add a plot to each axis
fig, ax = plt.subplots(1, 2)


In [None]:
### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))


#### Summary

Great job!  We will get practice plotting in this weeks homework and examine some other libraries and approaches during class next week.  For now, make sure you are familiar with the basic plots above -- histogram, boxplot, bar plot, scatterplot -- and when to use each.  