More plotting with matplotlib and seaborn

More plotting with `matplotlib` and `seaborn`#

Today we continue to work with matplotlib, focusing on customization and using subplots. Also, the seaborn library will be introduced as a second visualization library with additional functionality for plotting data.

#!pip install -U seaborn

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Subplots and Axes#

### create a 1 row 2 column plot

### add a plot to each axis
fig, ax = plt.subplots(1, 2)

### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))

Exploratory Data Analysis#

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. – IBM

Introduction to `seaborn`#

The seaborn library is built on top of matplotlib and offers high level visualization tools for plotting data. Typically a call to the seaborn library looks like:

sns.plottype(data = DataFrame, x = x, y = y, additional arguments...)

### load a sample dataset on tips
tips = sns.load_dataset('tips')
tips.head(2)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

### boxplot of tips
sns.boxplot(data = tips, x = 'tip')

<Axes: xlabel='tip'>

_images/2d04d2f4bf827e5cd0ba7966d5d6c40877d23d9a4ba0e81567843118a26640d6.png

### boxplot of tips by day
sns.boxplot(data = tips, x = 'day', y = 'tip')
plt.title('Tips by Day');

_images/d92f9c291a6e59752958c56fe7d530d6baa311beec3a5ef4ede12f9e4956aa52.png

`hue`#

The hue argument works like a grouping helper with seaborn. Plots that have this argument will break the data into groups from the passed column and add an appropriate legend.

### boxplot of tips by day by smoker
sns.boxplot(data = tips, x = 'day', y = 'tip', hue = 'smoker')

<Axes: xlabel='day', ylabel='tip'>

_images/048fb472cdc98cfd3de7394534a1da780bd770cad3259eb7f289efd663eb297c.png

`displot`#

For visualizing one dimensional distributions of data.

### histogram of tips
sns.displot(data = tips, x = 'tip')

<seaborn.axisgrid.FacetGrid at 0x13c888080>

_images/e89a69c2a48d1b71e46fb4a08f03746ed7e986400ccb8f51fb2289690314c203.png

### kde plot
sns.displot(data = tips, x = 'tip', kind = 'kde')

<seaborn.axisgrid.FacetGrid at 0x13c829220>

_images/b0b3fc1ab8c2c56779bc2dd360f7b365076ca237e895b63e51c1966b03a41014.png

### empirical cumulative distribution plot of tips by smoker
sns.displot(data = tips, x = 'tip', kind = 'ecdf', hue = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x13c6e2cf0>

_images/bf31c57b247d118f63ac220c3639adab0b69940b970ab702514c5492d1dce7ce.png

### using the col argument
sns.displot(data = tips, x = 'tip', col = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x13c96c500>

_images/b7a040fc0cbf7015bd838ed5cd37a6a1a6cc57861aa98f28c6a5cde7edde84d9.png

#draw a histogram and a boxplot using seaborn on two axes
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.histplot(data = tips, x = 'tip', ax = ax[0])
sns.boxplot(data = tips, x = 'day', y = 'tip', ax = ax[1])
ax[1].set_title('Boxplots')
fig.suptitle('This is a title for everything');

_images/31729df779f50669ca64969d90a606a0a727942580d00ebb1cc3c402d42e64a7.png

`relplot`#

For visualizing relationships.

### relplot of bill vs. tip
sns.relplot(data = tips, x = 'total_bill', y = 'tip')

<seaborn.axisgrid.FacetGrid at 0x13ec84a40>

_images/3af51126fb7d12d130b2cefeae0f7ff5fd8702561e7a471231e6545eccb20ebc.png

### regression plot
sns.regplot(data = tips, x ='total_bill', y = 'tip', lowess = True )

<Axes: xlabel='total_bill', ylabel='tip'>

_images/700bb5836e0278dd19085e89f28ce33e80aee1eab6280cad22959240abf7e60b.png

### swarm
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')

<Axes: xlabel='smoker', ylabel='tip'>

_images/cf9849dc8afec68593ff2b04004dfe65e430645ad582a0bb987a00ee6e31d1dc.png

### violin plot
sns.violinplot(data = tips, x = 'smoker', y = 'tip')

<Axes: xlabel='smoker', ylabel='tip'>

_images/b873f1607951eb3d765bdb834e589f7d6776089f48aa9362799aee95a306103a.png

### countplot
sns.countplot(data = tips, x = 'smoker');

_images/db03ba76d33eab5ee17838125d905f5753847694aaccb975366896f4cf388658.png

Additional Plots#

pairplot
heatmap

penguins = sns.load_dataset('penguins').dropna()

### pairplot of penguins colored by species
sns.pairplot(data = penguins, hue = 'species', diag_kind = 'kde')

<seaborn.axisgrid.PairGrid at 0x1473f30b0>

_images/856490453c1eb20d8e0b5eb150f057d435ab137eb5def7afd66e9074d160aa2d.png

### housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame = True).frame
housing.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Plotting Correlations#

Correlation captures the strength of a linear relationship between features. Often, this is easier to look at than a scatterplot of the data to establish relationships, however recall that this is only a detector for linear relationships!

### correlation in data
housing.corr(numeric_only = True)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
MedInc	1.000000	-0.119034	0.326895	-0.062040	0.004834	0.018766	-0.079809	-0.015176	0.688075
HouseAge	-0.119034	1.000000	-0.153277	-0.077747	-0.296244	0.013191	0.011173	-0.108197	0.105623
AveRooms	0.326895	-0.153277	1.000000	0.847621	-0.072213	-0.004852	0.106389	-0.027540	0.151948
AveBedrms	-0.062040	-0.077747	0.847621	1.000000	-0.066197	-0.006181	0.069721	0.013344	-0.046701
Population	0.004834	-0.296244	-0.072213	-0.066197	1.000000	0.069863	-0.108785	0.099773	-0.024650
AveOccup	0.018766	0.013191	-0.004852	-0.006181	0.069863	1.000000	0.002366	0.002476	-0.023737
Latitude	-0.079809	0.011173	0.106389	0.069721	-0.108785	0.002366	1.000000	-0.924664	-0.144160
Longitude	-0.015176	-0.108197	-0.027540	0.013344	0.099773	0.002476	-0.924664	1.000000	-0.045967
MedHouseVal	0.688075	0.105623	0.151948	-0.046701	-0.024650	-0.023737	-0.144160	-0.045967	1.000000

### heatmap of correlations
plt.figure(figsize = (15, 5))
sns.heatmap(housing.corr(numeric_only=True)[['MedHouseVal']].sort_values(by = 'MedHouseVal', ascending = False), annot = True)

<Axes: >

_images/ce1113ceaf329adce06aceba1d905a2e751d3e9340bc970514f395ae8d899033.png

Problems#

Use the diabetes data below loaded from OpenML (docs).

from sklearn.datasets import fetch_openml

diabetes = fetch_openml(data_id = 37).frame

diabetes.head()

	preg	plas	pres	skin	insu	mass	pedi	age	class
0	6	148	72	35	0	33.6	0.627	50	tested_positive
1	1	85	66	29	0	26.6	0.351	31	tested_negative
2	8	183	64	0	0	23.3	0.672	32	tested_positive
3	1	89	66	23	94	28.1	0.167	21	tested_negative
4	0	137	40	35	168	43.1	2.288	33	tested_positive

Plot distribution of ages separated by class.

Use the plots below to determine which feature has the most distinct difference between classes?

fig, ax = plt.subplots(2, 4, figsize = (20, 10))
colnum = 0
for row in range(2):
    for col in range(4):
        sns.histplot(data = diabetes, x = diabetes.iloc[:,colnum], hue = 'class', ax =  ax[row, col])
        ax[row,col].set_title(diabetes.columns.tolist()[colnum])
        colnum += 1

_images/8462f44f03510f9ace115e0f7b73546b906b19f6a95c09edb0f4b6494b786564.png

Head over the the seaborn documentation here. Find a different plot type or function and implement it using the diabetes data.

Partner Exercise#

What time of year is most popular for bike rentals?
What’s the most popular day of the week for bike rentals?
What’s the frequency of use for the average user?
What are the most and least congested bike stations?

bikeshare_hour = pd.read_csv('data/hour.csv')

bikeshare_hour.head()

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

More plotting with matplotlib and seaborn

Contents

More plotting with matplotlib and seaborn#