More plotting with matplotlib and seaborn

More plotting with `matplotlib` and `seaborn`#

Today we continue to work with matplotlib, focusing on customization and using subplots. Also, the seaborn library will be introduced as a second visualization library with additional functionality for plotting data.

#!pip install -U seaborn

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Subplots and Axes#

### create a 1 row 2 column plot
plt.subplots(nrows = 1, ncols = 2)

(<Figure size 640x480 with 2 Axes>, array([<Axes: >, <Axes: >], dtype=object))

../_images/a491e9097aa850e50be40c874faaf8018fdc5506b4ff8e1cdeefdb81ccf96c24.png

### add a plot to each axis
fig, ax = plt.subplots(1, 2, figsize = (10, 5))
ax[0].plot([1, 2, 3])
ax[0].set_title('Some Line')
ax[1].hist(np.random.random(100))
fig.suptitle('A Super Title');

../_images/ad194a4e8e9194ed2afafa6206152859e9af0331a23c5d253031b3d4dc12f044.png

### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))
ax[1, 1].hist(np.random.random(100))

(array([11., 12.,  5.,  8., 19., 11., 10.,  9.,  6.,  9.]),
 array([0.00821536, 0.10686283, 0.2055103 , 0.30415776, 0.40280523,
        0.5014527 , 0.60010017, 0.69874764, 0.7973951 , 0.89604257,
        0.99469004]),
 <BarContainer object of 10 artists>)

../_images/d3e06892cd6a4de95d97d9012554c0d82cb2364abc2f1f177b6c4a22316b8ff9.png

Exploratory Data Analysis#

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. – IBM

Introduction to `seaborn`#

The seaborn library is built on top of matplotlib and offers high level visualization tools for plotting data. Typically a call to the seaborn library looks like:

sns.plottype(data = DataFrame, x = x, y = y, additional arguments...)

### load a sample dataset on tips
tips = sns.load_dataset('tips')
tips.head(2)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

### boxplot of tips
sns.boxplot(data = tips, x = 'tip')

<Axes: xlabel='tip'>

../_images/2d04d2f4bf827e5cd0ba7966d5d6c40877d23d9a4ba0e81567843118a26640d6.png

### boxplot of tips by day
sns.boxplot(data = tips, x = 'day', y = 'tip')
plt.title('Tips by Day');

../_images/d92f9c291a6e59752958c56fe7d530d6baa311beec3a5ef4ede12f9e4956aa52.png

`hue`#

The hue argument works like a grouping helper with seaborn. Plots that have this argument will break the data into groups from the passed column and add an appropriate legend.

### boxplot of tips by day by smoker
sns.boxplot(data = tips, x = 'day', y = 'tip', hue = 'sex')
plt.title('Tips by Day and Sex');

../_images/1912a8fc84f9f9e2c8cf8c77a19f4c601e4a69600aaafc0da1c7447acf125e71.png

`displot`#

For visualizing one dimensional distributions of data.

### histogram of tips
sns.displot(data = tips, x = 'tip')

<seaborn.axisgrid.FacetGrid at 0x133b4e870>

../_images/e89a69c2a48d1b71e46fb4a08f03746ed7e986400ccb8f51fb2289690314c203.png

### kde plot
sns.displot(data = tips, x = 'tip', kind = 'kde')

<seaborn.axisgrid.FacetGrid at 0x132fbdd60>

../_images/b0b3fc1ab8c2c56779bc2dd360f7b365076ca237e895b63e51c1966b03a41014.png

### empirical cumulative distribution plot of tips by smoker
sns.displot(data = tips, x = 'tip', kind = 'ecdf', hue = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x133b874a0>

../_images/bf31c57b247d118f63ac220c3639adab0b69940b970ab702514c5492d1dce7ce.png

### using the col argument
sns.displot(data = tips, x = 'tip', col = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x1331d4ef0>

../_images/b7a040fc0cbf7015bd838ed5cd37a6a1a6cc57861aa98f28c6a5cde7edde84d9.png

#draw a histogram and a boxplot using seaborn on two axes
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.histplot(data = tips, x = 'tip', ax = ax[0])
sns.boxplot(data = tips, x = 'day', y = 'tip', ax = ax[1])
ax[1].set_title('Boxplots')
fig.suptitle('This is a title for everything');

../_images/31729df779f50669ca64969d90a606a0a727942580d00ebb1cc3c402d42e64a7.png

`relplot`#

For visualizing relationships.

### relplot of bill vs. tip
sns.relplot(data = tips, x = 'total_bill', y = 'tip')

<seaborn.axisgrid.FacetGrid at 0x133fe1c40>

../_images/3af51126fb7d12d130b2cefeae0f7ff5fd8702561e7a471231e6545eccb20ebc.png

### regression plot
sns.regplot(data = tips, x ='total_bill', y = 'tip', lowess = True )

<Axes: xlabel='total_bill', ylabel='tip'>

../_images/700bb5836e0278dd19085e89f28ce33e80aee1eab6280cad22959240abf7e60b.png

### swarm
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')

<Axes: xlabel='smoker', ylabel='tip'>

../_images/cf9849dc8afec68593ff2b04004dfe65e430645ad582a0bb987a00ee6e31d1dc.png

### violin plot
sns.violinplot(data = tips, x = 'smoker', y = 'tip')
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')

<Axes: xlabel='smoker', ylabel='tip'>

../_images/127795105ddeaf49e9cc22ebc31b4092fef6985175e335944019921dfe33d878.png

### countplot
sns.countplot(data = tips, x = 'smoker', hue = 'sex');

../_images/42d3b431ce52b8dde1f9892d0223cd7f0c531a1808abfa10cd1abcf2f64712fe.png

Additional Plots#

pairplot
heatmap

penguins = sns.load_dataset('penguins').dropna()

### pairplot of penguins colored by species
sns.pairplot(data = penguins, hue = 'species', diag_kind = 'kde')

<seaborn.axisgrid.PairGrid at 0x133f0aa20>

../_images/856490453c1eb20d8e0b5eb150f057d435ab137eb5def7afd66e9074d160aa2d.png

### housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame = True).frame
housing.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Plotting Correlations#

Correlation captures the strength of a linear relationship between features. Often, this is easier to look at than a scatterplot of the data to establish relationships, however recall that this is only a detector for linear relationships!

### correlation in data
housing.corr(numeric_only = True)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
MedInc	1.000000	-0.119034	0.326895	-0.062040	0.004834	0.018766	-0.079809	-0.015176	0.688075
HouseAge	-0.119034	1.000000	-0.153277	-0.077747	-0.296244	0.013191	0.011173	-0.108197	0.105623
AveRooms	0.326895	-0.153277	1.000000	0.847621	-0.072213	-0.004852	0.106389	-0.027540	0.151948
AveBedrms	-0.062040	-0.077747	0.847621	1.000000	-0.066197	-0.006181	0.069721	0.013344	-0.046701
Population	0.004834	-0.296244	-0.072213	-0.066197	1.000000	0.069863	-0.108785	0.099773	-0.024650
AveOccup	0.018766	0.013191	-0.004852	-0.006181	0.069863	1.000000	0.002366	0.002476	-0.023737
Latitude	-0.079809	0.011173	0.106389	0.069721	-0.108785	0.002366	1.000000	-0.924664	-0.144160
Longitude	-0.015176	-0.108197	-0.027540	0.013344	0.099773	0.002476	-0.924664	1.000000	-0.045967
MedHouseVal	0.688075	0.105623	0.151948	-0.046701	-0.024650	-0.023737	-0.144160	-0.045967	1.000000

### heatmap of correlations
plt.figure(figsize = (15, 5))
sns.heatmap(housing.corr(numeric_only=True), annot = True)

<Axes: >

../_images/e1efa62cd4bdc80e02e14609130dd8e17e2fed0a356a921974ba6a82b530ea18.png

Problems#

Use the diabetes data below loaded from OpenML (docs).

from sklearn.datasets import fetch_openml

diabetes = fetch_openml(data_id = 37).frame

diabetes.head()

	preg	plas	pres	skin	insu	mass	pedi	age	class
0	6	148	72	35	0	33.6	0.627	50	tested_positive
1	1	85	66	29	0	26.6	0.351	31	tested_negative
2	8	183	64	0	0	23.3	0.672	32	tested_positive
3	1	89	66	23	94	28.1	0.167	21	tested_negative
4	0	137	40	35	168	43.1	2.288	33	tested_positive

Plot distribution of ages separated by class.

sns.displot(data = diabetes, x = 'age', hue = 'class')

<seaborn.axisgrid.FacetGrid at 0x133ac3860>

../_images/e53e630fd4e2beca6f06d9a72ca6d9d4aec8150586bb879a1e002b196d4c0884.png

Use the plots below to determine which feature has the most distinct difference between classes?

fig, ax = plt.subplots(2, 4, figsize = (20, 10))
colnum = 0
for row in range(2):
    for col in range(4):
        sns.histplot(data = diabetes, x = diabetes.iloc[:,colnum], hue = 'class', ax =  ax[row, col])
        ax[row,col].set_title(diabetes.columns.tolist()[colnum])
        colnum += 1

../_images/8462f44f03510f9ace115e0f7b73546b906b19f6a95c09edb0f4b6494b786564.png

#Looks like plasticity shows the biggest difference between the group that 
#tested positive and those that tested negative.

Head over the the seaborn documentation here. Find a different plot type or function and implement it using the diabetes data.

sns.relplot(data = diabetes, x = 'age', y = 'mass', kind = 'line', hue = 'class')

<seaborn.axisgrid.FacetGrid at 0x1380226c0>

../_images/fdcb34491fb7665752ec2fbcd280aa0732150f400a83fe44072bb32da0c7af57.png

Partner Exercise#

What time of year is most popular for bike rentals?
What’s the most popular day of the week for bike rentals?
What’s the frequency of use for the average user?
What are the most and least congested bike stations?

bikeshare_hour = pd.read_csv('data/hour.csv')

bikeshare_hour.head()

More plotting with matplotlib and seaborn

Contents

More plotting with matplotlib and seaborn#