More plotting with matplotlib and seaborn#

Today we continue to work with matplotlib, focusing on customization and using subplots. Also, the seaborn library will be introduced as a second visualization library with additional functionality for plotting data.

#!pip install -U seaborn
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Subplots and Axes#

### create a 1 row 2 column plot
plt.subplots(nrows = 1, ncols = 2)
(<Figure size 640x480 with 2 Axes>, array([<Axes: >, <Axes: >], dtype=object))
../_images/a491e9097aa850e50be40c874faaf8018fdc5506b4ff8e1cdeefdb81ccf96c24.png
### add a plot to each axis
fig, ax = plt.subplots(1, 2, figsize = (10, 5))
ax[0].plot([1, 2, 3])
ax[0].set_title('Some Line')
ax[1].hist(np.random.random(100))
fig.suptitle('A Super Title');
../_images/ad194a4e8e9194ed2afafa6206152859e9af0331a23c5d253031b3d4dc12f044.png
### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))
ax[1, 1].hist(np.random.random(100))
(array([11., 12.,  5.,  8., 19., 11., 10.,  9.,  6.,  9.]),
 array([0.00821536, 0.10686283, 0.2055103 , 0.30415776, 0.40280523,
        0.5014527 , 0.60010017, 0.69874764, 0.7973951 , 0.89604257,
        0.99469004]),
 <BarContainer object of 10 artists>)
../_images/d3e06892cd6a4de95d97d9012554c0d82cb2364abc2f1f177b6c4a22316b8ff9.png

Exploratory Data Analysis#

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.IBM

Introduction to seaborn#

The seaborn library is built on top of matplotlib and offers high level visualization tools for plotting data. Typically a call to the seaborn library looks like:

sns.plottype(data = DataFrame, x = x, y = y, additional arguments...)
### load a sample dataset on tips
tips = sns.load_dataset('tips')
tips.head(2)
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
### boxplot of tips
sns.boxplot(data = tips, x = 'tip')
<Axes: xlabel='tip'>
../_images/2d04d2f4bf827e5cd0ba7966d5d6c40877d23d9a4ba0e81567843118a26640d6.png
### boxplot of tips by day
sns.boxplot(data = tips, x = 'day', y = 'tip')
plt.title('Tips by Day');
../_images/d92f9c291a6e59752958c56fe7d530d6baa311beec3a5ef4ede12f9e4956aa52.png

hue#

The hue argument works like a grouping helper with seaborn. Plots that have this argument will break the data into groups from the passed column and add an appropriate legend.

### boxplot of tips by day by smoker
sns.boxplot(data = tips, x = 'day', y = 'tip', hue = 'sex')
plt.title('Tips by Day and Sex');
../_images/1912a8fc84f9f9e2c8cf8c77a19f4c601e4a69600aaafc0da1c7447acf125e71.png

displot#

For visualizing one dimensional distributions of data.

### histogram of tips
sns.displot(data = tips, x = 'tip')
<seaborn.axisgrid.FacetGrid at 0x133b4e870>
../_images/e89a69c2a48d1b71e46fb4a08f03746ed7e986400ccb8f51fb2289690314c203.png
### kde plot
sns.displot(data = tips, x = 'tip', kind = 'kde')
<seaborn.axisgrid.FacetGrid at 0x132fbdd60>
../_images/b0b3fc1ab8c2c56779bc2dd360f7b365076ca237e895b63e51c1966b03a41014.png
### empirical cumulative distribution plot of tips by smoker
sns.displot(data = tips, x = 'tip', kind = 'ecdf', hue = 'smoker')
<seaborn.axisgrid.FacetGrid at 0x133b874a0>
../_images/bf31c57b247d118f63ac220c3639adab0b69940b970ab702514c5492d1dce7ce.png
### using the col argument
sns.displot(data = tips, x = 'tip', col = 'smoker')
<seaborn.axisgrid.FacetGrid at 0x1331d4ef0>
../_images/b7a040fc0cbf7015bd838ed5cd37a6a1a6cc57861aa98f28c6a5cde7edde84d9.png
#draw a histogram and a boxplot using seaborn on two axes
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.histplot(data = tips, x = 'tip', ax = ax[0])
sns.boxplot(data = tips, x = 'day', y = 'tip', ax = ax[1])
ax[1].set_title('Boxplots')
fig.suptitle('This is a title for everything');
../_images/31729df779f50669ca64969d90a606a0a727942580d00ebb1cc3c402d42e64a7.png

relplot#

For visualizing relationships.

### relplot of bill vs. tip
sns.relplot(data = tips, x = 'total_bill', y = 'tip')
<seaborn.axisgrid.FacetGrid at 0x133fe1c40>
../_images/3af51126fb7d12d130b2cefeae0f7ff5fd8702561e7a471231e6545eccb20ebc.png
### regression plot
sns.regplot(data = tips, x ='total_bill', y = 'tip', lowess = True )
<Axes: xlabel='total_bill', ylabel='tip'>
../_images/700bb5836e0278dd19085e89f28ce33e80aee1eab6280cad22959240abf7e60b.png
### swarm
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')
<Axes: xlabel='smoker', ylabel='tip'>
../_images/cf9849dc8afec68593ff2b04004dfe65e430645ad582a0bb987a00ee6e31d1dc.png
### violin plot
sns.violinplot(data = tips, x = 'smoker', y = 'tip')
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')
<Axes: xlabel='smoker', ylabel='tip'>
../_images/127795105ddeaf49e9cc22ebc31b4092fef6985175e335944019921dfe33d878.png
### countplot
sns.countplot(data = tips, x = 'smoker', hue = 'sex');
../_images/42d3b431ce52b8dde1f9892d0223cd7f0c531a1808abfa10cd1abcf2f64712fe.png

Additional Plots#

  • pairplot

  • heatmap

penguins = sns.load_dataset('penguins').dropna()
### pairplot of penguins colored by species
sns.pairplot(data = penguins, hue = 'species', diag_kind = 'kde')
<seaborn.axisgrid.PairGrid at 0x133f0aa20>
../_images/856490453c1eb20d8e0b5eb150f057d435ab137eb5def7afd66e9074d160aa2d.png
### housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame = True).frame
housing.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Plotting Correlations#

Correlation captures the strength of a linear relationship between features. Often, this is easier to look at than a scatterplot of the data to establish relationships, however recall that this is only a detector for linear relationships!

### correlation in data
housing.corr(numeric_only = True)
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
MedInc 1.000000 -0.119034 0.326895 -0.062040 0.004834 0.018766 -0.079809 -0.015176 0.688075
HouseAge -0.119034 1.000000 -0.153277 -0.077747 -0.296244 0.013191 0.011173 -0.108197 0.105623
AveRooms 0.326895 -0.153277 1.000000 0.847621 -0.072213 -0.004852 0.106389 -0.027540 0.151948
AveBedrms -0.062040 -0.077747 0.847621 1.000000 -0.066197 -0.006181 0.069721 0.013344 -0.046701
Population 0.004834 -0.296244 -0.072213 -0.066197 1.000000 0.069863 -0.108785 0.099773 -0.024650
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 1.000000 0.002366 0.002476 -0.023737
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366 1.000000 -0.924664 -0.144160
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476 -0.924664 1.000000 -0.045967
MedHouseVal 0.688075 0.105623 0.151948 -0.046701 -0.024650 -0.023737 -0.144160 -0.045967 1.000000
### heatmap of correlations
plt.figure(figsize = (15, 5))
sns.heatmap(housing.corr(numeric_only=True), annot = True)
<Axes: >
../_images/e1efa62cd4bdc80e02e14609130dd8e17e2fed0a356a921974ba6a82b530ea18.png

Problems#

Use the diabetes data below loaded from OpenML (docs).

from sklearn.datasets import fetch_openml
diabetes = fetch_openml(data_id = 37).frame
diabetes.head()
preg plas pres skin insu mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 tested_positive
1 1 85 66 29 0 26.6 0.351 31 tested_negative
2 8 183 64 0 0 23.3 0.672 32 tested_positive
3 1 89 66 23 94 28.1 0.167 21 tested_negative
4 0 137 40 35 168 43.1 2.288 33 tested_positive
  1. Plot distribution of ages separated by class.

sns.displot(data = diabetes, x = 'age', hue = 'class')
<seaborn.axisgrid.FacetGrid at 0x133ac3860>
../_images/e53e630fd4e2beca6f06d9a72ca6d9d4aec8150586bb879a1e002b196d4c0884.png
  1. Use the plots below to determine which feature has the most distinct difference between classes?

fig, ax = plt.subplots(2, 4, figsize = (20, 10))
colnum = 0
for row in range(2):
    for col in range(4):
        sns.histplot(data = diabetes, x = diabetes.iloc[:,colnum], hue = 'class', ax =  ax[row, col])
        ax[row,col].set_title(diabetes.columns.tolist()[colnum])
        colnum += 1
../_images/8462f44f03510f9ace115e0f7b73546b906b19f6a95c09edb0f4b6494b786564.png
#Looks like plasticity shows the biggest difference between the group that 
#tested positive and those that tested negative.
  1. Head over the the seaborn documentation here. Find a different plot type or function and implement it using the diabetes data.

sns.relplot(data = diabetes, x = 'age', y = 'mass', kind = 'line', hue = 'class')
<seaborn.axisgrid.FacetGrid at 0x1380226c0>
../_images/fdcb34491fb7665752ec2fbcd280aa0732150f400a83fe44072bb32da0c7af57.png

Partner Exercise#

  • What time of year is most popular for bike rentals?

  • What’s the most popular day of the week for bike rentals?

  • What’s the frequency of use for the average user?

  • What are the most and least congested bike stations?

bikeshare_hour = pd.read_csv('data/hour.csv')
bikeshare_hour.head()