More plotting with matplotlib and seaborn

More plotting with `matplotlib` and `seaborn`#

Today we continue to work with matplotlib, focusing on customization and using subplots. Also, the seaborn library will be introduced as a second visualization library with additional functionality for plotting data.

#!pip install -U seaborn

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 5
      3 import pandas as pd
      4 import matplotlib.pyplot as plt
----> 5 import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

`3D Plotting`#

There are additional projections available including polar and three dimensional projections. These can be accessed through the projection argument in the axes functions.

3d plotting

def f(x, y):
    return x**2 - y**2
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)

ax = plt.axes(projection = '3d')
ax.plot_wireframe(X, Y, f(X, Y))
ax.set_title('Using 3d projection');

_images/d71d3c7ce9bd45849ce1a875139acbbf260f7c6f72cc2937d4b9421988b3e3c0.png

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
ax = plt.axes(projection = 'polar')
ax.plot(theta, r)
ax.set_title('Basic polar coordinate plot');

_images/156a749a9967b95275f9f256f4a1818cc8ab29d4fffb2f02adcb8f68c9a229e6.png

`Gridspec`#

If you want to change the layout and organization of the subplot the Gridspec object allows you to specify additional information about width and height ratios of the subplots. Examples below are from the documentation here.

from matplotlib.gridspec import GridSpec

#helper for annotating
def annotate_axes(fig):
    for i, ax in enumerate(fig.axes):
        ax.text(0.5, 0.5, "ax%d" % (i+1), va="center", ha="center")
        ax.tick_params(labelbottom=False, labelleft=False)

fig = plt.figure()
fig.suptitle("Controlling subplot sizes with width_ratios and height_ratios")

gs = GridSpec(2, 2, width_ratios=[1, 2], height_ratios=[4, 1])
ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1])
ax3 = fig.add_subplot(gs[2])
ax4 = fig.add_subplot(gs[3])

annotate_axes(fig)

_images/e0fc65215467500a5f7257ce811c2b8b4dae256247e0cdee5dee6eb99aff8098.png

fig = plt.figure()
fig.suptitle("Controlling spacing around and between subplots")

gs1 = GridSpec(3, 3, left=0.05, right=0.48, wspace=0.05)
ax1 = fig.add_subplot(gs1[:-1, :])
ax2 = fig.add_subplot(gs1[-1, :-1])
ax3 = fig.add_subplot(gs1[-1, -1])

gs2 = GridSpec(3, 3, left=0.55, right=0.98, hspace=0.05)
ax4 = fig.add_subplot(gs2[:, :-1])
ax5 = fig.add_subplot(gs2[:-1, -1])
ax6 = fig.add_subplot(gs2[-1, -1])

annotate_axes(fig)

plt.show()

_images/35b706131a79c2b3fc39b6f8e88ef2af288c403087d38b332062b91a576221da.png

Exercise#

Use GridSpec to write a function that takes in a column from a DataFrame (a Series object) and returns a 2 row 1 column plot where the bottom plot is a histogram and top is boxplot; similar to image below.

Introduction to `seaborn`#

The seaborn library is built on top of matplotlib and offers high level visualization tools for plotting data. Typically a call to the seaborn library looks like:

sns.plottype(data = DataFrame, x = x, y = y, additional arguments...)

### load a sample dataset on tips
tips = sns.load_dataset('tips')
tips.head(2)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

### boxplot of tips
sns.boxplot(data = tips, x = 'tip')

<AxesSubplot: xlabel='tip'>

_images/2b9dd48b318dc839bd60a3f49cc97423486ed40a55649dcfbee6606ee5ca6036.png

### boxplot of tips by day
sns.boxplot(data = tips, x = 'day', y = 'tip')
plt.title('Tips by Day');

_images/1b12e6371adc97dec4cec7c39363837897ce3c7decbf7507e0dcd37a1d932ca8.png

`hue`#

The hue argument works like a grouping helper with seaborn. Plots that have this argument will break the data into groups from the passed column and add an appropriate legend.

### boxplot of tips by day by smoker
sns.boxplot(data = tips, x = 'day', y = 'tip', hue = 'smoker')

<AxesSubplot: xlabel='day', ylabel='tip'>

_images/4489320ddc303ddbc7874d8ad581adfbdc64001e72b86ad93da20206b3738450.png

`displot`#

For visualizing one dimensional distributions of data.

### histogram of tips
sns.displot(data = tips, x = 'tip')

<seaborn.axisgrid.FacetGrid at 0x7faa09939fa0>

_images/25e075aeca2fad339dc8f1de6eff1b2de24a7d5c5223b6188e2d650c92b562a2.png

### kde plot
sns.displot(data = tips, x = 'tip', kind = 'kde')

<seaborn.axisgrid.FacetGrid at 0x7faa09bb3220>

_images/e4c0af5172864f72180bc110a92c86969d4976d37327a8fcfd864344cc2e85ea.png

### empirical cumulative distribution plot of tips by smoker
sns.displot(data = tips, x = 'tip', kind = 'ecdf', hue = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x7faa0a07ae20>

_images/f6feff9076557d14916664cfc07e8c4e0568b6ee324a26b27780a95e192c21b0.png

### using the col argument
sns.displot(data = tips, x = 'tip', col = 'smoker')

<seaborn.axisgrid.FacetGrid at 0x7faa09e07340>

_images/5a7c3c16b0d946db518697d39d1f2dd2a31209841ef6f77a5c6b34a83e032f6f.png

#draw a histogram and a boxplot using seaborn on two axes
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.histplot(data = tips, x = 'tip', ax = ax[0])
sns.boxplot(data = tips, x = 'day', y = 'tip', ax = ax[1])
ax[1].set_title('Boxplots')
fig.suptitle('This is a title for everything');

_images/54056598903e71385d58b508a3bfc4103a97a1cffca9deb505506ea34eaca626.png

`relplot`#

For visualizing relationships.

### relplot of bill vs. tip
sns.relplot(data = tips, x = 'total_bill', y = 'tip')

<seaborn.axisgrid.FacetGrid at 0x7faa0a92a700>

_images/3ad85b2b13b5891eef785c490208b960846bd3df9586000562f987ad391d9aaa.png

### regression plot
sns.regplot(data = tips, x ='total_bill', y = 'tip', lowess = True )

<AxesSubplot: xlabel='total_bill', ylabel='tip'>

_images/7a910581db0ba5b7b1ad7b25ac0f2bb83185c1011527e402c7f75e07ae0807f8.png

### swarm
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')

<AxesSubplot: xlabel='smoker', ylabel='tip'>

_images/c82461709f983aed033b0ebd55d5687016ae0ddab369abe1bd0b11904a009acc.png

### violin plot
sns.violinplot(data = tips, x = 'smoker', y = 'tip')

<AxesSubplot: xlabel='smoker', ylabel='tip'>

_images/f41032bc33229724104cbcedacf4425b9d6fe71e28f4ad7fb5f08e6e8fcd1ff7.png

### countplot
sns.countplot(data = tips, x = 'smoker');

_images/a15d02e04117ef4b3093add1806711278165c019453e66a6b8ad59ff32a08350.png

Create a histogram of flipper length by species.

penguins = sns.load_dataset('penguins')
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

sns.displot(data = penguins, x = 'flipper_length_mm', col = 'species')

<seaborn.axisgrid.FacetGrid at 0x7faa0c1d0880>

_images/4ff89a5ace22f4b0864e65ec0d3a47f111e31d6167d7bdfd734370268ca4b6ab.png

sns.displot(data = penguins, x = 'flipper_length_mm', hue = 'species')

<seaborn.axisgrid.FacetGrid at 0x7faa0c85bb80>

_images/27d562e83f3999b9ffda77269de3a83446dc73bf28372f03a6ce9420bd8ba218.png

Create a scatterplot of bill length vs. flipper length.

sns.scatterplot(data = penguins, x = 'bill_length_mm', y = 'flipper_length_mm', hue = 'species')

<AxesSubplot: xlabel='bill_length_mm', ylabel='flipper_length_mm'>

_images/ca83a8feb98ba56fddb41a1bf0882ed4829e37a6ee84cddec7c39f7f52d71b4f.png

Create a violin plot of each species mass split by sex.

sns.violinplot(data = penguins, x = 'species', y = 'body_mass_g', hue = 'sex', split = True)

<AxesSubplot: xlabel='species', ylabel='body_mass_g'>

_images/028c678566e1f92b7d033b7cce5ff6004b25dba8fdbf6f3e0f06b69bd00c1347.png

Additional Plots#

pairplot
heatmap

penguins = sns.load_dataset('penguins').dropna()

### pairplot of penguins colored by species
sns.pairplot(data = penguins, hue = 'species')

<seaborn.axisgrid.PairGrid at 0x7fa9ed579f40>

_images/3ceb41040dd70189f5c0e1cd01ee9016b2eb82fd138d8083ed89011d10fd9e98.png

### housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame = True).frame
housing.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Plotting Correlations#

Correlation captures the strength of a linear relationship between features. Often, this is easier to look at than a scatterplot of the data to establish relationships, however recall that this is only a detector for linear relationships!

### correlation in data
housing.corr(numeric_only = True)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
MedInc	1.000000	-0.119034	0.326895	-0.062040	0.004834	0.018766	-0.079809	-0.015176	0.688075
HouseAge	-0.119034	1.000000	-0.153277	-0.077747	-0.296244	0.013191	0.011173	-0.108197	0.105623
AveRooms	0.326895	-0.153277	1.000000	0.847621	-0.072213	-0.004852	0.106389	-0.027540	0.151948
AveBedrms	-0.062040	-0.077747	0.847621	1.000000	-0.066197	-0.006181	0.069721	0.013344	-0.046701
Population	0.004834	-0.296244	-0.072213	-0.066197	1.000000	0.069863	-0.108785	0.099773	-0.024650
AveOccup	0.018766	0.013191	-0.004852	-0.006181	0.069863	1.000000	0.002366	0.002476	-0.023737
Latitude	-0.079809	0.011173	0.106389	0.069721	-0.108785	0.002366	1.000000	-0.924664	-0.144160
Longitude	-0.015176	-0.108197	-0.027540	0.013344	0.099773	0.002476	-0.924664	1.000000	-0.045967
MedHouseVal	0.688075	0.105623	0.151948	-0.046701	-0.024650	-0.023737	-0.144160	-0.045967	1.000000

### heatmap of correlations
plt.figure(figsize = (15, 5))
sns.heatmap(housing.corr()[['MedHouseVal']].sort_values(by = 'MedHouseVal', ascending = False), annot = True)

<AxesSubplot: >

_images/93227949ff9d373d36a49fc6bdcaf6c3845aca2d1d94f450a012b9d12dd00320.png

Problems#

Use the diabetes data below loaded from OpenML (docs).

from sklearn.datasets import fetch_openml

diabetes = fetch_openml(data_id = 37).frame

diabetes.head()

	preg	plas	pres	skin	insu	mass	pedi	age	class
0	6.0	148.0	72.0	35.0	0.0	33.6	0.627	50.0	tested_positive
1	1.0	85.0	66.0	29.0	0.0	26.6	0.351	31.0	tested_negative
2	8.0	183.0	64.0	0.0	0.0	23.3	0.672	32.0	tested_positive
3	1.0	89.0	66.0	23.0	94.0	28.1	0.167	21.0	tested_negative
4	0.0	137.0	40.0	35.0	168.0	43.1	2.288	33.0	tested_positive

Distribution of ages separated by class.

sns.displot(diabetes, x = 'age', hue = 'class')

<seaborn.axisgrid.FacetGrid at 0x7fa9f0ae6b80>

_images/9f8361b011f9be566884de31e8e1147809f112a7757f0f9b35b3131a15a6fb98.png

Heatmap of features. Any strong correlations?

plt.figure(figsize = (20, 5))
sns.heatmap(diabetes.corr(), annot = True, cmap = 'BuPu')

<ipython-input-58-7c98d612fac1>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(diabetes.corr(), annot = True, cmap = 'BuPu')

<AxesSubplot: >

_images/0c0f1320ff3987c4916e9f134bbaab152641b4d7390bd8a1d12ae07b7bbe45f8.png

CHALLENGE: 2 rows and 4 columns with histograms separated by class column. Which feature has the most distinct difference between classes?

fig, ax = plt.subplots(2, 4, figsize = (20, 10))
for row in range(2):
    for col in range(4):
        ax[row, col].hist(diabetes['age'])

_images/1dcd01240ef8f777bac9685d77600bf0bccd515637a3578aad400eefa791846f.png

#

Review#

data = {'Food': ['French Fries', 'Potato Chips', 'Bacon', 'Pizza', 'Chili Dog'],
        'Calories per 100g':  [607, 542, 533, 296, 260]}
cals = pd.DataFrame(data)

EXERCISE

Set ‘Food’ as the index of cals.
Create a bar chart with calories.
Add a title.
Change the color of the bars.
Add the argument alpha=0.5. What does it do?
Change your chart to a horizontal bar chart. Which do you prefer?

More plotting with matplotlib and seaborn

Contents

More plotting with matplotlib and seaborn#

3D Plotting#

Gridspec#