Homework 3: Advanced Pandas and Introductory Plotting

Homework 3: Advanced Pandas and Introductory Plotting#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      2 import matplotlib.pyplot as plt
      3 import numpy as np
----> 4 import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

Problem 1: Loading a data file.

Below, load in the data from the spotify.csv file. Assign it to a variable spotify below.

Problem 2: Who is the most frequently occurring artist in the data?

Problem 3: Create a histogram for the tempo column.

Problem 4: Create a scatterplot of tempo vs. danceability. Do these features seem related?

### tempo vs. danceability

Problem 5: Load in the cell_phone_churn.csv data and assign as churn below.

This dataset contains customer information from a telecommunications company about customer churn. A customer is churned if they leave the provider.

Problem 6: What percentage of customers were churned?

Problem 7: How do customers who have a voicemail plan and those who did not compare in terms of percent churned?

Problem 8: Draw a barplot to represent the number of customers by the number of customer service calls these customers made.

Problem 9: Draw boxplots for international minutes by customers who were churned and those that were not. Are there any differences between these groups?

Income by College Major

Below, a dataset from Nate Silver’s 538 blog is shown on college majors and income. link

url1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/'
url2 = 'college-majors/recent-grads.csv'
url = url1 + url2
df538 = pd.read_csv(url)
df538.head(2)

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50

2 rows × 21 columns

Problem 10: Assign the columns of the data as a list below.

Problem 11: Set the index of df538 as Major.

Problem 12: Create a horizontal bar chart of the median salary by major.

Problem 13: Load in the gapminder.csv file and assign as gapminder_df below. This data comes from the Gapminder organization and contains information on countries GDP and Life Expectancy.

Problem 14: What is the average GDP for each continent?

Problem 15: Create a scatter plot for GDP vs. Life Expectancy for the data from 2002. Include a title and x and y labels.

Exploring your own data#

Now, head over to the website kaggle.com and locate a dataset of interest to you from the datasets. Download a dataset and load it in to your notebook below. Be careful that you don’t select a dataset that is too large (>10 GB), and don’t spend too much time trying to find the perfect dataset.

The goal here is to use our techniques from pandas and matplotlib to explore the data. Once you have the data loaded, you are to use summaries and plots to explore the data. Create three plots of your data that contain important insights. Be sure to label your axes and add appropriate titles to these plots.

BONUS: Styling `pandas`#

There are additional capabilities in pandas to style tables by adding color and formatting outside of default settings. The documentation here gives some examples of ways that you can adjust styling on a DataFrame. These are handy if you are summarizing data in a table and want to highlight specific values for the reader.

Select some styling tips that you like from the documentation. Create a markdown cell and write a brief summary of the technique, followed by a demonstration using the titanic DataFrame. Make sure that your summaries actually say something about the data!

import seaborn as sns

titanic = sns.load_dataset('titanic')
titanic.head(5)

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

BONUS: Installing and using libraries#

A nice library for time series analysis and data is the sktime library. The documentation is here. Install the library, and create three plots that include using subplots of different datasets from the library.

Homework 3: Advanced Pandas and Introductory Plotting

Contents

Homework 3: Advanced Pandas and Introductory Plotting#

Exploring your own data#

BONUS: Styling pandas#

BONUS: Installing and using libraries#

BONUS: Styling `pandas`#