Homework 3: Advanced Pandas and Introductory Plotting#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      2 import matplotlib.pyplot as plt
      3 import numpy as np
----> 4 import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

Problem 1: Loading a data file.

Below, load in the data from the spotify.csv file. Assign it to a variable spotify below.

Problem 2: Who is the most frequently occurring artist in the data?

Problem 3: Create a histogram for the tempo column.

Problem 4: Create a scatterplot of tempo vs. danceability. Do these features seem related?

### tempo vs. danceability

Problem 5: Load in the cell_phone_churn.csv data and assign as churn below.

This dataset contains customer information from a telecommunications company about customer churn. A customer is churned if they leave the provider.

Problem 6: What percentage of customers were churned?

Problem 7: How do customers who have a voicemail plan and those who did not compare in terms of percent churned?

Problem 8: Draw a barplot to represent the number of customers by the number of customer service calls these customers made.

Problem 9: Draw boxplots for international minutes by customers who were churned and those that were not. Are there any differences between these groups?

Income by College Major

Below, a dataset from Nate Silver’s 538 blog is shown on college majors and income. link

url1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/'
url2 = 'college-majors/recent-grads.csv'
url = url1 + url2
df538 = pd.read_csv(url)
df538.head(2)
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50

2 rows × 21 columns

Problem 10: Assign the columns of the data as a list below.

Problem 11: Set the index of df538 as Major.

Problem 12: Create a horizontal bar chart of the median salary by major.

Problem 13: Load in the gapminder.csv file and assign as gapminder_df below. This data comes from the Gapminder organization and contains information on countries GDP and Life Expectancy.

Problem 14: What is the average GDP for each continent?

Problem 15: Create a scatter plot for GDP vs. Life Expectancy for the data from 2002. Include a title and x and y labels.

Exploring your own data#

Now, head over to the website kaggle.com and locate a dataset of interest to you from the datasets. Download a dataset and load it in to your notebook below. Be careful that you don’t select a dataset that is too large (>10 GB), and don’t spend too much time trying to find the perfect dataset.

The goal here is to use our techniques from pandas and matplotlib to explore the data. Once you have the data loaded, you are to use summaries and plots to explore the data. Create three plots of your data that contain important insights. Be sure to label your axes and add appropriate titles to these plots.

BONUS: Styling pandas#

There are additional capabilities in pandas to style tables by adding color and formatting outside of default settings. The documentation here gives some examples of ways that you can adjust styling on a DataFrame. These are handy if you are summarizing data in a table and want to highlight specific values for the reader.

Select some styling tips that you like from the documentation. Create a markdown cell and write a brief summary of the technique, followed by a demonstration using the titanic DataFrame. Make sure that your summaries actually say something about the data!

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head(5)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

BONUS: Installing and using libraries#

A nice library for time series analysis and data is the sktime library. The documentation is here. Install the library, and create three plots that include using subplots of different datasets from the library.