Homework 3: Advanced Pandas and Introductory Plotting#
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 4
2 import matplotlib.pyplot as plt
3 import numpy as np
----> 4 import seaborn as sns
ModuleNotFoundError: No module named 'seaborn'
Problem 1: Loading a data file.
Below, load in the data from the spotify.csv file. Assign it to a variable spotify below.
Problem 2: Who is the most frequently occurring artist in the data?
Problem 3: Create a histogram for the tempo column.
Problem 4: Create a scatterplot of tempo vs. danceability. Do these features seem related?
### tempo vs. danceability
Problem 5: Load in the cell_phone_churn.csv data and assign as churn below.
This dataset contains customer information from a telecommunications company about customer churn. A customer is churned if they leave the provider.
Problem 6: What percentage of customers were churned?
Problem 7: How do customers who have a voicemail plan and those who did not compare in terms of percent churned?
Problem 8: Draw a barplot to represent the number of customers by the number of customer service calls these customers made.
Problem 9: Draw boxplots for international minutes by customers who were churned and those that were not. Are there any differences between these groups?
Income by College Major
Below, a dataset from Nate Silver’s 538 blog is shown on college majors and income. link
url1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/'
url2 = 'college-majors/recent-grads.csv'
url = url1 + url2
df538 = pd.read_csv(url)
df538.head(2)
| Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
| 1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 rows × 21 columns
Problem 10: Assign the columns of the data as a list below.
Problem 11: Set the index of df538 as Major.
Problem 12: Create a horizontal bar chart of the median salary by major.
Problem 13: Load in the gapminder.csv file and assign as gapminder_df below. This data comes from the Gapminder organization and contains information on countries GDP and Life Expectancy.
Problem 14: What is the average GDP for each continent?
Problem 15: Create a scatter plot for GDP vs. Life Expectancy for the data from 2002. Include a title and x and y labels.
Exploring your own data#
Now, head over to the website kaggle.com and locate a dataset of interest to you from the datasets. Download a dataset and load it in to your notebook below. Be careful that you don’t select a dataset that is too large (>10 GB), and don’t spend too much time trying to find the perfect dataset.
The goal here is to use our techniques from pandas and matplotlib to explore the data. Once you have the data loaded, you are to use summaries and plots to explore the data. Create three plots of your data that contain important insights. Be sure to label your axes and add appropriate titles to these plots.
BONUS: Styling pandas#
There are additional capabilities in pandas to style tables by adding color and formatting outside of default settings. The documentation here gives some examples of ways that you can adjust styling on a DataFrame. These are handy if you are summarizing data in a table and want to highlight specific values for the reader.
Select some styling tips that you like from the documentation. Create a markdown cell and write a brief summary of the technique, followed by a demonstration using the titanic DataFrame. Make sure that your summaries actually say something about the data!
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head(5)
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
BONUS: Installing and using libraries#
A nice library for time series analysis and data is the sktime library. The documentation is here. Install the library, and create three plots that include using subplots of different datasets from the library.