Data Structures and Introduction to Pandas#

OBJECTIVES

  • Understand the similarities and differences of list, tuple, and dict in Python.

  • Use methods on collections

  • Iterate over collections and control flow with conditional statements

  • Connect data structures to NumPy arrays

  • Build a basic DataFrame

  • Read in a csv file as a DataFrame

Data Structures in Python#

A data structure can be thought of as a way to represent data in Python. We will explore three basic forms, and then introduce external libraries NumPy and Pandas that simplify and extend the base libraries capabilities.

Data and Lists#

some stock

0

102.824

1

202.062

2

297.471

3

401.106

4

503.444

5

607.566

6

708.208

7

809.986

8

908.093

9

1007.35

stock = [102.82449238,  202.06238561,  297.4710266 ,  401.10612769,
        503.44394471,  607.56592588,  708.20812766,  809.98616016,
        908.09339539, 1007.34582461]
type(stock)
list
len(stock)
9
#first day
#second day
#last day
#every other day
#third through sixth day

list methods#

A method is a function unique to each “type” of object in Python. In general, we access these methods with the syntax:

object.method()

See docs for full list of methods.

#see all methods -- .tab locally; just wait a second in colab
stock.
fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
2
fruits.count?
Signature: fruits.count(value, /)
Docstring: Return number of occurrences of value.
Type:      builtin_function_or_method
#using the count method
fruits.count('apple')
2
fruits.count('tangerine')
0

CHALLENGE

  1. How does the .index() method for a list work? Demonstrate its use with the list fruits.

  1. How does the .append() method for a list work? Demonstrate its use by appending a new stock price that is 112% of the last price in the list.

  1. What happens when you try to add two lists? Multiply by a constant (i.e. stocks * 2)?

tuple#

Similar to lists in that they are ordered collections of mixed datatypes. The primary difference is a list is mutable whereas a tuple is not.

stock_tuple = (102.82449238,  202.06238561,  297.4710266 ,  401.10612769,
        503.44394471,  607.56592588,  708.20812766,  809.98616016,
        908.09339539, 1007.34582461)
type(stock_tuple)
tuple
stock_tuple[0]
102.82449238
stock_tuple[1:7]
(202.06238561,
 297.4710266,
 401.10612769,
 503.44394471,
 607.56592588,
 708.20812766)

tuple methods#

Much more limited than the list.

stock_tuple.
stock[:4]
[202.06238561, 297.4710266, 401.10612769, 503.44394471]
#delete first element of list
del(stock[0])
print(stock)
[297.4710266, 401.10612769, 503.44394471, 607.56592588, 708.20812766, 809.98616016, 908.09339539, 1007.34582461]
print(len(stock))
8
del(stock_tuple[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 del(stock_tuple[0])

TypeError: 'tuple' object doesn't support item deletion

dict#

Dictionaries differ from lists and tuples in that they are not ordered and instead contain key and value pairs that are used to identify different elements of the collection.

Year

Country

Spending_USD

Life_Expectancy

0

1970

Germany

252.311

70.6

1

1970

France

192.143

72.2

2

1970

Great Britain

123.993

71.9

3

1970

Japan

150.437

72

4

1970

USA

326.961

70.9

germany = {'year': 1970, 'country': 'Germany', 'Spending': 252.311, 'Life_Expectancy': 70.6}
type(germany)
dict
#access the year
#access the life expectancy

dict methods#

Often, we use a dictionary to represent multiple data points for multiple columns or features of a dataset.

health_exp = {'Year': [1970, 1970, 1970, 1970, 1970],
 'Country': ['Germany','France','Great Britain','Japan','USA'],
 'Spending_USD': [252.311,192.143,123.993,150.437,326.961],
 'Life_Expectancy': [70.6,72.2,71.9,72.0,70.9]}
#using get to access values
health_exp.get('Year')
[1970, 1970, 1970, 1970, 1970]
#print the key/value pairs
health_exp.items()
dict_items([('Year', [1970, 1970, 1970, 1970, 1970]), ('Country', ['Germany', 'France', 'Great Britain', 'Japan', 'USA']), ('Spending_USD', [252.311, 192.143, 123.993, 150.437, 326.961]), ('Life_Expectancy', [70.6, 72.2, 71.9, 72.0, 70.9])])
#the keys only
health_exp.keys()
dict_keys(['Year', 'Country', 'Spending_USD', 'Life_Expectancy'])

QUESTION

What are some questions you might ask about this dataset? How would you find the mean spending and life expectancy?

conditional statements#

These are built in comparison methods for both numeric and string datatypes.

#less than?
4 < 8
True
#greater than or equal to?
5 >= 3
True
mean_life_exp = sum(health_exp['Life_Expectancy'])/len(health_exp['Life_Expectancy'])
print(f'The mean life expectancy is {mean_life_exp: .2f} years')
The mean life expectancy is  71.52 years
#suppose we want to compare the individual datapoints to the mean
health_exp['Life_Expectancy']
[70.6, 72.2, 71.9, 72.0, 70.9]
#which countries life expectancy is less than average
health_exp['Life_Expectancy'] < mean_life_exp
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[70], line 2
      1 #which countries life expectancy is less than average
----> 2 health_exp['Life_Expectancy'] < mean_life_exp

TypeError: '<' not supported between instances of 'list' and 'float'
#alternatively we can take each at a time
print(health_exp['Life_Expectancy'][0] < mean_life_exp)
print(health_exp['Life_Expectancy'][1] < mean_life_exp)
print(health_exp['Life_Expectancy'][2] < mean_life_exp)
True
False
False

Iterating over collections#

The for loop allows you to automate stepping through items in any collection. In general it works as:

for item in collection:
    print(item) #or whatever operation you want to perform on each
for country_life_exp in health_exp['Life_Expectancy']:
    print(country_life_exp)
70.6
72.2
71.9
72.0
70.9
#what if you iterate over a dictionary?
for entry in health_exp:
    print(entry)
Year
Country
Spending_USD
Life_Expectancy

CHALLENGE

Adjust the loop above to print the values of the dictionary health_exp instead of the keys.

Functions#

Functions can be thought of as our frequent high school mathematics description – an object that takes an input, performs on operation on the input, and return the result.

#defining a function
def f(x):
    return x**2
#using the function
f(3)
9

CHALLENGE

What does the function below do? Can you use it on the health_exp dataset?

def averager(list_of_numbers):
    '''
    This function takes in a list and returns
    the arithmetic mean of the quantities.
    ---------
    Keyword arguments:
    list_of_numbers: list (List of values to average)
    --------
    returns:
    average: float (mean of list)
    --------
    Example:
    x = [1, 2, 3, 4, 5]
    averager(x) --> 3.0
    '''
    return sum(list_of_numbers)/len(list_of_numbers)
    

Shortcomings of basic data structures#

Should finding the average of a dataset take so much effort? Suppose you wanted to find the standard deviation of a column, wouldn’t it be nice if we had a data structure that had all this built in for us?

Python Packages#

As a general-purpose programming language, Python is designed to be used in many ways. You can build web sites or industrial robots or a game for your friends to play, and much more, all using the same core technology.

Python’s flexibility is why the first step in every Python project must be to think about the project’s audience and the corresponding environment where the project will run. It might seem strange to think about packaging before writing code, but this process does wonders for avoiding future headaches. – source

Typically, you will install a package and consult its documentation for how to use the package. When we downloaded Anaconda it came loaded with many of the important libraries for data oriented tasks.

pandas#

The pandas library will be our workhorse for a data structure. It is already installed in both Anaconda and Google Colab notebooks. Before we can use the library, we first import it and alias it.

import pandas as pd
  • The documentation for pandas is excellent, and found here

  • They offer a basic cheat sheet that can be a quick reference as you learn here

  • In the documentation there are getting started tutorials that could be helpful to explore here

DataFrame as a data structure#

#creating a DataFrame from a dictionary
df = pd.DataFrame(health_exp)
#what kind of thing is this?
type(df)
pandas.core.frame.DataFrame
#take a look at the DataFrame
df
Year Country Spending_USD Life_Expectancy
0 1970 Germany 252.311 70.6
1 1970 France 192.143 72.2
2 1970 Great Britain 123.993 71.9
3 1970 Japan 150.437 72.0
4 1970 USA 326.961 70.9

DataFrame methods and attributes#

Just like the base library data structures, the DataFrame has its own methods and attributes.

#DataFrame methods and properties
df.
  Cell In[84], line 2
    df.
       ^
SyntaxError: invalid syntax
#finding the mean of all numeric columns
df.mean(numeric_only=True)
Year               1970.000
Spending_USD        209.169
Life_Expectancy      71.520
dtype: float64

CHALLENGE

  1. Determine the standard deviation of the numeric columns?

Creating DataFrame’s from files#

The pandas function pd.read_csv allows us to read in an external .csv file and create a DataFrame.

#create a csv file
df.to_csv('health_exp.csv')
#read the csv file back in as a new object
health_df = pd.read_csv('health_exp.csv')
health_df
Unnamed: 0 Year Country Spending_USD Life_Expectancy
0 0 1970 Germany 252.311 70.6
1 1 1970 France 192.143 72.2
2 2 1970 Great Britain 123.993 71.9
3 3 1970 Japan 150.437 72.0
4 4 1970 USA 326.961 70.9
#the index was included as a column in the csv file
#we can specify this column as the index instead of creating
#a new index column
health_df = pd.read_csv('health_exp.csv', index_col=0)
health_df
Year Country Spending_USD Life_Expectancy
0 1970 Germany 252.311 70.6
1 1970 France 192.143 72.2
2 1970 Great Britain 123.993 71.9
3 1970 Japan 150.437 72.0
4 1970 USA 326.961 70.9

CHALLENGE

  1. Find a dataset in .csv format with a quick google search. Download the file and read it in as a DataFrame. Show your DataFrame to your neighbor and rejoice.