Data Structures and Introduction to Pandas

Data Structures and Introduction to Pandas#

OBJECTIVES

Understand the similarities and differences of list, tuple, and dict in Python.
Use methods on collections
Iterate over collections and control flow with conditional statements
Connect data structures to NumPy arrays
Build a basic DataFrame
Read in a csv file as a DataFrame

Data Structures in Python#

A data structure can be thought of as a way to represent data in Python. We will explore three basic forms, and then introduce external libraries NumPy and Pandas that simplify and extend the base libraries capabilities.

Data and Lists#

	some stock
0	102.824
1	202.062
2	297.471
3	401.106
4	503.444
5	607.566
6	708.208
7	809.986
8	908.093
9	1007.35

stock = [102.82449238,  202.06238561,  297.4710266 ,  401.10612769,
        503.44394471,  607.56592588,  708.20812766,  809.98616016,
        908.09339539, 1007.34582461]

type(stock)

list

len(stock)

#first day

#second day

#last day

#every other day

#third through sixth day

`list` methods#

A method is a function unique to each “type” of object in Python. In general, we access these methods with the syntax:

object.method()

See docs for full list of methods.

#see all methods -- .tab locally; just wait a second in colab
stock.

fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']

fruits.count?

Signature: fruits.count(value, /)
Docstring: Return number of occurrences of value.
Type:      builtin_function_or_method

#using the count method
fruits.count('apple')

fruits.count('tangerine')

CHALLENGE

How does the .index() method for a list work? Demonstrate its use with the list fruits.

How does the .append() method for a list work? Demonstrate its use by appending a new stock price that is 112% of the last price in the list.

What happens when you try to add two lists? Multiply by a constant (i.e. stocks * 2)?

`tuple`#

Similar to lists in that they are ordered collections of mixed datatypes. The primary difference is a list is mutable whereas a tuple is not.

stock_tuple = (102.82449238,  202.06238561,  297.4710266 ,  401.10612769,
        503.44394471,  607.56592588,  708.20812766,  809.98616016,
        908.09339539, 1007.34582461)

type(stock_tuple)

tuple

stock_tuple[0]

102.82449238

stock_tuple[1:7]

(202.06238561,
4710266,
10612769,
44394471,
56592588,
20812766)

`tuple` methods#

Much more limited than the list.

stock_tuple.

stock[:4]

[202.06238561, 297.4710266, 401.10612769, 503.44394471]

#delete first element of list
del(stock[0])

print(stock)

[297.4710266, 401.10612769, 503.44394471, 607.56592588, 708.20812766, 809.98616016, 908.09339539, 1007.34582461]

print(len(stock))

del(stock_tuple[0])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 del(stock_tuple[0])

TypeError: 'tuple' object doesn't support item deletion

`dict`#

Dictionaries differ from lists and tuples in that they are not ordered and instead contain key and value pairs that are used to identify different elements of the collection.

	Year	Country	Spending_USD	Life_Expectancy
0	1970	Germany	252.311	70.6
1	1970	France	192.143	72.2
2	1970	Great Britain	123.993	71.9
3	1970	Japan	150.437	72
4	1970	USA	326.961	70.9

germany = {'year': 1970, 'country': 'Germany', 'Spending': 252.311, 'Life_Expectancy': 70.6}

type(germany)

dict

#access the year

#access the life expectancy

`dict` methods#

Often, we use a dictionary to represent multiple data points for multiple columns or features of a dataset.

health_exp = {'Year': [1970, 1970, 1970, 1970, 1970],
 'Country': ['Germany','France','Great Britain','Japan','USA'],
 'Spending_USD': [252.311,192.143,123.993,150.437,326.961],
 'Life_Expectancy': [70.6,72.2,71.9,72.0,70.9]}

#using get to access values
health_exp.get('Year')

[1970, 1970, 1970, 1970, 1970]

#print the key/value pairs
health_exp.items()

dict_items([('Year', [1970, 1970, 1970, 1970, 1970]), ('Country', ['Germany', 'France', 'Great Britain', 'Japan', 'USA']), ('Spending_USD', [252.311, 192.143, 123.993, 150.437, 326.961]), ('Life_Expectancy', [70.6, 72.2, 71.9, 72.0, 70.9])])

#the keys only
health_exp.keys()

dict_keys(['Year', 'Country', 'Spending_USD', 'Life_Expectancy'])

QUESTION

What are some questions you might ask about this dataset? How would you find the mean spending and life expectancy?

conditional statements#

These are built in comparison methods for both numeric and string datatypes.

#less than?
4 < 8

True

#greater than or equal to?
5 >= 3

True

mean_life_exp = sum(health_exp['Life_Expectancy'])/len(health_exp['Life_Expectancy'])
print(f'The mean life expectancy is {mean_life_exp: .2f} years')

The mean life expectancy is  71.52 years

#suppose we want to compare the individual datapoints to the mean
health_exp['Life_Expectancy']

[70.6, 72.2, 71.9, 72.0, 70.9]

#which countries life expectancy is less than average
health_exp['Life_Expectancy'] < mean_life_exp

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[70], line 2
      1 #which countries life expectancy is less than average
----> 2 health_exp['Life_Expectancy'] < mean_life_exp

TypeError: '<' not supported between instances of 'list' and 'float'

#alternatively we can take each at a time
print(health_exp['Life_Expectancy'][0] < mean_life_exp)
print(health_exp['Life_Expectancy'][1] < mean_life_exp)
print(health_exp['Life_Expectancy'][2] < mean_life_exp)

True
False
False

Iterating over collections#

The for loop allows you to automate stepping through items in any collection. In general it works as:

for item in collection:
    print(item) #or whatever operation you want to perform on each

for country_life_exp in health_exp['Life_Expectancy']:
    print(country_life_exp)

#what if you iterate over a dictionary?
for entry in health_exp:
    print(entry)

Year
Country
Spending_USD
Life_Expectancy

CHALLENGE

Adjust the loop above to print the values of the dictionary health_exp instead of the keys.

Functions#

Functions can be thought of as our frequent high school mathematics description – an object that takes an input, performs on operation on the input, and return the result.

#defining a function
def f(x):
    return x**2

#using the function
f(3)

CHALLENGE

What does the function below do? Can you use it on the health_exp dataset?

def averager(list_of_numbers):
    '''
    This function takes in a list and returns
    the arithmetic mean of the quantities.
    ---------
    Keyword arguments:
    list_of_numbers: list (List of values to average)
    --------
    returns:
    average: float (mean of list)
    --------
    Example:
    x = [1, 2, 3, 4, 5]
    averager(x) --> 3.0
    '''
    return sum(list_of_numbers)/len(list_of_numbers)
    

Shortcomings of basic data structures#

Should finding the average of a dataset take so much effort? Suppose you wanted to find the standard deviation of a column, wouldn’t it be nice if we had a data structure that had all this built in for us?

Python Packages#

As a general-purpose programming language, Python is designed to be used in many ways. You can build web sites or industrial robots or a game for your friends to play, and much more, all using the same core technology.

Python’s flexibility is why the first step in every Python project must be to think about the project’s audience and the corresponding environment where the project will run. It might seem strange to think about packaging before writing code, but this process does wonders for avoiding future headaches. – source

Typically, you will install a package and consult its documentation for how to use the package. When we downloaded Anaconda it came loaded with many of the important libraries for data oriented tasks.

`pandas`#

The pandas library will be our workhorse for a data structure. It is already installed in both Anaconda and Google Colab notebooks. Before we can use the library, we first import it and alias it.

import pandas as pd

The documentation for pandas is excellent, and found here
They offer a basic cheat sheet that can be a quick reference as you learn here
In the documentation there are getting started tutorials that could be helpful to explore here

`DataFrame` as a data structure#

#creating a DataFrame from a dictionary
df = pd.DataFrame(health_exp)

#what kind of thing is this?
type(df)

pandas.core.frame.DataFrame

#take a look at the DataFrame
df

	Year	Country	Spending_USD	Life_Expectancy
0	1970	Germany	252.311	70.6
1	1970	France	192.143	72.2
2	1970	Great Britain	123.993	71.9
3	1970	Japan	150.437	72.0
4	1970	USA	326.961	70.9

`DataFrame` methods and attributes#

Just like the base library data structures, the DataFrame has its own methods and attributes.

#DataFrame methods and properties
df.

  Cell In[84], line 2
    df.
       ^
SyntaxError: invalid syntax

#finding the mean of all numeric columns
df.mean(numeric_only=True)

Year               1970.000
Spending_USD        209.169
Life_Expectancy      71.520
dtype: float64

CHALLENGE

Determine the standard deviation of the numeric columns?

Creating `DataFrame`’s from files#

The pandas function pd.read_csv allows us to read in an external .csv file and create a DataFrame.

#create a csv file
df.to_csv('health_exp.csv')

#read the csv file back in as a new object
health_df = pd.read_csv('health_exp.csv')
health_df

	Unnamed: 0	Year	Country	Spending_USD	Life_Expectancy
0	0	1970	Germany	252.311	70.6
1	1	1970	France	192.143	72.2
2	2	1970	Great Britain	123.993	71.9
3	3	1970	Japan	150.437	72.0
4	4	1970	USA	326.961	70.9

#the index was included as a column in the csv file
#we can specify this column as the index instead of creating
#a new index column
health_df = pd.read_csv('health_exp.csv', index_col=0)
health_df

	Year	Country	Spending_USD	Life_Expectancy
0	1970	Germany	252.311	70.6
1	1970	France	192.143	72.2
2	1970	Great Britain	123.993	71.9
3	1970	Japan	150.437	72.0
4	1970	USA	326.961	70.9

CHALLENGE

Find a dataset in .csv format with a quick google search. Download the file and read it in as a DataFrame. Show your DataFrame to your neighbor and rejoice.

Data Structures and Introduction to Pandas

Contents

Data Structures and Introduction to Pandas#

Data Structures in Python#

Data and Lists#

list methods#

tuple#

tuple methods#

dict#

dict methods#

conditional statements#

Iterating over collections#

Functions#

Shortcomings of basic data structures#

Python Packages#

pandas#

DataFrame as a data structure#

DataFrame methods and attributes#

Creating DataFrame’s from files#

`list` methods#

`tuple`#

`tuple` methods#

`dict`#

`dict` methods#

`pandas`#

`DataFrame` as a data structure#

`DataFrame` methods and attributes#

Creating `DataFrame`’s from files#