Data Structures and Introduction to Pandas#
OBJECTIVES
Understand the similarities and differences of
list,tuple, anddictin Python.Use methods on collections
Iterate over collections and control flow with conditional statements
Connect data structures to
NumPyarraysBuild a basic
DataFrameRead in a
csvfile as aDataFrame
Data Structures in Python#
A data structure can be thought of as a way to represent data in Python. We will explore three basic forms, and then introduce external libraries NumPy and Pandas that simplify and extend the base libraries capabilities.
Data and Lists#
some stock |
|
|---|---|
0 |
102.824 |
1 |
202.062 |
2 |
297.471 |
3 |
401.106 |
4 |
503.444 |
5 |
607.566 |
6 |
708.208 |
7 |
809.986 |
8 |
908.093 |
9 |
1007.35 |
stock = [102.82449238, 202.06238561, 297.4710266 , 401.10612769,
503.44394471, 607.56592588, 708.20812766, 809.98616016,
908.09339539, 1007.34582461]
type(stock)
list
len(stock)
10
#first day
stock[0]
102.82449238
#second day
stock[1]
202.06238561
#last day
stock[-1]
1007.34582461
#every other day -- [start idx: stop idx: count by]
stock[2:7:2]
[297.4710266, 503.44394471, 708.20812766]
#third through sixth day
stock[2:5]
[297.4710266, 401.10612769, 503.44394471]
stock[5:8]
[607.56592588, 708.20812766, 809.98616016]
stock
[102.82449238,
202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766,
809.98616016,
908.09339539,
1007.34582461]
list methods#
A method is a function unique to each “type” of object in Python. In general, we access these methods with the syntax:
object.method()
See docs for full list of methods.
#see all methods -- .tab locally; just wait a second in colab
stock
[102.82449238,
202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766,
809.98616016,
908.09339539,
1007.34582461]
fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
fruits.count?
Signature: fruits.count(value, /)
Docstring: Return number of occurrences of value.
Type: builtin_function_or_method
#using the count method
fruits.count('apple')
2
fruits.count('tangerine')
0
string_1 = 'Lenny'
string_2 = 'Hardy'
string_1.join(
Cell In[68], line 1
string_1.join(
^
SyntaxError: incomplete input
CHALLENGE
How does the
.index()method for a list work? Demonstrate its use with the listfruits.
fruits.index('apple', 2)
5
fruits[1]
'apple'
fruits
['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
How does the
.append()method for a list work? Demonstrate its use by appending a new stock price that is 112% of the last price in the list.
stock.append(stock[-1]*1.12)
print(stock)
[102.82449238, 202.06238561, 297.4710266, 401.10612769, 503.44394471, 607.56592588, 708.20812766, 809.98616016, 908.09339539, 1007.34582461, 1128.2273235632001]
What happens when you try to add two lists? Multiply by a constant (i.e.
stocks * 2)?
stock + stock + stock
[102.82449238,
202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766,
809.98616016,
908.09339539,
1007.34582461,
1128.2273235632001,
102.82449238,
202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766,
809.98616016,
908.09339539,
1007.34582461,
1128.2273235632001,
102.82449238,
202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766,
809.98616016,
908.09339539,
1007.34582461,
1128.2273235632001]
tuple#
Similar to lists in that they are ordered collections of mixed datatypes. The primary difference is a list is mutable whereas a tuple is not.
stock_tuple = (102.82449238, 202.06238561, 297.4710266 , 401.10612769,
503.44394471, 607.56592588, 708.20812766, 809.98616016,
908.09339539, 1007.34582461)
type(stock_tuple)
tuple
stock_tuple[0]
102.82449238
stock_tuple[1:7]
(202.06238561,
297.4710266,
401.10612769,
503.44394471,
607.56592588,
708.20812766)
tuple methods#
Much more limited than the list.
stock_tuple.
Cell In[79], line 1
stock_tuple.
^
SyntaxError: invalid syntax
stock[:4]
[102.82449238, 202.06238561, 297.4710266, 401.10612769]
#delete first element of list
del(stock[0])
print(stock)
[202.06238561, 297.4710266, 401.10612769, 503.44394471, 607.56592588, 708.20812766, 809.98616016, 908.09339539, 1007.34582461, 1128.2273235632001]
print(len(stock))
10
del(stock_tuple[0])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[84], line 1
----> 1 del(stock_tuple[0])
TypeError: 'tuple' object doesn't support item deletion
dict#
Dictionaries differ from lists and tuples in that they are not ordered and instead contain key and value pairs that are used to identify different elements of the collection.
Year |
Country |
Spending_USD |
Life_Expectancy |
|
|---|---|---|---|---|
0 |
1970 |
Germany |
252.311 |
70.6 |
1 |
1970 |
France |
192.143 |
72.2 |
2 |
1970 |
Great Britain |
123.993 |
71.9 |
3 |
1970 |
Japan |
150.437 |
72 |
4 |
1970 |
USA |
326.961 |
70.9 |
germany = {'year': 1970, 'country': 'Germany', 'Spending': 252.311, 'Life_Expectancy': 70.6}
type(germany)
dict
#access the year
germany['year']
1970
#access the life expectancy
germany['Life_Expectancy']
70.6
dict methods#
Often, we use a dictionary to represent multiple data points for multiple columns or features of a dataset.
health_exp = {'Year': [1970, 1970, 1970, 1970, 1970],
'Country': ['Germany','France','Great Britain','Japan','USA'],
'Spending_USD': [252.311,192.143,123.993,150.437,326.961],
'Life_Expectancy': [70.6,72.2,71.9,72.0,70.9]}
health_exp.
Cell In[90], line 1
health_exp.
^
SyntaxError: invalid syntax
#using get to access values
health_exp.get('Year')
[1970, 1970, 1970, 1970, 1970]
#print the key/value pairs
health_exp.items()
dict_items([('Year', [1970, 1970, 1970, 1970, 1970]), ('Country', ['Germany', 'France', 'Great Britain', 'Japan', 'USA']), ('Spending_USD', [252.311, 192.143, 123.993, 150.437, 326.961]), ('Life_Expectancy', [70.6, 72.2, 71.9, 72.0, 70.9])])
#the keys only
health_exp.keys()
dict_keys(['Year', 'Country', 'Spending_USD', 'Life_Expectancy'])
QUESTION
What are some questions you might ask about this dataset? How would you find the mean spending and life expectancy?
health_exp
{'Year': [1970, 1970, 1970, 1970, 1970],
'Country': ['Germany', 'France', 'Great Britain', 'Japan', 'USA'],
'Spending_USD': [252.311, 192.143, 123.993, 150.437, 326.961],
'Life_Expectancy': [70.6, 72.2, 71.9, 72.0, 70.9]}
health_exp['date'] = '09-09-2025'
health_exp
{'Year': [1970, 1970, 1970, 1970, 1970],
'Country': ['Germany', 'France', 'Great Britain', 'Japan', 'USA'],
'Spending_USD': [252.311, 192.143, 123.993, 150.437, 326.961],
'Life_Expectancy': [70.6, 72.2, 71.9, 72.0, 70.9],
'date': '09-09-2025'}
sum(health_exp['Spending_USD'])/len(health_exp['Spending_USD'])
209.169
conditional statements#
These are built in comparison methods for both numeric and string datatypes.
#less than?
4 < 8
True
#greater than or equal to?
5 >= 3
True
mean_life_exp = sum(health_exp['Life_Expectancy'])/len(health_exp['Life_Expectancy'])
print(f'The mean life expectancy is {mean_life_exp: .2f} years')
The mean life expectancy is 71.52 years
#suppose we want to compare the individual datapoints to the mean
health_exp['Life_Expectancy']
[70.6, 72.2, 71.9, 72.0, 70.9]
#which countries life expectancy is less than average
health_exp['Life_Expectancy'] < mean_life_exp
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[102], line 2
1 #which countries life expectancy is less than average
----> 2 health_exp['Life_Expectancy'] < mean_life_exp
TypeError: '<' not supported between instances of 'list' and 'float'
#alternatively we can take each at a time
print(health_exp['Life_Expectancy'][0] < mean_life_exp)
print(health_exp['Life_Expectancy'][1] < mean_life_exp)
print(health_exp['Life_Expectancy'][2] < mean_life_exp)
True
False
False
Iterating over collections#
The for loop allows you to automate stepping through items in any collection. In general it works as:
for item in collection:
print(item) #or whatever operation you want to perform on each
for country_life_exp in health_exp['Life_Expectancy']:
print(country_life_exp)
70.6
72.2
71.9
72.0
70.9
#what if you iterate over a dictionary?
for entry in health_exp:
print(entry)
Year
Country
Spending_USD
Life_Expectancy
CHALLENGE
Adjust the loop above to print the values of the dictionary health_exp instead of the keys.
for entry in health_exp:
print(health_exp[entry])
[1970, 1970, 1970, 1970, 1970]
['Germany', 'France', 'Great Britain', 'Japan', 'USA']
[252.311, 192.143, 123.993, 150.437, 326.961]
[70.6, 72.2, 71.9, 72.0, 70.9]
for v in health_exp.values():
print(v)
[1970, 1970, 1970, 1970, 1970]
['Germany', 'France', 'Great Britain', 'Japan', 'USA']
[252.311, 192.143, 123.993, 150.437, 326.961]
[70.6, 72.2, 71.9, 72.0, 70.9]
for entry in health_exp:
print(health_exp.get(entry))
[1970, 1970, 1970, 1970, 1970]
['Germany', 'France', 'Great Britain', 'Japan', 'USA']
[252.311, 192.143, 123.993, 150.437, 326.961]
[70.6, 72.2, 71.9, 72.0, 70.9]
Functions#
Functions can be thought of as our frequent high school mathematics description – an object that takes an input, performs on operation on the input, and return the result.
#defining a function
def f(x):
return x**2
#using the function
f(3)
CHALLENGE
What does the function below do? Can you use it on the health_exp dataset?
def averager(list_of_numbers):
'''
This function takes in a list and returns
the arithmetic mean of the quantities.
---------
Keyword arguments:
list_of_numbers: list (List of values to average)
--------
returns:
average: float (mean of list)
--------
Example:
x = [1, 2, 3, 4, 5]
averager(x) --> 3.0
'''
return sum(list_of_numbers)/len(list_of_numbers)
Shortcomings of basic data structures#
Should finding the average of a dataset take so much effort? Suppose you wanted to find the standard deviation of a column, wouldn’t it be nice if we had a data structure that had all this built in for us?
Python Packages#
As a general-purpose programming language, Python is designed to be used in many ways. You can build web sites or industrial robots or a game for your friends to play, and much more, all using the same core technology.
Python’s flexibility is why the first step in every Python project must be to think about the project’s audience and the corresponding environment where the project will run. It might seem strange to think about packaging before writing code, but this process does wonders for avoiding future headaches. – source
Typically, you will install a package and consult its documentation for how to use the package. When we downloaded Anaconda it came loaded with many of the important libraries for data oriented tasks.
pandas#
The pandas library will be our workhorse for a data structure. It is already installed in both Anaconda and Google Colab notebooks. Before we can use the library, we first import it and alias it.
import pandas as pd
The documentation for
pandasis excellent, and found hereThey offer a basic cheat sheet that can be a quick reference as you learn here
In the documentation there are getting started tutorials that could be helpful to explore here
DataFrame as a data structure#
#creating a DataFrame from a dictionary
df = pd.DataFrame(health_exp)
#what kind of thing is this?
type(df)
pandas.core.frame.DataFrame
#take a look at the DataFrame
df
| Year | Country | Spending_USD | Life_Expectancy | date | |
|---|---|---|---|---|---|
| 0 | 1970 | Germany | 252.311 | 70.6 | 09-09-2025 |
| 1 | 1970 | France | 192.143 | 72.2 | 09-09-2025 |
| 2 | 1970 | Great Britain | 123.993 | 71.9 | 09-09-2025 |
| 3 | 1970 | Japan | 150.437 | 72.0 | 09-09-2025 |
| 4 | 1970 | USA | 326.961 | 70.9 | 09-09-2025 |
DataFrame methods and attributes#
Just like the base library data structures, the DataFrame has its own methods and attributes.
#DataFrame methods and properties
df.
#finding the mean of all numeric columns
df.mean(numeric_only=True)
Year 1970.000
Spending_USD 209.169
Life_Expectancy 71.520
dtype: float64
CHALLENGE
Determine the standard deviation of the numeric columns?
df.std(numeric_only=True)
Year 0.000000
Spending_USD 81.747280
Life_Expectancy 0.719027
dtype: float64
Creating DataFrame’s from files#
The pandas function pd.read_csv allows us to read in an external .csv file and create a DataFrame.
#create a csv file
df.to_csv('health_exp.csv')
#read the csv file back in as a new object
health_df = pd.read_csv('health_exp.csv')
health_df
| Unnamed: 0 | Year | Country | Spending_USD | Life_Expectancy | date | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1970 | Germany | 252.311 | 70.6 | 09-09-2025 |
| 1 | 1 | 1970 | France | 192.143 | 72.2 | 09-09-2025 |
| 2 | 2 | 1970 | Great Britain | 123.993 | 71.9 | 09-09-2025 |
| 3 | 3 | 1970 | Japan | 150.437 | 72.0 | 09-09-2025 |
| 4 | 4 | 1970 | USA | 326.961 | 70.9 | 09-09-2025 |
#the index was included as a column in the csv file
#we can specify this column as the index instead of creating
#a new index column
health_df = pd.read_csv('health_exp.csv', index_col=0)
health_df
| Year | Country | Spending_USD | Life_Expectancy | date | |
|---|---|---|---|---|---|
| 0 | 1970 | Germany | 252.311 | 70.6 | 09-09-2025 |
| 1 | 1970 | France | 192.143 | 72.2 | 09-09-2025 |
| 2 | 1970 | Great Britain | 123.993 | 71.9 | 09-09-2025 |
| 3 | 1970 | Japan | 150.437 | 72.0 | 09-09-2025 |
| 4 | 1970 | USA | 326.961 | 70.9 | 09-09-2025 |
CHALLENGE
Find a dataset in
.csvformat with a quick google search. Download the file and read it in as aDataFrame. Show yourDataFrameto your neighbor and rejoice.