Climb the Ladder!

Climb the Ladder!#

Our class moves quickly! Sometimes, it feels like we make leaps in logic that are a bit too big. In this ladder challenge, we’ll learn some core math concepts, some linear algebra, and the numpy library. Problems in this notebook start out easy and progressively get harder, so that the next rung of the Python ladder is always within reach.

Additionally, since not all of the topics discussed in this ladder challenge are explicitly taught in our course, these problems come with many more hints, tips, suggestions, and even sometimes a mini-lesson. You are encouraged to Google frequently throughout the lesson. In many ways, this ladder is meant to be a challenge as well as educational in its own right.

One Rule: NO LOOPS#

numpy takes advantage of vectorized calculations. A vectorized calculation is where one calculation occurs throughout the entire vector. For example, for a vector v, if we want a vector w that is all the elements of v plus one:

# GOOD! This is vectorized!
w = v + 1

# BAD! This is hard to read and extremely inefficient!
for i in range(len(v)):
    w[i] = v[i] + 1

None of the exercises in this notebook require a loop. If you use a loop to solve any of these problems, you are solving the problem incorrectly.

Section I - Vectors and some vector math#

Import the numpy library, aliasing it to be np

import numpy as np

Make the following vector defined as v: $$\mathbf{v} = (3, -1, 5, 2, -7)$$

Make the following vector defined as w using np.arange(): $$\mathbf{w} = (2, 3, 4, 5, 6)$$

Find $\mathbf{v} + \mathbf{w}$.

What’s the element-wise product of $\mathbf{v}$ and $\mathbf{w}$?

Triple every element of v. (Do not reassign.)

Double every element of w and then subtract 2 from every element. (Do not reassign.)

Index v to show me the number -1.

Index v to show me everything except the 0th element.

Index v to show me the elements at index 4, 2, and 3 (in that order).

Reassign the element of v at index 2 to be 6.

Mini-Lesson: Filtering#

Inequalities are also vectorized in numpy. For example:

v > 2
==> array([ True, False,  True, False, False])

You can also filter a vector using booleans, like follows:

v[[True, False,  True, False, False]]
==> array([3, 6])

The vector keeps anything True, and drops anything False. It then follows you can filter a vector using the vector itself, like this:

v[v > 2]
==> array([3, 6])

Show me all of the positive elements of v.

How many elements of v are positive? (Use numpy to answer this question, do not count manually).

Show me all of the even elements of w.

Show me all of the positive even elements of v.

Redefine all the negative elements of v to be zero. This is a common sort of operation to do for real-world data!

Hint: Using np.where is one of two easy ways to do this. The other is the reassign the vector while filtering. Do both ways if you can!

Create a new vector that’s "EVEN" if the corresponding element of w is even, and "ODD" otherwise.

Section II - Statistics and vector operations#

Emails#

For the next few problems, consider the following vector which represents the number of emails my inbox gets over the span of 20 days.

emails = np.array([25, 2, 45, 6, -2, 4, 4, 10, 6, -3, 16, 39, 19, 0, 1, -11, 25, 2, 7, 17])

A few data points were accidentally recorded as negative when they should be positive. Reassign emails to fix these records.

What is the mean number of emails I get per day?

Among days where I get more than 15 emails, what is the mean number of emails I get?

What proportion of days do I receive more than 20 emails?

Probabilities#

For the next few problems, consider the following vector of probabilities:

p = np.array([0.69, 0.38, 0.68, 0.23, 0.26, 0.59, 0.94, 0.77, 0.85, 0.89])

The odds of an event occuring is defined as the probability an even occuring divided by the probability that it doesn’t. That is:

\[\text{odds} = \frac{p}{1 - p}\]

Compute the odds of all of the probabilities in p. Remember, no loops!

Later in the course, we’ll need to compute log odds of probabilities. That is,

\[\log\frac{p}{1 - p}\]

where $\log$ is the natural logarithm. Compute the log odds for all the probabilities in p.

Create a variable predictions that is 1 if the value of p is greater than 0.5 and 0 otherwise.

What proportion of our predictions are 1?

Below, the vector y represents the true outcomes of the event occuring. In this case, p was actually our predicted probabilities of an event occuring. What proportion of the time were our predictions correct? More simply put, what proportion of the time do our predictions and y vectors match up? Later in the course, this will be known as our prediction accuracy.

y = np.random.binomial(1, .3, size = len(p))
y

array([0, 1, 0, 0, 0, 0, 1, 0, 1, 1])

Sometimes, we’ll want the log loss in a problem like this. While we’ll normally let this get calculated for us automatically, it’s good to do it once. Here’s the formula:

\[\text{LogLoss} = \sum\left[-y_i \log p_i - (1 - y_i) \log (1 - p_i)\right]\]

Mini-Lesson: How to read formulas like this!#

We will encounter intimidating formulas like this countless times throughout our course. Their bark is worse than their bite, I promise. Let’s dissect how to read them:

Subscripts - The symbol $y_i$ represents the $i$th element of the vector $y$. In this case, that is represented by our numpy array y. Similarly, $p_i$ is the $i$th element of $p$, which we have as p.
Calculation - For each $i$, we will calculate $-y_i \log p_i - (1 - y_i) \log (1 - p_i)$. So, for our 10-length vectors, we will have a resulting 10-length vector.
Summation - The symbol $\Sigma$ is the Greek letter “sigma” - which is the Greek version of “s”. In this case, “s” stands for “sum”. It implies we’ll be summing together all of our 10 calculations.

Let’s solve problem 26 in two easy steps:

26a) Compute $-y_i \log p_i - (1 - y_i) \log (1 - p_i)$ for each $i$. Call this result loglik_i.

26b) Sum the previous result to get our answer!