import matplotlib.pyplot as plt
import numpy as np
import sympy as sy
import pandas as pd
from scipy.optimize import minimize
Introduction to Linear Regression and Linear Programming#
OBJECTIVES
Derive ordinary least squares models for data
Evaluate regression models using mean squared error
Examine errors and assumptions in least squares models
Use
scikit-learnto fit regression models to dataSet up linear programming problems with constraints using
scipy.optimize
Calculus Refresher#
An important idea is that of finding a maximum or minimum of a function. From calculus, we have the tools required. Specifically, a maximum or minimum value of a function \(f\) occurs wherever \(f'(x) = 0\) or is underfined. Consider the function:
def f(x): return x**2
x = np.linspace(-2, 2, 100)
plt.plot(x, f(x))
plt.grid()
plt.title(r'$f(x) = x^2$');
Here, using our power rule for derivatives of polynomials we have:
and are left to solve:
or
PROBLEM: Use minimize to determine where the function \(f(x) = (5 - 2x)^2\) has a minimum.
def f(x): return (5 - 2*x)**2
x = np.linspace(-2, 5, 100)
plt.plot(x, f(x))
plt.grid();
# minimize?
x0 = 0
# results = minimize()
# print(results.x)
Using the chain rule#
Example 1: Line of best fit
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip');
def y1(x): return .19*x
def y2(x): return .12*x
x = tips['total_bill']
plt.plot(x, y1(x), label = 'y1')
plt.plot(x, y2(x), label = 'y2')
plt.legend()
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('Which line is a better fit?')
plt.grid();
To decide between all possible lines we will examine the error in all these models and select the one that minimizes this error.
plt.plot(x, y2(x))
for i, yhat in enumerate(y2(x)):
plt.vlines(x = tips['total_bill'].iloc[i], ymin = yhat, ymax = tips['tip'].iloc[i], color = 'red', linestyle = '--')
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('Error in Model')
plt.grid();
Mean Squared Error#
OBJECTIVE: Minimize mean squared error
def mse(beta):
return np.mean((y - beta*x)**2)
x = tips['total_bill']
y = tips['tip']
mse(.19)
mse(.12)
PROBLEM
Use the array of slopes below to loop over each possible slope and determine which gives the lowest Mean Squared Error.
slopes = np.linspace(.1, .2, 11)
betas = np.linspace(0, .4, 100)
plt.plot(betas, [mse(beta) for beta in betas])
plt.xlabel(r'$\beta$')
plt.ylabel('MSE')
plt.title('Mean Squared Error for Different Slopes')
plt.grid();
Using scipy#
To find the minimum of our objective function, the minimize function can again be used. It requires a function to be minimized and an intial guess.
Solving the Problem Exactly#
From calculus we know that the minimum value for a quadratic will occur where the first derivative equals zero. Thus, to determine the equations for the line of best fit, we minimize the MSE function with respect to \(\beta\).
np.sum(tips['total_bill']*tips['tip'])/np.sum(tips['total_bill']**2)
Adding an intercept#
Consider the model:
where \(\epsilon = N(0, 1)\).
Now, the objective function changes to be a function in 3-Dimensions where the slope and intercept terms are input and mean squared error is the output.
def mse(betas):
return np.mean((y - (betas[0] + betas[1]*x))**2)
mse([4, 3])
slopes = np.linspace(4, 5, 10)
intercepts = np.linspace(2, 3, 10)
for slope in slopes:
for intercept in intercepts:
print(f'Slope {slope: .2f}, Intercept: {intercept: .2f}, MSE: {mse([slope, intercept])}')
minimize(mse, [5, 3])
betas = minimize(mse, [0, 0]).x
def lobf(x): return betas[0] + betas[1]*x
plt.scatter(x, y)
plt.plot(x, lobf(x), color = 'red')
plt.grid()
plt.title('Line of Best fit with slope and intercept');
Using scikit-learn#
The scikit-learn library has many predictive models and modeling tools. It is a popular library in industry for Machine Learning tasks. docs
# !pip install -U scikit-learn
credit = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Credit.csv', index_col=0)
credit.head()
from sklearn.linear_model import LinearRegression
PROBLEM: Which feature is the strongest predictor of Balance in the data?
sns.heatmap(credit.corr(numeric_only = True), annot = True)
#define X as the feature most correlated with Balance
#and y as Balance
PROBLEM: Build a LinearRegression model, determine the Root Mean Squared Error and interpret the slope and intercept.
Instantiate
Fit
Predict/score
#instantiate
#fit
#look at coefficient and intercept
#examine the distribution of errors -- want normal centered at 0
from sklearn.metrics import mean_squared_error
#make predictions
#error between true and predicted
Problem: Determine \(\alpha\) and \(\beta\) of an investment#
Consider an investment with return of the market as \(R_M\) (say the S&P) and the return of a stock \(R\) over the same time period.
import yfinance as yf
data = yf.download(["NVDA", "SPY"], period="5y")
data.head()
data.info()
close = data.iloc[:, [0, 1]]
close.pct_change().plot()
returns = close.pct_change()
sns.regplot(x = returns.iloc[:, 1], y = returns.iloc[:, 0], scatter_kws={'alpha': 0.3})
plt.xlabel('SPY Close Pct Change')
plt.ylabel('NVDA Close Pct Change')
plt.grid();
#instantiate
#fit
#examine our alpha and beta
Problem#
Load in the california housing data from the sample_data folder in colab. Use this data to build a regression model with:
The most correlated feature to housing prices as input and price as target
The five features you believe would be most important
Which model has a lower MSE?
Linear Programming: An example#
A dog daycare feeds the dogs a diet that requires a minimum of 15 grams of fiber, 30 grams of protein. Each pound of chick peas contains 1 gram of protein, 5 grams of fiber, and 25 grams of fat. Cottage cheese contains 2 grams of protein, 1 gram of fiber, and 14 grams of fat. Suppose that chick peas are priced at 8 USD per pound, and cottage cheese at 6 USD per pound. The nutritionist claims the dogs will not eat more than 285 grams of fat. Your goal is to obtain the most economical solution given these constraints.
This is summarized in the table below:
ingredient |
protein |
fiber |
fat |
cost |
|---|---|---|---|---|
chick peas |
1 g |
5 g |
25 g |
$8 |
cottage cheese |
2 g |
1 g |
14 g |
$6 |
minimum |
15 g |
30 g |
||
maximum |
285 g |
Mathematically, the objective function is:
where \(x\) = chick peas and \(y\) = cottage cheese. This is the total cost of the combination of chick peas and cottage cheese.
We will save the coefficients as a list for use later.
c = [6, 8]
Because each pound of chick peas contains 1 gram of protein and each cottage cheese 2 grams, and know this needs to combine to at least 15 grams you have:
As for cottage cheese, you have:
As for the fat content, you have:
Further, the amount of \(x\) and \(y\) must not be negative.
Graphical representation#
To visualize the solution, you can imagine each of the constraints as a region. First, functions for these are defined and plotted together.
def cp(x): #chick peas
return -2*x + 15
def cc(x): #cottage cheese
return -1/5*x + 30/5
def fat(x): #fat
return -14/25*x + 285/25
y = 0
x = np.linspace(0, 30, 100)
d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
origin = 'lower',
cmap = 'Greys',
extent=(x.min(), x.max(), y.min(), y.max()),
alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')
plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')
plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem')
plt.grid();
The form of all solutions can be found by rewriting the objective function as:
Below, different values of \(C\) are plotted against the original region. Note that the objective function will obtain its maximum and minimum values at the vertices of the polygon. Hence, to solve the problem, you need only look to the vertices. These are highlighted in the plot on the right.
# fig, ax = plt.subplots(1, 2, figsize = (30, 10))
for i in range(10, 1000, 10):
plt.plot(d, -6/8*d + i/8, '--', color = 'black')
plt.plot(d, -6/8*d + i, '--', color = 'black', label = 'slopes of -6/8 + c/8')
d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
origin = 'lower',
cmap = 'Greys',
extent=(x.min(), x.max(), y.min(), y.max()),
alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')
plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')
plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem\nwith possible cost lines')
plt.grid()
plt.legend();
d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
origin = 'lower',
cmap = 'Greys',
extent=(x.min(), x.max(), y.min(), y.max()),
alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')
plt.plot(15, 3, 'ro', label = 'possible solutions')
plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')
plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem')
plt.grid()
plt.plot(5, 5, 'ro')
plt.plot(5/2, 10, 'ro')
plt.legend();
using linprog#
To solve the problem using python, scipy.optimize contains the linprog function that will solve this problem. The one exception is that the greater than inequality constraints need to be converted to less thans by multiplying both sides by -1.
#cost coeffients
c = [6, 8]
#coefficients of less than problems
A = [[-2, -1], [-1, -5], [14, 25]]
#right side of inequalities
b = [-15, -30, 285]
res = optim.linprog(c, A, b)
res
EXAMPLE
Suppose we have the opportunity to invest in four projects with given cash flows, net present value, and profitability index (\(\frac{NPV}{\text{investment}}\)) given below.
PROJECT |
\(C_0\) |
\(C_1\) |
\(C_2\) |
NPV at 10% |
Profitability Index |
|---|---|---|---|---|---|
A |
-10 |
-30 |
5 |
21 |
2.1 |
B |
-5 |
5 |
2 |
16 |
3.2 |
C |
-5 |
5 |
15 |
12 |
2.4 |
D |
0 |
-40 |
60 |
13 |
0.4 |
CONSTRAINTS
Cash flows for period 0 must not be greater than 10 million.
Total outflow in period 1 must not be greater than 10 million.
Finally the amount of the investment must be positive and we cannot purchase more than one of each.
from scipy.optimize import linprog
coefs = [-21, -16, -12, -13]
constr = [[10, 5, 5, 0],
[-30, -5, -5, 40]]
right_side = [10, 10]
bounds = [(0, 1), (0, 1), (0, 1), (0, 1)]
linprog(coefs, constr, right_side, bounds = bounds)