Introduction to Linear Regression and Linear Programming

Introduction to Linear Regression and Linear Programming#

OBJECTIVES

Derive ordinary least squares models for data
Evaluate regression models using mean squared error
Examine errors and assumptions in least squares models
Use scikit-learn to fit regression models to data
Set up linear programming problems with constraints using scipy.optimize

Calculus Refresher#

An important idea is that of finding a maximum or minimum of a function. From calculus, we have the tools required. Specifically, a maximum or minimum value of a function $f$ occurs wherever $f'(x) = 0$ or is underfined. Consider the function:

\[f(x) = x^2\]

def f(x): return x**2
x = np.linspace(-2, 2, 100)
plt.plot(x, f(x))
plt.grid()
plt.title(r'$f(x) = x^2$');

Here, using our power rule for derivatives of polynomials we have:

\[f'(x) = 2x\]

and are left to solve:

\[0 = 2x\]

or

\[x = 0\]

PROBLEM: Use minimize to determine where the function $f(x) = (5 - 2x)^2$ has a minimum.

def f(x): return (5 - 2*x)**2
x = np.linspace(-2, 5, 100)
plt.plot(x, f(x))
plt.grid();

# minimize?

x0 = 0
# results = minimize()
# print(results.x)

Using the chain rule#

Example 1: Line of best fit

import seaborn as sns

tips = sns.load_dataset('tips')

tips.head()

sns.scatterplot(data = tips, x = 'total_bill', y = 'tip');

def y1(x): return .19*x

def y2(x): return .12*x

x = tips['total_bill']

plt.plot(x, y1(x), label = 'y1')
plt.plot(x, y2(x), label = 'y2')
plt.legend()
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('Which line is a better fit?')
plt.grid();

To decide between all possible lines we will examine the error in all these models and select the one that minimizes this error.

plt.plot(x, y2(x))
for i, yhat in enumerate(y2(x)):
    plt.vlines(x = tips['total_bill'].iloc[i], ymin = yhat, ymax = tips['tip'].iloc[i], color = 'red', linestyle = '--')
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('Error in Model')
plt.grid();

Mean Squared Error#

\[\text{MSE}(\beta_0) = \frac{1}{n}\sum_{i = 1}^n (y_i - \beta_0x)^2\]

OBJECTIVE: Minimize mean squared error

def mse(beta):
    return np.mean((y - beta*x)**2)

x = tips['total_bill']
y = tips['tip']

mse(.19)

mse(.12)

PROBLEM

Use the array of slopes below to loop over each possible slope and determine which gives the lowest Mean Squared Error.

slopes = np.linspace(.1, .2, 11)
    

betas = np.linspace(0, .4, 100)
plt.plot(betas, [mse(beta) for beta in betas])
plt.xlabel(r'$\beta$')
plt.ylabel('MSE')
plt.title('Mean Squared Error for Different Slopes')
plt.grid();

Using `scipy`#

To find the minimum of our objective function, the minimize function can again be used. It requires a function to be minimized and an intial guess.

Solving the Problem Exactly#

From calculus we know that the minimum value for a quadratic will occur where the first derivative equals zero. Thus, to determine the equations for the line of best fit, we minimize the MSE function with respect to $\beta$.

\[f(\beta) = \frac{1}{n}\sum_{i = 1}^n (y - \beta x)^2\]

\[f'(\beta) = \frac{-2}{n}\sum_{i = 1}^n(y - \beta x) x\]

\[ 0 = \frac{-2}{n}\sum_{i = 1}^n(y - \beta x) x\]

\[0 = \sum_{i = 1}^n(y - \beta x) x\]

\[0 = \sum yx - \sum \beta x^2 \]

\[\sum \beta x^2 = \sum y x\]

\[\beta \sum x^2 = \sum y x\]

\[\beta = \frac{\sum y x}{\sum x^2}\]

np.sum(tips['total_bill']*tips['tip'])/np.sum(tips['total_bill']**2)

Adding an intercept#

Consider the model:

\[y = \beta_0 + \beta_1 x + \epsilon\]

where $\epsilon = N(0, 1)$.

Now, the objective function changes to be a function in 3-Dimensions where the slope and intercept terms are input and mean squared error is the output.

\[\text{MSE}(\beta_0, \beta_1) = \frac{1}{n}\sum_{i = 1}^n (y - (\beta_0 + \beta_1 x)^2)\]

def mse(betas):
    return np.mean((y - (betas[0] + betas[1]*x))**2)

mse([4, 3])

slopes = np.linspace(4, 5, 10)
intercepts = np.linspace(2, 3, 10)
for slope in slopes:
    for intercept in intercepts:
        print(f'Slope {slope: .2f}, Intercept: {intercept: .2f}, MSE: {mse([slope, intercept])}')

minimize(mse, [5, 3])

betas = minimize(mse, [0, 0]).x

def lobf(x): return betas[0] + betas[1]*x

plt.scatter(x, y)
plt.plot(x, lobf(x), color = 'red')
plt.grid()
plt.title('Line of Best fit with slope and intercept');

Using `scikit-learn`#

The scikit-learn library has many predictive models and modeling tools. It is a popular library in industry for Machine Learning tasks. docs

# !pip install -U scikit-learn

credit = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/Credit.csv', index_col=0)
credit.head()

from sklearn.linear_model import LinearRegression

PROBLEM: Which feature is the strongest predictor of Balance in the data?

sns.heatmap(credit.corr(numeric_only = True), annot = True)

#define X as the feature most correlated with Balance
#and y as Balance

PROBLEM: Build a LinearRegression model, determine the Root Mean Squared Error and interpret the slope and intercept.

Instantiate
Fit
Predict/score

#instantiate

#fit

#look at coefficient and intercept

#examine the distribution of errors -- want normal centered at 0

from sklearn.metrics import mean_squared_error

#make predictions

#error between true and predicted

Problem: Determine $\alpha$ and $\beta$ of an investment#

Consider an investment with return of the market as $R_M$ (say the S&P) and the return of a stock $R$ over the same time period.

\[R = \alpha + \beta*R_M + \epsilon\]

import yfinance as yf

data = yf.download(["NVDA", "SPY"], period="5y")
data.head()

data.info()

close = data.iloc[:, [0, 1]]

close.pct_change().plot()

returns = close.pct_change()

sns.regplot(x = returns.iloc[:, 1], y = returns.iloc[:, 0], scatter_kws={'alpha': 0.3})
plt.xlabel('SPY Close Pct Change')
plt.ylabel('NVDA Close Pct Change')
plt.grid();

#instantiate

#fit

#examine our alpha and beta

Problem#

Load in the california housing data from the sample_data folder in colab. Use this data to build a regression model with:

The most correlated feature to housing prices as input and price as target
The five features you believe would be most important
Which model has a lower MSE?

Linear Programming: An example#

A dog daycare feeds the dogs a diet that requires a minimum of 15 grams of fiber, 30 grams of protein. Each pound of chick peas contains 1 gram of protein, 5 grams of fiber, and 25 grams of fat. Cottage cheese contains 2 grams of protein, 1 gram of fiber, and 14 grams of fat. Suppose that chick peas are priced at 8 USD per pound, and cottage cheese at 6 USD per pound. The nutritionist claims the dogs will not eat more than 285 grams of fat. Your goal is to obtain the most economical solution given these constraints.

This is summarized in the table below:

ingredient	protein	fiber	fat	cost
chick peas	1 g	5 g	25 g	$8
cottage cheese	2 g	1 g	14 g	$6
minimum	15 g	30 g
maximum			285 g

Mathematically, the objective function is:

\[C = 6x + 8y\]

where $x$ = chick peas and $y$ = cottage cheese. This is the total cost of the combination of chick peas and cottage cheese.

We will save the coefficients as a list for use later.

c = [6, 8]

Because each pound of chick peas contains 1 gram of protein and each cottage cheese 2 grams, and know this needs to combine to at least 15 grams you have:

\[2x + y \geq 15\]

As for cottage cheese, you have:

\[x + 5y \geq 30\]

As for the fat content, you have:

\[14x + 25y \leq 285\]

Further, the amount of $x$ and $y$ must not be negative.

Graphical representation#

To visualize the solution, you can imagine each of the constraints as a region. First, functions for these are defined and plotted together.

def cp(x): #chick peas
    return -2*x + 15
def cc(x): #cottage cheese
    return -1/5*x + 30/5
def fat(x): #fat
    return -14/25*x + 285/25
y = 0
x = np.linspace(0, 30, 100)

d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
           origin = 'lower',
          cmap = 'Greys',
          extent=(x.min(), x.max(), y.min(), y.max()),
          alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')

plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')

plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem')
plt.grid();

The form of all solutions can be found by rewriting the objective function as:

\[y = -\frac{6}{8}x + \frac{C}{8}\]

Below, different values of $C$ are plotted against the original region. Note that the objective function will obtain its maximum and minimum values at the vertices of the polygon. Hence, to solve the problem, you need only look to the vertices. These are highlighted in the plot on the right.

# fig, ax = plt.subplots(1, 2, figsize = (30, 10))
for i in range(10, 1000, 10):
    plt.plot(d, -6/8*d + i/8, '--', color = 'black')
plt.plot(d, -6/8*d + i, '--', color = 'black', label = 'slopes of -6/8 + c/8')
d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
           origin = 'lower',
          cmap = 'Greys',
          extent=(x.min(), x.max(), y.min(), y.max()),
          alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')

plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')

plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem\nwith possible cost lines')
plt.grid()
plt.legend();

d = np.linspace(0, 30, 1000)
x, y = np.meshgrid(d, d)
plt.imshow((y > cp(x)) & (y < fat(x)) & (y > cc(x)).astype('int'),
           origin = 'lower',
          cmap = 'Greys',
          extent=(x.min(), x.max(), y.min(), y.max()),
          alpha = 0.4)
plt.plot(d, cc(d), '-', color = 'black', label = 'cottage cheese')
plt.plot(15, 3, 'ro', label = 'possible solutions')
plt.plot(d, cp(d), '-', color = 'blue', label = 'chick peas')
plt.plot(d, fat(d), '-', color = 'red', label = 'fat')
plt.ylim(0, 15)
plt.xlim(0, 20)
plt.title('Feasible region for dog food problem')
plt.grid()
plt.plot(5, 5, 'ro')
plt.plot(5/2, 10, 'ro')
plt.legend();

using `linprog`#

To solve the problem using python, scipy.optimize contains the linprog function that will solve this problem. The one exception is that the greater than inequality constraints need to be converted to less thans by multiplying both sides by -1.

\[C = 6x + 8y\]

\[2x + y \geq 15\]

\[x + 5y \geq 30\]

\[14x + 25y \leq 285\]

#cost coeffients
c = [6, 8]
#coefficients of less than problems
A = [[-2, -1], [-1, -5], [14, 25]]
#right side of inequalities
b = [-15, -30, 285]
res = optim.linprog(c, A, b)

res

EXAMPLE

Suppose we have the opportunity to invest in four projects with given cash flows, net present value, and profitability index ($\frac{NPV}{\text{investment}}$) given below.

PROJECT	$C_0$	$C_1$	$C_2$	NPV at 10%	Profitability Index
A	-10	-30	5	21	2.1
B	-5	5	2	16	3.2
C	-5	5	15	12	2.4
D	0	-40	60	13	0.4

\[NPV = 21x_A + 16x_B + 12x_C + 13x_D\]

CONSTRAINTS

Cash flows for period 0 must not be greater than 10 million.

\[10x_A + 5x_B + 5x_C + 0x_D \leq 10\]

Total outflow in period 1 must not be greater than 10 million.

\[-30x_A - 5x_B - 5x_C + 40x_D \leq 10\]

Finally the amount of the investment must be positive and we cannot purchase more than one of each.

\[0 \leq x_A \leq 1\]

\[0 \leq x_B \leq 1\]

\[0 \leq x_C \leq 1\]

\[0 \leq x_D \leq 1\]

from scipy.optimize import linprog

coefs = [-21, -16, -12, -13]
constr = [[10, 5, 5, 0],
          [-30, -5, -5, 40]]
right_side = [10, 10]
bounds = [(0, 1), (0, 1), (0, 1), (0, 1)]

linprog(coefs, constr, right_side, bounds = bounds)

Introduction to Linear Regression and Linear Programming

Contents

Introduction to Linear Regression and Linear Programming#

Calculus Refresher#

Using the chain rule#

Mean Squared Error#

Using `scipy`#

Solving the Problem Exactly#

Adding an intercept#

Using `scikit-learn`#

Problem: Determine \(\alpha\) and \(\beta\) of an investment#

Problem#

Linear Programming: An example#

Graphical representation#

using `linprog`#

Introduction to Linear Regression and Linear Programming

Contents

Introduction to Linear Regression and Linear Programming#

Calculus Refresher#

Using the chain rule#

Mean Squared Error#

Using scipy#

Solving the Problem Exactly#

Adding an intercept#

Using scikit-learn#

Problem: Determine \(\alpha\) and \(\beta\) of an investment#

Problem#

Linear Programming: An example#

Graphical representation#

using linprog#

Using `scipy`#

Using `scikit-learn`#

using `linprog`#