Introduction to Linear Regression

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import matplotlib.pyplot as plt
      2 import numpy as np
----> 3 import sympy as sy
      4 import pandas as pd

ModuleNotFoundError: No module named 'sympy'

Introduction to Linear Regression#

OBJECTIVES

Derive ordinary least squares models for data
Evaluate regression models using mean squared error
Examine errors and assumptions in least squares models
Use scikit-learn to fit regression models to data

Calculus Refresher#

An important idea is that of finding a maximum or minimum of a function. From calculus, we have the tools required. Specifically, a maximum or minimum value of a function \(f\) occurs wherever \(f'(x) = 0\) or is underfined. Consider the function:

\[f(x) = x^2\]

def f(x): return x**2
x = np.linspace(-2, 2, 100)
plt.plot(x, f(x))
plt.grid()
plt.title(r'$f(x) = x^2$');

_images/01c259d8d7e9d8fdc39f8919d868434490eceb43cda8bd9955906f3606fc1d8c.png

Here, using our power rule for derivatives of polynomials we have:

\[f'(x) = 2x\]

and are left to solve:

\[0 = 2x\]

or

\[x = 0\]

PROBLEM: Determine where the function \(f(x) = (5 - 2x)^2\) has a minimum.

def f(x): return (5 - 2*x)**2
x = np.linspace(-2, 5, 100)
plt.plot(x, f(x))
plt.grid();

_images/c375b0bc1b5afa2a8aeaee4abc841219d1882906bdc578bf5d4d7f242d0ff590.png

Using the chain rule#

Example 1: Line of best fit

import seaborn as sns

tips = sns.load_dataset('tips')

tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

sns.scatterplot(data = tips, x = 'total_bill', y = 'tip');

_images/06d433b731f88dadb0e9bf4dcc2116459f639144805ffb50db00257034d9df7a.png

def y1(x): return .19*x

def y2(x): return .12*x

x = tips['total_bill']

plt.plot(x, y1(x), label = 'y1')
plt.plot(x, y2(x), label = 'y2')
plt.legend()
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('How to determine the line of best fit?')
plt.grid();

_images/2963730ced030c7737144801760e77b84b50e227ffe9754975d592e501eace21.png

To decide between all possible lines we will examine the error in all these models and select the one that minimizes this error.

plt.plot(x, y2(x))
for i, yhat in enumerate(y2(x)):
    plt.vlines(x = tips['total_bill'].iloc[i], ymin = yhat, ymax = tips['tip'].iloc[i], color = 'red', linestyle = '--')
sns.scatterplot(data = tips, x = 'total_bill', y = 'tip')
plt.title('Error in Model')
plt.grid();

_images/56be4c6060eb34ff69b065706aafb65f7b9530cb9613ad399dd0f90036927774.png

Mean Squared Error#

\[\text{MSE}(\beta_0) = \frac{1}{n}\sum_{i = 1}^n (y_i - \beta_0x)^2\]

OBJECTIVE: Minimize mean squared error

def mse(beta):
    return np.mean((y - beta*x)**2)

x = tips['total_bill']
y = tips['tip']

mse(.17)

1.502706427336066

for pct in np.linspace(.1, .2, 11):
    print(f'The MSE for slope {pct: .3f} is {mse(pct): .3f}')

The MSE for slope  0.100 is  2.078
The MSE for slope  0.110 is  1.713
The MSE for slope  0.120 is  1.443
The MSE for slope  0.130 is  1.267
The MSE for slope  0.140 is  1.185
The MSE for slope  0.150 is  1.197
The MSE for slope  0.160 is  1.303
The MSE for slope  0.170 is  1.503
The MSE for slope  0.180 is  1.797
The MSE for slope  0.190 is  2.185
The MSE for slope  0.200 is  2.667

betas = np.linspace(0, .4, 100)
plt.plot(betas, [mse(beta) for beta in betas])
plt.xlabel(r'$\beta$')
plt.ylabel('MSE')
plt.title('Mean Squared Error for Different Slopes')
plt.grid();

_images/c1c986f29ecd3e81a339dea39c74b96a23a1b9468f8dc0943bbb98000554dc01.png

Using `scipy`#

To find the minimum of our objective function, the minimize function from scipy.optimize is useful. This relies on a variety of different optimization algorithms to find the minimum of a function.

from scipy.optimize import minimize

minimize(mse, .1 )

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 1.1781161154513358
        x: [ 1.437e-01]
      nit: 1
      jac: [ 1.103e-06]
 hess_inv: [[1]]
     nfev: 6
     njev: 3

Solving the Problem Exactly#

From calculus we know that the minimum value for a quadratic will occur where the first derivative equals zero. Thus, to determine the equations for the line of best fit, we minimize the MSE function with respect to \(\beta\).

\[f(\beta) = \frac{1}{n}\sum_{i = 1}^n (y - \beta x)^2\]

\[f'(\beta) = \frac{-2}{n}\sum_{i = 1}^n(y - \beta x) x\]

\[ 0 = \frac{-2}{n}\sum_{i = 1}^n(y - \beta x) x\]

\[0 = \sum_{i = 1}^n(y - \beta x) x\]

\[0 = \sum yx - \sum \beta x^2 \]

\[\sum \beta x^2 = \sum y x\]

\[\beta \sum x^2 = \sum y x\]

\[\beta = \frac{\sum y x}{\sum x^2}\]

np.sum(tips['total_bill']*tips['tip'])/np.sum(tips['total_bill']**2)

0.14373189527721666

Adding an intercept#

Consider the model:

\[y = \beta_0 + \beta_1 x + \epsilon\]

where \(\epsilon = N(0, 1)\).

np.random.seed(42)
x = np.linspace(0, 3, 40)
y = 3*x + 4 + np.random.normal(size = len(x))
plt.scatter(x, y)

<matplotlib.collections.PathCollection at 0x7f8b45d14250>

_images/00754c7da4f4dac73ae9344ac0ac5b5566df442875cf522a7404078dccef0124.png

Now, the objective function changes to be a function in 3-Dimensions where the slope and intercept terms are input and mean squared error is the output.

\[\text{MSE}(\beta_0, \beta_1) = \frac{1}{n}\sum_{i = 1}^n (y - (\beta_0 + \beta_1 x)^2)\]

def mse(betas):
    return np.mean((y - (betas[0] + betas[1]*x))**2)

mse([4, 3])

0.9329508980248764

minimize(mse, [0, 0])

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.829630503231984
        x: [ 4.179e+00  2.735e+00]
      nit: 7
      jac: [ 9.686e-08 -2.235e-08]
 hess_inv: [[ 1.921e+00 -9.483e-01]
            [-9.483e-01  6.327e-01]]
     nfev: 24
     njev: 8

betas = minimize(mse, [0, 0]).x

def lobf(x): return betas[0] + betas[1]*x

plt.scatter(x, y)
plt.plot(x, lobf(x), color = 'red')
plt.grid()
plt.title('Line of Best fit with slope and intercept');

_images/01464aeaa195e97d9e40d6a59a1bb27561c8b611a046d46d921c83567c1df596.png

Exercise#

Use the minimize function together with your mse function to complete the class below. After calling the fit method assign the slope of the line of best fit to the .coef_ attribute and the \(y\)-intercept to the .intercept_ attribute.

Test your model on the tips data below.

class LinearReg:
    '''
    This class fits an ordinary lease squares model
    of the form beta_0 + beta_1 * x.
    '''
    def __init__(self):
        self.coef_ = None
        self.intercept_ = None
        
    def mse(self, betas):
        return np.mean((y - (betas[0] + betas[1]*x))**2)
        
    def fit(self, x, y):
        #betas that minimize
        betas = minimize(self.mse, [0, 0]).x
        #set coef_
        self.coef_ = betas[1]
        #set intercept_
        self.intercept_ = betas[0]
        
    
    def predict(self, x):
        return self.intercept_ + self.coef_*x
    
    

x = tips['total_bill']
y = tips['tip']

#instantiate the model
model = LinearReg()

#fit the model on data
model.fit(x, y)

#make predictions on all data
model.predict(x)

    2.704636
    2.006223
    3.126835
    3.407250
    3.502822
         ...   
  3.969131
  3.774836
  3.301175
  2.791807
  2.892630
Name: total_bill, Length: 244, dtype: float64

model.coef_

0.10502447914641161

model.intercept_

0.9202703450693733

A second example#

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")
print(duncan_prestige.__doc__)

.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.

sns.lmplot(duncan_prestige.data, x = 'education', y = 'prestige')
plt.title('Education vs. Prestige');

_images/6b853a5ef8c3923f791c5a135551cb25f0ee0429735a7d3fa8e86bd4e043b0a1.png

Fit a model with an intercept to the data.
What is the slope of the line and what does this mean in terms of education and income?
What is the intercept of the model and what does this mean in terms of education and income?

x = duncan_prestige.data[['education']]
y = duncan_prestige.data['prestige']

x.shape

(45, 1)

# !pip install -U scikit-learn

from sklearn.linear_model import LinearRegression

model = LinearRegression() #instantiate -- step 1

model.fit(x, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

predictions = model.predict(x)
predictions

array([77.85563665, 68.83567885, 83.26761134, 81.46361978, 77.85563665,
       76.05164509, 84.16960712, 90.48357758, 78.75763243, 77.85563665,
       67.03168729, 88.67958602, 87.77759024, 76.05164509, 82.36561556,
       30.95185608, 40.87380966, 50.79576324, 39.97181388, 74.24765353,
       65.22769573, 49.89376746, 64.32569995, 45.38378856, 21.02990249,
       35.46183498, 25.53988139, 29.14786452, 20.12790671, 22.83389405,
       26.44187717,  6.59797001, 23.73588983, 17.42191937, 13.81393625,
       18.32391515, 23.73588983, 25.53988139, 15.61792781, 20.12790671,
       27.34387295, 22.83389405, 18.32391515, 42.67780122, 29.14786452])

model.coef_, model.intercept_

(array([0.90199578]), 0.2839995438216505)

model.coef_[0]

0.9019957803501166

Examining Errors in the Model#

Once we have a model, it is important to examine the properties of the residuals. Specifically, we aim to examine the residuals for patterns in error and the overall distribution of these errors.

fig, ax = plt.subplots(1, 2, figsize = (20, 10))
sns.residplot(x = x, y = y, ax = ax[0], lowess=True)
ax[0].set_title('Scatterplot of Residuals')
sns.kdeplot((y - (model.intercept_ + model.coef_[0]*x.values[:, 0])), ax = ax[1], fill = True);
ax[1].grid()
ax[1].set_title('Distribution of residuals');

_images/cfa853287ab06a419ee95d380e0b0209aa412b55ce2d6f4a85fcf25e89f28132.png

Using `scikit-learn`#

The scikit-learn library has many predictive models and modeling tools. It is a popular library in industry for Machine Learning tasks. docs

# !pip install -U scikit-learn

credit = pd.read_csv('data/Credit.csv', index_col=0)
credit.head()

	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance
1	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333
2	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903
3	104.593	7075	514	4	71	11	Male	No	No	Asian	580
4	148.924	9504	681	3	36	11	Female	No	No	Asian	964
5	55.882	4897	357	2	68	16	Male	No	Yes	Caucasian	331

from sklearn.linear_model import LinearRegression

PROBLEM: Which feature is the strongest predictor of Balance in the data?

sns.heatmap(credit.corr(numeric_only = True), annot = True)

<AxesSubplot: >

_images/a371f7a143809d2fc2cada2fa0bbcec404fd96799ac9a15995ef4d018eccd5bb.png

X = credit[['Rating']]
y = credit['Balance']

PROBLEM: Build a LinearRegression model, determine the Root Mean Squared Error and interpret the slope and intercept.

credit_model = LinearRegression()

credit_model.fit(X, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print(f'The model is y = {credit_model.coef_[0]}x + {credit_model.intercept_}')

The model is y = 2.5662403273433196x + -390.84634178723786

PROBLEM: Examine the residual plot and histogram of residuals. Do they look as they should?

sns.residplot(data = credit, x = X, y = y)

<AxesSubplot: xlabel='Rating', ylabel='Balance'>

_images/b4ea9be08447fa5725b71866828957434c022eab83d853638464be06f0406e3a.png

plt.hist(y - credit_model.predict(X))

(array([  3.,  11.,  31.,  91., 116.,  86.,  41.,   9.,   6.,   6.]),
 array([-712.2825064 , -558.1506416 , -404.01877679, -249.88691199,
         -95.75504718,   58.37681762,  212.50868243,  366.64054724,
         520.77241204,  674.90427685,  829.03614165]),
 <BarContainer object of 10 artists>)

_images/53342d16c773d66317995dc71c4f0ed8bb78acb01514bf10b24f0bb26ff528d6.png

Other Models#

If your goal is more around statistical inference and you want information about things like hypothesis tests on coefficients, the statsmodels library is a more classical statistics interface that also contains a variety of regression models. Below, the OLS model is instantiated, fit, and the results summarized with the .summary() method.

import statsmodels.api as sm

#instantiate the model

#fit it

#summary

#create the intercept term

#fit again

#summary

Typically, we will use the scikitlearn models on our data. Next class will focus on regression models with more features as well as how we can build higher degree polynomial models for our data.