Model Complexity and Evaluation

Model Complexity and Evaluation#

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Polynomial Models#

Thus far, our regression models have taken the form:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_ix_i\]

where all our inputs to the model were linear.

In this notebook, we consider models of the form:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + ... + \beta_ix_1^i\]

These are commonly referred to as Polynomial Regression Models – however we are still using Linear Regression because the unknown quantities – \(\beta\) – are linear.

#load in the cars data
cars = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/main/data/mtcars.csv')

cars.head(2)

#x = mpg and y = hp
X = cars[['mpg']]
y = cars['hp']

#scatter plot of x vs. y
plt.scatter(X, y)

Reminder on Least Squares#

\[MSE(m, b) = \frac{1}{n}\sum_{i = 1}^n (y_i - (mx_i + b))^2\]

Assumptions of the model

The relationship between features and target are linear in nature
The features are independent of one another
The errors are normally distributed
The residuals have constant variance across all feature values

#fit model
lr = LinearRegression()
lr.fit(X, y)
preds = lr.predict(X)

#plot residuals
resids = (y - preds)

fig, ax = plt.subplots(1, 3, figsize = (20, 3))

ax[0].plot(resids, 'o')
ax[0].axhline(color = 'black')
ax[1].hist(resids)
ax[2].scatter(X['mpg'], y)
ax[2].plot(X['mpg'], lr.predict(X))

#Any assumptions violated?  Why?

Reminder: Quadratics#

\[f(x) = ax^2 + bx + c\]

\[f(a, b, c) = \frac{1}{n}\sum_{i = 1}^n (y_i - (ax_i^2 + bx_i + c))^2\]

In plain language, we add a new feature to represent the quadratic term and fit a linear regressor to these columns, essentially what we’ve done with multiple regression.

#examine X
X.head()

#add new feature
X['mpg^2'] = X['mpg']**2

# check X again
X.head(2)

#fit model and look at coefficients
model1 = LinearRegression().fit(X, y)
model1.coef_

# intercept
model1.intercept_

plt.scatter(X['mpg'], y)
plt.scatter(X['mpg'], model1.predict(X))

# mean squared error?
mean_squared_error(y, model1.predict(X))

QUESTION: Which is better – the first or second degree model?

Problem#

Add a cubic term to the data.
Fit a new model to the cubic data.
Determine the mean_squared_error of the linear, quadratic, and cubic models. How do they compare?
Would a quartic polynomial (4th degree) be better or worse in terms of mean_squared_error?

Experimenting with complexity#

Below, a synthetic dataset is created and different complexity regression models can be controlled using the slider. Which degree complexity seems best? Consider the new data after you determine the ideal model.

from ipywidgets import interact
import ipywidgets as widgets

x = np.linspace(0, 12, 30)
y = 3*x + 15 + 4*np.sin(x) + np.random.normal(scale = 3.0, size = len(x))

# Don't Peek!
# x = np.random.choice(x, 50, replace = False)
def model_maker(n, newdata = False):
    coefs = np.polyfit(x, y, n)
    preds = np.polyval(coefs, x)
    x_,y_,p_ = zip(*sorted(zip(x, y, preds)))
    plt.scatter(x_, y_, label = 'Known Data')
    plt.xlim(0, 6)
    if newdata:
        np.random.seed(42)
        x2 = np.random.choice(np.linspace(0, 12, 1000), 35)
        y2 = 3*x2 + 15 + 4*np.sin(x2) + np.random.normal(scale = 3.0, size = len(x2))
        plt.scatter(x2, y2, label = 'New Data')
    plt.plot(x_, p_, color = 'red')
    plt.title(f'Degree {n}')
    plt.legend();

interact(model_maker, n = widgets.IntSlider(start = 1, min = 1, max = len(y), step = 1));

#CHOOSE OPTIMAL COMPLEXITY

PolynomialFeatures

Scikitlearn has a transformer that will do the work of adding polynomial terms on to our dataset. For more information see the documentation here.

from sklearn.preprocessing import PolynomialFeatures

#create a little dataset (3, 2)
toy_x = np.random.normal(size = (3, 2))
toy_x

#instantiate and transform
poly_feats = PolynomialFeatures(include_bias = False)
np.set_printoptions(precision=2, suppress=True)
poly_feats.fit_transform(toy_x)

#look at the feature names
poly_feats.get_feature_names_out()

#create a dataframe from results
pd.DataFrame(poly_feats.fit_transform(toy_x), columns = poly_feats.get_feature_names_out())

Now, let’s use PolynomialFeatures to solve the earlier problem predicting hp using mpg.

#instantiate polynomial features
pfeats = PolynomialFeatures(include_bias=False)

#transform X
XT = pfeats.fit_transform(X[['mpg']])

XT.shape

#examine feature names
pfeats.get_feature_names_out()

#instantiate model
lr5 = LinearRegression().fit(XT, cars['hp'])

#fit, predict and score
mean_squared_error(cars['hp'], lr5.predict(XT))

`train_test_split`#

To this point, we have evaluated our models using the data they were built with. If our goal is to use these models for future predictions, it would be better to understand the performance on data the model has not seen in the past. To mimic this notion of unseen data, we create a small holdout set of data to use in evaluation.

Train Data: Data to build our model with.
Test Data: Data to evaluate the model with (usually a smaller dataset than train)

Scikitlearn has a train_test_split function that will create these datasets for us. Below we load it from the model_selection module and explore its functionality. User Guide

# import function
from sklearn.model_selection import train_test_split

# create a train and test split
y = cars['hp']
X_train, X_test, y_train, y_test = train_test_split(XT, y, random_state=42)

# explore train data
X_train.shape

# explore test data
X_test.shape

# build model with train
lr = LinearRegression()
lr.fit(X_train, y_train)

# evaluate train mse
train_preds = lr.predict(X_train) 
print(mean_squared_error(y_train, train_preds))

# evaluate test mse
test_preds = lr.predict(X_test)
print(mean_squared_error(y_test, test_preds))

Using the test data to determine model complexity#

Now, you can use the test set to measure the performance of models with varied complexity – choosing the “best” based on the scores on the test data.

# create polynomial features for train and test
for i in range(1, 10):
    poly_feats = PolynomialFeatures(degree = i, include_bias = False)
    X_train_poly = poly_feats.fit_transform(X_train)
    X_test_poly = poly_feats.transform(X_test)
    lr = LinearRegression()
    lr.fit(X_train_poly, y_train)
    train_preds = lr.predict(X_train_poly)
    test_preds = lr.predict(X_test_poly)
    print(f'Train MSE: {mean_squared_error(y_train, train_preds)}')
    print(f'Test MSE: {mean_squared_error(y_test, test_preds)}')
    print('--------------------------------')

# fit the model

# train MSE

# test MSE

Part II#

Another Example#

Returning to the credit dataset from earlier, we walk through a basic model building exercise. Along the way we will explore the OneHotEncoder and make_column_transformer to help with preparing the data for modeling. Our workflow is as follows:

Convert categorical columns to dummy encoded
Add polynomial features
Build LinearRegression model on train data
Evaluate on test data

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

# load data
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/main/data/Credit.csv'
credit = pd.read_csv(url, index_col = 0)

# train/test split
credit.head(2)

# import OneHotEncoder
ohe = OneHotEncoder(drop = 'first', sparse = False)

# instantiate
X_train, X_test, y_train, y_test = train_test_split(credit[['Ethnicity', 'Limit', 'Student']], credit['Balance'])

# fit and transform train data
XT = ohe.fit_transform(X_train)
# print(XT)

# instead we specify columns with make_column_selector
selector = make_column_transformer((OneHotEncoder(drop = 'first'), ['Ethnicity', 'Student']),
                                  remainder = 'passthrough')

# transform train and test
XTR = selector.fit_transform(X_train)
XTS = selector.transform(X_test)

XTR

# add polynomial features
pfeats = PolynomialFeatures()
XTRP = pfeats.fit_transform(XTR)
XTRS = pfeats.transform(XTS)

# fit regression model
lr6 = LinearRegression().fit(XTRP, y_train)

# score on train
mean_squared_error(y_train, lr6.predict(XTRP))

# score on test
mean_squared_error(y_test, lr6.predict(XTRS))

Exit Ticket#

Here