A Regression Model for Wages

A Regression Model for Wages#

This homework assignment works through creating a regression model to predict the wage of an individual given some basic demographic information. The dataset is from the openml data repository and was culled from Census data – see information here.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.metrics import mean_squared_error, root_mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/wage_hw.csv')
df.head()

	EDUCATION	SOUTH	SEX	EXPERIENCE	UNION	WAGE	AGE	RACE	OCCUPATION	SECTOR	MARR
0	8	no	female	21	not_member	5.10	35	Hispanic	Other	Manufacturing	Married
1	9	no	female	42	not_member	4.95	57	White	Other	Manufacturing	Married
2	12	no	male	1	not_member	6.67	19	White	Other	Manufacturing	Unmarried
3	12	no	male	4	not_member	4.00	22	White	Other	Other	Unmarried
4	12	no	male	17	not_member	7.50	35	White	Other	Other	Married

PROBLEM 1: Splitting the data.

Use the train_test_split function to create a train and test dataset for all features and the target column WAGE. Your test set should be comprised of 20% of the total data.

X_train, X_test, y_train, y_test = train_test_split()

PROBLEM 2: Checking Assumptions

One assumption of the Linear Regression model is that the target feature is roughly normally distributed. Is this assumption met? If yes move on, if no consider transforming the target using np.log and compare the distribution of the logarithm of wages. If the logarithm is more “normal”, use this as your target.

PROBLEM 3: Preparing the Data

For the categorical features, use the OneHotEncoder to encode the different categorical variables and eliminate any reduntant information using the drop = 'if_binary' argument.

PROBLEM 4: Using make_column_transformer

Rather than taking just the categorical features, transforming these, and merging the dummied data with the other numeric features – make_column_transformer will accomplish this for us. Look over the user guide here and use this to transform the categorical features with OneHotEncoder and leave the remaining features as is. Be sure to transform both your training and test datasets correctly and assign as X_train_encoded and X_test_encoded below.

X_train_encoded = ''
X_test_encoded = ''

PROBLEM 5: Building the model

Now that your data is prepared, build a regression model with the appropriate input and target values.

PROBLEM 6: Scoring the Model

Now, evaluate the Mean Squared Error of your model on both the train and test data. Compare this with a baseline prediction Mean Squared Error. Did you model perform better than the baseline?

PROBLEM 7: Interpreting Coefficients

Examine your coefficients for the model. Using complete sentences explain which of the features seem to lead to increases in wages, and which seem to lead to a decrease in wage.

PROBLEM 8: Polynomial Features

After building a basic model using all the features, compare this to a model using a quadratic polynomial. Use PolynomialFeatures to create the features and score the train and test data as before. Did this model perform better than the baseline or linear model?

PROBLEM 9: Dimensionality Reduction

One of the downsides to the polynomial features is how many new features are introduced into the model. To limit this, you can use Principal Component Analysis to reduct the dimensionality of the data once the polynomial terms have been generated. Explore the PCA module here and use this to limit the polynomial features to the 15 principal components. Is the resulting model better?

PROBLEM 10: Interpreting Coefficients

Build a simple regression model and encode all the categorical features. Fit the model on the training data. Draw a horizontal bar plot of the coefficients, and interpret the feature importance based on these coefficients.

PROBLEM 11: Revisiting Interpretation

Read through the article on Common Pitfalls in the interpretation of coefficients of linear models. What does the author suggest is a better way of using the coefficients of a linear model to determine the “importance” of a feature? Draw a horizontal barplot of the updated coefficients and interpret the results.