A Regression Model for Wages#
This homework assignment works through creating a regression model to predict the wage of an individual given some basic demographic information. The dataset is from the openml data repository and was culled from Census data – see information here.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.metrics import mean_squared_error, root_mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa25/refs/heads/main/data/wage_hw.csv')
df.head()
| EDUCATION | SOUTH | SEX | EXPERIENCE | UNION | WAGE | AGE | RACE | OCCUPATION | SECTOR | MARR | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | no | female | 21 | not_member | 5.10 | 35 | Hispanic | Other | Manufacturing | Married |
| 1 | 9 | no | female | 42 | not_member | 4.95 | 57 | White | Other | Manufacturing | Married |
| 2 | 12 | no | male | 1 | not_member | 6.67 | 19 | White | Other | Manufacturing | Unmarried |
| 3 | 12 | no | male | 4 | not_member | 4.00 | 22 | White | Other | Other | Unmarried |
| 4 | 12 | no | male | 17 | not_member | 7.50 | 35 | White | Other | Other | Married |
PROBLEM 1: Splitting the data.
Use the train_test_split function to create a train and test dataset for all features and the target column WAGE. Your test set should be comprised of 20% of the total data.
X_train, X_test, y_train, y_test = train_test_split()
PROBLEM 2: Checking Assumptions
One assumption of the Linear Regression model is that the target feature is roughly normally distributed. Is this assumption met? If yes move on, if no consider transforming the target using np.log and compare the distribution of the logarithm of wages. If the logarithm is more “normal”, use this as your target.
PROBLEM 3: Preparing the Data
For the categorical features, use the OneHotEncoder to encode the different categorical variables and eliminate any reduntant information using the drop = 'if_binary' argument.
PROBLEM 4: Using make_column_transformer
Rather than taking just the categorical features, transforming these, and merging the dummied data with the other numeric features – make_column_transformer will accomplish this for us. Look over the user guide here and use this to transform the categorical features with OneHotEncoder and leave the remaining features as is. Be sure to transform both your training and test datasets correctly and assign as X_train_encoded and X_test_encoded below.
X_train_encoded = ''
X_test_encoded = ''
PROBLEM 5: Building the model
Now that your data is prepared, build a regression model with the appropriate input and target values.
PROBLEM 6: Scoring the Model
Now, evaluate the Mean Squared Error of your model on both the train and test data. Compare this with a baseline prediction Mean Squared Error. Did you model perform better than the baseline?
PROBLEM 7: Interpreting Coefficients
Examine your coefficients for the model. Using complete sentences explain which of the features seem to lead to increases in wages, and which seem to lead to a decrease in wage.
PROBLEM 8: Polynomial Features
After building a basic model using all the features, compare this to a model using a quadratic polynomial. Use PolynomialFeatures to create the features and score the train and test data as before. Did this model perform better than the baseline or linear model?
PROBLEM 9: Dimensionality Reduction
One of the downsides to the polynomial features is how many new features are introduced into the model. To limit this, you can use Principal Component Analysis to reduct the dimensionality of the data once the polynomial terms have been generated. Explore the PCA module here and use this to limit the polynomial features to the 15 principal components. Is the resulting model better?
PROBLEM 10: Interpreting Coefficients
Build a simple regression model and encode all the categorical features. Fit the model on the training data. Draw a horizontal bar plot of the coefficients, and interpret the feature importance based on these coefficients.
PROBLEM 11: Revisiting Interpretation
Read through the article on Common Pitfalls in the interpretation of coefficients of linear models. What does the author suggest is a better way of using the coefficients of a linear model to determine the “importance” of a feature? Draw a horizontal barplot of the updated coefficients and interpret the results.