Problem 1: Difference in Groups

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier, XGBRegressor
import scipy.stats as stats
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.inspection import permutation_importance, PartialDependenceDisplay

Problem 1: Difference in Groups#

Below, data for an experimental curricular intervention is given for the treatment and control group. Explain how you can use stats.ttest_ind to determine if the intervention made a difference. Test the hypothesis that the groups average score on the Directed Reading Protocol assessment (drp column) are different.

drp = pd.read_csv('data/DRP.csv', index_col=0)

drp.head()

	group	g	drp
id
1	Treat	0	24
2	Treat	0	56
3	Treat	0	43
4	Treat	0	59
5	Treat	0	58

Problem 2: Effect Size#

While the hypothesis test determines whether or not the groups results are different, it doesn’t say just how big of a difference this significance determines. For this, we turn to effect size. Here is an article discussing why \(p\)-values might not be enough when determining the difference between groups. [link ] Can you use any of the ideas discussed to determine the effect size or power of the intervention? (Feel free to use any library including statsmodels)

import statsmodels.stats.power as smp

Problem 3: Regression and Interpreting Coefficients#

Below, our wage dataset from an earlier assignment is loaded and displayed. Consider using the XGBRegressor to build a model predicting wages. Tune the model so its performance is as consistent as you can get, and use sklearn.inspection to explore the most important features driving wages.

from sklearn.datasets import fetch_openml

wages = fetch_openml(data_id=534, as_frame=True).frame

wages.head()

	EDUCATION	SOUTH	SEX	EXPERIENCE	UNION	WAGE	AGE	RACE	OCCUPATION	SECTOR	MARR
0	8	no	female	21	not_member	5.10	35	Hispanic	Other	Manufacturing	Married
1	9	no	female	42	not_member	4.95	57	White	Other	Manufacturing	Married
2	12	no	male	1	not_member	6.67	19	White	Other	Manufacturing	Unmarried
3	12	no	male	4	not_member	4.00	22	White	Other	Other	Unmarried
4	12	no	male	17	not_member	7.50	35	White	Other	Other	Married

wages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    int64   
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    int64   
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    int64   
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB

Problem 4: Classical Regression Inference#

Repeat the above problem but this time build your model using statsmodels regression model (docs). After fitting, explore the summary and the hypothesis tests for each coefficient as well as the confidence intervals. Do you find similar results as using the inspection module? Compare and contrast these approaches to understanding your models performance.

import statsmodels.api as sm

sm.OLS?

Init signature: sm.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
Docstring:     
Ordinary Least Squares

Parameters
----------
endog : array_like
    A 1-d endogenous response variable. The dependent variable.
exog : array_like
    A nobs x k array where `nobs` is the number of observations and `k`
    is the number of regressors. An intercept is not included by default
    and should be added by the user. See
    :func:`statsmodels.tools.add_constant`.
missing : str
    Available options are 'none', 'drop', and 'raise'. If 'none', no nan
    checking is done. If 'drop', any observations with nans are dropped.
    If 'raise', an error is raised. Default is 'none'.
hasconst : None or bool
    Indicates whether the RHS includes a user-supplied constant. If True,
    a constant is not checked for and k_constant is set to 1 and all
    result statistics are calculated as if a constant is present. If
    False, a constant is not checked for and k_constant is set to 0.
**kwargs
    Extra arguments that are used to set model properties when using the
    formula interface.

Attributes
----------
weights : scalar
    Has an attribute weights = array(1.0) due to inheritance from WLS.

See Also
--------
WLS : Fit a linear model using Weighted Least Squares.
GLS : Fit a linear model using Generalized Least Squares.

Notes
-----
No constant is added by the model unless you are using formulas.

Examples
--------
>>> import statsmodels.api as sm
>>> import numpy as np
>>> duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")
>>> Y = duncan_prestige.data['income']
>>> X = duncan_prestige.data['education']
>>> X = sm.add_constant(X)
>>> model = sm.OLS(Y,X)
>>> results = model.fit()
>>> results.params
const        10.603498
education     0.594859
dtype: float64

>>> results.tvalues
const        2.039813
education    6.892802
dtype: float64

>>> print(results.t_test([1, 0]))
                             Test for Constraints
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            10.6035      5.198      2.040      0.048       0.120      21.087
==============================================================================

>>> print(results.f_test(np.identity(2)))
<F test: F=array([[159.63031026]]), p=1.2607168903696672e-20,
 df_denom=43, df_num=2>
File:           /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/regression/linear_model.py
Type:           type
Subclasses:     

Problem 5: Causal ML#

Watch Hajime Takeda’s talk from Scipy 2024 on Casaul ML. What is the big idea and can you explain a use case for Causal ML?

from IPython.display import YouTubeVideo

YouTubeVideo(id = 'xcv4FH-KnvA')

Problem 6: `EconML`#

Microsoft has put together a very nice library with popular Causal ML algorithms and analysis tools. Head over to the Econ ML website and read through the Trip Advisor Case Study. Your goal is to use these ideas to determine a targeting strategy for the hillstrom data loaded below. (Explanation of problem here) This is a fairly open ended task, and you have flexibility with determining exactly how you want to approach this. I will give you time on Tuesday to brainstorm ideas with peers in class, and together you should produce a brief summary of your strategy and its efficacy.

from sklift.datasets import fetch_hillstrom

dataset = fetch_hillstrom(target_col='conversion')
data, target, treatment = dataset.data, dataset.target, dataset.treatment