import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier, XGBRegressor
import scipy.stats as stats
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
Problem 1: Difference in Groups#
Below, data for an experimental curricular intervention is given for the treatment and control group. Explain how you can use stats.ttest_ind to determine if the intervention made a difference. Test the hypothesis that the groups average score on the Directed Reading Protocol assessment (drp column) are different.
drp = pd.read_csv('data/DRP.csv', index_col=0)
drp.head()
| group | g | drp | |
|---|---|---|---|
| id | |||
| 1 | Treat | 0 | 24 |
| 2 | Treat | 0 | 56 |
| 3 | Treat | 0 | 43 |
| 4 | Treat | 0 | 59 |
| 5 | Treat | 0 | 58 |
Problem 2: Effect Size#
While the hypothesis test determines whether or not the groups results are different, it doesn’t say just how big of a difference this significance determines. For this, we turn to effect size. Here is an article discussing why \(p\)-values might not be enough when determining the difference between groups. [link ] Can you use any of the ideas discussed to determine the effect size or power of the intervention? (Feel free to use any library including statsmodels)
import statsmodels.stats.power as smp
Problem 3: Regression and Interpreting Coefficients#
Below, our wage dataset from an earlier assignment is loaded and displayed. Consider using the XGBRegressor to build a model predicting wages. Tune the model so its performance is as consistent as you can get, and use sklearn.inspection to explore the most important features driving wages.
from sklearn.datasets import fetch_openml
wages = fetch_openml(data_id=534, as_frame=True).frame
wages.head()
| EDUCATION | SOUTH | SEX | EXPERIENCE | UNION | WAGE | AGE | RACE | OCCUPATION | SECTOR | MARR | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | no | female | 21 | not_member | 5.10 | 35 | Hispanic | Other | Manufacturing | Married |
| 1 | 9 | no | female | 42 | not_member | 4.95 | 57 | White | Other | Manufacturing | Married |
| 2 | 12 | no | male | 1 | not_member | 6.67 | 19 | White | Other | Manufacturing | Unmarried |
| 3 | 12 | no | male | 4 | not_member | 4.00 | 22 | White | Other | Other | Unmarried |
| 4 | 12 | no | male | 17 | not_member | 7.50 | 35 | White | Other | Other | Married |
wages.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EDUCATION 534 non-null int64
1 SOUTH 534 non-null category
2 SEX 534 non-null category
3 EXPERIENCE 534 non-null int64
4 UNION 534 non-null category
5 WAGE 534 non-null float64
6 AGE 534 non-null int64
7 RACE 534 non-null category
8 OCCUPATION 534 non-null category
9 SECTOR 534 non-null category
10 MARR 534 non-null category
dtypes: category(7), float64(1), int64(3)
memory usage: 21.4 KB
Problem 4: Classical Regression Inference#
Repeat the above problem but this time build your model using statsmodels regression model (docs). After fitting, explore the summary and the hypothesis tests for each coefficient as well as the confidence intervals. Do you find similar results as using the inspection module? Compare and contrast these approaches to understanding your models performance.
import statsmodels.api as sm
sm.OLS?
Init signature: sm.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
Docstring:
Ordinary Least Squares
Parameters
----------
endog : array_like
A 1-d endogenous response variable. The dependent variable.
exog : array_like
A nobs x k array where `nobs` is the number of observations and `k`
is the number of regressors. An intercept is not included by default
and should be added by the user. See
:func:`statsmodels.tools.add_constant`.
missing : str
Available options are 'none', 'drop', and 'raise'. If 'none', no nan
checking is done. If 'drop', any observations with nans are dropped.
If 'raise', an error is raised. Default is 'none'.
hasconst : None or bool
Indicates whether the RHS includes a user-supplied constant. If True,
a constant is not checked for and k_constant is set to 1 and all
result statistics are calculated as if a constant is present. If
False, a constant is not checked for and k_constant is set to 0.
**kwargs
Extra arguments that are used to set model properties when using the
formula interface.
Attributes
----------
weights : scalar
Has an attribute weights = array(1.0) due to inheritance from WLS.
See Also
--------
WLS : Fit a linear model using Weighted Least Squares.
GLS : Fit a linear model using Generalized Least Squares.
Notes
-----
No constant is added by the model unless you are using formulas.
Examples
--------
>>> import statsmodels.api as sm
>>> import numpy as np
>>> duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")
>>> Y = duncan_prestige.data['income']
>>> X = duncan_prestige.data['education']
>>> X = sm.add_constant(X)
>>> model = sm.OLS(Y,X)
>>> results = model.fit()
>>> results.params
const 10.603498
education 0.594859
dtype: float64
>>> results.tvalues
const 2.039813
education 6.892802
dtype: float64
>>> print(results.t_test([1, 0]))
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 10.6035 5.198 2.040 0.048 0.120 21.087
==============================================================================
>>> print(results.f_test(np.identity(2)))
<F test: F=array([[159.63031026]]), p=1.2607168903696672e-20,
df_denom=43, df_num=2>
File: /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/regression/linear_model.py
Type: type
Subclasses:
Problem 5: Causal ML#
Watch Hajime Takeda’s talk from Scipy 2024 on Casaul ML. What is the big idea and can you explain a use case for Causal ML?
from IPython.display import YouTubeVideo
YouTubeVideo(id = 'xcv4FH-KnvA')
Problem 6: EconML#
Microsoft has put together a very nice library with popular Causal ML algorithms and analysis tools. Head over to the Econ ML website and read through the Trip Advisor Case Study. Your goal is to use these ideas to determine a targeting strategy for the hillstrom data loaded below. (Explanation of problem here) This is a fairly open ended task, and you have flexibility with determining exactly how you want to approach this. I will give you time on Tuesday to brainstorm ideas with peers in class, and together you should produce a brief summary of your strategy and its efficacy.
from sklift.datasets import fetch_hillstrom
dataset = fetch_hillstrom(target_col='conversion')
data, target, treatment = dataset.data, dataset.target, dataset.treatment