ml-econometrics
>
Install
mkdir -p .claude/skills/ml-econometrics && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16267" && unzip -o skill.zip -d .claude/skills/ml-econometrics && rm skill.zipInstalls to .claude/skills/ml-econometrics
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Use this Skill for machine learning in econometrics: Post-LASSO for high-dimensional controls, Double/Debiased ML (DML) for ATE/ATT, cross-fitting, and valid post-selection inference.About this skill
Machine Learning Econometrics
TL;DR — When controls are high-dimensional (many covariates relative to sample size), use Post-LASSO for variable selection and Double/Debiased ML (DML) for valid causal inference on treatment effects. Use EconML CausalForest for heterogeneous treatment effects (CATE).
When to Use
| Situation | Recommended Method |
|---|---|
| Many controls, single treatment, want ATE | Post-LASSO + OLS or DML-PLR |
| Need to partial out high-dimensional X from D and Y | Robinson (1988) / DML |
| Heterogeneous treatment effects across subgroups | CausalForest (EconML) |
| Program evaluation with rich covariate set | AIPW estimator |
| Valid inference after LASSO selection | Post-LASSO (Belloni et al. 2012) |
Background
Problem: Regularization Bias in LASSO
When estimating β_D in Y = β_D D + X β_X + ε with high-dimensional X, directly using LASSO on the full model introduces regularization bias into β̂_D. This happens because LASSO shrinks all coefficients, including the one on the treatment D.
Naive LASSO estimate: Biased, no valid standard errors.
Post-LASSO (Belloni-Chernozhukov-Hansen 2014)
- Run LASSO of Y on (D, X) to select a set S of controls.
- Run OLS of Y on D and the selected subset X_S only.
- Standard OLS inference is valid on the Post-LASSO estimator.
This two-step procedure recovers valid inference when the true model is approximately sparse (few truly important controls).
Frisch-Waugh-Lovell and Robinson's Estimator
By the FWL theorem, β_D in Y = β_D D + X β_X + ε equals the coefficient from regressing M_X Y on M_X D, where M_X is the residual maker.
Robinson (1988) extended this to semiparametric partially linear models: Y = θ D + g(X) + ε D = m(X) + v
The estimator: θ̂ = [Σ (D_i - m̂(X_i)) (Y_i - ĝ(X_i))] / [Σ (D_i - m̂(X_i))²]
where ĝ and m̂ can be any ML estimators.
Double/Debiased ML (Chernozhukov et al. 2018)
DML corrects for regularization bias via cross-fitting:
- Split data into K folds.
- For each fold k:
- Fit ĝ_{-k}(X) = E[Y|X] and m̂_{-k}(X) = E[D|X] on all folds except k.
- Compute residuals for fold k: ỹ_i = Y_i - ĝ_{-k}(X_i), d̃_i = D_i - m̂_{-k}(X_i).
- Regress ỹ on d̃ using all pooled residuals.
Cross-fitting ensures the nuisance estimators have negligible bias on the held-out sample, restoring √n-consistency and valid Gaussian inference.
CATE with CausalForest
For heterogeneous effects τ(x) = E[Y(1) - Y(0) | X = x]:
- Fit a GRF (Generalized Random Forest) that solves a local version of the Robinson estimator at each point x.
- Valid confidence intervals via the infinitesimal jackknife.
- Feature importance identifies which covariates drive heterogeneity.
Environment Setup
conda create -n mlecono python=3.11 -y
conda activate mlecono
pip install doubleml>=0.6 econml>=0.15 scikit-learn>=1.2 numpy>=1.23 pandas>=1.5 matplotlib>=3.6
# Verify
python -c "import doubleml; print('doubleml', doubleml.__version__)"
python -c "import econml; print('econml', econml.__version__)"
Core Workflow
Step 1 — Post-LASSO Variable Selection and OLS
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoCV, Lasso
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
np.random.seed(42)
def generate_highdim_data(
n: int = 500,
p_total: int = 200,
p_relevant: int = 10,
ate: float = 2.0,
) -> pd.DataFrame:
"""
Generate high-dimensional dataset for ML econometrics.
Y = ate * D + X_relevant @ beta + epsilon
D = X_relevant @ gamma + v (endogenous through X)
Args:
n: Number of observations.
p_total: Total number of control variables.
p_relevant: Number of truly relevant controls.
ate: True average treatment effect.
Returns:
DataFrame with columns: Y, D, X1, ..., Xp_total.
"""
X = np.random.randn(n, p_total)
X_rel = X[:, :p_relevant]
# True coefficients
beta = np.random.uniform(0.5, 2.0, p_relevant) * np.random.choice([-1, 1], p_relevant)
gamma = np.random.uniform(0.3, 1.0, p_relevant) * np.random.choice([-1, 1], p_relevant)
# Treatment (binary after probit)
D_star = X_rel @ gamma + np.random.randn(n)
D = (D_star > np.median(D_star)).astype(float)
# Outcome
Y = ate * D + X_rel @ beta + np.random.randn(n)
df = pd.DataFrame(X, columns=[f"X{i+1}" for i in range(p_total)])
df.insert(0, "D", D)
df.insert(0, "Y", Y)
return df
def post_lasso_ols(
df: pd.DataFrame,
outcome: str = "Y",
treatment: str = "D",
covariates: list = None,
cv_folds: int = 5,
) -> dict:
"""
Post-LASSO: select controls via LassoCV, then run OLS on selected subset.
This avoids regularization bias on the treatment coefficient β_D.
Args:
df: DataFrame with outcome, treatment, and covariates.
outcome: Name of outcome variable.
treatment: Name of treatment variable.
covariates: List of control variable names. Defaults to all columns
except outcome and treatment.
cv_folds: Cross-validation folds for LassoCV penalty selection.
Returns:
Dictionary with keys: ate_estimate, se, tstat, pvalue, selected_controls, ols_result.
"""
if covariates is None:
covariates = [c for c in df.columns if c not in [outcome, treatment]]
Y = df[outcome].values
D = df[treatment].values
X = df[covariates].values
# Standardize X for LASSO
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Stack D with X for selection (Belloni: select controls for Y equation)
XD = np.hstack([D.reshape(-1, 1), X_scaled])
# LassoCV: select lambda by cross-validation
lasso_cv = LassoCV(cv=cv_folds, max_iter=5000, random_state=0, n_jobs=-1)
lasso_cv.fit(XD, Y)
alpha_opt = lasso_cv.alpha_
# Fit Lasso at optimal alpha to identify selected controls
lasso = Lasso(alpha=alpha_opt, max_iter=5000)
lasso.fit(XD, Y)
coef_x = lasso.coef_[1:] # skip D coefficient (first column)
selected_idx = np.where(np.abs(coef_x) > 1e-8)[0]
selected_controls = [covariates[i] for i in selected_idx]
n_selected = len(selected_controls)
print(f"Post-LASSO: selected {n_selected} / {len(covariates)} controls at lambda={alpha_opt:.6f}")
# OLS on treatment + selected controls
X_selected = df[selected_controls].values if n_selected > 0 else np.zeros((len(Y), 0))
regressors = np.hstack([D.reshape(-1, 1), X_selected])
regressors_with_const = sm.add_constant(regressors)
ols_result = sm.OLS(Y, regressors_with_const).fit(cov_type="HC3")
# Treatment coefficient is the second element (after constant)
ate_hat = float(ols_result.params[1])
se_hat = float(ols_result.bse[1])
tstat = float(ols_result.tvalues[1])
pvalue = float(ols_result.pvalues[1])
print(f"Post-LASSO ATE: {ate_hat:.4f} SE: {se_hat:.4f} t: {tstat:.3f} p: {pvalue:.4f}")
return {
"ate_estimate": ate_hat,
"se": se_hat,
"tstat": tstat,
"pvalue": pvalue,
"selected_controls": selected_controls,
"ols_result": ols_result,
}
Step 2 — Double/Debiased ML (PLR Model)
import doubleml as dml
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.linear_model import LassoCV, LogisticRegressionCV
def fit_doubleml_plr(
df: pd.DataFrame,
outcome: str = "Y",
treatment: str = "D",
covariates: list = None,
ml_type: str = "lasso",
n_folds: int = 5,
n_rep: int = 3,
score: str = "partialling out",
) -> dict:
"""
Fit a Partially Linear Regression (PLR) model using DoubleML.
Args:
df: DataFrame with outcome, treatment, and controls.
outcome: Outcome variable name.
treatment: Treatment variable name (binary or continuous).
covariates: Control variable names. Defaults to all other columns.
ml_type: Nuisance learner: 'lasso' or 'gbm' (gradient boosting).
n_folds: Number of cross-fitting folds.
n_rep: Number of repeated cross-fitting splits.
score: Scoring method: 'partialling out' or 'IV-type'.
Returns:
Dictionary with keys: ate, se, pvalue, ci_lower, ci_upper, dml_obj.
"""
if covariates is None:
covariates = [c for c in df.columns if c not in [outcome, treatment]]
# DoubleML data object
data = dml.DoubleMLData(df, y_col=outcome, d_cols=treatment, x_cols=covariates)
if ml_type == "lasso":
ml_l = LassoCV(cv=5, max_iter=5000) # E[Y|X] learner
ml_m = LassoCV(cv=5, max_iter=5000) # E[D|X] learner
elif ml_type == "gbm":
ml_l = GradientBoostingRegressor(n_estimators=200, max_depth=3)
ml_m = GradientBoostingClassifier(n_estimators=200, max_depth=3)
else:
raise ValueError(f"Unknown ml_type '{ml_type}'. Choose 'lasso' or 'gbm'.")
# PLR model
dml_plr = dml.DoubleMLPLR(
obj_dml_data=data,
ml_l=ml_l,
ml_m=ml_m,
n_folds=n_folds,
n_rep=n_rep,
score=score,
)
dml_plr.fit(store_predictions=True, store_models=True)
summary = dml_plr.summary
ate = float(summary["coef"].values[0])
se = float(summary["std err"].values[0])
pvalue = float(summary["P>|z|"].values[0])
ci_lower = float(dml_plr.confint().iloc[0, 0])
ci_upper = float(dml_plr.confint().iloc[0, 1])
print(f"\nDML PLR ATE: {ate:.4f} SE: {se:.4f} p: {pvalue:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
return {
"ate": ate,
"se": se,
"pvalue": pvalue,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"dml_obj": dml_plr,
}
Step 3 — Causa
Content truncated.