ml-econometrics

Name: ml-econometrics
Author: xjtulyc

Install

mkdir -p .claude/skills/ml-econometrics && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16267" && unzip -o skill.zip -d .claude/skills/ml-econometrics && rm skill.zip

Installs to .claude/skills/ml-econometrics

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Use this Skill for machine learning in econometrics: Post-LASSO for high-dimensional controls, Double/Debiased ML (DML) for ATE/ATT, cross-fitting, and valid post-selection inference.

183 chars✓ has a “when” trigger

About this skill

Machine Learning Econometrics

TL;DR — When controls are high-dimensional (many covariates relative to sample size), use Post-LASSO for variable selection and Double/Debiased ML (DML) for valid causal inference on treatment effects. Use EconML CausalForest for heterogeneous treatment effects (CATE).

When to Use

Situation	Recommended Method
Many controls, single treatment, want ATE	Post-LASSO + OLS or DML-PLR
Need to partial out high-dimensional X from D and Y	Robinson (1988) / DML
Heterogeneous treatment effects across subgroups	CausalForest (EconML)
Program evaluation with rich covariate set	AIPW estimator
Valid inference after LASSO selection	Post-LASSO (Belloni et al. 2012)

Background

Problem: Regularization Bias in LASSO

When estimating β_D in Y = β_D D + X β_X + ε with high-dimensional X, directly using LASSO on the full model introduces regularization bias into β̂_D. This happens because LASSO shrinks all coefficients, including the one on the treatment D.

Naive LASSO estimate: Biased, no valid standard errors.

Post-LASSO (Belloni-Chernozhukov-Hansen 2014)

Run LASSO of Y on (D, X) to select a set S of controls.
Run OLS of Y on D and the selected subset X_S only.
Standard OLS inference is valid on the Post-LASSO estimator.

This two-step procedure recovers valid inference when the true model is approximately sparse (few truly important controls).

Frisch-Waugh-Lovell and Robinson's Estimator

By the FWL theorem, β_D in Y = β_D D + X β_X + ε equals the coefficient from regressing M_X Y on M_X D, where M_X is the residual maker.

Robinson (1988) extended this to semiparametric partially linear models: Y = θ D + g(X) + ε D = m(X) + v

The estimator: θ̂ = [Σ (D_i - m̂(X_i)) (Y_i - ĝ(X_i))] / [Σ (D_i - m̂(X_i))²]

where ĝ and m̂ can be any ML estimators.

Double/Debiased ML (Chernozhukov et al. 2018)

DML corrects for regularization bias via cross-fitting:

Split data into K folds.
For each fold k:
- Fit ĝ_{-k}(X) = E[Y|X] and m̂_{-k}(X) = E[D|X] on all folds except k.
- Compute residuals for fold k: ỹ_i = Y_i - ĝ_{-k}(X_i), d̃_i = D_i - m̂_{-k}(X_i).
Regress ỹ on d̃ using all pooled residuals.

Cross-fitting ensures the nuisance estimators have negligible bias on the held-out sample, restoring √n-consistency and valid Gaussian inference.

CATE with CausalForest

For heterogeneous effects τ(x) = E[Y(1) - Y(0) | X = x]:

Fit a GRF (Generalized Random Forest) that solves a local version of the Robinson estimator at each point x.
Valid confidence intervals via the infinitesimal jackknife.
Feature importance identifies which covariates drive heterogeneity.

Environment Setup

conda create -n mlecono python=3.11 -y
conda activate mlecono
pip install doubleml>=0.6 econml>=0.15 scikit-learn>=1.2 numpy>=1.23 pandas>=1.5 matplotlib>=3.6

# Verify
python -c "import doubleml; print('doubleml', doubleml.__version__)"
python -c "import econml; print('econml', econml.__version__)"

Core Workflow

Step 1 — Post-LASSO Variable Selection and OLS

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoCV, Lasso
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

np.random.seed(42)


def generate_highdim_data(
    n: int = 500,
    p_total: int = 200,
    p_relevant: int = 10,
    ate: float = 2.0,
) -> pd.DataFrame:
    """
    Generate high-dimensional dataset for ML econometrics.

    Y = ate * D + X_relevant @ beta + epsilon
    D = X_relevant @ gamma + v (endogenous through X)

    Args:
        n:           Number of observations.
        p_total:     Total number of control variables.
        p_relevant:  Number of truly relevant controls.
        ate:         True average treatment effect.

    Returns:
        DataFrame with columns: Y, D, X1, ..., Xp_total.
    """
    X = np.random.randn(n, p_total)
    X_rel = X[:, :p_relevant]

    # True coefficients
    beta = np.random.uniform(0.5, 2.0, p_relevant) * np.random.choice([-1, 1], p_relevant)
    gamma = np.random.uniform(0.3, 1.0, p_relevant) * np.random.choice([-1, 1], p_relevant)

    # Treatment (binary after probit)
    D_star = X_rel @ gamma + np.random.randn(n)
    D = (D_star > np.median(D_star)).astype(float)

    # Outcome
    Y = ate * D + X_rel @ beta + np.random.randn(n)

    df = pd.DataFrame(X, columns=[f"X{i+1}" for i in range(p_total)])
    df.insert(0, "D", D)
    df.insert(0, "Y", Y)
    return df


def post_lasso_ols(
    df: pd.DataFrame,
    outcome: str = "Y",
    treatment: str = "D",
    covariates: list = None,
    cv_folds: int = 5,
) -> dict:
    """
    Post-LASSO: select controls via LassoCV, then run OLS on selected subset.

    This avoids regularization bias on the treatment coefficient β_D.

    Args:
        df:          DataFrame with outcome, treatment, and covariates.
        outcome:     Name of outcome variable.
        treatment:   Name of treatment variable.
        covariates:  List of control variable names. Defaults to all columns
                     except outcome and treatment.
        cv_folds:    Cross-validation folds for LassoCV penalty selection.

    Returns:
        Dictionary with keys: ate_estimate, se, tstat, pvalue, selected_controls, ols_result.
    """
    if covariates is None:
        covariates = [c for c in df.columns if c not in [outcome, treatment]]

    Y = df[outcome].values
    D = df[treatment].values
    X = df[covariates].values

    # Standardize X for LASSO
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Stack D with X for selection (Belloni: select controls for Y equation)
    XD = np.hstack([D.reshape(-1, 1), X_scaled])

    # LassoCV: select lambda by cross-validation
    lasso_cv = LassoCV(cv=cv_folds, max_iter=5000, random_state=0, n_jobs=-1)
    lasso_cv.fit(XD, Y)
    alpha_opt = lasso_cv.alpha_

    # Fit Lasso at optimal alpha to identify selected controls
    lasso = Lasso(alpha=alpha_opt, max_iter=5000)
    lasso.fit(XD, Y)
    coef_x = lasso.coef_[1:]       # skip D coefficient (first column)
    selected_idx = np.where(np.abs(coef_x) > 1e-8)[0]
    selected_controls = [covariates[i] for i in selected_idx]

    n_selected = len(selected_controls)
    print(f"Post-LASSO: selected {n_selected} / {len(covariates)} controls at lambda={alpha_opt:.6f}")

    # OLS on treatment + selected controls
    X_selected = df[selected_controls].values if n_selected > 0 else np.zeros((len(Y), 0))
    regressors = np.hstack([D.reshape(-1, 1), X_selected])
    regressors_with_const = sm.add_constant(regressors)
    ols_result = sm.OLS(Y, regressors_with_const).fit(cov_type="HC3")

    # Treatment coefficient is the second element (after constant)
    ate_hat = float(ols_result.params[1])
    se_hat = float(ols_result.bse[1])
    tstat = float(ols_result.tvalues[1])
    pvalue = float(ols_result.pvalues[1])

    print(f"Post-LASSO ATE: {ate_hat:.4f}  SE: {se_hat:.4f}  t: {tstat:.3f}  p: {pvalue:.4f}")

    return {
        "ate_estimate": ate_hat,
        "se": se_hat,
        "tstat": tstat,
        "pvalue": pvalue,
        "selected_controls": selected_controls,
        "ols_result": ols_result,
    }

Step 2 — Double/Debiased ML (PLR Model)

import doubleml as dml
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.linear_model import LassoCV, LogisticRegressionCV


def fit_doubleml_plr(
    df: pd.DataFrame,
    outcome: str = "Y",
    treatment: str = "D",
    covariates: list = None,
    ml_type: str = "lasso",
    n_folds: int = 5,
    n_rep: int = 3,
    score: str = "partialling out",
) -> dict:
    """
    Fit a Partially Linear Regression (PLR) model using DoubleML.

    Args:
        df:          DataFrame with outcome, treatment, and controls.
        outcome:     Outcome variable name.
        treatment:   Treatment variable name (binary or continuous).
        covariates:  Control variable names. Defaults to all other columns.
        ml_type:     Nuisance learner: 'lasso' or 'gbm' (gradient boosting).
        n_folds:     Number of cross-fitting folds.
        n_rep:       Number of repeated cross-fitting splits.
        score:       Scoring method: 'partialling out' or 'IV-type'.

    Returns:
        Dictionary with keys: ate, se, pvalue, ci_lower, ci_upper, dml_obj.
    """
    if covariates is None:
        covariates = [c for c in df.columns if c not in [outcome, treatment]]

    # DoubleML data object
    data = dml.DoubleMLData(df, y_col=outcome, d_cols=treatment, x_cols=covariates)

    if ml_type == "lasso":
        ml_l = LassoCV(cv=5, max_iter=5000)       # E[Y|X] learner
        ml_m = LassoCV(cv=5, max_iter=5000)       # E[D|X] learner
    elif ml_type == "gbm":
        ml_l = GradientBoostingRegressor(n_estimators=200, max_depth=3)
        ml_m = GradientBoostingClassifier(n_estimators=200, max_depth=3)
    else:
        raise ValueError(f"Unknown ml_type '{ml_type}'. Choose 'lasso' or 'gbm'.")

    # PLR model
    dml_plr = dml.DoubleMLPLR(
        obj_dml_data=data,
        ml_l=ml_l,
        ml_m=ml_m,
        n_folds=n_folds,
        n_rep=n_rep,
        score=score,
    )
    dml_plr.fit(store_predictions=True, store_models=True)

    summary = dml_plr.summary
    ate = float(summary["coef"].values[0])
    se = float(summary["std err"].values[0])
    pvalue = float(summary["P>|z|"].values[0])
    ci_lower = float(dml_plr.confint().iloc[0, 0])
    ci_upper = float(dml_plr.confint().iloc[0, 1])

    print(f"\nDML PLR ATE: {ate:.4f}  SE: {se:.4f}  p: {pvalue:.4f}")
    print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")

    return {
        "ate": ate,
        "se": se,
        "pvalue": pvalue,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "dml_obj": dml_plr,
    }

Step 3 — Causa

Content truncated.

More by xjtulyc

View all by xjtulyc →

meta-analysis-psych

xjtulyc

Install

mkdir -p .claude/skills/ml-econometrics && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16267" && unzip -o skill.zip -d .claude/skills/ml-econometrics && rm skill.zip

Installs to .claude/skills/ml-econometrics

Safety

Review before install

Runs shell / code

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

3mo ago

License

MIT

Repo stars

Loads

~4,619 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

xjtulyc

2 skills published

Links

Source code

ml-econometrics

Install

Activation

About this skill

Machine Learning Econometrics

When to Use

Background

Problem: Regularization Bias in LASSO

Post-LASSO (Belloni-Chernozhukov-Hansen 2014)

Frisch-Waugh-Lovell and Robinson's Estimator

Double/Debiased ML (Chernozhukov et al. 2018)

CATE with CausalForest

Environment Setup

Core Workflow

Step 1 — Post-LASSO Variable Selection and OLS

Step 2 — Double/Debiased ML (PLR Model)

Step 3 — Causa

More by xjtulyc

meta-analysis-psych

Search skills