agentskills.codes
ML

ml-econometrics

>

Install

mkdir -p .claude/skills/ml-econometrics && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16267" && unzip -o skill.zip -d .claude/skills/ml-econometrics && rm skill.zip

Installs to .claude/skills/ml-econometrics

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Use this Skill for machine learning in econometrics: Post-LASSO for high-dimensional controls, Double/Debiased ML (DML) for ATE/ATT, cross-fitting, and valid post-selection inference.
183 chars✓ has a “when” trigger

About this skill

Machine Learning Econometrics

TL;DR — When controls are high-dimensional (many covariates relative to sample size), use Post-LASSO for variable selection and Double/Debiased ML (DML) for valid causal inference on treatment effects. Use EconML CausalForest for heterogeneous treatment effects (CATE).


When to Use

SituationRecommended Method
Many controls, single treatment, want ATEPost-LASSO + OLS or DML-PLR
Need to partial out high-dimensional X from D and YRobinson (1988) / DML
Heterogeneous treatment effects across subgroupsCausalForest (EconML)
Program evaluation with rich covariate setAIPW estimator
Valid inference after LASSO selectionPost-LASSO (Belloni et al. 2012)

Background

Problem: Regularization Bias in LASSO

When estimating β_D in Y = β_D D + X β_X + ε with high-dimensional X, directly using LASSO on the full model introduces regularization bias into β̂_D. This happens because LASSO shrinks all coefficients, including the one on the treatment D.

Naive LASSO estimate: Biased, no valid standard errors.

Post-LASSO (Belloni-Chernozhukov-Hansen 2014)

  1. Run LASSO of Y on (D, X) to select a set S of controls.
  2. Run OLS of Y on D and the selected subset X_S only.
  3. Standard OLS inference is valid on the Post-LASSO estimator.

This two-step procedure recovers valid inference when the true model is approximately sparse (few truly important controls).

Frisch-Waugh-Lovell and Robinson's Estimator

By the FWL theorem, β_D in Y = β_D D + X β_X + ε equals the coefficient from regressing M_X Y on M_X D, where M_X is the residual maker.

Robinson (1988) extended this to semiparametric partially linear models: Y = θ D + g(X) + ε D = m(X) + v

The estimator: θ̂ = [Σ (D_i - m̂(X_i)) (Y_i - ĝ(X_i))] / [Σ (D_i - m̂(X_i))²]

where ĝ and m̂ can be any ML estimators.

Double/Debiased ML (Chernozhukov et al. 2018)

DML corrects for regularization bias via cross-fitting:

  1. Split data into K folds.
  2. For each fold k:
    • Fit ĝ_{-k}(X) = E[Y|X] and m̂_{-k}(X) = E[D|X] on all folds except k.
    • Compute residuals for fold k: ỹ_i = Y_i - ĝ_{-k}(X_i), d̃_i = D_i - m̂_{-k}(X_i).
  3. Regress ỹ on d̃ using all pooled residuals.

Cross-fitting ensures the nuisance estimators have negligible bias on the held-out sample, restoring √n-consistency and valid Gaussian inference.

CATE with CausalForest

For heterogeneous effects τ(x) = E[Y(1) - Y(0) | X = x]:

  • Fit a GRF (Generalized Random Forest) that solves a local version of the Robinson estimator at each point x.
  • Valid confidence intervals via the infinitesimal jackknife.
  • Feature importance identifies which covariates drive heterogeneity.

Environment Setup

conda create -n mlecono python=3.11 -y
conda activate mlecono
pip install doubleml>=0.6 econml>=0.15 scikit-learn>=1.2 numpy>=1.23 pandas>=1.5 matplotlib>=3.6

# Verify
python -c "import doubleml; print('doubleml', doubleml.__version__)"
python -c "import econml; print('econml', econml.__version__)"

Core Workflow

Step 1 — Post-LASSO Variable Selection and OLS

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoCV, Lasso
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

np.random.seed(42)


def generate_highdim_data(
    n: int = 500,
    p_total: int = 200,
    p_relevant: int = 10,
    ate: float = 2.0,
) -> pd.DataFrame:
    """
    Generate high-dimensional dataset for ML econometrics.

    Y = ate * D + X_relevant @ beta + epsilon
    D = X_relevant @ gamma + v (endogenous through X)

    Args:
        n:           Number of observations.
        p_total:     Total number of control variables.
        p_relevant:  Number of truly relevant controls.
        ate:         True average treatment effect.

    Returns:
        DataFrame with columns: Y, D, X1, ..., Xp_total.
    """
    X = np.random.randn(n, p_total)
    X_rel = X[:, :p_relevant]

    # True coefficients
    beta = np.random.uniform(0.5, 2.0, p_relevant) * np.random.choice([-1, 1], p_relevant)
    gamma = np.random.uniform(0.3, 1.0, p_relevant) * np.random.choice([-1, 1], p_relevant)

    # Treatment (binary after probit)
    D_star = X_rel @ gamma + np.random.randn(n)
    D = (D_star > np.median(D_star)).astype(float)

    # Outcome
    Y = ate * D + X_rel @ beta + np.random.randn(n)

    df = pd.DataFrame(X, columns=[f"X{i+1}" for i in range(p_total)])
    df.insert(0, "D", D)
    df.insert(0, "Y", Y)
    return df


def post_lasso_ols(
    df: pd.DataFrame,
    outcome: str = "Y",
    treatment: str = "D",
    covariates: list = None,
    cv_folds: int = 5,
) -> dict:
    """
    Post-LASSO: select controls via LassoCV, then run OLS on selected subset.

    This avoids regularization bias on the treatment coefficient β_D.

    Args:
        df:          DataFrame with outcome, treatment, and covariates.
        outcome:     Name of outcome variable.
        treatment:   Name of treatment variable.
        covariates:  List of control variable names. Defaults to all columns
                     except outcome and treatment.
        cv_folds:    Cross-validation folds for LassoCV penalty selection.

    Returns:
        Dictionary with keys: ate_estimate, se, tstat, pvalue, selected_controls, ols_result.
    """
    if covariates is None:
        covariates = [c for c in df.columns if c not in [outcome, treatment]]

    Y = df[outcome].values
    D = df[treatment].values
    X = df[covariates].values

    # Standardize X for LASSO
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Stack D with X for selection (Belloni: select controls for Y equation)
    XD = np.hstack([D.reshape(-1, 1), X_scaled])

    # LassoCV: select lambda by cross-validation
    lasso_cv = LassoCV(cv=cv_folds, max_iter=5000, random_state=0, n_jobs=-1)
    lasso_cv.fit(XD, Y)
    alpha_opt = lasso_cv.alpha_

    # Fit Lasso at optimal alpha to identify selected controls
    lasso = Lasso(alpha=alpha_opt, max_iter=5000)
    lasso.fit(XD, Y)
    coef_x = lasso.coef_[1:]       # skip D coefficient (first column)
    selected_idx = np.where(np.abs(coef_x) > 1e-8)[0]
    selected_controls = [covariates[i] for i in selected_idx]

    n_selected = len(selected_controls)
    print(f"Post-LASSO: selected {n_selected} / {len(covariates)} controls at lambda={alpha_opt:.6f}")

    # OLS on treatment + selected controls
    X_selected = df[selected_controls].values if n_selected > 0 else np.zeros((len(Y), 0))
    regressors = np.hstack([D.reshape(-1, 1), X_selected])
    regressors_with_const = sm.add_constant(regressors)
    ols_result = sm.OLS(Y, regressors_with_const).fit(cov_type="HC3")

    # Treatment coefficient is the second element (after constant)
    ate_hat = float(ols_result.params[1])
    se_hat = float(ols_result.bse[1])
    tstat = float(ols_result.tvalues[1])
    pvalue = float(ols_result.pvalues[1])

    print(f"Post-LASSO ATE: {ate_hat:.4f}  SE: {se_hat:.4f}  t: {tstat:.3f}  p: {pvalue:.4f}")

    return {
        "ate_estimate": ate_hat,
        "se": se_hat,
        "tstat": tstat,
        "pvalue": pvalue,
        "selected_controls": selected_controls,
        "ols_result": ols_result,
    }

Step 2 — Double/Debiased ML (PLR Model)

import doubleml as dml
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.linear_model import LassoCV, LogisticRegressionCV


def fit_doubleml_plr(
    df: pd.DataFrame,
    outcome: str = "Y",
    treatment: str = "D",
    covariates: list = None,
    ml_type: str = "lasso",
    n_folds: int = 5,
    n_rep: int = 3,
    score: str = "partialling out",
) -> dict:
    """
    Fit a Partially Linear Regression (PLR) model using DoubleML.

    Args:
        df:          DataFrame with outcome, treatment, and controls.
        outcome:     Outcome variable name.
        treatment:   Treatment variable name (binary or continuous).
        covariates:  Control variable names. Defaults to all other columns.
        ml_type:     Nuisance learner: 'lasso' or 'gbm' (gradient boosting).
        n_folds:     Number of cross-fitting folds.
        n_rep:       Number of repeated cross-fitting splits.
        score:       Scoring method: 'partialling out' or 'IV-type'.

    Returns:
        Dictionary with keys: ate, se, pvalue, ci_lower, ci_upper, dml_obj.
    """
    if covariates is None:
        covariates = [c for c in df.columns if c not in [outcome, treatment]]

    # DoubleML data object
    data = dml.DoubleMLData(df, y_col=outcome, d_cols=treatment, x_cols=covariates)

    if ml_type == "lasso":
        ml_l = LassoCV(cv=5, max_iter=5000)       # E[Y|X] learner
        ml_m = LassoCV(cv=5, max_iter=5000)       # E[D|X] learner
    elif ml_type == "gbm":
        ml_l = GradientBoostingRegressor(n_estimators=200, max_depth=3)
        ml_m = GradientBoostingClassifier(n_estimators=200, max_depth=3)
    else:
        raise ValueError(f"Unknown ml_type '{ml_type}'. Choose 'lasso' or 'gbm'.")

    # PLR model
    dml_plr = dml.DoubleMLPLR(
        obj_dml_data=data,
        ml_l=ml_l,
        ml_m=ml_m,
        n_folds=n_folds,
        n_rep=n_rep,
        score=score,
    )
    dml_plr.fit(store_predictions=True, store_models=True)

    summary = dml_plr.summary
    ate = float(summary["coef"].values[0])
    se = float(summary["std err"].values[0])
    pvalue = float(summary["P>|z|"].values[0])
    ci_lower = float(dml_plr.confint().iloc[0, 0])
    ci_upper = float(dml_plr.confint().iloc[0, 1])

    print(f"\nDML PLR ATE: {ate:.4f}  SE: {se:.4f}  p: {pvalue:.4f}")
    print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")

    return {
        "ate": ate,
        "se": se,
        "pvalue": pvalue,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "dml_obj": dml_plr,
    }

Step 3 — Causa


Content truncated.

Search skills

Search the agent skills registry