Menu

Massive univariate linear model.

Residualization as sklearn

Credit: E Duchesnay

Residualizer as pre-processing of supervized prediction:

Input: X = age + site + e, target = age

Preprocessing: - Residualize X for “site” adjusted for “age” - Learn to predict age on residualized data

Since age is used in residualization, it MUST be fitted on training data only.

import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn import metrics

from mulm.residualizer import Residualizer

Dataset, here - X: is the input data of the predictive model and y is the target - Z: is the design matrix use to residualize the X

site = np.array([-1] * 50 + [1] * 50)
age = np.random.uniform(10, 40, size=100) + 5 * site
X = np.random.randn(100, 5)
X[:, 0] = -0.1 * age + site + np.random.normal(size=100)
X[:, 1] = -0.1 * age + site + np.random.normal(size=100)
demographic_df = pd.DataFrame(dict(age=age, site=site.astype(object)))
y = age

Predictive model cross-validation

lr = linear_model.Ridge(alpha=1)
scaler = StandardScaler()
cv = KFold(n_splits=5, random_state=42)

Usage 1: Manual slicing of train/test data: use Residualizer

residualizer = Residualizer(data=demographic_df, formula_res='site',
                            formula_full='site + age')
Z = residualizer.get_design_mat(data=demographic_df)
scores = np.zeros((5, 2))
for i, (tr_idx, te_idx) in enumerate(cv.split(X, y)):
    X_tr, X_te = X[tr_idx, :], X[te_idx, :]
    Z_tr, Z_te = Z[tr_idx, :], Z[te_idx, :]
    y_tr, y_te = y[tr_idx], y[te_idx]

    # 1) Fit residualizer
    residualizer.fit(X_tr, Z_tr)

    # 2) Residualize
    X_res_tr = residualizer.transform(X_tr, Z_tr)
    X_res_te = residualizer.transform(X_te, Z_te)

    X_res_tr = scaler.fit_transform(X_res_tr)
    X_res_te = scaler.transform(X_res_te)

    # 3) Fit predictor on train residualized data
    lr.fit(X_res_tr, y_tr)

    # 4) Predict on test residualized data
    y_test_pred = lr.predict(X_res_te)

    # 5) Compute metrics
    scores[i, 0] = metrics.r2_score(y_te, y_test_pred)
    scores[i, 1] = metrics.mean_absolute_error(y_te, y_test_pred)

scores = pd.DataFrame(scores, columns=['r2', 'mae'])

print("Mean scores")
print(scores.mean(axis=0))

Out:

Mean scores
r2     0.608784
mae    3.753313
dtype: float64

Usage 2: Usage with sklearn pipeline, cross_validate: use ResidualizerEstimator

from mulm.residualizer import ResidualizerEstimator

residualizer = Residualizer(data=demographic_df, formula_res='site',
                            formula_full='site + age')
# Extract design matrix and pack it with X
Z = residualizer.get_design_mat(data=demographic_df)

# Wrap the residualizer
residualizer_wrapper = ResidualizerEstimator(residualizer)
ZX = residualizer_wrapper.pack(Z, X)

pipeline = make_pipeline(residualizer_wrapper, StandardScaler(), lr)
cv_res = cross_validate(estimator=pipeline, X=ZX, y=y, cv=cv, n_jobs=5,
                        scoring=['r2', 'neg_mean_absolute_error'])

r2 = cv_res['test_r2'].mean()
mae = np.mean(-cv_res['test_neg_mean_absolute_error'])
print("CV R2:%.4f, MAE:%.4f" % (r2, mae))
assert np.allclose(scores.mean(axis=0).values, np.array([r2, mae]))

Out:

CV R2:0.6088, MAE:3.7533

Total running time of the script: ( 0 minutes 32.398 seconds)

Gallery generated by Sphinx-Gallery

Follow us

© 2021, pylearn-mulm developers .
Inspired by AZMIND template.