Menu

Massive univariate linear model.

ResidualizerΒΆ

Credit: E Duchesnay

Residualization of a Y data on possibly adjusted for other variables.

Suppose we have 3 variable: - site : contains a site effect - age: some uniform value plus some site effect - y = -0.1 * age + site + eps

The goal is to remove the site effect while preserving the age effect.

Import

import numpy as np
import pandas as pd
import scipy.stats as stats
from mulm.residualizer import Residualizer
import seaborn as sns

np.random.seed(1)

Dataset with site effect on age Before residualization on site. The association between y and age is affected by site.

site = np.array([-1] * 50 + [1] * 50)
age = np.random.uniform(10, 40, size=100) + 5 * site
y = -0.1 * age + site + np.random.normal(size=100)
data = pd.DataFrame(dict(y=y, age=age, site=site.astype(object)))

sns.lmplot(x="age", y="y", hue="site", data=data)
../_images/sphx_glr_plot_residualizer_001.png

Simple residualization on site. Better, but removing site effect also remove age effect

res_spl = Residualizer(data=data, formula_res="site")
X = res_spl.get_design_mat(data)
print("Design mat contains intercept and site:")
print(X[:5, :])
yres = res_spl.fit_transform(y[:, None], X)
data["yres"] = yres
sns.lmplot(x="age", y="yres", hue="site", data=data)
../_images/sphx_glr_plot_residualizer_002.png

Out:

Design mat contains intercept and site:
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]

Site residualization adjusted for age provides higher correlation, and lower stderr than simple residualization.

res_adj = Residualizer(data, formula_res="site", formula_full="age + site")
X = res_adj.get_design_mat(data)
print("Design mat contains intercept, site and age")
print(X[:5, :])
print("Residualisation contrast (intercept, site):")
print(res_adj.contrast_res)

yadj = res_adj.fit_transform(y[:, None], X)

lm_res = stats.linregress(age, yres.ravel())
lm_adj = stats.linregress(age, yadj.ravel())

np.allclose((lm_res.slope, lm_res.rvalue, lm_res.stderr),
            (-0.079187578, -0.623733003, 0.0100242219))

np.allclose((lm_adj.slope, lm_adj.rvalue, lm_adj.stderr),
            (-0.110779913, -0.7909219758, 0.00865778640))

data["yadj"] = yadj
sns.lmplot(x="age", y="yadj", hue="site", data=data)
../_images/sphx_glr_plot_residualizer_003.png

Out:

Design mat contains intercept, site and age
[[ 1.          0.         17.51066014]
 [ 1.          0.         26.6097348 ]
 [ 1.          0.          5.00343124]
 [ 1.          0.         14.06997718]
 [ 1.          0.          9.40267672]]
Residualisation contrast (intercept, site):
[ True  True False]

Total running time of the script: ( 0 minutes 5.917 seconds)

Gallery generated by Sphinx-Gallery

Follow us

© 2021, pylearn-mulm developers .
Inspired by AZMIND template.