5.9.1.2. statsmodels.sandbox.multilinear.multiOLS

statsmodels.sandbox.multilinear.multiOLS(model, dataframe, column_list=None, method='fdr_bh', alpha=0.05, subset=None, model_type=<class 'statsmodels.regression.linear_model.OLS'>, **kwargs)[source]

apply a linear model to several endogenous variables on a dataframe

Take a linear model definition via formula and a dataframe that will be the environment of the model, and apply the linear model to a subset (or all) of the columns of the dataframe. It will return a dataframe with part of the information from the linear model summary.

Parameters:

model : string

formula description of the model

dataframe : pandas.dataframe

dataframe where the model will be evaluated

column_list : list of strings, optional

Names of the columns to analyze with the model. If None (Default) it will perform the function on all the eligible columns (numerical type and not in the model definition)

model_type : model class, optional

The type of model to be used. The default is the linear model. Can be any linear model (OLS, WLS, GLS, etc..)

method: string, optional

the method used to perform the pvalue correction for multiple testing. default is the Benjamini/Hochberg, other available methods are:

bonferroni : one-step correction sidak : on-step correction holm-sidak : holm : simes-hochberg : hommel : fdr_bh : Benjamini/Hochberg fdr_by : Benjamini/Yekutieli

alpha: float, optional

the significance level used for the pvalue correction (default 0.05)

subset: boolean array

the selected rows to be used in the regression

all the other parameters will be directed to the model creation.

Returns:

summary : pandas.DataFrame

a dataframe containing an extract from the summary of the model obtained for each columns. It will give the model complexive f test result and p-value, and the regression value and standard deviarion for each of the regressors. The Dataframe has a hierachical column structure, divided as:

  • params: contains the parameters resulting from the models. Has

an additional column named _f_test containing the result of the F test. - pval: the pvalue results of the models. Has the _f_test column for the significativity of the whole test. - adj_pval: the corrected pvalues via the multitest function. - std: uncertainties of the model parameters - statistics: contains the r squared statistics and the adjusted r squared.

See also

statsmodels.stats.multitest
contains several functions to perform the multiple p-value correction

Notes

The main application of this function is on system biology to perform a linear model testing of a lot of different parameters, like the different genetic expression of several genes.

Examples

Using the longley data as dataframe example

>>> import statsmodels.api as sm
>>> data = sm.datasets.longley.load_pandas()
>>> df = data.exog
>>> df['TOTEMP'] = data.endog

This will perform the specified linear model on all the other columns of the dataframe >>> multiOLS(‘GNP + 1’, df)

This select only a certain subset of the columns >>> multiOLS(‘GNP + 0’, df, [‘GNPDEFL’, ‘TOTEMP’, ‘POP’])

It is possible to specify a trasformation also on the target column, conforming to the patsy formula specification >>> multiOLS(‘GNP + 0’, df, [‘I(GNPDEFL**2)’, ‘center(TOTEMP)’])

It is possible to specify the subset of the dataframe on which perform the analysis >> multiOLS(‘GNP + 1’, df, subset=df.GNPDEFL > 90)

Even a single column name can be given without enclosing it in a list >>> multiOLS(‘GNP + 0’, df, ‘GNPDEFL’)