5.9.1.2. statsmodels.sandbox.multilinear.multiOLS¶
-
statsmodels.sandbox.multilinear.
multiOLS
(model, dataframe, column_list=None, method='fdr_bh', alpha=0.05, subset=None, model_type=<class 'statsmodels.regression.linear_model.OLS'>, **kwargs)[source]¶ apply a linear model to several endogenous variables on a dataframe
Take a linear model definition via formula and a dataframe that will be the environment of the model, and apply the linear model to a subset (or all) of the columns of the dataframe. It will return a dataframe with part of the information from the linear model summary.
Parameters: model : string
formula description of the model
dataframe : pandas.dataframe
dataframe where the model will be evaluated
column_list : list of strings, optional
Names of the columns to analyze with the model. If None (Default) it will perform the function on all the eligible columns (numerical type and not in the model definition)
model_type : model class, optional
The type of model to be used. The default is the linear model. Can be any linear model (OLS, WLS, GLS, etc..)
method: string, optional
the method used to perform the pvalue correction for multiple testing. default is the Benjamini/Hochberg, other available methods are:
bonferroni : one-step correction sidak : on-step correction holm-sidak : holm : simes-hochberg : hommel : fdr_bh : Benjamini/Hochberg fdr_by : Benjamini/Yekutieli
alpha: float, optional
the significance level used for the pvalue correction (default 0.05)
subset: boolean array
the selected rows to be used in the regression
all the other parameters will be directed to the model creation.
Returns: summary : pandas.DataFrame
a dataframe containing an extract from the summary of the model obtained for each columns. It will give the model complexive f test result and p-value, and the regression value and standard deviarion for each of the regressors. The Dataframe has a hierachical column structure, divided as:
- params: contains the parameters resulting from the models. Has
an additional column named _f_test containing the result of the F test. - pval: the pvalue results of the models. Has the _f_test column for the significativity of the whole test. - adj_pval: the corrected pvalues via the multitest function. - std: uncertainties of the model parameters - statistics: contains the r squared statistics and the adjusted r squared.
See also
statsmodels.stats.multitest
- contains several functions to perform the multiple p-value correction
Notes
The main application of this function is on system biology to perform a linear model testing of a lot of different parameters, like the different genetic expression of several genes.
Examples
Using the longley data as dataframe example
>>> import statsmodels.api as sm >>> data = sm.datasets.longley.load_pandas() >>> df = data.exog >>> df['TOTEMP'] = data.endog
This will perform the specified linear model on all the other columns of the dataframe >>> multiOLS(‘GNP + 1’, df)
This select only a certain subset of the columns >>> multiOLS(‘GNP + 0’, df, [‘GNPDEFL’, ‘TOTEMP’, ‘POP’])
It is possible to specify a trasformation also on the target column, conforming to the patsy formula specification >>> multiOLS(‘GNP + 0’, df, [‘I(GNPDEFL**2)’, ‘center(TOTEMP)’])
It is possible to specify the subset of the dataframe on which perform the analysis >> multiOLS(‘GNP + 1’, df, subset=df.GNPDEFL > 90)
Even a single column name can be given without enclosing it in a list >>> multiOLS(‘GNP + 0’, df, ‘GNPDEFL’)