1.1.3. patsy.build_design_matrices¶
-
patsy.
build_design_matrices
(design_infos, data, NA_action='drop', return_type='matrix', dtype=dtype('float64'))[source]¶ Construct several design matrices from
DesignMatrixBuilder
objects.This is one of Patsy’s fundamental functions. This function and
design_matrix_builders()
together form the API to the core formula interpretation machinery.Parameters: - design_infos – A list of
DesignInfo
objects describing the design matrices to be built. - data – A dict-like object which will be used to look up data.
- NA_action – What to do with rows that contain missing values. You can
"drop"
them,"raise"
an error, or for customization, pass anNAAction
object. SeeNAAction
for details on what values count as ‘missing’ (and how to alter this). - return_type – Either
"matrix"
or"dataframe"
. See below. - dtype – The dtype of the returned matrix. Useful if you want to use single-precision or extended-precision.
This function returns either a list of
DesignMatrix
objects (forreturn_type="matrix"
) or a list ofpandas.DataFrame
objects (forreturn_type="dataframe"
). In both cases, all returned design matrices will have.design_info
attributes containing the appropriateDesignInfo
objects.Note that unlike
design_matrix_builders()
, this function takes only a simple data argument, not any kind of iterator. That’s because this function doesn’t need a global view of the data – everything that depends on the whole data set is already encapsulated in thedesign_infos
. If you are incrementally processing a large data set, simply call this function for each chunk.Index handling: This function always checks for indexes in the following places:
- If
data
is apandas.DataFrame
, its.index
attribute. - If any factors evaluate to a
pandas.Series
orpandas.DataFrame
, then their.index
attributes.
If multiple indexes are found, they must be identical (same values in the same order). If no indexes are found, then a default index is generated using
np.arange(num_rows)
. One way or another, we end up with a single index for all the data. Ifreturn_type="dataframe"
, then this index is used as the index of the returned DataFrame objects. Examining this index makes it possible to determine which rows were removed due to NAs.Determining the number of rows in design matrices: This is not as obvious as it might seem, because it’s possible to have a formula like “~ 1” that doesn’t depend on the data (it has no factors). For this formula, it’s obvious what every row in the design matrix should look like (just the value
1
); but, how many rows like this should there be? To determine the number of rows in a design matrix, this function always checks in the following places:- If
data
is apandas.DataFrame
, then its number of rows. - The number of entries in any factors present in any of the design
- matrices being built.
All these values much match. In particular, if this function is called to generate multiple design matrices at once, then they must all have the same number of rows.
New in version 0.2.0: The
NA_action
argument.- design_infos – A list of