1.1.3. patsy.build_design_matrices

patsy.build_design_matrices(design_infos, data, NA_action='drop', return_type='matrix', dtype=dtype('float64'))[source]

Construct several design matrices from DesignMatrixBuilder objects.

This is one of Patsy’s fundamental functions. This function and design_matrix_builders() together form the API to the core formula interpretation machinery.

Parameters:
  • design_infos – A list of DesignInfo objects describing the design matrices to be built.
  • data – A dict-like object which will be used to look up data.
  • NA_action – What to do with rows that contain missing values. You can "drop" them, "raise" an error, or for customization, pass an NAAction object. See NAAction for details on what values count as ‘missing’ (and how to alter this).
  • return_type – Either "matrix" or "dataframe". See below.
  • dtype – The dtype of the returned matrix. Useful if you want to use single-precision or extended-precision.

This function returns either a list of DesignMatrix objects (for return_type="matrix") or a list of pandas.DataFrame objects (for return_type="dataframe"). In both cases, all returned design matrices will have .design_info attributes containing the appropriate DesignInfo objects.

Note that unlike design_matrix_builders(), this function takes only a simple data argument, not any kind of iterator. That’s because this function doesn’t need a global view of the data – everything that depends on the whole data set is already encapsulated in the design_infos. If you are incrementally processing a large data set, simply call this function for each chunk.

Index handling: This function always checks for indexes in the following places:

  • If data is a pandas.DataFrame, its .index attribute.
  • If any factors evaluate to a pandas.Series or pandas.DataFrame, then their .index attributes.

If multiple indexes are found, they must be identical (same values in the same order). If no indexes are found, then a default index is generated using np.arange(num_rows). One way or another, we end up with a single index for all the data. If return_type="dataframe", then this index is used as the index of the returned DataFrame objects. Examining this index makes it possible to determine which rows were removed due to NAs.

Determining the number of rows in design matrices: This is not as obvious as it might seem, because it’s possible to have a formula like “~ 1” that doesn’t depend on the data (it has no factors). For this formula, it’s obvious what every row in the design matrix should look like (just the value 1); but, how many rows like this should there be? To determine the number of rows in a design matrix, this function always checks in the following places:

  • If data is a pandas.DataFrame, then its number of rows.
  • The number of entries in any factors present in any of the design
  • matrices being built.

All these values much match. In particular, if this function is called to generate multiple design matrices at once, then they must all have the same number of rows.

New in version 0.2.0: The NA_action argument.