1.1.10. patsy.dmatrix

patsy.dmatrix(formula_like, data={}, eval_env=0, NA_action='drop', return_type='matrix')[source]

Construct a single design matrix given a formula_like and data.

Parameters:
  • formula_like – An object that can be used to construct a design matrix. See below.
  • data – A dict-like object that can be used to look up variables referenced in formula_like.
  • eval_env – Either a EvalEnvironment which will be used to look up any variables referenced in formula_like that cannot be found in data, or else a depth represented as an integer which will be passed to EvalEnvironment.capture(). eval_env=0 means to use the context of the function calling dmatrix() for lookups. If calling this function from a library, you probably want eval_env=1, which means that variables should be resolved in your caller’s namespace.
  • NA_action – What to do with rows that contain missing values. You can "drop" them, "raise" an error, or for customization, pass an NAAction object. See NAAction for details on what values count as ‘missing’ (and how to alter this).
  • return_type – Either "matrix" or "dataframe". See below.

The formula_like can take a variety of forms. You can use any of the following:

  • (The most common option) A formula string like "x1 + x2" (for dmatrix()) or "y ~ x1 + x2" (for dmatrices()). For details see formulas.
  • A ModelDesc, which is a Python object representation of a formula. See formulas and expert-model-specification for details.
  • A DesignInfo.
  • An object that has a method called __patsy_get_model_desc__(). For details see expert-model-specification.
  • A numpy array_like (for dmatrix()) or a tuple (array_like, array_like) (for dmatrices()). These will have metadata added, representation normalized, and then be returned directly. In this case data and eval_env are ignored. There is special handling for two cases:
    • DesignMatrix objects will have their DesignInfo preserved. This allows you to set up custom column names and term information even if you aren’t using the rest of the patsy machinery.
    • pandas.DataFrame or pandas.Series objects will have their (row) indexes checked. If two are passed in, their indexes must be aligned. If return_type="dataframe", then their indexes will be preserved on the output.

Regardless of the input, the return type is always either:

  • A DesignMatrix, if return_type="matrix" (the default)
  • A pandas.DataFrame, if return_type="dataframe".

The actual contents of the design matrix is identical in both cases, and in both cases a DesignInfo object will be available in a .design_info attribute on the return value. However, for return_type="dataframe", any pandas indexes on the input (either in data or directly passed through formula_like) will be preserved, which may be useful for e.g. time-series models.

New in version 0.2.0: The NA_action argument.