7.19. The Datasets Package¶

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

7.19.1. Using Datasets from Stata¶

webuse(data[, baseurl, as_df])

Parameters:

7.19.2. Using Datasets from R¶

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset() function. The actual data is accessible by the data attribute. For example:

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

In [3]: print duncan_prestige.__doc__
+----------+-------------------+
| Duncan   | R Documentation   |
+----------+-------------------+

Duncan's Occupational Prestige Data
-----------------------------------

Description
~~~~~~~~~~~

The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.

Usage
~~~~~

::

    Duncan

Format
~~~~~~

This data frame contains the following columns:

type
    Type of occupation. A factor with the following levels: ``prof``,
    professional and managerial; ``wc``, white-collar; ``bc``,
    blue-collar.

income
    Percent of males in occupation earning $3500 or more in 1950.

education
    Percent of males in occupation in 1950 who were high-school
    graduates.

prestige
    Percent of raters in NORC study rating occupation as excellent or
    good in prestige.

Source
~~~~~~

Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Table VI-1].

References
~~~~~~~~~~

Fox, J. (2008) *Applied Regression Analysis and Generalized Linear
Models*, Second Edition. Sage.

Fox, J. and Weisberg, S. (2011) *An R Companion to Applied Regression*,
Second Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

7.19.3. R Datasets Function Reference¶

`get_rdataset`(dataname[, package, cache])	download and return R dataset
`get_data_home`([data_home])	Return the path of the statsmodels data dir.
`clear_data_home`([data_home])	Delete all the content of the data home cache.

7.19.4. Available Datasets¶

7.19.5. Usage¶

Load a dataset:

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [7]: data.data
Out[7]: 
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
 (61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
 (60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
 (61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
 (63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
 (63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
 (64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
 (63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
 (66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
 (67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
 (68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
 (66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
 (68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
 (69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
 (69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
 (70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)], 
          dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [8]: data.endog[:5]
Out[8]: array([ 60323.,  61122.,  60171.,  61187.,  63221.])

In [9]: data.exog[:5,:]
Out[9]: 
array([[     83. ,  234289. ,    2356. ,    1590. ,  107608. ,    1947. ],
       [     88.5,  259426. ,    2325. ,    1456. ,  108632. ,    1948. ],
       [     88.2,  258054. ,    3682. ,    1616. ,  109773. ,    1949. ],
       [     89.5,  284599. ,    3351. ,    1650. ,  110929. ,    1950. ],
       [     96.2,  328975. ,    2099. ,    3099. ,  112075. ,    1951. ]])

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [12]: type(data.data)
Out[12]: numpy.recarray

In [13]: type(data.raw_data)
Out[13]: numpy.recarray

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

7.19.5.1. Loading data as pandas objects¶

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
    83.0  234289.0  2356.0  1590.0  107608.0  1947.0
    88.5  259426.0  2325.0  1456.0  108632.0  1948.0
    88.2  258054.0  3682.0  1616.0  109773.0  1949.0
    89.5  284599.0  3351.0  1650.0  110929.0  1950.0
    96.2  328975.0  2099.0  3099.0  112075.0  1951.0
    98.1  346999.0  1932.0  3594.0  113270.0  1952.0
    99.0  365385.0  1870.0  3547.0  115094.0  1953.0
   100.0  363112.0  3578.0  3350.0  116219.0  1954.0
   101.2  397469.0  2904.0  3048.0  117388.0  1955.0
   104.6  419180.0  2822.0  2857.0  118734.0  1956.0
  108.4  442769.0  2936.0  2798.0  120445.0  1957.0
  110.8  444546.0  4681.0  2637.0  121950.0  1958.0
  112.6  482704.0  3813.0  2552.0  123366.0  1959.0
  114.2  502601.0  3931.0  2514.0  125368.0  1960.0
  115.7  518173.0  4806.0  2572.0  127852.0  1961.0
  116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
   60323.0
   61122.0
   60171.0
   61187.0
   63221.0
   63639.0
   64989.0
   63761.0
   66019.0
   67857.0
  68169.0
  66513.0
  68655.0
  69564.0
  69331.0
  70551.0
Name: TOTEMP, dtype: float64

The full DataFrame is available in the data attribute of the Dataset object

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

With pandas integration in the estimation classes, the metadata will be attached to model results:

7.19.5.2. Extra Information¶

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

7.19.6. Additional information¶

The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
To add datasets, see the notes on adding a dataset.