7.19. The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

7.19.1. Using Datasets from Stata

webuse(data[, baseurl, as_df])
Parameters:

7.19.2. Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset() function. The actual data is accessible by the data attribute. For example:

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

In [3]: print duncan_prestige.__doc__
+----------+-------------------+
| Duncan   | R Documentation   |
+----------+-------------------+

Duncan's Occupational Prestige Data
-----------------------------------

Description
~~~~~~~~~~~

The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.

Usage
~~~~~

::

    Duncan

Format
~~~~~~

This data frame contains the following columns:

type
    Type of occupation. A factor with the following levels: ``prof``,
    professional and managerial; ``wc``, white-collar; ``bc``,
    blue-collar.

income
    Percent of males in occupation earning $3500 or more in 1950.

education
    Percent of males in occupation in 1950 who were high-school
    graduates.

prestige
    Percent of raters in NORC study rating occupation as excellent or
    good in prestige.

Source
~~~~~~

Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Table VI-1].

References
~~~~~~~~~~

Fox, J. (2008) *Applied Regression Analysis and Generalized Linear
Models*, Second Edition. Sage.

Fox, J. and Weisberg, S. (2011) *An R Companion to Applied Regression*,
Second Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

7.19.3. R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

7.19.4. Available Datasets

7.19.5. Usage

Load a dataset:

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [7]: data.data
Out[7]: 
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
 (61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
 (60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
 (61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
 (63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
 (63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
 (64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
 (63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
 (66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
 (67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
 (68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
 (66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
 (68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
 (69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
 (69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
 (70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)], 
          dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [8]: data.endog[:5]
Out[8]: array([ 60323.,  61122.,  60171.,  61187.,  63221.])

In [9]: data.exog[:5,:]
Out[9]: 
array([[     83. ,  234289. ,    2356. ,    1590. ,  107608. ,    1947. ],
       [     88.5,  259426. ,    2325. ,    1456. ,  108632. ,    1948. ],
       [     88.2,  258054. ,    3682. ,    1616. ,  109773. ,    1949. ],
       [     89.5,  284599. ,    3351. ,    1650. ,  110929. ,    1950. ],
       [     96.2,  328975. ,    2099. ,    3099. ,  112075. ,    1951. ]])

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [12]: type(data.data)
Out[12]: numpy.recarray

In [13]: type(data.raw_data)
Out[13]: numpy.recarray

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

7.19.5.1. Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0      83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1      88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2      88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3      89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4      96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5      98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6      99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7     100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8     101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9     104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
0     60323.0
1     61122.0
2     60171.0
3     61187.0
4     63221.0
5     63639.0
6     64989.0
7     63761.0
8     66019.0
9     67857.0
10    68169.0
11    66513.0
12    68655.0
13    69564.0
14    69331.0
15    70551.0
Name: TOTEMP, dtype: float64

The full DataFrame is available in the data attribute of the Dataset object

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

With pandas integration in the estimation classes, the metadata will be attached to model results:

7.19.5.2. Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

7.19.6. Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.