10.12 Stata Format
New in version 0.12.0.
10.12.1 Writing to Stata format
The method to_stata()
will write a DataFrame
into a .dta file. The format version of this file is always 115 (Stata 12).
In [1]: df = pd.DataFrame(randn(10, 2), columns=list('AB'))
In [2]: df.to_stata('stata.dta')
Stata data files have limited data type support; only strings with
244 or fewer characters, int8
, int16
, int32
, float32
and float64
can be stored in .dta
files. Additionally,
Stata reserves certain values to represent missing data. Exporting a
non-missing value that is outside of the permitted range in Stata for
a particular data type will retype the variable to the next larger
size. For example, int8
values are restricted to lie between -127
and 100 in Stata, and so variables with values above 100 will trigger
a conversion to int16
. nan
values in floating points data
types are stored as the basic missing data type (.
in Stata).
Note
It is not possible to export missing data values for integer data types.
The Stata writer gracefully handles other data types including int64
,
bool
, uint8
, uint16
, uint32
by casting to
the smallest supported type that can represent the data. For example, data
with a type of uint8
will be cast to int8
if all values are less than
100 (the upper bound for non-missing int8
data in Stata), or, if values are
outside of this range, the variable is cast to int16
.
Warning
Conversion from int64
to float64
may result in a loss of precision
if int64
values are larger than 2**53.
Warning
StataWriter
and
to_stata()
only support fixed width
strings containing up to 244 characters, a limitation imposed by the version
115 dta file format. Attempting to write Stata dta files with strings
longer than 244 characters raises a ValueError
.
10.12.2 Reading from Stata format
The top-level function read_stata
will read a dta file and return
either a DataFrame or a StataReader
that can
be used to read the file incrementally.
In [3]: pd.read_stata('stata.dta')
Out[3]:
index A B
0 0 0.4691 -0.2829
1 1 -1.5091 -1.1356
2 2 1.2121 -0.1732
3 3 0.1192 -1.0442
4 4 -0.8618 -2.1046
5 5 -0.4949 1.0718
6 6 0.7216 -0.7068
7 7 -1.0396 0.2719
8 8 -0.4250 0.5670
9 9 0.2762 -1.0874
New in version 0.16.0.
Specifying a chunksize
yields a
StataReader
instance that can be used to
read chunksize
lines from the file at a time. The StataReader
object can be used as an iterator.
In [4]: reader = pd.read_stata('stata.dta', chunksize=3)
In [5]: for df in reader:
...: print(df.shape)
...:
(3, 3)
(3, 3)
(3, 3)
(1, 3)
For more fine-grained control, use iterator=True
and specify
chunksize
with each call to
read()
.
In [6]: reader = pd.read_stata('stata.dta', iterator=True)
In [7]: chunk1 = reader.read(5)
In [8]: chunk2 = reader.read(5)
Currently the index
is retrieved as a column.
The parameter convert_categoricals
indicates whether value labels should be
read and used to create a Categorical
variable from them. Value labels can
also be retrieved by the function value_labels
, which requires read()
to be called before use.
The parameter convert_missing
indicates whether missing value
representations in Stata should be preserved. If False
(the default),
missing values are represented as np.nan
. If True
, missing values are
represented using StataMissingValue
objects, and columns containing missing
values will have object
data type.
Note
read_stata()
and
StataReader
support .dta formats 113-115
(Stata 10-12), 117 (Stata 13), and 118 (Stata 14).
Note
Setting preserve_dtypes=False
will upcast to the standard pandas data types:
int64
for all integer types and float64
for floating point data. By default,
the Stata data types are preserved when importing.
10.12.2.1 Categorical Data
New in version 0.15.2.
Categorical
data can be exported to Stata data files as value labeled data.
The exported data consists of the underlying category codes as integer data values
and the categories as value labels. Stata does not have an explicit equivalent
to a Categorical
and information about whether the variable is ordered
is lost when exporting.
Warning
Stata only supports string value labels, and so str
is called on the
categories when exporting data. Exporting Categorical
variables with
non-string categories produces a warning, and can result a loss of
information if the str
representations of the categories are not unique.
Labeled data can similarly be imported from Stata data files as Categorical
variables using the keyword argument convert_categoricals
(True
by default).
The keyword argument order_categoricals
(True
by default) determines
whether imported Categorical
variables are ordered.
Note
When importing categorical data, the values of the variables in the Stata
data file are not preserved since Categorical
variables always
use integer data types between -1
and n-1
where n
is the number
of categories. If the original values in the Stata data file are required,
these can be imported by setting convert_categoricals=False
, which will
import original data (but not the variable labels). The original values can
be matched to the imported categorical data since there is a simple mapping
between the original Stata data values and the category codes of imported
Categorical variables: missing values are assigned code -1
, and the
smallest original value is assigned 0
, the second smallest is assigned
1
and so on until the largest original value is assigned the code n-1
.
Note
Stata supports partially labeled series. These series have value labels for
some but not all data values. Importing a partially labeled series will produce
a Categorical
with string categories for the values that are labeled and
numeric categories for values with no label.