10.3 JSON
Read and write JSON
format files and strings.
10.3.1 Writing JSON
A Series
or DataFrame
can be converted to a valid JSON string. Use to_json
with optional parameters:
path_or_buf
: the pathname or buffer to write the output This can beNone
in which case a JSON string is returnedorient
:- Series :
- default is
index
- allowed values are {
split
,records
,index
}
- default is
- DataFrame
- default is
columns
- allowed values are {
split
,records
,index
,columns
,values
}
- default is
The format of the JSON string
split
dict like {index -> [index], columns -> [columns], data -> [values]} records
list like [{column -> value}, ... , {column -> value}] index
dict like {index -> {column -> value}} columns
dict like {column -> {index -> value}} values
just the values array date_format
: string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.double_precision
: The number of decimal places to use when encoding floating point values, default 10.force_ascii
: force encoded string to be ASCII, default True.date_unit
: The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.default_handler
: The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.lines
: Ifrecords
orient, then will write each record per line as json.
Note NaN
‘s, NaT
‘s and None
will be converted to null
and datetime
objects will be converted based on the date_format
and date_unit
parameters.
In [1]: dfj = pd.DataFrame(randn(5, 2), columns=list('AB'))
In [2]: json = dfj.to_json()
In [3]: json
Out[3]: '{"A":{"0":0.4691122999,"1":-1.5090585032,"2":1.212112025,"3":0.1192087113,"4":-0.8618489633},"B":{"0":-0.2828633443,"1":-1.135632371,"2":-0.1732146491,"3":-1.0442359663,"4":-2.1045692189}}'
10.3.1.1 Orient Options
There are a number of different options for the format of the resulting JSON file / string. Consider the following DataFrame and Series:
In [4]: dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
...: columns=list('ABC'), index=list('xyz'))
...:
In [5]: dfjo
Out[5]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9
In [6]: sjo = pd.Series(dict(x=15, y=16, z=17), name='D')
In [7]: sjo
Out[7]:
x 15
y 16
z 17
Name: D, dtype: int64
Column oriented (the default for DataFrame
) serializes the data as
nested JSON objects with column labels acting as the primary index:
In [8]: dfjo.to_json(orient="columns")
Out[8]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
# Not available for Series
Index oriented (the default for Series
) similar to column oriented
but the index labels are now primary:
In [9]: dfjo.to_json(orient="index")
Out[9]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
In [10]: sjo.to_json(orient="index")
Out[10]: '{"x":15,"y":16,"z":17}'
Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js:
In [11]: dfjo.to_json(orient="records")
Out[11]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
In [12]: sjo.to_json(orient="records")
Out[12]: '[15,16,17]'
Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels are not included:
In [13]: dfjo.to_json(orient="values")
Out[13]: '[[1,4,7],[2,5,8],[3,6,9]]'
# Not available for Series
Split oriented serializes to a JSON object containing separate entries for
values, index and columns. Name is also included for Series
:
In [14]: dfjo.to_json(orient="split")
Out[14]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'
In [15]: sjo.to_json(orient="split")
Out[15]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
Note
Any orient option that encodes to a JSON object will not preserve the ordering of index and column labels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses ordered containers.
10.3.1.2 Date Handling
Writing in ISO date format
In [16]: dfd = pd.DataFrame(randn(5, 2), columns=list('AB'))
In [17]: dfd['date'] = pd.Timestamp('20130101')
In [18]: dfd = dfd.sort_index(1, ascending=False)
In [19]: json = dfd.to_json(date_format='iso')
In [20]: json
Out[20]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z"},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'
Writing in ISO date format, with microseconds
In [21]: json = dfd.to_json(date_format='iso', date_unit='us')
In [22]: json
Out[22]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-01T00:00:00.000000Z"},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'
Epoch timestamps, in seconds
In [23]: json = dfd.to_json(date_format='epoch', date_unit='s')
In [24]: json
Out[24]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":1356998400},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'
Writing to a file, with a date index and a date column
In [25]: dfj2 = dfj.copy()
In [26]: dfj2['date'] = pd.Timestamp('20130101')
In [27]: dfj2['ints'] = list(range(5))
In [28]: dfj2['bools'] = True
In [29]: dfj2.index = pd.date_range('20130101', periods=5)
In [30]: dfj2.to_json('test.json')
In [31]: open('test.json').read()
Out[31]: '{"A":{"1356998400000":0.4691122999,"1357084800000":-1.5090585032,"1357171200000":1.212112025,"1357257600000":0.1192087113,"1357344000000":-0.8618489633},"B":{"1356998400000":-0.2828633443,"1357084800000":-1.135632371,"1357171200000":-0.1732146491,"1357257600000":-1.0442359663,"1357344000000":-2.1045692189},"date":{"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,"1357344000000":true}}'
10.3.1.3 Fallback Behavior
If the JSON serializer cannot handle the container contents directly it will fallback in the following manner:
- if the dtype is unsupported (e.g.
np.complex
) then thedefault_handler
, if provided, will be called for each value, otherwise an exception is raised. - if an object is unsupported it will attempt the following:
- check if the object has defined a
toDict
method and call it. AtoDict
method should return adict
which will then be JSON serialized. - invoke the
default_handler
if one was provided. - convert the object to a
dict
by traversing its contents. However this will often fail with anOverflowError
or give unexpected results.
- check if the object has defined a
In general the best approach for unsupported objects or dtypes is to provide a default_handler
.
For example:
DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises
RuntimeError: Unhandled numpy dtype 15
can be dealt with by specifying a simple default_handler
:
In [32]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)
Out[32]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'
10.3.2 Reading JSON
Reading a JSON string to pandas object can take a number of parameters.
The parser will try to parse a DataFrame
if typ
is not supplied or
is None
. To explicitly force Series
parsing, pass typ=series
filepath_or_buffer
: a VALID JSON string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.jsontyp
: type of object to recover (series or frame), default ‘frame’orient
:- Series :
- default is
index
- allowed values are {
split
,records
,index
}
- default is
- DataFrame
- default is
columns
- allowed values are {
split
,records
,index
,columns
,values
}
- default is
The format of the JSON string
split
dict like {index -> [index], columns -> [columns], data -> [values]} records
list like [{column -> value}, ... , {column -> value}] index
dict like {index -> {column -> value}} columns
dict like {column -> {index -> value}} values
just the values array dtype
: if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, default is True, apply only to the dataconvert_axes
: boolean, try to convert the axes to the proper dtypes, default is Trueconvert_dates
: a list of columns to parse for dates; If True, then try to parse date-like columns, default is Truekeep_default_dates
: boolean, default True. If parsing dates, then parse the default date-like columnsnumpy
: direct decoding to numpy arrays. default is False; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term ifnumpy=True
precise_float
: boolean, defaultFalse
. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False
) is to use fast but less precise builtin functionalitydate_unit
: string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.lines
: reads file as one json object per line.encoding
: The encoding to use to decode py3 bytes.
The parser will raise one of ValueError/TypeError/AssertionError
if the JSON is not parseable.
If a non-default orient
was used when encoding to JSON be sure to pass the same
option here so that decoding produces sensible results, see Orient Options for an
overview.
10.3.2.1 Data Conversion
The default of convert_axes=True
, dtype=True
, and convert_dates=True
will try to parse the axes, and all of the data
into appropriate types, including dates. If you need to override specific dtypes, pass a dict to dtype
. convert_axes
should only
be set to False
if you need to preserve string-like numbers (e.g. ‘1’, ‘2’) in an axes.
Note
Large integer values may be converted to dates if convert_dates=True
and the data and / or column labels appear ‘date-like’. The exact threshold depends on the date_unit
specified. ‘date-like’ means that the column label meets one of the following criteria:
- it ends with
'_at'
- it ends with
'_time'
- it begins with
'timestamp'
- it is
'modified'
- it is
'date'
Warning
When reading JSON data, automatic coercing into dtypes has some quirks:
- an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization
- a column that was
float
data will be converted tointeger
if it can be done safely, e.g. a column of1.
- bool columns will be converted to
integer
on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype
keyword argument.
Reading from a JSON string:
In [33]: pd.read_json(json)
Out[33]:
A B date
0 -0.4949 1.0718 2013-01-01
1 0.7216 -0.7068 2013-01-01
2 -1.0396 0.2719 2013-01-01
3 -0.4250 0.5670 2013-01-01
4 0.2762 -1.0874 2013-01-01
Reading from a file:
In [34]: pd.read_json('test.json')
Out[34]:
A B bools date ints
2013-01-01 0.4691 -0.2829 True 2013-01-01 0
2013-01-02 -1.5091 -1.1356 True 2013-01-01 1
2013-01-03 1.2121 -0.1732 True 2013-01-01 2
2013-01-04 0.1192 -1.0442 True 2013-01-01 3
2013-01-05 -0.8618 -2.1046 True 2013-01-01 4
Don’t convert any data (but still convert axes and dates):
In [35]: pd.read_json('test.json', dtype=object).dtypes
Out[35]:
A object
B object
bools object
date object
ints object
dtype: object
Specify dtypes for conversion:
In [36]: pd.read_json('test.json', dtype={'A' : 'float32', 'bools' : 'int8'}).dtypes
Out[36]:
A float32
B float64
bools int8
date datetime64[ns]
ints int64
dtype: object
Preserve string indices:
In [37]: si = pd.DataFrame(np.zeros((4, 4)),
....: columns=list(range(4)),
....: index=[str(i) for i in range(4)])
....:
In [38]: si
Out[38]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [39]: si.index
Out[39]: Index([u'0', u'1', u'2', u'3'], dtype='object')
In [40]: si.columns
Out[40]: Int64Index([0, 1, 2, 3], dtype='int64')
In [41]: json = si.to_json()
In [42]: sij = pd.read_json(json, convert_axes=False)
In [43]: sij
Out[43]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
In [44]: sij.index
Out[44]: Index([u'0', u'1', u'2', u'3'], dtype='object')
In [45]: sij.columns
Out[45]: Index([u'0', u'1', u'2', u'3'], dtype='object')
Dates written in nanoseconds need to be read back in nanoseconds:
In [46]: json = dfj2.to_json(date_unit='ns')
# Try to parse timestamps as millseconds -> Won't Work
In [47]: dfju = pd.read_json(json, date_unit='ms')
In [48]: dfju
Out[48]:
A B bools date ints
1356998400000000000 0.4691 -0.2829 True 1356998400000000000 0
1357084800000000000 -1.5091 -1.1356 True 1356998400000000000 1
1357171200000000000 1.2121 -0.1732 True 1356998400000000000 2
1357257600000000000 0.1192 -1.0442 True 1356998400000000000 3
1357344000000000000 -0.8618 -2.1046 True 1356998400000000000 4
# Let pandas detect the correct precision
In [49]: dfju = pd.read_json(json)
In [50]: dfju
Out[50]:
A B bools date ints
2013-01-01 0.4691 -0.2829 True 2013-01-01 0
2013-01-02 -1.5091 -1.1356 True 2013-01-01 1
2013-01-03 1.2121 -0.1732 True 2013-01-01 2
2013-01-04 0.1192 -1.0442 True 2013-01-01 3
2013-01-05 -0.8618 -2.1046 True 2013-01-01 4
# Or specify that all timestamps are in nanoseconds
In [51]: dfju = pd.read_json(json, date_unit='ns')
In [52]: dfju
Out[52]:
A B bools date ints
2013-01-01 0.4691 -0.2829 True 2013-01-01 0
2013-01-02 -1.5091 -1.1356 True 2013-01-01 1
2013-01-03 1.2121 -0.1732 True 2013-01-01 2
2013-01-04 0.1192 -1.0442 True 2013-01-01 3
2013-01-05 -0.8618 -2.1046 True 2013-01-01 4
10.3.2.2 The Numpy Parameter
Note
This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates etc.
If numpy=True
is passed to read_json
an attempt will be made to sniff
an appropriate dtype during deserialization and to subsequently decode directly
to numpy arrays, bypassing the need for intermediate Python objects.
This can provide speedups if you are deserialising a large amount of numeric data:
In [53]: randfloats = np.random.uniform(-100, 1000, 10000)
In [54]: randfloats.shape = (1000, 10)
In [55]: dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))
In [56]: jsonfloats = dffloats.to_json()
In [57]: timeit pd.read_json(jsonfloats)
100 loops, best of 3: 7.1 ms per loop
In [58]: timeit pd.read_json(jsonfloats, numpy=True)
100 loops, best of 3: 4.22 ms per loop
The speedup is less noticeable for smaller datasets:
In [59]: jsonfloats = dffloats.head(100).to_json()
In [60]: timeit pd.read_json(jsonfloats)
100 loops, best of 3: 3.56 ms per loop
In [61]: timeit pd.read_json(jsonfloats, numpy=True)
100 loops, best of 3: 2.8 ms per loop
Warning
Direct numpy decoding makes a number of assumptions and may fail or produce unexpected output if these assumptions are not satisfied:
- data is numeric.
- data is uniform. The dtype is sniffed from the first value decoded. A
ValueError
may be raised, or incorrect output may be produced if this condition is not satisfied.- labels are ordered. Labels are only read from the first container, it is assumed that each subsequent row / column has been encoded in the same order. This should be satisfied if the data was encoded using
to_json
but may not be the case if the JSON is from another source.
10.3.3 Normalization
New in version 0.13.0.
pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table.
In [62]: from pandas.io.json import json_normalize
In [63]: data = [{'state': 'Florida',
....: 'shortname': 'FL',
....: 'info': {
....: 'governor': 'Rick Scott'
....: },
....: 'counties': [{'name': 'Dade', 'population': 12345},
....: {'name': 'Broward', 'population': 40000},
....: {'name': 'Palm Beach', 'population': 60000}]},
....: {'state': 'Ohio',
....: 'shortname': 'OH',
....: 'info': {
....: 'governor': 'John Kasich'
....: },
....: 'counties': [{'name': 'Summit', 'population': 1234},
....: {'name': 'Cuyahoga', 'population': 1337}]}]
....:
In [64]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
Out[64]:
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH
10.3.4 Line delimited json
New in version 0.19.0.
pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark.
In [65]: jsonl = '''
....: {"a":1,"b":2}
....: {"a":3,"b":4}
....: '''
....:
In [66]: df = pd.read_json(jsonl, lines=True)
In [67]: df
Out[67]:
a b
0 1 2
1 3 4
In [68]: df.to_json(orient='records', lines=True)
Out[68]: '{"a":1,"b":2}\n{"a":3,"b":4}'