10.3 JSON

Read and write JSON format files and strings.

10.3.1 Writing JSON

A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:

  • path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON string is returned

  • orient :

    Series :
    • default is index
    • allowed values are {split, records, index}
    DataFrame
    • default is columns
    • allowed values are {split, records, index, columns, values}

    The format of the JSON string

    split dict like {index -> [index], columns -> [columns], data -> [values]}
    records list like [{column -> value}, ... , {column -> value}]
    index dict like {index -> {column -> value}}
    columns dict like {column -> {index -> value}}
    values just the values array
  • date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.

  • double_precision : The number of decimal places to use when encoding floating point values, default 10.

  • force_ascii : force encoded string to be ASCII, default True.

  • date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.

  • default_handler : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.

  • lines : If records orient, then will write each record per line as json.

Note NaN‘s, NaT‘s and None will be converted to null and datetime objects will be converted based on the date_format and date_unit parameters.

In [1]: dfj = pd.DataFrame(randn(5, 2), columns=list('AB'))

In [2]: json = dfj.to_json()

In [3]: json
Out[3]: '{"A":{"0":0.4691122999,"1":-1.5090585032,"2":1.212112025,"3":0.1192087113,"4":-0.8618489633},"B":{"0":-0.2828633443,"1":-1.135632371,"2":-0.1732146491,"3":-1.0442359663,"4":-2.1045692189}}'

10.3.1.1 Orient Options

There are a number of different options for the format of the resulting JSON file / string. Consider the following DataFrame and Series:

In [4]: dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
   ...:                     columns=list('ABC'), index=list('xyz'))
   ...: 

In [5]: dfjo
Out[5]: 
   A  B  C
x  1  4  7
y  2  5  8
z  3  6  9

In [6]: sjo = pd.Series(dict(x=15, y=16, z=17), name='D')

In [7]: sjo
Out[7]: 
x    15
y    16
z    17
Name: D, dtype: int64

Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column labels acting as the primary index:

In [8]: dfjo.to_json(orient="columns")
Out[8]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

# Not available for Series

Index oriented (the default for Series) similar to column oriented but the index labels are now primary:

In [9]: dfjo.to_json(orient="index")
Out[9]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

In [10]: sjo.to_json(orient="index")
Out[10]: '{"x":15,"y":16,"z":17}'

Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js:

In [11]: dfjo.to_json(orient="records")
Out[11]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

In [12]: sjo.to_json(orient="records")
Out[12]: '[15,16,17]'

Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels are not included:

In [13]: dfjo.to_json(orient="values")
Out[13]: '[[1,4,7],[2,5,8],[3,6,9]]'

# Not available for Series

Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also included for Series:

In [14]: dfjo.to_json(orient="split")
Out[14]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'

In [15]: sjo.to_json(orient="split")
Out[15]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'

Note

Any orient option that encodes to a JSON object will not preserve the ordering of index and column labels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses ordered containers.

10.3.1.2 Date Handling

Writing in ISO date format

In [16]: dfd = pd.DataFrame(randn(5, 2), columns=list('AB'))

In [17]: dfd['date'] = pd.Timestamp('20130101')

In [18]: dfd = dfd.sort_index(1, ascending=False)

In [19]: json = dfd.to_json(date_format='iso')

In [20]: json
Out[20]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z"},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'

Writing in ISO date format, with microseconds

In [21]: json = dfd.to_json(date_format='iso', date_unit='us')

In [22]: json
Out[22]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-01T00:00:00.000000Z"},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'

Epoch timestamps, in seconds

In [23]: json = dfd.to_json(date_format='epoch', date_unit='s')

In [24]: json
Out[24]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":1356998400},"B":{"0":1.071803807,"1":-0.7067711336,"2":0.2718598855,"3":0.5670203498,"4":-1.0874006913},"A":{"0":-0.4949292741,"1":0.7215551622,"2":-1.0395749851,"3":-0.4249723298,"4":0.2762320193}}'

Writing to a file, with a date index and a date column

In [25]: dfj2 = dfj.copy()

In [26]: dfj2['date'] = pd.Timestamp('20130101')

In [27]: dfj2['ints'] = list(range(5))

In [28]: dfj2['bools'] = True

In [29]: dfj2.index = pd.date_range('20130101', periods=5)

In [30]: dfj2.to_json('test.json')

In [31]: open('test.json').read()
Out[31]: '{"A":{"1356998400000":0.4691122999,"1357084800000":-1.5090585032,"1357171200000":1.212112025,"1357257600000":0.1192087113,"1357344000000":-0.8618489633},"B":{"1356998400000":-0.2828633443,"1357084800000":-1.135632371,"1357171200000":-0.1732146491,"1357257600000":-1.0442359663,"1357344000000":-2.1045692189},"date":{"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,"1357344000000":true}}'

10.3.1.3 Fallback Behavior

If the JSON serializer cannot handle the container contents directly it will fallback in the following manner:

  • if the dtype is unsupported (e.g. np.complex) then the default_handler, if provided, will be called for each value, otherwise an exception is raised.
  • if an object is unsupported it will attempt the following:
    • check if the object has defined a toDict method and call it. A toDict method should return a dict which will then be JSON serialized.
    • invoke the default_handler if one was provided.
    • convert the object to a dict by traversing its contents. However this will often fail with an OverflowError or give unexpected results.

In general the best approach for unsupported objects or dtypes is to provide a default_handler. For example:

DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json()  # raises

RuntimeError: Unhandled numpy dtype 15

can be dealt with by specifying a simple default_handler:

In [32]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)
Out[32]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'

10.3.2 Reading JSON

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series

  • filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.json

  • typ : type of object to recover (series or frame), default ‘frame’

  • orient :

    Series :
    • default is index
    • allowed values are {split, records, index}
    DataFrame
    • default is columns
    • allowed values are {split, records, index, columns, values}

    The format of the JSON string

    split dict like {index -> [index], columns -> [columns], data -> [values]}
    records list like [{column -> value}, ... , {column -> value}]
    index dict like {index -> {column -> value}}
    columns dict like {column -> {index -> value}}
    values just the values array
  • dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, default is True, apply only to the data

  • convert_axes : boolean, try to convert the axes to the proper dtypes, default is True

  • convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns, default is True

  • keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like columns

  • numpy : direct decoding to numpy arrays. default is False; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if numpy=True

  • precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

  • date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.

  • lines : reads file as one json object per line.

  • encoding : The encoding to use to decode py3 bytes.

The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.

If a non-default orient was used when encoding to JSON be sure to pass the same option here so that decoding produces sensible results, see Orient Options for an overview.

10.3.2.1 Data Conversion

The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to dtype. convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’) in an axes.

Note

Large integer values may be converted to dates if convert_dates=True and the data and / or column labels appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means that the column label meets one of the following criteria:

  • it ends with '_at'
  • it ends with '_time'
  • it begins with 'timestamp'
  • it is 'modified'
  • it is 'date'

Warning

When reading JSON data, automatic coercing into dtypes has some quirks:

  • an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization
  • a column that was float data will be converted to integer if it can be done safely, e.g. a column of 1.
  • bool columns will be converted to integer on reconstruction

Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.

Reading from a JSON string:

In [33]: pd.read_json(json)
Out[33]: 
        A       B       date
0 -0.4949  1.0718 2013-01-01
1  0.7216 -0.7068 2013-01-01
2 -1.0396  0.2719 2013-01-01
3 -0.4250  0.5670 2013-01-01
4  0.2762 -1.0874 2013-01-01

Reading from a file:

In [34]: pd.read_json('test.json')
Out[34]: 
                 A       B bools       date  ints
2013-01-01  0.4691 -0.2829  True 2013-01-01     0
2013-01-02 -1.5091 -1.1356  True 2013-01-01     1
2013-01-03  1.2121 -0.1732  True 2013-01-01     2
2013-01-04  0.1192 -1.0442  True 2013-01-01     3
2013-01-05 -0.8618 -2.1046  True 2013-01-01     4

Don’t convert any data (but still convert axes and dates):

In [35]: pd.read_json('test.json', dtype=object).dtypes
Out[35]: 
A        object
B        object
bools    object
date     object
ints     object
dtype: object

Specify dtypes for conversion:

In [36]: pd.read_json('test.json', dtype={'A' : 'float32', 'bools' : 'int8'}).dtypes
Out[36]: 
A               float32
B               float64
bools              int8
date     datetime64[ns]
ints              int64
dtype: object

Preserve string indices:

In [37]: si = pd.DataFrame(np.zeros((4, 4)),
   ....:          columns=list(range(4)),
   ....:          index=[str(i) for i in range(4)])
   ....: 

In [38]: si
Out[38]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

In [39]: si.index
Out[39]: Index([u'0', u'1', u'2', u'3'], dtype='object')

In [40]: si.columns
Out[40]: Int64Index([0, 1, 2, 3], dtype='int64')

In [41]: json = si.to_json()

In [42]: sij = pd.read_json(json, convert_axes=False)

In [43]: sij
Out[43]: 
   0  1  2  3
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
3  0  0  0  0

In [44]: sij.index
Out[44]: Index([u'0', u'1', u'2', u'3'], dtype='object')

In [45]: sij.columns
Out[45]: Index([u'0', u'1', u'2', u'3'], dtype='object')

Dates written in nanoseconds need to be read back in nanoseconds:

In [46]: json = dfj2.to_json(date_unit='ns')

# Try to parse timestamps as millseconds -> Won't Work
In [47]: dfju = pd.read_json(json, date_unit='ms')

In [48]: dfju
Out[48]: 
                          A       B bools                 date  ints
1356998400000000000  0.4691 -0.2829  True  1356998400000000000     0
1357084800000000000 -1.5091 -1.1356  True  1356998400000000000     1
1357171200000000000  1.2121 -0.1732  True  1356998400000000000     2
1357257600000000000  0.1192 -1.0442  True  1356998400000000000     3
1357344000000000000 -0.8618 -2.1046  True  1356998400000000000     4

# Let pandas detect the correct precision
In [49]: dfju = pd.read_json(json)

In [50]: dfju
Out[50]: 
                 A       B bools       date  ints
2013-01-01  0.4691 -0.2829  True 2013-01-01     0
2013-01-02 -1.5091 -1.1356  True 2013-01-01     1
2013-01-03  1.2121 -0.1732  True 2013-01-01     2
2013-01-04  0.1192 -1.0442  True 2013-01-01     3
2013-01-05 -0.8618 -2.1046  True 2013-01-01     4

# Or specify that all timestamps are in nanoseconds
In [51]: dfju = pd.read_json(json, date_unit='ns')

In [52]: dfju
Out[52]: 
                 A       B bools       date  ints
2013-01-01  0.4691 -0.2829  True 2013-01-01     0
2013-01-02 -1.5091 -1.1356  True 2013-01-01     1
2013-01-03  1.2121 -0.1732  True 2013-01-01     2
2013-01-04  0.1192 -1.0442  True 2013-01-01     3
2013-01-05 -0.8618 -2.1046  True 2013-01-01     4

10.3.2.2 The Numpy Parameter

Note

This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates etc.

If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deserialization and to subsequently decode directly to numpy arrays, bypassing the need for intermediate Python objects.

This can provide speedups if you are deserialising a large amount of numeric data:

In [53]: randfloats = np.random.uniform(-100, 1000, 10000)

In [54]: randfloats.shape = (1000, 10)

In [55]: dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))

In [56]: jsonfloats = dffloats.to_json()
In [57]: timeit pd.read_json(jsonfloats)
100 loops, best of 3: 7.1 ms per loop
In [58]: timeit pd.read_json(jsonfloats, numpy=True)
100 loops, best of 3: 4.22 ms per loop

The speedup is less noticeable for smaller datasets:

In [59]: jsonfloats = dffloats.head(100).to_json()
In [60]: timeit pd.read_json(jsonfloats)
100 loops, best of 3: 3.56 ms per loop
In [61]: timeit pd.read_json(jsonfloats, numpy=True)
100 loops, best of 3: 2.8 ms per loop

Warning

Direct numpy decoding makes a number of assumptions and may fail or produce unexpected output if these assumptions are not satisfied:

  • data is numeric.
  • data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised, or incorrect output may be produced if this condition is not satisfied.
  • labels are ordered. Labels are only read from the first container, it is assumed that each subsequent row / column has been encoded in the same order. This should be satisfied if the data was encoded using to_json but may not be the case if the JSON is from another source.

10.3.3 Normalization

New in version 0.13.0.

pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table.

In [62]: from pandas.io.json import json_normalize

In [63]: data = [{'state': 'Florida',
   ....:           'shortname': 'FL',
   ....:           'info': {
   ....:                'governor': 'Rick Scott'
   ....:           },
   ....:           'counties': [{'name': 'Dade', 'population': 12345},
   ....:                       {'name': 'Broward', 'population': 40000},
   ....:                       {'name': 'Palm Beach', 'population': 60000}]},
   ....:          {'state': 'Ohio',
   ....:           'shortname': 'OH',
   ....:           'info': {
   ....:                'governor': 'John Kasich'
   ....:           },
   ....:           'counties': [{'name': 'Summit', 'population': 1234},
   ....:                        {'name': 'Cuyahoga', 'population': 1337}]}]
   ....: 

In [64]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
Out[64]: 
         name  population info.governor    state shortname
0        Dade       12345    Rick Scott  Florida        FL
1     Broward       40000    Rick Scott  Florida        FL
2  Palm Beach       60000    Rick Scott  Florida        FL
3      Summit        1234   John Kasich     Ohio        OH
4    Cuyahoga        1337   John Kasich     Ohio        OH

10.3.4 Line delimited json

New in version 0.19.0.

pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark.

In [65]: jsonl = '''
   ....:     {"a":1,"b":2}
   ....:     {"a":3,"b":4}
   ....: '''
   ....: 

In [66]: df = pd.read_json(jsonl, lines=True)

In [67]: df
Out[67]: 
   a  b
0  1  2
1  3  4

In [68]: df.to_json(orient='records', lines=True)
Out[68]: '{"a":1,"b":2}\n{"a":3,"b":4}'