.. currentmodule:: pandas

.. ipython:: python
   :suppress:

   import os
   import csv
   from pandas.compat import StringIO, BytesIO
   import pandas as pd
   ExcelWriter = pd.ExcelWriter

   import sys
   reload(sys)  # Reload does the trick!
   sys.setdefaultencoding('UTF8')

   import numpy as np
   np.random.seed(123456)
   randn = np.random.randn
   np.set_printoptions(precision=4, suppress=True)

   import matplotlib.pyplot as plt
   plt.close('all')

   import pandas.util.testing as tm
   pd.options.display.max_rows=15
   clipdf = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':['p','q','r']},
                         index=['x','y','z'])

.. _io.json:

JSON
----

Read and write ``JSON`` format files and strings.

.. _io.json_writer:

Writing JSON
''''''''''''

A ``Series`` or ``DataFrame`` can be converted to a valid JSON string. Use ``to_json``
with optional parameters:

- ``path_or_buf`` : the pathname or buffer to write the output
  This can be ``None`` in which case a JSON string is returned
- ``orient`` :

  Series :
      - default is ``index``
      - allowed values are {``split``, ``records``, ``index``}

  DataFrame
      - default is ``columns``
      - allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``}

  The format of the JSON string

  .. csv-table::
     :widths: 20, 150
     :delim: ;

     ``split``; dict like {index -> [index], columns -> [columns], data -> [values]}
     ``records``; list like [{column -> value}, ... , {column -> value}]
     ``index``; dict like {index -> {column -> value}}
     ``columns``; dict like {column -> {index -> value}}
     ``values``; just the values array

- ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601.
- ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10.
- ``force_ascii`` : force encoded string to be ASCII, default True.
- ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'.
- ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
- ``lines`` : If ``records`` orient, then will write each record per line as json.

Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datetime`` objects will be converted based on the ``date_format`` and ``date_unit`` parameters.

.. ipython:: python

   dfj = pd.DataFrame(randn(5, 2), columns=list('AB'))
   json = dfj.to_json()
   json

Orient Options
++++++++++++++

There are a number of different options for the format of the resulting JSON
file / string. Consider the following DataFrame and Series:

.. ipython:: python

  dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
                      columns=list('ABC'), index=list('xyz'))
  dfjo
  sjo = pd.Series(dict(x=15, y=16, z=17), name='D')
  sjo

**Column oriented** (the default for ``DataFrame``) serializes the data as
nested JSON objects with column labels acting as the primary index:

.. ipython:: python

  dfjo.to_json(orient="columns")
  # Not available for Series

**Index oriented** (the default for ``Series``) similar to column oriented
but the index labels are now primary:

.. ipython:: python

  dfjo.to_json(orient="index")
  sjo.to_json(orient="index")

**Record oriented** serializes the data to a JSON array of column -> value records,
index labels are not included. This is useful for passing DataFrame data to plotting
libraries, for example the JavaScript library d3.js:

.. ipython:: python

  dfjo.to_json(orient="records")
  sjo.to_json(orient="records")

**Value oriented** is a bare-bones option which serializes to nested JSON arrays of
values only, column and index labels are not included:

.. ipython:: python

  dfjo.to_json(orient="values")
  # Not available for Series

**Split oriented** serializes to a JSON object containing separate entries for
values, index and columns. Name is also included for ``Series``:

.. ipython:: python

  dfjo.to_json(orient="split")
  sjo.to_json(orient="split")

.. note::

  Any orient option that encodes to a JSON object will not preserve the ordering of
  index and column labels during round-trip serialization. If you wish to preserve
  label ordering use the `split` option as it uses ordered containers.

Date Handling
+++++++++++++

Writing in ISO date format

.. ipython:: python

   dfd = pd.DataFrame(randn(5, 2), columns=list('AB'))
   dfd['date'] = pd.Timestamp('20130101')
   dfd = dfd.sort_index(1, ascending=False)
   json = dfd.to_json(date_format='iso')
   json

Writing in ISO date format, with microseconds

.. ipython:: python

   json = dfd.to_json(date_format='iso', date_unit='us')
   json

Epoch timestamps, in seconds

.. ipython:: python

   json = dfd.to_json(date_format='epoch', date_unit='s')
   json

Writing to a file, with a date index and a date column

.. ipython:: python

   dfj2 = dfj.copy()
   dfj2['date'] = pd.Timestamp('20130101')
   dfj2['ints'] = list(range(5))
   dfj2['bools'] = True
   dfj2.index = pd.date_range('20130101', periods=5)
   dfj2.to_json('test.json')
   open('test.json').read()

Fallback Behavior
+++++++++++++++++

If the JSON serializer cannot handle the container contents directly it will fallback in the following manner:

- if the dtype is unsupported (e.g. ``np.complex``) then the ``default_handler``, if provided, will be called
  for each value, otherwise an exception is raised.

- if an object is unsupported it will attempt the following:


  * check if the object has defined a ``toDict`` method and call it.
    A ``toDict`` method should return a ``dict`` which will then be JSON serialized.

  * invoke the ``default_handler`` if one was provided.

  * convert the object to a ``dict`` by traversing its contents. However this will often fail
    with an ``OverflowError`` or give unexpected results.

In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``.
For example:

.. code-block:: python

  DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json()  # raises

  RuntimeError: Unhandled numpy dtype 15

can be dealt with by specifying a simple ``default_handler``:

.. ipython:: python

   pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)

.. _io.json_reader:

Reading JSON
''''''''''''

Reading a JSON string to pandas object can take a number of parameters.
The parser will try to parse a ``DataFrame`` if ``typ`` is not supplied or
is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series``

- ``filepath_or_buffer`` : a **VALID** JSON string or file handle / StringIO. The string could be
  a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host
  is expected. For instance, a local file could be
  file ://localhost/path/to/table.json
- ``typ``    : type of object to recover (series or frame), default 'frame'
- ``orient`` :

  Series :
      - default is ``index``
      - allowed values are {``split``, ``records``, ``index``}

  DataFrame
      - default is ``columns``
      - allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``}

  The format of the JSON string

  .. csv-table::
     :widths: 20, 150
     :delim: ;

     ``split``; dict like {index -> [index], columns -> [columns], data -> [values]}
     ``records``; list like [{column -> value}, ... , {column -> value}]
     ``index``; dict like {index -> {column -> value}}
     ``columns``; dict like {column -> {index -> value}}
     ``values``; just the values array

- ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don't infer dtypes at all, default is True, apply only to the data
- ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is True
- ``convert_dates`` : a list of columns to parse for dates; If True, then try to parse date-like columns, default is True
- ``keep_default_dates`` : boolean, default True. If parsing dates, then parse the default date-like columns
- ``numpy`` : direct decoding to numpy arrays. default is False;
  Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering **MUST** be the same for each term if ``numpy=True``
- ``precise_float`` : boolean, default ``False``. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (``False``) is to use fast but less precise builtin functionality
- ``date_unit`` : string, the timestamp unit to detect if converting dates. Default
  None. By default the timestamp precision will be detected, if this is not desired
  then pass one of 's', 'ms', 'us' or 'ns' to force timestamp precision to
  seconds, milliseconds, microseconds or nanoseconds respectively.
- ``lines`` : reads file as one json object per line.
- ``encoding`` : The encoding to use to decode py3 bytes.

The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable.

If a non-default ``orient`` was used when encoding to JSON be sure to pass the same
option here so that decoding produces sensible results, see `Orient Options`_ for an
overview.

Data Conversion
+++++++++++++++

The default of ``convert_axes=True``, ``dtype=True``, and ``convert_dates=True`` will try to parse the axes, and all of the data
into appropriate types, including dates. If you need to override specific dtypes, pass a dict to ``dtype``. ``convert_axes`` should only
be set to ``False`` if you need to preserve string-like numbers (e.g. '1', '2') in an axes.

.. note::

  Large integer values may be converted to dates if ``convert_dates=True`` and the data and / or column labels appear 'date-like'. The exact threshold depends on the ``date_unit`` specified. 'date-like' means that the column label meets one of the following criteria:

     * it ends with ``'_at'``
     * it ends with ``'_time'``
     * it begins with ``'timestamp'``
     * it is ``'modified'``
     * it is ``'date'``

.. warning::

   When reading JSON data, automatic coercing into dtypes has some quirks:

     * an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization
     * a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.``
     * bool columns will be converted to ``integer`` on reconstruction

   Thus there are times where you may want to specify specific dtypes via the ``dtype`` keyword argument.

Reading from a JSON string:

.. ipython:: python

   pd.read_json(json)

Reading from a file:

.. ipython:: python

   pd.read_json('test.json')

Don't convert any data (but still convert axes and dates):

.. ipython:: python

   pd.read_json('test.json', dtype=object).dtypes

Specify dtypes for conversion:

.. ipython:: python

   pd.read_json('test.json', dtype={'A' : 'float32', 'bools' : 'int8'}).dtypes

Preserve string indices:

.. ipython:: python

   si = pd.DataFrame(np.zeros((4, 4)),
            columns=list(range(4)),
            index=[str(i) for i in range(4)])
   si
   si.index
   si.columns
   json = si.to_json()

   sij = pd.read_json(json, convert_axes=False)
   sij
   sij.index
   sij.columns

Dates written in nanoseconds need to be read back in nanoseconds:

.. ipython:: python

   json = dfj2.to_json(date_unit='ns')

   # Try to parse timestamps as millseconds -> Won't Work
   dfju = pd.read_json(json, date_unit='ms')
   dfju

   # Let pandas detect the correct precision
   dfju = pd.read_json(json)
   dfju

   # Or specify that all timestamps are in nanoseconds
   dfju = pd.read_json(json, date_unit='ns')
   dfju

The Numpy Parameter
+++++++++++++++++++

.. note::
  This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates etc.

If ``numpy=True`` is passed to ``read_json`` an attempt will be made to sniff
an appropriate dtype during deserialization and to subsequently decode directly
to numpy arrays, bypassing the need for intermediate Python objects.

This can provide speedups if you are deserialising a large amount of numeric
data:

.. ipython:: python

   randfloats = np.random.uniform(-100, 1000, 10000)
   randfloats.shape = (1000, 10)
   dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))

   jsonfloats = dffloats.to_json()

.. ipython:: python

   timeit pd.read_json(jsonfloats)

.. ipython:: python

   timeit pd.read_json(jsonfloats, numpy=True)

The speedup is less noticeable for smaller datasets:

.. ipython:: python

   jsonfloats = dffloats.head(100).to_json()

.. ipython:: python

   timeit pd.read_json(jsonfloats)

.. ipython:: python

   timeit pd.read_json(jsonfloats, numpy=True)

.. warning::

   Direct numpy decoding makes a number of assumptions and may fail or produce
   unexpected output if these assumptions are not satisfied:

    - data is numeric.

    - data is uniform. The dtype is sniffed from the first value decoded.
      A ``ValueError`` may be raised, or incorrect output may be produced
      if this condition is not satisfied.

    - labels are ordered. Labels are only read from the first container, it is assumed
      that each subsequent row / column has been encoded in the same order. This should be satisfied if the
      data was encoded using ``to_json`` but may not be the case if the JSON
      is from another source.

.. ipython:: python
   :suppress:

   import os
   os.remove('test.json')

.. _io.json_normalize:

Normalization
'''''''''''''

.. versionadded:: 0.13.0

pandas provides a utility function to take a dict or list of dicts and *normalize* this semi-structured data
into a flat table.

.. ipython:: python

   from pandas.io.json import json_normalize
   data = [{'state': 'Florida',
             'shortname': 'FL',
             'info': {
                  'governor': 'Rick Scott'
             },
             'counties': [{'name': 'Dade', 'population': 12345},
                         {'name': 'Broward', 'population': 40000},
                         {'name': 'Palm Beach', 'population': 60000}]},
            {'state': 'Ohio',
             'shortname': 'OH',
             'info': {
                  'governor': 'John Kasich'
             },
             'counties': [{'name': 'Summit', 'population': 1234},
                          {'name': 'Cuyahoga', 'population': 1337}]}]

   json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

.. _io.jsonl:

Line delimited json
'''''''''''''''''''

.. versionadded:: 0.19.0

pandas is able to read and write line-delimited json files that are common in data processing pipelines
using Hadoop or Spark.

.. ipython:: python

  jsonl = '''
      {"a":1,"b":2}
      {"a":3,"b":4}
  '''
  df = pd.read_json(jsonl, lines=True)
  df
  df.to_json(orient='records', lines=True)