10.11 Google BigQuery (Experimental)

New in version 0.13.0.

The pandas.io.gbq module provides a wrapper for Google’s BigQuery analytics web service to simplify retrieving results from BigQuery tables using SQL-like queries. Result sets are parsed into a pandas DataFrame with a shape and data types derived from the source table. Additionally, DataFrames can be inserted into new BigQuery tables or appended to existing tables.

You will need to install some additional dependencies:

Warning

To use this module, you will need a valid BigQuery account. Refer to the BigQuery Documentation for details on the service itself.

The key functions are:

read_gbq(query[, project_id, index_col, ...]) Load data from Google BigQuery.
to_gbq(dataframe, destination_table, project_id) Write a DataFrame to a Google BigQuery table.

10.11.3 Authentication

New in version 0.18.0.

Authentication to the Google BigQuery service is via OAuth 2.0. Is possible to authenticate with either user account credentials or service account credentials.

Authenticating with user account credentials is as simple as following the prompts in a browser window which will be automatically opened for you. You will be authenticated to the specified BigQuery account using the product name pandas GBQ. It is only possible on local host. The remote authentication using user account credentials is not currently supported in Pandas. Additional information on the authentication mechanism can be found here.

Authentication with service account credentials is possible via the ‘private_key’ parameter. This method is particularly useful when working on remote servers (eg. jupyter iPython notebook on remote host). Additional information on service accounts can be found here.

You will need to install an additional dependency: oauth2client.

Authentication via application default credentials is also possible. This is only valid if the parameter private_key is not provided. This method also requires that the credentials can be fetched from the environment the code is running in. Otherwise, the OAuth2 client-side authentication is used. Additional information on application default credentials.

New in version 0.19.0.

Note

The ‘private_key’ parameter can be set to either the file path of the service account key in JSON format, or key contents of the service account key in JSON format.

Note

A private key can be obtained from the Google developers console by clicking here. Use JSON key type.

10.11.4 Querying

Suppose you want to load all data from an existing BigQuery table : test_dataset.test_table into a DataFrame using the read_gbq() function.

# Insert your BigQuery Project ID Here
# Can be found in the Google web console
projectid = "xxxxxxxx"

data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid)

You can define which column from BigQuery to use as an index in the destination DataFrame as well as a preferred column order as follows:

data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
                          index_col='index_column_name',
                          col_order=['col1', 'col2', 'col3'], projectid)

Note

You can find your project id in the Google developers console.

Note

You can toggle the verbose output via the verbose flag which defaults to True.

Note

The dialect argument can be used to indicate whether to use BigQuery’s 'legacy' SQL or BigQuery’s 'standard' SQL (beta). The default value is 'legacy'. For more information on BigQuery’s standard SQL, see BigQuery SQL Reference

10.11.5 Writing DataFrames

Assume we want to write a DataFrame df into a BigQuery table using to_gbq().

In [1]: df = pd.DataFrame({'my_string': list('abc'),
   ...:                    'my_int64': list(range(1, 4)),
   ...:                    'my_float64': np.arange(4.0, 7.0),
   ...:                    'my_bool1': [True, False, True],
   ...:                    'my_bool2': [False, True, False],
   ...:                    'my_dates': pd.date_range('now', periods=3)})
   ...: 

In [2]: df
Out[2]: 
  my_bool1 my_bool2                   my_dates  my_float64  my_int64 my_string
0     True    False 2016-09-30 13:51:25.040464         4.0         1         a
1    False     True 2016-10-01 13:51:25.040464         5.0         2         b
2     True    False 2016-10-02 13:51:25.040464         6.0         3         c

In [3]: df.dtypes
Out[3]: 
my_bool1                bool
my_bool2                bool
my_dates      datetime64[ns]
my_float64           float64
my_int64               int64
my_string             object
dtype: object
df.to_gbq('my_dataset.my_table', projectid)

Note

The destination table and destination dataset will automatically be created if they do not already exist.

The if_exists argument can be used to dictate whether to 'fail', 'replace' or 'append' if the destination table already exists. The default value is 'fail'.

For example, assume that if_exists is set to 'fail'. The following snippet will raise a TableCreationError if the destination table already exists.

df.to_gbq('my_dataset.my_table', projectid, if_exists='fail')

Note

If the if_exists argument is set to 'append', the destination dataframe will be written to the table using the defined table schema and column types. The dataframe must match the destination table in column order, structure, and data types. If the if_exists argument is set to 'replace', and the existing table has a different schema, a delay of 2 minutes will be forced to ensure that the new schema has propagated in the Google environment. See Google BigQuery issue 191.

Writing large DataFrames can result in errors due to size limitations being exceeded. This can be avoided by setting the chunksize argument when calling to_gbq(). For example, the following writes df to a BigQuery table in batches of 10000 rows at a time:

df.to_gbq('my_dataset.my_table', projectid, chunksize=10000)

You can also see the progress of your post via the verbose flag which defaults to True. For example:

In [8]: df.to_gbq('my_dataset.my_table', projectid, chunksize=10000, verbose=True)

        Streaming Insert is 10% Complete
        Streaming Insert is 20% Complete
        Streaming Insert is 30% Complete
        Streaming Insert is 40% Complete
        Streaming Insert is 50% Complete
        Streaming Insert is 60% Complete
        Streaming Insert is 70% Complete
        Streaming Insert is 80% Complete
        Streaming Insert is 90% Complete
        Streaming Insert is 100% Complete

Note

If an error occurs while streaming data to BigQuery, see Troubleshooting BigQuery Errors.

Note

The BigQuery SQL query language has some oddities, see the BigQuery Query Reference Documentation.

Note

While BigQuery uses SQL-like syntax, it has some important differences from traditional databases both in functionality, API limitations (size and quantity of queries or uploads), and how Google charges for use of the service. You should refer to Google BigQuery documentation often as the service seems to be changing and evolving. BiqQuery is best for analyzing large sets of data quickly, but it is not a direct replacement for a transactional database.

10.11.6 Creating BigQuery Tables

Warning

As of 0.17, the function generate_bq_schema() has been deprecated and will be removed in a future version.

As of 0.15.2, the gbq module has a function generate_bq_schema() which will produce the dictionary representation schema of the specified pandas DataFrame.

In [10]: gbq.generate_bq_schema(df, default_type='STRING')

Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'},
         {'name': 'my_bool2', 'type': 'BOOLEAN'},
         {'name': 'my_dates', 'type': 'TIMESTAMP'},
         {'name': 'my_float64', 'type': 'FLOAT'},
         {'name': 'my_int64', 'type': 'INTEGER'},
         {'name': 'my_string', 'type': 'STRING'}]}

Note

If you delete and re-create a BigQuery table with the same name, but different table schema, you must wait 2 minutes before streaming data into the table. As a workaround, consider creating the new table with a different name. Refer to Google BigQuery issue 191.