10.10 SQL Queries
The pandas.io.sql
module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
your database. Examples of such drivers are psycopg2
for PostgreSQL or pymysql for MySQL.
For SQLite this is
included in Python’s standard library by default.
You can find an overview of supported drivers for each SQL dialect in the
SQLAlchemy docs.
New in version 0.14.0.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.
See also some cookbook examples for some advanced strategies.
The key functions are:
read_sql_table (table_name, con[, schema, ...]) |
Read SQL database table into a DataFrame. |
read_sql_query (sql, con[, index_col, ...]) |
Read SQL query into a DataFrame. |
read_sql (sql, con[, index_col, ...]) |
Read SQL query or database table into a DataFrame. |
DataFrame.to_sql (name, con[, flavor, ...]) |
Write records stored in a DataFrame to a SQL database. |
Note
The function read_sql()
is a convenience wrapper around
read_sql_table()
and read_sql_query()
(and for
backward compatibility) and will delegate to specific function depending on
the provided input (database table name or sql query).
Table names do not need to be quoted if they have special characters.
In the following example, we use the SQlite SQL database engine. You can use a temporary SQLite database where data are stored in “memory”.
To connect with SQLAlchemy you use the create_engine()
function to create an engine
object from database URI. You only need to create the engine once per database you are
connecting to.
For more information on create_engine()
and the URI formatting, see the examples
below and the SQLAlchemy documentation
In [1]: from sqlalchemy import create_engine
# Create your engine.
In [2]: engine = create_engine('sqlite:///:memory:')
If you want to manage your own connections you can pass one of those instead:
with engine.connect() as conn, conn.begin():
data = pd.read_sql_table('data', conn)
10.10.5 Writing DataFrames
Assuming the following data is in a DataFrame data
, we can insert it into
the database using to_sql()
.
id | Date | Col_1 | Col_2 | Col_3 |
---|---|---|---|---|
26 | 2012-10-18 | X | 25.7 | True |
42 | 2012-10-19 | Y | -12.4 | False |
63 | 2012-10-20 | Z | 5.73 | True |
In [3]: data.to_sql('data', engine)
With some databases, writing large DataFrames can result in errors due to
packet size limitations being exceeded. This can be avoided by setting the
chunksize
parameter when calling to_sql
. For example, the following
writes data
to the database in batches of 1000 rows at a time:
In [4]: data.to_sql('data_chunked', engine, chunksize=1000)
10.10.5.1 SQL data types
to_sql()
will try to map your data to an appropriate
SQL data type based on the dtype of the data. When you have columns of dtype
object
, pandas will try to infer the data type.
You can always override the default type by specifying the desired SQL type of
any of the columns by using the dtype
argument. This argument needs a
dictionary mapping column names to SQLAlchemy types (or strings for the sqlite3
fallback mode).
For example, specifying to use the sqlalchemy String
type instead of the
default Text
type for string columns:
In [5]: from sqlalchemy.types import String
In [6]: data.to_sql('data_dtype', engine, dtype={'Col_1': String})
Note
Due to the limited support for timedelta’s in the different database
flavors, columns with type timedelta64
will be written as integer
values as nanoseconds to the database and a warning will be raised.
Note
Columns of category
dtype will be converted to the dense representation
as you would get with np.asarray(categorical)
(e.g. for string categories
this gives an array of strings).
Because of this, reading the database table back in does not generate
a categorical.
10.10.6 Reading Tables
read_sql_table()
will read a database table given the
table name and optionally a subset of columns to read.
Note
In order to use read_sql_table()
, you must have the
SQLAlchemy optional dependency installed.
In [7]: pd.read_sql_table('data', engine)
Out[7]:
index id Date Col_1 Col_2 Col_3
0 0 26 2010-10-18 X 27.50 True
1 1 42 2010-10-19 Y -12.50 False
2 2 63 2010-10-20 Z 5.73 True
You can also specify the name of the column as the DataFrame index, and specify a subset of columns to be read.
In [8]: pd.read_sql_table('data', engine, index_col='id')
Out[8]:
index Date Col_1 Col_2 Col_3
id
26 0 2010-10-18 X 27.50 True
42 1 2010-10-19 Y -12.50 False
63 2 2010-10-20 Z 5.73 True
In [9]: pd.read_sql_table('data', engine, columns=['Col_1', 'Col_2'])
Out[9]:
Col_1 Col_2
0 X 27.50
1 Y -12.50
2 Z 5.73
And you can explicitly force columns to be parsed as dates:
In [10]: pd.read_sql_table('data', engine, parse_dates=['Date'])
Out[10]:
index id Date Col_1 Col_2 Col_3
0 0 26 2010-10-18 X 27.50 True
1 1 42 2010-10-19 Y -12.50 False
2 2 63 2010-10-20 Z 5.73 True
If needed you can explicitly specify a format string, or a dict of arguments
to pass to pandas.to_datetime()
:
pd.read_sql_table('data', engine, parse_dates={'Date': '%Y-%m-%d'})
pd.read_sql_table('data', engine, parse_dates={'Date': {'format': '%Y-%m-%d %H:%M:%S'}})
You can check if a table exists using has_table()
10.10.7 Schema support
New in version 0.15.0.
Reading from and writing to different schema’s is supported through the schema
keyword in the read_sql_table()
and to_sql()
functions. Note however that this depends on the database flavor (sqlite does not
have schema’s). For example:
df.to_sql('table', engine, schema='other_schema')
pd.read_sql_table('table', engine, schema='other_schema')
10.10.8 Querying
You can query using raw SQL in the read_sql_query()
function.
In this case you must use the SQL variant appropriate for your database.
When using SQLAlchemy, you can also pass SQLAlchemy Expression language constructs,
which are database-agnostic.
In [11]: pd.read_sql_query('SELECT * FROM data', engine)
Out[11]:
index id Date Col_1 Col_2 Col_3
0 0 26 2010-10-18 00:00:00.000000 X 27.50 1
1 1 42 2010-10-19 00:00:00.000000 Y -12.50 0
2 2 63 2010-10-20 00:00:00.000000 Z 5.73 1
Of course, you can specify a more “complex” query.
In [12]: pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine)
Out[12]:
id Col_1 Col_2
0 42 Y -12.5
The read_sql_query()
function supports a chunksize
argument.
Specifying this will return an iterator through chunks of the query result:
In [13]: df = pd.DataFrame(np.random.randn(20, 3), columns=list('abc'))
In [14]: df.to_sql('data_chunks', engine, index=False)
In [15]: for chunk in pd.read_sql_query("SELECT * FROM data_chunks", engine, chunksize=5):
....: print(chunk)
....:
a b c
0 0.4691 -0.2829 -1.5091
1 -1.1356 1.2121 -0.1732
2 0.1192 -1.0442 -0.8618
3 -2.1046 -0.4949 1.0718
4 0.7216 -0.7068 -1.0396
a b c
0 0.2719 -0.4250 0.5670
1 0.2762 -1.0874 -0.6737
2 0.1136 -1.4784 0.5250
3 0.4047 0.5770 -1.7150
4 -1.0393 -0.3706 -1.1579
a b c
0 -1.3443 0.8449 1.0758
1 -0.1090 1.6436 -1.4694
2 0.3570 -0.6746 -1.7769
3 -0.9689 -1.2945 0.4137
4 0.2767 -0.4720 -0.0140
a b c
0 -0.3625 -0.0062 -0.9231
1 0.8957 0.8052 -1.2064
2 2.5656 1.4313 1.3403
3 -1.1703 -0.2262 0.4108
4 0.8139 0.1320 -0.8273
You can also run a plain query without creating a dataframe with
execute()
. This is useful for queries that don’t return values,
such as INSERT. This is functionally equivalent to calling execute
on the
SQLAlchemy engine or db connection object. Again, you must use the SQL syntax
variant appropriate for your database.
from pandas.io import sql
sql.execute('SELECT * FROM table_name', engine)
sql.execute('INSERT INTO table_name VALUES(?, ?, ?)', engine, params=[('id', 1, 12.2, True)])
10.10.9 Engine connection examples
To connect with SQLAlchemy you use the create_engine()
function to create an engine
object from database URI. You only need to create the engine once per database you are
connecting to.
from sqlalchemy import create_engine
engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase')
engine = create_engine('mysql+mysqldb://scott:tiger@localhost/foo')
engine = create_engine('oracle://scott:tiger@127.0.0.1:1521/sidname')
engine = create_engine('mssql+pyodbc://mydsn')
# sqlite://<nohostname>/<path>
# where <path> is relative:
engine = create_engine('sqlite:///foo.db')
# or absolute, starting with a slash:
engine = create_engine('sqlite:////absolute/path/to/foo.db')
For more information see the examples the SQLAlchemy documentation
10.10.10 Advanced SQLAlchemy queries
You can use SQLAlchemy constructs to describe your query.
Use sqlalchemy.text()
to specify query parameters in a backend-neutral way
In [16]: import sqlalchemy as sa
In [17]: pd.read_sql(sa.text('SELECT * FROM data where Col_1=:col1'), engine, params={'col1': 'X'})
Out[17]:
index id Date Col_1 Col_2 Col_3
0 0 26 2010-10-18 00:00:00.000000 X 27.5 1
If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions
In [18]: metadata = sa.MetaData()
In [19]: data_table = sa.Table('data', metadata,
....: sa.Column('index', sa.Integer),
....: sa.Column('Date', sa.DateTime),
....: sa.Column('Col_1', sa.String),
....: sa.Column('Col_2', sa.Float),
....: sa.Column('Col_3', sa.Boolean),
....: )
....:
In [20]: pd.read_sql(sa.select([data_table]).where(data_table.c.Col_3 == True), engine)
Out[20]:
index Date Col_1 Col_2 Col_3
0 0 2010-10-18 X 27.50 True
1 2 2010-10-20 Z 5.73 True
You can combine SQLAlchemy expressions with parameters passed to read_sql()
using sqlalchemy.bindparam()
In [21]: import datetime as dt
In [22]: expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam('date'))
In [23]: pd.read_sql(expr, engine, params={'date': dt.datetime(2010, 10, 18)})
Out[23]:
index Date Col_1 Col_2 Col_3
0 1 2010-10-19 Y -12.50 False
1 2 2010-10-20 Z 5.73 True
10.10.11 Sqlite fallback
The use of sqlite is supported without using SQLAlchemy. This mode requires a Python database adapter which respect the Python DB-API.
You can create connections like so:
import sqlite3
con = sqlite3.connect(':memory:')
And then issue the following queries:
data.to_sql('data', cnx)
pd.read_sql_query("SELECT * FROM data", con)