10.9 HDF5 (PyTables)
HDFStore
is a dict-like object which reads and writes pandas using
the high performance HDF5 format using the excellent PyTables library. See the cookbook
for some advanced strategies
Warning
As of version 0.15.0, pandas requires PyTables
>= 3.0.0. Stores written with prior versions of pandas / PyTables
>= 2.3 are fully compatible (this was the previous minimum PyTables
required version).
Warning
There is a PyTables
indexing bug which may appear when querying stores using an index. If you see a subset of results being returned, upgrade to PyTables
>= 3.2. Stores created previously will need to be rewritten using the updated version.
Warning
As of version 0.17.0, HDFStore
will not drop rows that have all missing values by default. Previously, if all values (except the index) were missing, HDFStore
would not write those rows to disk.
In [1]: store = pd.HDFStore('store.h5')
In [2]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
Objects can be written to the file just like adding key-value pairs to a dict:
In [3]: np.random.seed(1234)
In [4]: index = pd.date_range('1/1/2000', periods=8)
In [5]: s = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [6]: df = pd.DataFrame(randn(8, 3), index=index,
...: columns=['A', 'B', 'C'])
...:
In [7]: wp = pd.Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
...: major_axis=pd.date_range('1/1/2000', periods=5),
...: minor_axis=['A', 'B', 'C', 'D'])
...:
# store.put('s', s) is an equivalent method
In [8]: store['s'] = s
In [9]: store['df'] = df
In [10]: store['wp'] = wp
# the type of stored data
In [11]: store.root.wp._v_attrs.pandas_type
Out[11]: 'wide'
In [12]: store
Out[12]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[8,3])
/s series (shape->[5])
/wp wide (shape->[2,5,4])
In a current or later Python session, you can retrieve stored objects:
# store.get('df') is an equivalent method
In [13]: store['df']
Out[13]:
A B C
2000-01-01 0.8872 0.8596 -0.6365
2000-01-02 0.0157 -2.2427 1.1500
2000-01-03 0.9919 0.9533 -2.0213
2000-01-04 -0.3341 0.0021 0.4055
2000-01-05 0.2891 1.3212 -1.5469
2000-01-06 -0.2026 -0.6560 0.1934
2000-01-07 0.5534 1.3182 -0.4693
2000-01-08 0.6756 -1.8170 -0.1831
# dotted (attribute) access provides get as well
In [14]: store.df
Out[14]:
A B C
2000-01-01 0.8872 0.8596 -0.6365
2000-01-02 0.0157 -2.2427 1.1500
2000-01-03 0.9919 0.9533 -2.0213
2000-01-04 -0.3341 0.0021 0.4055
2000-01-05 0.2891 1.3212 -1.5469
2000-01-06 -0.2026 -0.6560 0.1934
2000-01-07 0.5534 1.3182 -0.4693
2000-01-08 0.6756 -1.8170 -0.1831
Deletion of the object specified by the key
# store.remove('wp') is an equivalent method
In [15]: del store['wp']
In [16]: store
Out[16]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[8,3])
/s series (shape->[5])
Closing a Store, Context Manager
In [17]: store.close()
In [18]: store
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
File is CLOSED
In [19]: store.is_open
Out[19]: False
# Working with, and automatically closing the store with the context
# manager
In [20]: with pd.HDFStore('store.h5') as store:
....: store.keys()
....:
10.9.1 Read/Write API
HDFStore
supports an top-level API using read_hdf
for reading and to_hdf
for writing,
similar to how read_csv
and to_csv
work. (new in 0.11.0)
In [21]: df_tl = pd.DataFrame(dict(A=list(range(5)), B=list(range(5))))
In [22]: df_tl.to_hdf('store_tl.h5','table',append=True)
In [23]: pd.read_hdf('store_tl.h5', 'table', where = ['index>2'])
Out[23]:
A B
3 3 3
4 4 4
As of version 0.17.0, HDFStore will no longer drop rows that are all missing by default. This behavior can be enabled by setting dropna=True
.
In [24]: df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2],
....: 'col2':[1, np.nan, np.nan]})
....:
In [25]: df_with_missing
Out[25]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [26]: df_with_missing.to_hdf('file.h5', 'df_with_missing',
....: format = 'table', mode='w')
....:
In [27]: pd.read_hdf('file.h5', 'df_with_missing')
Out[27]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [28]: df_with_missing.to_hdf('file.h5', 'df_with_missing',
....: format = 'table', mode='w', dropna=True)
....:
In [29]: pd.read_hdf('file.h5', 'df_with_missing')
Out[29]:
col1 col2
0 0.0 1.0
2 2.0 NaN
This is also true for the major axis of a Panel
:
In [30]: matrix = [[[np.nan, np.nan, np.nan],[1,np.nan,np.nan]],
....: [[np.nan, np.nan, np.nan], [np.nan,5,6]],
....: [[np.nan, np.nan, np.nan],[np.nan,3,np.nan]]]
....:
In [31]: panel_with_major_axis_all_missing = pd.Panel(matrix,
....: items=['Item1', 'Item2','Item3'],
....: major_axis=[1,2],
....: minor_axis=['A', 'B', 'C'])
....:
In [32]: panel_with_major_axis_all_missing
Out[32]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 2 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item3
Major_axis axis: 1 to 2
Minor_axis axis: A to C
In [33]: panel_with_major_axis_all_missing.to_hdf('file.h5', 'panel',
....: dropna = True,
....: format='table',
....: mode='w')
....:
In [34]: reloaded = pd.read_hdf('file.h5', 'panel')
In [35]: reloaded
Out[35]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 1 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item3
Major_axis axis: 2 to 2
Minor_axis axis: A to C
10.9.2 Fixed Format
Note
This was prior to 0.13.0 the Storer
format.
The examples above show storing using put
, which write the HDF5 to PyTables
in a fixed array format, called
the fixed
format. These types of stores are are not appendable once written (though you can simply
remove them and rewrite). Nor are they queryable; they must be
retrieved in their entirety. They also do not support dataframes with non-unique column names.
The fixed
format stores offer very fast writing and slightly faster reading than table
stores.
This format is specified by default when using put
or to_hdf
or by format='fixed'
or format='f'
Warning
A fixed
format will raise a TypeError
if you try to retrieve using a where
.
pd.DataFrame(randn(10,2)).to_hdf('test_fixed.h5','df')
pd.read_hdf('test_fixed.h5','df',where='index>5')
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety
10.9.3 Table Format
HDFStore
supports another PyTables
format on disk, the table
format. Conceptually a table
is shaped very much like a DataFrame,
with rows and columns. A table
may be appended to in the same or
other sessions. In addition, delete & query type operations are
supported. This format is specified by format='table'
or format='t'
to append
or put
or to_hdf
New in version 0.13.
This format can be set as an option as well pd.set_option('io.hdf.default_format','table')
to
enable put/append/to_hdf
to by default store in the table
format.
In [36]: store = pd.HDFStore('store.h5')
In [37]: df1 = df[0:4]
In [38]: df2 = df[4:]
# append data (creates a table automatically)
In [39]: store.append('df', df1)
In [40]: store.append('df', df2)
In [41]: store
Out[41]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
# select the entire object
In [42]: store.select('df')
Out[42]:
A B C
2000-01-01 0.8872 0.8596 -0.6365
2000-01-02 0.0157 -2.2427 1.1500
2000-01-03 0.9919 0.9533 -2.0213
2000-01-04 -0.3341 0.0021 0.4055
2000-01-05 0.2891 1.3212 -1.5469
2000-01-06 -0.2026 -0.6560 0.1934
2000-01-07 0.5534 1.3182 -0.4693
2000-01-08 0.6756 -1.8170 -0.1831
# the type of stored data
In [43]: store.root.df._v_attrs.pandas_type
Out[43]: 'frame_table'
Note
You can also create a table
by passing format='table'
or format='t'
to a put
operation.
10.9.4 Hierarchical Keys
Keys to a store can be specified as a string. These can be in a
hierarchical path-name like format (e.g. foo/bar/bah
), which will
generate a hierarchy of sub-stores (or Groups
in PyTables
parlance). Keys can be specified with out the leading ‘/’ and are ALWAYS
absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations can remove
everything in the sub-store and BELOW, so be careful.
In [44]: store.put('foo/bar/bah', df)
In [45]: store.append('food/orange', df)
In [46]: store.append('food/apple', df)
In [47]: store
Out[47]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/foo/bar/bah frame (shape->[8,3])
/food/apple frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/food/orange frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
# a list of keys are returned
In [48]: store.keys()
Out[48]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']
# remove all nodes under this level
In [49]: store.remove('food')
In [50]: store
Out[50]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/foo/bar/bah frame (shape->[8,3])
Warning
Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node.
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'
# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]
Instead, use explicit string based keys
In [51]: store['foo/bar/bah']
Out[51]:
A B C
2000-01-01 0.8872 0.8596 -0.6365
2000-01-02 0.0157 -2.2427 1.1500
2000-01-03 0.9919 0.9533 -2.0213
2000-01-04 -0.3341 0.0021 0.4055
2000-01-05 0.2891 1.3212 -1.5469
2000-01-06 -0.2026 -0.6560 0.1934
2000-01-07 0.5534 1.3182 -0.4693
2000-01-08 0.6756 -1.8170 -0.1831
10.9.5 Storing Types
10.9.5.1 Storing Mixed Types in a Table
Storing mixed-dtype data is supported. Strings are stored as a
fixed-width using the maximum size of the appended column. Subsequent attempts
at appending longer strings will raise a ValueError
.
Passing min_itemsize={`values`: size}
as a parameter to append
will set a larger minimum for the string columns. Storing floats,
strings, ints, bools, datetime64
are currently supported. For string
columns, passing nan_rep = 'nan'
to append will change the default
nan representation on disk (which converts to/from np.nan), this
defaults to nan.
In [52]: df_mixed = pd.DataFrame({ 'A' : randn(8),
....: 'B' : randn(8),
....: 'C' : np.array(randn(8),dtype='float32'),
....: 'string' :'string',
....: 'int' : 1,
....: 'bool' : True,
....: 'datetime64' : pd.Timestamp('20010102')},
....: index=list(range(8)))
....:
In [53]: df_mixed.ix[3:5,['A', 'B', 'string', 'datetime64']] = np.nan
In [54]: store.append('df_mixed', df_mixed, min_itemsize = {'values': 50})
In [55]: df_mixed1 = store.select('df_mixed')
In [56]: df_mixed1
Out[56]:
A B C bool datetime64 int string
0 0.7047 -1.1527 -0.4301 True 2001-01-02 1 string
1 -0.7854 0.6320 0.7674 True 2001-01-02 1 string
2 0.4621 0.0395 0.9849 True 2001-01-02 1 string
3 NaN NaN 0.2708 True NaT 1 NaN
4 NaN NaN 1.3920 True NaT 1 NaN
5 NaN NaN 0.0798 True NaT 1 NaN
6 2.0078 0.1526 -0.4000 True 2001-01-02 1 string
7 0.2270 0.1645 -1.0279 True 2001-01-02 1 string
In [57]: df_mixed1.get_dtype_counts()
Out[57]:
bool 1
datetime64[ns] 1
float32 1
float64 2
int64 1
object 1
dtype: int64
# we have provided a minimum string column size
In [58]: store.root.df_mixed.table
Out[58]:
/df_mixed/table (Table(8,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
"values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
"values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
"values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
"values_block_5": StringCol(itemsize=50, shape=(1,), dflt='', pos=6)}
byteorder := 'little'
chunkshape := (689,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
10.9.5.2 Storing Multi-Index DataFrames
Storing multi-index dataframes as tables is very similar to storing/selecting from homogeneous index DataFrames.
In [59]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
....: ['one', 'two', 'three']],
....: labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
....: names=['foo', 'bar'])
....:
In [60]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index,
....: columns=['A', 'B', 'C'])
....:
In [61]: df_mi
Out[61]:
A B C
foo bar
foo one -0.5847 0.8166 -0.0819
two -0.3448 0.5283 -1.0690
three -0.5119 0.2912 0.5665
bar one 0.5036 0.2853 0.4843
two 1.3635 -0.7811 -0.4680
baz two 1.2246 -1.2811 0.8755
three -1.7107 -0.4508 0.7492
qux one -0.2039 -0.1822 0.6807
two -1.8185 0.0471 0.3948
three -0.2484 -0.6177 -0.6829
In [62]: store.append('df_mi',df_mi)
In [63]: store.select('df_mi')
Out[63]:
A B C
foo bar
foo one -0.5847 0.8166 -0.0819
two -0.3448 0.5283 -1.0690
three -0.5119 0.2912 0.5665
bar one 0.5036 0.2853 0.4843
two 1.3635 -0.7811 -0.4680
baz two 1.2246 -1.2811 0.8755
three -1.7107 -0.4508 0.7492
qux one -0.2039 -0.1822 0.6807
two -1.8185 0.0471 0.3948
three -0.2484 -0.6177 -0.6829
# the levels are automatically included as data columns
In [64]: store.select('df_mi', 'foo=bar')
Out[64]:
A B C
foo bar
bar one 0.5036 0.2853 0.4843
two 1.3635 -0.7811 -0.4680
10.9.6 Querying
10.9.6.1 Querying a Table
Warning
This query capabilities have changed substantially starting in 0.13.0
.
Queries from prior version are accepted (with a DeprecationWarning
) printed
if its not string-like.
select
and delete
operations have an optional criterion that can
be specified to select/delete only a subset of the data. This allows one
to have a very large on-disk table and retrieve only a portion of the
data.
A query is specified using the Term
class under the hood, as a boolean expression.
index
andcolumns
are supported indexers of a DataFramemajor_axis
,minor_axis
, anditems
are supported indexers of the Panel- if
data_columns
are specified, these can be used as additional indexers
Valid comparison operators are:
=, ==, !=, >, >=, <, <=
Valid boolean expressions are combined with:
|
: or&
: and(
and)
: for grouping
These rules are similar to how boolean expressions are used in pandas for indexing.
Note
=
will be automatically expanded to the comparison operator==
~
is the not operator, but can only be used in very limited circumstances- If a list/tuple of expressions is passed they will be combined via
&
The following are valid expressions:
'index>=date'
"columns=['A', 'D']"
"columns in ['A', 'D']"
'columns=A'
'columns==A'
"~(columns=['A','B'])"
'index>df.index[3] & string="bar"'
'(index>df.index[3] & index<=df.index[6]) | string="bar"'
"ts>=Timestamp('2012-02-01')"
"major_axis>=20130101"
The indexers
are on the left-hand side of the sub-expression:
columns
, major_axis
, ts
The right-hand side of the sub-expression (after a comparison operator) can be:
- functions that will be evaluated, e.g.
Timestamp('2012-02-01')
- strings, e.g.
"bar"
- date-like, e.g.
20130101
, or"20130101"
- lists, e.g.
"['A','B']"
- variables that are defined in the local names space, e.g.
date
Note
Passing a string to a query by interpolating it into the query expression is not recommended. Simply assign the string of interest to a variable and use that variable in an expression. For example, do this
string = "HolyMoly'"
store.select('df', 'index == string')
instead of this
string = "HolyMoly'"
store.select('df', 'index == %s' % string)
The latter will not work and will raise a SyntaxError
.Note that
there’s a single quote followed by a double quote in the string
variable.
If you must interpolate, use the '%r'
format specifier
store.select('df', 'index == %r' % string)
which will quote string
.
Here are some examples:
In [65]: dfq = pd.DataFrame(randn(10,4),columns=list('ABCD'),index=pd.date_range('20130101',periods=10))
In [66]: store.append('dfq',dfq,format='table',data_columns=True)
Use boolean expressions, with in-line function evaluation.
In [67]: store.select('dfq',"index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[67]:
A B
2013-01-05 1.2104 0.7974
2013-01-06 -0.8503 1.1768
2013-01-07 0.9842 -0.1217
2013-01-08 0.7966 -0.4740
2013-01-09 -0.8048 -2.1236
2013-01-10 0.3342 0.5368
Use and inline column reference
In [68]: store.select('dfq',where="A>0 or C>0")
Out[68]:
A B C D
2013-01-01 0.4363 -1.7030 0.3937 -0.4793
2013-01-02 -0.2990 0.6941 0.6786 0.2396
2013-01-03 0.1512 0.8161 1.8935 0.6396
2013-01-04 -0.9620 -2.0853 1.9302 -1.7353
2013-01-05 1.2104 0.7974 -0.3798 0.7026
2013-01-07 0.9842 -0.1217 2.3658 0.4961
2013-01-08 0.7966 -0.4740 -0.0567 1.3578
2013-01-10 0.3342 0.5368 -0.7438 -0.3202
Works with a Panel as well.
In [69]: store.append('wp',wp)
In [70]: store
Out[70]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/df_mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])
/df_mixed frame_table (typ->appendable,nrows->8,ncols->7,indexers->[index])
/dfq frame_table (typ->appendable,nrows->10,ncols->4,indexers->[index],dc->[A,B,C,D])
/foo/bar/bah frame (shape->[8,3])
/wp wide_table (typ->appendable,nrows->20,ncols->2,indexers->[major_axis,minor_axis])
In [71]: store.select('wp', "major_axis>pd.Timestamp('20000102') & minor_axis=['A', 'B']")
Out[71]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to B
The columns
keyword can be supplied to select a list of columns to be
returned, this is equivalent to passing a
'columns=list_of_columns_to_filter'
:
In [72]: store.select('df', "columns=['A', 'B']")
Out[72]:
A B
2000-01-01 0.8872 0.8596
2000-01-02 0.0157 -2.2427
2000-01-03 0.9919 0.9533
2000-01-04 -0.3341 0.0021
2000-01-05 0.2891 1.3212
2000-01-06 -0.2026 -0.6560
2000-01-07 0.5534 1.3182
2000-01-08 0.6756 -1.8170
start
and stop
parameters can be specified to limit the total search
space. These are in terms of the total number of rows in a table.
# this is effectively what the storage of a Panel looks like
In [73]: wp.to_frame()
Out[73]:
Item1 Item2
major minor
2000-01-01 A 1.0590 0.2153
B -0.3978 0.8410
C 0.3374 -1.4458
D 1.0476 -1.4020
2000-01-02 A 1.0459 -0.1009
B 0.8637 -0.5482
C -0.1221 -0.1446
... ... ...
2000-01-04 B 0.0361 0.3080
C -2.0750 -0.2085
D 0.2478 1.0338
2000-01-05 A -0.8972 -2.4005
B -0.1368 2.0306
C 0.0183 -1.1426
D 0.7554 0.2119
[20 rows x 2 columns]
# limiting the search
In [74]: store.select('wp',"major_axis>20000102 & minor_axis=['A','B']",
....: start=0, stop=10)
....:
Out[74]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 1 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-03 00:00:00
Minor_axis axis: A to B
Note
select
will raise a ValueError
if the query expression has an unknown
variable reference. Usually this means that you are trying to select on a column
that is not a data_column.
select
will raise a SyntaxError
if the query expression is not valid.
10.9.6.2 Using timedelta64[ns]
New in version 0.13.
Beginning in 0.13.0, you can store and query using the timedelta64[ns]
type. Terms can be
specified in the format: <float>(<unit>)
, where float may be signed (and fractional), and unit can be
D,s,ms,us,ns
for the timedelta. Here’s an example:
In [75]: from datetime import timedelta
In [76]: dftd = pd.DataFrame(dict(A = pd.Timestamp('20130101'), B = [ pd.Timestamp('20130101') + timedelta(days=i,seconds=10) for i in range(10) ]))
In [77]: dftd['C'] = dftd['A']-dftd['B']
In [78]: dftd
Out[78]:
A B C
0 2013-01-01 2013-01-01 00:00:10 -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10 -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10 -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10 -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
In [79]: store.append('dftd',dftd,data_columns=True)
In [80]: store.select('dftd',"C<'-3.5D'")
Out[80]:
A B C
4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
10.9.6.3 Indexing
You can create/modify an index for a table with create_table_index
after data is already in the table (after and append/put
operation). Creating a table index is highly encouraged. This will
speed your queries a great deal when you use a select
with the
indexed dimension as the where
.
Note
Indexes are automagically created (starting 0.10.1
) on the indexables
and any data columns you specify. This behavior can be turned off by passing
index=False
to append
.
# we have automagically already created an index (in the first section)
In [81]: i = store.root.df.table.cols.index.index
In [82]: i.optlevel, i.kind
Out[82]: (6, 'medium')
# change an index by passing new parameters
In [83]: store.create_table_index('df', optlevel=9, kind='full')
In [84]: i = store.root.df.table.cols.index.index
In [85]: i.optlevel, i.kind
Out[85]: (9, 'full')
Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append, then recreate at the end.
In [86]: df_1 = pd.DataFrame(randn(10,2),columns=list('AB'))
In [87]: df_2 = pd.DataFrame(randn(10,2),columns=list('AB'))
In [88]: st = pd.HDFStore('appends.h5',mode='w')
In [89]: st.append('df', df_1, data_columns=['B'], index=False)
In [90]: st.append('df', df_2, data_columns=['B'], index=False)
In [91]: st.get_storer('df').table
Out[91]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
Then create the index when finished appending.
In [92]: st.create_table_index('df', columns=['B'], optlevel=9, kind='full')
In [93]: st.get_storer('df').table
Out[93]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"B": Index(9, full, shuffle, zlib(1)).is_csi=True}
In [94]: st.close()
See here for how to create a completely-sorted-index (CSI) on an existing store.
10.9.6.4 Query via Data Columns
You can designate (and index) certain columns that you want to be able
to perform queries (other than the indexable columns, which you can
always query). For instance say you want to perform this common
operation, on-disk, and return just the frame that matches this
query. You can specify data_columns = True
to force all columns to
be data_columns
In [95]: df_dc = df.copy()
In [96]: df_dc['string'] = 'foo'
In [97]: df_dc.ix[4:6,'string'] = np.nan
In [98]: df_dc.ix[7:9,'string'] = 'bar'
In [99]: df_dc['string2'] = 'cool'
In [100]: df_dc.ix[1:3,['B','C']] = 1.0
In [101]: df_dc
Out[101]:
A B C string string2
2000-01-01 0.8872 0.8596 -0.6365 foo cool
2000-01-02 0.0157 1.0000 1.0000 foo cool
2000-01-03 0.9919 1.0000 1.0000 foo cool
2000-01-04 -0.3341 0.0021 0.4055 foo cool
2000-01-05 0.2891 1.3212 -1.5469 NaN cool
2000-01-06 -0.2026 -0.6560 0.1934 NaN cool
2000-01-07 0.5534 1.3182 -0.4693 foo cool
2000-01-08 0.6756 -1.8170 -0.1831 bar cool
# on-disk operations
In [102]: store.append('df_dc', df_dc, data_columns = ['B', 'C', 'string', 'string2'])
In [103]: store.select('df_dc', [ pd.Term('B>0') ])
Out[103]:
A B C string string2
2000-01-01 0.8872 0.8596 -0.6365 foo cool
2000-01-02 0.0157 1.0000 1.0000 foo cool
2000-01-03 0.9919 1.0000 1.0000 foo cool
2000-01-04 -0.3341 0.0021 0.4055 foo cool
2000-01-05 0.2891 1.3212 -1.5469 NaN cool
2000-01-07 0.5534 1.3182 -0.4693 foo cool
# getting creative
In [104]: store.select('df_dc', 'B > 0 & C > 0 & string == foo')
Out[104]:
A B C string string2
2000-01-02 0.0157 1.0000 1.0000 foo cool
2000-01-03 0.9919 1.0000 1.0000 foo cool
2000-01-04 -0.3341 0.0021 0.4055 foo cool
# this is in-memory version of this type of selection
In [105]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == 'foo')]
Out[105]:
A B C string string2
2000-01-02 0.0157 1.0000 1.0000 foo cool
2000-01-03 0.9919 1.0000 1.0000 foo cool
2000-01-04 -0.3341 0.0021 0.4055 foo cool
# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [106]: store.root.df_dc.table
Out[106]:
/df_dc/table (Table(8,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2),
"C": Float64Col(shape=(), dflt=0.0, pos=3),
"string": StringCol(itemsize=3, shape=(), dflt='', pos=4),
"string2": StringCol(itemsize=4, shape=(), dflt='', pos=5)}
byteorder := 'little'
chunkshape := (1680,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"string2": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"string": Index(6, medium, shuffle, zlib(1)).is_csi=False}
There is some performance degradation by making lots of columns into data columns, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table!)
10.9.6.5 Iterator
Starting in 0.11.0
, you can pass, iterator=True
or chunksize=number_in_a_chunk
to select
and select_as_multiple
to return an iterator on the results.
The default is 50,000 rows returned in a chunk.
In [107]: for df in store.select('df', chunksize=3):
.....: print(df)
.....:
A B C
2000-01-01 0.8872 0.8596 -0.6365
2000-01-02 0.0157 -2.2427 1.1500
2000-01-03 0.9919 0.9533 -2.0213
A B C
2000-01-04 -0.3341 0.0021 0.4055
2000-01-05 0.2891 1.3212 -1.5469
2000-01-06 -0.2026 -0.6560 0.1934
A B C
2000-01-07 0.5534 1.3182 -0.4693
2000-01-08 0.6756 -1.8170 -0.1831
Note
New in version 0.12.0.
You can also use the iterator with read_hdf
which will open, then
automatically close the store when finished iterating.
for df in pd.read_hdf('store.h5','df', chunksize=3):
print(df)
Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks.
Here is a recipe for generating a query and using it to create equal sized return chunks.
In [108]: dfeq = pd.DataFrame({'number': np.arange(1,11)})
In [109]: dfeq
Out[109]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
In [110]: store.append('dfeq', dfeq, data_columns=['number'])
In [111]: def chunks(l, n):
.....: return [l[i:i+n] for i in range(0, len(l), n)]
.....:
In [112]: evens = [2,4,6,8,10]
In [113]: coordinates = store.select_as_coordinates('dfeq','number=evens')
In [114]: for c in chunks(coordinates, 2):
.....: print store.select('dfeq',where=c)
.....:
number
1 2
3 4
number
5 6
7 8
number
9 10
10.9.6.6 Advanced Queries
10.9.6.6.1 Select a Single Column
To retrieve a single indexable or data column, use the
method select_column
. This will, for example, enable you to get the index
very quickly. These return a Series
of the result, indexed by the row number.
These do not currently accept the where
selector.
In [115]: store.select_column('df_dc', 'index')
Out[115]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
7 2000-01-08
Name: index, dtype: datetime64[ns]
In [116]: store.select_column('df_dc', 'string')
Out[116]:
0 foo
1 foo
2 foo
3 foo
4 NaN
5 NaN
6 foo
7 bar
Name: string, dtype: object
10.9.6.6.2 Selecting coordinates
Sometimes you want to get the coordinates (a.k.a the index locations) of your query. This returns an
Int64Index
of the resulting locations. These coordinates can also be passed to subsequent
where
operations.
In [117]: df_coord = pd.DataFrame(np.random.randn(1000,2),index=pd.date_range('20000101',periods=1000))
In [118]: store.append('df_coord',df_coord)
In [119]: c = store.select_as_coordinates('df_coord','index>20020101')
In [120]: c.summary()
Out[120]: u'Int64Index: 268 entries, 732 to 999'
In [121]: store.select('df_coord',where=c)
Out[121]:
0 1
2002-01-02 -0.1783 -0.0646
2002-01-03 -1.2050 -3.8809
2002-01-04 0.9745 0.4152
2002-01-05 1.7520 0.4850
2002-01-06 -0.1709 0.7489
2002-01-07 0.6298 0.8111
2002-01-08 2.1338 0.2385
... ... ...
2002-09-20 -0.1814 0.6124
2002-09-21 -0.7633 -0.3550
2002-09-22 -0.2618 0.8121
2002-09-23 0.4826 -0.8865
2002-09-24 -0.0378 -0.5630
2002-09-25 0.8977 0.3832
2002-09-26 -1.3248 1.1393
[268 rows x 2 columns]
10.9.6.6.3 Selecting using a where mask
Sometime your query can involve creating a list of rows to select. Usually this mask
would
be a resulting index
from an indexing operation. This example selects the months of
a datetimeindex which are 5.
In [122]: df_mask = pd.DataFrame(np.random.randn(1000,2),index=pd.date_range('20000101',periods=1000))
In [123]: store.append('df_mask',df_mask)
In [124]: c = store.select_column('df_mask','index')
In [125]: where = c[pd.DatetimeIndex(c).month==5].index
In [126]: store.select('df_mask',where=where)
Out[126]:
0 1
2000-05-01 -1.0062 -0.6168
2000-05-02 0.2189 0.7178
2000-05-03 0.0133 1.3481
2000-05-04 0.6622 -1.0506
2000-05-05 -1.0349 -0.2432
2000-05-06 -0.7534 -1.4543
2000-05-07 -1.0229 -0.4770
... ... ...
2002-05-25 -0.5091 -0.3894
2002-05-26 0.1507 1.1643
2002-05-27 -0.3329 0.1152
2002-05-28 -1.0481 -0.6057
2002-05-29 1.4188 -0.4428
2002-05-30 -0.4332 0.8350
2002-05-31 -1.0413 1.4018
[93 rows x 2 columns]
10.9.6.6.4 Storer Object
If you want to inspect the stored object, retrieve via
get_storer
. You could use this programmatically to say get the number
of rows in an object.
In [127]: store.get_storer('df_dc').nrows
Out[127]: 8
10.9.6.7 Multiple Table Queries
New in 0.10.1 are the methods append_to_multiple
and
select_as_multiple
, that can perform appending/selecting from
multiple tables at once. The idea is to have one table (call it the
selector table) that you index most/all of the columns, and perform your
queries. The other table(s) are data tables with an index matching the
selector table’s index. You can then perform a very fast query
on the selector table, yet get lots of data back. This method is similar to
having a very wide table, but enables more efficient queries.
The append_to_multiple
method splits a given single DataFrame
into multiple tables according to d
, a dictionary that maps the
table names to a list of ‘columns’ you want in that table. If None
is used in place of a list, that table will have the remaining
unspecified columns of the given DataFrame. The argument selector
defines which table is the selector table (which you can make queries from).
The argument dropna
will drop rows from the input DataFrame to ensure
tables are synchronized. This means that if a row for one of the tables
being written to is entirely np.NaN
, that row will be dropped from all tables.
If dropna
is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES.
Remember that entirely np.Nan
rows are not written to the HDFStore, so if
you choose to call dropna=False
, some tables may have more rows than others,
and therefore select_as_multiple
may not work or it may return unexpected
results.
In [128]: df_mt = pd.DataFrame(randn(8, 6), index=pd.date_range('1/1/2000', periods=8),
.....: columns=['A', 'B', 'C', 'D', 'E', 'F'])
.....:
In [129]: df_mt['foo'] = 'bar'
In [130]: df_mt.ix[1, ('A', 'B')] = np.nan
# you can also create the tables individually
In [131]: store.append_to_multiple({'df1_mt': ['A', 'B'], 'df2_mt': None },
.....: df_mt, selector='df1_mt')
.....:
In [132]: store
Out[132]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/df1_mt frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])
/df2_mt frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index])
/df_coord frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_dc frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])
/df_mask frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])
/df_mixed frame_table (typ->appendable,nrows->8,ncols->7,indexers->[index])
/dfeq frame_table (typ->appendable,nrows->10,ncols->1,indexers->[index],dc->[number])
/dfq frame_table (typ->appendable,nrows->10,ncols->4,indexers->[index],dc->[A,B,C,D])
/dftd frame_table (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/foo/bar/bah frame (shape->[8,3])
/wp wide_table (typ->appendable,nrows->20,ncols->2,indexers->[major_axis,minor_axis])
# individual tables were created
In [133]: store.select('df1_mt')
Out[133]:
A B
2000-01-01 0.7147 0.3182
2000-01-02 NaN NaN
2000-01-03 -0.0869 0.4169
2000-01-04 0.4891 -0.2533
2000-01-05 -0.3830 -0.3974
2000-01-06 0.5381 0.2264
2000-01-07 -2.0735 -0.1159
2000-01-08 -0.6954 0.4025
In [134]: store.select('df2_mt')
Out[134]:
C D E F foo
2000-01-01 0.6075 0.7909 0.8522 0.0967 bar
2000-01-02 0.8110 -0.3568 1.0471 0.6647 bar
2000-01-03 -0.7644 -0.2872 -0.0894 -1.0351 bar
2000-01-04 -1.9481 -0.1166 0.8006 -0.7962 bar
2000-01-05 -0.7176 0.1570 -0.3447 -0.1712 bar
2000-01-06 1.5417 0.2053 1.9981 0.9536 bar
2000-01-07 1.3911 0.3030 1.0933 -0.1010 bar
2000-01-08 -1.5076 0.0896 0.6588 -1.0376 bar
# as a multiple
In [135]: store.select_as_multiple(['df1_mt', 'df2_mt'], where=['A>0', 'B>0'],
.....: selector = 'df1_mt')
.....:
Out[135]:
A B C D E F foo
2000-01-01 0.7147 0.3182 0.6075 0.7909 0.8522 0.0967 bar
2000-01-06 0.5381 0.2264 1.5417 0.2053 1.9981 0.9536 bar
10.9.7 Delete from a Table
You can delete from a table selectively by specifying a where
. In
deleting rows, it is important to understand the PyTables
deletes
rows by erasing the rows, then moving the following data. Thus
deleting can potentially be a very expensive operation depending on the
orientation of your data. This is especially true in higher dimensional
objects (Panel
and Panel4D
). To get optimal performance, it’s
worthwhile to have the dimension you are deleting be the first of the
indexables
.
Data is ordered (on the disk) in terms of the indexables
. Here’s a
simple use case. You store panel-type data, with dates in the
major_axis
and ids in the minor_axis
. The data is then
interleaved like this:
- date_1 - id_1 - id_2 - . - id_n
- date_2 - id_1 - . - id_n
It should be clear that a delete operation on the major_axis
will be
fairly quick, as one chunk is removed, then the following data moved. On
the other hand a delete operation on the minor_axis
will be very
expensive. In this case it would almost certainly be faster to rewrite
the table using a where
that selects all but the missing data.
# returns the number of rows deleted
In [136]: store.remove('wp', 'major_axis>20000102' )
Out[136]: 12
In [137]: store.select('wp')
Out[137]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-02 00:00:00
Minor_axis axis: A to D
Warning
Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again, WILL TEND TO INCREASE THE FILE SIZE.
To repack and clean the file, use ptrepack
10.9.8 Notes & Caveats
10.9.8.1 Compression
PyTables
allows the stored data to be compressed. This applies to
all kinds of stores, not just tables.
- Pass
complevel=int
for a compression level (1-9, with 0 being no compression, and the default) - Pass
complib=lib
where lib is any ofzlib, bzip2, lzo, blosc
for whichever compression library you prefer.
HDFStore
will use the file based compression scheme if no overriding
complib
or complevel
options are provided. blosc
offers very
fast compression, and is my most used. Note that lzo
and bzip2
may not be installed (by Python) by default.
Compression for all objects within the file
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc')
Or on-the-fly compression (this only applies to tables). You can turn
off file compression for a specific table by passing complevel=0
store.append('df', df, complib='zlib', complevel=5)
10.9.8.2 ptrepack
PyTables
offers better write performance when tables are compressed after
they are written, as opposed to turning on compression at the very
beginning. You can use the supplied PyTables
utility
ptrepack
. In addition, ptrepack
can change compression levels
after the fact.
ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5
Furthermore ptrepack in.h5 out.h5
will repack the file to allow
you to reuse previously deleted space. Alternatively, one can simply
remove the file and write again, or use the copy
method.
10.9.8.3 Caveats
Warning
HDFStore
is not-threadsafe for writing. The underlying
PyTables
only supports concurrent reads (via threading or
processes). If you need reading and writing at the same time, you
need to serialize these operations in a single thread in a single
process. You will corrupt your data otherwise. See the (:issue:`2397`) for more information.
- If you use locks to manage write access between multiple processes, you
may want to use
fsync()
before releasing write locks. For convenience you can usestore.flush(fsync=True)
to do this for you. - Once a
table
is created its items (Panel) / columns (DataFrame) are fixed; only exactly the same columns can be appended - Be aware that timezones (e.g.,
pytz.timezone('US/Eastern')
) are not necessarily equal across timezone versions. So if data is localized to a specific timezone in the HDFStore using one version of a timezone library and that data is updated with another version, the data will be converted to UTC since these timezones are not considered equal. Either use the same version of timezone library or usetz_convert
with the updated timezone definition.
Warning
PyTables
will show a NaturalNameWarning
if a column name
cannot be used as an attribute selector.
Natural identifiers contain only letters, numbers, and underscores,
and may not begin with a number.
Other identifiers cannot be used in a where
clause
and are generally a bad idea.
10.9.9 DataTypes
HDFStore
will map an object dtype to the PyTables
underlying
dtype. This means the following types are known to work:
Type | Represents missing values |
---|---|
floating : float64, float32, float16 |
np.nan |
integer : int64, int32, int8, uint64,uint32, uint8 |
|
boolean | |
datetime64[ns] |
NaT |
timedelta64[ns] |
NaT |
categorical : see the section below | |
object : strings |
np.nan |
unicode
columns are not supported, and WILL FAIL.
10.9.9.1 Categorical Data
New in version 0.15.2.
Writing data to a HDFStore
that contains a category
dtype was implemented
in 0.15.2. Queries work the same as if it was an object array. However, the category
dtyped data is
stored in a more efficient manner.
In [138]: dfcat = pd.DataFrame({ 'A' : pd.Series(list('aabbcdba')).astype('category'),
.....: 'B' : np.random.randn(8) })
.....:
In [139]: dfcat
Out[139]:
A B
0 a 0.6033
1 a 0.2626
2 b -0.9796
3 b 2.1324
4 c 0.8925
5 d 1.9965
6 b 0.2314
7 a 0.9801
In [140]: dfcat.dtypes
Out[140]:
A category
B float64
dtype: object
In [141]: cstore = pd.HDFStore('cats.h5', mode='w')
In [142]: cstore.append('dfcat', dfcat, format='table', data_columns=['A'])
In [143]: result = cstore.select('dfcat', where="A in ['b','c']")
In [144]: result
Out[144]:
A B
2 b -0.9796
3 b 2.1324
4 c 0.8925
6 b 0.2314
In [145]: result.dtypes
Out[145]:
A category
B float64
dtype: object
Warning
The format of the Categorical
is readable by prior versions of pandas (< 0.15.2), but will retrieve
the data as an integer based column (e.g. the codes
). However, the categories
can be retrieved
but require the user to select them manually using the explicit meta path.
The data is stored like so:
In [146]: cstore
Out[146]:
<class 'pandas.io.pytables.HDFStore'>
File path: cats.h5
/dfcat frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A])
/dfcat/meta/A/meta series_table (typ->appendable,nrows->4,ncols->1,indexers->[index],dc->[values])
# to get the categories
In [147]: cstore.select('dfcat/meta/A/meta')
Out[147]:
0 a
1 b
2 c
3 d
dtype: object
10.9.9.2 String Columns
min_itemsize
The underlying implementation of HDFStore
uses a fixed column width (itemsize) for string columns.
A string column itemsize is calculated as the maximum of the
length of data (for that column) that is passed to the HDFStore
, in the first append. Subsequent appends,
may introduce a string for a column larger than the column can hold, an Exception will be raised (otherwise you
could have a silent truncation of these columns, leading to loss of information). In the future we may relax this and
allow a user-specified truncation to occur.
Pass min_itemsize
on the first table creation to a-priori specify the minimum length of a particular string column.
min_itemsize
can be an integer, or a dict mapping a column name to an integer. You can pass values
as a key to
allow all indexables or data_columns to have this min_itemsize.
Starting in 0.11.0, passing a min_itemsize
dict will cause all passed columns to be created as data_columns automatically.
Note
If you are not passing any data_columns
, then the min_itemsize
will be the maximum of the length of any string passed
In [148]: dfs = pd.DataFrame(dict(A = 'foo', B = 'bar'),index=list(range(5)))
In [149]: dfs
Out[149]:
A B
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
# A and B have a size of 30
In [150]: store.append('dfs', dfs, min_itemsize = 30)
In [151]: store.get_storer('dfs').table
Out[151]:
/dfs/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (963,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
# A is created as a data_column with a size of 30
# B is size is calculated
In [152]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })
In [153]: store.get_storer('dfs2').table
Out[153]:
/dfs2/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
"A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
byteorder := 'little'
chunkshape := (1598,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
nan_rep
String columns will serialize a np.nan
(a missing value) with the nan_rep
string representation. This defaults to the string value nan
.
You could inadvertently turn an actual nan
value into a missing value.
In [154]: dfss = pd.DataFrame(dict(A = ['foo','bar','nan']))
In [155]: dfss
Out[155]:
A
0 foo
1 bar
2 nan
In [156]: store.append('dfss', dfss)
In [157]: store.select('dfss')
Out[157]:
A
0 foo
1 bar
2 NaN
# here you need to specify a different nan rep
In [158]: store.append('dfss2', dfss, nan_rep='_nan_')
In [159]: store.select('dfss2')
Out[159]:
A
0 foo
1 bar
2 nan
10.9.10 External Compatibility
HDFStore
writes table
format objects in specific formats suitable for
producing loss-less round trips to pandas objects. For external
compatibility, HDFStore
can read native PyTables
format
tables.
It is possible to write an HDFStore
object that can easily be imported into R
using the
rhdf5
library (Package website). Create a table format store like this:
In [160]: np.random.seed(1)
In [161]: df_for_r = pd.DataFrame({"first": np.random.rand(100),
.....: "second": np.random.rand(100),
.....: "class": np.random.randint(0, 2, (100,))},
.....: index=range(100))
.....:
In [162]: df_for_r.head()
Out[162]:
class first second
0 0 0.4170 0.3266
1 0 0.7203 0.5271
2 1 0.0001 0.8859
3 1 0.3023 0.3573
4 1 0.1468 0.9085
In [163]: store_export = pd.HDFStore('export.h5')
In [164]: store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
In [165]: store_export
Out[165]:
<class 'pandas.io.pytables.HDFStore'>
File path: export.h5
/df_for_r frame_table (typ->appendable,nrows->100,ncols->3,indexers->[index])
In R this file can be read into a data.frame
object using the rhdf5
library. The following example function reads the corresponding column names
and data values from the values and assembles them into a data.frame
:
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.
library(rhdf5)
loadhdf5data <- function(h5File) {
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
# NOTE: matrices returned by h5read have to be transposed to to obtain
# required Fortran order!
data <- data.frame(t(h5read(h5File, data_paths[idx])))
names <- t(h5read(h5File, name_paths[idx]))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
Now you can import the DataFrame
into R:
> data = loadhdf5data("transfer.hdf5")
> head(data)
first second class
1 0.4170220047 0.3266449 0
2 0.7203244934 0.5270581 0
3 0.0001143748 0.8859421 1
4 0.3023325726 0.3572698 1
5 0.1467558908 0.9085352 1
6 0.0923385948 0.6233601 1
Note
The R function lists the entire HDF5 file’s contents and assembles the
data.frame
object from all matching nodes, so use this only as a
starting point if you have stored multiple DataFrame
objects to a
single HDF5 file.
10.9.11 Backwards Compatibility
0.10.1 of HDFStore
can read tables created in a prior version of pandas,
however query terms using the
prior (undocumented) methodology are unsupported. HDFStore
will
issue a warning if you try to use a legacy-format file. You must
read in the entire file and write it out using the new format, using the
method copy
to take advantage of the updates. The group attribute
pandas_version
contains the version information. copy
takes a
number of options, please see the docstring.
In [166]: import os
In [167]: legacy_file_path = os.path.abspath('source/_static/legacy_0.10.h5')
# a legacy store
In [168]: legacy_store = pd.HDFStore(legacy_file_path,'r')
In [169]: legacy_store
Out[169]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/takanori/git_local/pandas/source/_static/legacy_0.10.h5
/a series (shape->[30])
/b frame (shape->[30,4])
/df1_mixed frame_table [0.10.0] (typ->appendable,nrows->30,ncols->11,indexers->[index])
/foo/bar wide (shape->[3,30,4])
/p1_mixed wide_table [0.10.0] (typ->appendable,nrows->120,ncols->9,indexers->[major_axis,minor_axis])
/p4d_mixed ndim_table [0.10.0] (typ->appendable,nrows->360,ncols->9,indexers->[items,major_axis,minor_axis])
# copy (and return the new handle)
In [170]: new_store = legacy_store.copy('store_new.h5')
In [171]: new_store
Out[171]:
<class 'pandas.io.pytables.HDFStore'>
File path: store_new.h5
/a series (shape->[30])
/b frame (shape->[30,4])
/df1_mixed frame_table (typ->appendable,nrows->30,ncols->11,indexers->[index])
/foo/bar wide (shape->[3,30,4])
/p1_mixed wide_table (typ->appendable,nrows->120,ncols->9,indexers->[major_axis,minor_axis])
/p4d_mixed wide_table (typ->appendable,nrows->360,ncols->9,indexers->[items,major_axis,minor_axis])
In [172]: new_store.close()
10.9.12 Performance
tables
format come with a writing performance penalty as compared tofixed
stores. The benefit is the ability to append/delete and query (potentially very large amounts of data). Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.- You can pass
chunksize=<int>
toappend
, specifying the write chunksize (default is 50000). This will significantly lower your memory usage on writing. - You can pass
expectedrows=<int>
to the firstappend
, to set the TOTAL number of expected rows thatPyTables
will expected. This will optimize read/write performance. - Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
- A
PerformanceWarning
will be raised if you are attempting to store types that will be pickled by PyTables (rather than stored as endemic types). See Here for more information and some solutions.
10.9.13 Experimental
HDFStore supports Panel4D
storage.
In [173]: p4d = pd.Panel4D({ 'l1' : wp })
In [174]: p4d
Out[174]:
<class 'pandas.core.panelnd.Panel4D'>
Dimensions: 1 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis)
Labels axis: l1 to l1
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
In [175]: store.append('p4d', p4d)
In [176]: store
Out[176]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/df1_mt frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])
/df2_mt frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index])
/df_coord frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_dc frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])
/df_mask frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])
/df_mixed frame_table (typ->appendable,nrows->8,ncols->7,indexers->[index])
/dfeq frame_table (typ->appendable,nrows->10,ncols->1,indexers->[index],dc->[number])
/dfq frame_table (typ->appendable,nrows->10,ncols->4,indexers->[index],dc->[A,B,C,D])
/dfs frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/dfs2 frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A])
/dfss frame_table (typ->appendable,nrows->3,ncols->1,indexers->[index])
/dfss2 frame_table (typ->appendable,nrows->3,ncols->1,indexers->[index])
/dftd frame_table (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/foo/bar/bah frame (shape->[8,3])
/p4d wide_table (typ->appendable,nrows->40,ncols->1,indexers->[items,major_axis,minor_axis])
/wp wide_table (typ->appendable,nrows->8,ncols->2,indexers->[major_axis,minor_axis])
These, by default, index the three axes items, major_axis,
minor_axis
. On an AppendableTable
it is possible to setup with the
first append a different indexing scheme, depending on how you want to
store your data. Pass the axes
keyword with a list of dimensions
(currently must by exactly 1 less than the total dimensions of the
object). This cannot be changed after table creation.
In [177]: store.append('p4d2', p4d, axes=['labels', 'major_axis', 'minor_axis'])
In [178]: store
Out[178]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])
/df1_mt frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])
/df2_mt frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index])
/df_coord frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_dc frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])
/df_mask frame_table (typ->appendable,nrows->1000,ncols->2,indexers->[index])
/df_mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])
/df_mixed frame_table (typ->appendable,nrows->8,ncols->7,indexers->[index])
/dfeq frame_table (typ->appendable,nrows->10,ncols->1,indexers->[index],dc->[number])
/dfq frame_table (typ->appendable,nrows->10,ncols->4,indexers->[index],dc->[A,B,C,D])
/dfs frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/dfs2 frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A])
/dfss frame_table (typ->appendable,nrows->3,ncols->1,indexers->[index])
/dfss2 frame_table (typ->appendable,nrows->3,ncols->1,indexers->[index])
/dftd frame_table (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/foo/bar/bah frame (shape->[8,3])
/p4d wide_table (typ->appendable,nrows->40,ncols->1,indexers->[items,major_axis,minor_axis])
/p4d2 wide_table (typ->appendable,nrows->20,ncols->2,indexers->[labels,major_axis,minor_axis])
/wp wide_table (typ->appendable,nrows->8,ncols->2,indexers->[major_axis,minor_axis])
In [179]: store.select('p4d2', [ pd.Term('labels=l1'), pd.Term('items=Item1'), pd.Term('minor_axis=A_big_strings') ])
Out[179]:
<class 'pandas.core.panelnd.Panel4D'>
Dimensions: 0 (labels) x 1 (items) x 0 (major_axis) x 0 (minor_axis)
Labels axis: None
Items axis: Item1 to Item1
Major_axis axis: None
Minor_axis axis: None