3.5 Index Types
We have discussed MultiIndex
in the previous sections pretty extensively. DatetimeIndex
and PeriodIndex
are shown here. TimedeltaIndex
are here.
In the following sub-sections we will highlite some other index types.
3.5.1 CategoricalIndex
New in version 0.16.1.
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [1]: df = pd.DataFrame({'A': np.arange(6),
...: 'B': list('aabbca')})
...:
In [2]: df['B'] = df['B'].astype('category', categories=list('cab'))
In [3]: df
Out[3]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [4]: df.dtypes
Out[4]:
A int64
B category
dtype: object
In [5]: df.B.cat.categories
Out[5]: Index([u'c', u'a', u'b'], dtype='object')
Setting the index, will create create a CategoricalIndex
In [6]: df2 = df.set_index('B')
In [7]: df2.index
Out[7]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index
with duplicates.
The indexers MUST be in the category or the operation will raise.
In [8]: df2.loc['a']
Out[8]:
A
B
a 0
a 1
a 5
These PRESERVE the CategoricalIndex
In [9]: df2.loc['a'].index
Out[9]: CategoricalIndex([u'a', u'a', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Sorting will order by the order of the categories
In [10]: df2.sort_index()
Out[10]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on the index will preserve the index nature as well
In [11]: df2.groupby(level=0).sum()
Out[11]:
A
B
c 4
a 6
b 5
In [12]: df2.groupby(level=0).sum().index
Out[12]: CategoricalIndex([u'c', u'a', u'b'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the PASSED Categorical
dtype. This allows one to arbitrarly index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [13]: df2.reindex(['a','e'])
Out[13]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [14]: df2.reindex(['a','e']).index
Out[14]: Index([u'a', u'a', u'a', u'e'], dtype='object', name=u'B')
In [15]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
Out[15]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [16]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
Out[16]: CategoricalIndex([u'a', u'a', u'a', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, name=u'B', dtype='category')
Warning
Reshaping and Comparison operations on a CategoricalIndex
must have the same categories
or a TypeError
will be raised.
In [9]: df3 = pd.DataFrame({'A' : np.arange(6),
'B' : pd.Series(list('aabbca')).astype('category')})
In [11]: df3 = df3.set_index('B')
In [11]: df3.index
Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')
In [12]: pd.concat([df2, df3]
TypeError: categories must match existing categories when appending
3.5.2 Int64Index and RangeIndex
Warning
Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.
Int64Index
is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set.
Prior to 0.18.0, the Int64Index
would provide the default index for all NDFrame
objects.
RangeIndex
is a sub-class of Int64Index
added in version 0.18.0, now providing the default index for all NDFrame
objects.
RangeIndex
is an optimized version of Int64Index
that can represent a monotonic ordered set. These are analagous to python range types.
3.5.3 Float64Index
Note
As of 0.14.0, Float64Index
is backed by a native float64
dtype
array. Prior to 0.14.0, Float64Index
was backed by an object
dtype
array. Using a float64
dtype in the backend speeds up arithmetic
operations by about 30x and boolean indexing operations on the
Float64Index
itself are about 2x as fast.
New in version 0.13.0.
By default a Float64Index
will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes [],ix,loc
for scalar indexing and slicing work exactly the
same.
In [17]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
In [18]: indexf
Out[18]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [19]: sf = pd.Series(range(5), index=indexf)
In [20]: sf
Out[20]:
1.5 0
2.0 1
3.0 2
4.5 3
5.0 4
dtype: int64
Scalar selection for [],.ix,.loc
will always be label based. An integer will match an equal float index (e.g. 3
is equivalent to 3.0
)
In [21]: sf[3]
Out[21]: 2
In [22]: sf[3.0]
Out[22]: 2
In [23]: sf.ix[3]
Out[23]: 2
In [24]: sf.ix[3.0]
Out[24]: 2
In [25]: sf.loc[3]
Out[25]: 2
In [26]: sf.loc[3.0]
Out[26]: 2
The only positional indexing is via iloc
In [27]: sf.iloc[3]
Out[27]: 3
A scalar index that is not found will raise KeyError
Slicing is ALWAYS on the values of the index, for [],ix,loc
and ALWAYS positional with iloc
In [28]: sf[2:4]
Out[28]:
2.0 1
3.0 2
dtype: int64
In [29]: sf.ix[2:4]
Out[29]:
2.0 1
3.0 2
dtype: int64
In [30]: sf.loc[2:4]
Out[30]:
2.0 1
3.0 2
dtype: int64
In [31]: sf.iloc[2:4]
Out[31]:
3.0 2
4.5 3
dtype: int64
In float indexes, slicing using floats is allowed
In [32]: sf[2.1:4.6]
Out[32]:
3.0 2
4.5 3
dtype: int64
In [33]: sf.loc[2.1:4.6]
Out[33]:
3.0 2
4.5 3
dtype: int64
In non-float indexes, slicing using floats will raise a TypeError
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Warning
Using a scalar float indexer for .iloc
has been removed in 0.18.0, so the following will raise a TypeError
In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>
Further the treatment of .ix
with a float indexer on a non-float index, will be label based, and thus coerce the index.
In [34]: s2 = pd.Series([1, 2, 3], index=list('abc'))
In [35]: s2
Out[35]:
a 1
b 2
c 3
dtype: int64
In [36]: s2.ix[1.0] = 10
In [37]: s2
Out[37]:
a 1
b 2
c 3
1.0 10
dtype: int64
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.
In [38]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2),
....: index=np.arange(5) * 250.0,
....: columns=list('AB')),
....: pd.DataFrame(np.random.randn(6,2),
....: index=np.arange(4,10) * 250.1,
....: columns=list('AB'))])
....:
In [39]: dfir
Out[39]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
... ... ...
1500.6 -2.281374 0.760010
1750.7 -0.742532 1.533318
2000.8 2.495362 -0.432771
2250.9 -0.068954 0.043520
[11 rows x 2 columns]
Selection operations then will always work on a value basis, for all selection operators.
In [40]: dfir[0:1000.4]
Out[40]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219
1000.4 0.310610 -0.108002
In [41]: dfir.loc[0:1001,'A']
Out[41]:
0.0 0.997289
250.0 -0.179129
500.0 0.936914
750.0 -1.003401
1000.0 -0.724626
1000.4 0.310610
Name: A, dtype: float64
In [42]: dfir.loc[1000.4]
Out[42]:
A 0.310610
B -0.108002
Name: 1000.4, dtype: float64
You could then easily pick out the first 1 second (1000 ms) of data then.
In [43]: dfir[0:1000]
Out[43]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219
Of course if you need integer based selection, then use iloc
In [44]: dfir.iloc[0:5]
Out[44]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219