3.5 Index Types

We have discussed MultiIndex in the previous sections pretty extensively. DatetimeIndex and PeriodIndex are shown here. TimedeltaIndex are here.

In the following sub-sections we will highlite some other index types.

3.5.1 CategoricalIndex

New in version 0.16.1.

We introduce a CategoricalIndex, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.

In [1]: df = pd.DataFrame({'A': np.arange(6),
   ...:                    'B': list('aabbca')})
   ...: 

In [2]: df['B'] = df['B'].astype('category', categories=list('cab'))

In [3]: df
Out[3]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [4]: df.dtypes
Out[4]: 
A       int64
B    category
dtype: object

In [5]: df.B.cat.categories
Out[5]: Index([u'c', u'a', u'b'], dtype='object')

Setting the index, will create create a CategoricalIndex

In [6]: df2 = df.set_index('B')

In [7]: df2.index
Out[7]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.

In [8]: df2.loc['a']
Out[8]: 
   A
B   
a  0
a  1
a  5

These PRESERVE the CategoricalIndex

In [9]: df2.loc['a'].index
Out[9]: CategoricalIndex([u'a', u'a', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Sorting will order by the order of the categories

In [10]: df2.sort_index()
Out[10]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

Groupby operations on the index will preserve the index nature as well

In [11]: df2.groupby(level=0).sum()
Out[11]: 
   A
B   
c  4
a  6
b  5

In [12]: df2.groupby(level=0).sum().index
Out[12]: CategoricalIndex([u'c', u'a', u'b'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the PASSED Categorical dtype. This allows one to arbitrarly index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.

In [13]: df2.reindex(['a','e'])
Out[13]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN

In [14]: df2.reindex(['a','e']).index
Out[14]: Index([u'a', u'a', u'a', u'e'], dtype='object', name=u'B')

In [15]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
Out[15]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN

In [16]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
Out[16]: CategoricalIndex([u'a', u'a', u'a', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, name=u'B', dtype='category')

Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

In [9]: df3 = pd.DataFrame({'A' : np.arange(6),
                            'B' : pd.Series(list('aabbca')).astype('category')})

In [11]: df3 = df3.set_index('B')

In [11]: df3.index
Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')

In [12]: pd.concat([df2, df3]
TypeError: categories must match existing categories when appending

3.5.2 Int64Index and RangeIndex

Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analagous to python range types.

3.5.3 Float64Index

Note

As of 0.14.0, Float64Index is backed by a native float64 dtype array. Prior to 0.14.0, Float64Index was backed by an object dtype array. Using a float64 dtype in the backend speeds up arithmetic operations by about 30x and boolean indexing operations on the Float64Index itself are about 2x as fast.

New in version 0.13.0.

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

In [17]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])

In [18]: indexf
Out[18]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')

In [19]: sf = pd.Series(range(5), index=indexf)

In [20]: sf
Out[20]: 
1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

Scalar selection for [],.ix,.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0)

In [21]: sf[3]
Out[21]: 2

In [22]: sf[3.0]
Out[22]: 2

In [23]: sf.ix[3]
Out[23]: 2

In [24]: sf.ix[3.0]
Out[24]: 2

In [25]: sf.loc[3]
Out[25]: 2

In [26]: sf.loc[3.0]
Out[26]: 2

The only positional indexing is via iloc

In [27]: sf.iloc[3]
Out[27]: 3

A scalar index that is not found will raise KeyError

Slicing is ALWAYS on the values of the index, for [],ix,loc and ALWAYS positional with iloc

In [28]: sf[2:4]
Out[28]: 
2.0    1
3.0    2
dtype: int64

In [29]: sf.ix[2:4]
Out[29]: 
2.0    1
3.0    2
dtype: int64

In [30]: sf.loc[2:4]
Out[30]: 
2.0    1
3.0    2
dtype: int64

In [31]: sf.iloc[2:4]
Out[31]: 
3.0    2
4.5    3
dtype: int64

In float indexes, slicing using floats is allowed

In [32]: sf[2.1:4.6]
Out[32]: 
3.0    2
4.5    3
dtype: int64

In [33]: sf.loc[2.1:4.6]
Out[33]: 
3.0    2
4.5    3
dtype: int64

In non-float indexes, slicing using floats will raise a TypeError

In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

Warning

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError

In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

Further the treatment of .ix with a float indexer on a non-float index, will be label based, and thus coerce the index.

In [34]: s2 = pd.Series([1, 2, 3], index=list('abc'))

In [35]: s2
Out[35]: 
a    1
b    2
c    3
dtype: int64

In [36]: s2.ix[1.0] = 10

In [37]: s2
Out[37]: 
a       1
b       2
c       3
1.0    10
dtype: int64

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.

In [38]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2),
   ....:                                index=np.arange(5) * 250.0,
   ....:                                columns=list('AB')),
   ....:                   pd.DataFrame(np.random.randn(6,2),
   ....:                                index=np.arange(4,10) * 250.1,
   ....:                                columns=list('AB'))])
   ....: 

In [39]: dfir
Out[39]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
...          ...       ...
1500.6 -2.281374  0.760010
1750.7 -0.742532  1.533318
2000.8  2.495362 -0.432771
2250.9 -0.068954  0.043520

[11 rows x 2 columns]

Selection operations then will always work on a value basis, for all selection operators.

In [40]: dfir[0:1000.4]
Out[40]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219
1000.4  0.310610 -0.108002

In [41]: dfir.loc[0:1001,'A']
Out[41]: 
0.0       0.997289
250.0    -0.179129
500.0     0.936914
750.0    -1.003401
1000.0   -0.724626
1000.4    0.310610
Name: A, dtype: float64

In [42]: dfir.loc[1000.4]
Out[42]: 
A    0.310610
B   -0.108002
Name: 1000.4, dtype: float64

You could then easily pick out the first 1 second (1000 ms) of data then.

In [43]: dfir[0:1000]
Out[43]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219

Of course if you need integer based selection, then use iloc

In [44]: dfir.iloc[0:5]
Out[44]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219