14 Selecting columns based on dtype

New in version 0.14.1.

The select_dtypes() method implements subsetting of columns based on their dtype.

First, let’s create a DataFrame with a slew of different dtypes:

In [1]: df = pd.DataFrame({'string': list('abc'),
   ...:                    'int64': list(range(1, 4)),
   ...:                    'uint8': np.arange(3, 6).astype('u1'),
   ...:                    'float64': np.arange(4.0, 7.0),
   ...:                    'bool1': [True, False, True],
   ...:                    'bool2': [False, True, False],
   ...:                    'dates': pd.date_range('now', periods=3).values,
   ...:                    'category': pd.Series(list("ABC")).astype('category')})
   ...: 

In [2]: df['tdeltas'] = df.dates.diff()

In [3]: df['uint64'] = np.arange(3, 6).astype('u8')

In [4]: df['other_dates'] = pd.date_range('20130101', periods=3).values

In [5]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [6]: df
Out[6]: 
   bool1  bool2 category                      dates  float64  int64 string  \
0   True  False        A 2016-09-30 13:50:03.127828      4.0      1      a   
1  False   True        B 2016-10-01 13:50:03.127828      5.0      2      b   
2   True  False        C 2016-10-02 13:50:03.127828      6.0      3      c   

   uint8  tdeltas  uint64 other_dates            tz_aware_dates  
0      3      NaT       3  2013-01-01 2013-01-01 00:00:00-05:00  
1      4   1 days       4  2013-01-02 2013-01-02 00:00:00-05:00  
2      5   1 days       5  2013-01-03 2013-01-03 00:00:00-05:00  

And the dtypes

In [7]: df.dtypes
Out[7]: 
bool1                                   bool
bool2                                   bool
category                            category
dates                         datetime64[ns]
                             ...            
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes() has two parameters include and exclude that allow you to say “give me the columns WITH these dtypes” (include) and/or “give the columns WITHOUT these dtypes” (exclude).

For example, to select bool columns

In [8]: df.select_dtypes(include=[bool])
Out[8]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

You can also pass the name of a dtype in the numpy dtype hierarchy:

In [9]: df.select_dtypes(include=['bool'])
Out[9]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes() also works with generic dtypes as well.

For example, to select all numeric and boolean columns while excluding unsigned integers

In [10]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])
Out[10]: 
   bool1  bool2  float64  int64  tdeltas
0   True  False      4.0      1      NaT
1  False   True      5.0      2   1 days
2   True  False      6.0      3   1 days

To select string columns you must use the object dtype:

In [11]: df.select_dtypes(include=['object'])
Out[11]: 
  string
0      a
1      b
2      c

To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a tree of child dtypes:

In [12]: def subdtypes(dtype):
   ....:     subs = dtype.__subclasses__()
   ....:     if not subs:
   ....:         return dtype
   ....:     return [dtype, [subdtypes(dt) for dt in subs]]
   ....: 

All numpy dtypes are subclasses of numpy.generic:

In [13]: subdtypes(np.generic)
Out[13]: 
[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.string_, numpy.unicode_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

Note

Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into the normal numpy hierarchy and wont show up with the above function.

Note

The include and exclude parameters must be non-string sequences.