14 Selecting columns based on dtype
New in version 0.14.1.
The select_dtypes()
method implements subsetting of columns
based on their dtype
.
First, let’s create a DataFrame
with a slew of different
dtypes:
In [1]: df = pd.DataFrame({'string': list('abc'),
...: 'int64': list(range(1, 4)),
...: 'uint8': np.arange(3, 6).astype('u1'),
...: 'float64': np.arange(4.0, 7.0),
...: 'bool1': [True, False, True],
...: 'bool2': [False, True, False],
...: 'dates': pd.date_range('now', periods=3).values,
...: 'category': pd.Series(list("ABC")).astype('category')})
...:
In [2]: df['tdeltas'] = df.dates.diff()
In [3]: df['uint64'] = np.arange(3, 6).astype('u8')
In [4]: df['other_dates'] = pd.date_range('20130101', periods=3).values
In [5]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')
In [6]: df
Out[6]:
bool1 bool2 category dates float64 int64 string \
0 True False A 2016-09-30 13:50:03.127828 4.0 1 a
1 False True B 2016-10-01 13:50:03.127828 5.0 2 b
2 True False C 2016-10-02 13:50:03.127828 6.0 3 c
uint8 tdeltas uint64 other_dates tz_aware_dates
0 3 NaT 3 2013-01-01 2013-01-01 00:00:00-05:00
1 4 1 days 4 2013-01-02 2013-01-02 00:00:00-05:00
2 5 1 days 5 2013-01-03 2013-01-03 00:00:00-05:00
And the dtypes
In [7]: df.dtypes
Out[7]:
bool1 bool
bool2 bool
category category
dates datetime64[ns]
...
tdeltas timedelta64[ns]
uint64 uint64
other_dates datetime64[ns]
tz_aware_dates datetime64[ns, US/Eastern]
dtype: object
select_dtypes()
has two parameters include
and exclude
that allow you to
say “give me the columns WITH these dtypes” (include
) and/or “give the
columns WITHOUT these dtypes” (exclude
).
For example, to select bool
columns
In [8]: df.select_dtypes(include=[bool])
Out[8]:
bool1 bool2
0 True False
1 False True
2 True False
You can also pass the name of a dtype in the numpy dtype hierarchy:
In [9]: df.select_dtypes(include=['bool'])
Out[9]:
bool1 bool2
0 True False
1 False True
2 True False
select_dtypes()
also works with generic dtypes as well.
For example, to select all numeric and boolean columns while excluding unsigned integers
In [10]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])
Out[10]:
bool1 bool2 float64 int64 tdeltas
0 True False 4.0 1 NaT
1 False True 5.0 2 1 days
2 True False 6.0 3 1 days
To select string columns you must use the object
dtype:
In [11]: df.select_dtypes(include=['object'])
Out[11]:
string
0 a
1 b
2 c
To see all the child dtypes of a generic dtype
like numpy.number
you
can define a function that returns a tree of child dtypes:
In [12]: def subdtypes(dtype):
....: subs = dtype.__subclasses__()
....: if not subs:
....: return dtype
....: return [dtype, [subdtypes(dt) for dt in subs]]
....:
All numpy dtypes are subclasses of numpy.generic
:
In [13]: subdtypes(np.generic)
Out[13]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.string_, numpy.unicode_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Note
Pandas also defines the types category
, and datetime64[ns, tz]
, which are not integrated into the normal
numpy hierarchy and wont show up with the above function.
Note
The include
and exclude
parameters must be non-string sequences.