13 dtypes
The main types stored in pandas objects are float
, int
, bool
,
datetime64[ns]
and datetime64[ns, tz]
(in >= 0.17.0), timedelta[ns]
, category
(in >= 0.15.0), and object
. In addition these dtypes
have item sizes, e.g. int64
and int32
. See Series with TZ for more detail on datetime64[ns, tz]
dtypes.
A convenient dtypes
attribute for DataFrames returns a Series with the data type of each column.
In [1]: dft = pd.DataFrame(dict(A = np.random.rand(3),
...: B = 1,
...: C = 'foo',
...: D = pd.Timestamp('20010102'),
...: E = pd.Series([1.0]*3).astype('float32'),
...: F = False,
...: G = pd.Series([1]*3,dtype='int8')))
...:
In [2]: dft
Out[2]:
A B C D E F G
0 0.869171 1 foo 2001-01-02 1.0 False 1
1 0.373614 1 foo 2001-01-02 1.0 False 1
2 0.305503 1 foo 2001-01-02 1.0 False 1
In [3]: dft.dtypes
Out[3]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
On a Series
use the dtype
attribute.
In [4]: dft['A'].dtype
Out[4]: dtype('float64')
If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the
column will be chosen to accommodate all of the data types (object
is the most
general).
# these ints are coerced to floats
In [5]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[5]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64
# string data forces an ``object`` dtype
In [6]: pd.Series([1, 2, 3, 6., 'foo'])
Out[6]:
0 1
1 2
2 3
3 6
4 foo
dtype: object
The method get_dtype_counts()
will return the number of columns of
each type in a DataFrame
:
In [7]: dft.get_dtype_counts()
Out[7]:
bool 1
datetime64[ns] 1
float32 1
float64 1
int64 1
int8 1
object 1
dtype: int64
Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0).
If a dtype is passed (either directly via the dtype
keyword, a passed ndarray
,
or a passed Series
, then it will be preserved in DataFrame operations. Furthermore,
different numeric dtypes will NOT be combined. The following example will give you a taste.
In [8]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
In [9]: df1
Out[9]:
A
0 0.024277
1 0.813977
2 0.558304
3 1.860748
4 -0.373496
5 -0.270848
6 1.037638
7 0.601461
In [10]: df1.dtypes
Out[10]:
A float32
dtype: object
In [11]: df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
....: B = pd.Series(np.random.randn(8)),
....: C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))
....:
In [12]: df2
Out[12]:
A B C
0 -0.027527 0.923133 0
1 0.846191 0.382896 0
2 0.688965 -0.917059 2
3 0.299072 0.437895 0
4 0.241333 -1.285750 255
5 0.754883 -2.981244 0
6 -0.879395 1.486285 0
7 1.580078 -2.010360 1
In [13]: df2.dtypes
Out[13]:
A float16
B float64
C uint8
dtype: object
13.1 defaults
By default integer types are int64
and float types are float64
,
REGARDLESS of platform (32-bit or 64-bit). The following will all result in int64
dtypes.
In [14]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[14]:
a int64
dtype: object
In [15]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[15]:
a int64
dtype: object
In [16]: pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes
Out[16]:
a int64
dtype: object
Numpy, however will choose platform-dependent types when creating arrays.
The following WILL result in int32
on 32-bit platform.
In [17]: frame = pd.DataFrame(np.array([1, 2]))
13.2 upcasting
Types can potentially be upcasted when combined with other types, meaning they are promoted
from the current type (say int
to float
)
In [18]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [19]: df3
Out[19]:
A B C
0 -0.003250 0.923133 0.0
1 1.660169 0.382896 0.0
2 1.247269 -0.917059 2.0
3 2.159821 0.437895 0.0
4 -0.132163 -1.285750 255.0
5 0.484035 -2.981244 0.0
6 0.158243 1.486285 0.0
7 2.181539 -2.010360 1.0
In [20]: df3.dtypes
Out[20]:
A float32
B float64
C float64
dtype: object
The values
attribute on a DataFrame return the lower-common-denominator of the dtypes, meaning
the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped numpy array. This can
force some upcasting.
In [21]: df3.values.dtype
Out[21]: dtype('float64')
13.3 astype
You can use the astype()
method to explicitly convert dtypes from one to another. These will by default return a copy,
even if the dtype was unchanged (pass copy=False
to change this behavior). In addition, they will raise an
exception if the astype operation is invalid.
Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.
In [22]: df3
Out[22]:
A B C
0 -0.003250 0.923133 0.0
1 1.660169 0.382896 0.0
2 1.247269 -0.917059 2.0
3 2.159821 0.437895 0.0
4 -0.132163 -1.285750 255.0
5 0.484035 -2.981244 0.0
6 0.158243 1.486285 0.0
7 2.181539 -2.010360 1.0
In [23]: df3.dtypes
Out[23]:
A float32
B float64
C float64
dtype: object
# conversion of dtypes
In [24]: df3.astype('float32').dtypes
Out[24]:
A float32
B float32
C float32
dtype: object
Convert a subset of columns to a specified type using astype()
In [25]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
In [26]: dft[['a','b']] = dft[['a','b']].astype(np.uint8)
In [27]: dft
Out[27]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [28]: dft.dtypes
Out[28]:
a uint8
b uint8
c int64
dtype: object
Note
When trying to convert a subset of columns to a specified type using astype()
and loc()
, upcasting occurs.
loc()
tries to fit in what we are assigning to the current dtypes, while []
will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.
In [29]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
In [30]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[30]:
a uint8
b uint8
dtype: object
In [31]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
In [32]: dft.dtypes
Out[32]:
a int64
b int64
c int64
dtype: object
13.4 object conversion
pandas offers various functions to try to force conversion of types from the object
dtype to other types.
The following functions are available for one dimensional object arrays or scalars:
to_numeric()
(conversion to numeric dtypes)In [33]: m = ['1.1', 2, 3] In [34]: pd.to_numeric(m) Out[34]: array([ 1.1, 2. , 3. ])
to_datetime()
(conversion to datetime objects)In [35]: import datetime In [36]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)] In [37]: pd.to_datetime(m) Out[37]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)
to_timedelta()
(conversion to timedelta objects)In [38]: m = ['5us', pd.Timedelta('1day')] In [39]: pd.to_timedelta(m) Out[39]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
To force a conversion, we can pass in an errors
argument, which specifies how pandas should deal with elements
that cannot be converted to desired dtype or object. By default, errors='raise'
, meaning that any errors encountered
will be raised during the conversion process. However, if errors='coerce'
, these errors will be ignored and pandas
will convert problematic elements to pd.NaT
(for datetime and timedelta) or np.nan
(for numeric). This might be
useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has
non-conforming elements intermixed that you want to represent as missing:
In [40]: import datetime
In [41]: m = ['apple', datetime.datetime(2016, 3, 2)]
In [42]: pd.to_datetime(m, errors='coerce')
Out[42]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)
In [43]: m = ['apple', 2, 3]
In [44]: pd.to_numeric(m, errors='coerce')
Out[44]: array([ nan, 2., 3.])
In [45]: m = ['apple', pd.Timedelta('1day')]
In [46]: pd.to_timedelta(m, errors='coerce')
Out[46]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)
The errors
parameter has a third option of errors='ignore'
, which will simply return the passed in data if it
encounters any errors with the conversion to a desired data type:
In [47]: import datetime
In [48]: m = ['apple', datetime.datetime(2016, 3, 2)]
In [49]: pd.to_datetime(m, errors='ignore')
Out[49]: array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)
In [50]: m = ['apple', 2, 3]
In [51]: pd.to_numeric(m, errors='ignore')
Out[51]: array(['apple', 2, 3], dtype=object)
In [52]: m = ['apple', pd.Timedelta('1day')]
#pd.to_timedelta(m, errors='ignore') # <- raises ValueError
In addition to object conversion, to_numeric()
provides another argument downcast
, which gives the
option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
In [53]: m = ['1', 2, 3]
In [54]: pd.to_numeric(m, downcast='integer') # smallest signed int dtype
Out[54]: array([1, 2, 3], dtype=int8)
In [55]: pd.to_numeric(m, downcast='signed') # same as 'integer'
Out[55]: array([1, 2, 3], dtype=int8)
In [56]: pd.to_numeric(m, downcast='unsigned') # smallest unsigned int dtype
Out[56]: array([1, 2, 3], dtype=uint8)
In [57]: pd.to_numeric(m, downcast='float') # smallest float dtype
Out[57]: array([ 1., 2., 3.], dtype=float32)
As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such
as DataFrames. However, with apply()
, we can “apply” the function over each column efficiently:
In [58]: import datetime
In [59]: df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
In [60]: df
Out[60]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00
In [61]: df.apply(pd.to_datetime)
Out[61]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [62]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')
In [63]: df
Out[63]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [64]: df.apply(pd.to_numeric)
Out[64]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [65]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')
In [66]: df
Out[66]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00
In [67]: df.apply(pd.to_timedelta)
Out[67]:
0 1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days
13.5 gotchas
Performing selection operations on integer
type data can easily upcast the data to floating
.
The dtype of the input data will be preserved in cases where nans
are not introduced (starting in 0.11.0)
See also integer na gotchas
In [68]: dfi = df3.astype('int32')
In [69]: dfi['E'] = 1
In [70]: dfi
Out[70]:
A B C E
0 0 0 0 1
1 1 0 0 1
2 1 0 2 1
3 2 0 0 1
4 0 -1 255 1
5 0 -2 0 1
6 0 1 0 1
7 2 -2 1 1
In [71]: dfi.dtypes
Out[71]:
A int32
B int32
C int32
E int64
dtype: object
In [72]: casted = dfi[dfi>0]
In [73]: casted
Out[73]:
A B C E
0 NaN NaN NaN 1
1 1.0 NaN NaN 1
2 1.0 NaN 2.0 1
3 2.0 NaN NaN 1
4 NaN NaN 255.0 1
5 NaN NaN NaN 1
6 NaN 1.0 NaN 1
7 2.0 NaN 1.0 1
In [74]: casted.dtypes
Out[74]:
A float64
B float64
C float64
E int64
dtype: object
While float dtypes are unchanged.
In [75]: dfa = df3.copy()
In [76]: dfa['A'] = dfa['A'].astype('float32')
In [77]: dfa.dtypes
Out[77]:
A float32
B float64
C float64
dtype: object
In [78]: casted = dfa[df2>0]
In [79]: casted
Out[79]:
A B C
0 NaN 0.923133 NaN
1 1.660169 0.382896 NaN
2 1.247269 NaN 2.0
3 2.159821 0.437895 NaN
4 -0.132163 NaN 255.0
5 0.484035 NaN NaN
6 NaN 1.486285 NaN
7 2.181539 NaN 1.0
In [80]: casted.dtypes
Out[80]:
A float32
B float64
C float64
dtype: object