13 dtypes

The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz] (in >= 0.17.0), timedelta[ns], category (in >= 0.15.0), and object. In addition these dtypes have item sizes, e.g. int64 and int32. See Series with TZ for more detail on datetime64[ns, tz] dtypes.

A convenient dtypes attribute for DataFrames returns a Series with the data type of each column.

In [1]: dft = pd.DataFrame(dict(A = np.random.rand(3),
   ...:                         B = 1,
   ...:                         C = 'foo',
   ...:                         D = pd.Timestamp('20010102'),
   ...:                         E = pd.Series([1.0]*3).astype('float32'),
   ...:                         F = False,
   ...:                         G = pd.Series([1]*3,dtype='int8')))
   ...: 

In [2]: dft
Out[2]: 
          A  B    C          D    E      F  G
0  0.869171  1  foo 2001-01-02  1.0  False  1
1  0.373614  1  foo 2001-01-02  1.0  False  1
2  0.305503  1  foo 2001-01-02  1.0  False  1

In [3]: dft.dtypes
Out[3]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

On a Series use the dtype attribute.

In [4]: dft['A'].dtype
Out[4]: dtype('float64')

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

# these ints are coerced to floats
In [5]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[5]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

# string data forces an ``object`` dtype
In [6]: pd.Series([1, 2, 3, 6., 'foo'])
Out[6]: 
0      1
1      2
2      3
3      6
4    foo
dtype: object

The method get_dtype_counts() will return the number of columns of each type in a DataFrame:

In [7]: dft.get_dtype_counts()
Out[7]: 
bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0). If a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.

In [8]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [9]: df1
Out[9]: 
          A
0  0.024277
1  0.813977
2  0.558304
3  1.860748
4 -0.373496
5 -0.270848
6  1.037638
7  0.601461

In [10]: df1.dtypes
Out[10]: 
A    float32
dtype: object

In [11]: df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
   ....:                         B = pd.Series(np.random.randn(8)),
   ....:                         C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))
   ....: 

In [12]: df2
Out[12]: 
          A         B    C
0 -0.027527  0.923133    0
1  0.846191  0.382896    0
2  0.688965 -0.917059    2
3  0.299072  0.437895    0
4  0.241333 -1.285750  255
5  0.754883 -2.981244    0
6 -0.879395  1.486285    0
7  1.580078 -2.010360    1

In [13]: df2.dtypes
Out[13]: 
A    float16
B    float64
C      uint8
dtype: object

13.1 defaults

By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit). The following will all result in int64 dtypes.

In [14]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[14]: 
a    int64
dtype: object

In [15]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[15]: 
a    int64
dtype: object

In [16]: pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes
Out[16]: 
a    int64
dtype: object

Numpy, however will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform.

In [17]: frame = pd.DataFrame(np.array([1, 2]))

13.2 upcasting

Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type (say int to float)

In [18]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [19]: df3
Out[19]: 
          A         B      C
0 -0.003250  0.923133    0.0
1  1.660169  0.382896    0.0
2  1.247269 -0.917059    2.0
3  2.159821  0.437895    0.0
4 -0.132163 -1.285750  255.0
5  0.484035 -2.981244    0.0
6  0.158243  1.486285    0.0
7  2.181539 -2.010360    1.0

In [20]: df3.dtypes
Out[20]: 
A    float32
B    float64
C    float64
dtype: object

The values attribute on a DataFrame return the lower-common-denominator of the dtypes, meaning the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped numpy array. This can force some upcasting.

In [21]: df3.values.dtype
Out[21]: dtype('float64')

13.3 astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the astype operation is invalid.

Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.

In [22]: df3
Out[22]: 
          A         B      C
0 -0.003250  0.923133    0.0
1  1.660169  0.382896    0.0
2  1.247269 -0.917059    2.0
3  2.159821  0.437895    0.0
4 -0.132163 -1.285750  255.0
5  0.484035 -2.981244    0.0
6  0.158243  1.486285    0.0
7  2.181539 -2.010360    1.0

In [23]: df3.dtypes
Out[23]: 
A    float32
B    float64
C    float64
dtype: object

# conversion of dtypes
In [24]: df3.astype('float32').dtypes
Out[24]: 
A    float32
B    float32
C    float32
dtype: object

Convert a subset of columns to a specified type using astype()

In [25]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

In [26]: dft[['a','b']] = dft[['a','b']].astype(np.uint8)

In [27]: dft
Out[27]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9

In [28]: dft.dtypes
Out[28]: 
a    uint8
b    uint8
c    int64
dtype: object

Note

When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

In [29]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

In [30]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[30]: 
a    uint8
b    uint8
dtype: object

In [31]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [32]: dft.dtypes
Out[32]: 
a    int64
b    int64
c    int64
dtype: object

13.4 object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types. The following functions are available for one dimensional object arrays or scalars:

  • to_numeric() (conversion to numeric dtypes)

    In [33]: m = ['1.1', 2, 3]
    
    In [34]: pd.to_numeric(m)
    Out[34]: array([ 1.1,  2. ,  3. ])
    
  • to_datetime() (conversion to datetime objects)

    In [35]: import datetime
    
    In [36]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]
    
    In [37]: pd.to_datetime(m)
    Out[37]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)
    
  • to_timedelta() (conversion to timedelta objects)

    In [38]: m = ['5us', pd.Timedelta('1day')]
    
    In [39]: pd.to_timedelta(m)
    Out[39]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
    

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encountered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

In [40]: import datetime

In [41]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [42]: pd.to_datetime(m, errors='coerce')
Out[42]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [43]: m = ['apple', 2, 3]

In [44]: pd.to_numeric(m, errors='coerce')
Out[44]: array([ nan,   2.,   3.])

In [45]: m = ['apple', pd.Timedelta('1day')]

In [46]: pd.to_timedelta(m, errors='coerce')
Out[46]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

In [47]: import datetime

In [48]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [49]: pd.to_datetime(m, errors='ignore')
Out[49]: array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)

In [50]: m = ['apple', 2, 3]

In [51]: pd.to_numeric(m, errors='ignore')
Out[51]: array(['apple', 2, 3], dtype=object)

In [52]: m = ['apple', pd.Timedelta('1day')]

#pd.to_timedelta(m, errors='ignore') # <- raises ValueError

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

In [53]: m = ['1', 2, 3]

In [54]: pd.to_numeric(m, downcast='integer')   # smallest signed int dtype
Out[54]: array([1, 2, 3], dtype=int8)

In [55]: pd.to_numeric(m, downcast='signed')    # same as 'integer'
Out[55]: array([1, 2, 3], dtype=int8)

In [56]: pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype
Out[56]: array([1, 2, 3], dtype=uint8)

In [57]: pd.to_numeric(m, downcast='float')     # smallest float dtype
Out[57]: array([ 1.,  2.,  3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

In [58]: import datetime

In [59]: df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')

In [60]: df
Out[60]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00

In [61]: df.apply(pd.to_datetime)
Out[61]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [62]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [63]: df
Out[63]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [64]: df.apply(pd.to_numeric)
Out[64]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [65]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [66]: df
Out[66]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00

In [67]: df.apply(pd.to_timedelta)
Out[67]: 
                0      1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

13.5 gotchas

Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the input data will be preserved in cases where nans are not introduced (starting in 0.11.0) See also integer na gotchas

In [68]: dfi = df3.astype('int32')

In [69]: dfi['E'] = 1

In [70]: dfi
Out[70]: 
   A  B    C  E
0  0  0    0  1
1  1  0    0  1
2  1  0    2  1
3  2  0    0  1
4  0 -1  255  1
5  0 -2    0  1
6  0  1    0  1
7  2 -2    1  1

In [71]: dfi.dtypes
Out[71]: 
A    int32
B    int32
C    int32
E    int64
dtype: object

In [72]: casted = dfi[dfi>0]

In [73]: casted
Out[73]: 
     A    B      C  E
0  NaN  NaN    NaN  1
1  1.0  NaN    NaN  1
2  1.0  NaN    2.0  1
3  2.0  NaN    NaN  1
4  NaN  NaN  255.0  1
5  NaN  NaN    NaN  1
6  NaN  1.0    NaN  1
7  2.0  NaN    1.0  1

In [74]: casted.dtypes
Out[74]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

While float dtypes are unchanged.

In [75]: dfa = df3.copy()

In [76]: dfa['A'] = dfa['A'].astype('float32')

In [77]: dfa.dtypes
Out[77]: 
A    float32
B    float64
C    float64
dtype: object

In [78]: casted = dfa[df2>0]

In [79]: casted
Out[79]: 
          A         B      C
0       NaN  0.923133    NaN
1  1.660169  0.382896    NaN
2  1.247269       NaN    2.0
3  2.159821  0.437895    NaN
4 -0.132163       NaN  255.0
5  0.484035       NaN    NaN
6       NaN  1.486285    NaN
7  2.181539       NaN    1.0

In [80]: casted.dtypes
Out[80]: 
A    float32
B    float64
C    float64
dtype: object