8 Iteration
The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the “keys” of the objects.
In short, basic iteration (for i in object
) produces:
- Series: values
- DataFrame: column labels
- Panel: item labels
Thus, for example, iterating over a DataFrame gives you the column names:
In [1]: df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)},
...: index=['a', 'b', 'c'])
...:
In [2]: for col in df:
...: print(col)
...:
col1
col2
Pandas objects also have the dict-like iteritems()
method to
iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
iterrows()
: Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series objects, which can change the dtypes and has some performance implications.itertuples()
: Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster thaniterrows()
, and is in most cases preferable to use to iterate over the values of a DataFrame.
Warning
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:
- Look for a vectorized solution: many operations can be performed using built-in methods or numpy functions, (boolean) indexing, ...
- When you have a function that cannot work on the full DataFrame/Series
at once, it is better to use
apply()
instead of iterating over the values. See the docs on function application. - If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop using e.g. cython or numba. See the enhancing performance section for some examples of this approach.
Warning
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
For example, in the following case setting the value has no effect:
In [3]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
In [4]: for index, row in df.iterrows():
...: row['a'] = 10
...:
In [5]: df
Out[5]:
a b
0 1 a
1 2 b
2 3 c
8.1 iteritems
Consistent with the dict-like interface, iteritems()
iterates
through key-value pairs:
- Series: (index, scalar value) pairs
- DataFrame: (column, Series) pairs
- Panel: (item, DataFrame) pairs
For example:
In [6]: for item, frame in wp.iteritems():
...: print(item)
...: print(frame)
...:
Item1
A B C D
2000-01-01 -0.514968 -0.922744 1.719198 0.354214
2000-01-02 -0.964852 1.149227 0.085127 -0.666126
2000-01-03 -0.937352 -0.236178 -0.065276 0.966529
2000-01-04 0.275865 0.952374 0.453077 0.105015
2000-01-05 -1.080907 2.059111 -0.569357 0.227393
Item2
A B C D
2000-01-01 -0.678641 0.754388 -0.863078 0.450325
2000-01-02 0.074156 -0.070482 0.065135 -0.353930
2000-01-03 1.450099 -0.388589 -0.291465 -0.273057
2000-01-04 -1.777603 -0.383081 0.868747 0.498215
2000-01-05 -0.703629 1.366700 0.140995 -2.331324
8.2 iterrows
iterrows()
allows you to iterate through the rows of a
DataFrame as Series objects. It returns an iterator yielding each
index value along with a Series containing the data in each row:
In [7]: for row_index, row in df.iterrows():
...: print('%s\n%s' % (row_index, row))
...:
0
a 1
b a
Name: 0, dtype: object
1
a 2
b b
Name: 1, dtype: object
2
a 3
b c
Name: 2, dtype: object
Note
Because iterrows()
returns a Series for each row,
it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames). For example,
In [8]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
In [9]: df_orig.dtypes
Out[9]:
int int64
float float64
dtype: object
In [10]: row = next(df_orig.iterrows())[1]
In [11]: row
Out[11]:
int 1.0
float 1.5
Name: 0, dtype: float64
All values in row
, returned as a Series, are now upcasted
to floats, also the original integer value in column x:
In [12]: row['int'].dtype
Out[12]: dtype('float64')
In [13]: df_orig['int'].dtype
Out[13]: dtype('int64')
To preserve dtypes while iterating over the rows, it is better
to use itertuples()
which returns namedtuples of the values
and which is generally much faster as iterrows
.
For instance, a contrived way to transpose the DataFrame would be:
In [14]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
In [15]: print(df2)
x y
0 1 4
1 2 5
2 3 6
In [16]: print(df2.T)
0 1 2
x 1 2 3
y 4 5 6
In [17]: df2_t = pd.DataFrame(dict((idx,values) for idx, values in df2.iterrows()))
In [18]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6
8.3 itertuples
The itertuples()
method will return an iterator
yielding a namedtuple for each row in the DataFrame. The first element
of the tuple will be the row’s corresponding index value, while the
remaining values are the row values.
For instance,
In [19]: for row in df.itertuples():
....: print(row)
....:
Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')
This method does not convert the row to a Series object but just
returns the values inside a namedtuple. Therefore,
itertuples()
preserves the data type of the values
and is generally faster as iterrows()
.
Note
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.