In [1]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

2.13 The where() Method and Masking

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

To return only the selected rows

In [2]: s
Out[2]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [3]: s[s > 0]
Out[3]: 
3    1
2    2
1    3
0    4
dtype: int64

To return a Series of the same shape as the original

In [4]: s.where(s > 0)
Out[4]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. Equivalent is df.where(df < 0)

In [5]: dates = pd.date_range('1/1/2000', periods=8)

In [6]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [7]: df
Out[7]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  1.2121 -0.1732  0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  1.0718
2000-01-04  0.7216 -0.7068 -1.0396  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449
In [8]: df[df < 0]
Out[8]: 
                 A       B       C       D
2000-01-01     NaN -0.2829 -1.5091 -1.1356
2000-01-02     NaN -0.1732     NaN -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949     NaN
2000-01-04     NaN -0.7068 -1.0396     NaN
2000-01-05 -0.4250     NaN     NaN -1.0874
2000-01-06 -0.6737     NaN -1.4784     NaN
2000-01-07     NaN     NaN -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443     NaN

In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

In [9]: df.where(df < 0, -df)
Out[9]: 
                 A       B       C       D
2000-01-01 -0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 -1.2121 -0.1732 -0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 -1.0718
2000-01-04 -0.7216 -0.7068 -1.0396 -0.2719
2000-01-05 -0.4250 -0.5670 -0.2762 -1.0874
2000-01-06 -0.6737 -0.1136 -1.4784 -0.5250
2000-01-07 -0.4047 -0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 -0.8449

You may wish to set values based on some boolean criteria. This can be done intuitively like so:

In [10]: s2 = s.copy()

In [11]: s2
Out[11]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [12]: s2[s2 < 0] = 0

In [13]: s2
Out[13]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [14]: df2 = df.copy()

In [15]: df2
Out[15]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  1.2121 -0.1732  0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  1.0718
2000-01-04  0.7216 -0.7068 -1.0396  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449

In [16]: df2[df2 < 0] = 0

In [17]: df2
Out[17]: 
                 A       B       C       D
2000-01-01  0.4691  0.0000  0.0000  0.0000
2000-01-02  1.2121  0.0000  0.1192  0.0000
2000-01-03  0.0000  0.0000  0.0000  1.0718
2000-01-04  0.7216  0.0000  0.0000  0.2719
2000-01-05  0.0000  0.5670  0.2762  0.0000
2000-01-06  0.0000  0.1136  0.0000  0.5250
2000-01-07  0.4047  0.5770  0.0000  0.0000
2000-01-08  0.0000  0.0000  0.0000  0.8449

By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy:

In [18]: df_orig = df.copy()

In [19]: df_orig.where(df > 0, -df, inplace=True);

In [20]: df_orig
Out[20]: 
                 A       B       C       D
2000-01-01  0.4691  0.2829  1.5091  1.1356
2000-01-02  1.2121  0.1732  0.1192  1.0442
2000-01-03  0.8618  2.1046  0.4949  1.0718
2000-01-04  0.7216  0.7068  1.0396  0.2719
2000-01-05  0.4250  0.5670  0.2762  1.0874
2000-01-06  0.6737  0.1136  1.4784  0.5250
2000-01-07  0.4047  0.5770  1.7150  1.0393
2000-01-08  0.3706  1.1579  1.3443  0.8449

Note

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

In [21]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[21]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

alignment

Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .ix (but on the contents rather than the axis labels)

In [22]: df2 = df.copy()

In [23]: df2
Out[23]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  1.2121 -0.1732  0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  1.0718
2000-01-04  0.7216 -0.7068 -1.0396  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449

In [24]: df2[ df2[1:4] > 0 ] = 3

In [25]: df2
Out[25]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  3.0000 -0.1732  3.0000 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  3.0000
2000-01-04  3.0000 -0.7068 -1.0396  3.0000
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449

New in version 0.13.

Where can also accept axis and level parameters to align the input when performing the where.

In [26]: df2 = df.copy()

In [27]: df2
Out[27]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  1.2121 -0.1732  0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  1.0718
2000-01-04  0.7216 -0.7068 -1.0396  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449

In [28]: df2.where(df2>0,df2['A'],axis='index')
Out[28]: 
                 A       B       C       D
2000-01-01  0.4691  0.4691  0.4691  0.4691
2000-01-02  1.2121  1.2121  0.1192  1.2121
2000-01-03 -0.8618 -0.8618 -0.8618  1.0718
2000-01-04  0.7216  0.7216  0.7216  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -0.4250
2000-01-06 -0.6737  0.1136 -0.6737  0.5250
2000-01-07  0.4047  0.5770  0.4047  0.4047
2000-01-08 -0.3706 -0.3706 -0.3706  0.8449

This is equivalent (but faster than) the following.

In [29]: df2 = df.copy()

In [30]: df.apply(lambda x, y: x.where(x>0,y), y=df['A'])
Out[30]: 
                 A       B       C       D
2000-01-01  0.4691  0.4691  0.4691  0.4691
2000-01-02  1.2121  1.2121  0.1192  1.2121
2000-01-03 -0.8618 -0.8618 -0.8618  1.0718
2000-01-04  0.7216  0.7216  0.7216  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -0.4250
2000-01-06 -0.6737  0.1136 -0.6737  0.5250
2000-01-07  0.4047  0.5770  0.4047  0.4047
2000-01-08 -0.3706 -0.3706 -0.3706  0.8449

New in version 0.18.1.

Where can accept a callable as condition and other arguments. The function must be with one argument (the calling Series or DataFrame) and that returns valid output as condition and other argument.

In [31]: df3 = pd.DataFrame({'A': [1, 2, 3],
   ....:                     'B': [4, 5, 6],
   ....:                     'C': [7, 8, 9]})
   ....: 

In [32]: df3
Out[32]: 
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

In [33]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[33]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

mask

mask is the inverse boolean operation of where.

In [34]: s
Out[34]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [35]: s.mask(s >= 0)
Out[35]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [36]: df
Out[36]: 
                 A       B       C       D
2000-01-01  0.4691 -0.2829 -1.5091 -1.1356
2000-01-02  1.2121 -0.1732  0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949  1.0718
2000-01-04  0.7216 -0.7068 -1.0396  0.2719
2000-01-05 -0.4250  0.5670  0.2762 -1.0874
2000-01-06 -0.6737  0.1136 -1.4784  0.5250
2000-01-07  0.4047  0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443  0.8449

In [37]: df.mask(df >= 0)
Out[37]: 
                 A       B       C       D
2000-01-01     NaN -0.2829 -1.5091 -1.1356
2000-01-02     NaN -0.1732     NaN -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949     NaN
2000-01-04     NaN -0.7068 -1.0396     NaN
2000-01-05 -0.4250     NaN     NaN -1.0874
2000-01-06 -0.6737     NaN -1.4784     NaN
2000-01-07     NaN     NaN -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443     NaN