In [1]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
2.13 The where()
Method and Masking
Selecting values from a Series with a boolean vector generally returns a
subset of the data. To guarantee that selection output has the same shape as
the original data, you can use the where
method in Series
and DataFrame
.
To return only the selected rows
In [2]: s
Out[2]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [3]: s[s > 0]
Out[3]:
3 1
2 2
1 3
0 4
dtype: int64
To return a Series of the same shape as the original
In [4]: s.where(s > 0)
Out[4]:
4 NaN
3 1.0
2 2.0
1 3.0
0 4.0
dtype: float64
Selecting values from a DataFrame with a boolean criterion now also preserves
input data shape. where
is used under the hood as the implementation.
Equivalent is df.where(df < 0)
In [5]: dates = pd.date_range('1/1/2000', periods=8)
In [6]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
In [7]: df
Out[7]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [8]: df[df < 0]
Out[8]:
A B C D
2000-01-01 NaN -0.2829 -1.5091 -1.1356
2000-01-02 NaN -0.1732 NaN -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 NaN
2000-01-04 NaN -0.7068 -1.0396 NaN
2000-01-05 -0.4250 NaN NaN -1.0874
2000-01-06 -0.6737 NaN -1.4784 NaN
2000-01-07 NaN NaN -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 NaN
In addition, where
takes an optional other
argument for replacement of
values where the condition is False, in the returned copy.
In [9]: df.where(df < 0, -df)
Out[9]:
A B C D
2000-01-01 -0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 -1.2121 -0.1732 -0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 -1.0718
2000-01-04 -0.7216 -0.7068 -1.0396 -0.2719
2000-01-05 -0.4250 -0.5670 -0.2762 -1.0874
2000-01-06 -0.6737 -0.1136 -1.4784 -0.5250
2000-01-07 -0.4047 -0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 -0.8449
You may wish to set values based on some boolean criteria. This can be done intuitively like so:
In [10]: s2 = s.copy()
In [11]: s2
Out[11]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [12]: s2[s2 < 0] = 0
In [13]: s2
Out[13]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [14]: df2 = df.copy()
In [15]: df2
Out[15]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [16]: df2[df2 < 0] = 0
In [17]: df2
Out[17]:
A B C D
2000-01-01 0.4691 0.0000 0.0000 0.0000
2000-01-02 1.2121 0.0000 0.1192 0.0000
2000-01-03 0.0000 0.0000 0.0000 1.0718
2000-01-04 0.7216 0.0000 0.0000 0.2719
2000-01-05 0.0000 0.5670 0.2762 0.0000
2000-01-06 0.0000 0.1136 0.0000 0.5250
2000-01-07 0.4047 0.5770 0.0000 0.0000
2000-01-08 0.0000 0.0000 0.0000 0.8449
By default, where
returns a modified copy of the data. There is an
optional parameter inplace
so that the original data can be modified
without creating a copy:
In [18]: df_orig = df.copy()
In [19]: df_orig.where(df > 0, -df, inplace=True);
In [20]: df_orig
Out[20]:
A B C D
2000-01-01 0.4691 0.2829 1.5091 1.1356
2000-01-02 1.2121 0.1732 0.1192 1.0442
2000-01-03 0.8618 2.1046 0.4949 1.0718
2000-01-04 0.7216 0.7068 1.0396 0.2719
2000-01-05 0.4250 0.5670 0.2762 1.0874
2000-01-06 0.6737 0.1136 1.4784 0.5250
2000-01-07 0.4047 0.5770 1.7150 1.0393
2000-01-08 0.3706 1.1579 1.3443 0.8449
Note
The signature for DataFrame.where()
differs from numpy.where()
.
Roughly df1.where(m, df2)
is equivalent to np.where(m, df1, df2)
.
In [21]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[21]:
A B C D
2000-01-01 True True True True
2000-01-02 True True True True
2000-01-03 True True True True
2000-01-04 True True True True
2000-01-05 True True True True
2000-01-06 True True True True
2000-01-07 True True True True
2000-01-08 True True True True
alignment
Furthermore, where
aligns the input boolean condition (ndarray or DataFrame),
such that partial selection with setting is possible. This is analogous to
partial setting via .ix
(but on the contents rather than the axis labels)
In [22]: df2 = df.copy()
In [23]: df2
Out[23]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [24]: df2[ df2[1:4] > 0 ] = 3
In [25]: df2
Out[25]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 3.0000 -0.1732 3.0000 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 3.0000
2000-01-04 3.0000 -0.7068 -1.0396 3.0000
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
New in version 0.13.
Where can also accept axis
and level
parameters to align the input when
performing the where
.
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [28]: df2.where(df2>0,df2['A'],axis='index')
Out[28]:
A B C D
2000-01-01 0.4691 0.4691 0.4691 0.4691
2000-01-02 1.2121 1.2121 0.1192 1.2121
2000-01-03 -0.8618 -0.8618 -0.8618 1.0718
2000-01-04 0.7216 0.7216 0.7216 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -0.4250
2000-01-06 -0.6737 0.1136 -0.6737 0.5250
2000-01-07 0.4047 0.5770 0.4047 0.4047
2000-01-08 -0.3706 -0.3706 -0.3706 0.8449
This is equivalent (but faster than) the following.
In [29]: df2 = df.copy()
In [30]: df.apply(lambda x, y: x.where(x>0,y), y=df['A'])
Out[30]:
A B C D
2000-01-01 0.4691 0.4691 0.4691 0.4691
2000-01-02 1.2121 1.2121 0.1192 1.2121
2000-01-03 -0.8618 -0.8618 -0.8618 1.0718
2000-01-04 0.7216 0.7216 0.7216 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -0.4250
2000-01-06 -0.6737 0.1136 -0.6737 0.5250
2000-01-07 0.4047 0.5770 0.4047 0.4047
2000-01-08 -0.3706 -0.3706 -0.3706 0.8449
New in version 0.18.1.
Where can accept a callable as condition and other
arguments. The function must
be with one argument (the calling Series or DataFrame) and that returns valid output
as condition and other
argument.
In [31]: df3 = pd.DataFrame({'A': [1, 2, 3],
....: 'B': [4, 5, 6],
....: 'C': [7, 8, 9]})
....:
In [32]: df3
Out[32]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [33]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[33]:
A B C
0 11 14 7
1 12 5 8
2 13 6 9
mask
mask
is the inverse boolean operation of where
.
In [34]: s
Out[34]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [35]: s.mask(s >= 0)
Out[35]:
4 NaN
3 NaN
2 NaN
1 NaN
0 NaN
dtype: float64
In [36]: df
Out[36]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [37]: df.mask(df >= 0)
Out[37]:
A B C D
2000-01-01 NaN -0.2829 -1.5091 -1.1356
2000-01-02 NaN -0.1732 NaN -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 NaN
2000-01-04 NaN -0.7068 -1.0396 NaN
2000-01-05 -0.4250 NaN NaN -1.0874
2000-01-06 -0.6737 NaN -1.4784 NaN
2000-01-07 NaN NaN -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 NaN