2.21 Returning a view versus a copy
When setting values in a pandas object, care must be taken to avoid what is called
chained indexing
. Here is an example.
In [1]: dfmi = pd.DataFrame([list('abcd'),
...: list('efgh'),
...: list('ijkl'),
...: list('mnop')],
...: columns=pd.MultiIndex.from_product([['one','two'],
...: ['first','second']]))
...:
In [2]: dfmi
Out[2]:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
Compare these two access methods:
In [3]: dfmi['one']['second']
Out[3]:
0 b
1 f
2 j
3 n
Name: second, dtype: object
In [4]: dfmi.loc[:,('one','second')]
Out[4]:
0 b
1 f
2 j
3 n
Name: (one, second), dtype: object
These both yield the same results, so which should you use? It is instructive to understand the order
of operations on these and why method 2 (.loc
) is much preferred over method 1 (chained []
)
dfmi['one']
selects the first level of the columns and returns a DataFrame that is singly-indexed.
Then another python operation dfmi_with_one['second']
selects the series indexed by 'second'
happens.
This is indicated by the variable dfmi_with_one
because pandas sees these operations as separate events.
e.g. separate calls to __getitem__
, so it has to treat them as linear operations, they happen one after another.
Contrast this to df.loc[:,('one','second')]
which passes a nested tuple of (slice(None),('one','second'))
to a single call to
__getitem__
. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly
faster, and allows one to index both axes if so desired.
2.21.1 Why does assignment fail when using chained indexing?
The problem in the previous section is just a performance issue. What’s up with
the SettingWithCopy
warning? We don’t usually throw warnings around when
you do something that might cost a few extra milliseconds!
But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:
dfmi.loc[:,('one','second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
But this code is handled differently:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__
in there? Outside of simple cases, it’s very hard to
predict whether it will return a view or a copy (it depends on the memory layout
of the array, about which pandas makes no guarantees), and therefore whether
the __setitem__
will modify dfmi
or a temporary object that gets thrown
out immediately afterward. That’s what SettingWithCopy
is warning you
about!
Note
You may be wondering whether we should be concerned about the loc
property in the first example. But dfmi.loc
is guaranteed to be dfmi
itself with modified indexing behavior, so dfmi.loc.__getitem__
/
dfmi.loc.__setitem__
operate on dfmi
directly. Of course,
dfmi.loc.__getitem__(idx)
may be a view or a copy of dfmi
.
Sometimes a SettingWithCopy
warning will arise at times when there’s no
obvious chained indexing going on. These are the bugs that
SettingWithCopy
is designed to catch! Pandas is probably trying to warn you
that you’ve done this:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
foo['quux'] = value # We don't know whether this will modify df or not!
return foo
Yikes!
2.21.2 Evaluation order matters
Furthermore, in chained expressions, the order may determine whether a copy is returned or not.
If an expression will set values on a copy of a slice, then a SettingWithCopy
exception will be raised (this raise/warn behavior is new starting in 0.13.0)
You can control the action of a chained assignment via the option mode.chained_assignment
,
which can take the values ['raise','warn',None]
, where showing a warning is the default.
In [5]: dfb = pd.DataFrame({'a' : ['one', 'one', 'two',
...: 'three', 'two', 'one', 'six'],
...: 'c' : np.arange(7)})
...:
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [6]: dfb['c'][dfb.a.str.startswith('o')] = 42
This however is operating on a copy and will not work.
>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
Traceback (most recent call last)
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
A chained assignment can also crop up in setting in a mixed dtype frame.
Note
These setting rules apply to all of .loc/.iloc/.ix
This is the correct access method
In [7]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
In [8]: dfc.loc[0,'A'] = 11
In [9]: dfc
Out[9]:
A B
0 11 1
1 bbb 2
2 ccc 3
This can work at times, but is not guaranteed, and so should be avoided
In [10]: dfc = dfc.copy()
In [11]: dfc['A'][0] = 111
In [12]: dfc
Out[12]:
A B
0 111 1
1 bbb 2
2 ccc 3
This will not work at all, and so should be avoided
>>> pd.set_option('mode.chained_assignment','raise')
>>> dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
...
SettingWithCopyException:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Warning
The chained assignment warnings / exceptions are aiming to inform the user of a possibly invalid assignment. There may be false positives; situations where a chained assignment is inadvertently reported.