9.2 Splitting and Replacing Strings
Methods like split
return a Series of lists:
In [1]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
In [2]: s2.str.split('_')
Out[2]:
0 [a, b, c]
1 [c, d, e]
2 NaN
3 [f, g, h]
dtype: object
Elements in the split lists can be accessed using get
or []
notation:
In [3]: s2.str.split('_').str.get(1)
Out[3]:
0 b
1 d
2 NaN
3 g
dtype: object
In [4]: s2.str.split('_').str[1]
Out[4]:
0 b
1 d
2 NaN
3 g
dtype: object
Easy to expand this to return a DataFrame using expand
.
In [5]: s2.str.split('_', expand=True)
Out[5]:
0 1 2
0 a b c
1 c d e
2 NaN None None
3 f g h
It is also possible to limit the number of splits:
In [6]: s2.str.split('_', expand=True, n=1)
Out[6]:
0 1
0 a b_c
1 c d_e
2 NaN None
3 f g_h
rsplit
is similar to split
except it works in the reverse direction,
i.e., from the end of the string to the beginning of the string:
In [7]: s2.str.rsplit('_', expand=True, n=1)
Out[7]:
0 1
0 a_b c
1 c_d e
2 NaN None
3 f_g h
Methods like replace
and findall
take regular expressions, too:
In [8]: s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
...: '', np.nan, 'CABA', 'dog', 'cat'])
...:
In [9]: s3
Out[9]:
0 A
1 B
2 C
3 Aaba
...
6 NaN
7 CABA
8 dog
9 cat
dtype: object
In [10]: s3.str.replace('^.a|dog', 'XX-XX ', case=False)
Out[10]:
0 A
1 B
2 C
3 XX-XX ba
...
6 NaN
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: object
Some caution must be taken to keep regular expressions in mind! For example, the following code will cause trouble because of the regular expression meaning of $:
# Consider the following badly formatted financial data
In [11]: dollars = pd.Series(['12', '-$10', '$10,000'])
# This does what you'd naively expect:
In [12]: dollars.str.replace('$', '')
Out[12]:
0 12
1 -10
2 10,000
dtype: object
# But this doesn't:
In [13]: dollars.str.replace('-$', '-')
Out[13]:
0 12
1 -$10
2 $10,000
dtype: object
# We need to escape the special character (for >1 len patterns)
In [14]: dollars.str.replace(r'-\$', '-')
Out[14]:
0 12
1 -10
2 $10,000
dtype: object