9.2 Splitting and Replacing Strings

Methods like split return a Series of lists:

In [1]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])

In [2]: s2.str.split('_')
Out[2]: 
0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

Elements in the split lists can be accessed using get or [] notation:

In [3]: s2.str.split('_').str.get(1)
Out[3]: 
0      b
1      d
2    NaN
3      g
dtype: object

In [4]: s2.str.split('_').str[1]
Out[4]: 
0      b
1      d
2    NaN
3      g
dtype: object

Easy to expand this to return a DataFrame using expand.

In [5]: s2.str.split('_', expand=True)
Out[5]: 
   1     2
  a     b     c
  c     d     e
NaN  None  None
  f     g     h

It is also possible to limit the number of splits:

In [6]: s2.str.split('_', expand=True, n=1)
Out[6]: 
   1
  a   b_c
  c   d_e
NaN  None
  f   g_h

rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:

In [7]: s2.str.rsplit('_', expand=True, n=1)
Out[7]: 
   1
a_b     c
c_d     e
NaN  None
f_g     h

Methods like replace and findall take regular expressions, too:

In [8]: s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
   ...:                '', np.nan, 'CABA', 'dog', 'cat'])
   ...: 

In [9]: s3
Out[9]: 
0       A
1       B
2       C
3    Aaba
     ... 
6     NaN
7    CABA
8     dog
9     cat
dtype: object

In [10]: s3.str.replace('^.a|dog', 'XX-XX ', case=False)
Out[10]: 
0           A
1           B
2           C
3    XX-XX ba
       ...   
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

Some caution must be taken to keep regular expressions in mind! For example, the following code will cause trouble because of the regular expression meaning of $:

# Consider the following badly formatted financial data
In [11]: dollars = pd.Series(['12', '-$10', '$10,000'])

# This does what you'd naively expect:
In [12]: dollars.str.replace('$', '')
Out[12]: 
0        12
1       -10
2    10,000
dtype: object

# But this doesn't:
In [13]: dollars.str.replace('-$', '-')
Out[13]: 
0         12
1       -$10
2    $10,000
dtype: object

# We need to escape the special character (for >1 len patterns)
In [14]: dollars.str.replace(r'-\$', '-')
Out[14]: 
0         12
1        -10
2    $10,000
dtype: object