9.1 Introduction
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps most
importantly, these methods exclude missing/NA values automatically. These are
accessed via the str attribute and generally have names matching
the equivalent (scalar) built-in string methods:
In [1]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [2]: s.str.lower()
Out[2]:
0 a
1 b
2 c
3 aaba
...
5 NaN
6 caba
7 dog
8 cat
dtype: object
In [3]: s.str.upper()
Out[3]:
0 A
1 B
2 C
3 AABA
...
5 NaN
6 CABA
7 DOG
8 CAT
dtype: object
In [4]: s.str.len()
Out[4]:
0 1.0
1 1.0
2 1.0
3 4.0
...
5 NaN
6 4.0
7 3.0
8 3.0
dtype: float64
In [5]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
In [6]: idx.str.strip()
Out[6]: Index([u'jack', u'jill', u'jesse', u'frank'], dtype='object')
In [7]: idx.str.lstrip()
Out[7]: Index([u'jack', u'jill ', u'jesse ', u'frank'], dtype='object')
In [8]: idx.str.rstrip()
Out[8]: Index([u' jack', u'jill', u' jesse', u'frank'], dtype='object')
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:
In [9]: df = pd.DataFrame(randn(3, 2), columns=[' Column A ', ' Column B '],
...: index=range(3))
...:
In [10]: df
Out[10]:
Column A Column B
0 0.017428 0.039049
1 -2.240248 0.847859
2 -1.342107 0.368828
Since df.columns is an Index object, we can use the .str accessor
In [11]: df.columns.str.strip()
Out[11]: Index([u'Column A', u'Column B'], dtype='object')
In [12]: df.columns.str.lower()
Out[12]: Index([u' column a ', u' column b '], dtype='object')
These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lowercasing all names, and replacing any remaining whitespaces with underscores:
In [13]: df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
In [14]: df
Out[14]:
column_a column_b
0 0.017428 0.039049
1 -2.240248 0.847859
2 -1.342107 0.368828
Note
If you have a Series where lots of elements are repeated
(i.e. the number of unique elements in the Series is a lot smaller than the length of the
Series), it can be faster to convert the original Series to one of type
category and then use .str.<method> or .dt.<property> on that.
The performance difference comes from the fact that, for Series of type category, the
string operations are done on the .categories and not on each element of the
Series.
Please note that a Series of type category with string .categories has
some limitations in comparison of Series of type string (e.g. you can’t add strings to
each other: s + " " + s won’t work if s is a Series of type category). Also,
.str methods which operate on elements of type list are not available on such a
Series.