9.4 Extracting Substrings
9.4.1 Extract first match in each subject (extract)
New in version 0.13.0.
Warning
In version 0.18.0, extract
gained the expand
argument. When
expand=False
it returns a Series
, Index
, or
DataFrame
, depending on the subject and regular expression
pattern (same behavior as pre-0.18.0). When expand=True
it
always returns a DataFrame
, which is more consistent and less
confusing from the perspective of a user.
The extract
method accepts a regular expression with at least one
capture group.
Extracting a regular expression with more than one group returns a DataFrame with one column per group.
In [1]: pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)
Out[1]:
0 1
0 a 1
1 b 2
2 NaN NaN
Elements that do not match return a row filled with NaN
. Thus, a
Series of messy strings can be “converted” into a like-indexed Series
or DataFrame of cleaned-up or more useful strings, without
necessitating get()
to access tuples or re.match
objects. The
dtype of the result is always object, even if no match is found and
the result only contains NaN
.
Named groups like
In [2]: pd.Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)', expand=False)
Out[2]:
letter digit
0 a 1
1 b 2
2 NaN NaN
and optional groups like
In [3]: pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)', expand=False)
Out[3]:
0 1
0 a 1
1 b 2
2 NaN 3
can also be used. Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.
Extracting a regular expression with one group returns a DataFrame
with one column if expand=True
.
In [4]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
Out[4]:
0
0 1
1 2
2 NaN
It returns a Series if expand=False
.
In [5]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
Out[5]:
0 1
1 2
2 NaN
dtype: object
Calling on an Index
with a regex with exactly one capture group
returns a DataFrame
with one column if expand=True
,
In [6]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
In [7]: s
Out[7]:
A11 a1
B22 b2
C33 c3
dtype: object
In [8]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[8]:
letter
0 A
1 B
2 C
It returns an Index
if expand=False
.
In [9]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[9]: Index([u'A', u'B', u'C'], dtype='object', name=u'letter')
Calling on an Index
with a regex with more than one capture group
returns a DataFrame
if expand=True
.
In [10]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[10]:
letter 1
0 A 11
1 B 22
2 C 33
It raises ValueError
if expand=False
.
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index
The table below summarizes the behavior of extract(expand=False)
(input subject in first column, number of groups in regex in
first row)
1 group | >1 group | |
Index | Index | ValueError |
Series | Series | DataFrame |
9.4.2 Extract all matches in each subject (extractall)
New in version 0.18.0.
Unlike extract
(which returns only the first match),
In [11]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
In [12]: s
Out[12]:
A a1a2
B b1
C c1
dtype: object
In [13]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
In [14]: s.str.extract(two_groups, expand=True)
Out[14]:
letter digit
A a 1
B b 1
C c 1
the extractall
method returns every match. The result of
extractall
is always a DataFrame
with a MultiIndex
on its
rows. The last level of the MultiIndex
is named match
and
indicates the order in the subject.
In [15]: s.str.extractall(two_groups)
Out[15]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
When each subject string in the Series has exactly one match,
In [16]: s = pd.Series(['a3', 'b3', 'c2'])
In [17]: s
Out[17]:
0 a3
1 b3
2 c2
dtype: object
then extractall(pat).xs(0, level='match')
gives the same result as
extract(pat)
.
In [18]: extract_result = s.str.extract(two_groups, expand=True)
In [19]: extract_result
Out[19]:
letter digit
0 a 3
1 b 3
2 c 2
In [20]: extractall_result = s.str.extractall(two_groups)
In [21]: extractall_result
Out[21]:
letter digit
match
0 0 a 3
1 0 b 3
2 0 c 2
In [22]: extractall_result.xs(0, level="match")
Out[22]:
letter digit
0 a 3
1 b 3
2 c 2
Index
also supports .str.extractall
. It returns a DataFrame
which has the
same result as a Series.str.extractall
with a default index (starts from 0).
New in version 0.19.0.
In [23]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[23]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1
In [24]: pd.Series(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[24]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1