2.2 Selection

2.2.1 DataFrames

The indexing docs.

Using both row labels and value conditionals

In [1]: df = pd.DataFrame(
   ...:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ...: 
Out[1]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [2]: df[(df.AAA <= 6) & (df.index.isin([0,2,4]))]
Out[2]: 
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

Use loc for label-oriented slicing and iloc positional slicing

In [3]: data = {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}

In [4]: df = pd.DataFrame(data=data,index=['foo','bar','boo','kar']); df
Out[4]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

There are 2 explicit slicing methods, with a third general case

  1. Positional-oriented (Python slicing style : exclusive of end)
  2. Label-oriented (Non-Python slicing style : inclusive of end)
  3. General (Either slicing style : depends on if the slice contains labels or positions)
In [5]: df.loc['bar':'kar'] #Label
Out[5]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

#Generic
In [6]: df.ix[0:3] #Same as .iloc[0:3]
Out[6]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

In [7]: df.ix['bar':'kar'] #Same as .loc['bar':'kar']
Out[7]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.

In [8]: df2 = pd.DataFrame(data=data,index=[1,2,3,4]); #Note index starts at 1.

In [9]: df2.iloc[1:3] #Position-oriented
Out[9]: 
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [10]: df2.loc[1:3] #Label-oriented
Out[10]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

In [11]: df2.ix[1:3] #General, will mimic loc (label-oriented)
Out[11]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

In [12]: df2.ix[0:3] #General, will mimic iloc (position-oriented), as loc[0:3] would raise a KeyError
Out[12]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

Using inverse operator (~) to take the complement of a mask

In [13]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40], 'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[13]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [14]: df[~((df.AAA <= 6) & (df.index.isin([0,2,4])))]
Out[14]: 
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50

2.2.2 Panels

Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions

In [15]: rng = pd.date_range('1/1/2013',periods=100,freq='D')

In [16]: data = np.random.randn(100, 4)

In [17]: cols = ['A','B','C','D']

In [18]: df1, df2, df3 = pd.DataFrame(data, rng, cols), pd.DataFrame(data, rng, cols), pd.DataFrame(data, rng, cols)

In [19]: pf = pd.Panel({'df1':df1,'df2':df2,'df3':df3});pf
Out[19]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to D

#Assignment using Transpose  (pandas < 0.15)
In [20]: pf = pf.transpose(2,0,1)

In [21]: pf['E'] = pd.DataFrame(data, rng, cols)

In [22]: pf = pf.transpose(1,2,0);pf
Out[22]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 5 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to E

#Direct assignment (pandas > 0.15)
In [23]: pf.loc[:,:,'F'] = pd.DataFrame(data, rng, cols);pf
Out[23]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 6 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to F

Mask a panel by using np.where and then reconstructing the panel with the new masked values

2.2.3 New Columns

Efficiently and dynamically creating new columns using applymap

In [24]: df = pd.DataFrame(
   ....:      {'AAA' : [1,2,1,3], 'BBB' : [1,1,2,2], 'CCC' : [2,1,3,1]}); df
   ....: 
Out[24]: 
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [25]: source_cols = df.columns # or some subset would work too.

In [26]: new_cols = [str(x) + "_cat" for x in source_cols]

In [27]: categories = {1 : 'Alpha', 2 : 'Beta', 3 : 'Charlie' }

In [28]: df[new_cols] = df[source_cols].applymap(categories.get);df
Out[28]: 
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha

Keep other columns when using min() with groupby

In [29]: df = pd.DataFrame(
   ....:      {'AAA' : [1,1,1,2,2,2,3,3], 'BBB' : [2,1,3,4,5,1,2,3]}); df
   ....: 
Out[29]: 
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

Method 1 : idxmin() to get the index of the mins

In [30]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[30]: 
   AAA  BBB
1    1    1
5    2    1
6    3    2

Method 2 : sort then take first of each

In [31]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[31]: 
   AAA  BBB
0    1    1
1    2    1
2    3    2

Notice the same results, with the exception of the index.