5.11 Examples

5.11.1 Regrouping by factor

Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.

In [1]: df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})

In [2]: df
Out[2]: 
   a  b  c  d
0  1  0  1  2
1  0  1  0  3
2  0  0  0  4

In [3]: df.groupby(df.sum(), axis=1).sum()
Out[3]: 
   1  9
0  2  2
1  1  3
2  0  4

5.11.2 Groupby by Indexer to ‘resample’ data

Resampling produces new hypothetical samples(resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.

In order to resample to work on indices that are non-datetimelike , the following procedure can be utilized.

In the following examples, df.index // 5 returns a binary array which is used to determine what get’s selected for the groupby operation.

Note

The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using df.index // 5, we are aggregating the samples in bins. By applying std() function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.

In [4]: df = pd.DataFrame(np.random.randn(10,2))

In [5]: df
Out[5]: 
        0       1
0  0.4691 -0.2829
1 -1.5091 -1.1356
2  1.2121 -0.1732
3  0.1192 -1.0442
4 -0.8618 -2.1046
5 -0.4949  1.0718
6  0.7216 -0.7068
7 -1.0396  0.2719
8 -0.4250  0.5670
9  0.2762 -1.0874

In [6]: df.index // 5
Out[6]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')

In [7]: df.groupby(df.index // 5).std()
Out[7]: 
        0       1
0  1.0792  0.7786
1  0.6925  0.8977

5.11.3 Returning a Series to propagate names

Group DataFrame columns, compute a set of metrics and return a named Series. The Series name is used as the name for the column index. This is especially useful in conjunction with reshaping operations such as stacking in which the column index name will be used as the name of the inserted column:

In [8]: df = pd.DataFrame({
   ...:          'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
   ...:          'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
   ...:          'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
   ...:          'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
   ...:          })
   ...: 

In [9]: def compute_metrics(x):
   ...:     result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
   ...:     return pd.Series(result, name='metrics')
   ...: 

In [10]: result = df.groupby('a').apply(compute_metrics)

In [11]: result
Out[11]: 
metrics  b_sum  c_mean
a                     
0          2.0     0.5
1          2.0     0.5
2          2.0     0.5

In [12]: result.stack()
Out[12]: 
a  metrics
0  b_sum      2.0
   c_mean     0.5
1  b_sum      2.0
   c_mean     0.5
2  b_sum      2.0
   c_mean     0.5
dtype: float64