pandas_groupby

tasks
preserve the highest value entries in each group
preserve the highest odd value in each group
level
filter elements from groups that dont satisfy a criterion
extract groupby object by key
groupby slicing
apply a function on each group

tasks

Show artists with multiple albums

preserve the highest value entries in each group

tags | filter by value

Given

    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      7
8  MM4  S2   uyi      7

get all the rows with highest 'count' for each ['Sp', 'Mt'] combination. That is we want

    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
7  MM4  S2    cb      7
8  MM4  S2   uyi      7

Solution

In [1]:
import pandas as pd
df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
                   'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
                   'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd', 'cb', 'uyi'],
                   'count': [3, 2, 5, 8, 10, 1, 2, 7, 7]})
df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      7
8  MM4  S2   uyi      7

In [2]:
df.groupby(['Sp', 'Mt'])['count'].transform(max)
Out[2]:
0     3
1     3
2     5
3     8
4    10
5    10
6     7
7     7
8     7
Name: count, dtype: int64

In [3]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
df[idx]
Out[3]:
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
7  MM4  S2    cb      7
8  MM4  S2   uyi      7

Ref:- https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby

preserve the highest odd value in each group

tags | pandas groupby transform maximum odd number, maxodd

Given

     Sp  Mt Value  count
0   MM1  S1     a      1
1   MM1  S1     n      2
2   MM1  S1    cb      3
3   MM2  S2    mk      1
4   MM2  S2    bg      2
5   MM3  S3   dgd      2
6   MM3  S3    rd      3
7   MM4  S4    cb      1
8   MM4  S4   uyi      3
9   MM5  S5     w      1
10  MM6  S6    ea      2
11  MM7  S7     t      3

We want

     Sp  Mt Value  count
2   MM1  S1    cb      3
3   MM2  S2    mk      1
6   MM3  S3    rd      3
8   MM4  S4   uyi      3
9   MM5  S5     w      1
11  MM7  S7     t      3

That is get all the rows with highest odd 'count' for each ['Sp', 'Mt'] combination. If there is a group with only even 'count' values, discard it.

Solution

In [1]:
import pandas as pd
df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM3', 'MM3',
                          'MM4', 'MM4', 'MM5', 'MM6', 'MM7'],
                   'Mt': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3', 'S3',
                          'S4', 'S4', 'S5', 'S6', 'S7'],
                   'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd',
                             'cb', 'uyi', 'w', 'ea', 't'],
                   'count': [1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 2, 3]})
df
Out[1]:
     Sp  Mt Value  count
0   MM1  S1     a      1
1   MM1  S1     n      2
2   MM1  S1    cb      3
3   MM2  S2    mk      1
4   MM2  S2    bg      2
5   MM3  S3   dgd      2
6   MM3  S3    rd      3
7   MM4  S4    cb      1
8   MM4  S4   uyi      3
9   MM5  S5     w      1
10  MM6  S6    ea      2
11  MM7  S7     t      3

In [2]:
def max_odd(s):
    value = s.loc[s % 2 == 1].max()
    return value


In [3]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
df[idx]
Out[3]:
     Sp  Mt Value  count
2   MM1  S1    cb      3
3   MM2  S2    mk      1
6   MM3  S3    rd      3
8   MM4  S4   uyi      3
9   MM5  S5     w      1
11  MM7  S7     t      3

Breakdown of how it works:

In [4]:
df.groupby(['Sp', 'Mt'])['count'].transform(max_odd)
Out[4]:
0     3.0
1     3.0
2     3.0
3     1.0
4     1.0
5     3.0
6     3.0
7     3.0
8     3.0
9     1.0
10    NaN
11    3.0
Name: count, dtype: float64

In [5]:
idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count']
idx
Out[5]:
0     False
1     False
2      True
3      True
4     False
5     False
6      True
7     False
8      True
9      True
10    False
11     True
Name: count, dtype: bool

level

If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc.,

tags | pandas groupby level=0

$ ipython
Python 3.6.12 |Anaconda, Inc.| (default, Sep  9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame([{'year':2000,'team':'A','game':4,'goals':2},
                   {'year':2000,'team':'B','game':5,'goals':4},
                   {'year':2001,'team':'B','game':6,'goals':3}
                   ]).set_index(['year', 'team'])
df
Out[1]:
           game  goals
year team
2000 A        4      2
     B        5      4
2001 B        6      3

In [2]:
df.groupby(level=0).size()
Out[2]:
year
2000    2
2001    1
dtype: int64

In [3]:
df.groupby(level=1).size()
Out[3]:
team
A    1
B    2
dtype: int64

Ref: https://stackoverflow.com/questions/49859182/understanding-level-0-and-group-keys

filter elements from groups that dont satisfy a criterion

tags | pandas groupby filter groups

In [2]:
import pandas as pd
df = pd.DataFrame({
    'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B' : [1, 2, 3, 4, 5, 6],
    'C' : [2.0, 5., 8., 1., 2., 9.]})
df
Out[2]:
     A  B    C
0  foo  1  2.0
1  bar  2  5.0
2  foo  3  8.0
3  bar  4  1.0
4  foo  5  2.0
5  bar  6  9.0

In [3]:
grouped = df.groupby('A')

In [4]:
grouped.filter(lambda x: x['B'].mean() > 3.)
Out[4]:
     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0

Ref:- https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html

extract groupby object by key

tags | pandas groupby filter a group

groups.get_group(key_value) if grouping on a single column
groups.get_group(key_value_tuple) if grouping on multiple columns.

In [1]:
import pandas as pd
df = pd.DataFrame({'a': ['foo', 'bar'] * 4 + ['foo'],
                   'b': [1, 1, 1, 0, 0, 0, 0, 2, 2],
                   'c': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df
Out[1]:
     a  b  c
0  foo  1  1
1  bar  1  2
2  foo  1  3
3  bar  0  4
4  foo  0  5
5  bar  0  6
6  foo  0  7
7  bar  2  8
8  foo  2  9

In [2]:
grp1 = df.groupby(['b'])

In [3]:
grp1.get_group(0)
Out[3]:
     a  b  c
3  bar  0  4
4  foo  0  5
5  bar  0  6
6  foo  0  7

In [4]:
grp1.get_group(1)
Out[4]:
     a  b  c
0  foo  1  1
1  bar  1  2
2  foo  1  3

In [5]:
grp2 = df.groupby(['a'])

In [6]:
grp2.get_group('foo')
Out[6]:
     a  b  c
0  foo  1  1
2  foo  1  3
4  foo  0  5
6  foo  0  7
8  foo  2  9

In [7]:
grp2.get_group('bar')
Out[7]:
     a  b  c
1  bar  1  2
3  bar  0  4
5  bar  0  6
7  bar  2  8

In [8]:
grp3 = df.groupby(['a', 'b'])

In [9]:
grp3.get_group(('foo',1))
Out[9]:
     a  b  c
0  foo  1  1
2  foo  1  3

In [10]:
grp3.get_group(('bar',0))
Out[10]:
     a  b  c
3  bar  0  4
5  bar  0  6

groupby slicing

Consider

In [1]: 
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})

In [2]: 
df
Out[2]: 
     A         B   C
0  foo  1.624345   5
1  bar -0.611756  18
2  foo -0.528172  11
3  bar -1.072969  10
4  foo  0.865408  14
5  bar -2.301539  18

Group by on column 'A'

In [3]: 
gb = df.groupby(['A'])

You can use get_group() to get a single group

In [4]: 
gb.get_group('foo')
Out[4]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

You can select different columns using the groupby slicing:

In [5]: 
gb[['A', 'B']].get_group('foo')
Out[5]: 
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [6]: 
gb[['C']].get_group('foo')
Out[6]: 
    C
0   5
2  11
4  14

Ref:

https://stackoverflow.com/questions/14734533/how-to-access-subdataframes-of-pandas-groupby-by-key

apply a function on each group

Note:- think of a better function for black_ops. The current function does not really demand a groupby. We can simply do those column operations on the top level dataframe

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
    'a': ['foo', 'foo', 'bar', 'bar', 'baz', 'baz', 'qux', 'qux'],
    'b': np.random.randn(8),
    'c': np.random.randn(8),
    'd': np.random.rand(8)})
df
Out[1]:
     a         b         c         d
0  foo  0.496714 -0.469474  0.611853
1  foo -0.138264  0.542560  0.139494
2  bar  0.647689 -0.463418  0.292145
3  bar  1.523030 -0.465730  0.366362
4  baz -0.234153  0.241962  0.456070
5  baz -0.234137 -1.913280  0.785176
6  qux  1.579213 -1.724918  0.199674
7  qux  0.767435 -0.562288  0.514234

In [2]:
def black_ops(df):
    return df['b'] + df['c'] + df['d']


In [3]:
df.groupby('a').apply(black_ops).reset_index(name='blackops').drop('level_1', axis=1)
Out[3]:
     a    blackops
0  bar  0.476415
1  bar  1.423662
2  baz  0.463879
3  baz -1.362241
4  foo  0.639093
5  foo  0.543790
6  qux  0.053969
7  qux  0.719382

tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result, groupby apply remove level_1

Table of Contents