tags | filter by value
Given
Sp Mt Value count 0 MM1 S1 a 3 1 MM1 S1 n 2 2 MM1 S3 cb 5 3 MM2 S3 mk 8 4 MM2 S4 bg 10 5 MM2 S4 dgd 1 6 MM4 S2 rd 2 7 MM4 S2 cb 7 8 MM4 S2 uyi 7
get all the rows with highest 'count' for each ['Sp', 'Mt'] combination. That is we want
Sp Mt Value count 0 MM1 S1 a 3 2 MM1 S3 cb 5 3 MM2 S3 mk 8 4 MM2 S4 bg 10 7 MM4 S2 cb 7 8 MM4 S2 uyi 7
Solution
In [1]: import pandas as pd df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'], 'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'], 'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd', 'cb', 'uyi'], 'count': [3, 2, 5, 8, 10, 1, 2, 7, 7]}) df Out[1]: Sp Mt Value count 0 MM1 S1 a 3 1 MM1 S1 n 2 2 MM1 S3 cb 5 3 MM2 S3 mk 8 4 MM2 S4 bg 10 5 MM2 S4 dgd 1 6 MM4 S2 rd 2 7 MM4 S2 cb 7 8 MM4 S2 uyi 7 In [2]: df.groupby(['Sp', 'Mt'])['count'].transform(max) Out[2]: 0 3 1 3 2 5 3 8 4 10 5 10 6 7 7 7 8 7 Name: count, dtype: int64 In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count'] df[idx] Out[3]: Sp Mt Value count 0 MM1 S1 a 3 2 MM1 S3 cb 5 3 MM2 S3 mk 8 4 MM2 S4 bg 10 7 MM4 S2 cb 7 8 MM4 S2 uyi 7
tags | pandas groupby transform maximum odd number, maxodd
Given
Sp Mt Value count 0 MM1 S1 a 1 1 MM1 S1 n 2 2 MM1 S1 cb 3 3 MM2 S2 mk 1 4 MM2 S2 bg 2 5 MM3 S3 dgd 2 6 MM3 S3 rd 3 7 MM4 S4 cb 1 8 MM4 S4 uyi 3 9 MM5 S5 w 1 10 MM6 S6 ea 2 11 MM7 S7 t 3
We want
Sp Mt Value count 2 MM1 S1 cb 3 3 MM2 S2 mk 1 6 MM3 S3 rd 3 8 MM4 S4 uyi 3 9 MM5 S5 w 1 11 MM7 S7 t 3
That is get all the rows with highest odd 'count' for each ['Sp', 'Mt'] combination. If there is a group with only even 'count' values, discard it.
Solution
In [1]: import pandas as pd df = pd.DataFrame({'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM3', 'MM3', 'MM4', 'MM4', 'MM5', 'MM6', 'MM7'], 'Mt': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3', 'S3', 'S4', 'S4', 'S5', 'S6', 'S7'], 'Value': ['a', 'n', 'cb', 'mk', 'bg', 'dgd', 'rd', 'cb', 'uyi', 'w', 'ea', 't'], 'count': [1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 2, 3]}) df Out[1]: Sp Mt Value count 0 MM1 S1 a 1 1 MM1 S1 n 2 2 MM1 S1 cb 3 3 MM2 S2 mk 1 4 MM2 S2 bg 2 5 MM3 S3 dgd 2 6 MM3 S3 rd 3 7 MM4 S4 cb 1 8 MM4 S4 uyi 3 9 MM5 S5 w 1 10 MM6 S6 ea 2 11 MM7 S7 t 3 In [2]: def max_odd(s): value = s.loc[s % 2 == 1].max() return value In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count'] df[idx] Out[3]: Sp Mt Value count 2 MM1 S1 cb 3 3 MM2 S2 mk 1 6 MM3 S3 rd 3 8 MM4 S4 uyi 3 9 MM5 S5 w 1 11 MM7 S7 t 3
Breakdown of how it works:
In [4]: df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) Out[4]: 0 3.0 1 3.0 2 3.0 3 1.0 4 1.0 5 3.0 6 3.0 7 3.0 8 3.0 9 1.0 10 NaN 11 3.0 Name: count, dtype: float64 In [5]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max_odd) == df['count'] idx Out[5]: 0 False 1 False 2 True 3 True 4 False 5 False 6 True 7 False 8 True 9 True 10 False 11 True Name: count, dtype: bool
If a dataframe has multiple indices but you need to groupby on only of them, use level. So, level=0 groups it on the first index, level=1 on the second index, level=-1 on the last index etc.,
tags | pandas groupby level=0
$ ipython Python 3.6.12 |Anaconda, Inc.| (default, Sep 9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame([{'year':2000,'team':'A','game':4,'goals':2}, {'year':2000,'team':'B','game':5,'goals':4}, {'year':2001,'team':'B','game':6,'goals':3} ]).set_index(['year', 'team']) df Out[1]: game goals year team 2000 A 4 2 B 5 4 2001 B 6 3 In [2]: df.groupby(level=0).size() Out[2]: year 2000 2 2001 1 dtype: int64 In [3]: df.groupby(level=1).size() Out[3]: team A 1 B 2 dtype: int64
Ref: https://stackoverflow.com/questions/49859182/understanding-level-0-and-group-keys
tags | pandas groupby filter groups
In [2]: import pandas as pd df = pd.DataFrame({ 'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'], 'B' : [1, 2, 3, 4, 5, 6], 'C' : [2.0, 5., 8., 1., 2., 9.]}) df Out[2]: A B C 0 foo 1 2.0 1 bar 2 5.0 2 foo 3 8.0 3 bar 4 1.0 4 foo 5 2.0 5 bar 6 9.0 In [3]: grouped = df.groupby('A') In [4]: grouped.filter(lambda x: x['B'].mean() > 3.) Out[4]: A B C 1 bar 2 5.0 3 bar 4 1.0 5 bar 6 9.0
Ref:- https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html
tags | pandas groupby filter a group
In [1]: import pandas as pd df = pd.DataFrame({'a': ['foo', 'bar'] * 4 + ['foo'], 'b': [1, 1, 1, 0, 0, 0, 0, 2, 2], 'c': [1, 2, 3, 4, 5, 6, 7, 8, 9]}) df Out[1]: a b c 0 foo 1 1 1 bar 1 2 2 foo 1 3 3 bar 0 4 4 foo 0 5 5 bar 0 6 6 foo 0 7 7 bar 2 8 8 foo 2 9 In [2]: grp1 = df.groupby(['b']) In [3]: grp1.get_group(0) Out[3]: a b c 3 bar 0 4 4 foo 0 5 5 bar 0 6 6 foo 0 7 In [4]: grp1.get_group(1) Out[4]: a b c 0 foo 1 1 1 bar 1 2 2 foo 1 3 In [5]: grp2 = df.groupby(['a']) In [6]: grp2.get_group('foo') Out[6]: a b c 0 foo 1 1 2 foo 1 3 4 foo 0 5 6 foo 0 7 8 foo 2 9 In [7]: grp2.get_group('bar') Out[7]: a b c 1 bar 1 2 3 bar 0 4 5 bar 0 6 7 bar 2 8 In [8]: grp3 = df.groupby(['a', 'b']) In [9]: grp3.get_group(('foo',1)) Out[9]: a b c 0 foo 1 1 2 foo 1 3 In [10]: grp3.get_group(('bar',0)) Out[10]: a b c 3 bar 0 4 5 bar 0 6
Consider
In [1]: import pandas as pd import numpy as np rand = np.random.RandomState(1) df = pd.DataFrame({'A': ['foo', 'bar'] * 3, 'B': rand.randn(6), 'C': rand.randint(0, 20, 6)}) In [2]: df Out[2]: A B C 0 foo 1.624345 5 1 bar -0.611756 18 2 foo -0.528172 11 3 bar -1.072969 10 4 foo 0.865408 14 5 bar -2.301539 18
Group by on column 'A'
In [3]: gb = df.groupby(['A'])
You can use get_group() to get a single group
In [4]: gb.get_group('foo') Out[4]: A B C 0 foo 1.624345 5 2 foo -0.528172 11 4 foo 0.865408 14
You can select different columns using the groupby slicing:
In [5]: gb[['A', 'B']].get_group('foo') Out[5]: A B 0 foo 1.624345 2 foo -0.528172 4 foo 0.865408 In [6]: gb[['C']].get_group('foo') Out[6]: C 0 5 2 11 4 14
Ref:
Note:- think of a better function for black_ops. The current function does not really demand a groupby. We can simply do those column operations on the top level dataframe
In [1]: import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame({ 'a': ['foo', 'foo', 'bar', 'bar', 'baz', 'baz', 'qux', 'qux'], 'b': np.random.randn(8), 'c': np.random.randn(8), 'd': np.random.rand(8)}) df Out[1]: a b c d 0 foo 0.496714 -0.469474 0.611853 1 foo -0.138264 0.542560 0.139494 2 bar 0.647689 -0.463418 0.292145 3 bar 1.523030 -0.465730 0.366362 4 baz -0.234153 0.241962 0.456070 5 baz -0.234137 -1.913280 0.785176 6 qux 1.579213 -1.724918 0.199674 7 qux 0.767435 -0.562288 0.514234 In [2]: def black_ops(df): return df['b'] + df['c'] + df['d'] In [3]: df.groupby('a').apply(black_ops).reset_index(name='blackops').drop('level_1', axis=1) Out[3]: a blackops 0 bar 0.476415 1 bar 1.423662 2 baz 0.463879 3 baz -1.362241 4 foo 0.639093 5 foo 0.543790 6 qux 0.053969 7 qux 0.719382
tags | reset_index remove level_1 column, apply function to multiple columns and rename result, groupby apply name the result, groupby apply remove level_1