show_artists_multiple_albums []

Show artists with multiple albums

{tags: show duplicates in a particular column but print whole line, show duplicates in a particular column but print other columns, show groups with more than 1 count, uniq on a single column but print duplicate lines
solution uses: do not display index, only read certain columns of a csv file}

Task

Given a csv file such as

$ cat album.csv
Artist,AlbumId,TrackId
A,201,1022
A,201,3472
A,451,9866
B,390,6078
B,390,2634
C,272,3411
C,272,8673
C,698,2543
C,698,5837
D,235,9874
E,312,1089

1) show artists that have multiple albums along with their album ids. That is print

Artist,AlbumId
A,201
A,451
C,272
C,698

2) show artists that have multiple albums along with their album ids and track ids. That is print

Artist,AlbumId,TrackId
A,201,1022
A,201,3472
A,451,9866
C,272,3411
C,272,8673
C,698,2543
C,698,5837

Solution

In [6]:
import pandas as pd
coi = ['Artist', 'AlbumId']
df = pd.read_csv('album.csv', usecols=coi).drop_duplicates()
duplicates = df.groupby(['Artist']).filter(lambda grp: grp['AlbumId'].nunique() > 1).sort_values('Artist')
print(duplicates.to_csv(index=False))
Artist,AlbumId
A,201
A,451
C,272
C,698

To print other columns, change the coi (columns of interest) variable

In [7]:
import pandas as pd
coi = ['Artist', 'AlbumId', 'TrackId']
df = pd.read_csv('album.csv', usecols=coi).drop_duplicates()
duplicates = df.groupby(['Artist']).filter(lambda grp: grp['AlbumId'].nunique() > 1).sort_values('Artist')
print(duplicates.to_csv(index=False))
Artist,AlbumId,TrackId
A,201,1022
A,201,3472
A,451,9866
C,272,3411
C,272,8673
C,698,2543
C,698,5837

How it works

$ ipython
Python 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]
IPython 8.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.read_csv('album.csv', usecols=['Artist', 'AlbumId']).drop_duplicates()
df
Out[1]:
   Artist  AlbumId
0       A      201
2       A      451
3       B      390
5       C      272
7       C      698
9       D      235
10      E      312

In [2]:
duplicates = df.groupby(['Artist']).filter(lambda grp: grp['AlbumId'].nunique() > 1).sort_values('Artist')
duplicates
Out[2]:
  Artist  AlbumId
0      A      201
2      A      451
5      C      272
7      C      698

In [3]:
print(duplicates.to_csv(index=False))
Artist,AlbumId
A,201
A,451
C,272
C,698

Ref

https://stackoverflow.com/questions/70236139/pandas-groupby-select-groups-that-have-more-than-one-unique-values-in-a-column - this is where I found the sample data and the initial solution. I just modified it to fit my needs.
https://stackoverflow.com/a/24646906 - shows how to display pandas dataframe without index
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://stackoverflow.com/questions/53979178/pandas-groupby-show-only-groups-with-more-than-one-unique-feature-value/ - similar question. My answer is in https://stackoverflow.com/a/78809363/6305733

Table of Contents

Show artists with multiple albums

Task

Solution

How it works

Ref