What I am looking for is a function that works exactly like pandas.DataFrame.drop_duplicates() but that allows me to keep not only the first occurence but the first 'x' occurences (say like 10). Does anything like that exist? Thanks for your help!
Asked
Active
Viewed 2,423 times
1 Answers
3
IIUC, One way to do this would be with a groupby and head, to select the first x occurrences. As noted in the docs, head:
Returns first n rows of each group.
Sample code:
x = 10
df.groupby('col').head(x)
Where col is the column you want to check for duplicates, and x is the number of occurrences you want to keep for each value in col
For instance:
In [81]: df.head()
Out[81]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
....
# keep 3 first instances of each value in column a:
x = 3
df.groupby('a').head(x)
Out[82]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
5 1 1.476599
6 1 0.898684
8 2 -0.824963
9 2 -0.290499
sacuL
- 49,704
- 8
- 81
- 106
-
Yes, that's exactly what I was looking for. It perfectly solves the problem. Thanks! – Fulvio Feb 19 '19 at 02:40