I have a dataframe with duplicate entries coming from two sources, all the values should be unique, but one column was not formatted the same, hence I should remove duplicate with different names in one column, but only if the names are within a list.
Technically, I would like to remove a row in a pandas dataframe if there exist another row with the same A and B values, but only if this row’s Z value is 'bar' and the other’s 'Z' is 'foo'.
An example might be clearer:
I have the given dataframe df
A B Z
'a' 'a' 'foo'
'a' 'a' 'bar'
'b' 'a' 'bar'
'c' 'c' 'foo'
'd' 'd' 'blb'
And I would like to get
A B Z
'a' 'a' 'foo'
'b' 'a' 'bar'
'c' 'c' 'foo'
'd' 'd' 'blb'
Note that:
- The rows with other values than
'foo'and'bar'in theZcolumn should not be touched. - It’s not important if
'foo'and'bar'stay the same because they will get changed to the same value afterwards. - It would be great to generalize the duo
'foo'and'bar'as a list.
Attempts so far: Here is my best guess, it doesn’t work though… I don’t understand much what groupby returns. Also I’m sure there is some magical pandas one-liner I just can’t find.
new_df = []
for row in df.groupby('A'):
if rowloc['Z'].isin('foo'):
if not row['Z'].isin('bar'):
new_df.append(row)
Thanks !