I have 2 dataframes left_df and right-df, which both have 20 columns with identical names and dtypes. right_df also has 2 additional columns with unique values on every row.
I want to update rows in right_df with ALL the values from left_df where the values in ALL columns in a list of a subset of columns, matching_cols = ['col_1', 'col_3', 'col_10', 'col_12'] are identical in both dataframes. The values in the additional 2 unique columns in right_df should be preserved.
Ideally, I want to also drop those rows from left_df in the same command, or as the next command if this isn't possible. I need to do this process more than once, matching on several different lists of columns, with the left_df dropping matched rows each loop, until eventually no further matches are found.
An acceptable alternative would be any method to create a new dataframe new_df containing the set of rows where all specified columns in the list matching_cols match, with values from left_df in the first 20 columns and values from right_df in the remaining 2 columns.
I don't care about preserving the indices at any point in either dataframe, I am importing them to SQL after this and will reindex them on one of the 2 right_df values at the end.
New to Pandas and can't determine what method to use, have tried variations of .merge, .join, .update, etc, but can't seem to specify to only update when my desired column values all match, or how to drop those rows/export them to a new df.
Update: Added pseudocode below:
For a left_df as:
left_df = pd.DataFrame({
'col_0': ['0', '1', '2', '3', '4', '5'],
'col_1': ['A', 'B', 'C', 'D', 'E', 'F'],
'col_2': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_3': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_4': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_5': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_6': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_7': ['new', 'new', 'new', 'new', 'new', 'new'],
})
and a right_df as:
right_df = pd.DataFrame({
'col_0': ['0', '1', '2', '3', '4', '5'],
'col_1': ['A', 'B', 'C', 'X', 'E', 'F'],
'col_2': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_3': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_4': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_5': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_6': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_7': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_8': ['uid_0', 'uid_1', 'uid_2', 'uid_3', 'uid_4', 'uid_5'],
'col_9': ['uid_a', 'uid_b', 'uid_c', 'uid_d', 'uid_e', 'uid_f'],
})
Where matching_cols = ['col_0', 'col_1']
I want to get the following result either as a new dataframe or in-place on right_df (note that col_1 doesn't match on row 3, so is not changed)
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0 A new new new new new new uid_0 uid_a
1 1 B new new new new new new uid_1 uid_b
2 2 C new new new new new new uid_2 uid_c
3 3 X old old old old old old uid_3 uid_d
4 4 E new new new new new new uid_4 uid_e
5 5 F new new new new new new uid_5 uid_f