Given the following two dataframes that represent ranges:
df1 =
start end
0 200 300
1 600 900
2 950 1050
df2 =
start end
0 350 550
1 650 800
2 900 1100
They can be represented as such:
df1 [200 300] [600 900] [950 1050]
df2 [350 550] [650 800] [900 1100]
I'm tasked with identifying four different types of relationships between df1 and df2 ranges:
df2subset ofdf1df2 [650 800]subset ofdf1 [600 900]
df2superset ofdf1df2 [900 1100]superset ofdf1 [950 1050]
df2afterdf1(nearest neighbor, excluding subset/superset)df2 [350 550]afterdf1 [200 300]df2 [900 1100]afterdf1 [600 900]
df2beforedf1(nearest neighbor, excluding subset/superset)df2 [350 550]beforedf1 [600 900]df2 [650 800]beforedf1 [950 1050]
I'm trying to use merge_asof() that I learned from this answer, but it's not working because of the complication added by the superset/subset relationship, e.g.:
# Create "before" condition
df_before = pd.merge_asof(
df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
on='before_end',
direction='forward'
).query('end > before_end')
print(df_before)
Output:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 600.0 900.0
Target output:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 950.0 1050.0
The problem is that
pd.merge_asof(
df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
on='before_end',
direction='forward'
)
finds the closest df1.end after 800 in df2 [650 800], which is df1 [600 900]:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 600.0 900.0
2 900 1100 NaN NaN
Is it possible to do a merge_asof() to find the nearest value based on a certain condition, such as "find nearest df1.end only if df1.start in that range is larger than 800 (950 in this case)"? With this level of complexity, maybe there is another function better suited to this task?
Notes:
- Ranges in
df1can overlap each other, but are never identical. - Ranges in
df2can overlap each other, but are never identical. df1anddf2have over 200k rows each.df1anddf2have different number of rows.- Relationships are relative to
df1, so only one match is needed for each row indf1, with up to four relationships in each row. Given the data provided above, the final output would look like this after merging back intodf1:
df1 =
start end subset_start subset_end superset_start superset_end before_start before_end after_start after_end
0 200 300 NaN NaN NaN NaN NaN NaN 350.0 550.0
1 600 900 650.0 800.0 NaN NaN 350.0 550.0 900.0 1100.0
2 950 1050 NaN NaN 900.0 1100.0 650.0 800.0 NaN NaN