I have two DB2 tables, expenses and environments, which I need to join.
expenses
ID Expense Job
1 10 AAAAAA
2 5 BBBBBB
3 3 AAAAAAC
4 7 AAAAAA01
environments
Job Environment
AAAAAA01 PROD
BBBBBB INT
AAAAAAC PROD
I now want to join the environments table to the expenses table so that I know in which environments the expenses occurred. The problem with the tables is that the Job column for both tables is slightly different. Hence, I can't just join ON ENVIRONMENTS.JOB = EXPENSES.JOB The values are between 6 and 8 characters long, and the values in the expenses table are often shorter than in the environments table.
The logic for comparing the values of the Job columns would be something like this:
- Check for an exact match of the
Jobif it is 8 characters long. - If no match is found, remove one character and try again.
- Repeat until the
Jobis six characters long.
In each step, if multiple values are found, select the first match.
I tried the following query
SELECT E.expense, ENV.environment
FROM EXPENSES E
LEFT OUTER JOIN ENVIRONMENTS ENV
ON LEFT(ENV.JOB, 6) = LEFT(E.JOB, 6)
GROUP BY E.expense, ENV.environment
The problem is that I receive duplicate values in the result. Hence, I get more expenses when grouping them by environment than I originally had.
Actual output:
ID Expense Job Job Environment
1 10 AAAAAA AAAAAA01 PROD
1 10 AAAAAA AAAAAAC PROD
2 5 BBBBBB BBBBBB INT
3 3 AAAAAAC AAAAAA01 PROD
3 3 AAAAAAC AAAAAAC PROD
4 7 AAAAAA01 AAAAAA01 PROD
4 7 AAAAAA01 AAAAAAC PROD
Desired output:
ID Expense Job Job Environment
1 10 AAAAAA AAAAAA01 PROD
2 5 BBBBBB BBBBBB INT
3 3 AAAAAAC AAAAAA01 PROD
4 7 AAAAAA01 AAAAAA01 PROD
The value in the second Job column could be either AAAAAA01 or AAAAAAC. This is completely irrelevant.
This is due to the `ON` condition. Let's consider the following example. The row with `ID=1` from the `expenses` table matches two rows in the `environments` table. Hence, the row with `ID=1` occurs twice in the result. Importantly, the matching rows from the `environments` table all have the same environment. Consequently, it is negligible which row I select.
How can I avoid duplicate values in the given scenario?
I have already tried the following solutions: