DB2 | Left outer join | Avoid duplicates

Question

I have two DB2 tables, expenses and environments, which I need to join.

expenses

ID    Expense  Job
1     10       AAAAAA
2     5        BBBBBB 
3     3        AAAAAAC
4     7        AAAAAA01

environments

Job        Environment
AAAAAA01   PROD
BBBBBB     INT
AAAAAAC    PROD

I now want to join the environments table to the expenses table so that I know in which environments the expenses occurred. The problem with the tables is that the Job column for both tables is slightly different. Hence, I can't just join ON ENVIRONMENTS.JOB = EXPENSES.JOB The values are between 6 and 8 characters long, and the values in the expenses table are often shorter than in the environments table.
The logic for comparing the values of the Job columns would be something like this:

Check for an exact match of the Job if it is 8 characters long.
If no match is found, remove one character and try again.
Repeat until the Job is six characters long.

In each step, if multiple values are found, select the first match.

I tried the following query

SELECT E.expense, ENV.environment
FROM EXPENSES E
LEFT OUTER JOIN ENVIRONMENTS ENV
ON LEFT(ENV.JOB, 6) = LEFT(E.JOB, 6)
GROUP BY E.expense, ENV.environment

The problem is that I receive duplicate values in the result. Hence, I get more expenses when grouping them by environment than I originally had.

Actual output:

ID    Expense  Job        Job         Environment
1     10       AAAAAA     AAAAAA01    PROD
1     10       AAAAAA     AAAAAAC     PROD
2     5        BBBBBB     BBBBBB      INT
3     3        AAAAAAC    AAAAAA01    PROD
3     3        AAAAAAC    AAAAAAC     PROD
4     7        AAAAAA01   AAAAAA01    PROD
4     7        AAAAAA01   AAAAAAC     PROD

Desired output:

ID    Expense  Job        Job         Environment
1     10       AAAAAA     AAAAAA01    PROD
2     5        BBBBBB     BBBBBB      INT
3     3        AAAAAAC    AAAAAA01    PROD
4     7        AAAAAA01   AAAAAA01    PROD

The value in the second Job column could be either AAAAAA01 or AAAAAAC. This is completely irrelevant.

This is due to the `ON` condition. Let's consider the following example. The row with `ID=1` from the `expenses` table matches two rows in the `environments` table. Hence, the row with `ID=1` occurs twice in the result. Importantly, the matching rows from the `environments` table all have the same environment. Consequently, it is negligible which row I select.

How can I avoid duplicate values in the given scenario?

I have already tried the following solutions:

Show us the current result and the wanted result. Also, `SELECT DISTINCT` is great to get distinct rows. — jarlh, Sep 01 '22 at 11:53
@jarlh I updated the question with the actual result and the desired result. Thanks for the input. — Xantipus, Sep 01 '22 at 12:01
Keep GROUP BY. Use MIN() or MAX() to pick the second Job column value. — jarlh, Sep 01 '22 at 12:03
[mre] PS [Fetch the row which has the Max value for a column](https://stackoverflow.com/q/121387/3404097) PS It is not helpful to list links you saw, you need to say why the post contents were relevant but didn't help. Also 3 questions is very few. — philipxy, Sep 05 '22 at 11:32
Re (re)search success: Please before considering posting: Pin down code issues via [mre]. Read the manual/reference & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. Reflect research in posts. SO/SE search is poor & literal & unusual, read the help. Google re googling/searching, including Q&A at [meta] & [meta.se]. [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [ask] [Help] — philipxy, Sep 05 '22 at 11:33
@philipxy I'm really sorry, but I don't understand your point and what exactly your problem is with my post. I spent a lot of time trying to solve the issue independently and searching for similar questions. Obviously, my problem description has been concise enough to give enough context for others to answer my question. In my opinion, there are posts more suitable for your feedback than this question. Please read the question once more and think about your feedback. — Xantipus, Sep 05 '22 at 11:50

Mark Barinstein · Accepted Answer · 2022-09-05T11:35:06.153

Try this:

WITH 
  expenses (ID, Expense, Job) AS
  (
    VALUES
      (1, 10, 'AAAAAA')
    , (2,  5, 'BBBBBB')
    , (3,  3, 'AAAAAAC')
    , (4,  7, 'AAAAAA01')

    , (5,  0, 'XXX')
  )
, environments (Job, Environment) AS 
  (
    VALUES
      ('AAAAAA01', 'PROD')
    , ('BBBBBB'  , 'INT')
    , ('AAAAAAC' , 'PROD')
  )

-- The 1-st solution
SELECT
  ID, Expense, Job_E
, Job_ENV, Environment
FROM
(
  SELECT 
    E.ID, E.Expense, E.Job AS Job_E
  , ENV.Job AS Job_ENV, ENV.Environment
  , ROW_NUMBER () OVER (PARTITION BY E.Job ORDER BY LENGTH (ENV.Job)) AS RN_
  FROM expenses E
  LEFT JOIN environments ENV ON ENV.JOB LIKE E.JOB || '%'
) T
WHERE RN_ = 1
ORDER BY ID

/*
  -- Alternate solution
  SELECT
  E.ID, E.Expense, E.Job AS Job_E
, ENV.Job AS Job_ENV, ENV.Environment
FROM expenses E
LEFT JOIN TABLE
(
  SELECT ENV.Job, ENV.Environment
  FROM environments ENV 
  WHERE ENV.JOB LIKE E.JOB || '%'
  ORDER BY LENGTH (ENV.Job)
  FETCH FIRST 1 ROW ONLY
) ENV ON 1 = 1
ORDER BY E.ID
*/

ID	EXPENSE	JOB_E	JOB_ENV	ENVIRONMENT
1	10	AAAAAA	AAAAAAC	PROD
2	5	BBBBBB	BBBBBB	INT
3	3	AAAAAAC	AAAAAAC	PROD
4	7	AAAAAA01	AAAAAA01	PROD
5	0	XXX

Great! And what if we assume that there are expenses that cannot be assigned to a particular environment? — Xantipus, Sep 05 '22 at 10:40

score 0 · Answer 2 · answered Sep 06 '22 at 07:52

In addition to the accepted answer, which works perfectly, I provide what I ended up with.

SELECT E.expense,
       ENV.environment
FROM   expenses E
       LEFT JOIN (SELECT DISTINCT LEFT(job, 6) AS job,
                                  environment
                  FROM   environments) ENV
              ON LEFT(ENV.job, 6) = LEFT(E.job, 6)

This query can then also be used to group the expenses by date (e.g., year and month) and the environment. The trick is to select only the distinct values from the environments table.

DB2 | Left outer join | Avoid duplicates

2 Answers2