How to remove rest of the rows with the same ID starting from the first duplicate?

Question

I have the following structure for the table DataTable: every column is of the datatype int, RowID is an identity column and the primary key. LinkID is a foreign key and links to rows of an other table.

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
5       120      5      1       10
6       120      6      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
13      371      7      7       2
14      371      8      17      4
.................................
.................................

I'm trying to do a query which alters every LinkID batch in the following way:

Take every row with same LinkID (e.g. the first batch is the first 6 rows here)
Order them by the Order column
Look at Data and DataSpecifier columns as one compare unit (They can be thought as one column, called dataunit):
- Keep as many rows from Order 1 onwards, until a duplicate dataunit comes by
- Delete every row from that first duplicate onwards for that LinkID

So for the LinkID 120:

Sort the batch (already sorted here, but should still do it)
Start looking from the top (So Order=1 here), go as long as you don't see a duplicate.
Stop at the first duplicate Order = 5 (dataunit 1 10 was already seen).
Delete everything which has the LinkID=120 AND Order>=5

After similar process for LinkID 371 (and every other LinkID in the table), the processed table will look like this:

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
.................................
.................................

I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:

DELETE FROM DataTable  
WHERE RowID IN (SELECT RowID
                FROM DataTable
                WHERE -- ?
                GROUP BY LinkID
                HAVING COUNT(*) > 1 -- ?
                ORDER BY [Order]);

But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.

Possible duplicate of [How to delete duplicate rows in sql server?](https://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server/38938704)? — Thom A, May 08 '19 at 13:27
So change `ROW_NUMBER` to `COUNT` @ruohola. Then, instead of deleting the duplicates, you delete everything where the count is more than 1. The theory is identical. — Thom A, May 08 '19 at 13:31
`ROW_NUMBER() OVER (PARTITION BY...` -> `COUNT({non NULL column}) OVER (PARTITION BY...`. — Thom A, May 08 '19 at 13:32

Tim Biegeleisen · Answer 1 · 2019-05-08T13:54:32.013

We can try using a CTE here to make things easier:

WITH cte AS (
    SELECT *,
        COUNT(*) OVER (PARTITION BY LinkID, Data, DataSpecifier ORDER BY [Order]) - 1 cnt
    FROM DataTable
),
cte2 AS (
    SELECT *,
        SUM(cnt) OVER (PARTITION BY LinkID ORDER BY [Order]) num
    FROM cte
)

DELETE
FROM cte
WHERE num > 0;

The logic here is to use COUNT as an analytic function to identify the duplicate records. We use a partition of LinkID along with Data and DataSpecifier. Any record with an Order value greater than or equal to the first record with a non zero count is then targeted for deletion.

Here is a demo showing that the logic of the CTE is correct:

Demo

Daniel Gimenez · Accepted Answer · 2019-05-08T14:27:23.927

You can use the ROW_NUMBER() window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID and a greater than or equal to any encountered Order with a row number greater than one.

(I originally used a second CTE to get the MIN order, but I realized that it wasn't necessary as long as the join to order was greater than equal to any order where there was a second instance of the DataUnitId. By removing the MIN the query plan became quite simple and efficient.)

WITH DataUnitInstances AS (
  SELECT *
    , ROW_NUMBER() OVER
      (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId
  FROM DataTable
)
DELETE FROM DataTable
FROM DataTable dt
INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID 
  AND dup.[Order] <= dt.[Order]
  AND dup.DataUnitInstanceId > 1

Here is the output from your sample data which matches your desired result:

+-------+--------+-------+------+---------------+
| RowID | LinkID | Order | Data | DataSpecifier |
+-------+--------+-------+------+---------------+
| 1     | 120    | 1     | 1    | 1             |
| 2     | 120    | 2     | 1    | 3             |
| 3     | 120    | 3     | 1    | 10            |
| 4     | 120    | 4     | 1    | 13            |
| 7     | 371    | 1     | 6    | 2             |
| 8     | 371    | 2     | 3    | 5             |
| 9     | 371    | 3     | 8    | 1             |
| 10    | 371    | 4     | 10   | 1             |
| 11    | 371    | 5     | 7    | 2             |
| 12    | 371    | 6     | 3    | 3             |
+-------+--------+-------+------+---------------+

Thank you for the nice answer :) This was extremely easy to modify working with my actual database table and column names and it works perfectly! — ruohola, May 09 '19 at 15:23

EzLo · Answer 3 · 2019-05-08T13:39:19.297

This solution uses an APPLY to find the minimum order by each Link.

Set up:

IF OBJECT_ID('tempdb..#YourData') IS NOT NULL
    DROP TABLE #YourData

CREATE TABLE #YourData (
    RowID INT,
    LinkID INT,
    [Order] INT,
    Data INT,
    DataSpecifier INT)

INSERT INTO #YourData (
    RowID,
    LinkID,
    [Order],
    Data,
    DataSpecifier)
VALUES
    ('1', ' 120', '1', '1', ' 1'), 
    ('2', ' 120', '2', '1', ' 3'), 
    ('3', ' 120', '3', '1', ' 10'), 
    ('4', ' 120', '4', '1', ' 13'), 
    ('5', ' 120', '5', '1', ' 10'), 
    ('6', ' 120', '6', '1', ' 13'), 
    ('7', ' 371', '1', '6', ' 2'), 
    ('8', ' 371', '2', '3', ' 5'), 
    ('9', ' 371', '3', '8', ' 1'), 
    ('10', '371', '4', '10', '1'), 
    ('11', '371', '5', '7', ' 2'), 
    ('12', '371', '6', '3', ' 3'), 
    ('13', '371', '7', '7', ' 2'), 
    ('14', '371', '8', '17', '4')

Solution:

;WITH MinOrderToDeleteByLinkID AS
(
    SELECT
        T.LinkID,
        MinOrder = MIN(C.[Order])
    FROM
        #YourData AS T
        OUTER APPLY (
            SELECT TOP 1
                C.*
            FROM
                #YourData AS C
            WHERE
                C.LinkID = T.LinkID AND
                C.Data = T.Data AND
                C.DataSpecifier = T.DataSpecifier AND
                C.[Order] > T.[Order]
            ORDER BY
                T.[Order]) AS C
    GROUP BY
        T.LinkID
)
DELETE Y FROM
-- SELECT Y.* FROM
    #YourData AS Y
    INNER JOIN MinOrderToDeleteByLinkID AS M ON
        Y.LinkID = M.LinkID AND
        Y.[Order] >= M.MinOrder

The rows to be deleted from this are the following:

RowID   LinkID  Order   Data    DataSpecifier
5       120     5       1       10
6       120     6       1       13
13      371     7       7       2
14      371     8       17      4

... which correspond to the point where the tuple Data-DataSpecified start to repeat for a particular LinkID.

How to remove rest of the rows with the same ID starting from the first duplicate?

3 Answers3

Demo

Linked