Deleting duplicate records in SQL Server
Sep. 24th, 2013 10:40 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Of techy interest only - I'll cut-tag for the rest of you.
A problem I recently encountered on an old table and some old code - it was looking something up and expecting a unique answer, because the record in the lookup table had been accidentally entered twice.
At this point every blog post I've ever met wastes a lot of space on telling you that the table shouldn't have been allowed to get into that state, and telling us about primary and unique keys as if they were a new idea to us. Well, duh. Mind going back in time and telling whoever it was who made the error? Then we have to waste even more time on "maybe the records aren't really duplicates, maybe there's a numerical ID you haven't looked at?" Again, duh. No, the problem I've trying to solve, the interesting one, is where all data in all columns is identical, there is no simple WHERE clause that will pick out one but not the other.
How one handles this varies from one database server to another. Oracle had a rather nice pseudo-column long before anyone else. SQL Server? I found two methods, one using the command syntax to say "just delete the first one you find", another creating a pseudo-column.
This page gives the most detailed explanation I've found so far
Command syntax
Yes, as simple as that. Tell it to only process one record. Do your deletion. Set the ROWCOUNT back to normal (no limit). I've seen this recommended a lot.
However, as the MS documentation says, this won't keep working in up-to-date SQL Server. " Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server." It recommends looking at TOP instead.
A little thing called "row_number()". It's got much wider uses than this
and when you're happy that's returning what you want,
A problem I recently encountered on an old table and some old code - it was looking something up and expecting a unique answer, because the record in the lookup table had been accidentally entered twice.
At this point every blog post I've ever met wastes a lot of space on telling you that the table shouldn't have been allowed to get into that state, and telling us about primary and unique keys as if they were a new idea to us. Well, duh. Mind going back in time and telling whoever it was who made the error? Then we have to waste even more time on "maybe the records aren't really duplicates, maybe there's a numerical ID you haven't looked at?" Again, duh. No, the problem I've trying to solve, the interesting one, is where all data in all columns is identical, there is no simple WHERE clause that will pick out one but not the other.
How one handles this varies from one database server to another. Oracle had a rather nice pseudo-column long before anyone else. SQL Server? I found two methods, one using the command syntax to say "just delete the first one you find", another creating a pseudo-column.
This page gives the most detailed explanation I've found so far
Command syntax
SET ROWCOUNT 1
DELETE FROM mytable WHERE nonuniqueid = 'duplicate row'
SET ROWCOUNT 0
Yes, as simple as that. Tell it to only process one record. Do your deletion. Set the ROWCOUNT back to normal (no limit). I've seen this recommended a lot.
However, as the MS documentation says, this won't keep working in up-to-date SQL Server. " Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server." It recommends looking at TOP instead.
Pseudocolumn
DELETE TOP (1)
FROM mytable
WHERE nonuniqueid = 'duplicate row'
A little thing called "row_number()". It's got much wider uses than this
That gives you a nice sequential row number for the query. The order by is rather more use in other contexts, but this can now be expanded to
select nonuniqueid, datacols, row_number() OVER(partition by nonuniqueid order by nonuniqueid) AS nr
FROM mytable
WHERE nonuniqueid = 'duplicate row'
WITH numbered AS (
select nonuniqueid, datacols, row_number() OVER(partition by nonuniqueid order by nonuniqueid) AS nr
FROM mytable
WHERE nonuniqueid = 'duplicate row'
)
select * from numbered
where nr > 1
and when you're happy that's returning what you want,
WITH numbered AS (
select nonuniqueid, datacols, row_number() OVER(partition by nonuniqueid order by nonuniqueid) AS nr
FROM mytable
WHERE nonuniqueid = 'duplicate row'
)
delete from numbered
where nr > 1