If you need to delete duplicate records from a table (e.g. in order to copy the data to a table with a unique index or to add a unique index on this table), you can use one of the following methods.
First it’s important to know what kind of duplication we’re talking about:
- There is a unique key on this table (e.g. an identity column). The duplicates are entries where the other columns of the rows are identical but the unique key is different.
- The duplicates are 100% identical rows.
In both cases, you need to delete all rows but one, with the same values for X columns. The difference is just that in the first case you can reference each column with a key (i.e. some kind of row ID) and in the second one you can’t.
In the first case, identifying the row to keep is quite easy: you need to get the minimum (or maximum) value of your key (let’s call it row_id from now on) and delete all rows with the same values for the other columns but a higher (or lower) row_id. Here an example keeping the lowest row_id:
DELETE FROM tablename t1 WHERE t1.row_id > (SELECT MIN(t2.row_id) FROM tablename t2 WHERE t1.xxx=t2.xxx AND t1.yyy=t2.yyy)
xxx and yyy being the columns you want to use to identify duplicates.
Note that I assume that we’re talking about non-null values. If you also want to handle null values you need to replace:
t1.xxx=t2.xxx AND t1.yyy=t2.yyy
(t1.xxx=t2.xxx OR (t1.xxx is null AND t2.xxx is null)) AND (t1.yyy=t2.yyy OR (t1.yyy is null AND t2.yyy is null))
If you do not have something like a row_id, it’s getting a little bit more difficult since you cannot directly reference a row. There are basically two methods to clean up the duplicates:
If you have few duplicate entries, you can use ROWCOUNT to make sure that for each set of duplicate rows you delete all of them except one.
First you need to find out which sets of duplicate entries you have and how many entries you have in each set:
SELECT xxx, yyy, COUNT(*) FROM tablename GROUP BY xxx, yyy HAVING COUNT(*) > 1
This will for example return:
xxx_value_1 yyy_value_1 5
So you know you have 5 rows with these two values for xxx and yyy. So you need to delete 4:
SET ROWCOUNT 4 DELETE FROM tablename WHERE xxx='xxx_value_1' AND yyy='yyy_value_1'
And you can repeat it for all values returned by the SELECT statement. Of course this is kind of a lot of manual work and you only want to do it for a few combinations. If you have thousands or millions of duplicates, this is definitely not a good solution.
In this case, you have to create a new table containing only distinct entries and then…
either copy it back in the original table after truncating it:
SELECT DISTINCT * INTO #mytemptable FROM tablename TRUNCATE TABLE tablename INSERT INTO tablename SELECT * FROM #mytemptable DROP TABLE #mytemptable
or rename the original table and select the distinct entries directly as the original table:
sp_rename 'tablename', 'tablename_old' SELECT DISTINCT * INTO tablename FROM tablename_old DROP TABLE tablename_old
The advantage of this second method is that you save one step. The disadvantage is that you don’t have any index/constraints/triggers/… on tablename anymore. You can also do it the other way around, first a SELECT INTO then drop the original table and then sp_rename on the new table (but you still miss a few things like the indexes).
Of course another possibility is to add an identity column so that you can use the first method of this post. After removing the duplicates, you can then remove the identity column.