Home > SQL Tips > Identify & Delete Duplicate records from a table

Identify & Delete Duplicate records from a table


I see a lot of questions/posts about dealing with duplicate records in many SQL Server forums. Many of these questions are asked in “SQL Server Interviews” and many developers starting their carrier in database programming find it challenging to deal with. Here we will see how we can deal with such records.

Duplicate data is one of the biggest pain points in our IT industry, which various projects¬†have to deal with. Whatever state-of-art technology and best practices followed, the big ERP, CRM, SCM and other inventory based database management projects ends up in having duplicate & redundant data. Duplicate data keeps on increasing by manual entries and automated data loads. Various data leads getting pumped into system’s databases without proper deduping & data cleansing leads to redundant data and thus duplicated record-sets.

Data cleansing requires regular exercise of identifying duplicates, validating and removing them. To minimize these type of scenarios various checks and filters should also be applied before loading new leads into the system.

Lets check this by a simple exercise how we can identify & remove duplicate data from a table:

USE [AdventureWorks]
GO

-- Insert some sample records from Person.Contact table of [AdventureWorks] database:
SELECT TOP 10 ContactID, FirstName, LastName, EmailAddress, Phone
INTO DupContacts
FROM Person.Contact

SELECT * FROM DupContacts


-- Insert some duplicate records from the same list inserted above.
INSERT INTO DupContacts
SELECT TOP 50 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

-- Insert some more duplicate records from the same list inserted above.
INSERT INTO DupContacts
SELECT TOP 20 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

--// Identify Duplicate records & delete them:-

-- Method #1: by using ROW_NUMBER() function:
;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
SELECT * FROM dup
WHERE NumOfDups > 1
ORDER BY ContactID


-- Remove/Delete duplicate records:
;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
DELETE FROM dup
WHERE NumOfDups > 1

SELECT * FROM DupContacts


-- Method #2: by using SELF-JOIN:
SELECT DISTINCT a.ContactID, a.FirstName, a.LastName, a.EmailAddress, a.Phone
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID


-- Remove/Delete duplicate records:
DELETE a
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID

SELECT * FROM DupContacts


-- Method #3: by using AGGREGATES & Sub-QUERY:
SELECT * FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)


-- Remove/Delete duplicate records:
DELETE FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)

SELECT * FROM DupContacts


--// Final Cleanup
DROP TABLE DupContacts
About these ads
  1. Surinder
    October 14, 2011 at 5:49 pm

    Very well explained . Thanks for sharing !!

  2. Ken
    March 14, 2012 at 2:31 pm

    It’s not the ROW_NUMBER, it’s the cte combined with that, that deletes the duplicates and with this method, you don’t need to list the table fields to remove the duplicates and you don’t need a unique field in the table either:

    USE [AdventureWorks]
    SELECT TOP 10 FirstName, LastName, EmailAddress, Phone
    INTO DupContacts
    FROM Person.Contact
    INSERT INTO DupContacts
    SELECT TOP 50 PERCENT FirstName, LastName, EmailAddress, Phone
    from DupContacts
    INSERT INTO DupContacts
    SELECT TOP 20 PERCENT FirstName, LastName, EmailAddress, Phone
    FROM DupContacts
    SELECT * FROM DupContacts
    –dup is the cte that will remove duplicate records
    ;WITH dup AS (
    SELECT ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY Phone) AS NumOfDups
    FROM DupContacts)
    DELETE FROM dup
    WHERE NumOfDups > 1
    SELECT * FROM DupContacts
    DROP TABLE DupContacts

    • March 14, 2012 at 2:38 pm

      Hi Ken,

      I didn’t mentioned CTE because CTE is not necessary here. You can also do with derived query or sub-query, but CTEs are more desirable than sub-queries.

      Thanks for your comments :)

  3. June 5, 2012 at 12:44 pm

    It really awesome post with great examples

    thanks,
    bhaskar

    http://csharpektroncmssql.blogspot.com

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 395 other followers

%d bloggers like this: