SELF JOIN | SQL with Manoj

SQL Trivia – Identify & Delete Duplicate records from a table

October 14, 2011 6 comments

I see a lot of questions/posts about dealing with duplicate records in many SQL Server forums. Many of these questions are asked in SQL Server Interviews and many developers starting their carrier in database programming find it challenging to deal with. Here we will see how we can deal with such records.

Duplicate data is one of the biggest pain points in our IT industry, which various projects have to deal with. Whatever state-of-art technology and best practices followed, the big ERP, CRM, SCM and other inventory based database management projects ends up in having duplicate & redundant data. Duplicate data keeps on increasing by manual entries and automated data loads. Various data leads getting pumped into system’s databases without proper deduping & data cleansing leads to redundant data and thus duplicated record-sets.

Data cleansing requires regular exercise of identifying duplicates, validating and removing them. To minimize these type of scenarios various checks and filters should also be applied before loading new leads into the system.

–> Lets check this by a simple exercise how we can identify & remove duplicate data from a table:

1. Insert some sample records from Person.Contact table of [AdventureWorks] database:

USE [AdventureWorks]
GO

SELECT TOP 10 ContactID, FirstName, LastName, EmailAddress, Phone
INTO DupContacts
FROM Person.Contact

SELECT * FROM DupContacts

2. Insert some duplicate records from the same list inserted above:

INSERT INTO DupContacts
SELECT TOP 50 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

3. Insert some more duplicate records from the same list inserted above.

INSERT INTO DupContacts
SELECT TOP 20 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

–> Identify Duplicate records & delete them

Method #1: by using ROW_NUMBER() function:

;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
SELECT * FROM dup
WHERE NumOfDups > 1
ORDER BY ContactID

-- Remove/Delete duplicate records:
;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
DELETE FROM dup
WHERE NumOfDups > 1

SELECT * FROM DupContacts

Method #2: by using SELF-JOIN:

SELECT DISTINCT a.ContactID, a.FirstName, a.LastName, a.EmailAddress, a.Phone
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID

-- Remove/Delete duplicate records:
DELETE a
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID

SELECT * FROM DupContacts

Method #3: by using AGGREGATES & Sub-QUERY:

SELECT * FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)

-- Remove/Delete duplicate records:
DELETE FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)

SELECT * FROM DupContacts

–> Final Cleanup

DROP TABLE DupContacts

Check the same demo here:

Categories: SQL Tips, SQL Trivia Tags: Delete Duplicate, ROW_NUMBER, SELF JOIN

Convert multiple Rows into a single column

August 17, 2010 2 comments

Just replied an answer on a SQL forum (http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=148860), so thought to post this as well for future reference.

How will you output the following record set:
123456788  City     399.99
123456788  County   499.99
123456788  Flood    299.99
123456788  Hazzard  199.99
123456789  City     333.99
123456789  County   444.99
123456789  Flood    222.99
123456789  Hazzard  111.99

... into following pattern:

id            Hazzard     Flood      City       County
123456788     199.99      299.99     399.99     499.99
123456789     111.99      222.99     333.99     444.99

The following query will do it and convert the rows into columns:

select 123456788 as id,  'Hazzard' as des, 199.99 as val
into #tempTable
union
select 123456788, 'Flood', 299.99 union
select 123456788, 'City', 399.99 union
select 123456788, 'County', 499.99 union
select 123456789, 'Hazzard', 111.99 union
select 123456789, 'Flood', 222.99 union
select 123456789, 'City', 333.99 union
select 123456789, 'County', 444.99
select * from #tempTable

select a.id as id, a.val as Hazzard, b.val as Flood, c.val as City, d.val as County
from #tempTable a,#tempTable b,#tempTable c, #tempTable d
where a.id=b.id and b.id=c.id and c.id=d.id
and a.des='Hazzard' and b.des='Flood' and c.des='City' and d.des='County'

drop table #tempTable

The above select code is with a simple old style where clause with multiple tables joining. The code could be also converted to self-joins.

Categories: JOINS, SQL Tips Tags: Convert Rows to Columns, SELF JOIN