ROW_NUMBER | SQL with Manoj

SQL Trivia – Identify & Delete Duplicate records from a table

October 14, 2011 6 comments

I see a lot of questions/posts about dealing with duplicate records in many SQL Server forums. Many of these questions are asked in SQL Server Interviews and many developers starting their carrier in database programming find it challenging to deal with. Here we will see how we can deal with such records.

Duplicate data is one of the biggest pain points in our IT industry, which various projects have to deal with. Whatever state-of-art technology and best practices followed, the big ERP, CRM, SCM and other inventory based database management projects ends up in having duplicate & redundant data. Duplicate data keeps on increasing by manual entries and automated data loads. Various data leads getting pumped into system’s databases without proper deduping & data cleansing leads to redundant data and thus duplicated record-sets.

Data cleansing requires regular exercise of identifying duplicates, validating and removing them. To minimize these type of scenarios various checks and filters should also be applied before loading new leads into the system.

–> Lets check this by a simple exercise how we can identify & remove duplicate data from a table:

1. Insert some sample records from Person.Contact table of [AdventureWorks] database:

USE [AdventureWorks]
GO

SELECT TOP 10 ContactID, FirstName, LastName, EmailAddress, Phone
INTO DupContacts
FROM Person.Contact

SELECT * FROM DupContacts

2. Insert some duplicate records from the same list inserted above:

INSERT INTO DupContacts
SELECT TOP 50 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

3. Insert some more duplicate records from the same list inserted above.

INSERT INTO DupContacts
SELECT TOP 20 PERCENT FirstName, LastName, EmailAddress, Phone
from DupContacts

SELECT * FROM DupContacts

–> Identify Duplicate records & delete them

Method #1: by using ROW_NUMBER() function:

;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
SELECT * FROM dup
WHERE NumOfDups > 1
ORDER BY ContactID

-- Remove/Delete duplicate records:
;WITH dup as (
SELECT ContactID, FirstName, LastName, EmailAddress, Phone,
ROW_NUMBER() OVER(PARTITION BY FirstName, LastName ORDER BY ContactID) AS NumOfDups
FROM DupContacts)
DELETE FROM dup
WHERE NumOfDups > 1

SELECT * FROM DupContacts

Method #2: by using SELF-JOIN:

SELECT DISTINCT a.ContactID, a.FirstName, a.LastName, a.EmailAddress, a.Phone
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID

-- Remove/Delete duplicate records:
DELETE a
FROM DupContacts a
JOIN DupContacts b
ON a.FirstName = b.FirstName
AND a.LastName = b.LastName
AND a.ContactID > b.ContactID

SELECT * FROM DupContacts

Method #3: by using AGGREGATES & Sub-QUERY:

SELECT * FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)

-- Remove/Delete duplicate records:
DELETE FROM DupContacts
WHERE ContactID NOT IN (SELECT MIN(ContactID)
FROM DupContacts
GROUP BY FirstName, LastName)

SELECT * FROM DupContacts

–> Final Cleanup

DROP TABLE DupContacts

Check the same demo here:

Categories: SQL Tips, SQL Trivia Tags: Delete Duplicate, ROW_NUMBER, SELF JOIN

Ranking Functions | ROW_NUMBER, RANK, DENSE_RANK, NTILE

August 9, 2010 9 comments

SQL Server 2005 provides functionality for using Ranking Functions with your result set. One can select a number of Ranking algorithms which are applied to a column of your table that you want to classify in a scope of your executing query. This feature is Dynamic and upon change of data (addition or removal of rows) it gives desired results the next time query is run.

–> Its 4 gems are:

1. ROW_NUMBER: Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.
Syntax:

ROW_NUMBER() OVER ( [ < partition_by_clause > ] < order_by_clause > )

2. RANK: Returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.
Syntax:

RANK() OVER ( [ < partition_by_clause > ] < order_by_clause > )

3. DENSE_RANK: Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.
Syntax:

DENSE_RANK() OVER( [ ] < order_by_clause > )

4. NTILE: Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.
Syntax:

NTILE(integer_expression) OVER( [ ] < order_by_clause > )

–> Now lets take an example, simple one of a class of students, their marks & class:

select 'A' [class], 80 [marks], 'manoj' stuName
into #tempTable
UNION
select 'A', 70 [marks],'harish' stuName
UNION
select 'A', 80 [marks],'kanchan' stuName
UNION
select 'A', 90 [marks],'pooja' stuName
UNION
select 'A', 90 [marks],'saurabh' stuName
UNION
select 'A', 50 [marks],'anita' stuName
UNION
select 'B', 60 [marks],'nitin' stuName
UNION
select 'B', 50 [marks],'kamar' stuName
UNION
select 'B', 80 [marks],'dinesh' stuName
UNION
select 'B', 90 [marks],'paras' stuName
UNION
select 'B', 50 [marks],'lalit' stuName
UNION
select 'B', 70 [marks],'hema' stuName

select * from #tempTable

Now on selection this gives you:
class 	marks   name
A 	50 	anita
A 	70 	harish
A 	80 	kanchan
A 	80 	manoj
A 	90 	pooja
A 	90 	saurabh
B 	50 	kamar
B 	50 	lalit
B 	60 	nitin
B 	70 	hema
B 	80 	dinesh
B 	90 	paras

–> The following query shows you how each function works:

select marks, stuName,
    ROW_NUMBER() over(order by marks desc) as [RowNum],
    RANK() over(order by marks desc) as [Rank],
    DENSE_RANK() over(order by marks desc) as [DenseRank],
    NTILE(3) over(order by marks desc) as [nTile]
from #tempTable

Result:
marks stuName  RowNum  Rank DenseRank  nTile
90    pooja    1       1    1          1
90    saurabh  2       1    1          1
90    paras    3       1    1          1
80    dinesh   4       4    2          1
80    kanchan  5       4    2          2
80    manoj    6       4    2          2
70    harish   7       7    3          2
70    hema     8       7    3          2
60    nitin    9       9    4          3
50    anita    10     10    5          3
50    kamar    11     10    5          3
50    lalit    12     10    5          3

–> Here:
– RowNum column lists unique ID’s of students, like Roll Numbers.
– Rank lists student rank with equal ranks those secured equal marks, thus there is no 2nd or 3rd.
– DenseRank lists student ranks with no gaps, so here 3 students came 1st &2nd and only 2 3rd.
– nTile listed students in different but equal groups, can be thought of as different sections.

–> Now, lets use the PARTITION BY option, its same as group by clause. Lets group/partition the students group by their classes A&B:

select class, marks, stuName,
    ROW_NUMBER() over(partition by class order by marks desc) as [RowNum],
    RANK() over(partition by class order by marks desc) as [Rank],
    DENSE_RANK() over(partition by class order by marks desc) as [DenseRank],
    NTILE(3) over(partition by class order by marks desc) as [nTile]
from #tempTable

Result:
class marks stuName  RowNum  Rank DenseRank  nTile
A     90    pooja    1       1    1          1
A     90    saurabh  2       1    1          1
A     80    kanchan  3       3    2          2
A     80    manoj    4       3    2          2
A     70    harish   5       5    3          3
A     50    anita    6       6    4          3
B     90    paras    1       1    1          1
B     80    dinesh   2       2    2          1
B     70    hema     3       3    3          2
B     60    nitin    4       4    4          2
B     50    kamar    5       5    5          3
B     50    lalit    6       5    5          3

You can clearly see that our query has grouped students in 2 Partitions (classes) and then Ranked them.

In my [next post] check how to use OVER Clause & Partition By option with Aggregate functions like, SUM, AVG, MIN, MAX, etc.

–> Check the same demo on YouTube: