Archive
Clustered Index will not always guarantee Sorted Rows
In my previous post I discussed about the how Clustered Index’s Data-Pages and Rows are allocated in memory (Disk). I tried to prove that that Clustered Indexes do not guarantee Physical Ordering of Rows. But instead they are Logically Ordered and Sorted.
As they are Logically Sorted and when you query a table without an “ORDER BY” clause you get Sorted Rows, but this is not what will happen always. You can also get rows in Unsorted Order, so to get Sorted rows always apply an “ORDER BY” clause. Here in this post we will see under what circumstances a table will not return Sorted rows:
–> Let’s create a simple table with a Clustered Index on it and add some records:
-- Create table with 2 columns, first beign a PK and an IDENTITY column: CREATE TABLE test2 ( i INT IDENTITY(1,1) PRIMARY KEY NOT NULL, j INT ) -- Let's insert some records in random order on second column: INSERT INTO test2 (j) SELECT 500 UNION ALL SELECT 300 UNION ALL SELECT 900 UNION ALL SELECT 100 UNION ALL SELECT 600 UNION ALL SELECT 200 -- Now we will query the table without using ORDER BY clause: SELECT * FROM test2
–> Output:
You get sorted rows, as the Execution plan shows that Clustered Index was used to fetch the records.
–> Now, what if this table also has a Non Clustered Index, let’s see:
-- We will create a Non Clustered Index on the same table on second column: CREATE INDEX NCI_test ON test2 (j) -- Again query the table without using ORDER BY clause: SELECT * FROM test2
–> Following is the Output of the second SELECT query above:
This time the query returned rows in un-ordered fashion. As you can see in the Execution plan Query optimizer preferred to do a Non Clustered Scan. Thus the records were returned in un-ordered fashion.
–> Now, If you add an ORDER BY Clause to your SELECT Query how does it return rows:
-- Added an ORDER BY Clause: SELECT * FROM test2 ORDER BY i
–> Following is the Output of the third SELECT query above:
After adding an “ORDER BY” clause the query optimizer preferd to use the Clustered Index and returns rows in Ordered fashion again.
-- Final Cleanup DROP TABLE test2
So, it is necessary to provide an “ORDER BY” clause to your Queries when you expect to get sorted results, even if the table has Clustered Index on it.
Clustered Index do “NOT” guarantee Physically Ordering or Sorting of Rows
Myth: Lot of SQL Developers I’ve worked with and interviewed have this misconception, that Clustered Indexes are Physically Sorted and Non-Clustered Index are Logically Sorted. When I ask them reasons, all come with their own different versions.
I discussed about this in my previous post long time back, where I mentioned as per the MSDN that “Clustered Indexes do not guarantee Physical Ordering of rows in a Table”. And today in this post I’m going to provide an example to prove it.
Clustered Index is the table itself as its Leaf-Nodes contains the Data-Pages of the table. The Data-Pages are nothing but Doubly Linked-List Data-Structures linked to their leading & preceding Data-Pages. The Data-Pages and rows in them are Ordered by the Clustering Key. Thus there can be only one Clustered Index on a table.
When a new Clustered Index is created on a table, it reorganizes the table’s Data-Pages Physically so that the rows are Logically sorted. But when a tables goes through updates, like INSERT/UPDATE/DELETE operations then the index gets fragmented and thus the Physical Ordering of Data-Pages gets lost. But still the Data-Pages of the Clustered Index or the Table are chained together by Doubly Linked-List in Logical Order.
We will see this behavior in the following example:
USE AdventureWorks2012 GO CREATE TABLE test (i INT PRIMARY KEY not null, j VARCHAR(2000)) INSERT INTO test (i,j) SELECT 1, REPLICATE('a',100) UNION ALL SELECT 2, REPLICATE('a',100) UNION ALL SELECT 3, REPLICATE('a',100) UNION ALL SELECT 5, REPLICATE('a',100) UNION ALL SELECT 6, REPLICATE('a',100) SELECT * FROM test DBCC IND(AdventureWorks2012, 'test', -1); DBCC TRACEON(3604); DBCC PAGE (AdventureWorks2012, 1, 22517, 1);
As you can see above we inserted 5 records in a table having PK as Clustering Key with following values: 1,2,3,5,6. We skipped the value 4.
–> Here is the Output of DBCC PAGE command execute in the end:
The DBCC PAGE provides very descriptive Output, but here I’ve just extracted the useful information we need.
OFFSET TABLE: Row - Offset 4 (0x4) - 556 (0x22c) --> 1=6 3 (0x3) - 441 (0x1b9) --> 1=5 2 (0x2) - 326 (0x146) --> 1=3 1 (0x1) - 211 (0xd3) --> 1=2 0 (0x0) - 96 (0x60) --> 1=1
It show the 5 rows with their memory allocated address in both decimal and hexadecimal formats, which is contiguous. Thus all the rows are Physically Ordered.
Now, let’s insert the value “4” that we skipped initially:
INSERT INTO test (i,j) SELECT 4, REPLICATE('a',100) SELECT * FROM test DBCC IND(AdventureWorks2012, 'test', -1); DBCC TRACEON(3604); DBCC PAGE (AdventureWorks2012, 1, 22517, 1);
–> Here is the new Output of DBCC PAGE command execute in the end:
OFFSET TABLE: Row - Offset 5 (0x5) - 556 (0x22c) --> 1=6 4 (0x4) - 441 (0x1b9) --> 1=5 3 (0x3) - 671 (0x29f) --> 1=4 <-- new row added here 2 (0x2) - 326 (0x146) --> 1=3 1 (0x1) - 211 (0xd3) --> 1=2 0 (0x0) - 96 (0x60) --> 1=1
As you can see very clearly in the output above that the memory allocation address for the new row with PK value = “4” is 671, which is not in between and greater than the address of the last row i.e. 556. The memory address of the last 2 rows is unchanged (441 & 556) and the new row is not accommodated in between as per the sort order but at the end. There is no Physical Ordering and the new row is Logically Chained with other rows and thus is Logically and Ordered, similar to the image below:
Myth Busted: Hope the above exercise clears that Clustered Indexes do not guarantee the Physical Ordering of Rows & Data Pages.
-- Final Cleanup DROP TABLE test
Columnstore Indexes in SQL Server 2012
This time the new version of SQL Server 2012 a.k.a Denali has introduced a new kind of index i.e. ColumnStore Index, which is very different from the traditional indexes. This new index differs in the way it is created, stores its table contents in specific format and provides fast retrieval of data from the new storage.
–> Before talking about ColumnStore Index, let’s first check and understand what is a ColumnStore?
ColumnStore is a data storage method that uses xVelocity technology based upon Vertipaq engine, which uses a new Columnar storage technique to store data that is highly Compressed and is capable of In-memory Caching and highly parallel data scanning with Aggregation algorithms.
Traditionally, on the other side a RowStore is the traditional and by-default way to store data for each row and then joins all the rows and store them in Data Pages, and is still the same storage mechanism for Heap and Clustered Indexes.
The ColumnStore or Columnar data format does not store data in traditional RowStore fashion, instead the data is grouped and stored as one column at a time in Column Segments.
–> Here is what happens when you try to create a ColumnStore Index on a table:
1. Existing table rows are divided into multiple RowGroups, a Row-Group can contain upto 1 million rows.
2. Each column of a RowGroup is stored in its own Segment and is compressed.
3. The individual compressed Column Segments are added to the ColumnStore.
4. When new rows are inserted or existing ones are updated (in small batches, except BulkLoad) they are added to a separate Delta Store, upto a threshold of 1 million rows.
5. When a Delta-Store reaches its threshold of 1 million rows a separate process Tuple-mover invokes and closes the delta-store, compresses it & stores it into the ColumnStore index.
–> Thus, Columnstore indexes can produce faster results by doing less I/O operations by following:
1. Reading only the required columns, thus less data is read from disk to memory.
2. Heavy Column compression, which reduces the number of bytes that must be read and moved.
3. Advanced query execution technology by processing chunks of columns called batches (1000 rows) in a streamlined manner, further reducing CPU usage.
4. Stored as ColumnStore Object Pool in RAM to cache ColumnStore Index, instead of SQL Buffer Pool (for Pages)
–> Please Note: In SQL Server 2012 ColumnStore indexes has some limitations:
1. A Table (Heap or BTree) can have only one NonClustered ColumnStore Index.
2. Cannot be a Clustered Index.
3. A Table with NonClustered ColumnStore Index becomes readonly and cannot be updated.
4. Check MSDN BoL for more limitations with SQL Server 2012 version, link.
Reorganize Index vs Rebuild Index in SQL Server
Well-designed indexes on tables/views improves the performance of queries run against a database by reducing disk I/O operations and consume fewer system resources. Indexes can be helpful for a variety of queries that contain SELECT, UPDATE, DELETE, or MERGE statements.
SQL Server Database Engine automatically maintains indexes whenever INSERT, UPDATE, or DELETE operations are made to the underlying data. Over time these modifications can cause the information in the index to become scattered/fragmented in the database. Fragmentation exists when indexes have pages in which the logical ordering, based on the key value, does not match the physical ordering inside the data file. Heavily fragmented indexes can degrade query performance and cause your application to respond slowly.
SQL Server has provided ways to reduce/remedy fragmentation by Reorganizing or Rebuilding an Index.
1. Reorganize Index: uses minimal system resources. It defragments the leaf level of clustered and nonclustered indexes on tables and views by physically reordering the leaf-level pages to match the logical, left to right, order of the leaf nodes. Reorganizing also compacts the index pages. Compaction is based on the existing fill factor value.
2. Rebuild Index: drops and re-creates the index. This removes fragmentation, reclaims disk space by compacting the pages based on the specified or existing fill factor setting, and reorders the index rows in contiguous pages. When ALL is specified, all indexes on the table are dropped and rebuilt in a single transaction.
So, which approach to go with, Reorganize or Rebuild?
First of all we’ll check the fragmentation status of a particular table or an index, by using sys.dm_db_index_physical_stats function.
Here we will check the status of all indexes in [HumanResources].[Employee] table of [AdventureWorks2012] database.
select x.name as Index_Name, s.database_id, s.object_id, s.index_id, s.index_type_desc, -- General info columns s.avg_fragmentation_in_percent, s.fragment_count, s.avg_fragment_size_in_pages -- stats we need from sys.indexes x cross apply sys.dm_db_index_physical_stats ( DB_ID(N'AdventureWorks2012'), -- database_id x.object_id, -- object_id x.index_id, -- index_id NULL, -- partition_number NULL) as s -- mode where s.object_id = object_id('HumanResources.Employee')
Output:
We will use the following criteria setup by Microsoft to detirmine the best method to correct the fragmentation:
avg_fragmentation_in_percent value | Corrective statement > 5% and <= 30% ALTER INDEX REORGANIZE > 30% ALTER INDEX REBUILD WITH (ONLINE = ON)*
The above results shows that the Indexes [AK_Employee_LoginID] and [AK_Employee_NationalIDNumber] requires Rebuild and rest of them are good.
–> TO REBUILD:
--// To Rebuild [AK_Employee_LoginID] Index, run the following query: USE AdventureWorks2012; GO ALTER INDEX AK_Employee_LoginID ON HumanResources.Employee REBUILD; GO --// To Rebuild All indexes, use following query: USE AdventureWorks2012; GO ALTER INDEX ALL ON HumanResources.Employee REBUILD WITH (FILLFACTOR = 80, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = ON); GO
–> TO REORGANIZE:
--// To Reorganize [AK_Employee_NationalIDNumber] Index, run the following query: USE AdventureWorks2012; GO ALTER INDEX AK_Employee_NationalIDNumber ON HumanResources.Employee REORGANIZE; GO --// To Reorganize All indexes on [HumanResources].[Employee], use following query: USE AdventureWorks2012; GO ALTER INDEX ALL ON HumanResources.Employee REORGANIZE; GO
So, check the fragmentation status by using the DM function sys.dm_db_index_physical_stats and then decide to do either REORGANIZE or a REBUILD on an Index.
Clustered vs NonClustered Indexes… and data sorting in SQL Server
This post on difference between Clustered Index & Non Clustered Index is a prequel to my blog post on requirement & use of Clustered index & NonClustered index, [link].
As per MS BOL (MSDN) in SQL Server, indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels.
In Clustered index: (MSDN)
1. The Leaf nodes contain the Data pages of the underlying table.
2. The Root and Intermediate level nodes contain Index pages holding Index rows.
3. Each Index row contains a Key value and a pointer to either an Intermediate level page in the B-tree, or a Data row in the Leaf level of the Index. The Pages in each level of the Index are linked in a Doubly-linked list.
4. The Pages in the Data chain and the rows in them are ordered on the value of the Clustered Index key.
5. There can be only one Clustered Index on a table.
6. Does not support Included column, because they already contain all the columns which are not in the index as Included columns.
In NonClustered Index: (MSDN)
1. The Leaf layer of a NonClustered index is made up of Index pages instead of Data pages.
2. Each Index row in the NonClustered index contains the NonClustered Key value and a row locator. This locator points to the Data row in the Clustered index or Heap having the Key value.
2.a. If the table is a Heap, which means it does not have a Clustered index, the row locator is a pointer to the row.
2.b. If the table has a Clustered index, or the index is on an Indexed view, the row locator is the Clustered index Key for the row. SQL Server retrieves the data row by searching the Clustered index using the Clustered index Key stored in the Leaf row of the NonClustered index.
3. The Data rows of the underlying table are not sorted and stored in order based on their nonclustered keys.
4. Each table can have up to 249 & 999 nonclustered indexes on SQL Server 2005 & 2008 respectively.
Indexing & data Sorting: (MSDN)
As per MS BOL, a Clustered Index only reorganizes the data pages so that the rows are logically sorted in Clustered Index order. The pages are not guaranteed to be ordered physically. SQL Server doesn’t necessarily store the data physically on the disk in clustered-index order, but while creating an index, SQL Server attempts to physically order the data as close to the logical order as possible. Each page in an index’s leaf level has a pointer to the page that logically precedes the current page and to the page that logically follows the current page, thereby creating a doubly linked list. The sysindexes table contains the address of the first leaf-level page. Because the data is guaranteed to be logically in clustered-index order, SQL Server can just start at the first page and follow the index pointers from one page to the next to retrieve the data in order.
So its not guaranteed about the physical ordering of records/rows if a table has Clustered Index on it. It is a common misconsecption among people that Clustered Index sorts data physically & Non Clustered Index sorts data logically.
Also discussed this topic on MSDN’s TSQL forum and got several expert comments & answers:Â http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/1c98a7ee-7e60-4730-a38c-f0e3f0deddba
More Info on ordering: https://sqlwithmanoj.com/2013/06/02/clustered-index-do-not-guarantee-physically-ordering-or-sorting-of-rows/