A Couple of Data Compression Tricks in SQL Server 2008

As I have said several times before, I am a big fan of data compression in SQL Server 2008 and above. Data compression is an Enterprise Edition only feature that was added to SQL Server 2008, and enhanced with Unicode compression in SQL Server 2008 R2.

It can be very effective in allowing you to trade some CPU utilization for a large reduction in I/O pressure and in memory pressure. Particularly if you have indexes that are relatively static, with data that compresses well, on large tables, with a weak or overstressed storage subsystem. Powerful CPUs are usually a lot less expensive than adding extra I/O capacity.

If you have large existing indexes that are good candidates for compression, you need to consider how to get those indexes compressed with the least performance impact on your system. Compressing an index means that it is rebuilt. The entire index must be read off of the storage subsystem, compressed, and then written back out to the storage subsystem. This can cause I/O pressure, memory pressure, and CPU pressure on your system.

If you use the Data Compression Wizard that is built into the SSMS GUI, the T-SQL script that it will generate can cause big concurrency issues for an active database that must be available 24 x 7 x 365.  This is because it will rebuild the index in OFFLINE mode, and it will not restrict the number of CPU cores that are used for the index rebuild (by using the MAXDOP = xx option). This will rebuild and compress the index as fast as your I/O subsystem and processors will allow, but it will end up locking your table and very likely pushing your CPU utilization to 100% during the operation.

A safer, but more time consuming alternative is to add the ONLINE = ON and MAXDOP = xx (where xx is 25-50% of your logical CPU cores) to the command, like you see in Figure 1 below.

-- Page compress clustered index on BigAddress table
ALTER INDEX [PK_BigAddress] 
ON [dbo].[BigAddress] REBUILD PARTITION = ALL 
WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON, ONLINE = ON, MAXDOP = 2, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Figure 1: Example of modified ALTER INDEX command

Adding ONLINE = ON will add about 50% more time to the ALTER INDEX operation, but will keep the table available during the command. Adding the MAXDOP option will restrict the number of logical CPU cores used for the operation (if you have the instance level MAXDOP set to zero, which is the default), thereby putting a ceiling on CPU utilization. This will also slow down the operation. You might be thinking “Why would I want to slow down the index rebuild?”, but believe me, running it ONLINE, with CPU usage throttled is much safer for concurrency purposes!

Once you have the command running in ONLINE mode, you might be curious about how long it will take to complete. Well, I just happen to have a query that will help you estimate that, shown in Figure 2.

-- Get Table names, row counts, and compression status for clustered index or heap
SELECT OBJECT_NAME(object_id) AS [ObjectName], 
SUM(Rows) AS [RowCount], data_compression_desc AS [CompressionType]
FROM sys.partitions 
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
AND OBJECT_NAME(object_id) NOT LIKE 'sys%'
AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' 
AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%' 
AND OBJECT_NAME(object_id) NOT LIKE 'fulltext%'
AND OBJECT_NAME(object_id) NOT LIKE 'ifts_comp_fragment%'
GROUP BY object_id, data_compression_desc
ORDER BY SUM(Rows) DESC OPTION (RECOMPILE);

Figure 2: Getting row counts and compression status for clustered indexes

It turns out that if you use the ONLINE = ON flag, after SQL Server starts writing out the PAGE compressed version of the clustered index, it will show up when you query sys.partitions, like you see in Figure 3, where you can see that it was about 90% done when I ran this query.

ObjectName               RowCount         CompressionType
BigAddress                    330391354          NONE
BigAddress                    306682536          PAGE

Figure 3: Duplicate entries in sys.partitions while index is being compressed

If the ALTER INDEX command has been running for 60 minutes, and you see that it is 90% done writing out the new index, that means that it will probably finish in about 3-4 minutes, since it took some time to read the index in off of the disk subsystem.

This entry was posted in SQL Server 2008, SQL Server 2008 R2 and tagged . Bookmark the permalink.

8 Responses to A Couple of Data Compression Tricks in SQL Server 2008

  1. Andrew says:

    Glenn –

    I’m looking at compression as an option in our environment but I’m wondering how much of a performance gain on average that you can expect? i understand ever system is different and every load is different, I’m just looking for a ball park figure of what could be expected? Also, how much overhead would it add to the CPU (again, understanding all systems/loads are different)?

    • Glenn Berry says:

      If you have have indexes on large, relatively static tables, that are highly compressible (which depends on the data), and if you are I/O bound in your regular workload, then data compression can be a huge win. For example, if you had an index that was 60GB in size before compression, and it was compressed down to 20GB with data compression, you would see a very large reduction (up to 2/3rds) in read I/O for scans of that index. You would also see less memory pressure, since the index will stay compressed in the buffer pool, as long as the data does not change. In this scenario, there will be no extra CPU activity either.

      Where it gets dicey is if you have an index on a very volatile table, such that the extra CPU activity from decompressing/compressing the data as it changes outweighs the reduced I/O and memory pressure from data compression.

  2. Jack Vamvas says:

    Glenn, I’ve had varying results with precision based data types. Is this something you can comment on?

  3. Pingback: Something for the Weekend – SQL Server Links 26/08/11

  4. Pingback: Link Resource # 20 : Aug 26 – Aug 28 « Dactylonomy of Web Resource

  5. Eric Kohnen says:

    Glenn, I’m wondering if data compression will improve performance (speed) on a system that is fully contained in memory such as our production 70 GB database on a system with 128 GB of RAM. On another forum you state, “Data stays compressed in memory unless you have to update it”, which leaves room for performance improvement in a system fully contained in memory. Others have stated that the data stays compressed until you access it. This would indicate that there would not be a performance improvement. Please comment. Thanks.

  6. Glenn Berry says:

    Eric,

    It might for certain types of operations. For example, if you do a clustered index scan of a large table (even if it is in the buffer pool, so you get logical reads) you would be reading much less data out of memory to satisfy the query, so query might be faster. Using data compression will mean that your indexes will take up less space in the buffer pool, so even more indexes will fit in the same space.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s