As I have said several times before, I am a big fan of data compression in SQL Server 2008 and above. Data compression is an Enterprise Edition only feature that was added to SQL Server 2008, and enhanced with Unicode compression in SQL Server 2008 R2.
It can be very effective in allowing you to trade some CPU utilization for a large reduction in I/O pressure and in memory pressure. Particularly if you have indexes that are relatively static, with data that compresses well, on large tables, with a weak or overstressed storage subsystem. Powerful CPUs are usually a lot less expensive than adding extra I/O capacity.
If you have large existing indexes that are good candidates for compression, you need to consider how to get those indexes compressed with the least performance impact on your system. Compressing an index means that it is rebuilt. The entire index must be read off of the storage subsystem, compressed, and then written back out to the storage subsystem. This can cause I/O pressure, memory pressure, and CPU pressure on your system.
If you use the Data Compression Wizard that is built into the SSMS GUI, the T-SQL script that it will generate can cause big concurrency issues for an active database that must be available 24 x 7 x 365. This is because it will rebuild the index in OFFLINE mode, and it will not restrict the number of CPU cores that are used for the index rebuild (by using the MAXDOP = xx option). This will rebuild and compress the index as fast as your I/O subsystem and processors will allow, but it will end up locking your table and very likely pushing your CPU utilization to 100% during the operation.
A safer, but more time consuming alternative is to add the ONLINE = ON and MAXDOP = xx (where xx is 25-50% of your logical CPU cores) to the command, like you see in Figure 1 below.
-- Page compress clustered index on BigAddress table ALTER INDEX [PK_BigAddress] ON [dbo].[BigAddress] REBUILD PARTITION = ALL WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, ONLINE = ON, MAXDOP = 2, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
Figure 1: Example of modified ALTER INDEX command
Adding ONLINE = ON will add about 50% more time to the ALTER INDEX operation, but will keep the table available during the command. Adding the MAXDOP option will restrict the number of logical CPU cores used for the operation (if you have the instance level MAXDOP set to zero, which is the default), thereby putting a ceiling on CPU utilization. This will also slow down the operation. You might be thinking “Why would I want to slow down the index rebuild?”, but believe me, running it ONLINE, with CPU usage throttled is much safer for concurrency purposes!
Once you have the command running in ONLINE mode, you might be curious about how long it will take to complete. Well, I just happen to have a query that will help you estimate that, shown in Figure 2.
-- Get Table names, row counts, and compression status for clustered index or heap SELECT OBJECT_NAME(object_id) AS [ObjectName], SUM(Rows) AS [RowCount], data_compression_desc AS [CompressionType] FROM sys.partitions WHERE index_id < 2 --ignore the partitions from the non-clustered index if any AND OBJECT_NAME(object_id) NOT LIKE 'sys%' AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%' AND OBJECT_NAME(object_id) NOT LIKE 'fulltext%' AND OBJECT_NAME(object_id) NOT LIKE 'ifts_comp_fragment%' GROUP BY object_id, data_compression_desc ORDER BY SUM(Rows) DESC OPTION (RECOMPILE);
Figure 2: Getting row counts and compression status for clustered indexes
It turns out that if you use the ONLINE = ON flag, after SQL Server starts writing out the PAGE compressed version of the clustered index, it will show up when you query sys.partitions, like you see in Figure 3, where you can see that it was about 90% done when I ran this query.
ObjectName RowCount CompressionType
BigAddress 330391354 NONE
BigAddress 306682536 PAGE
Figure 3: Duplicate entries in sys.partitions while index is being compressed
If the ALTER INDEX command has been running for 60 minutes, and you see that it is 90% done writing out the new index, that means that it will probably finish in about 3-4 minutes, since it took some time to read the index in off of the disk subsystem.
Glenn –
I’m looking at compression as an option in our environment but I’m wondering how much of a performance gain on average that you can expect? i understand ever system is different and every load is different, I’m just looking for a ball park figure of what could be expected? Also, how much overhead would it add to the CPU (again, understanding all systems/loads are different)?
If you have have indexes on large, relatively static tables, that are highly compressible (which depends on the data), and if you are I/O bound in your regular workload, then data compression can be a huge win. For example, if you had an index that was 60GB in size before compression, and it was compressed down to 20GB with data compression, you would see a very large reduction (up to 2/3rds) in read I/O for scans of that index. You would also see less memory pressure, since the index will stay compressed in the buffer pool, as long as the data does not change. In this scenario, there will be no extra CPU activity either.
Where it gets dicey is if you have an index on a very volatile table, such that the extra CPU activity from decompressing/compressing the data as it changes outweighs the reduced I/O and memory pressure from data compression.
Glenn, I’ve had varying results with precision based data types. Is this something you can comment on?
Which type of compression have you tried, PAGE or ROW? What do you mean by “varying” results? Different compression ratios, different performance results?
Pingback: Something for the Weekend – SQL Server Links 26/08/11
Pingback: Link Resource # 20 : Aug 26 – Aug 28 « Dactylonomy of Web Resource
Glenn, I’m wondering if data compression will improve performance (speed) on a system that is fully contained in memory such as our production 70 GB database on a system with 128 GB of RAM. On another forum you state, “Data stays compressed in memory unless you have to update it”, which leaves room for performance improvement in a system fully contained in memory. Others have stated that the data stays compressed until you access it. This would indicate that there would not be a performance improvement. Please comment. Thanks.
Eric,
It might for certain types of operations. For example, if you do a clustered index scan of a large table (even if it is in the buffer pool, so you get logical reads) you would be reading much less data out of memory to satisfy the query, so query might be faster. Using data compression will mean that your indexes will take up less space in the buffer pool, so even more indexes will fit in the same space.