原因:
之前已经写过一篇关于列存储索引的简介http://www.cnblogs.com/wenBlog/p/4970493.html,很粗糙但是基本阐明了列存储索引的好处。为了更好的理解列存储索引,接下来我们一起通过列存储索引与传统的行存储索引地对比2014中的列存储索引带来了哪些改善。由于已经很多介绍列存储,因此这里我仅就性能的改进进行重点说明。
测试场景
我创建了5个测试,尽量保证测试环境避免来自外界的重负载进而影响到结果。测试结果基于两个独立的表,分别是:
- FactTransaction_ColumnStore - 这个表仅有一个聚集列存储索引,由于列存储索引的限制,该表不再有其他索引。
- FactTransaction_RowStore - 该表将包含一个聚集索引和一个非聚集列存储索引和一个非聚集行存储索引。
首先我用脚本文件创建表和索引,然后用30m行数据填充到三个表中。由于所有的测试我都制定了最大并行度的hint ,因此可以指定内核数量来查询。
测试1-填充表
为了更好地测试,一个表由列存储索引构成,而另一个表仅有行存储索引构成。填充数据来自于另一个表'FactTransaction'。
IO 和时间统计
Table 'FactTransaction_ColumnStore'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'FactTransaction'. Scan count 1, logical reads 73462, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (30000000 row(s) affected) SQL Server Execution Times: CPU time = 98204 ms, elapsed time = 109927 ms. Table ' FactTransaction_RowStore '. Scan count 0, logical reads 98566047, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'FactTransaction'. Scan count 1, logical reads 73462, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (30000000 row(s) affected) SQL Server Execution Times: CPU time = 111375 ms, elapsed time = 129609 ms.
观察测试
表名 | 填充时间 | 逻辑读 |
FacTransaction_ColumnStore | 1.49 mins | 0 |
FacTransaction_RowStore | 2.09 mins | 98566047 |
测试2-比较搜索
注意这里在行存储索引上我指定表的hint,迫使表通过索引查找。
-- Comparing Seek.... SET Statistics IO,TIME ON Select CustomerFK From [dbo].FactTransaction_RowStore WITH(FORCESEEK) Where transactionSK = 4000000 OPTION (MAXDOP 1) Select CustomerFK From [dbo].FactTransaction_ColumnStore Where transactionSK = 4000000 OPTION (MAXDOP 1) SET Statistics IO,TIME OFF
IO 和时间统计
Table 'FactTransaction_RowStore'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. Table 'FactTransaction_ColumnStore'. Scan count 1, logical reads 714, physical reads 0, read-ahead reads 2510, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 83 ms.
执行计划
观察测试2
正如上图所示,行存储索引表的索引查找远比列存储索引表查询快的多。这主要归因于2014的sqlserver不支持聚集列存储索引的索引查找。执行计划对比图中一个是索引扫描导致更多的逻辑读,因此导致了性能的下降。
表名 | 索引类型 | 逻辑读 | 运行时间 |
FacTransaction_ColumnStore | Column | 714 | 83 ms |
FacTransaction_RowStore | Row | 3 | 0 ms |
Test 3 - Comparing SCAN
注意这次我指定的hint都是索引扫描,当然列存储索引上优化器默认为索引扫描。
-- Comparing Scan.... SET Statistics IO,TIME ON Select CustomerFK From [dbo].FactTransaction_RowStore WITH(FORCESCAN) Where transactionSK = 4000000 OPTION (MAXDOP 1) Select CustomerFK From [dbo].FactTransaction_ColumnStore WITH(FORCESCAN) Where transactionSK = 4000000 OPTION (MAXDOP 1) SET Statistics IO,TIME OFF
IO 和时间统计
Table 'FactTransaction_RowStore'. Scan count 1, logical reads 12704, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 32 ms, elapsed time = 22 ms. Table 'FactTransaction_ColumnStore'. Scan count 1, logical reads 714, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 2 ms.
执行计划
观察测试3
正如之前提到的,索引扫描列存储要比行存储快,俩个逻辑读和运行时间表明列存储索引在大表扫描上是更优的方式,因此更适合于数据仓库的表。
表名 | 索引类型 | 逻辑读 | 运行时间 |
FacTransaction_ColumnStore | Column | 714 | 2 ms |
FacTransaction_RowStore | Row | 12704 | 22 ms |
测试4-聚合查询
测试行存储表使用基于聚集索引。
SET Statistics IO,TIME ON Select CustomerFK,BrandFK, Count(*) From [dbo].[FactTransaction_RowStore] WITH(INDEX=RowStore_FactTransaction) Group by CustomerFK,BrandFK OPTION (MAXDOP 4)
测试行存储表,使用CustomerFK 和BrandFK的索引。(覆盖索引)
Select CustomerFK,BrandFK, Count(*) From [dbo].[FactTransaction_RowStore] WITH(INDEX=RowStore_CustomerFK_BrandFK) Group by CustomerFK,BrandFK OPTION (MAXDOP 4)
测试行存储索引使用CustomerFK 和BrandFK的列存储索引(覆盖索引)
Select CustomerFK,BrandFK, Count(*) From [dbo].[FactTransaction_RowStore] WITH(INDEX=ColumnStore_CustomerFK_BrandFK) Group by CustomerFK,BrandFK OPTION (MAXDOP 4) Test on the columnstore table using the Clustered Index. Select CustomerFK,BrandFK, Count(*) From [dbo].[FactTransaction_ColumnStore] Group by CustomerFK,BrandFK OPTION (MAXDOP 4) SET Statistics IO,TIME OFF
IO 和时间统计
使用基于聚集索引查询行存储的表。
Table 'FactTransaction_RowStore'. Scan count 5, logical reads 45977, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 9516 ms, elapsed time = 2645 ms.
使用行存储的非聚集索引测试行存储表。(覆盖索引)
Table 'FactTransaction_RowStore'. Scan count 5, logical reads 71204, physical reads 0, read-ahead reads 2160, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 5343 ms, elapsed time = 1833 ms.
使用非聚集列存储索引测试行存储表。(覆盖索引)
Table 'FactTransaction_RowStore'. Scan count 4, logical reads 785, physical reads 7, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 141 ms, elapsed time = 63 ms.
使用聚集索引测试列存储表。
Table 'FactTransaction_ColumnStore'. Scan count 4, logical reads 723, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 203 ms, elapsed time = 118 ms.
执行计划
观察测试4
这里才是列存储索引开始“闪耀”的地方。两个列存储索引的表查询要比传统的航索引在逻辑读和运行时间上性能好得多。
表名 | 索引使用 | 索引类型 | 逻辑读 | 运行时间 |
FacTransaction_ColumnStore | ClusteredColumnStore | Column | 717 | 118 |
FacTransaction_RowStore | RowStore_FactTransaction | Row | 45957 | 2645 |
FacTransaction_RowStore | RowStore_CustomerFK_BrandFK | Row | 71220 | 1833 |
FacTransaction_RowStore | ColumnStore_CustomerFK_BrandFK | Column | 782 | 63 |
测试5-比较更新(数据子集)
这个测试中,我将更新少于100m行数据,占总数据的30分之一。
SET Statistics IO,TIME ON Update [dbo].[FactTransaction_ColumnStore] Set TransactionAmount = 100 Where CustomerFK = 112 OPTION (MAXDOP 1) Update [dbo].[FactTransaction_RowStore] Set TransactionAmount = 100 Where CustomerFK = 112 OPTION (MAXDOP 1) SET Statistics IO,TIME OFF
IO 和时间统计
Table 'FactTransaction_ColumnStore'. Scan count 2, logical reads 2020, physical reads 0, read-ahead reads 2598, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (913712 row(s) affected) SQL Server Execution Times: CPU time = 27688 ms, elapsed time = 37638 ms. Table 'FactTransaction_RowStore'. Scan count 1, logical reads 2800296, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (913712 row(s) affected) SQL Server Execution Times: CPU time = 6812 ms, elapsed time = 6819 ms.
执行计划
观察测试5
在这种情况下 ,列存储索引的表要比行存储的更新慢的多。
表名 | 索引类型 | 逻辑读 | 运行时间 |
FacTransaction_ColumnStore | Column | 2020 | 37638 ms |
FacTransaction_RowStore | Row | 2800296 | 6819 ms |
注意对于行存储表逻辑读还是要比行存储的要多很多。这是归因于列存储索引的压缩比率更高,因此占用更少的内存。
总结
列存储索引(包含聚集和非聚集)提供了大量的优势。但是在数据仓库上使用还是要做好准备工作。一种合适地使用情况是非聚集索引不能被更新且禁用对底层表的更新。如果是巨大且没有分区的表,可能存在一个问题,整个表的索引每次都会被重建,因此如果表是巨大的则禁止使用列存储索引。因此必须要有好的分区策略来支持这种索引。
有几个应用列存储索引的地方:事实表的聚合、Fast Track Data Warehouse Servers、恰当环境SSAS的Cube…