• SSIS Data Flows



    ​There has been much debate in the developer community on which provider to choose when connecting to SQL Server database.

    Performance is often cited as a discriminator, with ADO.NET being less perfor​​mant than ODBC and OLE DB, as it is a managed​ facade. But performance is glossed over as most applications deal only with CRUD operations, and therefore the differences are negligible.

    However, as BI developers we typically work with large datasets. We therefore cannot gloss over the performance differences, particularly when implementing SSIS data flow tasks.

    A quick search online reveals very little in the way of empirical data on which is faster. Maybe I’ve been searching on the wrong terms, but the only comprehensive test data I’ve come up with is detailed by  Gilbert Quevauvilliers in the following article –

    https://gqbi.wordpress.com/2013/06/19/ssis-fastest-data-flow-task-item-for-transferring-data-over-the-network/​

    He compares various combinations of ADO.NET and OLE DB source and destination components for a relatively large data set – 10,000,000 rows. He found a combination of OLE DB source and destination components to be over 8x faster than ADO.NET for his testing protocol –

    ​​Given the lack of evidence, I decided to carry out some tests myself.

    Create Source and Destination Tables

    I used AdventureWorks2014 as my source database.

    I created a {Sales].[SalesOrderDetailBIG] table with the same schema as [Sales].[SalesOrderDetail]. And then populated it with 10,000,000 records, totalling just under 700MB. This is the maximum number of records that I will be testing with. It should be more than enough records to cover most incremental loading scenarios.​

    USE AdvntureWorks2014
    GO
    CREATE TABLE [Sales].[SalesOrderDetailBIG](
            [SalesOrderID] [int] NOT NULL,
       ​     [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NU​LL,
            [CarrierTrackingNumber] [nvarchar](25) NULL,
            [OrderQty] [smallint] NOT NULL,
            [ProductID] [int] NOT NULL,
            [SpecialOfferID] [int] NOT NULL,
            [UnitPrice] [money] NOT NULL,
            [UnitPriceDiscount] [money] NOT NULL,
            [LineTotal] numeric(38,6) NOT NULL,
            [rowguid] [uniqueidentifier] NOT NULL,
            [ModifiedDate] [datetime] NOT NULL
    ) ON [PRIMARY]​​​
    

    My destination is a SQL Server 2014 database on the same server – my underpowered 8GB i5 quad-core development laptop – so there will be no network latency between the source and destination databases. This is not representative of most scenarios, but I’ve only got access to one server at the moment so needs must.

    The destination table is as follows –

    USE RWDW_Staging 
    GO
    CREATE TABLE [dbo].[SalesOrderDetail_Staging](
            [SalesOrderID] [int] NOT NULL,
            [SalesOrderDetailID] [int] NOT NULL,
            [CarrierTrackingNumber] [nvarchar](25) NOT NULL,
            [OrderQty] [smallint] NOT NULL,
            [ProductID] [int] NOT NULL,
            [SpecialOfferID] [int] NOT NULL,
            [UnitPrice] [money] NOT NULL,
            [UnitPriceDiscount] [money] NOT NULL,
            [ModifiedDate] [datetime] NOT NULL,
     CONSTRAINT [PK_SalesOrderDetail_Staging] PRIMARY KEY CLUSTERED (
    
            [SalesOrderID] ASC,
            [SalesOrderDetailID] ASC 
    ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    
    CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_Staging] ON [dbo].[SalesOrderDetail_Staging]
    (
            [ModifiedDate] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO​
    

    Please note that the destination table does not include the [LineTotal] and [rowguid] columns. And it has a composite PK on [SalesOrderID] and [SalesOrderDetailID] and a non-clustered index on [ModifiedDate]. More about these features later.

    Create Connection Managers

    I created 6 connection managers, one each of ADO.NET, OLE DB, and ODBC for the source and destination databases –​

    Connection Managers

    For ADO.NET and OLE DB I used the “.Net ProvidersSQLClient Data Provider” and “Native OLE DBSQL Server Native Client 11.0″ data providers.

    For ODBC I configured System DSNs for the source and destination, using the “ODBC Driver 11 for SQL Server” driver. According to the SQL Server 2014 developer guide, this is the successor to the “SQL Server Native Client 11.0″ driver, and should be used in it’s place –

    http://msdn.microsoft.com/en-us/library/cc280510.aspx

    I kept the default settings for all the connection managers.

    ​​​Comparison of ADO.NET, OLE DB, and ODBC Source Components

    I wanted to initially compare the source data flow components in isolation from the destination components, so I added 3 data flow tasks to my package, and configured them as follows –

    ADO.NET Source data flow task​

    ADONET Source

    By adding a RowCount transformation, I do not need to add a destination. Hence I have isolated the source from the destination. The RowCount transformation adds very little overhead and should not skew the results.

    ADONET Source Config

    I have selected the LocalHost AdventureWorks2014 ADO NET connection manager, and used a SQL command to select only the columns that I need. It is obviously important to select only the columns that are required rather than all columns. It can make a significant improvement in performance. But is something that can be easily overlooked.

    OLE DB Source Data Flow Task

    The OLE DB Source test data flow follows the same pattern –

    OLEDB Source

    The OLE DB Source component allows you to choose the subset of columns to utilise – in this case I have deselected [LineTotal] and [rowguid] –

    OLEDB Source Config

    So you would think that rather than having to create a SQL Command for the source you can combine the “Table or View” data access mode with the selection of only the columns that you need.

    But all is not as it seems. A SQL Server Profiler trace reveals the following –

    Profiler

    SSIS is still emitting a “select *” statement and so is retrieving all columns.

    The answer, as per the ADO.NET source, is to use a SQL Command to select only the columns that are needed. I have done this for the test.

    ODBC Source Data Flow Task

    ODBC Source

    The ODBC Source component has a BatchSize property. The Profiler trace reveals the following –Profiler 5

    An sp_cursorpreexec system stored procedures is executed to prepare a curssor and populate it with the full data set. The sp_cursorfetch stored procedure is then repeatedly excuted to fetch the data 1000 rows at a time. The final parameter is the batch size specified in the BatchSize property.

    I tried a range of BatchSize values, and found negligible difference in the performance. I did however hit a problem when attempting to retrieve the data in a single batch by setting the BatchSize to 10,000,000. The performance was significantly degredated as the SSIS internal buffer size was smaller than the batchsize, and as such there was a big overhead incurred in allocating more memory to the buffer –

    ​Information: 0x4004800C at ODBC Source, SSIS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.
    
    Information: 0x4004800F at ODBC Source: Buffer manager allocated 2 megabyte(s) in 1 physical buffer(s).
    
    Information: 0x40048010 at ODBC Source: Component "ODBC Source" (2) owns 2 megabyte(s) physical buffer.​
    

    ​Source Data Flow Component Results

    I wanted to see if the number of records effected the results. I therefore modified the SQL commands using a “top n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each data source. I also ran each test 3 times over in order to account for chatter on the server. The only other processes running during the testing were background OS and SQL Server processes, but sometimes these can skew the results.

    Results are as follows –

    Source Records​ Data (MB) Elapsed Time (s) MB/s
    ADO.NET 10,000,000 639.727 37.737 16.952
    ADO.NET 1,000,000 63.973 4.025 15.894
    ADO.NET 100,000 6.397 0.483  
    ADO.NET 10,000 0.640 0.109  
    OLE DB 10,000,000 639.727 15.693 40.765
    OLE DB 1,000,000 63.973 1.716 37.280
    OLE DB 100,000 6.397 0.297  
    OLE DB 10,000 0.640 0.078  
    ​ODBC 10,000,000​ 639.727​ 1:13:461​ 8.708​
    ​ODBC 1,000,000​ 63.973​ 6.614​ 9.672​
    ​ODBC 100,000​ 6.397​ 0.687​
    ​ODBC ​10,000 ​0.640 ​0.234

    Gilbert found some benefit from optimizing the source connection manager by changing the packet size property to the maximum possible value 32,767. And also by adding the following SQL hints “WITH (NOLOCK) OPTION (FAST 10000)”. Neither of these were applicable for my testing protocol as both my databases are on the same server and so the data is not being sent over a network. Also I am not competing with other processes for access to the data, and do not have to wait for other processes to release locks on the source table.

    But as with all SQL hints, you are overriding the optimizer, and so there may or may not be an improvement. You should definitely consider all of the above suggestions when attempting to optimize your processes. Gilbert, to be fair, does say it “sometimes” improves performance.

    There’s a good article by Rob Farley on why you should consider using the FAST 10000 hint –

    http://sqlblog.com/blogs/rob_farley/archive/2011/02/17/the-ssis-tuning-tip-that-everyone-misses.aspx

    Under my testing protocol, the OLE DB source in isolation is approximately 2.5 x quicker than the ADO.NET source, and a whopping 3.8x quicker than the ODBC source, for relatively large data sets. For small data sets the difference is much less, but I suspect only because the overhead of initializing the classes and establishing the connection is obscuring the real throughput performance. This is why, in the table, I have left the MB/s field blank for the 100,000 and 10,000 row tests.

    Comparison of ADO.NET, OLE DB, and ODBC Destination Components

    I will now test the ADO.NET, OLE DB, and ODBC destination components. But to keep thing simple will only be using the OLE DB source to do this.

    ADO.NET Destination Data Flow Task

    ADONET Destination

    ADONET Destination Config

    On the ADO.NET Destination component, I have checked the  “Use Bulk Insert when possible” option. We obviously prefer bulk inserts rather than single row inserts. This option was introduced with 2008 R2. If you are using 2008 or earlier then the ADO.NET destination will use single row inserts and so will be dreadfully slow. Apparently the “Use Bulk Insert when possible” option was added to enable the transfer of data to SQL Azure, given that connecting to SQL Azure using OLE DB is not supported by Microsoft (although it does work).

    I ran the data flow task 3 timed for 10,000,000. The results were – 2:55.220; 2:54.690; and 2:54.784.

    Looking at the Profiler trace reveals that the component is performing a bulk insert, but is inserting in batches of 9557 rows –

    Profiler 2

    SSIS buffers the data during a data flow. The data flow task has a couple of properties for controlling the buffer size – DefaultBufferMaxRow and DefaultBufferSize. DefaultBufferMax rows applies an upper limit to the maximum number of rows in the buffer. I set this to 10,000,000 as this is the number of rows that I am transferring. The DefaultBufferSize is the default size in bytes of the buffer. I set this to the maximum possible, which is 100 MB.

    Task Props

    In addition, I increased the command timeout property of the ADO.NET destination component from 30 to 300 seconds to avoid time outs. And I kept the BatchSize property equal to 0, which instructs SSIS to use the same size as the internal buffer.

    ADONET Destination Props

    The net effect is that the bulk inserts were now being performed in batches of 728,177 rows; and this is constrained by the 100 MB DefaultBufferSize –

    Profiler 3

    But contrary to expectation, increasing the buffer and batch size actually had a detrimental effect on performance, with the task now taking over 4 minutes to complete. I tried a range of DefaultMaxBufferRow values, and found 20,000 to produce the best results as follows – 2:47.483; 2:51.055; and 2:53.297.

    However, this is only marginally faster than the starting point of 10,000. I would expect the optimal value to very much depend on context, so worth trying out different values. But do not expect it to greatly improve the performance with the ADO.NET destination.

    As mentioned above, the staging table has a PK constraint, and hence is a B-tree rather than a heap. And it has a non-clustered index which will of course utilize resources to maintain.

    Deleting the non-clustered index results in the following – 2:11.950; 2:18.155; and 2:19.824. So an average improvement of 33 seconds. But this must be offset slightly by the 12 seconds that it takes to recreate the index once all records have been inserted.

    Additionally dropping the PK constraint results in the following – 1:51.344; 1:51.853; and 1:59.183. But it took over 2 minutes to recreate the constraint, so on balance dropping the PK was detrimental.

    OLE DB Destination Data Flow Component

    OLEDB Destination

    I configured the OLE DB Source component to restrict the columns to the required subset using a SQL command –

    OLEDB Destination Config

    And I configured the OLE DB Destination component to use the “Table or view – fast load” data access mode, and hence enable bulk rather than row-by-row inserts –

    OLEDB Destination Config 2

    The staging table had both the PK constraint and the non-clustered index enabled.

    My initial results with 10,000,000 rows were as follows – 2:09.981; 2:05.238; and 2:00.589. This is not far off the best results achieved with the ADO.NET Destination. But we’ve only just started.

    The source database is configured to use the Bulk-logged recovery model, which aims to minimally log bulk operations. This should be significantly more performant than the Full recovery model, assuming the bulk insert meets the criteria required to be minimally logged. The criteria for the target table are as follows –

    • The target table is not being replicated.
    • Table locking is specified (using TABLOCK).
    • If the table has no indexes, data pages are minimally logged.
    • If the table does not have a clustered index but has one or more non-clustered indexes, data pages are always minimally logged. How index pages are logged, however, depends on whether the table is empty –
      • If the table is empty, index pages are minimally logged.
      • If table is non-empty, index pages are fully logged.
    • If the table has a clustered index and is empty, both data and index pages are minimally logged. In contrast, if a table has a clustered index and is non-empty, data pages and index pages are both fully logged regardless of the recovery model.

    My destination table is not being replicated, and the Profiler trace reveals that the TABLOCK hint is being applied with the “insert bulk” statement -insert bulk [dbo].

    [SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)
    

    My destination table has a clustered index (PK) and also a non-clustered index. However, I am truncating the table before executing the data flow task. But, there is a caveat – the data is being inserted in batches. The table is only empty for the 1st batch. As such only the 1st batch is being minimally logged.

    As with the ADO.NET Destination test, I dropped the non-clustered index, but I have kept the PK constraint. I also increased the DefaultBufferMaxRow and DefaultBufferSize settings –

    Task Props 2

    And then set the OLE DB Destination component’s “Maximum insert commit size” property to 0, which tells SSIS to commit the rows as a single batch at the end of the data-flow –

    OLEDB Destination Config 3

    If I was to run the data flow now, the performance would still be relatively slow as the rows will be written to the TempDB, then sorted by the PK fields, before being written to the destination table. To ensure that the rows are written directly to the destination table, I added a ORDER BY statement to the OLE DB Source SQL Command –

    OLEDB Destination Config 4

    And finally, I added an ORDER hint to the OLE DB Destination component’s FastLoadOptions property, specifying the PK fields –

    OLEDB Destination Config 5

    My final results for the OLE DB Destination are as follows – 0:43.197; 0:43.493; and 0:43.493

    ODBC Destination Component

    I configured the Data Flow task and OLE DB Source component as per the ADO.NET and OLE DB tests. I then configured the ODBC Destination component to use the “Table Name – Batch” data access mode as follows –

    ODBC Destination Config

    In the SQL Server 2014 documentation Microsoft describes the batch load option as follows –

    Batch: In this mode the ODBC destination attempts to use the most efficient insertion method based on the perceived ODBC provider capabilities. For most modern ODBC providers, this would mean preparing an INSERT statement with parameters and then using a row-wise array parameter binding (where the array size is controlled by the BatchSize property). If you select Batch and the provider does not support this method, the ODBC destination automatically switches to the Row-by-row mode.​

    http://msdn.microsoft.com/en-us/library/hh758691.aspx​

    I executed the task and looked at the Profiler trace, and pretty soon realised that the data was being inserted row-by-row –

    Profiler 6

    I stopped the task as it was taking too long to complete.

    I cannot see any obvious reason for the data being inserted row-by-row rather than in a batch. I need to investigate this further at a later date and will return and completing the tests when I have a resolution. Unfortunately, for now I am not including the ODBC destination component in the results.

    SQL Server Destination Component​

    ​​Another option for connection to a destination SQL Server database is to use the SQL Server Destination component. I have excluded this from the test, as it is limited to local databases only, and so you are restricting yourself to only being able to run the package on the destination server. This may fit with your current deployment plan, but potentially limits extensibility and scallability. There would have to be good performance benefits for me to consider accepting this restriction, but most commentators say that the performance is very similar to the latest incarnations of the OLE DB Destination. To back this up, the description of the component in the BIDS toolbox includes the following note –

    Note: to optimize performance, we recommend using the OLE DB destination instead.

    Destination Data Flow Component Results

    As per the Source data flow component tests, I modified the SQL commands using a “TOP n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each destination.  And I ran each test 3 times over in order to account for chatter on the server.

    Source Destination
    Records
    Data (MB) Elapsed Time (s)​​ MB/s
    OLE DB ADO.NET 10,000,000 639.727 2:16.543 4.685
    OLE DB ADO.NET 1,000,000 63.973 13.004 4.919
    OLE DB ADO.NET 100,000 6.397​ 1.347  
    OLE DB ADO.NET 10,000 0.640 0.265  
    OLE DB OLE DB 10,000,000 639.727 43.394 14.742
    OLE DB OLE DB 1,000,000 63.973 4.649 13.760
    OLE DB OLE DB 100,000 6.397 0.577  
    OLE DB OLE DB 10,000 0.640 0.156  

    Why is the OLE DB Destination so much faster than the ADO.NET Destination?

    Although not statistically significant, my results infer that the OLE DB Destination is approximately 3 times faster than the ADO.NET Destination when transferring reasonably large data sets.

    I can think of 2 reasons that may explain this –

    • The OLE DB Destination component gives us more control over the bulk load. In particular, it allows us to apply hints using the FastLoadOptions to ensure that the data is inserted directly into the destination table, rather than being inserted into the TempDB table prior to sorting in order to accommodate the PK constraint. It also allows us to commit the whole data set as a single batch.
    • The OLE DB destination inserts rows using the following statement –
    insert bulk [dbo].[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

    C​​ompare this to the ADO.NET destination statement –

    insert bulk "dbo"."SalesOrderDetail_Staging" ([SalesOrderID] Int, [SalesOrderDetailID] Int, [CarrierTrackingNumber] NVarChar(25) COLLATE Latin1_General_CI_AS, [OrderQty] SmallInt, [ProductID] Int, [SpecialOfferID] Int, [UnitPrice] Money, [UnitPriceDiscount] Money, [ModifiedDate] DateTime)
    

    There are no table hints. And in particular, there is no TABLOCK hint. This is necessary for minimal logging, so even if we meet all the other criteria for minimal logging, the ADO.NET destination does not request it.

    In Conclusion

    • ​​I found the OLE DB Source to be approximately 2.5x faster than the ADO.NET source, and approximately 3.8x faster than the ODBC Source, when reading a large number of rows from a SQL Server Database hosted on the same server as SSIS.
    • When combined with an OLE DB Source, I found the OLE DB Destination to be ​3x faster than the ADO.NET Destination when transfering a lerge number of rows between SQL Server databases hosted on the same server.
    • Although I selected the “Table Load – Batch” data access mode, the ODBC Destination performed row-by-row inserts, which is not acceptible for a large data set. I need to investigate this further to determine a resolution.
    • My source and destination databases where hosted on the same server as SSIS. As such my testing protocol does not account for network latencies or constraints. Arguably my testing protocol is flawed because of this. I will endevour to repeat the tests on a more realistic infrastructure when one becomes available to me.
    • My results aren’t statistically significant – 3 observations per population – and my testing protocol is flawed. But I think my results add to the anecdotal evidence which suggests that OLE DB should be the first port of call when building SSIS data flows.​

    Caveat Emptor

    ​OLE DB would appear to be the most performant option for SSIS data flow SQL Server source and destination components.

    However, in August 2011 Microsoft announced that they will be deprecating the SQL Server OLE DB provider, with SQL Server 2012 being the last release to support it. True to their word, the provider has not been updated for SQL Server 2014, which is why I have used the 2012 provider “Native OLE DBSQL Server Native Client 11.0″ for my OLE DB connections.
    As discussed in the article below, they are aligning their strategy with ODBC, and they recommend using ODBC for new projects, whilst planning for the upgrade of existing projects. I will not be doing either until it is clear that the performance of ODBC has been improved to match that of OLE DB. I’m sure ​Microsoft are working on it.​


  • 相关阅读:
    常用的16个Java实用工具类,Java开发人员请收藏!
    JVM内存区域与垃圾回收
    听说你还不知道Java代码是怎么运行的?
    Java 到底是值传递还是引用传递?
    Java线程池实现原理及其在美团业务中的实践
    别再说Java对象都是在堆内存上分配空间的了!
    面试常问的Java虚拟机内存模型,看这篇就够了!
    好用的Java工具类库,GitHub星标10k+你在用吗?
    MySQL 全文索引实现一个简单版搜索引擎
    laravel 数据库里的图片重命名
  • 原文地址:https://www.cnblogs.com/larryqian86/p/4695595.html
Copyright © 2020-2023  润新知