Very Large Data Conversion Process

VLDB SQL Server conversions

By Valentin Iamandi, Principal DBA

The conversion myth

“The conversion is slow because there is a lot of data” – a SQL developer

Context


Some of the big conversion SQL Server databases projects are supposed to not only convert very large data sets (billions of rows and multiple terabytes of data) but also to expand the data so it can fit a very different database structure, semi-normalized.

Problem


I was brought in to help with such a project: upgrade a huge banking app that was using TB of data. The constraint was to run the conversion in under 48hrs (a weekend during which most of the banks will be closed). Until my arrival, the conversion was taking 2 weeks to finish in testing.

The data conversion was very slow, the SSIS packages execution was not able to finish under 2 weeks, in the last 6 months of the project duration. The project team had 5 SSIS developers and 2 DBAs.

Solution

To increase the performance, I applied the following VLDB best conversion practices. Using these principles, we were able to increase the performance by 75% in 3 months. 48hrs target achieved.

Performance checklist for SQL Server VLDB Conversions

 System or SQL Server level optimization

  1. Have the conversion database on the same physical server as the final destination database if possible.
  2. Pre-grow all databases file groups to correct size for data being added.  This should include the staging, the conversion and the destination database(s) or any other database involved in receiving large amounts of data.  This includes transaction log.  Transaction logs should be grown at once to the max size needed to avoid excessive VLF’s.
  3. Make sure the conversion database is on separate luns so space can be returned to SAN team after conversion.
  4. Ensure TEMPDB has adequate space and is pre-grown in advance of conversion.
  5. All Databases involved in the conversion process and if they are receiving large amounts of data must be run in simple mode if possible.
  6. Replication of large tables, mirroring and log shipping need to be dropped during conversion. Reestablish these processes after the conversion is complete.
  7. The static data conversion needs to be done in advance in order to reduce critical execution time.  This includes lookup tables, parent tables, History Tables, Static Images, etc.
  8. Appropriate grouping(s) of dependent tables should be determined.  Each of these groups should be executed in parallel so maximum performance and minimum elapsed time can be achieved.  This may mean multiple jobs, multiple steps, or multiple concurrent streams in SSIS packages.
  9. Existing SSIS packages (already developed) are usually serial vs. parallel in nature.  Tasks reorganization needs to be achieved for parallel processing.
  10. All big conversion steps should be able to be restarted.  Anything can and will fail at the most inopportune time.  We want to be able to fix the issue and resume without loss of time.  The cost of resume-ability vs. re-running the step needs to be measured.  A non-re-start-able process would need to be accepted on a case by case basis. The target is to have a 100% re-start-able process.
  11. SSIS packages must be used when data is to be moved from one physical server to another physical server. In some particular situations we can use  backup and restore, log shipping or “BCP out and BCP in”.
  12. In SQL Server operations should be used whenever possible to achieve optimal performance.  Any time the data needs to leave SQL server and go to other applications and then back into SQL server performance will degrade.  The exception to this is when the logic required cannot be performed within SQL server or can be performed in another manner more efficiently to justify the OLEDB performance overhead.
  13. Transactional units of work and performance vs. commit\rollback should be carefully considered when coding data conversion. Batches of 100k, 1000k or 10000k batches (set based) may be suitable values depending on circumstances. All batches sizes need to be tested. These should be chosen carefully and may vary per table or data conversion area.
  14. When large loads are used to transfer data into tables, the new Microsoft SSIS transform Balanced Data Distributor need to be considered.  This transform task takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading. http://www.microsoft.com/en-us/download/details.aspx?id=4123 ; In my testing, 5 or 6 parallel threads are optimal.  All indexes should be dropped from the destination tables including the clustered constraint before the parallel Insert operation takes place.
  15. Under certain circumstances, parallel stage areas processing can be considered to improve performance.  This could be by batch or any other mechanisms that ensure performance, data integrity, and guarantees processing the data only once.  Special care must be employed when using the staging area to ensure data inserts into the tables do not conflict with each other and cause performance degradation.  An example of this would be parallel staging areas where we copy data for further processing. This is one of the best ways to make sure the process can be restarted from a moment in time.
  16. Mid conversion database backups may be required for recovery points or reporting purposes.  This will need to be determined during the review of the process.

SQL Optimizations

  1. Identity values should be pre-allocated in the target environment in advance so that we do not need to deal with complex identity issues when inserting the big data into destination tables.  This should be performed in a way that minimal identity updates occur.
  2. Usage of cursors should be avoided unless absolutely required. Set-based data processing is advised.
  3. Merge statements need to be replaced with regular SQL DML operations: Insert, Update and Delete. These are much easier to manage and optimize.
  4. Inline TSQL should be replaced with stored procedures if possible.  Stored procedures calls should be used with appropriate parameters passed in to control transaction levels and restart-ability.  This will allow the code to be re-used.  Control tables can be used in place of certain parameters.
  5. Top N and loops can be used to control transaction levels.
  6. Indexes not needed in the conversion process need to be dropped and recreated later. Only the needed indexes for the conversion process need to be considered and evaluated.  
  7. Indexes should be dropped before large table loads and re-created after the load.  The exceptions to this are indexes that are required for the load process to perform acceptably.
  8. Indexes rebuilds should not occur during the conversion. Update statistics is advised if fresh statistics are required for a subsequent step.
  9. Run update statistics after any large data group load.
  10. Perform logical indexes testing and verification after a few runs of the conversion testing to see if some indexes can be dropped and other missing indexes created.
  11. Eliminate “Order BY” from queries unless the development team or the business rules justify the usage.  The Order BY creates a Sort operation which, when used in queries involving big tables, can be very costly.
  12. When we have big or complex queries using the OR clause, it may be better to use Union or Union ALL. The OR clause is slower in most cases than using Union or Union ALL (depends on the SQL version).  Testing is required.
  13. The In, Not IN, Exists or Not exists statements related to query external tables, need to be tested, verified and replaced if needed with Left Joins.
  14. Some conversion tasks may be using the same group of tables joined together to prepare multiple insert statements into different big tables. It may be faster to group the data from these tables grouped together into a permanent or temporary table (based on context), create an index on it and use it in subsequent inserts. In this way, we query the data only once and we insert it many times (we should be able to use the new table on at least 2 or 3 different queries).
  15. All TSQL should be performance tuned and optimized.
  16. Design the conversion process with reusability in mind.
  17. Some queries may use functions in the Where clause like substring, left, replace, etc. When using this type of functions in the Where clause the SQL Server engine does not use indexes. The solution to this problem is to:
  18. Copy the column content used by the function to a new column,
  19. Update the data in the new column by applying the text function on the data,
  20. Create an index or add the column to an already existing index on the same table.
  21. Test to see if the index is used properly.
  22. Use the new column in the where clause condition in place of the construct with the function.

Note: Because of these rules and me applying them, today we can still buy houses using the American banking system (😊)

Comments