We did POC for SharePlex and were getting updated source system data on SQL server side from Oracle.Now the next challenge was to get the star schema based Fact and Dimension tables updated for the incremental change of data.
So we decided to use CDC introduced since SQL server 2008.
Arshad Ali in his article on databasejournal has nicely explained where can we use CDC in a data warehouse system as follows:
There are two approaches to load\refresh data from a source system to a data warehouse;
the first approach is to load the full set of data every time
and the second approach is to load the full set of data for the first time
and then subsequently load only the changed data set as incremental data load.
As you might have guessed, the first approach is not feasible especially in cases where you have humongous volume of data in your source table
whereas in case of the second approach you need to have some mechanism to identify the changed data set in the source table after the last data pull
so that only those changed data sets can be considered while pulling it from the source table and to load into the data warehouse.
Change Data Capture (CDC) is an Enterprise edition (available in Developer and Evaluation editions as well) feature.
It captures DML changes (insert, update and delete activities) on a tracked table.
How CDC Works
Once CDC is enabled on a tracked table, SQL Server uses an asynchronous capture mechanism that reads the SQL Server transaction logs and populates the CDC table
(table which keeps track of history of changes along with meta-data about changes) with the row’s data that changes.
This feature is entrenched in transaction log architecture, thus a lot of the metadata in CDC is related around the concept of a Log Sequence Number
(LSN, every record in the transaction log is uniquely identified by a LSN.
LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN1).
Enabling CDC is a 2 step process – first at database level and then at table level.
–This command can be executed by a member of the sysadmin role
@source_schema = N’Person’,
@source_name = N’Address’,
@role_name = N’cdc_Admin’,
@supports_net_changes = 1
When the feature on the table is successfully enabled, the following messages are shown
Job ‘cdc.AdventureWorks2012_capture’ started successfully.
Job ‘cdc.AdventureWorks2012_cleanup’ started successfully.
There are four parameters available: @captured_column_list, @filegroup_name, @role_name, and @supports_net_changes
By default, all columns in the table are tracked.
If you want to track only the specific ones, use the @captured_column_list parameter.
The syntax is
@captured_column_list = N’AddressLine1, AddressLine2, City’
by default, the CDC table is created in the default filegroup of the database.
Though you can specify to create it on another filegroup using @filegroup_name parameter.
When the @role_name parameter is set to NULL, only members of sysadmin and db_owner roles have full access to captured information.
When set to a specific role(cdc_Admin), only the members of the role (called a gating role) can access the changed data table.
The @supports_net_changes parameter enables to show multiple changes aggregated as a single one.
This parameter can be used only on tables that have a primary key or unique index
For example, if the row was first inserted (_$operation = 2), and deleted (_$operation = 1) afterwards,
the net change will be that nothing has happened.
The feature captures both transactions, but the @supports_net_changes parameter enables to see both individual and net changes
When you enable CDC for a table, SQL Server creates the table (with this naming convention: cdc._CT and keeps recording DML changes happening to the tracked table in this table.
For example, in the current example following query returns the result-set of changes captured for the above changes:
SELECT * FROM [cdc].[dbo_Address_CT]
Along with the data there are some columns which capture meta information about the changes. For example, __$operation column captures the DML operation needed to apply the row of change data to the target data source. Valid values are 1 = delete, 2 = insert, 3 = value before update and 4 = value after update. __$update_mask is a bit mask representing columns that were changed during DML operations. It means delete (__$operation = 1) and insert (__$operation = 2) operation will have value set to 1 for all defined bits whereas for update (__$operation = 3 and __$operation = 4) only those bits corresponding to columns that changed are set to 1.
When you enable CDC, several functions are created to return changes. For example, cdc.fn_cdc_get_all_changes_ function returns one row for each change applied to the CDC tracked table within the specified log sequence number (LSN) range. If a source row had multiple changes during the specified range interval, each change is represented in the returned result set whereas cdc.fn_cdc_get_net_changes_ function returns one net change row for each source row changed within the specified LSN range. That is, when a source row has multiple changes during the specified LSN range, a single row that reflects the final content of the row is returned.
For example, as you can see below cdc.fn_cdc_get_all_changes_dbo_Employee function returns two rows for two updates of EmployeeID = 3:
DECLARE @MinimumLSN binary(10), @MaximumLSN binary(10) SET @MinimumLSN = sys.fn_cdc_get_min_lsn('dbo_Employee') SET @MaximumLSN = sys.fn_cdc_get_max_lsn() SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee (@MinimumLSN, @MaximumLSN, N'all'); GO
Two Rows Updated
Whereas cdc.fn_cdc_get_net_changes_dbo_Employee function returns net or final changes for EmployeeID = 3 in a single record:
DECLARE @MinimumLSN binary(10), @MaximumLSN binary(10) SET @MinimumLSN = sys.fn_cdc_get_min_lsn('dbo_Employee') SET @MaximumLSN = sys.fn_cdc_get_max_lsn() SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_Employee (@MinimumLSN, @MaximumLSN, N'all'); GO
Final Changes in a Single Record
The above functions expect starting LSN and ending LSN and returns the changes between these two LSNs. You can use sys.fn_cdc_get_min_lsn to get start LSN for the specified capture instance and sys.fn_cdc_get_max_lsn to get ending LSN from cdc.lsn_time_mapping system table.
Sometimes, you would like to pull data based on a time range instead of LSN range and hence you can use the sys.fn_cdc_map_time_to_lsn function to get start LSN from cdc.lsn_time_mapping system table for the specified time.
The Change Data Capture jobs
When the Change Data Capture feature is enabled for the first table in the database, two SQL Server jobs are automatically created – one to capture the changes and another for cleaning up the old captured information
The capture job is in charge of capturing data changes and processing them into change tables
“It runs continuously, processing a maximum of 1000 transactions per scan cycle with a wait of 5 seconds between cycles. The cleanup job runs daily at 2 A.M. It retains change table entries for 4320 minutes or 3 days, removing a maximum of 5000 entries with a single delete statement.” 
Like other SQL Server jobs, the capture job can be stopped and started. When the job is stopped, the online transaction log is not scanned for the changes, and changes are not added to the capture tables. The change capturing process is not broken, as the changes will be processed once the job is started again. As the feature that can cause a delay in log truncating, the un-scanned transactions will not be overwritten, unless the feature is disabled on the database. However, the capture job should be stopped only when necessary, such as in peak hours when scanning logs can add load, and restarted afterwards.
You can disable CDC for either each individual table or disable it at the database level. As I said before, for each CDC enabled database there will be two jobs created as shown below. The first job captures the information from the SQL Server transaction log as it works in asynchronous manner whereas the second job cleans up the tracked table. The cleanup process occurs every three days by default (this is configurable and can be changed as per specific needs). For more intense environments, where you want to directly manage the CDC table cleanup process, you can leverage the manual method using the system stored procedure sys.sp_cdc_cleanup_change_table. When we execute this system procedure you need to specify the low LSN and any change records occurring before this point are removed and the start_lsn is set to the low LSN we specified.
SQL Server Agent
As these jobs are essential for CDC to work properly, SQL Server Agent must be running.
Disabling Change Data Capture (CDC)
You can disable CDC for either each individual table or disable it at the database level, which in effect will disable CDC for all the tables of the given database. You need to be a member of the db_owner fixed database role to disable CDC at the table level and a member of the sysadmin fixed server for disabling at database level.
--To disable CDC for a table in a database for a given capture instance EXEC sys.sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'Employee', @capture_instance = N'dbo_Employee' GO --To disable CDC for the database in the context EXEC sys.sp_cdc_disable_db GO
Disabling CDC at the database level removes all associated CDC metadata, including the CDC user and schema and the CDC jobs.
You can execute the scripts below to get more information about CDC configuration:
--Returns CDC configuration information for a specified schema and table EXECUTE sys.sp_cdc_help_change_data_capture @source_schema = N'db', @source_name = N'Employee'; GO --Returns CDC configuration information for all tables in the database EXECUTE sys.sp_cdc_help_change_data_capture GO