Wednesday, 7 July 2010

MERGE and Slowly Changing Dimensions

Blimey - has it really been over 2 months since my last post? A few things have happened since then - not least moving job from Clarity to MITIE. Anyway on with the subject of the post.

Using the MERGE statement is pretty well documented however what about using it to update a dimension that includes Type 1 and Type 2 changes? There are examples on the web but they're either not SQL specific or seem flawed in one way or another. To follow this example you'll need the AdventureWorksDW2008 database available here. The R2 version of AdventureWorksDW might work but as yet I haven't tested it. Firstly lets set up a staging table and a dimension table:



SELECT ProductKey,ProductAlternateKey AS ProductId,EnglishProductName,

       ISNULL(ListPrice, 0) AS ListPrice,HASHBYTES('SHA1', EnglishProductName)

                                         AS Type1CheckSum,

       HASHBYTES('SHA1', CAST(ISNULL(ListPrice, 0) AS VARCHAR(10))) AS

       Type2CheckSum

INTO   ProductStaging

FROM   AdventureWorksDW2008..DimProduct

WHERE  EndDate IS NULL



GO



CREATE TABLE [dbo].[DimProduct]

  (

     [ProductKey]         [INT] IDENTITY(1, 1) NOT NULL,

     ProductId            NVARCHAR(25) NOT NULL,

     [EnglishProductName] [NVARCHAR](50) NOT NULL,

     [ListPrice]          [MONEY] NOT NULL,

     DateUpdated          DATETIME NOT NULL DEFAULT GETDATE(),

     Type1Checksum        VARBINARY(8000) NULL DEFAULT 0,

     Type2Checksum        VARBINARY(8000) NULL DEFAULT 0,

     StartDate            DATETIME NOT NULL DEFAULT GETDATE(),

     EndDate              DATETIME NULL,

     CurrentFlag          BIT NOT NULL DEFAULT 1

     CONSTRAINT [PK_DimProduct_ProductKey] PRIMARY KEY CLUSTERED ( [ProductKey]

     ASC )

  ) 



Note that we're using HASHBYTES on the staging table - this will be used to compare columns. The SHA1 algorithm ensures sensitivity to changes across those columns. If you're using non-character fields you'll need to cast them before hashing them.

Firstly lets add new records (at this point all of them):



MERGE dbo.DimProduct AS dim

USING ProductStaging AS stg

ON dim.ProductId=stg.ProductId



WHEN MATCHED AND dim.Type1Checksum<>stg.Type1Checksum THEN --type1 changes

  UPDATE SET  dim.Type1Checksum=stg.Type1Checksum,

        dim.EnglishProductName=stg.EnglishProductName,

        dim.DateUpdated=GETDATE()



WHEN NOT MATCHED THEN --new records

  INSERT (ProductId, EnglishProductName, ListPrice, Type1Checksum, Type2Checksum)

  VALUES  (stg.ProductId, stg.EnglishProductName, stg.ListPrice, stg.Type1Checksum, stg.Type2Checksum);





You'll note that this statement actually does the Type 1 updates as well - ie where the checksum\hash columns are different.

OK, lets add some changes to the ProductName - which will be handled as Type 1 overwrites, and some changes to ListPrice - which we'll handle as Type 2:



UPDATE ProductStaging

SET    EnglishProductName = 'Fred',

       Type1CheckSum = HASHBYTES('SHA1', 'Fred')

WHERE  ProductId = 'AR-5381';



UPDATE ProductStaging

SET    ListPrice = 9.99,

       Type2CheckSum = HASHBYTES('SHA1', CAST(ISNULL(9.99, 0) AS VARCHAR(10)))

WHERE  ProductId = 'BE-2349';



GO



So now we'll do the Type 1 update using the same MERGE statement as above. We also set the DateUpdated column - for auditing purposes:




MERGE dbo.DimProduct AS dim

USING ProductStaging AS stg

ON dim.ProductId = stg.ProductId

WHEN MATCHED AND dim.Type1Checksum<>stg.Type1Checksum THEN --type1 changes

  UPDATE SET dim.Type1Checksum = stg.Type1Checksum,

             dim.EnglishProductName = stg.EnglishProductName,

             dim.DateUpdated = GETDATE()

WHEN NOT MATCHED THEN --new records

  INSERT (ProductId,

          EnglishProductName,

          ListPrice,

          Type1Checksum,

          Type2Checksum)

  VALUES (stg.ProductId,

          stg.EnglishProductName,

          stg.ListPrice,

          stg.Type1Checksum,

          stg.Type2Checksum); 



Here we're careful to ensure that Type 1 changes apply to all records including the historic ones. Some examples filter for current records only...

Finally the Type 2 INSERTS. The trick here is to use the OUTPUT statement on the UPDATE to create a pseudo-table and then SELECT from that into the DimProduct dimension table to create the new 'current' record. Note that I'm also setting the CurrentFlag and EndDate on the old dimension record - so that its clear this is a historic record. Again all in one statement:



INSERT dbo.DimProduct

       (ProductId,EnglishProductName,ListPrice,DateUpdated,Type1Checksum,

        Type2Checksum,

        CurrentFlag)

SELECT ProductId,EnglishProductName,ListPrice,GETDATE(),Type1Checksum,

       Type2Checksum,1

FROM   ( 

MERGE DimProduct AS dim

USING ProductStaging AS stg

ON dim.ProductId = stg.ProductId

WHEN MATCHED AND dim.Type2Checksum<>stg.Type2Checksum AND dim.CurrentFlag=1 THEN

  UPDATE SET CurrentFlag = 0,

             EndDate = GETDATE()

OUTPUT $Action AS Operation,stg.ProductId,stg.EnglishProductName,stg.ListPrice,

stg.Type1Checksum,stg.Type2Checksum 
) AS MergeOut

WHERE  Operation = 'UPDATE';





So that's it! Hope its um...useful.