Wednesday, January 18, 2012

CDC-MD5 value in informatica

Summary:
The technical document talks Message digest function which can used to identify the changes in the data more effectively , rather than checking all the columns using lookup .Message digest as termed as MD5 saves good amount of performance when compared to lookups when the comparison columns are more than 10 columns.
Introduction:

In Informatica we have two functions MD5 (Message Digest 5 algorithm) and CRC32 (Cyclic Redundancy Check) for effectively handling of change data capture MD5 is mostly recommended. This function becomes very useful when we have no primary key columns for comparison and where using lookup for comparison for more than ten columns is not recommended.


Message Digest 5 algorithms:

The algorithm basically returns as 32 bit hexadecimal number by calculating a checksum of the input value. It is been recommended to used for hash key generation and can effectively be handled for change data capture. Whereas Cyclic Redundancy Check is mostly recommended for transmission errors.





Traditional Approach:

To identify records for updates and inserts ideally we use a lookup transformation .But the cache built by the lookup cache basically depends on two factors, the number of columns in the comparison condition and the amount of data in the lookup tables. When we do not have primary key columns to identify the changes they is ideally two ways either to compare all the columns in the lookup and to compare the data using the concept of power exchange change data capture. Which can degrade the performance as well.



MD5 Approach:


MD5 using termed as is as message digest algorithm which convert all the input string data to a hexadecimal data .So while comparison we can concatenate all the columns and generate a MD5 value and if any column is modified ,the MD5 values changes .Thus by comparing the MD5 values we can identify whether the data is changed or not.









Conclusion:

Thus by using the MD5 values we can identify whether the data is changed or not without the performance being degraded and can effectively handle the data as well .But MD5 value is always recommended for more comparison columns and literally no primary key columns in the lookup table . And there is limitation as well the input to the MD5 values needs to a string by data type and it returns a 32 bit Hexa decimal.

No comments: