XTIVIA recently kicked off a new Data Warehousing project for a customer and as I started looking at the project plan, I realized that we had assigned a significant portion of the effort to data profiling and I thought it would be interesting to note some of the basics of the reasoning behind that estimation and I am hoping to share that with you all.
Since the discussion points are fairly lengthy, I will split them into a series of posts.
So, now on to the fun part…
According to a study published by The Data Warehousing Institute (TDWI) entitled Taking Data Quality to the Enterprise through Data Governance, some issues are primarily technical in nature, such as the extra time required for reconciling data (85%) or delays in deploying new systems (52%). Other problems are closer to business issues, such as customer dissatisfaction (69%), compliance problems (39%) and revenue loss (35%). Poor-quality data can also cause problems with costs (67%) and credibility (77%).
Another recurring statement that we come across frequently is : Only less than 5% of our data is “bad”.
If you consider a fact table with 10 million rows, then about 500K rows are bad – which is a huge problem for analytics, operations or anything else you may want to do with the data!!
Even if you choose to ignore all the statistics and studies, it comes down to just a few questions –
- How do we get the information that we need and where do we get it from?
- Can we trust this information?
- What does it mean and how can we get this information in the format we need?
Data Quality Definition: Commonly known, data is deemed to be of high quality if it correctly represents the real-world construct to which they refer.
The implementation of Data Quality processes and procedures can dramatically affect the quality and therefore usability of the information in your Data Warehouse.
To be continued…