Too often I meet data warehouse development teams who are implementing what they call »data quality« in order to help the business have cleaner data to work with. But what they are really implementing is »data gatekeeping« which hurts the business even more than if they didn’t have any data quality in the first place.
Let’s look at an example.
A data quality rule may say that the value for education level in the customer table must exist in the education level reference table. Fair enough, right? We must ensure that all reference data values in our tables match their corresponding values in the reference tables.
What happens when we receive a customer record with a value for education level that doesn’t exist in the education level reference table? Our data gatekeepers say: »No, this record is not allowed to enter the data warehouse because it violates a data quality rule.« Fair enough, right?
But think about it. We are not loading a customer record into the data warehouse because of a data quality violation. Then what happens when we start loading transaction records? Do we load this customer’s transactions? He has just been rejected from the data warehouse. His transaction records – if loaded – will not belong to any of the existing customers in the data warehouse. Most likely they will be rejected from the data warehuse as well because the data gatekeepers probably set up another data quality rule that doesn’t allow any transactions into the data warehouse if they don’t belong to an existing customer.
When the data warehouse is ready, the data gatekeepers are happy. They believe that by implementing their data quality rules they have ensured »data quality«. The business users, on the other hand, are screaming: »Where are my customers? I thought I had seven million customers but I only see six million! Where are my transactions?« And the data gatekeepers are eager to explain that they have ensured data quality by not loading the offending records. Is that really helpful? No, probably not.
Doesn’t anyone trust Ralph Kimball any more? What did he teach us about data quality? Yes, we can have data quality rules. And no, we don’t reject bad quality data from the data warehouse. We flag data of questionable quality so that the business users are aware of potential data quality related issues.