Data warehouses use timestamps to keep track of data that is changing over time. A common term that is used to describe implementations of historically changing data is slowly changing dimensions. There are several types of slowly changing dimensions, explained in various sources (examples here and here). For the purpose of this article I will focus on how to query slowly changing dimensions type 2 (SCD2 for short).
The reason I decided to write this article is that there is very little information out there on how to query SCD2. When I Google it, I get tons of links explaining how to insert data into SCD2 tables. But there is hardly any information on how to query it.
In my experience, developers who are working on ETL are usually well versed in using the ETL tools and the built-in functionalities for inserting SCD2 type data. But querying data from SCD2 tables is often misunderstood or not understood at all. To many developers and business users alike it is hard to understand historical representation of data. As a result, many don’t know how to get data out of SCD2 tables.
Let’s look at a simple example. Supposing our friend Jane became a customer of Company XYZ on January 1, 2000. At that time, when she filled out the application form, she provided her marital status as single. The entry in the Customers table would be:
The column First_name is her name and the column Marital_status represents her marital status at the time the data was entered. The column Valid_from tells us the date starting from when this data is valid. She became a customer on January 1, 2000, therefore as far as we know this is the starting date of when this information became known (assuming we don’t know anything about her before this date). The column Valid_to tells us the date until when this data valid. As far as we know, this data is valid until we are notified otherwise. In data warehouses we typically use a date far in the future (eg. December 31, 1999) to indicate that this is valid until further notice.
Jane was married last Saturday, on June 17, 2017. We record this information in our Customers table, using slowly changing dimension type 2, like this:
We now have two records for Jane in our Customers table, the first record representing the time span when she was single and the second record representing the time span when she is married. In the first record, the column Valid_to has been updated to June 16, 2017, because we now know that her single status was only valid until June 16. A second record was inserted for the Marital_status of married. In this second record, the value of Valid_from is June 17, 2017 because this is when she became married and her marital status changed. The value of Valid_to is set to December 31, 1999 because we don’t know how long she will stay married and we assume that this will remain so until further notice.
How do we query the SCD2 table?
If we want to find out Jane’s marital status, we might simply query like this:
WHERE First_name = 'Jane';
Because we haven’t applied any filters, the result of this query is two records:
This is probably not what we wanted and a business user who is writing such queries might wonder how is it possible that a person is both single and married. What we really wanted was to have only one marital status that describes Jane at a given point in time. We know that today, she is married and that last week she was single and a year ago she was also single. But how do we translate this into a query?
Let’s say we want to know Jane’s marital status a week ago, on June 14, 2017? Here is how we would query it:
WHERE First_name = 'Jane'
AND 'June 14, 2017' BETWEEN Valid_from and Valid_to;
This is where the query gets confusing for most people. What does the above query mean?
To explain, let’s do a calendar representation of the data:
Just by looking at this calendar, we can easily see that on June 14, Jane was single. And on June 21 (today) she is married. So how do we translate this observation that we can easily pick up from the calendar view into a SQL query?
We know that Jane’s marital status is represented as non-overlapping time intervals. In the interval between January 1, 2000 and June 16, 2017 she was single. And in the interval between June 17, 2017 and December 31, 1999 she is married. In order to find out what her marital status was on June 14, we just have to pick the correct interval in which June 14 lies. We do this by looking at which interval has June 14 between its start and end dates. This translates straightforward into the SQL query from above:
... 'June 14, 2017' BETWEEN Valid_from and Valid_to;
And this is all the magic there is in querying SCD2 tables. Once we understand the concept of identifying the correct time interval where we find our information for the point in time that we are interested in, it should work the same for any SCD2 table.