Data warehouses use timestamps to keep track of data that is changing over time. A common term that is used to describe implementations of historically changing data is slowly changing dimensions. There are several types of slowly changing dimensions, explained in various sources (examples here and here). In this post I will focus on how to query slowly changing dimensions type 2 or SCD2 for short.
The reason I decided to write this post is that there is very little information out there on how to query SCD2. When I Google it, I get tons of links explaining how to insert data into SCD2 tables. But there is hardly any information on how to query it.
In my experience, developers who are working on ETL are usually well versed in using the ETL tools and the built-in functionalities for inserting SCD2 type data. But querying data from SCD2 tables is often misunderstood or not understood at all. To many developers and business users alike it is hard to understand historical representation of data. As a result, many don’t know how to get data out of SCD2 tables.
Let’s look at a simple example. Suppose our friend Jane became a customer of Company XYZ on January 1, 2000. At that time, when she filled out the application form, she provided her marital status as single. The entry in the Customers table would be:
The column First_name is her name and the column Marital_status represents her marital status at the time the data was entered. The column Valid_from tells us the date starting from when this data is valid. She became a customer on January 1, 2000, therefore as far as we know this is the starting date of when this information became known (assuming we don’t know anything about her before this date). The column Valid_to tells us the date until when this data valid. As far as we know, this data is valid until we are notified otherwise. In data warehouses we typically use a date far in the future (eg. December 31, 9999) to indicate that this is valid until further notice.
Jane was married on June 12, 2021. We record this information in our Customers table, using slowly changing dimension type 2, like this:
We now have two records for Jane in our Customers table, the first record representing the time span when she was single and the second record representing the time span when she is married. In the first record, the column Valid_to has been updated to June 11, 2021, because we now know that her single status was only valid until June 11. A second record was inserted for the Marital_status of married. In this second record, the value of Valid_from is June 12, 2021 because this is when she became married and her marital status changed. The value of Valid_to is set to December 31, 9999 because we don’t know how long she will stay married and we assume that this will remain so until further notice.
How do we query the SCD2 table?
If we want to find out Jane’s marital status, we might simply query like this:
SELECT Marital_status FROM Customers WHERE First_name = 'Jane';
Because we haven’t applied any filters, the result of this query is two records:
This is probably not what we wanted and a business user who is writing such queries might wonder how is it possible that a person is both single and married. What we really wanted was to have only one marital status that describes Jane at a given point in time. We know that today, she is married and that last month she was single and a year ago she was also single. But how do we translate this into a query?
Let’s say we want to know Jane’s marital status on June 10, 2021. Here is how we would query it:
SELECT Marital_status FROM Customers WHERE First_name = 'Jane' AND 'June 10, 2021' BETWEEN Valid_from and Valid_to;
This is where the query gets confusing for many people. What does the above query mean?
To explain, let’s do a calendar representation of the data:
Just by looking at this calendar, we can easily see that on June 10, Jane was single. And on June 15, she is married. So how do we translate this observation that we can easily pick up from the calendar into a SQL query?
We know that Jane’s marital status is represented as non-overlapping time intervals. In the interval between January 1, 2000 and June 11, 2021 she was single. And in the interval between June 12, 2021 and December 31, 9999 she is married. In order to find out what her marital status was on June 10, we just have to pick the correct interval in which June 10 lies. We do this by looking at which interval has June 10 between its start and end dates. The first interval, from January 1, 2000 and June 11, 2021 has June 10 between its start and end dates, so it must be the right one. The second interval, from June 12, 2021 and December 31, 9999 does not have June 10 between its start and end dates, so it is not the right one.
This translates into the SQL query from above:
... 'June 10, 2021' BETWEEN Valid_from and Valid_to;
And this is all the magic there is in querying SCD2 tables. Once we understand the concept of identifying the correct time interval where we find our information for the point in time that we are interested in, it should work the same for any SCD2 table.