When companies move their infrastructure to the cloud, like for example Snowflake cloud where resources are billed according to their actual use, they become more conscious about the costs of storage and compute consumption. Those who are new to the cloud are often concerned about how much it will cost when they run a query or when they create a table instead of a view. It takes time to get used to this type of billing and also to understand the average consumption and the expected cost over time, so that it doesn’t become a day-to-day worry.
Features that save costs are welcome in the cloud. One of such features in Snowflake that is not available in other databases is zero-copy cloning. This feature creates a copy of a database, a schema, or a table. The cloned object incurs no additional data storage costs because it shares all the existing micro-partitions of the original object at the time it was cloned. The cloned object is writable, meaning that rows can be added, deleted, or updated in the clone independently from the original object. Each change to the clone results in new micro-partitions that are owned exclusively by the clone while the original object remains intact.
Cloning is completed almost instantly since no data is physically copied. This allows users to have copies of data available whenever they need it without the additional cost and time of actually replicating the data.
In this post, I don’t want to delve into the technical details of cloning since they are explained in more detail in the Snowflake documentation. Instead, I want to look at use cases from the data warehousing perspective where cloning is beneficial.
Cloning data between environments
Data warehouses are data intensive applications that can only be tested properly when they are loaded with meaningful data. Even when we are developing in the development environment, we usually want data that resembles production data to be able to test performance, data transformation correctness and so on. This is where Snowflake zero-copy cloning is very useful, because we can clone data as needed.
Below are some use cases for cloning data between environments in data warehousing projects.
- From development to test: set up the initial tables and configuration in the development environment and clone everything to the testing environment to continue integration testing without having to set up the test environment and load data from scratch
- From production to UAT: populate the UAT environment with data from production to allow data testing and comparison without having to load data from scratch, keeping in line with security access restrictions
- From production to development: load real data into the extract schema in the production environment and then periodically clone the data into the development environment where it is needed for testing the ETL process
- From production to sandbox: allow data scientists to have their own sandbox where they can analyze and transform data as they wish, on their own copy of the data without having to wait for the data to be loaded according to their specific needs and requirements
Load a copy of the table and discard in case of errors
Whenever we have a complicated ETL process that loads data into target tables in multiple steps, it can be a nightmare to restore the original state of the tables when a step fails somewhere in the middle of the loading process. In relational databases we typically use transactions to handle such situations. We start a transaction at the beginning, perform all the steps and if nothing fails, we commit the transaction, otherwise we rollback.
In data warehousing environments it may not always be practical to use transactions, especially when large amounts of data are involved that may overrun resources required to store the transaction details required for rollback.
An alternative approach is to create copies of the target tables that we want to load, perform all the data transformation and load processes on copies of the tables and finally, if nothing fails, switch the copies of the tables with the originals. In case anything fails, the original tables remain intact and no harm was done from the loading process.
Here is a brief example. We have a target table named FACT_CUST_TRANSACTIONS that is loaded via a complicated ETL process that involves many steps. To ensure consistency of the table, we first create a zero-copy clone:
CREATE TABLE COPY_FACT_CUST_TRANSACTIONS CLONE FACT_CUST_TRANSACTIONS;
The above command creates a clone of the original table named COPY_FACT_CUST_TRANSACTIONS which happens almost instantly, regardless of how much data the table contains. We then allow the ETL process to load data into this copy of the original table. At the end, if nothing fails, the tables are swapped using the SWAP command:
ALTER TABLE FACT_CUST_TRANSACTIONS SWAP WITH COPY_FACT_CUST_TRANSACTIONS;
This step switches the copy and the original table and this also happens almost instantly so that the integrity of the data in the target table is ensured.
Additional use cases may include situations when we want to restore data from time travel at a given point in time in the past, which can also be accomplished with cloning.
Although cloning tables is easy to understand and to perform, and can be used to efficiently transport data from one place to another, there are additional considerations to keep in mind when cloning schemas or databases such as access control, sequences, stages, pipes, streams and so on, all of which is explained in more detail in the documentation.