Elevate your Databricks development workflow with SHALLOW CLONE

In this post, I would like to present how the SHALLOW CLONE functionality can be utilized to streamline your development process by accelerating the creation of dedicated development environments and enabling comprehensive data testing in your CI/CD pipeline.

One more time, what exactly is SHALLOW CLONE?

Let’s start with a quick recap of Delta Lake and Cloning.

Delta Lake is an optimized storage layer that serves as the foundation for tables in a Databricks Lakehouse. This modern open table format enhances Parquet files with a file-based transaction log, enabling ACID transactions and scalable metadata management.

Parquet data files, which form the basis of the table, are immutable and cannot be modified. As a consequence, DML operations on a Delta Lake table can result in:

  • creation of new Parquet data files during an INSERT operation,
  • creation of new Parquet files and the deletion of existing ones for UPDATE/DELETE statements.

Yeah, I know… deletion vectors, let’s set this aside the for now ;).

Okay, just to clarify – we do not automatically remove the unused files from storage; instead, we simply mark them as deleted in the transaction log. Physical deletion occurs only after executing the VACUUM command (if they exceed the retention period).

What is a SHALLOW CLONE then? It is a replica of an existing Data Table that shares the same Parquet data files as the original table. It duplicates the current state of the metadata layer, pointing to precisely the same set of physical files on our storage.

I agree, it doesn’t sound exciting yet.

The real magic happens when you begin modifying your tables – both the original and the cloned ones. Operations targeting each table do not affect the other. This is because of immutability; since we do not modify any files created before the SHALLOW CLONE execution, we retain the “core” of our table intact, while new operations exist in their own separate branches.

So what about deletions, are those problematic? Not at all, because the same file can be marked as deleted in only one of the metadata layers. As long as the physical files are not VACUUM’ed.

Now imagine this scenario: you have a massive 2-3TB table that you need to test with the new version of your stored procedure, which might modify the data. Without this functionality, you would typically have to make a copy of the data, a process that could take a considerable amount of time. However, with SHALLOW CLONE, a copy of your table is ready in seconds. Pretty cool, isn’t it? 🙂

btw. If you want to learn more about this topic, I invite you to read Adrian’s Delta Lake 101 series on this blog.

Spin up your DEV environment in seconds!

Let’s now scale this idea across the entire development team.

Long ago, when working with data warehouses in SQL Server, each of us had a local installation on our laptops where we developed our solutions. We then synchronized them with the shared version using SSDT (I still love this tool!).

In the case of Databricks, hosting a local development environment is not an option. However, dedicated development directories can serve as substitutes. All we need to do is parameterize the code cloning production tables, for example, in a Workflow, and then invoke it whenever a developer requests a dedicated environment. The duration of such an operation is seconds or minutes, with costs close to zero.

And what about security, you may ask. The user (or Service Principal) initiating the CLONE operation needs access to both the source and target catalogs. It is also important that the workspace in which the CLONE operation is executed must be bound to both catalogs (workspace-catalog bindings).

However, these requirements do not apply to users working with already existing clones – we can therefore isolate and secure development work by restricting permissions to production tables and detaching the source catalog from the workspace where development is conducted.

Next step? Proper performance tests!

Let’s take it one step further.

What can you say about data testing in your project? And I’m not talking about simple unit tests invoked with pytest during the CI build. Do you perform full data processing before each Pull Request approval to verify that the changes haven’t significantly degraded performance or, for example, duplicated data? Yeah, its quite hard to do in complex systems and on a large volumes of data.

The graphic below illustrates a potential expansion of our environment with a UAT catalog where full acceptance tests are performed. The CI process triggered during each Pull Request automates cloning of the current version from the production environment to the UAT environment, followed by invoking new version of our solution – e.g., modified Databricks Workflows responsible for processing subsequent data layers.

Approval of the mentioned Pull Request can occur after analysis of the results by stakeholders, for example, by comparing data between UAT and Production catalogs.

It’s worth noting that the CLONE functionality can be invoked with VERSION AS OF or TIMESTAMP AS OF arguments. In such cases, on the UAT environment, we can replicate yesterday’s state of our system and load today’s data using new processing logic.

Be careful with single-user access mode compute.

Last but not least, I need to mention that a full isolation between source and cloned catalogs is only possible when using a shared compute cluster. If you attempt to query the cloned tables with a single-user access mode compute, you will encounter a series of errors that’ll ask you to assign source catalog to the workspace you are currently in. You’ll also have to be granted USE CATALOG, USE SCHEMA, and SELECT permissions on the source table (or even MODIFY on source table, when trying to INSERT data to the cloned one!).

Curious why this is the case? Please read this blog post on the Unity Catalog Lakeguard mechanism and/or view an excellent session from DAIS 2024 by Stefania Leone and Martin Grund.

Leave a Reply