power-clone-functionality-databricks-delta-tables_MAIN

Understanding CLONE Functionality in Databricks for Delta Tables

Introduction

Databricks offers great functionality in the form of table CLONING, which works not only for Delta Tables but also for Parquet Table and Iceberg Tables. In this article we will focuse on Delta Tabels.

Please note that in this article, whenever I mention:

  • Source Table: This refers to the original table used as the source for the clone operation.
  • Cloned Table: This refers to the table that is the target of the clone operation.
  • Clone: This refers to the clone operation itself, the sequence of events that will make a copy of the Source Table into the cloned table.

We have 2 types of clones:

  1. DEEP CLONE – This involves copying table metadata along with the data (you duplicate your data, so be aware of the cost when cloning a 1 TB table).
  2. SHALLOW CLONE – This involves copying table metadata without copying the data, resulting in a different table in the metastore that refers to the Source Table’s data files.

Why should you use CLONE?  

There are numerous use cases, especially when you understand how clones exactly work:

  • Data alignment between environments/teams.
  • Editable copy of production data without copying the data (shallow clone).
  • Disaster recovery: You can clone your table to a table in another cloud region after finishing a transaction(s).
  • Test your pipeline on production data: You can shallow clone your entire production data to a test environment and test your pipeline before release.
  • ML experiments on dev environment over a snapshot of production data.

This is how I use clone, but you can probably find even more use cases.

Let’s dive into details

Databricks supports CLONING for both External Tables and Managed Tables.

If you don’t know the difference between those two, here’s an easy explanation: if you provide an explicit path when creating the table, it will be an External Table; otherwise, it will be a Managed Table (in a managed storage location, e.g. the schema default location).

Example of creating an External Table:

In SQL:

CREATE TABLE 
  priv_wojciech_bukowski.cloning.src_table_cdf
LOCATION 
  'abfss://test-one@wbunitycatalog.dfs.core.windows.net/UNITY_EXT_LOC/cloning/src_table_cdf'
  • In Python:
(df
.write
.option("path","abfss://test-one@wbunitycatalog.dfs.core.windows.net/UNITY_EXT_LOC/cloning/src_table_cdf")
.saveAsTable("priv_wojciech_bukowski.cloning.src_table_cdf")

Example of creating a Managed Table:

In SQL:

CREATE TABLE priv_wojciech_bukowski.cloning.src_table_cdf

In Python:

df.write.saveAsTable("priv_wojciech_bukowski.cloning.src_table_cdf")

It’s important to understand the difference between table types because you can only make clones:

  • From Managed Table to Managed Tabel.
  • From External Table to External Table.

If you try to clone, for example, from a Managed Table to an External Table, you will get an error, which is misleading:

[CANNOT_SHALLOW_CLONE_NON_UC_MANAGED_TABLE_AS_SOURCE_OR_TARGET] Shallow clone is only supported for the MANAGED table type. The table [TABLE_NAME] is not MANAGED table. SQLSTATE: 0AKUC

It is misleaeading, because clone now also works for External Tables and we tried diffrent operation (clone from Managed to External Table) that is not allowed. From Databricks docs: “It is now in Public Preview: Shallow clone support for Unity Catalog external table is in Public Preview in Databricks Runtime 14.2 and above.” So let’s hope they fix the error message.

How does Deep Clone works?

Regardless of whether you clone an External or Managed table, it will have the same effect (work the same way).

Deep Clone (or just CLONE) can be executed using the following SQL statement:

CREATE OR REPLACE TABLE 
  priv_wojciech_bukowski.cloning.deep_clone_incremental
DEEP CLONE
  priv_wojciech_bukowski.cloning.source_incremental
LOCATION 
  'abfss://test-one@wbunitycatalog.dfs.core.windows.net/UNITY_EXT_LOC/cloning/deep_clone_incremental'

You will get to know later why there is an “OR REPLACE” statement later in this article.

If you run this command, Databricks will:

  • Create a new table in the metastore (in our case, “deep_clone_incremental”).
  • Copy all data files (parquet files) to the Cloned Table storage location.
  • Create a new Delta Log with the following information:
    • Source Table name,
    • Source Table version,
    • Flag indicating whether it is a deep or shallow clone.

That means the Cloned Table has its own fresh and independent history. Consequently, you cannot Time Travel on a Cloned Table as you can on a Source Table due to the absence of history. However, you can still create a clone of the Source Table from a previous version. For example:

CREATE TABLE 
  priv_wojciech_bukowski.cloning.deep_clone_incremental
CLONE 
  priv_wojciech_bukowski.cloning.source_incremental VERSION AS OF 1

That will create a Cloned Table as of Version 1 of the Source Table and copy the corresponding data files. Now, if you run this command in a loop until the current Source Table version, you might end up with a Cloned Table with a history that is very similar or exactly the same as the Source Table

DEEP CLONE IS INCREMENTAL!

I have used the “CREATE OR REPLACE TABLE” syntax before, and the reason is, this syntax guarantees incrementality for the clone operation.
This means that any consecutive DEEP CLONE will result in copying only new data files.

In the example below, we observe that only one data file is copied after running DEEP CLONE again, and a new Delta log entry is created.

 

We don’t know exactly what happened on the Source Table between versions 2 and 4, but “somehow” the Clone figured out that it only requires this single parquet file to be copied, so the Cloned Table will be in the same state as the Source Table.

What is also nice is that after we run Clone for the second time, it results in a new entry in the Delta Log and a new cloned table version, which means we can Time Travel on it.

I find Deep Clone useful for archiving, backup, or DR scenarios, but it’s a waste of money if you want to, for example, experiment on a production table as it requires data duplication.

That is why you should also check out Shallow Clone.

How does Shallow Clone works?

Shallow Clone differs from Deep Clone in way it does not copy any data from the Source Table.

The SQL code below creates a Cloned Table that is a Shallow Clone of the Source Managed table:

--Clone Managed table
CREATE OR REPLACE TABLE 
  priv_wojciech_bukowski.cloning.target_managed_table_shallow
SHALLOW CLONE 
  priv_wojciech_bukowski.cloning.source_managed_table

As we see in the diagram below, running Shallow Clone resulted in:

  • A new table in the metastore,
  • A new delta table folder path in the storage location (managed),
  • A new delta log file,

And no data files.

Now, if you try to run a SELECT statement from the Cloned Table, you will physically read data files from the Source Table storage location.

SHALLOW CLONE is independent after clone.

You can make modifications (updates, inserts, deletes) on the Cloned Table, and that operation is independent from the Source Table. It does not modify the Source Table; instead, the Cloned Table will have a separate history from the moment it was cloned.

For example, if you run the following statement:

UPDATE priv_wojciech_bukowski.cloning.target_managed_table_shallow
SET position = 'Data Architect'
WHERE id = 2

This will result in:

  • A new data file will be created on the Cloned Table storage location.
  • A new Delta Log entry will be created for this operation in the Cloned Table location.

This change will only be visible when querying the Cloned Table; the Source Table is not aware of that update.

Let’s see what happens when we try to read the Cloned Table now.

If a user runs SELECT * FROM <Cloned Table Name>, the system will read one data file from the Cloned Table storage location (the one that was created after running the Update) and two data files from the Source Table storage location.

This is a really cool feature, especially when dealing with large tables like 100 GB and only wanting to experiment (modify) a small subset of data while still requiring read access to all data.

NOTE: If you run the Shallow Clone command again, the Cloned Table will be overwritten by the Source Table version, but it won’t lose its history.

Let’s examine how it works with an example:

  1. We clone (Shallow) the table.
  2. We run the Update statement on the Cloned Table, resulting in a new table version and a new Data File.
  3. Nothing happens on the Source Table after we ran Shallow Clone for the first time.
  4. We run Shallow Clone again.

The results will be:

  • We will have a new version of the Cloned Table that will simply ignore the data file containing the row after the “Update Statement” we executed in point (2).
  • We can still Time Travel on the Cloned Table and/or restore its previous state.

Please remember, when using Shallow Clone, for example, to create a table in a lower environment for developers, you need to ensure your Databricks Clusters have connectivity to the storage account where the underlying data of the Source Table resides:

  • When cloning the table, the cluster that clones the table needs to have network access to both the Source and Target Storage Location.
  • When querying the cloned table (shallow), the cluster that is reading the data might need access to both the Source and Target Storage Location.

SHALLOW CLONE – be aware of VACUUM.

Recently, Databricks enhanced the metadata tracking for Shallow Clone. Now, when you attempt to run Vacuum on the Source Table, the system is aware of tables that are clones of the Source and can prevent data files from being removed when they are still in use by Cloned Tables.

This means that if you run Vacuum against the Source Table and the files are out of the retention period of the Original Table but are still referenced by other Cloned Tables, only the files that are not needed for either the Source Table or any Cloned Tables will be removed from the Source Table storage location.

 

There is a small difference between how Vacuum will work comparing Managed and External tables:

  • For managed tables, VACUUM operations against either the source or target of a shallow clone operation might delete data files from the source table.
  • For external tables, VACUUM operations only remove data files from the source table when run against the source table.

Basically, this means that whenever you run Vacuum (on the Source or Cloned Table) for a managed table, it will clear unnecessary files.

Summary

This guide provides insights into the workings of two primary cloning methods: Deep Clone and Shallow Clone. By understanding these methods, users can enhance their data platform workflows and operations. Table cloning in Databricks offers powerful capabilities for data management and experimentation, empowering users to streamline workflows, drive innovation, and optimize costs effectively.

Wojciech Bukowski

Leave a Reply