In the first post of this series, we discussed the process of creating a metastore, which is essential for enabling workspaces for the Unity Catalog. In this part, I would like to cover the provisioning process of key elements in Unity Catalog’s object model – specifically, catalogs and schemas. The goal of this article is also to demonstrate how terraform commands such as for_each
, yamldecode
, and flatten
can simplify managing the expanding object structure in our solution.
As usual, let’s start with a quick overview of the objects we will be creating.
Catalogs are the highest-level organizational units in Unity Catalog. A catalog contains multiple schemas and helps manage permissions, data governance, and access control for a group of related data assets. Catalogs can be thought of as collections of databases.
Within a catalog, schemas are used to organize tables, views, volumes, and other data objects. Schemas help logically segment the data, allowing for better organization and control over individual datasets. They are similar to traditional database schemas.
But wait, “Catalogs can be thought of as collections of databases”? Isn’t that a mistake? No, actually, it’s not. People used to MS SQL/S hierarchy (DATABASE.SCHEMA.TABLE), need to get used to the fact that in Databricks, DATABASE is simply an alias for SCHEMA and can be used interchangeably.
In Azure Databricks, schemas are sometimes called databases. For example, CREATE DATABASE
is an alias for CREATE SCHEMA
. This terminology differs from that of some relational database systems in which a database is a collection of schemas. (link)
Ok, but that’s not all. We also need to configure our cloud object storage – the location where the data will physically reside. To do this, we need to set up the following:
- Databricks storage credentials – a cloud-specific credentials that enable Databricks to access external cloud storage systems such as ADLS, AWS S3 or GCS.
- Databricks external location – a securable object that combines the registered cloud storage path and a storage credential that allows to access that path.
External locations are used for both external data assets, such as external tables, as well as managed data assets, including managed tables and managed volumes.
To store managed objects in an external location, a managed storage location must be configured. This setting can be applied at three different levels: metastore, catalog, or schema.
Configuring this property is mandatory for at least one of the following: METASTORE and/or CATALOG. For SCHEMA, it is an optional setting that allows for features such as isolating schemas within the storage account.
Implementation
Quick summary of the objects we plan to create:
- Storage credentials – authorization mechanism
- External location – cloud storage path with a storage credential
- Catalog – primary unit of data organization
- Schemas – a child of a catalog and can contain tables, views, volumes, etc.
And let’s get started.
Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.
Azure Resources & Access Groups
Since the goal of this series is to prepare end-to-end solutions, we will revisit code similar to what was discussed in the previous article (repetitio est mater studiorum!).
Those familiar with Terraform and azurerm provider may skip this section. For those who are just starting with Terraform, I recommend taking a look especially at the for_each
clause, which allows the creation of multiple containers using a single resource. It’s worth reviewing the code once more to fully understand it.
In the first part, we create the following objects:
- Resource Group
- Storage Account (ADLS) and Containers
- Managed Identity
- Access Connector
- Storage Blob Data Contributor Role Assignment
resource "azurerm_resource_group" "rgrp-lake" { name = "rg-terradbx-lake-neu" location = "northeurope" } resource "azurerm_storage_account" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "staterradbxlakeneu" account_tier = "Standard" access_tier = "Hot" account_replication_type = "LRS" cross_tenant_replication_enabled = false account_kind = "StorageV2" is_hns_enabled = true } resource "azurerm_storage_container" "lake" { for_each = toset(["catalog", "bronze", "silver", "gold"]) name = each.key storage_account_name = azurerm_storage_account.lake.name container_access_type = "private" } resource "azurerm_user_assigned_identity" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "id-terradbx-lake-neu" } resource "azurerm_databricks_access_connector" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "dbac-terradbx-lake-neu" identity { type = "UserAssigned" identity_ids = [azurerm_user_assigned_identity.lake.id] } } resource "azurerm_role_assignment" "lake-contributor" { scope = azurerm_storage_account.lake.id principal_id = azurerm_user_assigned_identity.lake.principal_id role_definition_name = "Storage Blob Data Contributor" }
Next:
- Resource Group
- Databricks Workspace
- Metastore Assignment
At this point, it’s worth noting that, unlike the objects created in the first part of this series (Metastore), in this section we will base mainly on resources leveraging the Workspace API. For details, please refer here and here. In the code, you will repeatedly see provider = databricks.account
and provider = databricks.workspace
, which indicate which API is used during the resource creation.
resource "azurerm_resource_group" "rgrp-proc" { name = "rg-terradbx-proc-neu" location = "northeurope" } resource "azurerm_databricks_workspace" "proc" { resource_group_name = azurerm_resource_group.rgrp-proc.name location = "northeurope" name = "dbw-terradbx-proc-neu" managed_resource_group_name = "rg-terradbx-proc-neu-managed" sku = "premium" } resource "databricks_metastore_assignment" "proc" { provider = databricks.account metastore_id = var.metastore_id workspace_id = azurerm_databricks_workspace.proc.workspace_id default_catalog_name = "main" }
Finally, just like last time, set up a group in Unity Catalog to build good practices from the start.
resource "databricks_group" "lake" { provider = databricks.account display_name = "gr_lake_owners" } data "databricks_user" "me" { provider = databricks.account user_name = "example@gmail.com" } resource "databricks_group_member" "lake" { provider = databricks.account group_id = databricks_group.lake.id member_id = data.databricks_user.me.id }
Catalogs & Schemas
Now we can move on to the elements we discussed in the first section of this article. We start with the Storage Credential, which points to the Databricks Access Connector created in Azure.
resource "databricks_storage_credential" "lake" { provider = databricks.workspace name = "sc-terradbx-lake-neu" comment = "managed from terraform" owner = "gr_lake_owners" azure_managed_identity { access_connector_id = azurerm_databricks_access_connector.lake.id managed_identity_id = azurerm_user_assigned_identity.lake.id } }
Next, we create a set of external locations, one for each container. In our example, for simplicity, the same storage credential is used for all objects. However, for full isolation, you can separate this and create dedicated objects with restricted permissions. Similarly to containers, the for_each
construct allows us to handle this with a single resource.
resource "databricks_external_location" "lake" { provider = databricks.workspace for_each = toset(["catalog", "bronze", "silver", "gold"]) name = "loc-terradbx-lake-neu-${each.key}" comment = "managed from terraform" owner = databricks_group.lake.display_name url = format("abfss://%s@%s.dfs.core.windows.net/", each.key, azurerm_storage_account.lake.name ) credential_name = databricks_storage_credential.lake.id }
Those external locations will then be used as the storage_root
for our catalogs…
resource "databricks_catalog" "lake" { provider = databricks.workspace name = "terradbx" comment = "managed from terraform" owner = databricks_group.lake.display_name metastore_id = var.metastore_id storage_root = databricks_external_location.lake["catalog"].url }
… and schemas. Note how the appropriate naming convention allowed us to easily reference the correct external location in each of the created schemas ([each.key]
).
resource "databricks_schema" "lake" { provider = databricks.workspace for_each = toset(["bronze", "silver", "gold"]) name = each.key owner = databricks_group.lake.display_name catalog_name = databricks_catalog.lake.name comment = "managed from terraform" storage_root = databricks_external_location.lake[each.key].url depends_on = [ databricks_catalog.lake ] }
And as a bonus – external volume. There can be debate about whether it should be managed via code or treated similarly to other objects within schemas, such as tables and views. Personally, I have often created such volumes using Infrastructure as Code (IaC), especially when they point to a catalog with raw data before it is loaded into Delta tables in the first schema of a lakehouse.
resource "databricks_volume" "volumes" { provider = databricks.workspace for_each = toset(["raw_erp", "raw_crm", "raw_hrm"]) name = each.key catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name volume_type = "EXTERNAL" storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}" comment = "managed by terraform" }
After deploying everything, our solution should look as follows:
YAML configuration
For larger solutions, a setup built this way can quickly become cumbersome to maintain. One possible solution to this problem is to move the object configurations to a YAML file.
catalogs: - name: lakehouse_tmp owner: gr_lake_owners root: storage: staterradbxlakeneu path: catalog access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu schemas: - name: raw storage_root: raw - name: bronze storage_root: bronze - name: silver storage_root: silver - name: gold storage_root: gold
Then, load the configuration using the yamldecode
function, flatten it using the flatten
syntax, and pass it to the appropriate resources. These resources can then deploy a list of any length using loops.
locals { catalogs = yamldecode("path/to/yaml/config/file") schemas = flatten([ for index, cat in local.catalogs.catalogs : [ for index, sch in(cat.schemas == null ? [] : cat.schemas) : { id = join("-", [cat.name, sch.name]) catalog = cat.name storage = cat.root.storage owner = cat.owner schema = sch.name storage_root = sch.storage_root } ] ]) }
Tables – please don’t 🙂
At the end of the article, I’d just like to mention that Terraform can also be used to create objects such as views and tables. After all, as the law of the instrument states: “Once you’ve learned Terraform, everything starts to look like infrastructure.”
resource "databricks_sql_table" "sample_table_managed" { provider = databricks.workspace name = "sample_table_managed" catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name table_type = "MANAGED" data_source_format = "DELTA" storage_location = "" comment = "this table is managed by terraform" column { name = "id" type = "int" } column { name = "code" type = "string" } column { name = "description" type = "string" } } resource "databricks_sql_table" "sample_view" { provider = databricks.workspace name = "sample_view" catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name table_type = "VIEW" comment = "this view is managed by terraform" view_definition = format( "SELECT * FROM %s", databricks_sql_table.sample_table_managed.id ) depends_on = [ databricks_sql_table.sample_table_managed ] }
But please remember, just because you can, doesn’t mean you should …
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
Last comments