In the first post of this series, we discussed the process of creating a metastore, which is essential for enabling workspaces for the Unity Catalog. In this part, I would like to cover the provisioning process of key elements in Unity Catalog’s object model – specifically, catalogs and schemas. The goal of this article is also to demonstrate how terraform commands such as for_each, yamldecode, and flatten can simplify managing the expanding object structure in our solution.
As usual, let’s start with a quick overview of the objects we will be creating.
Catalogs are the highest-level organizational units in Unity Catalog. A catalog contains multiple schemas and helps manage permissions, data governance, and access control for a group of related data assets. Catalogs can be thought of as collections of databases.
Within a catalog, schemas are used to organize tables, views, volumes, and other data objects. Schemas help logically segment the data, allowing for better organization and control over individual datasets. They are similar to traditional database schemas.
But wait, “Catalogs can be thought of as collections of databases”? Isn’t that a mistake? No, actually, it’s not. People used to MS SQL/S hierarchy (DATABASE.SCHEMA.TABLE), need to get used to the fact that in Databricks, DATABASE is simply an alias for SCHEMA and can be used interchangeably.
In Azure Databricks, schemas are sometimes called databases. For example, CREATE DATABASE is an alias for CREATE SCHEMA. This terminology differs from that of some relational database systems in which a database is a collection of schemas. (link)
Ok, but that’s not all. We also need to configure our cloud object storage – the location where the data will physically reside. To do this, we need to set up the following:
- Databricks storage credentials – a cloud-specific credentials that enable Databricks to access external cloud storage systems such as ADLS, AWS S3 or GCS.
- Databricks external location – a securable object that combines the registered cloud storage path and a storage credential that allows to access that path.
External locations are used for both external data assets, such as external tables, as well as managed data assets, including managed tables and managed volumes.
To store managed objects in an external location, a managed storage location must be configured. This setting can be applied at three different levels: metastore, catalog, or schema.
Configuring this property is mandatory for at least one of the following: METASTORE and/or CATALOG. For SCHEMA, it is an optional setting that allows for features such as isolating schemas within the storage account.
Implementation
Quick summary of the objects we plan to create:
- Storage credentials – authorization mechanism
- External location – cloud storage path with a storage credential
- Catalog – primary unit of data organization
- Schemas – a child of a catalog and can contain tables, views, volumes, etc.
And let’s get started.
Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.
Azure Resources & Access Groups
Since the goal of this series is to prepare end-to-end solutions, we will revisit code similar to what was discussed in the previous article (repetitio est mater studiorum!).
Those familiar with Terraform and azurerm provider may skip this section. For those who are just starting with Terraform, I recommend taking a look especially at the for_each clause, which allows the creation of multiple containers using a single resource. It’s worth reviewing the code once more to fully understand it.
In the first part, we create the following objects:
- Resource Group
- Storage Account (ADLS) and Containers
- Managed Identity
- Access Connector
- Storage Blob Data Contributor Role Assignment
resource "azurerm_resource_group" "rgrp-lake" {
name = "rg-terradbx-lake-neu"
location = "northeurope"
}
resource "azurerm_storage_account" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "staterradbxlakeneu"
account_tier = "Standard"
access_tier = "Hot"
account_replication_type = "LRS"
cross_tenant_replication_enabled = false
account_kind = "StorageV2"
is_hns_enabled = true
}
resource "azurerm_storage_container" "lake" {
for_each = toset(["catalog", "bronze", "silver", "gold"])
name = each.key
storage_account_name = azurerm_storage_account.lake.name
container_access_type = "private"
}
resource "azurerm_user_assigned_identity" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "id-terradbx-lake-neu"
}
resource "azurerm_databricks_access_connector" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "dbac-terradbx-lake-neu"
identity {
type = "UserAssigned"
identity_ids = [azurerm_user_assigned_identity.lake.id]
}
}
resource "azurerm_role_assignment" "lake-contributor" {
scope = azurerm_storage_account.lake.id
principal_id = azurerm_user_assigned_identity.lake.principal_id
role_definition_name = "Storage Blob Data Contributor"
}
Next:
- Resource Group
- Databricks Workspace
- Metastore Assignment
At this point, it’s worth noting that, unlike the objects created in the first part of this series (Metastore), in this section we will base mainly on resources leveraging the Workspace API. For details, please refer here and here. In the code, you will repeatedly see provider = databricks.account and provider = databricks.workspace, which indicate which API is used during the resource creation.
resource "azurerm_resource_group" "rgrp-proc" {
name = "rg-terradbx-proc-neu"
location = "northeurope"
}
resource "azurerm_databricks_workspace" "proc" {
resource_group_name = azurerm_resource_group.rgrp-proc.name
location = "northeurope"
name = "dbw-terradbx-proc-neu"
managed_resource_group_name = "rg-terradbx-proc-neu-managed"
sku = "premium"
}
resource "databricks_metastore_assignment" "proc" {
provider = databricks.account
metastore_id = var.metastore_id
workspace_id = azurerm_databricks_workspace.proc.workspace_id
default_catalog_name = "main"
}
Finally, just like last time, set up a group in Unity Catalog to build good practices from the start.
resource "databricks_group" "lake" {
provider = databricks.account
display_name = "gr_lake_owners"
}
data "databricks_user" "me" {
provider = databricks.account
user_name = "example@gmail.com"
}
resource "databricks_group_member" "lake" {
provider = databricks.account
group_id = databricks_group.lake.id
member_id = data.databricks_user.me.id
}
Catalogs & Schemas
Now we can move on to the elements we discussed in the first section of this article. We start with the Storage Credential, which points to the Databricks Access Connector created in Azure.
resource "databricks_storage_credential" "lake" {
provider = databricks.workspace
name = "sc-terradbx-lake-neu"
comment = "managed from terraform"
owner = "gr_lake_owners"
azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.lake.id
managed_identity_id = azurerm_user_assigned_identity.lake.id
}
}
Next, we create a set of external locations, one for each container. In our example, for simplicity, the same storage credential is used for all objects. However, for full isolation, you can separate this and create dedicated objects with restricted permissions. Similarly to containers, the for_each construct allows us to handle this with a single resource.
resource "databricks_external_location" "lake" {
provider = databricks.workspace
for_each = toset(["catalog", "bronze", "silver", "gold"])
name = "loc-terradbx-lake-neu-${each.key}"
comment = "managed from terraform"
owner = databricks_group.lake.display_name
url = format("abfss://%s@%s.dfs.core.windows.net/",
each.key,
azurerm_storage_account.lake.name
)
credential_name = databricks_storage_credential.lake.id
}
Those external locations will then be used as the storage_root for our catalogs…
resource "databricks_catalog" "lake" {
provider = databricks.workspace
name = "terradbx"
comment = "managed from terraform"
owner = databricks_group.lake.display_name
metastore_id = var.metastore_id
storage_root = databricks_external_location.lake["catalog"].url
}
… and schemas. Note how the appropriate naming convention allowed us to easily reference the correct external location in each of the created schemas ([each.key]).
resource "databricks_schema" "lake" {
provider = databricks.workspace
for_each = toset(["bronze", "silver", "gold"])
name = each.key
owner = databricks_group.lake.display_name
catalog_name = databricks_catalog.lake.name
comment = "managed from terraform"
storage_root = databricks_external_location.lake[each.key].url
depends_on = [
databricks_catalog.lake
]
}
And as a bonus – external volume. There can be debate about whether it should be managed via code or treated similarly to other objects within schemas, such as tables and views. Personally, I have often created such volumes using Infrastructure as Code (IaC), especially when they point to a catalog with raw data before it is loaded into Delta tables in the first schema of a lakehouse.
resource "databricks_volume" "volumes" {
provider = databricks.workspace
for_each = toset(["raw_erp", "raw_crm", "raw_hrm"])
name = each.key
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
volume_type = "EXTERNAL"
storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}"
comment = "managed by terraform"
}
After deploying everything, our solution should look as follows:
YAML configuration
For larger solutions, a setup built this way can quickly become cumbersome to maintain. One possible solution to this problem is to move the object configurations to a YAML file.
catalogs:
- name: lakehouse_tmp
owner: gr_lake_owners
root:
storage: staterradbxlakeneu
path: catalog
access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu
managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu
schemas:
- name: raw
storage_root: raw
- name: bronze
storage_root: bronze
- name: silver
storage_root: silver
- name: gold
storage_root: gold
Then, load the configuration using the yamldecode function, flatten it using the flatten syntax, and pass it to the appropriate resources. These resources can then deploy a list of any length using loops.
locals {
catalogs = yamldecode("path/to/yaml/config/file")
schemas = flatten([
for index, cat in local.catalogs.catalogs : [
for index, sch in(cat.schemas == null ? [] : cat.schemas) : {
id = join("-", [cat.name, sch.name])
catalog = cat.name
storage = cat.root.storage
owner = cat.owner
schema = sch.name
storage_root = sch.storage_root
}
]
])
}
Tables – please don’t 🙂
At the end of the article, I’d just like to mention that Terraform can also be used to create objects such as views and tables. After all, as the law of the instrument states: “Once you’ve learned Terraform, everything starts to look like infrastructure.”
resource "databricks_sql_table" "sample_table_managed" {
provider = databricks.workspace
name = "sample_table_managed"
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
table_type = "MANAGED"
data_source_format = "DELTA"
storage_location = ""
comment = "this table is managed by terraform"
column {
name = "id"
type = "int"
}
column {
name = "code"
type = "string"
}
column {
name = "description"
type = "string"
}
}
resource "databricks_sql_table" "sample_view" {
provider = databricks.workspace
name = "sample_view"
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
table_type = "VIEW"
comment = "this view is managed by terraform"
view_definition = format(
"SELECT * FROM %s",
databricks_sql_table.sample_table_managed.id
)
depends_on = [
databricks_sql_table.sample_table_managed
]
}
But please remember, just because you can, doesn’t mean you should …
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024





fantastically useful…. especially that last line!
Good to hear 🙂