Terraforming Databricks #2: Catalogs & Schemas

In the first post of this series, we discussed the process of creating a metastore, which is essential for enabling workspaces for the Unity Catalog. In this part, I would like to cover the provisioning process of key elements in Unity Catalog’s object model – specifically, catalogs and schemas. The goal of this article is also to demonstrate how terraform commands such as for_each, yamldecode, and flatten can simplify managing the expanding object structure in our solution.

As usual, let’s start with a quick overview of the objects we will be creating.

Catalogs are the highest-level organizational units in Unity Catalog. A catalog contains multiple schemas and helps manage permissions, data governance, and access control for a group of related data assets. Catalogs can be thought of as collections of databases.

Within a catalog, schemas are used to organize tables, views, volumes, and other data objects. Schemas help logically segment the data, allowing for better organization and control over individual datasets. They are similar to traditional database schemas.

But wait, “Catalogs can be thought of as collections of databases”? Isn’t that a mistake? No, actually, it’s not. People used to MS SQL/S hierarchy (DATABASE.SCHEMA.TABLE), need to get used to the fact that in Databricks, DATABASE is simply an alias for SCHEMA and can be used interchangeably.

In Azure Databricks, schemas are sometimes called databases. For example, CREATE DATABASE is an alias for CREATE SCHEMA. This terminology differs from that of some relational database systems in which a database is a collection of schemas. (link)

Ok, but that’s not all. We also need to configure our cloud object storage – the location where the data will physically reside. To do this, we need to set up the following:

  • Databricks storage credentials – a cloud-specific credentials that enable Databricks to access external cloud storage systems such as ADLS, AWS S3 or GCS.
  • Databricks external location – a securable object that combines the registered cloud storage path and a storage credential that allows to access that path.

External locations are used for both external data assets, such as external tables, as well as managed data assets, including managed tables and managed volumes.

To store managed objects in an external location, a managed storage location must be configured. This setting can be applied at three different levels: metastore, catalog, or schema.

Configuring this property is mandatory for at least one of the following: METASTORE and/or CATALOG. For SCHEMA, it is an optional setting that allows for features such as isolating schemas within the storage account.

Implementation

Quick summary of the objects we plan to create:

  • Storage credentials – authorization mechanism
  • External location – cloud storage path with a storage credential
  • Catalog – primary unit of data organization
  • Schemas – a child of a catalog and can contain tables, views, volumes, etc.

And let’s get started.

Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.

Azure Resources & Access Groups

Since the goal of this series is to prepare end-to-end solutions, we will revisit code similar to what was discussed in the previous article (repetitio est mater studiorum!).

Those familiar with Terraform and azurerm provider may skip this section. For those who are just starting with Terraform, I recommend taking a look especially at the for_each clause, which allows the creation of multiple containers using a single resource. It’s worth reviewing the code once more to fully understand it.

In the first part, we create the following objects:

  • Resource Group
  • Storage Account (ADLS) and Containers
  • Managed Identity
  • Access Connector
  • Storage Blob Data Contributor Role Assignment
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "azurerm_resource_group" "rgrp-lake" {
name = "rg-terradbx-lake-neu"
location = "northeurope"
}
resource "azurerm_storage_account" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "staterradbxlakeneu"
account_tier = "Standard"
access_tier = "Hot"
account_replication_type = "LRS"
cross_tenant_replication_enabled = false
account_kind = "StorageV2"
is_hns_enabled = true
}
resource "azurerm_storage_container" "lake" {
for_each = toset(["catalog", "bronze", "silver", "gold"])
name = each.key
storage_account_name = azurerm_storage_account.lake.name
container_access_type = "private"
}
resource "azurerm_user_assigned_identity" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "id-terradbx-lake-neu"
}
resource "azurerm_databricks_access_connector" "lake" {
resource_group_name = azurerm_resource_group.rgrp-lake.name
location = "northeurope"
name = "dbac-terradbx-lake-neu"
identity {
type = "UserAssigned"
identity_ids = [azurerm_user_assigned_identity.lake.id]
}
}
resource "azurerm_role_assignment" "lake-contributor" {
scope = azurerm_storage_account.lake.id
principal_id = azurerm_user_assigned_identity.lake.principal_id
role_definition_name = "Storage Blob Data Contributor"
}
resource "azurerm_resource_group" "rgrp-lake" { name = "rg-terradbx-lake-neu" location = "northeurope" } resource "azurerm_storage_account" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "staterradbxlakeneu" account_tier = "Standard" access_tier = "Hot" account_replication_type = "LRS" cross_tenant_replication_enabled = false account_kind = "StorageV2" is_hns_enabled = true } resource "azurerm_storage_container" "lake" { for_each = toset(["catalog", "bronze", "silver", "gold"]) name = each.key storage_account_name = azurerm_storage_account.lake.name container_access_type = "private" } resource "azurerm_user_assigned_identity" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "id-terradbx-lake-neu" } resource "azurerm_databricks_access_connector" "lake" { resource_group_name = azurerm_resource_group.rgrp-lake.name location = "northeurope" name = "dbac-terradbx-lake-neu" identity { type = "UserAssigned" identity_ids = [azurerm_user_assigned_identity.lake.id] } } resource "azurerm_role_assignment" "lake-contributor" { scope = azurerm_storage_account.lake.id principal_id = azurerm_user_assigned_identity.lake.principal_id role_definition_name = "Storage Blob Data Contributor" }
resource "azurerm_resource_group" "rgrp-lake" {
  name     = "rg-terradbx-lake-neu"
  location = "northeurope"
}

resource "azurerm_storage_account" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name                             = "staterradbxlakeneu"
  account_tier                     = "Standard"
  access_tier                      = "Hot"
  account_replication_type         = "LRS"
  cross_tenant_replication_enabled = false
  account_kind                     = "StorageV2"
  is_hns_enabled                   = true
}

resource "azurerm_storage_container" "lake" {
  for_each = toset(["catalog", "bronze", "silver", "gold"])

  name                  = each.key
  storage_account_name  = azurerm_storage_account.lake.name
  container_access_type = "private"
}

resource "azurerm_user_assigned_identity" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name = "id-terradbx-lake-neu"
}

resource "azurerm_databricks_access_connector" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name = "dbac-terradbx-lake-neu"

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.lake.id]
  }
}

resource "azurerm_role_assignment" "lake-contributor" {
  scope                = azurerm_storage_account.lake.id
  principal_id         = azurerm_user_assigned_identity.lake.principal_id
  role_definition_name = "Storage Blob Data Contributor"
}

Next:

  • Resource Group
  • Databricks Workspace
  • Metastore Assignment

At this point, it’s worth noting that, unlike the objects created in the first part of this series (Metastore), in this section we will base mainly on resources leveraging the Workspace API. For details, please refer here and here. In the code, you will repeatedly see provider = databricks.account and provider = databricks.workspace, which indicate which API is used during the resource creation.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "azurerm_resource_group" "rgrp-proc" {
name = "rg-terradbx-proc-neu"
location = "northeurope"
}
resource "azurerm_databricks_workspace" "proc" {
resource_group_name = azurerm_resource_group.rgrp-proc.name
location = "northeurope"
name = "dbw-terradbx-proc-neu"
managed_resource_group_name = "rg-terradbx-proc-neu-managed"
sku = "premium"
}
resource "databricks_metastore_assignment" "proc" {
provider = databricks.account
metastore_id = var.metastore_id
workspace_id = azurerm_databricks_workspace.proc.workspace_id
default_catalog_name = "main"
}
resource "azurerm_resource_group" "rgrp-proc" { name = "rg-terradbx-proc-neu" location = "northeurope" } resource "azurerm_databricks_workspace" "proc" { resource_group_name = azurerm_resource_group.rgrp-proc.name location = "northeurope" name = "dbw-terradbx-proc-neu" managed_resource_group_name = "rg-terradbx-proc-neu-managed" sku = "premium" } resource "databricks_metastore_assignment" "proc" { provider = databricks.account metastore_id = var.metastore_id workspace_id = azurerm_databricks_workspace.proc.workspace_id default_catalog_name = "main" }
resource "azurerm_resource_group" "rgrp-proc" {
  name     = "rg-terradbx-proc-neu"
  location = "northeurope"
}

resource "azurerm_databricks_workspace" "proc" {
  resource_group_name = azurerm_resource_group.rgrp-proc.name
  location            = "northeurope"

  name                        = "dbw-terradbx-proc-neu"
  managed_resource_group_name = "rg-terradbx-proc-neu-managed"
  sku                         = "premium"
}

resource "databricks_metastore_assignment" "proc" {
  provider = databricks.account

  metastore_id         = var.metastore_id
  workspace_id         = azurerm_databricks_workspace.proc.workspace_id
  default_catalog_name = "main"
}

Finally, just like last time, set up a group in Unity Catalog to build good practices from the start.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_group" "lake" {
provider = databricks.account
display_name = "gr_lake_owners"
}
data "databricks_user" "me" {
provider = databricks.account
user_name = "example@gmail.com"
}
resource "databricks_group_member" "lake" {
provider = databricks.account
group_id = databricks_group.lake.id
member_id = data.databricks_user.me.id
}
resource "databricks_group" "lake" { provider = databricks.account display_name = "gr_lake_owners" } data "databricks_user" "me" { provider = databricks.account user_name = "example@gmail.com" } resource "databricks_group_member" "lake" { provider = databricks.account group_id = databricks_group.lake.id member_id = data.databricks_user.me.id }
resource "databricks_group" "lake" {
  provider = databricks.account

  display_name = "gr_lake_owners"
}

data "databricks_user" "me" {
  provider = databricks.account

  user_name = "example@gmail.com"
}

resource "databricks_group_member" "lake" {
  provider = databricks.account

  group_id  = databricks_group.lake.id
  member_id = data.databricks_user.me.id
}

Catalogs & Schemas

Now we can move on to the elements we discussed in the first section of this article. We start with the Storage Credential, which points to the Databricks Access Connector created in Azure.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_storage_credential" "lake" {
provider = databricks.workspace
name = "sc-terradbx-lake-neu"
comment = "managed from terraform"
owner = "gr_lake_owners"
azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.lake.id
managed_identity_id = azurerm_user_assigned_identity.lake.id
}
}
resource "databricks_storage_credential" "lake" { provider = databricks.workspace name = "sc-terradbx-lake-neu" comment = "managed from terraform" owner = "gr_lake_owners" azure_managed_identity { access_connector_id = azurerm_databricks_access_connector.lake.id managed_identity_id = azurerm_user_assigned_identity.lake.id } }
resource "databricks_storage_credential" "lake" {
  provider = databricks.workspace

  name    = "sc-terradbx-lake-neu"
  comment = "managed from terraform"
  owner   = "gr_lake_owners"

  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.lake.id
    managed_identity_id = azurerm_user_assigned_identity.lake.id
  }
}

Next, we create a set of external locations, one for each container. In our example, for simplicity, the same storage credential is used for all objects. However, for full isolation, you can separate this and create dedicated objects with restricted permissions. Similarly to containers, the for_each construct allows us to handle this with a single resource.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_external_location" "lake" {
provider = databricks.workspace
for_each = toset(["catalog", "bronze", "silver", "gold"])
name = "loc-terradbx-lake-neu-${each.key}"
comment = "managed from terraform"
owner = databricks_group.lake.display_name
url = format("abfss://%s@%s.dfs.core.windows.net/",
each.key,
azurerm_storage_account.lake.name
)
credential_name = databricks_storage_credential.lake.id
}
resource "databricks_external_location" "lake" { provider = databricks.workspace for_each = toset(["catalog", "bronze", "silver", "gold"]) name = "loc-terradbx-lake-neu-${each.key}" comment = "managed from terraform" owner = databricks_group.lake.display_name url = format("abfss://%s@%s.dfs.core.windows.net/", each.key, azurerm_storage_account.lake.name ) credential_name = databricks_storage_credential.lake.id }
resource "databricks_external_location" "lake" {
  provider = databricks.workspace
  for_each = toset(["catalog", "bronze", "silver", "gold"])

  name    = "loc-terradbx-lake-neu-${each.key}"
  comment = "managed from terraform"
  owner   = databricks_group.lake.display_name

  url = format("abfss://%s@%s.dfs.core.windows.net/",
    each.key,
    azurerm_storage_account.lake.name
  )

  credential_name = databricks_storage_credential.lake.id
}

Those external locations will then be used as the storage_root for our catalogs…

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_catalog" "lake" {
provider = databricks.workspace
name = "terradbx"
comment = "managed from terraform"
owner = databricks_group.lake.display_name
metastore_id = var.metastore_id
storage_root = databricks_external_location.lake["catalog"].url
}
resource "databricks_catalog" "lake" { provider = databricks.workspace name = "terradbx" comment = "managed from terraform" owner = databricks_group.lake.display_name metastore_id = var.metastore_id storage_root = databricks_external_location.lake["catalog"].url }
resource "databricks_catalog" "lake" {
  provider = databricks.workspace

  name    = "terradbx"
  comment = "managed from terraform"
  owner   = databricks_group.lake.display_name

  metastore_id = var.metastore_id
  storage_root = databricks_external_location.lake["catalog"].url
}

… and schemas. Note how the appropriate naming convention allowed us to easily reference the correct external location in each of the created schemas ([each.key]).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_schema" "lake" {
provider = databricks.workspace
for_each = toset(["bronze", "silver", "gold"])
name = each.key
owner = databricks_group.lake.display_name
catalog_name = databricks_catalog.lake.name
comment = "managed from terraform"
storage_root = databricks_external_location.lake[each.key].url
depends_on = [
databricks_catalog.lake
]
}
resource "databricks_schema" "lake" { provider = databricks.workspace for_each = toset(["bronze", "silver", "gold"]) name = each.key owner = databricks_group.lake.display_name catalog_name = databricks_catalog.lake.name comment = "managed from terraform" storage_root = databricks_external_location.lake[each.key].url depends_on = [ databricks_catalog.lake ] }
resource "databricks_schema" "lake" {
  provider = databricks.workspace
  for_each = toset(["bronze", "silver", "gold"])

  name         = each.key
  owner        = databricks_group.lake.display_name
  catalog_name = databricks_catalog.lake.name
  comment      = "managed from terraform"

  storage_root = databricks_external_location.lake[each.key].url

  depends_on = [

    databricks_catalog.lake
  ]
}

And as a bonus – external volume. There can be debate about whether it should be managed via code or treated similarly to other objects within schemas, such as tables and views. Personally, I have often created such volumes using Infrastructure as Code (IaC), especially when they point to a catalog with raw data before it is loaded into Delta tables in the first schema of a lakehouse.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_volume" "volumes" {
provider = databricks.workspace
for_each = toset(["raw_erp", "raw_crm", "raw_hrm"])
name = each.key
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
volume_type = "EXTERNAL"
storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}"
comment = "managed by terraform"
}
resource "databricks_volume" "volumes" { provider = databricks.workspace for_each = toset(["raw_erp", "raw_crm", "raw_hrm"]) name = each.key catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name volume_type = "EXTERNAL" storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}" comment = "managed by terraform" }
resource "databricks_volume" "volumes" {
  provider = databricks.workspace
  for_each = toset(["raw_erp", "raw_crm", "raw_hrm"])

  name             = each.key
  catalog_name     = databricks_catalog.lake.name
  schema_name      = databricks_schema.lake["bronze"].name
  volume_type      = "EXTERNAL"
  storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}"
  comment          = "managed by terraform"
}

After deploying everything, our solution should look as follows:

YAML configuration

For larger solutions, a setup built this way can quickly become cumbersome to maintain. One possible solution to this problem is to move the object configurations to a YAML file.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
catalogs:
- name: lakehouse_tmp
owner: gr_lake_owners
root:
storage: staterradbxlakeneu
path: catalog
access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu
managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu
schemas:
- name: raw
storage_root: raw
- name: bronze
storage_root: bronze
- name: silver
storage_root: silver
- name: gold
storage_root: gold
catalogs: - name: lakehouse_tmp owner: gr_lake_owners root: storage: staterradbxlakeneu path: catalog access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu schemas: - name: raw storage_root: raw - name: bronze storage_root: bronze - name: silver storage_root: silver - name: gold storage_root: gold
catalogs:
- name: lakehouse_tmp
  owner: gr_lake_owners
  root:
    storage: staterradbxlakeneu
    path: catalog
    access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu
    managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu
  schemas:
  - name: raw
    storage_root: raw
  - name: bronze
    storage_root: bronze
  - name: silver
    storage_root: silver
  - name: gold
    storage_root: gold

Then, load the configuration using the yamldecode function, flatten it using the flatten syntax, and pass it to the appropriate resources. These resources can then deploy a list of any length using loops.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
locals {
catalogs = yamldecode("path/to/yaml/config/file")
schemas = flatten([
for index, cat in local.catalogs.catalogs : [
for index, sch in(cat.schemas == null ? [] : cat.schemas) : {
id = join("-", [cat.name, sch.name])
catalog = cat.name
storage = cat.root.storage
owner = cat.owner
schema = sch.name
storage_root = sch.storage_root
}
]
])
}
locals { catalogs = yamldecode("path/to/yaml/config/file") schemas = flatten([ for index, cat in local.catalogs.catalogs : [ for index, sch in(cat.schemas == null ? [] : cat.schemas) : { id = join("-", [cat.name, sch.name]) catalog = cat.name storage = cat.root.storage owner = cat.owner schema = sch.name storage_root = sch.storage_root } ] ]) }
locals {
  catalogs = yamldecode("path/to/yaml/config/file")

  schemas = flatten([
    for index, cat in local.catalogs.catalogs : [
      for index, sch in(cat.schemas == null ? [] : cat.schemas) : {
        id        	= join("-", [cat.name, sch.name])
        catalog   	= cat.name
        storage   	= cat.root.storage
        owner     	= cat.owner
        schema    	= sch.name
        storage_root 	= sch.storage_root
      }
    ]
  ])
}

Tables – please don’t 🙂

At the end of the article, I’d just like to mention that Terraform can also be used to create objects such as views and tables. After all, as the law of the instrument states: “Once you’ve learned Terraform, everything starts to look like infrastructure.”

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
resource "databricks_sql_table" "sample_table_managed" {
provider = databricks.workspace
name = "sample_table_managed"
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
table_type = "MANAGED"
data_source_format = "DELTA"
storage_location = ""
comment = "this table is managed by terraform"
column {
name = "id"
type = "int"
}
column {
name = "code"
type = "string"
}
column {
name = "description"
type = "string"
}
}
resource "databricks_sql_table" "sample_view" {
provider = databricks.workspace
name = "sample_view"
catalog_name = databricks_catalog.lake.name
schema_name = databricks_schema.lake["bronze"].name
table_type = "VIEW"
comment = "this view is managed by terraform"
view_definition = format(
"SELECT * FROM %s",
databricks_sql_table.sample_table_managed.id
)
depends_on = [
databricks_sql_table.sample_table_managed
]
}
resource "databricks_sql_table" "sample_table_managed" { provider = databricks.workspace name = "sample_table_managed" catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name table_type = "MANAGED" data_source_format = "DELTA" storage_location = "" comment = "this table is managed by terraform" column { name = "id" type = "int" } column { name = "code" type = "string" } column { name = "description" type = "string" } } resource "databricks_sql_table" "sample_view" { provider = databricks.workspace name = "sample_view" catalog_name = databricks_catalog.lake.name schema_name = databricks_schema.lake["bronze"].name table_type = "VIEW" comment = "this view is managed by terraform" view_definition = format( "SELECT * FROM %s", databricks_sql_table.sample_table_managed.id ) depends_on = [ databricks_sql_table.sample_table_managed ] }
resource "databricks_sql_table" "sample_table_managed" {
  provider = databricks.workspace

  name               = "sample_table_managed"
  catalog_name       = databricks_catalog.lake.name
  schema_name        = databricks_schema.lake["bronze"].name
  table_type         = "MANAGED"
  data_source_format = "DELTA"
  storage_location   = ""
  comment            = "this table is managed by terraform"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "code"
    type = "string"
  }
  column {
    name = "description"
    type = "string"
  }
}

resource "databricks_sql_table" "sample_view" {
  provider     = databricks.workspace
  name         = "sample_view"
  catalog_name = databricks_catalog.lake.name
  schema_name  = databricks_schema.lake["bronze"].name
  table_type   = "VIEW"
  comment      = "this view is managed by terraform"

  view_definition = format(
    "SELECT * FROM %s",
    databricks_sql_table.sample_table_managed.id
  )

  depends_on = [
    databricks_sql_table.sample_table_managed
    ]
}

But please remember, just because you can, doesn’t mean you should …

Leave a Reply