Terraforming Databricks #2: Catalogs & Schemas

In the first post of this series, we discussed the process of creating a metastore, which is essential for enabling workspaces for the Unity Catalog. In this part, I would like to cover the provisioning process of key elements in Unity Catalog’s object model – specifically, catalogs and schemas. The goal of this article is also to demonstrate how terraform commands such as for_each, yamldecode, and flatten can simplify managing the expanding object structure in our solution.

As usual, let’s start with a quick overview of the objects we will be creating.

Catalogs are the highest-level organizational units in Unity Catalog. A catalog contains multiple schemas and helps manage permissions, data governance, and access control for a group of related data assets. Catalogs can be thought of as collections of databases.

Within a catalog, schemas are used to organize tables, views, volumes, and other data objects. Schemas help logically segment the data, allowing for better organization and control over individual datasets. They are similar to traditional database schemas.

But wait, “Catalogs can be thought of as collections of databases”? Isn’t that a mistake? No, actually, it’s not. People used to MS SQL/S hierarchy (DATABASE.SCHEMA.TABLE), need to get used to the fact that in Databricks, DATABASE is simply an alias for SCHEMA and can be used interchangeably.

In Azure Databricks, schemas are sometimes called databases. For example, CREATE DATABASE is an alias for CREATE SCHEMA. This terminology differs from that of some relational database systems in which a database is a collection of schemas. (link)

Ok, but that’s not all. We also need to configure our cloud object storage – the location where the data will physically reside. To do this, we need to set up the following:

  • Databricks storage credentials – a cloud-specific credentials that enable Databricks to access external cloud storage systems such as ADLS, AWS S3 or GCS.
  • Databricks external location – a securable object that combines the registered cloud storage path and a storage credential that allows to access that path.

External locations are used for both external data assets, such as external tables, as well as managed data assets, including managed tables and managed volumes.

To store managed objects in an external location, a managed storage location must be configured. This setting can be applied at three different levels: metastore, catalog, or schema.

Configuring this property is mandatory for at least one of the following: METASTORE and/or CATALOG. For SCHEMA, it is an optional setting that allows for features such as isolating schemas within the storage account.

Implementation

Quick summary of the objects we plan to create:

  • Storage credentials – authorization mechanism
  • External location – cloud storage path with a storage credential
  • Catalog – primary unit of data organization
  • Schemas – a child of a catalog and can contain tables, views, volumes, etc.

And let’s get started.

Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.

Azure Resources & Access Groups

Since the goal of this series is to prepare end-to-end solutions, we will revisit code similar to what was discussed in the previous article (repetitio est mater studiorum!).

Those familiar with Terraform and azurerm provider may skip this section. For those who are just starting with Terraform, I recommend taking a look especially at the for_each clause, which allows the creation of multiple containers using a single resource. It’s worth reviewing the code once more to fully understand it.

In the first part, we create the following objects:

  • Resource Group
  • Storage Account (ADLS) and Containers
  • Managed Identity
  • Access Connector
  • Storage Blob Data Contributor Role Assignment
resource "azurerm_resource_group" "rgrp-lake" {
  name     = "rg-terradbx-lake-neu"
  location = "northeurope"
}

resource "azurerm_storage_account" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name                             = "staterradbxlakeneu"
  account_tier                     = "Standard"
  access_tier                      = "Hot"
  account_replication_type         = "LRS"
  cross_tenant_replication_enabled = false
  account_kind                     = "StorageV2"
  is_hns_enabled                   = true
}

resource "azurerm_storage_container" "lake" {
  for_each = toset(["catalog", "bronze", "silver", "gold"])

  name                  = each.key
  storage_account_name  = azurerm_storage_account.lake.name
  container_access_type = "private"
}

resource "azurerm_user_assigned_identity" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name = "id-terradbx-lake-neu"
}

resource "azurerm_databricks_access_connector" "lake" {
  resource_group_name = azurerm_resource_group.rgrp-lake.name
  location            = "northeurope"

  name = "dbac-terradbx-lake-neu"

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.lake.id]
  }
}

resource "azurerm_role_assignment" "lake-contributor" {
  scope                = azurerm_storage_account.lake.id
  principal_id         = azurerm_user_assigned_identity.lake.principal_id
  role_definition_name = "Storage Blob Data Contributor"
}

Next:

  • Resource Group
  • Databricks Workspace
  • Metastore Assignment

At this point, it’s worth noting that, unlike the objects created in the first part of this series (Metastore), in this section we will base mainly on resources leveraging the Workspace API. For details, please refer here and here. In the code, you will repeatedly see provider = databricks.account and provider = databricks.workspace, which indicate which API is used during the resource creation.

resource "azurerm_resource_group" "rgrp-proc" {
  name     = "rg-terradbx-proc-neu"
  location = "northeurope"
}

resource "azurerm_databricks_workspace" "proc" {
  resource_group_name = azurerm_resource_group.rgrp-proc.name
  location            = "northeurope"

  name                        = "dbw-terradbx-proc-neu"
  managed_resource_group_name = "rg-terradbx-proc-neu-managed"
  sku                         = "premium"
}

resource "databricks_metastore_assignment" "proc" {
  provider = databricks.account

  metastore_id         = var.metastore_id
  workspace_id         = azurerm_databricks_workspace.proc.workspace_id
  default_catalog_name = "main"
}

Finally, just like last time, set up a group in Unity Catalog to build good practices from the start.

resource "databricks_group" "lake" {
  provider = databricks.account

  display_name = "gr_lake_owners"
}

data "databricks_user" "me" {
  provider = databricks.account

  user_name = "example@gmail.com"
}

resource "databricks_group_member" "lake" {
  provider = databricks.account

  group_id  = databricks_group.lake.id
  member_id = data.databricks_user.me.id
}

Catalogs & Schemas

Now we can move on to the elements we discussed in the first section of this article. We start with the Storage Credential, which points to the Databricks Access Connector created in Azure.

resource "databricks_storage_credential" "lake" {
  provider = databricks.workspace

  name    = "sc-terradbx-lake-neu"
  comment = "managed from terraform"
  owner   = "gr_lake_owners"

  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.lake.id
    managed_identity_id = azurerm_user_assigned_identity.lake.id
  }
}

Next, we create a set of external locations, one for each container. In our example, for simplicity, the same storage credential is used for all objects. However, for full isolation, you can separate this and create dedicated objects with restricted permissions. Similarly to containers, the for_each construct allows us to handle this with a single resource.

resource "databricks_external_location" "lake" {
  provider = databricks.workspace
  for_each = toset(["catalog", "bronze", "silver", "gold"])

  name    = "loc-terradbx-lake-neu-${each.key}"
  comment = "managed from terraform"
  owner   = databricks_group.lake.display_name

  url = format("abfss://%s@%s.dfs.core.windows.net/",
    each.key,
    azurerm_storage_account.lake.name
  )

  credential_name = databricks_storage_credential.lake.id
}

Those external locations will then be used as the storage_root for our catalogs…

resource "databricks_catalog" "lake" {
  provider = databricks.workspace

  name    = "terradbx"
  comment = "managed from terraform"
  owner   = databricks_group.lake.display_name

  metastore_id = var.metastore_id
  storage_root = databricks_external_location.lake["catalog"].url
}

… and schemas. Note how the appropriate naming convention allowed us to easily reference the correct external location in each of the created schemas ([each.key]).

resource "databricks_schema" "lake" {
  provider = databricks.workspace
  for_each = toset(["bronze", "silver", "gold"])

  name         = each.key
  owner        = databricks_group.lake.display_name
  catalog_name = databricks_catalog.lake.name
  comment      = "managed from terraform"

  storage_root = databricks_external_location.lake[each.key].url

  depends_on = [

    databricks_catalog.lake
  ]
}

And as a bonus – external volume. There can be debate about whether it should be managed via code or treated similarly to other objects within schemas, such as tables and views. Personally, I have often created such volumes using Infrastructure as Code (IaC), especially when they point to a catalog with raw data before it is loaded into Delta tables in the first schema of a lakehouse.

resource "databricks_volume" "volumes" {
  provider = databricks.workspace
  for_each = toset(["raw_erp", "raw_crm", "raw_hrm"])

  name             = each.key
  catalog_name     = databricks_catalog.lake.name
  schema_name      = databricks_schema.lake["bronze"].name
  volume_type      = "EXTERNAL"
  storage_location = "${databricks_external_location.lake["bronze"].url}${each.key}"
  comment          = "managed by terraform"
}

After deploying everything, our solution should look as follows:

YAML configuration

For larger solutions, a setup built this way can quickly become cumbersome to maintain. One possible solution to this problem is to move the object configurations to a YAML file.

catalogs:
- name: lakehouse_tmp
  owner: gr_lake_owners
  root:
    storage: staterradbxlakeneu
    path: catalog
    access_connector_id: /subscriptions/.../accessConnectors/dbac-terradbx-lake-neu
    managed_identity_id: /subscriptions/.../userAssignedIdentities/id-terradbx-lake-neu
  schemas:
  - name: raw
    storage_root: raw
  - name: bronze
    storage_root: bronze
  - name: silver
    storage_root: silver
  - name: gold
    storage_root: gold

Then, load the configuration using the yamldecode function, flatten it using the flatten syntax, and pass it to the appropriate resources. These resources can then deploy a list of any length using loops.

locals {
  catalogs = yamldecode("path/to/yaml/config/file")

  schemas = flatten([
    for index, cat in local.catalogs.catalogs : [
      for index, sch in(cat.schemas == null ? [] : cat.schemas) : {
        id        	= join("-", [cat.name, sch.name])
        catalog   	= cat.name
        storage   	= cat.root.storage
        owner     	= cat.owner
        schema    	= sch.name
        storage_root 	= sch.storage_root
      }
    ]
  ])
}

Tables – please don’t 🙂

At the end of the article, I’d just like to mention that Terraform can also be used to create objects such as views and tables. After all, as the law of the instrument states: “Once you’ve learned Terraform, everything starts to look like infrastructure.”

resource "databricks_sql_table" "sample_table_managed" {
  provider = databricks.workspace

  name               = "sample_table_managed"
  catalog_name       = databricks_catalog.lake.name
  schema_name        = databricks_schema.lake["bronze"].name
  table_type         = "MANAGED"
  data_source_format = "DELTA"
  storage_location   = ""
  comment            = "this table is managed by terraform"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "code"
    type = "string"
  }
  column {
    name = "description"
    type = "string"
  }
}

resource "databricks_sql_table" "sample_view" {
  provider     = databricks.workspace
  name         = "sample_view"
  catalog_name = databricks_catalog.lake.name
  schema_name  = databricks_schema.lake["bronze"].name
  table_type   = "VIEW"
  comment      = "this view is managed by terraform"

  view_definition = format(
    "SELECT * FROM %s",
    databricks_sql_table.sample_table_managed.id
  )

  depends_on = [
    databricks_sql_table.sample_table_managed
    ]
}

But please remember, just because you can, doesn’t mean you should …

Leave a Reply