Terraforming Databricks #1: Unity Catalog Metastore

Over the past two years, we have participated in numerous projects where Azure Databricks was implemented from the ground up. Each of these deployments allowed us to learn something new, verify previous solutions, and ultimately develop a methodology that allows us to deploy Azure Databricks in a standardized, enterprise-scale-ready manner. As a result, the newly created platform does not require a rebuild when new data sources and higher volumes of data are introduced.

One of the key principles of such a deployment is its full automation through code, which allows for standardization and significantly reduces deployment time. Preparing ready-made modules or even snippets means we do not have to reinvent the wheel each time.

In this series of posts, I will review the key elements present in each of these deployments, describing them step by step and implementing them using Terraform. As part of the series, I will cover, among other things, Unity Catalog Metastores, Catalogs, and User Access Management. Next, I’ll focus on the workspace-level functionalities such as cluster management, cluster policies, secret scopes, and more.

Unity Catalog Metastore

The first step in provisioning our platform is configuring Unity Catalog. In practice, this involves creating a Unity Catalog Metastore and enabling it for all workspaces in our solution.

The Metastore serves as a central repository that stores metadata about data assets within a Databricks environment, such as tables, views, and other data objects. It facilitates better management and governance of data by providing a unified view of all data assets, including access controls, data lineage, and audit logs.

The term ‘central’ is crucial here, as implementing Unity Catalog allows access to data from multiple workspaces and provides a single place to manage data access policies.

But, before we look at the configuration details, there is one important thing to mention. As mentioned in the documentation:

Databricks began automatically enabling new workspaces for Unity Catalog on November 9, 2023, with a gradual rollout across accounts.

You can check the details of how the automatic implementation works here. This default configuration might work well for most scenarios, but there may be cases where you need to set it up differently – for example, assign default storage at the Metastore level. Or maybe you just like to have full control over everything in your solution :). Because of this, it’s still valuable to know how to set up the Metastore from scratch.

In the example below, I will describe a full configuration, which aligns with the one described here and includes both steps that are listed as optional: creating a user-managed identity and setting up a Metastore-level storage container.

 

Implementation

Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.

We will start by preparing a resource group where the Databricks Access Connector and its associated user-assigned managed identity will reside. As stated in the documentation:

The Access Connector for Azure Databricks is a first-party Azure resource that lets you connect managed identities to an Azure Databricks account. Each access connector for Azure Databricks can contain either one system-assigned managed identity or one user-assigned managed identity. If you want to use multiple managed identities, create a separate access connector for each.

# resource group
# ----------------------------------
resource "azurerm_resource_group" "rgrp-meta" {
  name     = "rg-terradbx-meta-neu"
  location = "northeurope"
}

# access connector
# ----------------------------------
resource "azurerm_user_assigned_identity" "meta" {
  resource_group_name = azurerm_resource_group.rgrp-meta.name
  location            = "northeurope"

  name = "id-terradbx-meta-neu"
}

resource "azurerm_databricks_access_connector" "meta" {
  resource_group_name = azurerm_resource_group.rgrp-meta.name
  location            = "northeurope"

  name = "dbac-terradbx-meta-neu"

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.meta.id]
  }
}

Next, we will create a storage account and container, and assign the appropriate permissions to the user-assigned managed identity so that a service authenticated through it can modify the data. In this case, it is necessary to assign at least the Storage Blob Data Contributor role.

# storage
# ----------------------------------
resource "azurerm_storage_account" "meta" {
  resource_group_name = azurerm_resource_group.rgrp-meta.name
  location            = "northeurope"

  name                             = "staterradbxmetaneu"
  account_tier                     = "Standard"
  access_tier                      = "Hot"
  account_replication_type         = "LRS"
  cross_tenant_replication_enabled = false
  account_kind                     = "StorageV2"
  is_hns_enabled                   = true
}

resource "azurerm_storage_container" "meta" {
  name                  = "meta"
  storage_account_name  = azurerm_storage_account.meta.name
  container_access_type = "private"
}

# permissions
# ----------------------------------
resource "azurerm_role_assignment" "meta-contributor" {
  scope                = azurerm_storage_account.meta.id
  principal_id         = azurerm_user_assigned_identity.meta.principal_id
  role_definition_name = "Storage Blob Data Contributor"
}

All the steps so far have been executed on the Azure side and implemented using the azurerm provider. The next block of code, however, utilizes the databricks provider configured at the account level.

One of the key lessons I have learned from my previous projects is the importance of configuring the ownership of all critical platform components using groups from the very beginning. Assigning objects to individual user accounts as a ‘temporary’ solution creates technical debt. This debt then often remains unresolved until the employee leaves our project, their account is deleted, and processes suddenly stop working. Therefore, before defining the Metastore, we will create a databricks group, which will then be assigned as its owner.

# groups
# ----------------------------------
resource "databricks_group" "meta" {
  provider = databricks.account

  display_name = "gr_metastore_owners"
}

data "databricks_user" "me" {
  provider = databricks.account

  user_name = "tomek.kostyrka@gmail.com"
}

resource "databricks_group_member" "meta" {
  provider = databricks.account

  group_id  = databricks_group.meta.id
  member_id = data.databricks_user.me.id
}

After preparing all the Azure objects and Databricks access groups, we can provision our Metastore. It is important that the Metastore is in a region that matches the one where future databricks workspaces will be provisioned. This code not only creates the Metastore but also assigns default storage and credentials that will be used for future data operations.

# metastore
# ----------------------------------
resource "databricks_metastore" "meta" {
  provider = databricks.account

  name   = "metastore-neu"
  owner  = databricks_group.meta.display_name
  region = "northeurope"

  storage_root = format("abfss://%s@%s.dfs.core.windows.net/",
    azurerm_storage_container.meta.name,
    azurerm_storage_account.meta.name
  )
}

resource "databricks_metastore_data_access" "meta" {
  provider = databricks.account

  name         = "metastore-neu-dac"
  metastore_id = databricks_metastore.meta.id
  is_default   = true

  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.meta.id
    managed_identity_id = azurerm_user_assigned_identity.meta.id
  }
}

After completing this step, the configured Metastore should be visible in the Databricks account console.

The final step is to create a Databricks workspace and assign it to the Metastore, or as it is referred to in the documentation, ‘enable it for Unity Catalog’. Without this step, we won’t be able to access any Unity Catalog objects from this workspace, such as catalogs, models, and volumes.

# workspace
# ----------------------------------
resource "azurerm_databricks_workspace" "workspace" {
  resource_group_name = azurerm_resource_group.rgrp-meta.name
  location            = "northeurope"

  name                        = "dbw-terradbx-meta-neu"
  managed_resource_group_name = "rg-terradbx-meta-neu-managed"
  sku                         = "premium"
}

resource "databricks_metastore_assignment" "workspace" {
  provider = databricks.account

  metastore_id         = databricks_metastore.meta.id
  workspace_id         = azurerm_databricks_workspace.workspace.workspace_id
  default_catalog_name = "main"
}

After successfully provisioning the last resources, you can check the list of all workspaces assigned to a given Metastore in the Databricks Account Console.

Our new workspace is enabled for the Unity Catalog :).

3 Comments

Leave a Reply