Terraforming Databricks #3: Lakehouse Federation

In today’s post, the third in the Terraforming Databricks series, we’ll break down the process of setting a connection to an Azure SQL Database as part of the Lakehouse Federation functionality.

Lakehouse Federation

Before diving into the implementation, let’s first define what Lakehouse Federation is. Here’s a brief description from the documentation.

Lakehouse Federation is the query federation platform for Azure Databricks. The term query federation describes a collection of features that enable users and systems to run queries against multiple data sources without needing to migrate all data to a unified system.

At this point, the mechanism allows us to connect to various systems, including PostgreSQL, MySQL, BigQuery, and many others. The provisioning process is quite similar across all scenarios, with the main difference being how the connection to the source system is configured.

Here, we’ll focus on connecting to Azure SQL Database. The fact that this can be done using either a SQL user authentication or via. Service Principal using OAuth mechanism makes the example more interesting.​

Providers

Before we move on to the implementation, a few words about the Terraform providers used, as we will need to configure three of them.

Providers are plugins that allow Terraform to interact with APIs from various platforms to provision and manage resources. They define how Terraform communicates with specific services. In this article, we’ll use the following:

  • azuread – to manage users, groups, service principals, and applications in Microsoft Entra ID
  • azurerm – to configure infrastructure in Azure using the ARM API’s
  • databricks – to interact with databricks resources
terraform {
  required_providers {
    azuread = {
      source  = "hashicorp/azuread"
      version = "3.0.2"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "4.5.0"
    }
    databricks = {
      source  = "databricks/databricks"
      version = "1.53.0"
    }
  }
  backend "azurerm" {}
}

Additionally, the Databricks provider will be configured twice, as our solution will utilize both the Account and Workspace APIs.

provider "databricks" {
  alias = "account"

  host       = "https://accounts.azuredatabricks.net"
  account_id = "xxxxxxxxxxxxxxxxxxxx"
}

provider "databricks" {
  alias = "workspace"

  host  = "xxxxxxxxxxxxxxxxxxxx"
}

Implementation

Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.

Microsoft Entra ID

Let’s start by creating an Entra group. Later in the article, it will be set as the administrator of the Azure SQL Database.

We’ll also add our current user to this group right away. This can be easily done using azuread_client_config, which allows us to retrieve, among other things, the object_id of the currently authenticated principal.

data "azuread_client_config" "current" {}

# create group
# ----------------------------------
resource "azuread_group" "sqls" {
  display_name     = "terradbx-sql-admins"
  owners           = [data.azuread_client_config.current.object_id]
  security_enabled = true
}

# assign current user
# ----------------------------------
resource "azuread_group_member" "sqls-current" {
  group_object_id  = azuread_group.sqls.object_id
  member_object_id = data.azuread_client_config.current.object_id
}

Azure SQL Database

Next, we’ll create an Azure SQL server and a database. This will be the one we connect to using the Lakehouse Federation mechanism.

This piece of code requires a few comments. First, never hard-code passwords in Terraform code – you can always use the random_provider to generate one. Second, during the provisioning of the database, we can set the sample_name flag to “AdventureWorksLT”. The database will then be preloaded with sample data – it’s useful when building a POC or for demonstration projects.

# create a resource group
# ----------------------------------
resource "azurerm_resource_group" "rgrp-sqls" {
  name     = "rg-terradbx-sqls-neu"
  location = "northeurope"
}

# password
# ----------------------------------
resource "random_password" "password" {
  length           = 16
  special          = true
  override_special = "!#$%&*()-_=+[]{}<>:?"
}

# server
# ----------------------------------
resource "azurerm_mssql_server" "sqls" {
  name                = "sqls-terradbx-lake-neu"
  resource_group_name = azurerm_resource_group.rgrp-sqls.name
  location            = "northeurope"

  version                       = "12.0"
  public_network_access_enabled = true
  administrator_login           = "adm-login"
  administrator_login_password  = random_password.password.result

  azuread_administrator {
    login_username = azuread_group.sqls.display_name
    object_id      = azuread_group.sqls.object_id
  }

  identity {
    type = "SystemAssigned"
  }
}

# database
# ----------------------------------
resource "azurerm_mssql_database" "sqls" {
  name      = "sample"
  server_id = azurerm_mssql_server.sqls.id

  sku_name              = "Basic"
  max_size_gb           = 2
  storage_account_type  = "Local"
  sample_name           = "AdventureWorksLT"
}

Finally, we’ll make things a bit easier by fully opening the firewall on the database. This is only intended to simplify our demo – please avoid this type of practice in production environments.

# firewall
# ----------------------------------
resource "azurerm_mssql_firewall_rule" "sqls" {
  name             = "allow-all"
  server_id        = azurerm_mssql_server.sqls.id
  start_ip_address = "0.0.0.0"
  end_ip_address   = "255.255.255.255"
}

Databricks Workspace

We’ll skip the step of creating the Databricks workspace and assigning it to the metastore, as it has already been covered twice in this series. For those interested, I recommend checking the previous posts in the series, where you can find the complete code needed for this process.

Databricks Groups

In previous articles, we also mentioned the best practice of working with Entra/Databricks groups and avoiding assigning specific users as owners of objects. We’ll follow that same practice here as well.

resource "databricks_group" "fed" {
  provider = databricks.account

  display_name = "gr_federation_owners"
}

data "databricks_user" "me" {
  provider = databricks.account

  user_name = "example@gmail.com"
}

resource "databricks_group_member" "fed" {
  provider = databricks.account

  group_id  = databricks_group.fed.id
  member_id = data.databricks_user.me.id
}

Lakehouse Federation (user/pass)

Let’s start with the simple scenario – configuring a connection using basic authentication, i.e., connecting with a username and password that we created on the SQL server side.

The configuration of the lakehouse federation mechanism essentially involves two steps: creating a connection and a catalog that utilizes it.

  • A connection defines the path and credentials for accessing an external database system.
  • A foreign catalog replicates a database from an external system, allowing you to query and manage access to its data through Azure Databricks and Unity Catalog. To create a foreign catalog, you use an already defined connection to the data source.

Notably, those objects follow a one-to-many relationship – the same connection, such as one to an SQL server, can be used to create multiple federated catalogs (databases on that server).

Let’s begin by creating the CONNECTION object…

resource "databricks_connection" "fed-sqls" {
  provider = databricks.workspace

  name    = "${azurerm_mssql_server.sqls.name}-sqls"
  owner   = databricks_group.fed.display_name
  comment = "managed from terraform"

  connection_type = "SQLSERVER"
  options = {
    host     = azurerm_mssql_server.sqls.fully_qualified_domain_name
    port     = 1433
    user     = "adm-login"
    password = "P@ssword#123"
  }
}

… which, after provisioning, should look as follows.

Next, let’s configure a catalog pointing to our sample database.

resource "databricks_catalog" "fed-sqls" {
  provider = databricks.workspace

  name    = "${azurerm_mssql_database.sqls.name}-sqls"
  owner   = databricks_group.fed.display_name
  comment = "managed from terraform"

  metastore_id    = var.metastore_id
  connection_name = databricks_connection.fed-sqls.name

  isolation_mode = "OPEN"
  options = {
    database = azurerm_mssql_database.sqls.name
  }
}

When verifying the created objects, it’s important to remember that, unlike traditional catalogs, even browsing objects (such as the list of tables and views) requires active compute on the Databricks side.

A quick look at the “sample data” tab will confirm that the federation has been configured correctly.

Quod erat demonstrandum​! ;)

Lakehouse Federation (OAuth)

Now let’s move on to configuring the connection using OAuth – this is where the fun begins.

However, spoiler alert: it’s not currently possible to do this entirely through Terraform or SQL code. That said, we will implement a workaround that will allow us to automate the vast majority of the work.

In the first step, we need to create a dedicated app in Microsoft Entra ID. Additionally, this app must be configured properly – we need to set the redirection URI and the appropriate API delegations. A complete list of steps is available at this link.

data "azuread_application_published_app_ids" "well_known" {}

resource "azuread_service_principal" "sqldb" {
  client_id    = data.azuread_application_published_app_ids.well_known.result["AzureSqlDatabase"]
  use_existing = true
}

resource "azuread_application" "fed" {
  display_name = "databricks-dbxfed"
  owners       = [data.azuread_client_config.current.object_id]

  web {
    redirect_uris = ["https://${azurerm_databricks_workspace.lfed.workspace_url}/login/oauth/azure.html"]

    implicit_grant {
      access_token_issuance_enabled = false
      id_token_issuance_enabled     = false
    }
  }

  required_resource_access {
    resource_app_id = azuread_service_principal.sqldb.client_id
      resource_access {
          id   = azuread_service_principal.sqldb.oauth2_permission_scope_ids["user_impersonation"]
          type = "Scope"
        }
    }
}

resource azuread_service_principal fed {
  client_id = azuread_application.fed.client_id
  owners    = azuread_application.fed.owners

  app_role_assignment_required = false
}

resource "azuread_group_member" "fed" {
  group_object_id  = azuread_group.sqls.object_id
  member_object_id = azuread_service_principal.fed.object_id
}

Quick check of the app after creation shows that everything looks good.

Last thing before we proceed to creating the connection, we still need to create a secret – unfortunately, federated identity is not an option here.

resource "azuread_application_password" "fed" {
  display_name   = "databricks-fed-secret"
  application_id = azuread_application.fed.id
}

Once again, I want to emphasize in this article that this is only code for the sake of a complete demo and I do not recommend using it in this version for production deployments. Why? Because application secrets (like secrets in Key Vault, passwords for virtual machines, and access keys in storage accounts) can be found in plain text in the state file.

So please use a different mechanism to create and rotate your secrets, or ensure that no one – absolutely no one – who shouldn’t have access can access the state file.

That being said, we can return to our demo. Our application is ready, so let’s continue by creating the connection, this time using OAuth authentication. Unfortunately, we won’t find any information in the documentation about how to configure the connection. Let’s approach this from the other side – let’s set it up in the GUI and later see what changes Terraform generates for us based on the drift.

Quick terraform plan after saving the changes and… it doesn’t work. Kudos to Terraform for not printing all those sensitive values on the screen.

Let’s then use our familiar backdoor. Remember, everything can be found in the tfstate file.

Now, based on the information we found, let’s try to configure our resources in Terraform. Initially, when we run the apply command with the -refresh-only flag, everything looks fine, but any attempts to change the resource result in an error. The term “pkce_verifier” visible in the message stands for “Proof Key for Code Exchange Verifier.”

Interested individuals can read more about this mechanism here; however, in the context of our issue, we will stop at this point and move on to trying to find a reasonable workaround.

For curiosity’s sake, I’ll mention that a week ago, similar attempts resulted in a different error, stating that the ‘user’ and ‘password’ attributes are required when creating ‘databricks_connection’ resources. I therefore assume that this functionality is still under development, and we will likely receive a straightforward and documented solution in the near future.

Alright, so what does the workaround look like? Step by step:

  1. create a connection and configure it with a non-existent user and password
  2. add the entire options block to lifecycle ignore_changes
  3. after the object is created, reconfigure it in the GUI to use OAuth, just once
  4. wait for the official functionality to be released. 🙂
resource "databricks_connection" "fed-sqls" {
  provider = databricks.workspace

  name    = "${azurerm_mssql_server.sqls.name}-sqls"
  owner   = databricks_group.fed.display_name
  comment = "managed from terraform"

  connection_type = "SQLSERVER"
  options = {
    host     = azurerm_mssql_server.sqls.fully_qualified_domain_name
    port     = 1433

    user     = "non-existing-user"
    password = "non-existing-password"
  }

  lifecycle {
    ignore_changes = [ options ]
  }
}

By the way, just for the record – we get the same result from the SQL level. I guess we just have to wait.

 

Leave a Reply