In one of our previous posts, we explained what self-hosted integration runtimes are and how to fully configure them using Terraform. Today, we’ll take it a step further by discussing the sharing mechanism that allows us to reuse the same runtime across multiple Azure Data Factories.

Multiple Integration Runtimes

Let’s consider the following scenario: our solution consists of three fully functional environments (dev, test, prod), each requiring access to data stored on the same on-premises SQL Server instance. The load is not evenly distributed among them: the production environment operates on a scheduled basis, while the other environments require access irregularly, typically only during the development and testing of new features. At first, it might seem easiest to just set up three virtual machines and register a dedicated runtime on each, especially since we can define it as code (IaC) that can be parameterized and deployed multiple times.

However, this approach isn’t cost-effective due to the need for three separate machines. Moreover, given the irregular use in the dev/test environments, it would be a good idea to implement shutdown schedules and a mechanism for automatically starting up the machines when needed. This will require some additional code to be written.

You might say, ‘Okay, after all, the integration runtime is just software; we could install it three times on the same machine.’ Unfortunately, according to the documentation, we cannot.

You can install only one instance of a self-hosted integration runtime on any single machine. If you have two data factories that need to access on-premises data sources, either use the self-hosted IR sharing feature to share the self-hosted IR, or install the self-hosted IR on two on-premises computers, one for each data factory or Synapse workspace. Synapse workspace doesn’t support Integration Runtime Sharing.

Sharing Integration Runtimes

Fortunately, there is a feature that simplifies management and reduces costs by providing centralized access to on-premises data sources, rather than requiring separate runtimes for each Azure Data Factory instance. This feature is the shared self-hosted integration runtime.

You can configure the original self-hosted integration runtime (Shared IR) to be accessible from another Azure Data Factory, and then create an integration runtime object (Linked IR) that references the original one. The Linked IR is essentially a logical reference to the Shared IR and leverages its infrastructure.

Let’s have a look at how this can be set up from the portal.

Manual Configuration

Sharing an IR is a fairly straightforward process and can be done through the Azure Data Factory portal. In the Integration Runtime options, we select the “Sharing” tab, …

…, then, specify which Azure Data Factory instances and which managed identities (You can read about managed identities here) are allowed to connect to the shared integration runtime and accept. That’s it – done.

What’s intriguing is what happens behind the scenes. When we copy the path of our integration runtime, we find a resource in the Azure portal that isn’t visible by default. By checking the permissions in the Access Control (IAM) tab, we’ll see that the managed identity we chose earlier have been granted Contributor permissions on this resource.

Two conclusions arise from this:

Looks like something that can be easily automated.
If we assign Contributor permissions to a user-assigned managed identity, any Azure Data Factory instance to which this identity is assigned will automatically gain access to our shared IR.

Now, let’s look at the other end of our process – linking. This part is also not complicated – we just create a new integration runtime in Azure Data Factory portal, …

… paste the resource ID (same we’ve used in Azure portal to check IAM) and pick which managed identity should be used to Authenticate.

And that’s it – a linked self-hosted integration runtime is ready to be used.

Implementation

Now, let’s do the same with Terraform.

Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.

The following solution assumes that an Azure Data Factory with a self-hosted integration runtime has already been set up. You can find the Terraform code for this in the post. Here, we’ll start with creating a new Resource Group and Azure Data Factory where the linked integration runtime will be created.

# create a resource group
# ---------------------------------
resource "azurerm_resource_group" "rgrp-linked" {
  name     = "rg-neu-terra-adf-linked"
  location = "North Europe"
}

# create azure data factory
# ---------------------------------
resource "azurerm_data_factory" "adf-linked" {
  name = "adf-terra-linked-neu"

  resource_group_name = azurerm_resource_group.rgrp.name
  location            = "North Europe"

  public_network_enabled = true

  identity {
    type = "SystemAssigned"
  }
}

Next, let’s proceed with sharing the existing integration runtime. This is straightforward – a Contributor role has to be assigned to the managed identity of our new Azure Data Factory.

However, it’s important to note a significant limitation with the azurerm Terraform provider (v3.114.0) – it currently does not support creating linked integration runtimes with user-assigned managed identities used for authorization. Therefore, to fully automate our process, we must either use a system-assigned managed identity or find an alternative method for creating the linked IR, such as utilizing the azapi provider. We’ll pick the easier option and utilize system-assigned managed identity.

# share
# ---------------------------------
resource "azurerm_role_assignment" "adf-smid-shared-contributor" {
  scope                = azurerm_data_factory_integration_runtime_self_hosted.shir.id
  principal_id         = azurerm_data_factory.adf-linked.identity[0].principal_id
  role_definition_name = "Contributor"
}

Finally, we can create the linked IR.

# link
# ---------------------------------
resource "azurerm_data_factory_integration_runtime_self_hosted" "shared_on-prem" {
  name            = "shared-on-prem"
  description     = "---"
  data_factory_id = azurerm_data_factory.adf-linked.id

  rbac_authorization {
    resource_id = azurerm_data_factory_integration_runtime_self_hosted.shir.id
  }

  depends_on = [ 
    azurerm_role_assignment.adf-smid-shared-contributor
   ]
}

As demonstrated, the entire process that typically requires manual intervention across two Azure Data Factory resources can be automated with just two blocks of Terraform code: one for assigning the necessary role and the other for creating the linked integration runtime.

Author
Recent Posts

Tomasz Kostyrka

Data Platform Architect at GetInData | Part of Xebia

Latest posts by Tomasz Kostyrka (see all)

Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024

2 Comments

Michał Pawlikowski says:

August 12, 2024 at 7:21 am

Once they fix the container version of Shir, running it on Docker can be a mitigation too 😉 GJ!

- Tomasz Kostyrka says:
  
  September 5, 2024 at 7:45 pm
  
  I’m afraid it’s not a priority at MS right now 😉

Terraforming ADF: Shared Self-Hosted Integration Runtime

Multiple Integration Runtimes

Sharing Integration Runtimes

Manual Configuration

Implementation

2 Comments

Leave a ReplyCancel reply

Last articles

Last comments

Facebook