In one of our previous posts, we explained what self-hosted integration runtimes are and how to fully configure them using Terraform. Today, we’ll take it a step further by discussing the sharing mechanism that allows us to reuse the same runtime across multiple Azure Data Factories.
Multiple Integration Runtimes
Let’s consider the following scenario: our solution consists of three fully functional environments (dev, test, prod), each requiring access to data stored on the same on-premises SQL Server instance. The load is not evenly distributed among them: the production environment operates on a scheduled basis, while the other environments require access irregularly, typically only during the development and testing of new features. At first, it might seem easiest to just set up three virtual machines and register a dedicated runtime on each, especially since we can define it as code (IaC) that can be parameterized and deployed multiple times.
However, this approach isn’t cost-effective due to the need for three separate machines. Moreover, given the irregular use in the dev/test environments, it would be a good idea to implement shutdown schedules and a mechanism for automatically starting up the machines when needed. This will require some additional code to be written.
You might say, ‘Okay, after all, the integration runtime is just software; we could install it three times on the same machine.’ Unfortunately, according to the documentation, we cannot.
You can install only one instance of a self-hosted integration runtime on any single machine. If you have two data factories that need to access on-premises data sources, either use the self-hosted IR sharing feature to share the self-hosted IR, or install the self-hosted IR on two on-premises computers, one for each data factory or Synapse workspace. Synapse workspace doesn’t support Integration Runtime Sharing.
Sharing Integration Runtimes
Let’s have a look at how this can be set up from the portal.
Manual Configuration
Sharing an IR is a fairly straightforward process and can be done through the Azure Data Factory portal. In the Integration Runtime options, we select the “Sharing” tab, …
…, then, specify which Azure Data Factory instances and which managed identities (You can read about managed identities here) are allowed to connect to the shared integration runtime and accept. That’s it – done.
What’s intriguing is what happens behind the scenes. When we copy the path of our integration runtime, we find a resource in the Azure portal that isn’t visible by default. By checking the permissions in the Access Control (IAM) tab, we’ll see that the managed identity we chose earlier have been granted Contributor permissions on this resource.
Two conclusions arise from this:
- Looks like something that can be easily automated.
- If we assign Contributor permissions to a user-assigned managed identity, any Azure Data Factory instance to which this identity is assigned will automatically gain access to our shared IR.
Now, let’s look at the other end of our process – linking. This part is also not complicated – we just create a new integration runtime in Azure Data Factory portal, …
… paste the resource ID (same we’ve used in Azure portal to check IAM) and pick which managed identity should be used to Authenticate.
And that’s it – a linked self-hosted integration runtime is ready to be used.
Implementation
Now, let’s do the same with Terraform.
Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.
The following solution assumes that an Azure Data Factory with a self-hosted integration runtime has already been set up. You can find the Terraform code for this in the post. Here, we’ll start with creating a new Resource Group and Azure Data Factory where the linked integration runtime will be created.
# create a resource group # --------------------------------- resource "azurerm_resource_group" "rgrp-linked" { name = "rg-neu-terra-adf-linked" location = "North Europe" } # create azure data factory # --------------------------------- resource "azurerm_data_factory" "adf-linked" { name = "adf-terra-linked-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" public_network_enabled = true identity { type = "SystemAssigned" } }
Next, let’s proceed with sharing the existing integration runtime. This is straightforward – a Contributor role has to be assigned to the managed identity of our new Azure Data Factory.
However, it’s important to note a significant limitation with the azurerm
Terraform provider (v3.114.0) – it currently does not support creating linked integration runtimes with user-assigned managed identities used for authorization. Therefore, to fully automate our process, we must either use a system-assigned managed identity or find an alternative method for creating the linked IR, such as utilizing the azapi
provider. We’ll pick the easier option and utilize system-assigned managed identity.
# share # --------------------------------- resource "azurerm_role_assignment" "adf-smid-shared-contributor" { scope = azurerm_data_factory_integration_runtime_self_hosted.shir.id principal_id = azurerm_data_factory.adf-linked.identity[0].principal_id role_definition_name = "Contributor" }
Finally, we can create the linked IR.
# link # --------------------------------- resource "azurerm_data_factory_integration_runtime_self_hosted" "shared_on-prem" { name = "shared-on-prem" description = "---" data_factory_id = azurerm_data_factory.adf-linked.id rbac_authorization { resource_id = azurerm_data_factory_integration_runtime_self_hosted.shir.id } depends_on = [ azurerm_role_assignment.adf-smid-shared-contributor ] }
As demonstrated, the entire process that typically requires manual intervention across two Azure Data Factory resources can be automated with just two blocks of Terraform code: one for assigning the necessary role and the other for creating the linked integration runtime.
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
Once they fix the container version of Shir, running it on Docker can be a mitigation too 😉 GJ!
I’m afraid it’s not a priority at MS right now 😉