Git and mandatory CI/CD have become widely accepted standards, and an increasing number of people are advocating for defining cloud resources using IaC tools. Many are coming to understand that quality, reliability, and scalability in modern cloud data projects are simply unattainable without embracing the DevOps culture and proper automation.
The ClickOps resistance movement is slowly losing its strength! 😉
In light of this, I’d like to kick off the ‘Terraforming Data’ series, where I’ll share some solutions to streamline the process of building and deploying a cloud data platform in Azure. The first topic will focus on creating and configuring a self-hosted integration runtime, a task that is still often performed manually.
Self-Hosted Integration Runtime
Let’s start with a quick reminder of what a self-hosted integration runtime is and in which scenarios it will be needed.
Azure Data Factory’s self-hosted integration runtime is a component that allows secure data movement between on-premises data sources and Azure. It is necessary for integrating or transferring data from local environments to Azure or vice versa, especially with on-premises databases. Additionally, self-hosted integration runtimes are required for connecting to Azure resources that have public network access disabled.
Creating an Azure Data Factory self-hosted integration runtime, in short, involves downloading the appropriate software, installing it on an on-premises or virtual machine that has access to both your data sources and the Azure cloud, and finally configuring the integration runtime and linking it to your Azure Data Factory instance. Exact steps are detailed in the Microsoft documentation here.
Based on what I’ve observed, this part of the process is often done manually after the automatic provisioning of Data Factory and the virtual machine. But can it be done entirely through code?
Absolutely.
Implementation
Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.
As I mentioned in the previous sentence, the following code is intended to demonstrate a complete end-to-end solution. Therefore, we will start by creating a Resource Group, an Azure Data Factory resource, and a Self-Hosted Integration Runtime. This part is quite straightforward, and I hope it does not require further explanation.
# create a resource group # --------------------------------- resource "azurerm_resource_group" "rgrp" { name = "rg-neu-terraforming-adf" location = "North Europe" } # create azure data factory # --------------------------------- resource "azurerm_data_factory" "adf" { name = "adf-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" public_network_enabled = true identity { type = "SystemAssigned" } } # create shared integration runtime # --------------------------------- resource "azurerm_data_factory_integration_runtime_self_hosted" "shir" { name = "shir-terra-demo" data_factory_id = azurerm_data_factory.adf.id }
At this stage, you should already see the shared integration runtime in Azure Data Factory; however, it does not yet have any registered nodes.
The next step involves creating network resources, which are essential for setting up a virtual machine. In real-life scenarios, we typically skip this step since a dedicated team handles network management, allowing us to use the existing Virtual Network.
# create virtual network & nic # --------------------------------- resource "azurerm_virtual_network" "vnet" { name = "vnet-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" address_space = ["10.0.0.0/16"] } resource "azurerm_subnet" "snet" { name = "snet-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name virtual_network_name = azurerm_virtual_network.vnet.name address_prefixes = ["10.0.1.0/24"] } resource "azurerm_network_interface" "vm-nic" { name = "nic-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" ip_configuration { name = "internal" subnet_id = azurerm_subnet.snet.id private_ip_address_allocation = "Static" private_ip_address = "10.0.1.4" } }
After creating the network resources, we can provision the virtual machine that will host our integration runtime. It’s important to ensure that the source system is accessible from the VM and that the firewall is not blocking any of the domains listed here.
# create virtual machine # --------------------------------- resource "azurerm_windows_virtual_machine" "vm" { name = "vm-terra-demo-neu" computer_name = "vm-tfdemo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" admin_username = "dummyuser" admin_password = "PlzCh4ng3th1sP@$$" size = "Standard_D2as_v4" patch_assessment_mode = "AutomaticByPlatform" network_interface_ids = [ azurerm_network_interface.vm-nic.id, ] os_disk { name = "osd-terra-demo-neu" caching = "ReadWrite" storage_account_type = "Standard_LRS" } source_image_reference { publisher = "MicrosoftWindowsServer" offer = "WindowsServer" sku = "2022-datacenter-azure-edition" version = "latest" } identity { type = "SystemAssigned" } }
In many projects I’ve been involved with, this is where automation typically stopped – once the virtual machine was created, someone with the sufficient privileges manually installed and configured the integration runtime.
To automate the installation, we’ll begin by preparing a PowerShell script to download and install the runtime (you can find excellent examples of such scripts here or here). The script will then be uploaded to a storage account.
# create storage and upload script # --------------------------------- resource "azurerm_storage_account" "storage" { name = "staterrademoneu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" account_tier = "Standard" account_replication_type = "LRS" } resource "azurerm_storage_container" "container" { name = "scripts" storage_account_name = azurerm_storage_account.storage.name container_access_type = "private" } locals { script_name = "gatewayInstall.ps1" } resource "azurerm_storage_blob" "script" { name = local.script_name storage_account_name = azurerm_storage_account.storage.name storage_container_name = azurerm_storage_container.container.name type = "Block" source = "${path.module}/scripts/${local.script_name}" content_md5 = filemd5("${path.module}/scripts/${local.script_name}") }
In the final step, we will use the Azure Custom Script Extension to run previously mentioned script on our host. Azure Custom Script Extension is a tool used to deploy and manage custom scripts on Azure Virtual Machines (VMs). It allows to automate the configuration and management of our VMs by running scripts, such as PowerShell or Bash, directly on the VMs during or after their provisioning.
With the ability to download and pass authorization keys from the previously defined resource, the runtime will automatically register itself with the Azure Data Factory after installation.
# configure shir # --------------------------------- resource "azurerm_role_assignment" "storage-shir-umid-rbac-contributor" { scope = azurerm_storage_account.storage.id principal_id = azurerm_windows_virtual_machine.vm.identity[0].principal_id role_definition_name = "Storage Blob Data Reader" } resource "azurerm_virtual_machine_extension" "gateway" { name = "shir-install-register" virtual_machine_id = azurerm_windows_virtual_machine.vm.id publisher = "Microsoft.Compute" type = "CustomScriptExtension" type_handler_version = "1.10" auto_upgrade_minor_version = true settings = jsonencode( { "fileUris" : [ "${azurerm_storage_account.storage.primary_blob_endpoint}${azurerm_storage_container.container.name}/${local.script_name}" ] } ) protected_settings = jsonencode( { "managedIdentity" : {}, "commandToExecute" : format( "powershell.exe -ExecutionPolicy Unrestricted -File ${local.script_name} %s", azurerm_data_factory_integration_runtime_self_hosted.shir.primary_authorization_key ) } ) }
And that’s it – our integration runtime is up and running.
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
[…] one of our previous posts, we explained what self-hosted integration runtimes are and how to fully configure them using […]