Azure Data Factory is almost never created as an isolated resource. In nearly every project where we have used this service, it was provisioned alongside a Key Vault and a Storage Account. In the following article, I’ll describe the standard method for authenticating between these services using managed identities.
As indicated by the title, I’ll also demonstrate how to deploy everything using Terraform – without a single click in the Azure Portal! 😉
Managed Identities & Storage Credentials
Let’s start with a brief reminder of the objects that will appear later in the article.
Managed Identity allows Azure services to authenticate to other Azure services without a need to manually manage credentials. Essentially, they are a special type of service principal attached to an Azure resource. There are two types of managed identities available in Azure:
- System-Assigned: Created and managed directly by Azure for a specific Azure resource. When the resource is deleted, the associated managed identity is automatically cleaned up.
- User-Assigned: Created as an independent Azure resource and can be assigned to one or more Azure resources. Its lifecycle is independent of the resources it is assigned to.
A single resource can have one System-Assigned and multiple User-Assigned Managed Identities.
As a rule, you should use managed identities instead of secrets or passwords whenever possible.
Azure Data Factory credential objects are used to securely store and manage authentication information required for accessing various data sources and services. They can be used to authenticate to data stores, compute services, or other resources without embedding sensitive information directly in pipeline definitions.
A common practice that utilizes objects mentioned above is to store passwords for on-premises sources in a Key Vault, accessible by our Azure Data Factory. The linked service (e.g., an SQL Server) is then configured to retrieve the password from the Key Vault using a specified credential on the fly.
This approach allows passwords to be uploaded to the Key Vault without exposing them to developers and simplifies password rotation, which can be done without disrupting the Data Factory service itself.
Implementation
Disclaimer: Although the code below presents an end-to-end solution, it is written in a simplified version for demonstration purposes. I advise against copy-pasting it into your production solution without proper review and refactoring.
As in the previous article in this series, let’s start the “Implementation” section by creating core resources: a Resource Group, a Managed Identity, and an Azure Data Factory resource.
The identity block in Azure Data Factory references previously created User-Assigned Managed Identity (UMID), resulting in the assignment of the identity to the resource. The identity_ids
argument accepts a list of values because, as mentioned earlier, multiple managed identities can be assigned to a single resource.
# create a resource group # --------------------------------- resource "azurerm_resource_group" "rgrp" { name = "rg-neu-terraforming-adf-umids" location = "North Europe" } # managed identity # --------------------------------- resource "azurerm_user_assigned_identity" "factory-umid" { name = "id-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" } # create azure data factory # --------------------------------- resource "azurerm_data_factory" "adf" { name = "adf-terra-demo-neu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" public_network_enabled = true identity { type = "SystemAssigned, UserAssigned" identity_ids = [ azurerm_user_assigned_identity.factory-umid.id ] } }
As a result of this step, our Azure Data Factory can use the assigned identity. However, this does not result in automatic creation of a credential object; it has to be defined separately.
# factory - credentials # --------------------------------- resource "azurerm_data_factory_credential_user_managed_identity" "default" { name = "cred-default" data_factory_id = azurerm_data_factory.adf.id identity_id = azurerm_user_assigned_identity.factory-umid.id }
At this point, the following object should be visible in Azure Data Factory:
In the next step, two resources that our Data Factory needs to connect to are created: an Azure Storage Account and an Azure Key Vault. For the purposes of this article, they are configured in a way similar to the default settings, with open public network access.
# create storage # --------------------------------- resource "azurerm_storage_account" "storage" { name = "stterrademoneu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" account_tier = "Standard" account_replication_type = "LRS" } # create keyvault # --------------------------------- data "azurerm_client_config" "current" {} resource "azurerm_key_vault" "azkv" { name = "kvterrademoneu" resource_group_name = azurerm_resource_group.rgrp.name location = "North Europe" tenant_id = data.azurerm_client_config.current.tenant_id sku_name = "premium" enable_rbac_authorization = true public_network_access_enabled = true }
In addition to creating the resources, access permissions has to be also configured appropriately.
By the way, please keep in mind that the Contributor role is not sufficient for performing these kinds of IAM operations. Therefore, if you are working in a shared environment with such access level, the code at this point will return errors.
# permissions # --------------------------------- resource "azurerm_role_assignment" "umid-sta-contr" { scope = azurerm_storage_account.storage.id principal_id = azurerm_user_assigned_identity.factory-umid.principal_id role_definition_name = "Storage Blob Data Contributor" } resource "azurerm_role_assignment" "umid-kv-user" { scope = azurerm_key_vault.azkv.id principal_id = azurerm_user_assigned_identity.factory-umid.principal_id role_definition_name = "Key Vault Secrets User" }
After configuring the appropriate permissions, we can proceed to the final step: defining Linked Services.
# linked service # --------------------------------- resource "azurerm_data_factory_linked_custom_service" "ls-akv" { name = "LS_AKEV_terrademoneu" data_factory_id = azurerm_data_factory.adf.id type = "AzureKeyVault" type_properties_json = jsonencode( { "baseUrl" : "${azurerm_key_vault.azkv.vault_uri}", "credential" : { "referenceName" : "${azurerm_data_factory_credential_user_managed_identity.default.name}", "type" : "CredentialReference" } } ) } resource "azurerm_data_factory_linked_custom_service" "ls-sta" { name = "LS_ADLS_terrademoneu" data_factory_id = azurerm_data_factory.adf.id type = "AzureBlobFS" type_properties_json = jsonencode( { "url" : "${azurerm_storage_account.storage.primary_dfs_endpoint}", "credential" : { "referenceName" : "${azurerm_data_factory_credential_user_managed_identity.default.name}", "type" : "CredentialReference" } } ) }
I’m aware, that the final point might be controversial – some may argue that elements within Data Factory are not considered infrastructure and therefore should not be defined using Terraform. Personally, I believe the boundary between what is managed through Infrastructure as Code (IaC) and what is not is somewhat blurred. In such situations, I follow this principle:
- Linked Services for resources that are created together with our ADF as a part of the core solution are defined using Terraform.
- Linked Services for specific source systems are created directly within Azure Data Factory.
Regardless of your opinion on the matter, one thing is certain: it’s worth knowing our options before deciding what best fits our project. 🙂
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
- Utilizing YAML Anchors in Databricks Asset Bundles - August 24, 2024
- Databricks: MERGE WITH SCHEMA EVOLUTION - August 17, 2024
Last comments