Introduction
Databricks Jobs can execute code stored locally (1 on the picture below) or stored in a remote Git repository (2). Second approach simplifies the creation and management of production jobs while enabling automated continuous deployment. It eliminates the need to create and maintain a separate production repository within Azure Databricks, reducing the burden of managing permissions and updates. Additionally, it protects against accidental changes to production jobs, such as unintended local edits or branch switching. The job definition is consistently maintained in the remote repository, with each job run associated with a specific commit hash. Today, I will demonstrate how to set up Databricks workflows to use a remote repository and execute them using a Service Principal.
Workflow with GIT notebook reference
Let’s set up the Git workflow. As illustrated, you can set the ‘Source’ to Git.
When setting up connectivity, you can connect to Azure DevOps or GitHub. Interestingly, you can point not only to a specific branch but also to a particular commit or tag. In many scenarios, code changes are tagged, allowing you to execute different versions of code without making any changes on the Databricks side— which is a significant advantage!
To use Git references in workflow tasks, you need to configure a Linked Account. This configuration can be found under Settings. As shown in the screenshot below, you can set up your own Git configuration here. Please note that this setting is tied to your personal account and cannot be configured for other users from this interface.
As an example, I created a simple workflow that executes a single notebook. This notebook displays a “Hello, World” message along with the current user under whom the notebook is executed.
I also modified the notebook outside of Databricks by replacing the first print command with “Hello, World 2.” After executing the workflow again, you can see that these changes were automatically pulled and reflected without any additional intervention.
Generate Service Principal GIT Credentials
Of course, the above examples illustrate how to retrieve Git code from a workflow using my user context, which is not the focus of this article. We can also specify which account will be used to execute the workflow. As shown below, we can add a Service Principal account for this purpose. It is important to ensure that the Service Principal has access to all resources used within the workflow and that it has Git credentials configured to pull code from the repository.
Git credentials for a Service Principal cannot be configured through the graphical interface. However, you can use the REST API to achieve this configuration. Below, I have provided Python code within a notebook to illustrate the process.
The provided code performs authentication with Azure Active Directory using a Service Principal to obtain an access token. This token is then used to make an authenticated request to a Databricks workspace API endpoint, which retrieves a list of Git credentials associated with the Service Principal. Additionally, the credentials are accessed from Azure Key Vault using Databricks utilities. It is important to note that the resource ID 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default
is a static identifier for Azure Databricks and remains constant. To avoid hard-coding secrets, a secret scope linked to Azure Key Vault is employed.
import requests,json databricks_workspace_url = "adb-123.azuredatabricks.net" tenant_id = "123" login_url = "https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token" login_headers = {'Content-Type': 'application/x-www-form-urlencoded'} login_data = { 'client_id': f'{dbutils.secrets.get(scope="akv-secret-scope", key="service-principal-client-id")}', 'grant_type': 'client_credentials', 'scope': '2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default', 'client_secret': f'{dbutils.secrets.get(scope="akv-secret-scope", key="service-principal-client-secret")}' } access_token = requests.post(login_url, headers=login_headers, data=login_data).json()['access_token'] # list git credentials for service principal list_url = f"{databricks_workspace_url}/api/2.0/git-credentials/" list_headers = { "Authorization": f"Bearer {access_token}", "Content-Type": "application/json" } list_response = requests.get(list_url, headers=list_headers).json() credential_id = None
Below, you can update existing credentials or create new ones. The code checks whether specific Git credentials associated with a Service Principal exist in a Databricks workspace. If the credentials are found, it updates them; if not, it creates new credentials. The required information is securely retrieved from Azure Key Vault. Authentication is managed through Azure Active Directory using a Service Principal, and the access token obtained is used to make API requests to Databricks. All those actions are performed using Databricks REST API and Personal Access Token is saved in Key Vault and can be generated independently of this code (manually or automatically):
if "credentials" in list_response: for credential in list_response['credentials']: if credential['git_username'] == 'databricks_sp_bi': credential_id = credential['credential_id'] break if credential_id is None: print(f"Credential does exists. Creating...") create_url = f"{databricks_workspace_url}/api/2.0/git-credentials/" create_headers = { "Authorization": f"Bearer {access_token}", "Content-Type": "application/json" } create_data = json.dumps({ "personal_access_token": f'{dbutils.secrets.get(scope="akv-secret-scope", key="sp-pat-to-git")}', "git_username": "databricks_sp_to_git", "git_provider": "azureDevOpsServices" }) create_response = requests.post(create_url, headers=create_headers, data=create_data) print(create_response.text) else: print(f"Credential with {credential_id} exists. Updating...") update_url = f"{databricks_workspace_url}/api/2.0/git-credentials/{credential_id}" update_headers = { "Authorization": f"Bearer {access_token}", "Content-Type": "application/json" } update_data = json.dumps({ "personal_access_token": f'{dbutils.secrets.get(scope="akv-secret-scope", key="sp-pat-to-git")}', "git_username": "databricks_sp_to_git", "git_provider": "azureDevOpsServices" }) response = requests.patch(update_url, headers=update_headers, data=update_data) print(response.text)
After execution, you will see that the credential has been successfully created.
Execute workflows as Service Principal
Following that, the job can be executed in the context of the Service Principal:
This approach works exceptionally well, allowing all your code to be managed directly in the repository without the need for manual deployment of notebooks. I highly recommend exploring this technique, as it is a powerful feature that can be beneficial in various scenarios. For more information, please refer to the official documentation. That’s all for now—thank you!
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments