DatafactoryDatabricksItnegration_000

Executing Databricks custom query from Azure Data Factory

Today, we’re going to talk about using Azure Data Factory to run a custom query in Databricks. Why we need dedicated article about it? Well, Azure Data Factory helps us move and change data and in some scenarios Databricks can be treated as a source from which we want to move data. When we use ADF and ADB together, we can expect pretty simple and straight-forward process but reality is different. In this article, I am going to focus on running a custom query on Databricks and getting the results using Azure Data Factory.

We’ll go through everything step by step, showing you how to set everything up, how to configure it, and how to run it. We’ll also share some code snippets and explain what they do. First of all when you will try to look through the available Linked Service you will probably notice two that are connected to Databricks:

When we will see Activities then we can notice only three of them and all of them are connected to Azure Databricks linked service (Azure Databricks Delta Lake has not dedicated Activity):

  • Notebook,
  • Jar,
  • Python

As you probably noticed already those activities give you possibility to point at the code that is stored as a py, jar or notebook. We don’t have option to create Dataset – it means that it is not dedicated to use for any data movement activities – only orchestration.  Let’s see if there is any customization possibilities:

Unfortunately the only thing possible is point out the path to the notebook, pass some parameters and append libraries – not so much. Of course if you want to orchestrate notebooks in correct order etc. this activity can be useful. The only problem that I see is Linked Service itself that supports only three authentication methods:

  • Personal Access Token,
  • System Managed Identity,
  • User Managed Identity,

Will be great to have also possibility to use standard Service Principal which can be used in many cases.To sum up, we cannot use Azure Databricks if we want to execute custom query.

Let’s then switch to Azure Databricks Delta Lake  to see those possibilities. For this connector we can create Dataset and point out to the specific database (schema) and table. For some of you question can appear – what about Unity Catalog – we have there three level namespace and here we have old two level? Answer is pretty simple – it is not supporting it directly but we can fool it a little 🙂 Let’s select Edit option to both and leave it empty!

Then we can use one of the standard activities like Copy or Lookup and then instead of selecting Table we can use “Query” and write our own query and guess what! Three level namespace is working!

Data is available and we can use it inside the pipeline. Small magic trick and everything works as expected.

But that’s not all! When you will try to load data from Databricks to any source you can have some problems with it. I tried to perform copying from Azure Databricks Delta Lake to Azure SQL and I get an error that the only target supported is DelimitedText, Parquet or Avro on Azure Data ake Storage Gen2 or Azure Blob Storage:

The only way to achieve what I want is by enabling staging. This means that data will temporarily be saved on storage before being pushed to the target, which is quite suspicious. The answer to why this is the case can be found in the documentation (https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-databricks-delta-lake?tabs=data-factory). It states that the Copy Activity is not performing this save, but rather, Databricks is doing all the work!

In this case, I strongly recommend writing all the code in a Databricks notebook, saving it in the appropriate format in the correct storage, and then allowing the Copy Activity to read and transfer it to the chosen Sink.

There are some possibilities using Dataflows, but I’m not a big fan of this technology. However, if you’re interested, you can find an official tutorial (link). You might also consider using the Lookup activity, but please remember that it has a limit of 1000 rows that can be returned, which can be a significant problem in many cases.

As you can see, there are some possibilities for extracting data from Databricks using Data Factory, but they are quite limited. I encourage you to perform your own tests and leverage the standard functionalities of both tools instead of resorting to overcomplicated solutions.

Leave a Reply