FabricNotebooksOrchestration_00

Notebook orchestration in Microsoft Fabric made easy

When I meet with customers, we discuss how to efficiently organize notebook execution via orchestration mechanism. They often mention managing it manually using tools like Azure Data Factory. On the other hand, they sometimes express that Data Factory is too simplistic and they require other tools to address their needs. Similar questions arise regarding Microsoft Fabric, but the answer remains consistent in most cases: utilize a metadata-driven approach instead of crafting manual processes and you don’t need any complex tools to achieve your goals.

This is a crucial aspect because simple orchestration enhances productivity, scalability, and maintainability. Easy orchestration allows developers to focus on analysis and development rather than worrying about code execution. How to achieve it? It’s simple! In this article, I will demonstrate how to do it in Microsoft Fabric with just a few steps, using services available within the platform. Let’s start!

To begin, I created seven notebooks. It’s not crucial what’s inside them because that’s not the focus of this article. In reality, they could contain anything from complex transformations to simple data copy mechanisms.

As mentioned in the introduction, we will employ a straightforward metadata-driven approach, necessitating a storage space for this metadata. I’ve created a table in the Fabric Warehouse to house all configuration values. The table is minimalistic, featuring only a few columns.

  • notebook_id – identifier of a notebook that you can get using API or when you open a specific notebook you can get it from URL (section after word synapsenotebooks):
  • notebook_name – name of a notebook, it will not be used by orchestration mechanism but I included it for readability,
  • execution_group – a group of notebooks that we want to execute together,
  • execution_group_execution_order – order of execution of every group:
​
CREATE  TABLE notebooks_to_execute
(
    id                                  INTEGER,
    nteobook_id                         VARCHAR(50),
    notebook_name                       VARCHAR(50),
    execution_group                     VARCHAR(50),
    execution_group_execution_order     INTEGER
)

My current configuration looks like this:

INSERT INTO notebooks_to_execute
(
     id
    ,notebook_id
    ,notebook_name
    ,execution_group
    ,execution_group_execution_order
)
VALUES
 ( 1,'9c9f794d-93ff-4246-b425-046e12513ad3','notebook1','Group A',1)
,( 2,'95c4a86e-9f28-46c9-b155-276d0be75a60','notebook2','Group A',1)
,( 3,'c6eff9c6-3b8d-4349-ba45-72d301fcf4d1','notebook3','Group A',1)

,( 4,'3061c634-9f57-481a-a8ae-fd222a657ba1','notebook4','Group B',2)
,( 5,'16eb1211-bbcd-4080-a19f-fd03aa14648a','notebook5','Group B',2)

,( 6,'29e472b3-4cff-4022-b1b8-d41cf01d459b','notebook6','Group D',3)

,( 7,'7e0c866b-eefd-485b-8813-2d42b57fb307','notebook7','Group E',4)

I have four groups that I want to execute in a specific order. Some groups contain multiple notebooks, while others have only one notebook. My assumption is that every group can have a completely different execution order, but if needed, multiple groups can be executed within the same execution. Notebooks, by default, are in only one group, but if there is a need, there is no problem to mention them multiple times in multiple groups. The table itself is very simple to depict the process, but to make it more reliable, it can be normalized a little. However, this is not crucial at this moment.

With this approach, we can model any dependency and order of execution. It’s important to note that the notebook name and group name are not important for the process, but those columns are essential for readability and easy identification.

Good, we are ready to create our orchestration mechanism, and it will come as no surprise that we will use Data Factory for that. There are two reasons for this choice: simplicity and scalability, which can address even very complex orchestrations. For this purpose, we need three pipelines:

  1. orchestration_master: The main pipeline that will initiate the entire process.
  2. orchestration_groups: A pipeline that will execute groups in a specific order.
  3. orchestration_notebooks: A pipeline that will execute notebooks in a specific order.

First, we will create the orchestration_master. The construction of this pipeline is very simple and consists of just three elements:

  1. Lookup activity to retrieve the execution orders.
  2. Foreach loop to iterate through those execution orders.
  3. Invoke pipeline to execute the orchestration_groups pipeline.

The configuration of the Lookup activity appears as shown in the image below. There is no need to configure connectivity to the warehouse itself, especially since it is located in the same workspace. One of the main reasons I chose the warehouse is because I can execute a specific query against it, which is not possible with, for example, a lakehouse. The query added to this activity is a simple SELECT DISTINCT. For productive deployments, I recommend using a stored procedure to keep this code on the warehouse side.

SELECT DISTINCT execution_group_execution_order 
FROM notebooks_to_execute
ORDER BY execution_group_execution_order 

Regarding the “for each” loop, it simply iterates through the execution orders returned by the Lookup activity, ensuring that it goes through them sequentially. Invoking the orchestration_groups pipeline is also straightforward. The only thing to note is that we should pass the execution_group_execution_order as a parameter with the same name for the current item.

If you would like to reproduce the same approach this is an expression used in Items property of a loop:

@activity('Get groups to execute').output.value

and this is the expression passed to execution_groups_execution_order parameter:

@item().execution_group_execution_order

Our next pipeline is very similar to the previous one and has also three elements:

  1. Lookup activity to retrieve the groups that we want to execute.
  2. Foreach loop to iterate through those groups.
  3. Invoke pipeline to execute the orchestration_groups pipeline.

 

As I mentioned earlier single “execution order” can have multiple groups so we have to iterate over it. Lookup activity configuration is very similar to the previous one only query is a little different because we want to get distinct list execution_group values and the entire query is filtered by execution_group_execution_order parameter that was passed from the parent pipeline:

@concat('
SELECT DISTINCT execution_group 
FROM notebooks_to_execute
WHERE execution_group_execution_order =''',pipeline().parameters.execution_groups_execution_order,'''')

 

The configuration for the “for each” loop is also very similar. Please note that in this case, our loop is not sequential so it means that all the groups within single “execution order” can be executed in parallel (if you want them to not be executed in parallel just place them in separate execution order).

Invoke orchestration_notebook contains group_name parameter passed:

The last pipeline is called orchestration_notebooks and also has three elements:

  • Get notebooks – get from the configuration warehouse all notebooks that should be executed,
  • Foreach – loops through the notebooks returned by Get notebooks,
  • Execute notebook – pointer to the specific notebook that should be executed.

Lookup expression used to get notebooks:

@concat('
SELECT DISTINCT notebook_name
FROM notebooks_to_execute
WHERE execution_group = ''',pipeline().parameters.group_name,'''')

Typical not sequential for each loop:

@activity('Get notebooks').output.value

 

Notebook execution is very simple because the only information we need to provide is the Workspace and the Notebook object ID:

An interesting fact is that when we don’t want to parameterize the notebook reference, we can use its name. However, if there is any expression involved, then we have to provide the notebook’s ID:

Ok then, our process is ready so I executed it and the result is visible below:

As you can see, everything executed without any problems. We have four iterations in the loop because we have four execution group order numbers. We could delve deeper into the details, but you can already notice that everything works as expected. Right now, if we want to add more notebooks for execution, we just need to add them to the configuration table hosted in the warehouse, and that’s all! In the perfect scenario, we should not modify pipelines at all; instead, we should only modify this simple table.

The point of this article is to demonstrate how easy it is and to highlight the fact that we don’t need a very complex orchestration mechanism to reach our goals. That’s all for now. Thank you for your time, and see you next time.

 

4 Comments

  1. Great article! Adrian Chodkowski. Implementing a similar approach in Fabric and wondering how can we run the pipeline from Point of Failure if one of the notebook is failed inside the loop.

    • You have to mark what was done in meta-database or somewhere so next time loop will start when it failed πŸ™‚

  2. Also any advice on Breaking the loop and failing the pipeline run in case of one notebook failure…. the loop is continuing with other sequence even if a notebook is failed..

    could not able to use API to kill the session/pipeline. Alternative will be highly appreciated.

Leave a Reply