- notebook_id – identifier of a notebook that you can get using API or when you open a specific notebook you can get it from URL (section after word synapsenotebooks):
- notebook_name – name of a notebook, it will not be used by orchestration mechanism but I included it for readability,
- execution_group – a group of notebooks that we want to execute together,
- execution_group_execution_order – order of execution of every group:
β CREATE TABLE notebooks_to_execute ( id INTEGER, nteobook_id VARCHAR(50), notebook_name VARCHAR(50), execution_group VARCHAR(50), execution_group_execution_order INTEGER )
My current configuration looks like this:
INSERT INTO notebooks_to_execute ( id ,notebook_id ,notebook_name ,execution_group ,execution_group_execution_order ) VALUES ( 1,'9c9f794d-93ff-4246-b425-046e12513ad3','notebook1','Group A',1) ,( 2,'95c4a86e-9f28-46c9-b155-276d0be75a60','notebook2','Group A',1) ,( 3,'c6eff9c6-3b8d-4349-ba45-72d301fcf4d1','notebook3','Group A',1) ,( 4,'3061c634-9f57-481a-a8ae-fd222a657ba1','notebook4','Group B',2) ,( 5,'16eb1211-bbcd-4080-a19f-fd03aa14648a','notebook5','Group B',2) ,( 6,'29e472b3-4cff-4022-b1b8-d41cf01d459b','notebook6','Group D',3) ,( 7,'7e0c866b-eefd-485b-8813-2d42b57fb307','notebook7','Group E',4)
I have four groups that I want to execute in a specific order. Some groups contain multiple notebooks, while others have only one notebook. My assumption is that every group can have a completely different execution order, but if needed, multiple groups can be executed within the same execution. Notebooks, by default, are in only one group, but if there is a need, there is no problem to mention them multiple times in multiple groups. The table itself is very simple to depict the process, but to make it more reliable, it can be normalized a little. However, this is not crucial at this moment.
With this approach, we can model any dependency and order of execution. It’s important to note that the notebook name and group name are not important for the process, but those columns are essential for readability and easy identification.
The configuration of the Lookup activity appears as shown in the image below. There is no need to configure connectivity to the warehouse itself, especially since it is located in the same workspace. One of the main reasons I chose the warehouse is because I can execute a specific query against it, which is not possible with, for example, a lakehouse. The query added to this activity is a simple SELECT DISTINCT. For productive deployments, I recommend using a stored procedure to keep this code on the warehouse side.
SELECT DISTINCT execution_group_execution_order FROM notebooks_to_execute ORDER BY execution_group_execution_order
Regarding the “for each” loop, it simply iterates through the execution orders returned by the Lookup activity, ensuring that it goes through them sequentially. Invoking the orchestration_groups pipeline is also straightforward. The only thing to note is that we should pass the execution_group_execution_order as a parameter with the same name for the current item.
If you would like to reproduce the same approach this is an expression used in Items property of a loop:
@activity('Get groups to execute').output.value
and this is the expression passed to execution_groups_execution_order parameter:
@item().execution_group_execution_order
Our next pipeline is very similar to the previous one and has also three elements:
- Lookup activity to retrieve the groups that we want to execute.
- Foreach loop to iterate through those groups.
- Invoke pipeline to execute the orchestration_groups pipeline.
As I mentioned earlier single “execution order” can have multiple groups so we have to iterate over it. Lookup activity configuration is very similar to the previous one only query is a little different because we want to get distinct list execution_group values and the entire query is filtered by execution_group_execution_order parameter that was passed from the parent pipeline:
@concat(' SELECT DISTINCT execution_group FROM notebooks_to_execute WHERE execution_group_execution_order =''',pipeline().parameters.execution_groups_execution_order,'''')
Invoke orchestration_notebook contains group_name parameter passed:
The last pipeline is called orchestration_notebooks and also has three elements:
- Get notebooks – get from the configuration warehouse all notebooks that should be executed,
- Foreach – loops through the notebooks returned by Get notebooks,
- Execute notebook – pointer to the specific notebook that should be executed.
Lookup expression used to get notebooks:
@concat(' SELECT DISTINCT notebook_name FROM notebooks_to_execute WHERE execution_group = ''',pipeline().parameters.group_name,'''')
Typical not sequential for each loop:
@activity('Get notebooks').output.value
An interesting fact is that when we don’t want to parameterize the notebook reference, we can use its name. However, if there is any expression involved, then we have to provide the notebook’s ID:
Ok then, our process is ready so I executed it and the result is visible below:
- Avoiding Issues: Monitoring Query Pushdowns in Databricks Federated Queries - October 27, 2024
- Microsoft Fabric: Using Workspace Identity for Authentication - September 25, 2024
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
Great article! Adrian Chodkowski. Implementing a similar approach in Fabric and wondering how can we run the pipeline from Point of Failure if one of the notebook is failed inside the loop.
You have to mark what was done in meta-database or somewhere so next time loop will start when it failed π
Also any advice on Breaking the loop and failing the pipeline run in case of one notebook failure…. the loop is continuing with other sequence even if a notebook is failed..
could not able to use API to kill the session/pipeline. Alternative will be highly appreciated.
Of course everything depends on scenario, but good point π