Orchestration is about organizing and controlling many computer systems, apps, and/or services, linking together many tasks to carry out a bigger workflow or process. These processes can have many tasks that are automated and can include many systems. The aim of orchestration is to make regular, repeatable processes run smoother and faster, helping data teams handle complex tasks and workflows more easily. Whenever a process can be done again and again, and its tasks can be automated, orchestration can be used to save time, make things more efficient, and get rid of unnecessary repeats.
In the field of data engineering or ETL, it often involves running specific code in the right order based on certain needs. Today, I want to show you a way to manage notebooks from other notebooks, particularly using a fantastic method named runMultiple. I want to concentrate only on running one notebook from another, though there are other options in Microsoft Fabric! I’ve already discussed one of them, you can find the article here. Let’s take a look at how it works.
My demo setup includes four notebooks:
- main_notebook – the main notebook that will run the other notebooks,
- notebook_01, notebook_02, notebook_03 – these are the working notebooks that will be run from the main notebook, take some parameter, and return it to the main notebook.
The code in notebook 1, notebook 2, and notebook 3 is very straightforward and only contains two cells: one to accept parameters and the other to return their values.
input_value = "default"
mssparkutils.notebook.exit(input_value)
Just a reminder, in Microsoft Fabric notebooks, parameters can be set as a variable definition in one of the cells (usually the top one). You just need to go to the settings of the specific cell and select the “Toggle parameter cell” option. After that, you should see a small “parameters” indicator in the bottom right corner of a cell. This means that this cell will be treated in a special way and you can define all your parameters there. As you can see, I have only one string parameter with a default value set to “default” (you can also skip this part and just pass parameter to it without declaring it in the child notebook but IMHO it is not best approach). The value of this parameter will be returned using the mssparkutils.notebook.exit function. The same setup is present in notebook 1, 2, and 3.
We’re all set to start. First, let’s use one of the magic commands in our main notebook called %run. It’s very easy to use – just type %run followed by the path_to_notebook and then optionally in brackets, you can add key-value pairs of parameter values, as shown in below:
%run notebook_01 { "input_value": 1}
As you can see, it’s very simple to use and we can also see the value returned by the child notebook. %run can be useful in many scenarios, but if we want to use a more Python-like way – for example, to capture the returned value and save it in a variable – then we should consider using mssparkutils.notebook.run. This gives us the same functionalities but it’s pure Python. The syntax is shown below, where 600 in this case is the timeout for the executed notebook:
mssparkutils.notebook.run("notebook_01",600,{ "input_value": 1})
Everything is working as expected. As you can see in the screenshot below, there’s also an option “View notebook run”. If we click on it, we can see how the child notebook was executed, cell by cell.
The two options I showed are the most common ways to run one notebook from another. But if we put multiple cells with those commands, they will run in sequence, which can be good, but not always. What if we want to run them in parallel? Of course, we can use typical Python parallelism features like concurrent.futures:
import concurrent.futures def run_notebook(notebook_name, input_value): return mssparkutils.notebook.run(notebook_name, 600, {"input_value": input_value}) notebook_names = ["notebook_01", "notebook_02", "notebook_03"] input_values = [1, 2, 3] with concurrent.futures.ThreadPoolExecutor() as executor: futures = [executor.submit(run_notebook, notebook_name, input_value) for notebook_name, input_value in zip(notebook_names, input_values)] results = [future.result() for future in concurrent.futures.as_completed(futures)] print(results)
But this method isn’t the simplest, and there are other alternatives that I’d like to tell you about now. The real game-changer here is the function mssparkutils.notebook.runMultiple. In its simplest form, it can run all the notebooks passed as a list in parallel:
mssparkutils.notebook.runMultiple(["notebook_01", "notebook_02"])
As you can see, it works pretty well, but that’s not all this function can do. What if we want to “model” some dependencies between notebooks? Of course, we can do this by preparing a DAG in JSON format. What is a DAG? A Directed Acyclic Graph (DAG) is a type of diagram used in different areas like scheduling, data compression, and especially data processing. As the name suggests, it’s a diagram in a special form that shows dependencies. In our case, we have the following diagram:
DAG = { "activities": [ { "name": "execute_notebook_1", "path": "notebook_01", "timeoutPerCellInSeconds": 600, "args": { "input_value": "999" }, "retry": 1, "retryIntervalInSeconds": 30, "dependencies": [] }, { "name": "execute_notebook_2", "path": "notebook_02", "timeoutPerCellInSeconds": 400, "args": { "input_value": "888" }, "retry": 1, "retryIntervalInSeconds": 30, "dependencies": ['execute_notebook_1'] }, { "name": "execute_notebook_3", "path": "notebook_03", "timeoutPerCellInSeconds": 600, "args": { "input_value": "777" }, "retry": 1, "retryIntervalInSeconds": 30, "dependencies": ['execute_notebook_1'] }, { "name": "execute_notebook_3_with_different_param", "path": "notebook_03", "timeoutPerCellInSeconds": 600, "args": { "input_value": "111" }, "retry": 1, "retryIntervalInSeconds": 30, "dependencies": ['execute_notebook_2'] } ], "timeoutInSeconds": 43200, "concurrency": 0 }
Let’s briefly explain all the properties:
- activities – a list of activities (notebooks) executed within this DAG,
- name – a unique name assigned to the specific activity, the activity will be identifiable by this name,
- path – the path to the notebook that we want to execute within the activity,
- timeoutPerCellInSeconds – the timeout for every cell, the default is 60 seconds,
- args – arguments/parameters that can be passed to the notebook,
- retry – how many times the notebook should be executed if an error occurs,
- retryIntervalInSeconds – the interval between every try, the default is 0 so retry will start instantly,
- dependencies – the name of activities on which this specific activity relies.
In our case, we first execute notebook 1, then notebooks 2 and 3 in parallel, and finally, we execute notebook 3 one more time with different parameters after notebook 2. The code is pretty simple because we just execute the function and pass our variable with a diagram to it:
mssparkutils.notebook.runMultiple(DAG,{"displayDAGViaGraphviz": True})
As you can see above, I also added an optional parameter named displayDAGViaGraphviz set to True – it will give us a simple visualization of our diagram of dependencies. Next to that, we have a nice summary of every execution, including duration, exit value (so the value returned by every notebook), and we can also see the details like in previous examples by clicking the name of a notebook in the Snapshot section.
It looks pretty nice, doesn’t it? For me, it’s a really good feature and is often overlooked. It can be even more powerful if we generate the diagram based on metadata and automate it. I recommend you to test this feature on your own.
Resources:
- https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities
- https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments