FabricConcurrencyNotebookInPipelines_00

High concurrency mode for Fabric notebooks in pipelines

When you try to run a single notebook in Fabric, a new Apache session is started. This is the default behavior of our notebooks when using the standard approach. However, you can execute other notebooks from a single notebook using built-in methods from mssparkutils, such as the runMultiple method, which I’ve already described here. By using this method, the same session is shared for the entire execution, which can be beneficial in various scenarios. What if you want to orchestrate execution from Pipelines? I will explain all these things below.

Let’s see what will be the default behaviour. In my Fabric workspace I have three notebooks that I will try to execute and pipeline that will orchestrate all those notebooks:

The pipeline is extremely simple; it contains only three notebook execution tasks. All these tasks are intended to execute in parallel, which is why there are no dependencies between them.

When I execute it, I can go to the Monitoring Hub to see what the execution looks like.

The above picture shows that all the notebooks were executed correctly, and each used a different Spark Session. How do we know this? Each activity name includes the notebook name and a Livy ID, which indicates the Spark Session created for it. When we examine the details of each execution, we can see that all these identifiers are different, meaning the notebooks were not executed within the same session.

As you probably know, every Fabric SKU has limitations regarding concurrency and queue size. We can say that the maximum limit depends on the SKU you have, where 1 CU equals 2 Spark VCores. Spark for Fabric enforces a cores-based throttling and queueing mechanism, where users can submit jobs based on the purchased Fabric capacity SKUs. The queueing mechanism is a simple FIFO-based queue that checks for available job slots and automatically retries jobs once capacity becomes available. The table below depicts these limits:

If you will reach the limit you can get an error:

HTTP Response code 430: This Spark job can't be run because you have hit a Spark compute or API rate limit. To run this Spark job, cancel an active Spark job through the Monitoring hub, or choose a larger capacity SKU or try again later.

Of course, everything depends on the configured pool, as you can have small or large custom pools, but the limits remain the same even when considering bursting. For example, with an F64 SKU, you have 128 Spark VCores available. With a burst factor of 3, it supports up to 384 Spark VCores for concurrent execution. In this configuration, three jobs using 128 VCores each can run concurrently, or one job using 384 VCores can run.

So, what if I want to reuse the same session across multiple pipelines? First, you need to configure your workspace. In Workspace settings, navigate to the Data Engineering/Science section and then to Spark settings. There, you can enable the option for pipelines running multiple notebooks.

 

That’s not all. In our pipeline, under Settings > Advanced Settings, we must provide a session tag. All notebook executions with the same session tag will share the same session. We can divide our notebooks as desired, but there are some limitations to keep in mind. For now, let’s see the result of our exercise.

Notebook B is configured with the same session tag and Notebook C has its own tag. When we go to the monitoring tab we will see something like this:

As you may have noticed, a new prefix ‘HC’ has appeared, indicating that notebooks are running in high concurrency mode. However, one identifier is used for two notebooks, and a different session tag is used for Notebook C. Two entries not only share the same ID but also include the same notebook name. This does not mean the same notebook is executed twice; it simply indicates they share the same session initiated by Notebook B. This is evident when examining the details.

Everything works as it should. However, there are some limitations to consider:

  • Session tags cannot be used across workspaces.
  • The same user must run the process.
  • The same default lakehouse must be set up within the notebook.
  • The same Spark compute configurations must be set up.
  • The same library packages must be set up.

What are the benefits of using high concurrency mode? First, we can reuse an active session, significantly reducing startup time. Additionally, it can be faster overall, and we may not need to increase the SKU in some cases.

That’s all for now! I hope you find it useful!

 

Adrian Chodkowski
Follow me

Leave a Reply