Using variables in loops in Data Factory – why it’s not worth it

Loops are well-known constructs and they are a fundamental and necessary element of programming. This is no different in ADF, where loops play a standard role related to storing values fetched or calculated in a specific location of the data flow. The schema of Data Factory is rather well-known and involves nesting specific calls, etc. This approach involves a trap related to variables because, if they have a loop in their pipeline and are used inside the loop, we expose ourselves to incorrect results. Here, the question may arise as to why this is so, and the answer is very simple – because variables always have the scope of the entire pipeline, not just a single iteration in the loop.

Let’s examine this with an example – in the image below, we see the logical execution of a loop. We will iterate over an array containing three fruits: Apples, Bananas, and Oranges. Inside the loop, we read the value of the currently processed fruit and save it to a variable in “Set Variable“. We then pass that variable to the next pipeline:

The characteristics of the ForEach loop in ADF are such that individual calls can be sequential, i.e., executed one after another, but there is also the possibility of running them in parallel, which was shown in the above image. So what’s wrong here? Well, as mentioned earlier, variable values have the scope of the entire pipeline, not just a single loop invocation. Therefore, the above image, despite intuitively seeming correct, actually works completely differently.

The correct diagram depicting the execution looks like this:

Each iteration is of course called in parallel, but the “blocks” inside are often not called at exactly the same time. Due to the fact that there is only one variable, each iteration assigns the value of the same variable, so at the beginning the variable has the value “Apples” until the next iteration overwrites it with the value “Oranges” and then “Bananas”. Ultimately, the value “Bananas” will be passed to the calls of the next pipeline in each iteration, because at the time of calling this element the variable had that value. Interesting, isn’t it? Let’s go to the portal to see if ADF behaves in such a way.

So we have two variables: __varListOfFruits, which contains a list of fruits that we will iterate over, and __varResultFruit, which is the variable to which we will assign the fruit from the current iteration – this variable will also be passed to the next pipeline:

Next, we have a loop that iterates over the __varListOfFruits variable:

Please note the Sequential switch which is turned off by default and which affects our result. What impact does it have? Well, when this option is selected, nothing will be parallelized. Inside the loop, we have three tasks:

Set variable assigns the current value of the loop to the __varResultFruit variable:

The Wait task is designed to pause the processing for a few seconds so that we can simulate real execution.

The variable, or rather its value, is passed to the next pipeline.

After running the pipeline in Debug mode, we get the following execution:

What should draw our attention is the fact that individual iterations were executed one after the other – a hint of what happened can be given by the notification from ADF:

This means that Debug mode itself runs everything sequentially, so everything may appear to be working as expected. However, it will be interesting to see what happens when we run the pipeline normally via a trigger:

We see that the sequence of steps is disrupted and individual elements are executed in parallel. When we see what values of the variable are passed to the invoked pipeline, we will be a little surprised because we see that the value “Apples” was passed three times, which was not exactly what we expected: