AutoSysOps

View Original

Parallel processing in Azure logic apps

Azure logic apps are a powerful low-code solution to automate workflows. It has a lot of connectors available by default. It enabled people who don’t want to code to have an alternative. By providing a nice drag and drop interface which provides most control options you need it allows you to create powerful systems without a line of code written.

But these tools also come with challenges, because a lot is done in the background it’s possible they might respond differently then you’d expect. This is especially the case when we are looking into parallel processing. Often when we want to automate tasks this involves batches of data being retrieved and processed. Logic Apps are perfect for this and have a lot of build in mechanisms available to help. But if you are not careful it could also cause a lot of problems because by default logic apps will do what they can in parallel and this can lead to unforeseen consequences.

A basic Logic App

Let’s look at a fairly simple logic app.

This logic app has some data, normally it would be received by the trigger, but to make it easier the data is defined as JSON in the first initialize variable block. The data is as follows:

See this content in the original post

This could be an export from some kind of ticketing system. For the purpose of this blog post say we want to export this data to a sharepoint list but instead of the name of the user we want to correlate this username to our user system to get an user ID, which makes the downstream processing easier. In the case of this blog our user system is a simple sharepoint list that looks like this:

As you see this list has two columns, the first one being the name and the second one being the ID.
At the start of the logic app the data is parsed from JSON and then a variable is defined to store the ID. Also all users are retrieved so it can be compared easily.

Defining a variable wont be always needed. But often you want to do some transformation or search for something so you need a variable.
In this logic app we loop through all messages, and there we want to look for the right userID by looping through all users retrieved from the list and then see which name matches the name in the message. For this user we store the userID in the variable. This loop looks like this:

With the userID found the last step is saving the new data in a new sharepoint list. The userID which is found and the original message are saved in this list. So we’d expect with a logic app like this that it would give us all the message with the corresponding user ID’s behind it. Let’s run it and see what the result in sharepoint looks like.

As you see there is one message with an ID code of 222 while all others have 111 as ID. This is not what we expect to happen as there where many different users in the original data all posting different messages. Some where posting two messages but not more. So something went wrong here. And the thing which went wrong is that we didn’t take into account that logic apps run everything in a loop parallel.

Because we where setting a variable in the loop this was overwritten by all the different jobs happening at the same time. And when we where posting the message it took the value which is in memory at that point, it could be that this value was from a completely different step in the loop.

This issue is something I’ve seen happening quite often in Logic Apps. This logic app is very quick and therefore the result is clear, but often when you are having more complex logic in your loop it’s possible that most steps in the loop work well but only sometimes one takes the wrong values. In this case it’s possible that this issue is only discovered after a long time which can cause all kinds of issues if the data is business critical.

Using concurrency control to combat the issue

A way to solve the problem we had above is to enable concurrency control in our logic app. By enabling this you can select how many steps of the loops can run at the same time. By setting this to one you can make sure the main loop which loops through every message is ran one at the time. This way the problem is solved.

When the logic is ran now the result will be like this:

But when we compare the runtime of this logic app to the previous one we will notice something. When concurrency control was disabled the runtime of the logic app was around 5 seconds, but when it is enabled and set to one degree of parallelism the runtime of the logic apps becomes over 25 seconds. That’s more then a 5 times increase. When we are talking about short jobs that only take a few seconds this isn’t such a problem, but if this was a job which handles a big batch of data and has to process it during the night or something this 5 times increase could mean that the job isn’t done in time.

Splitting up the logic app to create parallel processing

A solution which you can use instead of the concurrency control is by using child logic apps. Programmers might know about variable scopes. In many programming languages you can define a variable with the same name in two different functions and they don’t overwrite each other when one is set. In Logic Apps this doesn’t exists inside a logic app but we can achieve it by create multiple logic apps and calling them when needed with certain parameters. This way you can send certain values to a logic app which will then parallel process your request. So the main logic app we had before will now look like this:

You’ll see the first part is still the same but now inside the main loop another logic app is called. The logic app has some parameters which are send as data from this logic app. The child logic app looks like this:

Here the users are retrieved and then a loop is done to check which user matches the name in the message. After that the item is created in the list. You see a response block at the end, this is to let he main logic app know the job is done, you can also use it to send data back. This child logic app receives data by the first response block. In here the parameters are defined like this:

These parameters are defined in JSON. By creating a sample JSON object of what the logic app can expect to get from the main logic app you can have this be generated for you by clicking on the “Use sample payload to generate schema” option. It will be automatically parsed, so you don’t need to add another parse JSON step for this.

Now when the main job is triggered it will trigger multiple of the child logic apps which can all process their data in parallel. After all of the responded they are done the mian logic app will be done too. Let’s see how the output looks like in the list.

The processing time of the whole logic app was around 4 seconds, so that is comparable with the first version. And as you see all the message have the right userID behind them but something is different from the version with concurrency control. The order of the message is not the same as in the source data. This is because every message was processed in parallel. And some just took a little bit longer then others. If you are processing batches of independent data or timestamped data this often isn’t an issue. But if you are processing data were the order is also important you need to keep this in mind when using this technique and you might need to use the concurrency control options instead.

Something to keep in mind when using child logic apps is the cost. Logic Apps have different pricing models. When using the consumption model you have to pay for every operation which is done. When you are splitting up the logic app it’s possible operations are added to make sure data is processed right. If these logic apps run millions of time this could cause a cost increase so do keep this in mind.

Conclusion

When you are using loops in a logic app you need to keep in mind they are processed parallel, so if you are setting data which you also use later in the loop you need to keep in account it’s possible this data has been overwritten if another step access the same data. If this happens you have two options:

Make the loop serial by enabling concurrency control: you can use this when you don’t care about the processing time or when it’s important that the order of the data is preserved.

Split the logic app up into a main logic app with child logic apps: this will allow you to have a quick processing time. Downside is that depending on the situation and pricing model it could increase your costs.

So if you have logic apps running in your environments check them to see if you are experiencing the issues with this parallel processing or if you maybe can increase their runtime by splitting them up!