Automatically tag azure vm's behind a load balancer

January 6, 2022 Leo Visser

When managing virtual machines in Azure it’s often useful to tag these machines based on certain criteria. These tags can be used to filter the virtual machines for certain services. For example when using the update management solution provided by azure automation you can filter the virtual machines based on tags. I’ve been in situations where we wanted to have different approaches for stand alone virtual machines and those behind a load balancer or application gateway. For example when an update was applied we wanted the virtual machine to first be removed from the load balancer gracefully so no messages where routed to it.

To achieve this you could make sure that when deploying virtual machines you will manually make sure they have the right tag, but this allows room for error. To make sure that virtual machines will always have the right tags we can use Azure Policies. In this blog post I will describe how you can make an Azure Policy which automatically tags virtual machines behind a load balancer or application gateway. All files for this blog are also shared on my github.

The preparations - Custom Role

To have the policy work there are a few things we need to set up beforehand. This policy will use a deployment script to make changes to the resources. When policies need to create or alter resources they use a managed identity to perform these actions. Because we follow the principle of lease privileges we need to check which permissions are required to create a deployment script. So to make sure the policy can only create a deployment script we need to create a custom role with the following json file:

See this content in the original post

Create a .json file with the above as content and save it on your device.

For this example we will add this rule on a subscription scope. So in the azure portal go to subscriptions and choose the subscription you want to use and select “Access Control (IAM)” in the left menu. Go to the tab “Roles”, select add and pick the option “Add custom role”.

Select the option “Start from JSON” and pick the .json file we created. After you’ve done this it will show most fields entered. But there is still a red dot next to the “assignable scopes”. Go to this tab.

Click the button “Add assignable scopes” and select your subscription you want to test this on. Now review + create the role and wait for it to be created.
Now check out the Role you created in the subscription Access Management and use the json view to get the id. This should look something like: "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Authorization/roleDefinitions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

The preparations - User Assigned Managed Identity

The next thing we need is to create a Managed identity. This is needed to provide permissions to the script which will be run to deploy the changes. To create this go in the portal to “Managed Identities” and click create.

A managed identity must be created in a resource group so you need to create one for it if you don’t have one. And it needs a unique name. After you’ve entered that information you can create the managed identity. After it’s done go to the resource you created. In the left menu go to the azure role assignments. Here you can assign different roles to the identity you created.

You can see a Managed Identity as a service account from which you don’t have a password. Azure will make sure the authentication is done right. In this example we assign the Tag Contributor and Reader role to the managed identity because we are going to run a script which assigns tags to virtual machines. Due to how powershell works the script also needs the reader role because else it can’t find the resource to assign the tag too.
Now go back to the overview and use the json view to get the id of the resource. You could also use the azure resource explorer to get this. It should look something like "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/your-resource-group/providers/Microsoft.ManagedIdentity/userAssignedIdentities/your-identity"

Creating the Policy

Now the prerequisites are created it’s time to create the actual policy which will tag the VM’s. To create a policy definition we need a json that consists of three parts:

mode - This defines if the policy should target all resources or only a subset.
policyRule - This contains the actual policy with an “if” and “then” structure.
parameters - This contains parameters which can be specified for the policy.

When looking at policy definition in Azure you will see more parts, they are created when making the policy definition or are optional.

Let’s look at the policy part by part starting with the mode.

See this content in the original post

The mode which is used is indexed because the policy shouldn’t target all resources, we want to limit it to specific resources which are edited.

Next we look at the parameters.

See this content in the original post

Here are three parameters defined. The first one is the ID of the managed identity we created before. The second one is the name of the tag we are going to use and the third one is the value we want to se the tag too when it does find a VM which is load balanced. The metadata for every parameters contains a displayname and description, it’s possible to add even more options here like using strong types to create dropdown boxes for specific resources.

Now let’s look at the PolicyRule, this consists of two parts. First it has the “if” part which contains a filter expression. If any of the resources matches the criteria in the if part it will go to the “then” part. The if part looks like this:

See this content in the original post

The syntax for azure policies allows for logic structures, if you want all statements to be true to have it show true you use “allOf” (which would be the same as using AND), if you only need one of the statements to be true to show it true you use “anyOf” (which would be the same as using OR). If we would write the above statement in a other way it would be:

type == Microsoft.Compute/virtualMachines AND (Microsoft.Compute/virtualMachines/storageProfile.osDisk.osType == Windows OR Microsoft.Compute/virtualMachines/storageProfile.imageReference.offer == Windows)

You’ll notice that the first statement only using a single word for the fieldname while the others have it more specified. This is because the first statement uses an alias.
What effectively is filtered here are all virtual machines which are running Windows. If you want the policy to also work on other operating systems you can remove the second part of the policy, it’s in here to demonstrate that sometimes with policies you need to keep an eye out for which field to check. This policy uses an “anyOf” statement to check if either of the two fields is Windows. This is because when a virtual machine is created the osDisk field is not present and therefore the policy wont trigger if this is used. But once the vm is created the imageReference field doesn’t exists anymore. So if you want to create a remediation task it will not find the right resources.

Next up is the “then” part. In there you will first find the effect. Normally for a tag policy you would use the modify effect. But this effect doesn’t allow you to do extra filtering and this is needed when picking virtual machines based on if they are behind a load balancer or not. This is why the deployIfNotExists effect is chosen. In this policy in the details of the effect we will find five parts.

type - This specifies a type of resource to query. If there is a parent-child relation between this resource and the one from the if part it will only query those from the relation. We opt to use the networkInterfaces resource which has no parent-child relationship with virtual machines. But this resources contains the link to the load balancer.
roleDefinitionIds - This is the role the managed identity needs to have. The policy will then perform the deployment using this identity. In this part we need to enter the roleid of the role we created.
evaluationDelay - This is an optional field which is included in this policy to make sure the execution is done as fast a possible. Normally policies will trigger 15 minutes are a resource is created. If you want to speed this process up you can use this field to pick a different moment.
existenceCondition - This is a second filter which will be applied on all the resources queried defined by the “type” field. Contrary to the “if” part of the policy. The logic of the existenceCondition can be a bit tricky. For every resource which shows true in the “if” part of the policy it will test this condition to every resource queried based on the type. If ANY of these resources return true it won’t trigger the deploy.
deployment - This is a nested arm template which can be used to deploy resources.

Now let’s look a bit closer at the last two parts starting with the existenceCondition.

See this content in the original post

To write above is more understandable pseudocode would look like

(
    (
        count(loadBalancerBackendAddressPools).where(virtualMachine.id == parent.id) >= 1 
            OR
        count(applicationGatewayBackendAddressPools).where(virtualMachine.id == parent.id) >= 1
    )
    AND
    parent.tags -contains parameters.tagname
    AND
    virtualMachine.id == parent.id
)
OR
(
    count(loadBalancerBackendAddressPools).where(virtualMachine.id == parent.id) == 0
    AND
    count(applicationGatewayBackendAddressPools).where(virtualMachine.id == parent.id) ==0
    AND
    virtualMachine.id == parent.id
)

As you see this statement consists of two different checks. Both will check if the virtualmachine.id is equal to the parent.id. Remember that we are now evaluating networkinterfaces, so any network interface which isn’t connected to the parent virtual machine will always return false. This could be an issue if no network interface was attached because then the whole check would return false and therefore trigger an deployment. But a virtual machine does always need to have at least one network interface attached to it, so therefore we don’t have to worry for that. If that would be the case we could add something in the “if” field to check if any was connected and if not to filter out that virtual machine.
In the bottom part of the expression we also do a count for every loadbalancer and applicationgateway backendaddresspool. These are counted with the where statement checking if they are actually connected to the parent virtual machine. The count operator in azure policies return how many load balancers or application gateways are connected with this network adapter.
Last in the top part there is an extra check to see if a tag is set already.
So now if there is no loadbalancer or application gateway set and the tag is not set while the virtual machine is correct it will return true, this is correct because these machines should not be tagged and are therefore compliant, as a bonus you could add an extra check here to see if the tag was set and if so have this remediated, but this would increate the complexity of the deployment step. And if there is a loadbalancer or application gateway set and the tag is set it will return true as that is what we want. But if the tag is not set it will return false and therefore go into the deployment step.

The deployment part of the policy looks like this:

See this content in the original post

This is a arm template using a deployment script. It will create a temporary azure container instance on which the powershell script will run afterward it will remove the container instance again. And the deployment script resource itself which contains the output of the script will also be cleaned up after the retention period is over.
The arm template has some parameters which are given by the Azure Policy. The script which it will trigger looks like this:

See this content in the original post

This fairly simpel powershell script uses the Update-AzTag cmdlet to set the tag(s) specified in the parameters to the virtual machine set in the parameters. The deployment script runs azure powershell, this means all the az modules are available, in the arm template you specify which version of az to use. This will also handle the authentication part by using the managed identity we created before.
Because arm template can’t handle powershell object inputs very well in the arguments the script first does a convertfrom-json to convert this to a hashtable which then can be used. This allows us to build a json string in the arm template containing the right values for the tags. In this case we only need one tag so it’s a simple JSON. To make sure everything is parsed well we use escape characters.
In this example we use a uri to the script, this uri can also be a storage account in azure if you don’t want to use a public repository. You can also decide to embed the script in the arm template itself by using the attribute “scriptContent”.

Assigning the policy

Now that the policy definition is reviewed let’s actually create it in azure and assign it to a scope.
In the azure portal go to policies and in the left menu select definition. Here create a new policy definition

In the location define where you want to store this policy. Then enter a nice name and description for your policy. For the category you can select “Use existing” and search for the build-in “Tags” category. Now copy the policy definition from github and past it in the Policy Rule field. Don’t forget to replace the "<YOUR ROLE ID>" with the ID of the role you created. If you forget this it will give an error while saving.

If the creation was a success the next step is to assign the policy definition to a scope. Click the “assign” button in the policy definition. In the first tab you don’t have to change anything unless you want to assign the policy to a different scope as you stored it (for example only to a certain resource group). Go to the parameters tab.

Enter the id of the managed identity and the tag name and value. In the remediation tab you can review if the managed identity which will be created shows the right role in the permissions field. Now create the policy assignment.
Now our policy is active. If you already have resources which you want to evaluate you can create a remediation task to do this. If you want to test if it works you can create two vm’s where one is behind a load balancer and one isn’t to check if the tag is applied.

After you deployed the VM you still need to be a little bit patient to see the tag show up. It’s not instant but should be done in within 5 minutes or so. You can check the activity log of the vm’s it should show a “'deployIfNotExists' Policy action.” in there to show the policy triggered.

The conclusion

This solution allows you to rest easier because now you know every virtual machine in your environment will be created with the right tag, you can combine this policy with other (tagging) policies to make sure everything is set up right.
In the next blog we will look at the policy created here an deploy it via an ARM template including the necessary roles and permissions to transform this solution into an IAC solution. We will also build a ci/cd pipeline around it to deploy the solution and have tests in it as well, so I hope to see you back then.