Make your AI more sustainable with API Management

It’s time for the Festive Tech Calendar 2025 again! And this is my submission for the year. Please check out the whole event and consider donating for the charity (Beatson Cancer Charity) via their just giving page.

In this blogpost I want to explain some of the uses of Azure API Management when working with AI systems. Recently some very interesting features have been released to make development with AI systems a lot easier!

For my AI backend I will be using Microsoft Foundry. In here you can host AI models and it offers an easy way to deploy and manage them. The techniques I will describe will work with any AI system that has http endpoints.

The advantage of using Microsoft Foundry is that there are easy ways to link it to Azure API Management where with some other tools this will require a bit more work.

To use Microsoft Foundry you will need to deploy a resource in Azure. Once you’ve done that you can use the Foundry portal to deploy a chat completion model. For example I deployed the gpt-5-mini model.

For the demo’s in this blogpost I use the Basic v2 tier of API Management. For most of the features having at least a basic tier is required (developer is possible but that’s not suitable for production workloads). You can use the v1 tiers but it’s strongly advised to use the v2 tiers instead.

Once you created an API Management instance you can go to the API’s and click “Add API” in this view you’ll see the option to connect to Azure AI Foundry.

You can now select the Microsoft Foundry instance. It will show the endpoint and how many models are deployed.

In the next tab you can configure the API. You can give it a displayname and internal name. You can also give it a Base Path. This base path will be added to the url of your API call. Especially if you will have more API’s you want to pick this basepath correctly.
In the client compatibility section you can select how the API should be presented. For my demo’s I used the Azure AI option.

As you notice there are more tabs in this screen and I will be using them. But first let’s explain what we are going to do and make sure all the requirements are met.

There are many ways to make our AI system more sustainable but in my opinion an easy way is to just use it less. Depending on your workload you can have situations where an AI system is asked the same question multiple times. Every time a question is asked the AI system will spend a lot of compute to infer the answer. If there was a way to cache answers that were already given and decide if this should be send then in this way the AI system can be used less. If you want to set up a system like this, it will look like this picture:

These are the steps that will happen:

  1. The user will request something to the app.

  2. The app will create an AI request and set it to API Management

  3. API Management will vectorize this by sending it to an embeddings model in AI Foundry

  4. The vector of the request is send back to API Management

  5. The vector is send to the cache

  6. The cache looks in it vector database to see if it has a request that is cached that is similar, if so this is send back if not it lets API management know no cache is available.

  7. (Only happens if no cache is found) The request is send to the chat completion model in AI Foundry.

  8. (Only happens if no cache is found) AI Foundry sends back the response to API Management and stores this in the cache

  9. The response is send back to the app

  10. The app will present the response to the user

This process is explain in more depth in this learn article.

To do this you will need to create a managed redis cache in Azure. You need to make sure you enable the RediSearch module while setting this up. You also need to make sure you enable Access Keys Authentication in your Redis Cache.

At the moment API Management doesn’t support signing in to the redis cache via Entra ID.

You also need to deploy a text embeddings model in AI Foundry. I used the text-embedding-ada-002 model.

Once you’ve set up these requirements you can go back to the creation of the API and go the tab for “Apply semantic caching”.

You need to select the first checkbox and then in the Azure OpenAI instance you select your foundry resource where your text embeddings model is deployed. In the model deployment you select the embeddings model.

The similarity score can be any value between 0.0 and 1.0 where 0.0 is the most strict and 1.0 the least strict. This value is used to decide how much variance is allowed in the requests that are being matched. It’s important to test well with this value as you don’t want to have any false positives when calling the cache.

The cache duration specifies how long a response is allowed to live in the cache before it is removed. If you are working with data that is static and your LLM is only using this then you can have a very long cache duration. But if your backend AI system is using RAG or other systems to get more data for your request (like an agentic flow) then probably you want to keep this value low as it would else not show the new data in the answer.

Once you set this up you can go to review + create and create your API. If you then go to your API you will see something like this:

Things to note are the list of operations. Based on the types of models you have deployed API Management will set up the right API calls for these models. In the inbound processing you see the “set-backend-service” policy which connects to AI Foundry. You also see the “llm-semantic-cache-lookup” this is what sets up the search in the cache. In the outbound processing you see the “llm-semantic-cache-store” this will store the results in the cache. Notice that these policies are applied to all operations. If needed you can change this to only work for specific operations. If you look at the code it looks like this:

<policies>
    <inbound>
        <base />
        <llm-semantic-cache-lookup score-threshold="0.0" embeddings-backend-id="AI-SC-text-embedding-bcd7dmzpjvnnuw3" />
        <set-backend-service id="apim-generated-policy" backend-id="aifoundrycache-ai-endpoint" />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
        <llm-semantic-cache-store duration="20" />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Here you can see the values that where set during the setup phase. You can see the “score-threshold” this is the similarity score. And there is the “duration” which is cache duration. There are also references to back-ends that contain the connection data for the different models and the cache.

After you set this up you can test it by going to the Test tab in API Management. For the request body I use this body.

{"model":"gpt-5-mini","messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"Can you tell me more about the website AutoSysOps?"}]}

Do note that by default it will show a gpt-4o body which has a parameter that is not supports by gpt-5-mini so you need to remove that. Using the body I provided here should work if you set it up the same as I did.

If you run a Trace in API Management you will see it go through several steps and eventually returning an answer. If you look in the traces you will see something like this:

azure-openai-semantic-cache-lookup (75.880 ms)
{
    "message": "No results were found in semantic search under the vary-by partition 'None'."
}
azure-openai-semantic-cache-lookup (1.473 ms)
{
    "message": "Cache lookup resulted in a miss. Cache headers listed below were removed from the request to prompt the backend service to send back a complete response.",
    "cacheKey": "<REDACTED>",
    "cacheHeaders": []
}
set-backend-service (1.625 ms)
"Backend was changed to backend with ID 'aifoundrycache-ai-endpoint' (type: Single)."

As you are running it the first time it logical that nothing was found in the cache. If you now run it again before the cache duration is over you will see something like this in your trace:

azure-openai-semantic-cache-lookup (55.513 ms)
{
    "message": "Found cache entry using semantic search within threshold of '0', and similarity score '1.0000001' under the vary-by partition 'None'. Vector distance is '-1.1920929E-07'."
}
azure-openai-semantic-cache-lookup (0.641 ms)
{
    "message": "Cache lookup resulted in a hit! Cached response will be used. Processing will continue from the step in the response pipeline that is after the corresponding cache-store.",
    "cacheKey": "<REDACTED>"
}

Here you see that something was found in the cache and it was returned instantly.

An added benefit of using caching like this is the speed, if the response is in the cache already it will respond much quicker as it doesn’t have to infer the answer. It can also reduce the variety in the answers. If you ask an LLM the same question three times it’s possible you get 3 completely different answers.

In the image above here you’ll see a Power App I created. On the left the question was send to a non-cached LLM three times. As you see the answers are quite different. On the right you see the cached responses. As you’ll see all of them are exactly the same.

It will depend on your implementation and how often the cache can be accessed to determine if any environmental savings are made. As you should keep in mind that this will create overhead for your queries. As it now first need to do a vectorization and look this up in the database. These actions are a lot less resource intensive then an AI inference. But if the cache is rarely hit then this could actually add up to being more wasteful. So if you are implementing this I advise you to monitor how often the cache is hit to make sure you are creating a solution that helps.

Azure API Management supports more tools to control and improve your AI Endpoints. In a follow up blogpost I will tell more about those systems. For now don’t forget to check out all the other nice contributions for the Festive Tech Calendar and thank you for reading. If you have any questions about this feel free to ask them in the comments or contact me on the socials!