Analyze sustainability data periodically with GitHub Actions for free
/For a project I’m working on I wanted to know the Power Usage Effectiveness for different Azure Regions. Microsoft publishes sustainability fact sheets for multiple different regions. The contain data like this.
This information is very nice, but as far as I have found the only place to get this information is in these documents. I would like to use this information inside a PowerShell module, if I would need to parse the PDF documents every time I use the module that would be slow and annoying. So that made me think on how to solve this problem.
The solution I came up with was to use GitHub Actions to parse all the documents and compile it into one JSON file which then can be hosted on GitHub publicly for anyone who wants to use it.
I’ve decided to use GitHub actions because it only has to happen once a day as these files are not updated so frequently. To create resources for this in a cloud platform would still cost me some money. But by using GitHub Action I can use the free tier to check if there are any changes daily.
The repository which hosts the JSON file and the code to run the GitHub action can be found here. The regiondata.json file is which contains all the data. I looks like this:
{
"id": "southeastasia",
"carbon": {
"pue": "1.358",
"renewable": "13%"
},
"water": {
"wue": "2.06"
}
},
{
"id": "australiaeast",
"carbon": {
"pue": "1.12",
"renewable": "28%"
},
"water": {
"wue": "0.012"
}
},
...
The file goes on with more regions. You see that the values for the PUE and renewable percentage are in there as well is the WUE which is for the water efficiency.
In the scripts folder you can find a PowerShell script to parse the data. I will not go through the whole script in detail in this blogpost. I will point out some of the highlights:
Install-Module -Name ExtractPDFData -Confirm:$false -Force
Import-Module -Name ExtractPDFData -Force
# Retrieve all Sustainability Reports
$regions = Invoke-RestMethod -Uri "https://datacenters.microsoft.com/globe/data/geo/regions.json" -Method Get
In the first part of the script my new module ExtractPDFData is installed. There are multiple Powershell modules in the PSGallery to get data from an PDF file. But all I tested will return the data unformatted and without any care of where it was positioned on the page.
Because the PDF files contained the data in columns I had to get this data in a way where I could see what data was in which column. By using a .NET module called FreeSpire.PDF I was able to get the text out of the PDF while still retaining some positional information.
# Get the words in the header
$inWord = $false
$words = @()
for ($charindex = 0; $charindex -lt $line.Length; $charindex++) {
if ((-not $inWord) -and ($line[$charindex] -ne " ")) {
$word = @{
start = $charindex
}
$inWord = $true
}
if ($inWord -and (($line[$charindex] -eq " ") -or ($charindex -eq ($line.Length - 1)))) {
$word.end = $charindex
$word.word = $line[($word.start)..($word.end - 1)] -join ""
$words += $word
$inWord = $false
}
}
# Get avarage distance between words
$distances = @()
for ($w = 0; $w -lt ($words.count - 1); $w++) {
$distances += $words[$w + 1].start - $words[$w].end
}
$padding = [System.Math]::Round(($distances | Measure-Object -Minimum).Minimum / 2, 0)
# Create the columns
$columns = @()
$columns += @{
Name = "Text"
Start = 0
End = $words[0].start - ($padding + 1)
Text = @()
}
foreach ($word in $words) {
$columns += @{
Name = $word.word
Start = $word.start - $padding
End = $word.end + $padding
Text = @()
}
}
}
In this next part the code tries to determine the different columns in the document. By taking the words in this row and measuring the distance between them a kind of cutover point is created halfway between them where words either fit in one of the other column.
In the next part every line of words is processed and by checking where in the line the words exists it’s determined in which column they fit.
$carbondata = (($columns | Where-Object { $_.Name -eq "CARBON" }).text -join " ")
if ($carbondata -match "(\d+(?:[,.]\d+))(.*)(\(PUE\))") {
$pue = $matches[1]
}
else {
$pue = $null
}
# retrieve renewable energy data
if ($carbondata -match "(\d+%)(.*)(enewable energy coverage)") {
$renewable = $matches[1]
}
else {
$renewable = $null
}
# retrieve WUE data
$waterdata = (($columns | Where-Object { $_.Name -eq "WATER" }).text -join " ")
if ($waterdata -match "(\d+(?:[,.]\d+))(.*)(\(WUE\))") {
$wue = $matches[1]
}
else {
$wue = $null
}
Finally the different values are retrieved from the respective columns and stored in an array which eventually is converted to the json file.
But just having this JSON file is not enough, as we want it to be stored on GitHub and compared with the existing data and if different replaced.
So to do this a GitHub action was made which can be found in the .github/workflows/run.yml file.
If you ever worked with things like Azure DevOps Pipelines this should look familiar in how it’s constructed. First there is a section which contains some general info, in this case the schedule for when it needs to run, this uses a CRON expression for the schedule trigger. After this a series of steps are defined. The first being:
uses: actions/checkout@v1
This is a simple checkout action which gets the files from the repository.
run: .\Scripts\Parse-AzureSustainabilityData.ps1
shell: pwsh
This step will run the script. The script once run will output the JSON file it creates. Because the GitHub action is running in an isolated environment we can store files there without worrying that they might get compromised. After the run of our action the environment including the stored files are being deleted.
uses: actions-js/push@master
with:
github_token: ${ }
message: 'New version of regiondata'
branch: 'refs/heads/newbranch'
Here a commit is being done to a new branch containing the newly created JSON file. To achieve this a step is used from the GitHub Marketplace. In this marketplace developer can share steps they made which can be used in GitHub Actions.
To authenticate towards GitHub the special build in secret is used. This will use the service principal the Actions are using to commit the files. To make sure this works in the settings under actions -> general I had to enable these options.
I didn’t want the Action to instantly commit to the main branch as it might not work 100% all the time. So I prefer if I can review it once it detects changes. To achieve this the last part is added to the Action:
uses: rematocorp/open-pull-request-action@v1
if: success()
with:
github-token: ${ }
from-branch: newbranch
to-branch: main
repository-owner: autosysops
repository: azure_sustainability_data
This step will create a Pull Request. This is also a step used from the marketplace.
Once a new pullrequest is created I will be notified. You might spot that this contains a if clause. When no changes are made to the file the previous step will fail due to there being nothing to commit, so this step will only run if the commit was successful.
Because it would be annoying if the Action would return a failed state every time a commit was failed I added this line in the job itself.
This isn’t the best way to do it as when something else fails I have no easy way of detecting this. In the future I want to improve it so it’s more robust, but for a first version this works.
As you see in the screen above the Actions run every day and it will keep my data up to date. I hope this will give you some idea’s on how to utilize tools like this if you only need to have something performed occasionally. If you want to help improve this feel free to fork the repository and create a pull request once you made your changes. I would love to see contributions from the community!