Home server monitoring with Azure Metrics and Telegraf

April 4, 2020
Azure Azure Metrics Azure Alerts Telegraf

This is the first part of my Azure blog series. In these posts, I would like to share how I use Azure services with my home servers and hopefully I can give you some practical examples how you can integrate on-premise infrastructure to Azure.

In this post I will use Telegraf to monitor CPU and disk usage of my home server and send the collected data to Azure Metrics. I will also setup an alert to send an email notification when a metric exceeds a threshold.

For this demonstration, I use my Orange Pi home server and run Telegraf in a container, but you can use any hardware with docker and docker-compose installed.

I will assume in this tutorial that you already have:

Let’s start with creating a new resource group.

$ az login
...
$ az group create -l northeurope -g rg-custom-metrics-test
Location     Name
-----------  ----------------------
northeurope  rg-custom-metrics-test

First we need an Azure resource our metrics is reported for. I will create a new Application Insight resource for this purpose. (and later we will use the Availability test service to monitor our servers)

$ az extension add -n application-insights
$ az monitor app-insights component create --app appis-custom-metrics-test --location westeurope -g rg-custom-metrics-test

We need to create a new Service principal to give permission for our Telegraf instance to publish metrics. This is a security identity in the Azure Active Directory and we can create one with the following command.

$ az ad sp create-for-rbac -n sp-custom-metrics-test --role "Monitoring Metrics Publisher" -o yaml
appId: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
displayName: sp-custom-metrics-test
name: http://sp-custom-metrics-test
password: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX
tenant: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Save the output of this command, we need this in the next step to configure the Telegraf output plugin.

Create a new docker-compose file on your server.

version: '3'
services:
  telegraf:
    image: telegraf:latest
    volumes:
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /:/rootfs:ro
    environment:
      HOST_SYS: /rootfs/sys
      HOST_MOUNT_PREFIX: /rootfs
      AZURE_TENANT_ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
      AZURE_CLIENT_ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
      AZURE_CLIENT_SECRET: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX
    command:
      - "--test"

Here is some explanation, what happens in this file: we mount the host file system to the container and set the HOST_PROC variable, so Telegraf can read disk usage info from the host. We also set HOST_MOUNT_PREFIX to cut the /rootfs part from the path when reporting metrics.

Set the following environment variables from the output of the create-rbac-command:

Finally we need a simple configuration file for Telegraf to report CPU and disk usage.

[agent]
  interval = "1m"
  hostname = "orangepi-v2" # Override container hostname

[[inputs.disk]]
  taginclude = ["device", "host"]
  fieldpass = ["used_percent"]
  mount_points = ["/rootfs/mnt/data", "/rootfs"]

[[inputs.system]]
  fieldpass = ["load*"]

[[outputs.azure_monitor]]
  namespace_prefix = "Telegraf/"
  region = "westeurope"
  resource_id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-davidjenei-com/providers/microsoft.insights/components/appis-davidjenei-com"

Few things to notice here: I tried to report only a handful of metrics, so I used the fieldpass directive to select only what I need. You need to fill in your resource details in the output config. Use this command to find out your resource id and region name:

az resource show -n appis-custom-metrics-test -g rg-custom-metrics-test --resource-type microsoft.insights/components

Now we can start Telegraf with docker-compose and check the reported metrics:

$ docker-compose up
2020-04-04T19:28:34Z I! Starting Telegraf 1.13.4
2020-04-04T19:28:34Z I! Using config file: /etc/telegraf/telegraf.conf
> disk,device=mmcblk0p2,host=orangepi-v2 used_percent=94.26396009711372 1586028514000000000
> disk,device=mmcblk0p3,host=orangepi-v2 used_percent=15.371661730642543 1586028514000000000
> system,host=orangepi-v2 load1=0.22,load15=0.02,load5=0.08 1586028514000000000

Log in to Azure portal and use the search bar on the top to quickly navigate to Metrics. Select your subscription and find the Application Insight resource. After a few minutes, your telegraf metric namespaces appears, and you can display your load metrics.

telegraf-metrics

Now that we have some data in Azure, we can create an alert. Open Alerts and select New Alert Rule.

The instructions are fairly straightforward from here, select your resource, than add a condition and select load1 from the dropdown list. Set the threshold to 1 for testing.

telegraf-alert

Finally add an action group and create a new email alert.

We can test the alert by generating some artifical load. Here is a simple Dockerfile for building a container with the stress tool.

FROM ubuntu
RUN apt-get update && apt-get install -y stress
ENTRYPOINT ["/usr/bin/stress", "-v"]

Build the container and start a CPU hog.

docker build -t stress . && docker run --rm -it stress --cpu 2

After a one minute delay you will receive your first email alert.

In the next tutorials, I will add new alert groups to trigger Azure Logic Apps and send a message to a Microsoft Teams channel.