Dev Talk: Monitoring with Prometheus, Grafana & Docker Part 2

Introduction

Previously we looked at setting up a Prometheus server, an exporter to report metrics, and Grafana as the graphical front-end for data display.

Where we left off was to set up an alert, route it to a service like Slack, and to secure the set-up by locking down ports and adding SSL. This is what we will be looking at in this second part of the blog post.

Alerts

A monitoring system that you need to stare at isn’t very helpful unless you can afford to do a lot of staring and never sleep. And while you’ll never be able to avoid staring entirely, it’s still best for your ease of mind to know that some things will be reported to you automatically.

Prometheus uses so-called alert rules to define when certain conditions are met and how to deal with them exactly. We’ll begin by adding an alert that fires when an instance goes down.

To set this up we need to make three changes:

  • add an alert.rules file
  • map this file into the container by editing docker-compose.yml
  • update prometheus.yml to make it use the file

Here’s what alert.rules looks like:

ALERT service_down
  IF up == 0

More on what this does in a moment, let’s first add it to the container. For this we need to add the following line to the volumes section of the prometheus service in
docker-compose.yml:

services:
    prometheus:
        ...
        volumes:
            ...
            - ./alert.rules:/etc/prometheus/alert.rules

Finally, we’ll need to tell Prometheus that this is where the alerts are defined. Simply append a top level entry rule_files: to prometheus.yml:

...
rule_files:
    - 'alert.rules'

Prometheus Expressions

The alert rule for our service status looks deceptively simple. The syntax is based on the Prometheus expression language and allows to set up conditions based on complex queries of the metrics.

In our initial example, we’re querying against what is probably the most basic metric available: the up state of the exporters. This binary metric (you can inspect it here) reports 1 or 0 for the configured exporters.

To see this in action, simply shut down the node-exporter:

docker-compose stop node-exporter

and refresh the graph or check the alerts page:

Load Check

What we’ve done here for the built-in up metric is easily done for others as well. So next we’ll set up an alert for the load above 0.5. Add the following to alert.rules:

ALERT high_load
    IF node_load1 > 0.5
    ANNOTATIONS {
      summary = "Instance {{ $labels.instance }} under high load",
      description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load.",
    }

Note that we’ve also taken the opportunity to add an ANNOTATION, the purpose of which will become apparent in a minute.

First let’s confirm we can trigger this alert by creating some load, for example by running

docker run --rm -it busybox sh -c "while true; do :; done"

We should be seeing the following after a while:

Alertmanager

Alerts themselves are metrics that can be displayed, which means they can easily be added to a Grafana dashboard:

The metric shown at the bottom in two different variants is the following:

ALERTS{alertname="high_load",alertstate="firing"}

The configuration for this can be imported from dashboard.json in this post’s github repo and you can inspect the set up of the panels to see how to represent the values as shown above.

While it is useful to have this display, you will also want to be notified by other means, like a slack channel or via an email. To set this up we need to add another component to the mix, the Alertmanager, which is also part of Prometheus. We need to make only a handful of changes:

  • extend docker-compose.yml with a section to launch the container
  • in that same file, tell prometheus how to connect to the Alertmanager, by passing in the -alertmanager.url flag
  • provide an alertmanager.yml configuration file with our specific alert routes

So, in more detail, these are the additions to docker-compose.yml :

# docker-compose.yml
version: '2'
services:
    prometheus:
            ...
        command:
            - '-config.file=/etc/prometheus/prometheus.yml'
            - '-alertmanager.url=http://alertmanager:9093'
        ports:
            ...
    alertmanager:
        image: prom/alertmanager:0.1.1
        volumes:
            - ./alertmanager.yml:/alertmanager.yml
        command:
            - '-config.file=/alertmanager.yml'
volumes:
    ...

This is all that’s needed to launch the alertmanager service and connect prometheus to it. (Again, note how we can reference the service simply by its service name, thanks to the name resolution in the container network.)

Slack Receiver

The alertmanager takes care of routing any alerts that fire to whatever service is configured in its configuration file alertmanager.yml , which looks as follows:

# alertmanager.yml
route:
    receiver: 'slack'
receivers:
    - name: 'slack'
      slack_configs:
          - send_resolved: true
            username: 'Prometheus'
            channel: '#random'
            api_url: 'https://hooks.slack.com/services/<your>/<stuff>/<here>'

In this case we set up a slack receiver for our alerts which will result in the following message to be posted when alerts occur:

In order to make this possible, you will need to set up an incoming webhook integration for your Slack team and update the api_url: config with the value you get from the integration.

You can see from the screenshot that you can also get notified when an alert is resolved, thanks to the send_resolved: true setting in the config file. There are a few other parameters you can set, as described in the Slack receiver documentation.

SSL Configuration

The final piece necessary to make this set-up deployable is to protect the monitoring site with SSL and the easiest way to do that is with another dockerised service: lets-nginx . This is another great example of how you can add pieces to the puzzle of building up a service in a modular way.

We start by adding a section for the ssl service to our docker-compose.yml :

ssl:
        image: opencapacity/lets-nginx:1.3
        ports:
            - "443:443"
        volumes:
            - letsencrypt:/etc/letsencrypt
            - letsencrypt_backups:/var/lib/letsencrypt
            - dhparam_cache:/cache

This sits at the same level as the other services, prometheus , grafana , etc, in that file.

You’ll notice that we are referencing three volumes here that we also need to add to the volumes section at the very end of the file. Simply append the following:

volumes:
    ...
    letsencrypt: {}
    letsencrypt_backups: {}
    dhparam_cache: {}

While it’s not strictly necessary to do this, it is advisable for the following reasons:

  • lets-nginx requests new certs every time you launch the container if there are no valid certs
  • if you don’t keep your certs around between restarts you may hit letsencrypt’s rate limit (currently 5 per week)
  • creating the Diffie Hellman parameters takes quite a while and you don’t want to re-create them on every start up

With this we have set up a generic ssl container which is not configured in any way specific to our service yet. To do, simply add the following three lines to a new environments: section, for example between image: and ports: and at the same level:

- EMAIL=<your email, e.g info@mydomain.com>
            - DOMAIN=<your domain, e.g. mydomain.com>
            - UPSTREAM=grafana:3000

These three lines set up environment variables which determine the parameters lets-nginx uses during startup. EMAIL and DOMAIN configure your SSL cert while UPSTREAM tells nginx what host to proxy to.

There is one small detail we need to take care of before running this and that is adding a dependency of ssl on grafana . The reason for this is to prevent the ssl from existing because the grafana host is not available, which can happen if ssl launches faster than grafana (which is typically the case, unless ssl computes the DH parameters). Add the following sub-section to the ssl: entry:

depends_on:
            - grafana

And that is all there is to getting SSL for you service. If you run docker-compose up -d now you should be able to access your service on your public IP address via SSL. (Be aware that the first launch of ssl will be slow because of the DH parameter computation.)

Removing Open Ports

Of course, we’re not quite done yet. While we’ve added an SSL proxy to our grafana service, we haven’t closed the door yet on the unsecured ports from the other services. This is as simple as removing all ports: sub-sections from docker-compose.yml except for the one for port 443 under ssl: .

The ssl service will still be able to talk to grafana even without its ports: declared, because they are still exposed at the container network level, just not externally.

What this means is that you will no longer be able to connect to Prometheus directly at port 9090 or to node-exporter at port 9100. However, this is really only necessary while setting up the system or for trouble-shooting as all the information gathered via those subsystems will be displayed via Grafana.

Conclusion

This concludes our two part mini-series about monitoring with Prometheus, Grafana & Docker. The configuration files are available on github in opencapacity/blogpost-prometheus, with the tags part1 and part2 pointing to the current state of the configuration at the end of each part.

There is one difference between the files in this repository and the description in this blog post and that is how the configuration files are add to the containers. Throughout this series, this mapping was declared as follows:

...
        volumes:
            ...
            - ./prometheus.yml:/etc/prometheus/prometheus.yml

This works fine as long you run docker-compose against a local docker daemon. However, if you attempt to run this set-up with a docker-machine for example on Digital Ocean, this will fail. The reason is that the volume mapping to these local files will not ‘travel’ with the service description and the services will not find their configuration.

Therefore, we have made a small change in commit 3a01d23 to copy the configuration files into the images via specific Dockerfiles for the two services that require configuration files, prometheus and alertmanager . This makes it much easier to test the set-up with a hosted service, which in turn is an easy way to get a public IP for the SSL set-up.

If you have any questions or feedback, please contact the author. You can also follow us on Twitter: