Dev Talk: Monitoring with Prometheus, Grafana & Docker Part 2

Introduction

Previously we looked at setting up a Prometheus server, an exporter to report metrics, and Grafana as the graphical front-end for data display.

Where we left off was to set up an alert, route it to a service like Slack, and to secure the set-up by locking down ports and adding SSL. This is what we will be looking at in this second part of the blog post.

Alerts

A monitoring system that you need to stare at isn’t very helpful unless you can afford to do a lot of staring and never sleep. And while you’ll never be able to avoid staring entirely, it’s still best for your ease of mind to know that some things will be reported to you automatically.

Prometheus uses so-called alert rules to define when certain conditions are met and how to deal with them exactly. We’ll begin by adding an alert that fires when an instance goes down.

To set this up we need to make three changes:

  • add an alert.rules file
  • map this file into the container by editing docker-compose.yml
  • update prometheus.yml to make it use the file

Here’s what alert.rules looks like:

ALERT service_down
  IF up == 0

More on what this does in a moment, let’s first add it to the container. For this we need to add the following line to the volumes section of the prometheus service in
docker-compose.yml:

services:
    prometheus:
        ...
        volumes:
            ...
            - ./alert.rules:/etc/prometheus/alert.rules

Finally, we’ll need to tell Prometheus that this is where the alerts are defined. Simply append a top level entry rule_files: to prometheus.yml:

...
rule_files:
    - 'alert.rules'

Prometheus Expressions

The alert rule for our service status looks deceptively simple. The syntax is based on the Prometheus expression language and allows to set up conditions based on complex queries of the metrics.

In our initial example, we’re querying against what is probably the most basic metric available: the up state of the exporters. This binary metric (you can inspect it here) reports 1 or 0 for the configured exporters.

To see this in action, simply shut down the node-exporter:

docker-compose stop node-exporter

and refresh the graph or check the alerts page:

Load Check

What we’ve done here for the built-in up metric is easily done for others as well. So next we’ll set up an alert for the load above 0.5. Add the following to alert.rules:

ALERT high_load
    IF node_load1 > 0.5
    ANNOTATIONS {
      summary = "Instance {{ $labels.instance }} under high load",
      description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load.",
    }

Note that we’ve also taken the opportunity to add an ANNOTATION, the purpose of which will become apparent in a minute.

First let’s confirm we can trigger this alert by creating some load, for example by running

docker run --rm -it busybox sh -c "while true; do :; done"

We should be seeing the following after a while:

Alertmanager

Alerts themselves are metrics that can be displayed, which means they can easily be added to a Grafana dashboard:

The metric shown at the bottom in two different variants is the following:

ALERTS{alertname="high_load",alertstate="firing"}

The configuration for this can be imported from dashboard.json in this post’s github repo and you can inspect the set up of the panels to see how to represent the values as shown above.

While it is useful to have this display, you will also want to be notified by other means, like a slack channel or via an email. To set this up we need to add another component to the mix, the Alertmanager, which is also part of Prometheus. We need to make only a handful of changes:

  • extend docker-compose.yml with a section to launch the container
  • in that same file, tell prometheus how to connect to the Alertmanager, by passing in the -alertmanager.url flag
  • provide an alertmanager.yml configuration file with our specific alert routes

So, in more detail, these are the additions to docker-compose.yml :

# docker-compose.yml
version: '2'
services:
    prometheus:
            ...
        command:
            - '-config.file=/etc/prometheus/prometheus.yml'
            - '-alertmanager.url=http://alertmanager:9093'
        ports:
            ...
    alertmanager:
        image: prom/alertmanager:0.1.1
        volumes:
            - ./alertmanager.yml:/alertmanager.yml
        command:
            - '-config.file=/alertmanager.yml'
volumes:
    ...

This is all that’s needed to launch the alertmanager service and connect prometheus to it. (Again, note how we can reference the service simply by its service name, thanks to the name resolution in the container network.)

Slack Receiver

The alertmanager takes care of routing any alerts that fire to whatever service is configured in its configuration file alertmanager.yml , which looks as follows:

# alertmanager.yml
route:
    receiver: 'slack'
receivers:
    - name: 'slack'
      slack_configs:
          - send_resolved: true
            username: 'Prometheus'
            channel: '#random'
            api_url: 'https://hooks.slack.com/services/<your>/<stuff>/<here>'

In this case we set up a slack receiver for our alerts which will result in the following message to be posted when alerts occur:

In order to make this possible, you will need to set up an incoming webhook integration for your Slack team and update the api_url: config with the value you get from the integration.

You can see from the screenshot that you can also get notified when an alert is resolved, thanks to the send_resolved: true setting in the config file. There are a few other parameters you can set, as described in the Slack receiver documentation.

SSL Configuration

The final piece necessary to make this set-up deployable is to protect the monitoring site with SSL and the easiest way to do that is with another dockerised service: lets-nginx . This is another great example of how you can add pieces to the puzzle of building up a service in a modular way.

We start by adding a section for the ssl service to our docker-compose.yml :

ssl:
        image: opencapacity/lets-nginx:1.3
        ports:
            - "443:443"
        volumes:
            - letsencrypt:/etc/letsencrypt
            - letsencrypt_backups:/var/lib/letsencrypt
            - dhparam_cache:/cache

This sits at the same level as the other services, prometheus , grafana , etc, in that file.

You’ll notice that we are referencing three volumes here that we also need to add to the volumes section at the very end of the file. Simply append the following:

volumes:
    ...
    letsencrypt: {}
    letsencrypt_backups: {}
    dhparam_cache: {}

While it’s not strictly necessary to do this, it is advisable for the following reasons:

  • lets-nginx requests new certs every time you launch the container if there are no valid certs
  • if you don’t keep your certs around between restarts you may hit letsencrypt’s rate limit (currently 5 per week)
  • creating the Diffie Hellman parameters takes quite a while and you don’t want to re-create them on every start up

With this we have set up a generic ssl container which is not configured in any way specific to our service yet. To do, simply add the following three lines to a new environments: section, for example between image: and ports: and at the same level:

- EMAIL=<your email, e.g info@mydomain.com>
            - DOMAIN=<your domain, e.g. mydomain.com>
            - UPSTREAM=grafana:3000

These three lines set up environment variables which determine the parameters lets-nginx uses during startup. EMAIL and DOMAIN configure your SSL cert while UPSTREAM tells nginx what host to proxy to.

There is one small detail we need to take care of before running this and that is adding a dependency of ssl on grafana . The reason for this is to prevent the ssl from existing because the grafana host is not available, which can happen if ssl launches faster than grafana (which is typically the case, unless ssl computes the DH parameters). Add the following sub-section to the ssl: entry:

depends_on:
            - grafana

And that is all there is to getting SSL for you service. If you run docker-compose up -d now you should be able to access your service on your public IP address via SSL. (Be aware that the first launch of ssl will be slow because of the DH parameter computation.)

Removing Open Ports

Of course, we’re not quite done yet. While we’ve added an SSL proxy to our grafana service, we haven’t closed the door yet on the unsecured ports from the other services. This is as simple as removing all ports: sub-sections from docker-compose.yml except for the one for port 443 under ssl: .

The ssl service will still be able to talk to grafana even without its ports: declared, because they are still exposed at the container network level, just not externally.

What this means is that you will no longer be able to connect to Prometheus directly at port 9090 or to node-exporter at port 9100. However, this is really only necessary while setting up the system or for trouble-shooting as all the information gathered via those subsystems will be displayed via Grafana.

Conclusion

This concludes our two part mini-series about monitoring with Prometheus, Grafana & Docker. The configuration files are available on github in opencapacity/blogpost-prometheus, with the tags part1 and part2 pointing to the current state of the configuration at the end of each part.

There is one difference between the files in this repository and the description in this blog post and that is how the configuration files are add to the containers. Throughout this series, this mapping was declared as follows:

...
        volumes:
            ...
            - ./prometheus.yml:/etc/prometheus/prometheus.yml

This works fine as long you run docker-compose against a local docker daemon. However, if you attempt to run this set-up with a docker-machine for example on Digital Ocean, this will fail. The reason is that the volume mapping to these local files will not ‘travel’ with the service description and the services will not find their configuration.

Therefore, we have made a small change in commit 3a01d23 to copy the configuration files into the images via specific Dockerfiles for the two services that require configuration files, prometheus and alertmanager . This makes it much easier to test the set-up with a hosted service, which in turn is an easy way to get a public IP for the SSL set-up.

If you have any questions or feedback, please contact the author. You can also follow us on Twitter:

Dev Talk: Monitoring with Prometheus, Grafana & Docker Part 1

Introduction

The choice of monitoring systems out there is overwhelming. When we recently decided to set up a monitoring system for our handful of servers, it became clear that many of the go-to solutions like Nagios, Sensu, New Relic would be either too heavy or too expensive – or both.

What we really needed was something lean we could spin up in a docker container and then ‘grow’ by extending the configuration or adding components as and when our needs change.

With those requirements in hand we soon came across Prometheus, a monitoring system and time series database, with its de-facto graphical front-end Grafana. We set it up for a trial run and it fit our needs perfectly.

While the set-up went fairly smoothly we did find some of the information on the web for similar set-ups slightly outdated and wanted to pull everything together in one place as a reference. This covers the as-of-writing current versions of Prometheus (0.18.0) and Grafana (3.0.1).

Because pictures are worth more than a thousand words, here’s what a Prometheus powered Grafana dashboard looks like:

Requirements

In order to follow along, you will need only two things

Follow the links for installation instructions. The versions we use are docker 1.11 and docker-compose 1.7. Note that older versions will very likely work as well, but we have not tested it.

Components

Prometheus is a system originally developed by SoundCloud as part of a move towards a microservice architecture. As such, it consists of a few moving parts that are launched and configured separately.

While this may be a bit more complicated to set up and manage on the surface, thanks to docker-compose it is actually quite easy to bundle everything up as a single service again with only one service definition file and (in our example) three configuration files.

Before we dive in, here’s a brief run-down of the components and what they do:

  • Prometheus - this is the central piece, it contains the time series database and the logic of scraping stats from exporters (see below) as well as alerts.
  • Grafana is the ‘face’ of Prometheus. While Prometheus exposes some of its internals like settings and the stats it gathers via basic web front-ends, it delegates the heavy lifting of proper graphical displays and dashboards to Grafana.
  • Alertmanager manages the routing of alerts which Prometheus raises to various different channels like email, pagers, slack - and so on. So while Prometheus collects stats and raises alerts it is completely agnostic of where these alerts should be displayed. This is where the alertmanager picks up.
  • Exporters are http endpoints which expose ‘prometheus metrics’ for scraping by the Prometheus server. What this means is that this is a pull set-up. Note that it is also possible to set up a push-gateway which is essentially an intermediary push target which Prometheus can then scrape. This is useful for scenarios where pull is not appropriate or feasible (for example short lived processes).

Getting started

Launching Prometheus

We’ll start off by launching Prometheus via a very simple docker-compose.yml configuration file

# docker-compose.yml
version: '2'
services:
prometheus:
    image: prom/prometheus:0.18.0
    volumes:
        - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
        - '-config.file=/etc/prometheus/prometheus.yml'
    ports:
        - '9090:9090'

and a prometheus configuration file prometheus.yml:

# prometheus.yml
global:
    scrape_interval: 5s
    external_labels:
        monitor: 'my-monitor'
scrape_configs:
    - job_name: 'prometheus'
      target_groups:
          - targets: ['localhost:9090']

As you can see, inside docker-compose.yml we map the prometheus config file into the container as a volume and add a -config.file parameter to the command pointing to this file.

To launch prometheus, run the command

docker-compose up

Visit http://localhost:9090/status to confirm the server is running and the configuration is the one we provided.

Targets

Further down below the ‘Configuration’ on the status page you will find a section ‘Targets’ which lists a ‘prometheus’ endpoint. This corresponds to the scrape_configs setting by the same job_name and is a source of metrics provided by Prometheus. In other words, the Prometheus server comes with a metrics endpoint - or exporter, as we called it above - which reports stats for the Prometheus server itself.

The raw metrics can be inspected by visiting http://localhost:9090/metrics.

Adding a node-exporter target

While it’s certainly a good idea to monitor the monitoring service itself, this is just going to be an additional aspect of the set-up. The main point is to monitor other things by adding targets to the scrape_configs section in prometheus.yml . As described above, these targets need to export metric in the prometheus format.

One such exporter is node-exporter, another piece of the puzzle provided as part of Prometheus. What it does is collect system metrics like cpu/memory/storage usage and then it exports it for Prometheus to scrape. The beauty of this is that it can be run as a docker container while also reporting stats for the host system. It is therefore very easy to instrument any system that can run docker containers.

We will add a configuration setting to our existing docker-compose.yml to bring up node-exporter alongside prometheus . However, this is mainly for convenience in this example as in a normal setup where one prometheus instance is monitoring many other machines these other exporters would likely be launched by other means.

Here’s what our new docker-compose.yml looks like:

# docker-compose.yml
version: '2'
services:
    prometheus:
        image: prom/prometheus:0.18.0
        volumes:
            - ./prometheus.yml:/etc/prometheus/prometheus.yml
        command:
            - '-config.file=/etc/prometheus/prometheus.yml'
        ports:
            - '9090:9090'
    node-exporter:
        image: prom/node-exporter:0.12.0rc1
        ports:
            - '9100:9100'

We simply added a node-exporter section. Configuring it as a target only requires a small extension to prometheus.yml :

# prometheus.yml
global:
    scrape_interval: 5s
    external_labels:
        monitor: 'my-monitor'
scrape_configs:
    - job_name: 'prometheus'
      target_groups:
          - targets: ['localhost:9090']
    - job_name: 'node-exporter'
      target_groups:
          - targets: ['node-exporter:9100']

Note that we reference the node-exporter by its service name which we specified in docker-compose.yml (the service label). Docker Compose makes the service available by that name for inter-container connectivity.

Grafana

At this point we’ve set up a Prometheus server in a basic configuration with two probes exporting metrics. If you’ve had a look around the Prometheus web front-end you’ve probably noticed that there’s a rudimentary interface in place to look at metrics.

This helps to get an overview or take a quick look but Grafana offers a much more powerful picture, as the very first screenshot in this blog post shows.

Adding Grafana to our set-up is again a simple extension to docker-compose.yml. Append the following lines:

grafana:
        image: grafana/grafana:3.0.0-beta7
        environment:
            - GF_SECURITY_ADMIN_PASSWORD=pass
        depends_on:
            - prometheus
        ports:
            - "3000:3000"

The complete final version version of all config files can be found in this github repo.

After restarting the service with

docker-compose up

you can access Grafana at http://localhost:3000/login

Persistence

Up to this point we’ve simply set things up via configuration files to bring up the service and while we did record some metrics within the running containers we’ve not made any changes that we would necessarily want to keep around between restarts of the system.

Obviously this is fine for testing but before we go on to configure dashboards, we’ll want to make sure everything is actually persisted in docker volumes.

To do so, we append the following lines to our docker-compose.yml configuration file:

volumes:
        prometheus_data: {}
        grafana_data: {}

This defines two data volumes, one for Prometheus and one for Grafana. To use them, add the following lines to their service definition (adding a new volumes: section to grafana‘s service definition):

services:
        prometheus:
            ...
            volumes:
                - prometheus_data:/prometheus
                ...
        grafana:
            ...
            volumes:
                - grafana_data:/var/lib/grafana

This tells docker-compose to map the docker volumes we’ve defined into the containers where their data directories are located. The docker volumes will be created if they don’t exist and will persist even after the containers are stopped and removed.

Now because we’ve been running the service before we need to remove the containers before we bring it back up again. (Otherwise you will get a WARNING: Service "prometheus" is using volume "/prometheus" from the previous container, going on to inform you that the volume will not be used.)

So we run

docker-compose rm

followed by

docker-compose up

and everything should be the same as before, except now we save all changes (and recorded metrics) to the specified docker volumes.

Configuring Grafana

As you’ve probably noticed, we’ve specified the admin user’s password in docker-compose.yml as pass. Use these credentials to log in and navigate to http://localhost:3000/datasources to set up our Prometheus server as the data source for Grafana. Give it a name, make sure you select “Prometheus” as the type, and set the URI to http://prometheus:9090.

Note again that we can use the service name as the host name for the URI to connect Grafana to Prometheus.

Next head over to http://localhost:3000/dashboard/new to create a new dashboard and add a graph:

Make sure to select “My Monitor” as the panel data source and after doing so you can use the metric lookup field to filter for any of the metrics available in Prometheus.

Grafana offers a lot of options to create great looking graphs and all of this is extensively documented at http://docs.grafana.org.

Limitations

The current set-up has a few limitations which make it unsuitable to be run as-is except for testing and learning purposes:

  • The Prometheus web front-end is exposed on port 9090 and freely accessible without authentication.
  • Grafana supports authentication but is not configured for SSL.

In addition, while we display metrics in graphs we have not yet set up any automatic alerts that trigger when certain conditions are met. And these alerts should be delivered as notifications in various way, for example via email or a slack message.

We will address all of these points in the second part of this blog post.

If you have any questions or feedback, please [contact the author]. You can also follow us on Twitter: