Prometheus part two - Movio blog

This post is part two of a three post blog series and you can catch up with part one here - Prometheus: Lighting the way.

In this post we will talk about the way we export our system and application metrics to Prometheus, and how it has allowed us to vastly improve our monitoring and alerting.

The aim

In previous years we have had to rely on logging to get metrics out of our services, this wasn’t very flexible as it only really provided us with event-based monitoring, as a log is normally triggered by an event happening inside our service. We found instrumenting our code to be a lot more flexible as we could get a value out at any point in time and it didn’t require us writing lines of code to log out values. It also means that we wouldn’t need to grep the values out of log files as we get the values in Prometheus stored in the correct data type so we are able to manipulate them directly.

The aim of using exporters and instrumentation in our code is to expose metrics to Prometheus which aren’t already exposed directly somewhere else. When it comes to instrumenting our code most of the time the metrics we are exposing aren’t even visible to the human eye. One of the most valuable things we found that we got out of exposing these metrics is that they also can tell us a lot about the current state of our services and can greatly improve the visibility we have over what’s happening inside each and every one of our services.

Instrumentation

As developers there are certain metrics we want access to which are often stored in variables or accessible from a function inside our services. As these metrics are held in memory inside our services it isn’t so easy to make them visible to the human eye, however, we can rely on instrumentation in our code to expose them and then be collected and read by Prometheus.

We use JMX as an interface between our services and the Prometheus JMX Exporter to export metrics from inside our running Scala services to Prometheus. JMX is the industry standard tool for monitoring JVM based applications/services, as the Prometheus organization provide and maintain the Prometheus JMX Exporter we decided that the best way to expose the metrics was via JMX. The exporter scrapes the MBeans we expose and creates an endpoint which it outputs them to in a format Prometheus can understand.

After we discovered that JMX/MBeans provided us with what we needed to improve the monitoring of our services and the visibility around the current state they are in, we decided to create a library to make it easy for us to create MBeans. Using this library we created we were able to get a variable or closure into Prometheus with just a few lines of code, you can find this library here.

Example:

val counter = new AtomicLong 

MetricsMBean.create(
  ObjectName("movio.test:type=RequestCounter"),
  Map(
    "counter" → counter
  )
)

def inc() = counter.incrementAndGet()

The above code would create a metric called “counter” and give it the value of the counter variable, each time the JMX exporter does a scrape it would get the current value of counter. We can do the same thing here but pass in a closure to get the current value a closure is returning, as shown below:

Example:

def getStorageStatusCode: Long = ???

MetricsMBean.create(
  ObjectName("movio.test:type=StorageStatus"),
  Map(
    "status" → Metric(getStorageStatusCode)
  )
)

Each time the JMX exporter scrapes it would retrieve the current value of getStatusStorageCode and make it visible to Prometheus under the metric name ‘status’.

With JMX you also get metrics that tell you about the state of the JVM, some of the metrics you get relate to memory usage, garbage collection, and threads. These can be extremely helpful when determining the impact your service has on the resources available on that machine it is running on.

Additionally, because we use Kafka at Movio we get a bunch of metrics about the consumers in our services instantly as the Kafka client library already exposes these metrics via JMX. Our Kafka cluster also exposes metrics via JMX such as the broker state, throughput and health of the partitions in the cluster . This would also apply to other libraries which expose metrics via JMX, once you have implemented the library the metrics become available via JMX.

Exporters

The Prometheus JMX exporter is just one of the many available exporters for Prometheus, here at Movio we also use a bunch of other exporters to get metrics from different sources, such as the node exporter and the mysqld exporter. We also use a couple developed by Movio developer Braedon Vickers, elasticsearch exporter and a MySQL exporter (Allows us to run MySQL queries and expose the results to Prometheus) .

Each of these exporters gives us more visibility and monitoring around what’s happening inside our applications and also with our data. Thanks to the data the exporters expose to Prometheus we are able to create meaningful graphs which allow us to do things such as easily seeing the current and previous state as well as pick up on potential issues.

The format which Prometheus parses is quite simple and makes it easy to create your own exporters. There are client libraries available for most popular programming languages which output your metrics in the format Prometheus can understand. It’s as simple as creating an endpoint, creating counters for metrics and depending on the library you use it can be as easy as adding each counter to a registry.

Alerting

Prometheus has an alert manager (documentation) which is able to perform queries on the Prometheus data store and trigger alerts if a condition is met. It can integrate with email and services such as PagerDuty and OpsGenie.

With the alert manager we have been able to create alerts to warn us when we are receiving a large amount of abnormal data, which allows us to act before our customers notice. We have also set up the integration between the alert manager PagerDuty so that alerts which trigger in the alert manager propagate through to PagerDuty, this integration gives us all the features of PagerDuty while letting us have one source of alerts which is the alert manager.

Alert rules are defined in a syntax which are easy to understand and write, the Prometheus query language is extremely powerful and provides a lot of functions, all of which can be used in alert rules.

Example:

ALERT email_template_service_sustained_errors
  IF rate(errors_counter_by_host[6m]) > 0
  FOR 10m
  LABELS { service = "email-template-service" }
  ANNOTATIONS {
    summary = "The email template service on {{ $labels.host }} is logging sustained errors",
    dashboard = "http://prometheus.example.com/dashboard/email-template-service",
  }

This definition will create an alert called email_template_service_sustained_errors which will trigger if the rate of the metric named errors_counter_by_host has an average value greater than 0 in the last 6 minutes for at least 10 minutes. You can read more about the rate function and more about the Prometheus query language here.

Below is an image of a graph in Grafana for the rate in a 6-minute window (average of the last 6 minutes) of the email_template_service_sustained_errors metric. If the above alert was defined in the alert manager then it would have triggered alerts because we can see that the rate (in a 6-minute window) was above 0 for more than 10 minutes between some time before 13:45 and some time after 13:55.

In summary

Since we have started using Prometheus we have been able to rapidly improve the monitoring and alerting we have over not only our services but also the servers we run those services on, as well as getting the bigger picture of how our systems are performing. However, this wouldn’t have been possible without the use of multiple exporters which constantly collect system and application metrics from multiple sources. In addition, we are easily able to find relations between metrics, e.g a spike in memory could relate to a memory leak in one of our services which we could easily identify.

In our next blog we will talk about how service discovery works with Prometheus and Kubernetes and how we've implemented it at Movio.

Categories

View Contributors List >

17 May 2016 · 10 min read · Jack Hopner & Jerry Peng

Prometheus: Wielding The Flame

The aim

Instrumentation

Exporters

Alerting

In summary

Keep me
in the loop

Thank you for signing up

Categories

17 May 2016 · 10 min read · Jack Hopner & Jerry Peng

Prometheus: Wielding The Flame

The aim

Instrumentation

Exporters

Alerting

In summary

You might also like

Making The Move From Scala To Go, And Why We’re Not Going Back

Keep mein the loop

Thank you for signing up

Keep me
in the loop