Metrics Agent

Metrics Agent is a component of NKE that allows you to monitor your applications. Metrics Agent is the successor of our deprecated Prometheus product.

Availability

Metrics Agent is available as an optional service for NKE. It can be deployed on an existing NKE cluster using Cockpit.

Usage

Please see the following sections for an explanation on how to use Metrics Agent.

General information about the setup

When a Metrics Agent is created, a new Metrics Agent instance will be deployed in your NKE cluster in the nine-system namespace. The pods will run on the control-plane nodes, leaving your node pools fully available for your applications.

The Metrics Agent is based on Victoria Metrics.

The Metrics Agent configuration is based on the prometheus-operator project. Accordingly, you can use the following resource types in your cluster to define scraping configurations as well as alerting and recording rules:

ServiceMonitors
PodMonitors
PrometheusRules

The Metrics Agent will collect your metrics and send them to our Metrics Cluster, which runs externally from your cluster. This will free up a lot of resources from the control-plane nodes in your NKE cluster compared to the deprecated Prometheus product.

Exporters and Metrics

Metrics Agent collects a set of default metrics automatically. Additional metrics from the following exporters can be enabled optionally:

CertManager
IngressNginx
NodeExporter
Kubelet
Kubelet cAdvisor
KubeStateMetrics
Velero

You will need to tell us which of these exporters you want to enable. In the future, you will be able to enable them yourself in Cockpit.

Accessing metrics - Grafana

Create a Grafana instance and depending in what project you create your Grafana instance, you will see different metrics.

If you create your Grafana instance in the default project (same name as your organization), you will be able to see all metrics in your whole organization.

If you create the instance in any other project, you will only see metrics of the same project.

Instrumenting your application

Before Metrics Agent can scrape metrics from your application, you will need to instrument your application to export metrics in a special given format. You can find information about how to do this in the official Prometheus documentation.

Adding application metrics to Metrics Agent

Once your application supports metrics, you can use ServiceMonitors or PodMonitors to let Metrics Agent scrape your application's metrics.

ServiceMonitors will scrape all pods which are targeted by one or more services. This resource needs to be used in most of the cases. You need to define a label selector in the ServiceMonitor which will be used to find all the wanted services. The ServiceMonitor should be created in the same namespace as the service(s) it selects. Next to the label selector your ServiceMonitor should also have the label prometheus.nine.ch/<your metrics agent name>: scrape set with the name of your Metrics Agent instance. Consider the following example ServiceMonitor and Service definition:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: web

kind: Service
apiVersion: v1
metadata:
  name: my-app-service
  namespace: my-app
  labels:
    app: my-app
spec:
  selector:
    application: example-app
  ports:
    - name: web
      port: 8080

The given ServiceMonitor definition will select the service "my-app-service" because the label "app: my-app" exists on that service. Metrics Agent will then search for all pods which are targeted by this service and starts to scrape them for metrics on port 8080 (the ServiceMonitor defines the port in the endpoints field).

PodMonitors will scrape all pods which are selected by the given label selector. It works very similar to the ServiceMonitor resource (just without an actual Service resource). You can use the PodMonitor resource if your application does not need a Service resource (like some exporters) for any other reason. The pods should run in the same namespace as the PodMonitor is defined. Here is an example for a PodMonitor with a corresponding pod:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-pods
  namespace: my-app
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  selector:
    matchLabels:
      application: my-app
  endpoints:
    - port: web

apiVersion: v1
kind: Pod
metadata:
  labels:
    application: my-app
  name: my-app
  namespace: my-app
spec:
  containers:
    - image: mycompany/example-app
      name: app
      ports:
        name: web
        containerPort: 8080

Based on the given PodMonitor resource the Metrics Agent will generate a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.

Metrics Agent will create a job for every ServiceMonitor or PodMonitor resource you define. It will also add a job label to all scraped metrics which have been gathered in the corresponding job. This can be used to find out from which services or pods a given metric has been scraped.

Use ScrapeConfig to scrape an external target

The ScrapeConfig CRD can be employed to scrape targets outside the Kubernetes cluster or to create scrape configurations that are not achievable with higher-level resources such as ServiceMonitor or PodMonitor. Currently, ScrapeConfig supports a limited set of service discovery mechanisms.

Although numerous options are available (for a comprehensive list, refer to the API documentation), we currently only support static_config and http_sd configurations. The CRD is continually evolving (for now at the v1alpha1 stage), with new features and support for additional service discoveries being added regularly. We need to carefully determine which fields will be useful and need to be maintained in the long term.

`static_config` example

The following example provide basic configuration and do not cover all supported options. For example, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, use the following configuration:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: my-static-config
  namespace: my-namespace
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  staticConfigs:
    - labels:
        job: metricsagent
      targets:
        - metricsagent.demo.do.metricsagent.io:9090

note

The target must be specified as a hostname, not as an HTTP(S) URL. For instance, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, you should enter metricsagent.demo.do.metricsagent.io:9090 in the targets field.

For further details, refer to the Configuration and the API documentation.

`http_sd` example

HTTP-based service discovery provides a more generic way to configure static targets and serves as an interface to plug in custom service discovery mechanisms.

It fetches targets from an HTTP endpoint containing a list of zero or more static_configs. The target must reply with an HTTP 200 response. The HTTP header Content-Type must be application/json, and the body must be valid JSON. The answer must be UTF-8 formatted. If no targets should be transmitted, HTTP 200 must also be emitted, with an empty list []. Target lists are unordered. See Requirements of HTTP SD endpoints for more information. In general, the content of the answer is as follows:

[
  {
    "targets": [ "<host>", ... ],
    "labels": {
      "<labelname>": "<labelvalue>", ...
    }
  },
  ...
]

Example response body:

[
  {
    "targets": ["metricsagent.demo.do.metricsagent.io:9090"],
    "labels": {
      "job": "metricsagent",
      "__meta_test_label": "test_label1"
    }
  }
]

note

The URL to the HTTP SD is not considered secret. The authentication and any API keys should be passed with the appropriate authentication mechanisms. Metrics Agent supports TLS authentication, basic authentication, OAuth2, and authorization headers.

The endpoint is queried periodically at the specified refresh interval.

The whole list of targets must be returned on every scrape. There is no support for incremental updates. A Metrics Agent instance does not send its hostname and it is not possible for a SD endpoint to know if the SD requests is the first one after a restart or not.

Each target has a meta label __meta_url during the relabeling phase. Its value is set to the URL from which the target was extracted.

A simple example:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: my-http-sd
  namespace: my-namespace
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  httpSDConfigs:
    - url: http://my-external-api/discovery
      refreshInterval: 15s

Metrics Agent caches target lists and continues to use the current list if an error occurs while fetching an updated one. However, the targets list is not preserved across restarts. Therefore, it is crucial to monitor your HTTP service discovery (HTTP SD) endpoints for downtime. During a Metrics Agent restart, which may occur during our regular maintenance window, the cache will be cleared. If the HTTP SD endpoints are also down at this time, you may lose the endpoint target list. For more information, refer to the Requirements of HTTP SD endpoints documentation.

For further details, refer to the Configuration and the API documentation.

Querying for metrics

You can use PromQL to query for metrics. There are some examples on the official Prometheus page. Querying can be done by using Grafana in the explore view. When using Grafana please make sure to select the data source matching your Metrics Agent instance. The data source name will be <YOUR PROJECT NAME>/.

Adding rules to Metrics Agent

Metrics Agent supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.

Recording rules can be used to calculate new metrics from already existing ones. This can be useful if you use computationally expensive queries in dashboards. To speed them up you can create a recording rule which will evaluate the query in a defined interval and stores the result as a new metric. You can then use this new metric in your dashboard queries.

Alerting rules allow you to define alert conditions (based on PromQL). When those conditions are true, Metrics Agent will send out an alert to the connected Alertmanager instances. Alertmanager will then send notifications to users about alerts.

When creating alerting or recording rules, please make sure to add the prometheus.nine.ch/<your metrics agent name>: scrape label with the name of your Metrics Agent instance. This will assign the created rule to your Metrics Agent instance.

The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
    role: alert-rules
  name: jobs-check
spec:
  groups:
    - name: ./example.rules
      rules:
        - alert: InstanceDown
          expr: up == 0
          for: 5m
          labels:
            severity: Critical
          annotations:
            summary: "Instance {{ $labels.instance }} down"
            description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

This alerting rule definition will trigger an alert once an up metric gets a value of 0. The up metric is a special metric as it will be added by Metrics Agent itself for every job target (pod). Once a pod can not be scraped anymore, the corresponding up metric will turn to 0. If the up metric is 0 for more than 5 minutes (in this case), Metrics Agent will trigger an alert. The specified "labels" and "annotations" can be used in Alertmanager to customize your notification messages and routing decisions.

The full spec for the PrometheusRule definition can be found here.

Here is an example of a recording rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
    role: recording-rules
  name: cpu-per-namespace-recording
spec:
  groups:
    - name: ./example.rules
      rules:
        - record: namespace:container_cpu_usage_seconds_total:sum_rate
          expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)

This recording rule will create a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a Grafana dashboard to have an overview about the CPU usage of all pods per namespace.

The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules.

Documentation and Links

Video Guide

Checkout our video guide series for GKE Application Monitoring. While the videos are done on our GKE product with Prometheus, the concepts are the same.

Availability​

Usage​

General information about the setup​

Exporters and Metrics​

Accessing metrics - Grafana​

Instrumenting your application​

Adding application metrics to Metrics Agent​

Use ScrapeConfig to scrape an external target​

static_config example​

http_sd example​

Querying for metrics​

Adding rules to Metrics Agent​

Documentation and Links​

Video Guide​