Skip to main content

Metrics Agent

Metrics Agent is a component of NKE that is used to monitor applications. It is the successor to Prometheus.

Usage

A new Metrics Agent instance is deployed in the nine-system namespace upon creation. The pods run on control-plane nodes, so node pools remain fully available for applications.

The Metrics Agent is based on Victoria Metrics.

Configuration is managed via prometheus-operator project resources. The following resource types are used for scraping configurations, alerting, and recording rules:

  • ServiceMonitors
  • PodMonitors
  • PrometheusRules

Metrics are collected and sent to an external Metrics Cluster. This offloads resources from the NKE cluster control-plane compared to the previous Prometheus product.

Exporters and Metrics

A set of default metrics is collected automatically. Additional metrics from the following exporters can be optionally enabled:

  • CertManager
  • IngressNginx
  • NodeExporter
  • Kubelet
  • Kubelet cAdvisor
  • KubeStateMetrics
  • Velero

Please contact us to enable specific exporters. Future updates will allow self-service activation via Cockpit.

Visualizing Metrics With Grafana

note

Grafana Alerting is currently not supported when using Metrics Agent. Please use Alertmanager instead.

Create a Grafana instance to visualize metrics.

If the Grafana instance is created in the default project (same name as the organization), all metrics in the organization are visible.

If the instance is created in any other project, only metrics of the same project are visible.

Instrumenting Your Application

To enable metric scraping, the application must be instrumented to export metrics in a supported format. Details are available in the official Prometheus documentation.

Once metrics support is added, ServiceMonitors or PodMonitors are used to configure scraping.

ServiceMonitors scrape all pods targeted by one or more services. This resource is used in most cases. A label selector must be defined in the ServiceMonitor to find the desired services. The ServiceMonitor should be created in the same namespace as the service(s) it selects. In addition to the label selector, the ServiceMonitor must have the label prometheus.nine.ch/<your metrics agent name>: scrape set with the name of the Metrics Agent instance. Consider the following example ServiceMonitor and Service definition:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
kind: Service
apiVersion: v1
metadata:
name: my-app-service
namespace: my-app
labels:
app: my-app
spec:
selector:
application: example-app
ports:
- name: web
port: 8080

The given ServiceMonitor definition selects the service "my-app-service" because the label "app: my-app" exists on that service. Metrics Agent then searches for all pods targeted by this service and scrapes them for metrics on port 8080 (the ServiceMonitor defines the port in the endpoints field).

PodMonitors scrape all pods selected by the given label selector. This works similarly to the ServiceMonitor resource (but without an actual Service resource). The PodMonitor resource can be used if the application does not need a Service resource (like some exporters). The pods should run in the same namespace as the PodMonitor is defined. Here is an example for a PodMonitor with a corresponding pod:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pods
namespace: my-app
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
selector:
matchLabels:
application: my-app
endpoints:
- port: web
apiVersion: v1
kind: Pod
metadata:
labels:
application: my-app
name: my-app
namespace: my-app
spec:
containers:
- image: mycompany/example-app
name: app
ports:
name: web
containerPort: 8080

Based on the given PodMonitor resource, the Metrics Agent generates a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.

Metrics Agent creates a job for every ServiceMonitor or PodMonitor resource defined. A job label is added to all scraped metrics gathered in the corresponding job. This can be used to identify from which services or pods a given metric was scraped.

Scraping External Targets

The ScrapeConfig CRD is used to scrape targets outside the Kubernetes cluster or to create scrape configurations not achievable with higher-level resources such as ServiceMonitor or PodMonitor. Currently, ScrapeConfig supports a limited set of service discovery mechanisms.

Although numerous options are available (for a comprehensive list, refer to the API documentation), only static_config and http_sd configurations are currently supported. The CRD is continually evolving (for now at the v1alpha1 stage), with new features and support for additional service discoveries being added regularly.

static_config example

The following example provides basic configuration and does not cover all supported options. For example, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, use the following configuration:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: my-static-config
namespace: my-namespace
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
staticConfigs:
- labels:
job: metricsagent
targets:
- metricsagent.demo.do.metricsagent.io:9090
note

The target must be specified as a hostname, not as an HTTP(S) URL. For instance, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, enter metricsagent.demo.do.metricsagent.io:9090 in the targets field.

For further details, refer to the Configuration and the API documentation.

http_sd example

HTTP-based service discovery provides a generic way to configure static targets and serves as an interface to plug in custom service discovery mechanisms.

It fetches targets from an HTTP endpoint containing a list of zero or more static_configs. The target must reply with an HTTP 200 response. The HTTP header Content-Type must be application/json, and the body must be valid JSON. The answer must be UTF-8 formatted. If no targets should be transmitted, HTTP 200 must also be emitted, with an empty list []. Target lists are unordered. See Requirements of HTTP SD endpoints for more information. In general, the content of the answer is as follows:

[
{
"targets": [ "<host>", ... ],
"labels": {
"<labelname>": "<labelvalue>", ...
}
},
...
]

Example response body:

[
{
"targets": ["metricsagent.demo.do.metricsagent.io:9090"],
"labels": {
"job": "metricsagent",
"__meta_test_label": "test_label1"
}
}
]
note

The URL to the HTTP SD is not considered secret. Authentication and any API keys should be passed with the appropriate authentication mechanisms. Metrics Agent supports TLS authentication, basic authentication, OAuth2, and authorization headers.

The endpoint is queried periodically at the specified refresh interval.

The whole list of targets must be returned on every scrape. There is no support for incremental updates. A Metrics Agent instance does not send its hostname and it is not possible for a SD endpoint to know if the SD request is the first one after a restart or not.

Each target has a meta label __meta_url during the relabeling phase. Its value is set to the URL from which the target was extracted.

A simple example:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: my-http-sd
namespace: my-namespace
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
httpSDConfigs:
- url: http://my-external-api/discovery
refreshInterval: 15s

Metrics Agent caches target lists and continues to use the current list if an error occurs while fetching an updated one. However, the targets list is not preserved across restarts. Therefore, it is crucial to monitor HTTP service discovery (HTTP SD) endpoints for downtime. During a Metrics Agent restart, which may occur during regular maintenance windows, the cache is cleared. If the HTTP SD endpoints are also down at this time, the endpoint target list may be lost. For more information, refer to the Requirements of HTTP SD endpoints documentation.

For further details, refer to the Configuration and the API documentation.

Querying Metrics

PromQL is used to query for metrics. There are some examples on the official Prometheus page. Querying can be done by using Grafana in the explore view. When using Grafana please make sure to select the data source matching your Metrics Agent instance. The data source name will be <YOUR PROJECT NAME>/.

Adding Rules

Metrics Agent supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.

Recording rules are used to calculate new metrics from already existing ones. This can be useful for computationally expensive queries in dashboards. To speed them up, a recording rule can be created which evaluates the query in a defined interval and stores the result as a new metric. This new metric can then be used in dashboard queries.

Alerting rules are used to define alert conditions (based on PromQL). When those conditions are true, Metrics Agent sends an alert to the connected Alertmanager instances. Alertmanager then sends notifications to users about alerts.

When creating alerting or recording rules, the prometheus.nine.ch/<your metrics agent name>: scrape label must be added with the name of the Metrics Agent instance. This assigns the created rule to the Metrics Agent instance.

The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/mymetricsagent01: scrape
role: alert-rules
name: jobs-check
spec:
groups:
- name: ./example.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: Critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

This alerting rule definition triggers an alert once an up metric gets a value of 0. The up metric is a special metric as it is added by Metrics Agent itself for every job target (pod). Once a pod can not be scraped anymore, the corresponding up metric turns to 0. If the up metric is 0 for more than 5 minutes (in this case), Metrics Agent triggers an alert. The specified "labels" and "annotations" can be used in Alertmanager to customize notification messages and routing decisions.

The full spec for the PrometheusRule definition can be found here.

Here is an example of a recording rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/mymetricsagent01: scrape
role: recording-rules
name: cpu-per-namespace-recording
spec:
groups:
- name: ./example.rules
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)

This recording rule creates a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a Grafana dashboard to have an overview about the CPU usage of all pods per namespace.

The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules.

Video Guide

Checkout our video guide series for GKE Application Monitoring. While the videos are done on our GKE product with Prometheus, the concepts are the same.