Metrics Agent

Metrics Agent is a component of NKE that monitors applications. It succeeds Prometheus.

Usage

NKE deploys a new Metrics Agent instance in the nine-system namespace upon creation. The pods run on control-plane nodes, so node pools remain fully available for applications.

The Metrics Agent is based on Victoria Metrics.

Manage configuration via prometheus-operator project resources. Use the following resource types to configure scraping, alerting, and recording rules:

ServiceMonitors
PodMonitors
PrometheusRules

The agent collects metrics and sends them to an external Metrics Cluster. This offloads resources from the NKE cluster control-plane compared to the previous Prometheus product.

Migrating from Prometheus

tip

To ensure a seamless migration, name your Metrics Agent the same as your existing Prometheus instance. This eliminates the need to update labels on ServiceMonitors and PodMonitors.

If you're currently using Prometheus and want to migrate to Metrics Agent, the process is straightforward:

Select your Kubernetes Cluster in Cockpit, open the Metrics tab, and note your current Prometheus name.
Click on Add Metrics Agent and define the same name as your existing Prometheus instance, so that Metrics Agent automatically picks up all existing ServiceMonitors and PodMonitors without any configuration changes.

If you use a different name, you need to update the label on your existing ServiceMonitors and PodMonitors to match the new Metrics Agent name. For example, if your Prometheus instance was named prometheus and you create a Metrics Agent named metrics-agent, change the label from prometheus.nine.ch/prometheus: scrape to prometheus.nine.ch/metrics-agent: scrape.
Give Metrics Agent some time to discover the monitors and start scraping metrics.
Update the data source in Grafana to point to the new Metrics Agent.
After you verify that metrics collection is working correctly, delete the old Prometheus instance.

Exporters and Metrics

The agent automatically collects a set of default metrics. You can optionally enable additional metrics from the following exporters:

CertManager
IngressNginx
NodeExporter
Kubelet
Kubelet cAdvisor
KubeStateMetrics
Velero

Please contact us to enable specific exporters. Future updates will allow self-service activation via Cockpit.

Visualizing Metrics with Grafana

note

Metrics Agent does not currently support Grafana Alerting. Please use Alertmanager instead.

Create a Grafana instance to visualize metrics.

If you create the Grafana instance in the default project (same name as the organization), all metrics in the organization are visible.

If you create the instance in any other project, only metrics of the same project are visible.

Instrumenting Your Application

To enable metric scraping, you must instrument the application to export metrics in a supported format. For details, see the official Prometheus documentation.

After you add metrics support, use ServiceMonitors or PodMonitors to configure scraping.

ServiceMonitors scrape all pods targeted by one or more services. Use this resource in most cases. Define a label selector in the ServiceMonitor to find the desired services. Create the ServiceMonitor in the same namespace as the service(s) it selects. In addition to the label selector, set the label prometheus.nine.ch/<your metrics agent name>: scrape with the name of the Metrics Agent instance. Consider the following example ServiceMonitor and Service definition:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: web

kind: Service
apiVersion: v1
metadata:
  name: my-app-service
  namespace: my-app
  labels:
    app: my-app
spec:
  selector:
    application: example-app
  ports:
    - name: web
      port: 8080

The given ServiceMonitor definition selects the service "my-app-service" because the label "app: my-app" exists on that service. Metrics Agent then searches for all pods targeted by this service and scrapes them for metrics on port 8080 (the ServiceMonitor defines the port in the endpoints field).

PodMonitors scrape all pods selected by the given label selector. This works similarly to the ServiceMonitor resource (but without an actual Service resource). Use the PodMonitor resource if the application does not need a Service resource (like some exporters). Run the pods in the same namespace where you define the PodMonitor. Here is an example for a PodMonitor with a corresponding pod:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-pods
  namespace: my-app
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  selector:
    matchLabels:
      application: my-app
  endpoints:
    - port: web

apiVersion: v1
kind: Pod
metadata:
  labels:
    application: my-app
  name: my-app
  namespace: my-app
spec:
  containers:
    - image: mycompany/example-app
      name: app
      ports:
        name: web
        containerPort: 8080

Based on the given PodMonitor resource, the Metrics Agent generates a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.

Metrics Agent creates a job for every ServiceMonitor or PodMonitor resource defined. The agent adds a job label to all scraped metrics gathered in the corresponding job. Use this label to identify from which services or pods a given metric was scraped.

Scraping External Targets

Use the ScrapeConfig CRD to scrape targets outside the Kubernetes cluster or to create scrape configurations not achievable with higher-level resources such as ServiceMonitor or PodMonitor. Currently, ScrapeConfig supports a limited set of service discovery mechanisms.

Although numerous options are available (for a comprehensive list, refer to the API documentation), only static_config and http_sd configurations are currently supported. The CRD continually evolves (for now at the v1alpha1 stage), and regularly adds new features and support for additional service discoveries.

`static_config` Example

The following example provides basic configuration and does not cover all supported options. For example, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, use the following configuration:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: my-static-config
  namespace: my-namespace
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  staticConfigs:
    - labels:
        job: metricsagent
      targets:
        - metricsagent.demo.do.metricsagent.io:9090

note

Specify the target as a hostname, not as an HTTP(S) URL. For instance, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, enter metricsagent.demo.do.metricsagent.io:9090 in the targets field.

For further details, refer to the Configuration and the API documentation.

`http_sd` Example

HTTP-based service discovery provides a generic way to configure static targets and serves as an interface to plug in custom service discovery mechanisms.

It fetches targets from an HTTP endpoint containing a list of zero or more static_configs. The target must meet the following requirements:

Reply with an HTTP 200 response
Set the HTTP header Content-Type to application/json
Always respond with a JSON array,
Format the response in UTF-8
If no targets exist, also emit HTTP 200 with an empty list []
Target lists are unordered

See Requirements of HTTP SD endpoints for more information. In general, the content of the answer is as follows:

[
  {
    "targets": [ "<host>", ... ],
    "labels": {
      "<labelname>": "<labelvalue>", ...
    }
  },
  ...
]

Example response body:

[
  {
    "targets": ["metricsagent.demo.do.metricsagent.io:9090"],
    "labels": {
      "job": "metricsagent",
      "__meta_test_label": "test_label1"
    }
  }
]

note

The HTTP SD URL is not secret. Pass authentication and any API keys with the appropriate authentication mechanisms. Metrics Agent supports TLS authentication, basic authentication, OAuth2, and authorization headers.

The endpoint is queried periodically at the specified refresh interval.

Return the whole list of targets on every scrape. Metrics Agent does not support incremental updates. A Metrics Agent instance does not send its hostname and SD endpoints cannot determine if the SD request is the first one after a restart.

Each target has a meta label __meta_url during the relabeling phase. The value contains the URL from which the target was extracted.

A simple example:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: my-http-sd
  namespace: my-namespace
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
spec:
  httpSDConfigs:
    - url: http://my-external-api/discovery
      refreshInterval: 15s

Metrics Agent caches target lists and continues to use the current list if an error occurs while fetching an updated one. However, restarts do not preserve the targets list. Therefore, monitor HTTP service discovery (HTTP SD) endpoints for downtime. During a Metrics Agent restart, which may occur during regular maintenance windows, the agent clears the cache. If the HTTP SD endpoints are also down at this time, you may lose the endpoint target list. For more information, refer to the Requirements of HTTP SD endpoints documentation.

For further details, refer to the Configuration and the API documentation.

Querying Metrics

Use PromQL to query for metrics. See examples on the official Prometheus page. Query metrics using Grafana in the explore view. When using Grafana, select the data source matching your Metrics Agent instance. The data source name will be <YOUR PROJECT NAME>/.

Adding Rules

Metrics Agent supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.

Recording rules calculate new metrics from existing ones. This is useful for computationally expensive queries in dashboards. To speed them up, create a recording rule that evaluates the query at a defined interval and stores the result as a new metric. Use this new metric in dashboard queries.

Alerting rules define alert conditions (based on PromQL). When those conditions are true, Metrics Agent sends an alert to the connected Alertmanager instances. Alertmanager then sends notifications to users about alerts.

When creating alerting or recording rules, add the prometheus.nine.ch/<your metrics agent name>: scrape label with the name of the Metrics Agent instance. This assigns the created rule to the Metrics Agent instance.

The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
    role: alert-rules
  name: jobs-check
spec:
  groups:
    - name: ./example.rules
      rules:
        - alert: InstanceDown
          expr: up == 0
          for: 5m
          labels:
            severity: Critical
          annotations:
            summary: "Instance {{ $labels.instance }} down"
            description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

This alerting rule definition triggers an alert once an up metric gets a value of 0. The up metric is special because Metrics Agent itself adds it for every job target (pod). When a pod can no longer be scraped, the corresponding up metric turns to 0. If the up metric is 0 for more than 5 minutes (in this case), Metrics Agent triggers an alert. Use the specified "labels" and "annotations" in Alertmanager to customize notification messages and routing decisions.

See the full spec for the PrometheusRule definition in the prometheus-operator documentation.

Here is an example of a recording rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus.nine.ch/mymetricsagent01: scrape
    role: recording-rules
  name: cpu-per-namespace-recording
spec:
  groups:
    - name: ./example.rules
      rules:
        - record: namespace:container_cpu_usage_seconds_total:sum_rate
          expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)

This recording rule creates a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a Grafana dashboard to have an overview about the CPU usage of all pods per namespace.

The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules.

Limitations

Metrics Agent is limited to 100,000 unique time series per day. This limit prevents our Global Metrics cluster from becoming congested.

Documentation and Links

Video Guide

Check out our video guide series for GKE Application Monitoring. While the videos are done on our GKE product with Prometheus, the concepts are the same.

Part 1: The Basics
Part 2: Service Monitors
Part 3: Blackbox Exporter

Usage​

Migrating from Prometheus​

Exporters and Metrics​

Visualizing Metrics with Grafana​

Instrumenting Your Application​

Scraping External Targets​

static_config Example​

http_sd Example​

Querying Metrics​

Adding Rules​

Limitations​

Documentation and Links​

Video Guide​

Usage

Migrating from Prometheus

Exporters and Metrics

Visualizing Metrics with Grafana

Instrumenting Your Application

Scraping External Targets

`static_config` Example

`http_sd` Example

Querying Metrics

Adding Rules

Limitations

Documentation and Links

Video Guide