Metrics Agent
Metrics Agent is a component of NKE that monitors applications. It succeeds Prometheus.
Usage
NKE deploys a new Metrics Agent instance in the nine-system namespace upon creation. The pods run on control-plane nodes, so node pools remain fully available for applications.
The Metrics Agent is based on Victoria Metrics.
Manage configuration via prometheus-operator project resources. Use the following resource types to configure scraping, alerting, and recording rules:
ServiceMonitorsPodMonitorsPrometheusRules
The agent collects metrics and sends them to an external Metrics Cluster. This offloads resources from the NKE cluster control-plane compared to the previous Prometheus product.
Migrating from Prometheus
To ensure a seamless migration, name your Metrics Agent the same as your existing Prometheus instance. This eliminates the need to update labels on ServiceMonitors and PodMonitors.
If you're currently using Prometheus and want to migrate to Metrics Agent, the process is straightforward:
-
Select your Kubernetes Cluster in Cockpit, open the Metrics tab, and note your current Prometheus name.
-
Click on Add Metrics Agent and define the same name as your existing Prometheus instance, so that Metrics Agent automatically picks up all existing
ServiceMonitorsandPodMonitorswithout any configuration changes.If you use a different name, you need to update the label on your existing
ServiceMonitorsandPodMonitorsto match the new Metrics Agent name. For example, if your Prometheus instance was namedprometheusand you create a Metrics Agent namedmetrics-agent, change the label fromprometheus.nine.ch/prometheus: scrapetoprometheus.nine.ch/metrics-agent: scrape. -
Give Metrics Agent some time to discover the monitors and start scraping metrics.
-
Update the data source in Grafana to point to the new Metrics Agent.
-
After you verify that metrics collection is working correctly, delete the old Prometheus instance.
Exporters and Metrics
The agent automatically collects a set of default metrics. You can optionally enable additional metrics from the following exporters:
- CertManager
- IngressNginx
- NodeExporter
- Kubelet
- Kubelet cAdvisor
- KubeStateMetrics
- Velero
Please contact us to enable specific exporters. Future updates will allow self-service activation via Cockpit.
Visualizing Metrics with Grafana
Metrics Agent does not currently support Grafana Alerting. Please use Alertmanager instead.
Create a Grafana instance to visualize metrics.
If you create the Grafana instance in the default project (same name as the organization), all metrics in the organization are visible.
If you create the instance in any other project, only metrics of the same project are visible.
Instrumenting Your Application
To enable metric scraping, you must instrument the application to export metrics in a supported format. For details, see the official Prometheus documentation.
After you add metrics support, use ServiceMonitors or PodMonitors to configure scraping.
ServiceMonitors scrape all pods targeted by one or more services. Use this resource in most cases. Define a label selector in the ServiceMonitor to find the desired services. Create the ServiceMonitor in the same namespace as the service(s) it selects. In addition to the label selector, set the label prometheus.nine.ch/<your metrics agent name>: scrape with the name of the Metrics Agent instance. Consider the following example ServiceMonitor and Service definition:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
kind: Service
apiVersion: v1
metadata:
name: my-app-service
namespace: my-app
labels:
app: my-app
spec:
selector:
application: example-app
ports:
- name: web
port: 8080
The given ServiceMonitor definition selects the service "my-app-service" because the label "app: my-app" exists on that service. Metrics Agent then searches for all pods targeted by this service and scrapes them for metrics on port 8080 (the ServiceMonitor defines the port in the endpoints field).
PodMonitors scrape all pods selected by the given label selector. This works similarly to the ServiceMonitor resource (but without an actual Service resource). Use the PodMonitor resource if the application does not need a Service resource (like some exporters). Run the pods in the same namespace where you define the PodMonitor. Here is an example for a PodMonitor with a corresponding pod:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pods
namespace: my-app
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
selector:
matchLabels:
application: my-app
endpoints:
- port: web
apiVersion: v1
kind: Pod
metadata:
labels:
application: my-app
name: my-app
namespace: my-app
spec:
containers:
- image: mycompany/example-app
name: app
ports:
name: web
containerPort: 8080
Based on the given PodMonitor resource, the Metrics Agent generates a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.
Metrics Agent creates a job for every ServiceMonitor or PodMonitor resource defined. The agent adds a job label to all scraped metrics gathered in the corresponding job. Use this label to identify from which services or pods a given metric was scraped.
Scraping External Targets
Use the ScrapeConfig CRD to scrape targets outside the Kubernetes cluster or to create scrape configurations not achievable with higher-level resources such as ServiceMonitor or PodMonitor. Currently, ScrapeConfig supports a limited set of service discovery mechanisms.
Although numerous options are available (for a comprehensive list, refer to the API documentation), only static_config and http_sd configurations are currently supported. The CRD continually evolves (for now at the v1alpha1 stage), and regularly adds new features and support for additional service discoveries.
static_config Example
The following example provides basic configuration and does not cover all supported options. For example, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, use the following configuration:
apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: my-static-config
namespace: my-namespace
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
staticConfigs:
- labels:
job: metricsagent
targets:
- metricsagent.demo.do.metricsagent.io:9090
Specify the target as a hostname, not as an HTTP(S) URL. For instance, to scrape the target located at http://metricsagent.demo.do.metricsagent.io:9090, enter metricsagent.demo.do.metricsagent.io:9090 in the targets field.
For further details, refer to the Configuration and the API documentation.
http_sd Example
HTTP-based service discovery provides a generic way to configure static targets and serves as an interface to plug in custom service discovery mechanisms.
It fetches targets from an HTTP endpoint containing a list of zero or more static_configs. The target must meet the following requirements:
- Reply with an
HTTP 200response - Set the HTTP header
Content-Typetoapplication/json - Always respond with a
JSONarray, - Format the response in UTF-8
- If no targets exist, also emit HTTP 200 with an empty list
[] - Target lists are unordered
See Requirements of HTTP SD endpoints for more information. In general, the content of the answer is as follows:
[
{
"targets": [ "<host>", ... ],
"labels": {
"<labelname>": "<labelvalue>", ...
}
},
...
]
Example response body:
[
{
"targets": ["metricsagent.demo.do.metricsagent.io:9090"],
"labels": {
"job": "metricsagent",
"__meta_test_label": "test_label1"
}
}
]
The HTTP SD URL is not secret. Pass authentication and any API keys with the appropriate authentication mechanisms. Metrics Agent supports TLS authentication, basic authentication, OAuth2, and authorization headers.
The endpoint is queried periodically at the specified refresh interval.
Return the whole list of targets on every scrape. Metrics Agent does not support incremental updates. A Metrics Agent instance does not send its hostname and SD endpoints cannot determine if the SD request is the first one after a restart.
Each target has a meta label __meta_url during the relabeling phase. The value contains the URL from which the target was extracted.
A simple example:
apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: my-http-sd
namespace: my-namespace
labels:
prometheus.nine.ch/mymetricsagent01: scrape
spec:
httpSDConfigs:
- url: http://my-external-api/discovery
refreshInterval: 15s
Metrics Agent caches target lists and continues to use the current list if an error occurs while fetching an updated one. However, restarts do not preserve the targets list. Therefore, monitor HTTP service discovery (HTTP SD) endpoints for downtime. During a Metrics Agent restart, which may occur during regular maintenance windows, the agent clears the cache. If the HTTP SD endpoints are also down at this time, you may lose the endpoint target list. For more information, refer to the Requirements of HTTP SD endpoints documentation.
For further details, refer to the Configuration and the API documentation.
Querying Metrics
Use PromQL to query for metrics. See examples on the official Prometheus page. Query metrics using Grafana in the explore view. When using Grafana, select the data source matching your Metrics Agent instance. The data source name will be <YOUR PROJECT NAME>/.
Adding Rules
Metrics Agent supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.
Recording rules calculate new metrics from existing ones. This is useful for computationally expensive queries in dashboards. To speed them up, create a recording rule that evaluates the query at a defined interval and stores the result as a new metric. Use this new metric in dashboard queries.
Alerting rules define alert conditions (based on PromQL). When those conditions are true, Metrics Agent sends an alert to the connected Alertmanager instances. Alertmanager then sends notifications to users about alerts.
When creating alerting or recording rules, add the prometheus.nine.ch/<your metrics agent name>: scrape label with the name of the Metrics Agent instance. This assigns the created rule to the Metrics Agent instance.
The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/mymetricsagent01: scrape
role: alert-rules
name: jobs-check
spec:
groups:
- name: ./example.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: Critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
This alerting rule definition triggers an alert once an up metric gets a value of 0. The up metric is special because Metrics Agent itself adds it for every job target (pod). When a pod can no longer be scraped, the corresponding up metric turns to 0. If the up metric is 0 for more than 5 minutes (in this case), Metrics Agent triggers an alert. Use the specified "labels" and "annotations" in Alertmanager to customize notification messages and routing decisions.
See the full spec for the PrometheusRule definition in the prometheus-operator documentation.
Here is an example of a recording rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/mymetricsagent01: scrape
role: recording-rules
name: cpu-per-namespace-recording
spec:
groups:
- name: ./example.rules
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)
This recording rule creates a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a Grafana dashboard to have an overview about the CPU usage of all pods per namespace.
The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules.
Limitations
Metrics Agent is limited to 100,000 unique time series per day. This limit prevents our Global Metrics cluster from becoming congested.
Documentation and Links
Video Guide
Check out our video guide series for GKE Application Monitoring. While the videos are done on our GKE product with Prometheus, the concepts are the same.
- Part 1: The Basics
- Part 2: Service Monitors
- Part 3: Blackbox Exporter