Application Monitoring
Application Monitoring is a component of nine Managed GKE that allows you to monitor your applications in a self-service way.
Video Guide
Checkout our video guide series for Application Monitoring.
Details
With application monitoring, nine provides a complete monitoring solution for you with the following features:
- managed Prometheus instance
- integrated exporters
- service discovery and Prometheus rules can be configured in a self-service way
- managed alertmanager to send out notifications
- alertmanager configuration can be changed in a self-service way
- integrated grafana datasource for Prometheus
Availability
Application monitoring is charged separately from the nine managed GKE base platform. To order the needed components for application monitoring please create a support ticket.
Prerequisites
Nine nodes need to be upgraded to n1-standard-2 nodes atleast, as the n1-standard-1 don't have enough power. If you order Application Monitoring, we will upgrade the nodes to n1-standard-2.
Usage
Please see the following sections for an explanation on how to use the application monitoring solution.
General information about the setup
The application monitoring solution is based on the prometheus-operator project. When ordering the application monitoring product, nine will:
- run one (or more) Prometheus instance(s) (backed by GCP regional storage) for you on the nine node pool
- run an instance of the Prometheus-operator on the nine node pool
- run 2 instances of alertmanager on the nine node pool
- pre-configure some metric exporters in the customer Prometheus
- create a grafana datasource for the customer Prometheus
Running all components on the nine node pool will leave all available resources of your node pools to your applications. Moreover, by using GCP regional storage for Prometheus, the instance can failover to another GCP compute zone (high availability).
With the help of the prometheus-operator project you can then use the following resources to create scraping configurations and recording/alerting rules:
- ServiceMonitors
- PodMonitors
- PrometheusRules
It is possible to run multiple Prometheus instances in your cluster if needed. Every Prometheus instance gets a name which you can see on runway. This name must be used in the nine.ch/prometheus label of all corresponding resources so that it will be picked up by the prometheus instance.
Please have a look on "Adding customer application metrics to Prometheus" and "Adding alerting rules to Prometheus" for more information about how to use these resources.
The provided alertmanager will be sending out notifications in case of triggering Prometheus rules. The alertmanager instances can be configured by providing a special named kubernetes secret in a certain namespace. Please see "Configure alertmanager" for further informations.
Accessing the web UI
The Prometheus and alertmanager web UI URLs can be found on runway.
Instrumenting your application
Before Prometheus can scrape metrics from your application, you will need to instrument your application to export metrics in a special given format. You can find information about how to do this in the official Prometheus documentation.
Adding application metrics to Prometheus
Once your application supports metrics, you can use ServiceMonitors
or PodMonitors
to let Prometheus scrape your applications metrics.
ServiceMonitors will scrape all pods which are targeted by one or more services. This resource needs to be used in most of the cases. You need to define a label selector in the ServiceMonitor
which will be used to find all the wanted services. The ServiceMonitor
should be created in the same namespace as the service(s) it selects. Next to the label selector your ServiceMonitor
should also have the label nine.ch/prometheus set with the name of your Prometheus instance (can be found on runway. Consider the following example ServiceMonitor
and Service
definition:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
nine.ch/prometheus: myPrometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
kind: Service
apiVersion: v1
metadata:
name: my-app-service
namespace: my-app
labels:
app: my-app
spec:
selector:
application: example-app
ports:
- name: web
port: 8080
The given ServiceMonitor
definition will select the service "my-app-service" because the label "app: my-app" exists on that service. Prometheus will then search for all pods which are targeted by this service and starts to scrape them for metrics on port 8080 (the ServiceMonitor
defines the port in the endpoints field).
PodMonitors will scrape all pods which are selected by the given label selector. It works very similiar to the ServiceMonitor
resource (just without an actual Service
resource). You can use the PodMonitor
resource if your application does not need a Service
resource (like some exporters) for any other reason. The pods should run in the same namespace as the PodMonitor
is defined. Here is an example for a PodMonitor
with a corresponding pod:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pods
namespace: my-app
labels:
nine.ch/prometheus: myPrometheus
spec:
selector:
matchLabels:
application: my-app
endpoints:
- port: web
apiVersion: v1
kind: Pod
metadata:
labels:
application: my-app
name: my-app
namespace: my-app
spec:
containers:
- image: mycompany/example-app
name: app
ports:
name: web
containerPort: 8080
Based on the given PodMonitor
resource the prometheus-operator will generate a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.
Prometheus will create a job For every ServiceMonitor
or PodMonitor
resource you define. It will also add a job label to all scraped metrics which have been gathered in the corresponding job. This can be used to find out from which services or pods a given metric has been scraped.
Querying for metrics
You can use PromQL to query for metrics. There are some examples on the official Prometheus page. Querying can be done by either using the Prometheus web UI or by using grafana in the explore view. When using grafana please make sure to select the data source matching your Prometheus instance. The data source name will be prometheus-<YOUR PROMETHEUS NAME>.
Adding rules to Prometheus
Prometheus supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.
Recording rules can be used to calculate new metrics from already existing ones. This can be useful if you use computationally expensive queries in dashboards. To speed them up you can create a recording rule which will evaluate the query in a defined interval and stores the result as a new metric. You can then use this new metric in your dashboard queries.
Alerting rules allow you to define alert conditions (based on PromQL). When those conditions are true, Prometheus will send out an alert to the connected alertmanager instances. Alertmanager will then send notifications to users about alerts.
When creating alerting or recording rules, please make sure to add the nine.ch/prometheus label with the name of your Prometheus instance. This will assign the created rule to your Prometheus instance.
The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
nine.ch/prometheus: myPrometheus
role: alert-rules
name: jobs-check
spec:
groups:
- name: ./example.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: Critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
This alerting rule definition will trigger an alert once a up metric gets a value of 0. The up metric is a special metric as it will be added by Prometheus itself for every job target (pod). Once a pod can not be scraped anymore, the corresponding up metric will turn to 0. If the up metric is 0 for more than 5 minutes (in this case), Prometheus will trigger an alert. The specified "labels" and "annotations" can be used in alertmanager to customize your notification messages and routing decisions.
The full spec for the PrometheusRule
definition can be found here.
Here is an example of a recording rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
nine.ch/prometheus: myPrometheus
role: recording-rules
name: cpu-per-namespace-recording
spec:
groups:
- name: ./example.rules
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)
This recording rule will create a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a grafana dashboard to have an overview about the CPU usage of all pods per namespace.
The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules. Some of those rules and alerts have already been integrated into your instance of Prometheus.
Checking website availability
You can easily check the HTTP return code of any HTTP endpoint which is reachable via a Kubernetes ingress. All you need to do is to add a label named nine.ch/prometheus with the name of your Prometheus instance set as the value to your Kubernetes ingress. The name of your Prometheus instance can be found on runway. Prometheus will then start to check your HTTP endpoint. This is done by utilising the Prometheus blackbox exporter which we run as a GCP Cloud Function alongside your GCP project. This instance is only accessible from within your nine Managed GKE cluster.
To further customize the check we support the following annotations on the ingress resource:
annotation | description | example |
---|---|---|
blackbox-exporter.nine.ch/valid_status_codes | Allow the check to be successful if the return code matches one of the given comma separated status codes. Shortcuts like 2xx can be used to specify the corresponding full range of http status codes. If this annotation is not present, the default expected status codes are 200-299 (2xx). | blackbox-exporter.nine.ch/valid_status_codes: 2xx, 3xx, 401 |
blackbox-exporter.nine.ch/expect_regexp | Mark the probe as successful if the given regular expression was found in the body of the response. | blackbox-exporter.nine.ch/expect_regexp: status=[oO][kK] |
blackbox-exporter.nine.ch/fail_on_regexp | Mark the probe as unsuccessful if the given regular expression was found in the body of the response. | blackbox-exporter.nine.ch/fail_on_regexp: status=([fF]ailed|[eE]rror) |
You can query your prometheus instance for the relevant metrics which are returned by this check with the following query:
{job="ingress-check"}
To get the returned status code you can use:
probe_http_status_code{job="ingress-check",namespace=<namespace of your ingress>,ingress=<name of your ingress>}
To see if your check returned successfully you can leverage the "probe_success" metric:
probe_success{job="ingress-check",namespace=<namespace of your ingress>,ingress=<name of your ingress>}
The metric "probe_duration_seconds" will show how long it took to check the HTTP endpoint. It contains a "phase" label which helps to identify the correspondiung probe duration of a specific HTTP connection stage.
Configuring Alertmanager
Alertmanager is the component responsible to send out notifications in case of Prometheus alerts. Alertmanager support various channels for notifications, like Slack, Email, Hipchat, PagerDuty, etc. Please have a look at the official documentation for detailled information about the confiuration. We also supply example configurations.
The default alertmanager instances, which are created by us, do not have any notification receivers configured by default. To configure them, we provide a way to supply a complete alertmanager configuration. You can create a secret which has to be named alertmanager in a namespace called alertmanager-config. The secret needs to have a key called alertmanager.yaml which contains a full alertmanager configuration. Furthermore special annotations have to be set. The contained alertmanager configuration will be checked for validity on creation and update of the secret. If the configuration contains errors or the mandatory annotations of the secret are missing, the secret will not be accepted. In that case you will receive an error message which describes the issue.
As the configuration can be given as a secret, we recommend to use a sealed secret resource in combination with gitops techniques to create and maintain it. With this way, you do not store confidential information in your configuration git.
Mandatory annotations
Please make sure that your alertmanager secret contains the following annotations:
replicator.v1.mittwald.de/replication-allowed: "true"
replicator.v1.mittwald.de/replication-allowed-namespaces: "nine-alertmanager-customer"
Manual steps to create the configuration
- create the namespace alertmanager-config
$> kubectl create namespace alertmanager-config
-
create a local directory where all needed configuration files go in. The directory needs to have at least a file called alertmanager.yaml which contains a valid alertmanager configuration. Have a look at our examples. You can also put template files into this directory (extension should be .tmpl).
-
create a secret called alertmanager with the mandatory annotations in the above created namespace alertmanager-config
$> export AMDIR=<path-to-your-directory>
$> kubectl create secret generic alertmanager --from-file=$AMDIR --dry-run -o yaml -n alertmanager-config | \
kubectl annotate -f- --dry-run --local -o yaml \
replicator.v1.mittwald.de/replication-allowed=true \
replicator.v1.mittwald.de/replication-allowed-namespaces=nine-alertmanager-customer | \
kubectl apply -f-
If you use template files in your configuration, please make sure to include the following line into your alertmanager.yaml file:
templates:
- "/etc/alertmanager/config/*.tmpl"
Using sealed secrets for configuration
If you want to use gitops to control the alertmanager configuration, we recommend to use a SealedSecret to define your configuration. This way you do not expose access credentials in git.
You can use runway to generate a secret with the name alertmanager in the namespace alertmanager-config. The secret needs to have at least a key called alertmanager.yaml which contains a valid alertmanager configuration. It might have additional keys which define template names (like 'slack.tmpl' for example). To make sure that the sealed secrets controller adds the mandatory annotations please add a 'template' section to the generated secret as shown below:
example secret generated by runway:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: alertmanager
namespace: alertmanager-config
spec:
encryptedData:
alertmanager.yaml: <encrypted data>
secret after adding the mandatory template section:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: alertmanager
namespace: alertmanager-config
spec:
encryptedData:
alertmanager.yaml: <encrypted data>
template:
metadata:
annotations:
replicator.v1.mittwald.de/replication-allowed: "true"
replicator.v1.mittwald.de/replication-allowed-namespaces: "nine-alertmanager-customer"
Overwriting an existing secret
If you already created a kubernetes secret alertmanager manually and want to overwrite it now with the use of a sealed secret, you first need to add the following annotation to the existing secret:
sealedsecrets.bitnami.com/managed: "true"
This can be achieved with the following command:
kubectl annotate secret alertmanager sealedsecrets.bitnami.com/managed="true" -n alertmanager-config
If this annotation is missing, the sealed secrets controller will refuse to overwrite the already existing secret.
Alertmanager configuration examples
1. Send all alerts via email
global:
resolve_timeout: 5m
route:
receiver: "email"
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes: []
receivers:
- name: "email"
email_configs:
- to: "monitoring-alerts-list@your-domain.ch"
send_resolved: true
# when using STARTTLS (port 587) this needs to be 'true'
require_tls: false
from: "alertmanager@your-domain.ch"
smarthost: smtp.your-domain.ch:465
auth_username: "alertmanager@your-domain.ch"
auth_password: "verysecretsecret"
headers: { Subject: "[Alert] Prometheus Alert Email" }
2. Send all critical alerts via slack. All other severities will be sent out via email. Please make sure to add a severity
label to your alerts.
global:
resolve_timeout: 5m
route:
# this specifices the default receiver which will be used if no route matches
receiver: "email"
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes:
- receiver: "slack"
match_re:
severity: "[cC]ritical"
receivers:
- name: "email"
email_configs:
- to: "monitoring-alerts-list@your-domain.ch"
send_resolved: true
# when using STARTTLS (port 587) this needs to be 'true'
require_tls: false
from: "alertmanager@your-domain.ch"
smarthost: smtp.your-domain.ch:465
auth_username: "alertmanager@your-domain.ch"
auth_password: "verysecretsecret"
headers: { Subject: "[Alert] Prometheus Alert Email" }
- name: "slack"
slack_configs:
- send_resolved: true
api_url: https://hooks.slack.com/services/s8o3m2e0r8a8n2d/8snx2X983
channel: "#alerts"
3. Send all alerts of the production environment via slack. Drop all other alerts. Please make sure to define the label 'environment' in your alerts.
global:
resolve_timeout: 5m
route:
# this specifices the default receiver which will be used if no route matches
receiver: "devnull"
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes:
- receiver: "slack"
match:
environment: production
receivers:
- name: "slack"
slack_configs:
- send_resolved: true
api_url: https://hooks.slack.com/services/s8o3m2e0r8a8n2d/8snx2X983
channel: "#alerts"
- name: devnull
4. Use templates to customize your notifications and send all alerts via slack. Here we define some templates in a file called 'slack.tmpl'.
filename: alertmanager.yaml
global:
resolve_timeout: 5m
# THIS LINE IS VERY IMPORTANT AS OTHERWISE YOUR TEMPLATES WILL NOT BE LOADED
templates:
- "/etc/alertmanager/config/*.tmpl"
route:
receiver: "slack"
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes: []
receivers:
- name: "slack"
slack_configs:
- send_resolved: true
api_url: https://hooks.slack.com/services/s8o3m2e0r8a8n2d/8snx2X983
channel: "#alerts"
pretext: "{{ .CommonAnnotations.description }}"
text: '{{ template "slack.myorg.text" . }}'
filename: slack.tmpl
{{ define "slack.myorg.text" -}}
{{ range .Alerts -}}
*Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs -}}
• *{{ .Name }}:* `{{ .Value }}`
{{ end -}}
{{ template "slack.default.text" . }}
{{ end -}}
{{ end -}}