The networking control plane emits metrics that can be used to understand the health of the platform. These metrics can be consumed by Prometheus and graphed with Grafana, but the installation and configuration of those tools is outside the scope of this document.
There is existing official documentation on what metrics each component exposes.
One goal you might have when monitoring is to ensure that requests are being served and that the components are not becoming overloaded.
Significant changes in the number or size of requests is a sign that load on the cluster is changing and the ingress-gateways might need to be scaled. Because Istio configures Envoy to output data plane metrics, it is possible to measure the global dataplane load across all Envoys.
round(sum(irate(istio_requests_total{reporter="destination"}[1m])), 0.001)
histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(istio_request_bytes_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(istio_response_bytes_bucket[5m])) by (le))
In addition to measuring the load, it can be useful to watch out for spikes in the rate of error codes or cancelled requests:
cancelled
, completed
, total
):
round(sum(irate(envoy_cluster_upstream_rq_<status word>[1m])), 0.001)
200
, 404
, 503
):
round(sum(irate(envoy_cluster_upstream_rq_<http status code>[1m])), 0.001)
Note that these metrics are emitted for both the ingress-gateways, and the
sidecars, so if you want to monitor the sidecars and the ingress-gateways
separately, then the metrics will need to be filtered.
For example, the following query will return the rate of 503
s per second across
all the ingress-gateways:
sum(irate(envoy_cluster_upstream_rq_503{namespace="istio-system"}[1m]))
Monitoring the self-reported state of the ingress-gateways is one way to determine the health of the ingress-gateways. The relevant metrics for that monitoring are:
The following query will return the time since each ingress-gateway last restarted:
envoy_server_uptime{pod_name=~"istio-ingressgateway-.*"}
The following query will return the current state of each ingress-gateway:
envoy_server_state{pod_name=~"istio-ingressgateway-.*"}
envoy_server_state
has the following values:
0: live
. During an upgrade, some pods
would be spinning up (in states 2: pre-initializing
and 3: initializing
) and others would be 1: draining
to make room for the new
pods.The following query will return the current number of live ingress-gateways:
sum(envoy_server_live{pod_name=~"istio-ingressgateway-.*"})
For all the components, you can monitor the standard set of resource usage metrics, such as the following queries that measure efficiency:
sum(irate(container_cpu_usage_seconds_total{pod=~"istio-ingressgateway-.*",container="istio-proxy"}[1m]))
/ (round(
sum(irate(
istio_requests_total{source_workload="istio-ingressgateway", reporter="source"}[1m])),
0.001)
/ 1000)
sum(irate(container_memory_usage_bytes{pod=~"istio-ingressgateway-.*",container="istio-proxy"}[1m]))
/ (round(
sum(irate(
istio_requests_total{source_workload="istio-ingressgateway", reporter="source"}[1m])),
0.001)
/ 1000)
The path from making a configuration change using the CF CLI to that change being applied in Envoy (especially the Envoy ingress-gateways) has a series of steps. It is recommended to monitor each step separately in order to ensure that control plane latency is not too high overall and to make fixing latency problems easier. This section will cover metrics relevant to monitoring each step.
Because most steps involve the K8s API, we will first cover the K8s API metrics that are relevant to all those steps.
All of the metrics beginning with apiserver_request_
measure the load on the
K8s API server.
The following query measures the current latency on the various API server actions (e.g. CREATE, WATCH) and resources (e.g. pods):
apiserver_request_duration_seconds_bucket
The following query will return the number of requests per second to the API server over the range of a minute, rounded to the nearest thousandth:
round(sum(irate(apiserver_request_total[1m])), 0.001)
The following query will return errors from the API server such as HTTP 5xx errors:
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
(?:5..)
can be replaced with other status codes numbers (e.g. (?:4..)
for HTTP 4xx errors).The following query will return the 95th percentile latency for all Kubernetes resources and verbs:
histogram_quantile(0.95,
sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, resource,
subresource, verb))
histogram_quantile(0.95,
sum(rate(apiserver_request_duration_seconds_bucket{resource="virtualservices"}[5m]))
by (le, resource, subresource, verb))
The CC API creates or updates a Route CR to reflect changes requested by a CF CLI command. The CC API does not emit metrics, so K8s API metrics related to Route CRs are the most relevant.
The Route Controller consumes Route CRs and outputs Istio configuration as other CRs. There are no metrics output by Route Controller at this time, so the most relevant metrics are emitted by the K8s API.
Istio consumes its config as VirtualService CRs and emits configuration to each Envoy via XDS. In addition to K8s API metrics related to VirtualServices, istiod outputs several relevant metrics:
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le))