Centralized monitoring for your OpenShift clusters

Jaydeep Ayachit
10 min readJul 2, 2021

Introduction

Red Hat OpenShift is an enterprise-ready Kubernetes container platform with full-stack automated operations to manage hybrid cloud, multicloud, and edge deployments. Red Hat OpenShift is optimized to improve developer productivity and promote innovation.

In this article we will look at OpenShift monitoring capabilities and out of the box support to integrate with external monitoring systems. We will also look at various enterprise monitoring systems that you can use to monitor and collect metrics from OpenShift.

If you are interested to know more about centralized logging for your OpenShift clusters, take a look at Aggregate OpenShift logs into enterprise logging system | by Jaydeep Ayachit | Jun, 2021 | Medium

Red Hat OpenShift Monitoring

OpenShift Container Platform includes a pre-configured, pre-installed, and self-updating monitoring stack that provides monitoring for core platform components. A set of alerts are included by default that immediately notify cluster administrators about issues with a cluster. Default dashboards in the OpenShift Container Platform web console include visual representations of cluster metrics to help you to quickly understand the state of your cluster.

The OpenShift Container Platform monitoring stack is based on the Prometheus open source project and its wider ecosystem. The monitoring stack includes the following:

  • Default platform monitoring components: A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core OpenShift Container Platform components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.
  • Components for monitoring user-defined projects: After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects.

Using monitoring for user-defined projects, you can query metrics, review dashboards, and manage alerting rules and silences for your own projects in the OpenShift Container Platform web console. When monitoring is enabled for user-defined projects, you can monitor:

  • Metrics provided through service endpoints in user-defined projects.
  • Pods running in user-defined projects.

By default, the OpenShift Container Platform 4.7 monitoring stack includes these key components:

  • Prometheus Operator: The Prometheus Operator creates, configures, and manages platform Prometheus instances and Alertmanager instances. It also automatically generates monitoring target configurations based on Kubernetes label queries.
  • Prometheus: Prometheus is the monitoring system on which the OpenShift Container Platform monitoring stack is based. Prometheus is a time-series database and a rule evaluation engine for metrics. Prometheus sends alerts to Alertmanager for processing.
  • Prometheus Adapter: The Prometheus Adapter translates Kubernetes node and pod queries for use in Prometheus. The resource metrics that are translated include CPU and memory utilization metrics. The Prometheus Adapter exposes the cluster resource metrics API for horizontal pod autoscaling.
  • Alertmanager: The Alertmanager service handles alerts received from Prometheus. Alertmanager is also responsible for sending the alerts to external notification systems.
  • kube-state-metrics agent: The kube-state-metrics exporter agent converts Kubernetes objects to metrics that Prometheus can use.
  • openshift-state-metrics agent: The openshift-state-metrics exporter expands upon kube-state-metrics by adding metrics for OpenShift Container Platform-specific resources.
  • node-exporter agent: The node-exporter agent collects metrics about every node in a cluster. The node-exporter agent is deployed on every node.
  • Thanos Querier: The Thanos Querier aggregates and optionally deduplicates core OpenShift Container Platform metrics and metrics for user-defined projects under a single, multi-tenant interface.
  • Grafana: The Grafana analytics platform provides dashboards for analyzing and visualizing the metrics. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.

Source:

Understanding the monitoring stack | Monitoring | OpenShift Container Platform 4.7

Enabling monitoring for user-defined projects | Monitoring | OpenShift Container Platform 4.7

Need for centralized monitoring

“Observability”, is a popular term used nowadays to help understand how the systems are performing and what data they expose. The more observable a system, the more quickly and accurately you can identify an issue and its root cause. One of the key pillar of observability is “Metrics”. Metrics can originate from a variety of sources, including infrastructure, hosts, services, applications, cloud platforms and external sources. Metrics typically represent key performance indicators (KPI) for your systems like CPU and memory usage, page faults, HTTP errors etc. Metrics are either represented as counts or aggregated over period of time. Clubbed with Application Performance Monitoring (APM) tools, you can instrument your code to make additional key information available for monitoring.

In traditional infrastructures, applications operate on relatively static instances and thus monitoring applications is quite manageable. Containerized workloads operate across a fleet of underlying hosts where multiple instances of an application may be running at a given time. By definition, containers have a short life span and monitoring them during runtime can be extremely challenging. Compliance risks are also very high because of the fast-moving nature of container environments.

In many cases, organizations have enterprise monitoring systems already in place that acts as centralized system for metrics collection, aggregation, analytics and visualization. Organization would want to integrate OpenShift and workloads running in OpenShift metrics to be available in the centralized monitoring solution.

The following sections explains various enterprise monitoring systems and how your OpenShift clusters can be administered to take advantage of centralized monitoring solution.

Overview: OpenShift Monitoring and Enterprise Tools

The following diagram depicts

  • OpenShift monitoring stack that collects metrics from various sources
  • Various enterprise systems that collect metrics from OpenShift cluster

Out of the box support for extension

OpenShift metrics cannot be send directly to an external system. Neither you can export metrics from OpenShift Prometheus to an external system. If you are running multiple OpenShift clusters, each cluster has its own monitoring stack and as such you cannot get a centralized view across all clusters.

If you are running an external Prometheus system, it can be configured to scrap metrics from your applications, provided your applications provides a public /metrics API that expose metrics in Prometheus format. There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly (for example, HAProxy or Linux system stats). Take a look at Exporters and integrations | Prometheus

Enterprise Monitoring Systems

Datadog

Datadog enables you to collect and analyze metrics, logs, performance data from your applications, and more, using one unified platform.

The Datadog Agent collect information from several different sources in your OpenShift cluster, including the Kubernetes API server, each node’s kubelet process, and the hosts themselves. When the Datadog Agent is successfully deployed, resource metrics and events from your cluster are streamed into Datadog. In addition to metrics from your nodes’ kubelets, data from services like kube-state-metrics and the Kubernetes API server automatically appear via the Datadog Agent’s Autodiscovery feature, which listens for container creation events, detects when certain services are running on those containers, and starts collecting data from supported services. This gives you out-of-the-box access to cluster state information, including metrics that OpenShift exposes through the Kubernetes API that track OpenShift-specific objects like cluster resource quotas.

Datadog integrates with Kubernetes components including the API server, controller manager, scheduler, and etcd. This means that, once enabled, in addition to key metrics from your nodes and pods you can also monitor the health and workload of your cluster’s control plane. Datadog provides out-of-the-box dashboards for several of these components, including the scheduler.

Datadog provides three general levels of data collection based on what permissions are required:

  • Restricted for basic metric collection
  • Host network for APM, container logs, and custom metrics
  • Custom for full Datadog monitoring

DataDog agents

DataDog provides 2 types of agents — node level and cluster level. Depending on your use case, you can choose to go with one or both.

The Datadog Agent: The Datadog Agent is open source software that collects and reports metrics, distributed traces, and logs from each of your nodes, so you can view and monitor your entire infrastructure in one place. In addition to collecting telemetry data from Kubernetes, Docker, CRI-O, and other infrastructure technologies, the Agent automatically collects and reports resource metrics (such as CPU, memory, and network traffic) from your nodes, regardless of whether they’re running in the cloud or on-prem infrastructure.

The Datadog Cluster Agent: The Datadog Cluster Agent provides several additional benefits to using the node-based DaemonSet alone for large-scale, production use cases. For instance, the Cluster Agent:

  • reduces the load on the Kubernetes API server for gathering cluster-level data by serving as a proxy between the API server and the node-based Agents
  • provides additional security by reducing the permissions needed for the node-based Agents
  • enables auto-scaling of Kubernetes workloads using any metric collected by Datadog

Source: OpenShift Monitoring With Datadog | Datadog (datadoghq.com)

Scraping metrics from Prometheus compliant end points

DataDog can scrape metrics from /metrics endpoints in Prometheus compliant format exposed by your applications. For more details please see Prometheus (datadoghq.com) and Kubernetes Prometheus and OpenMetrics metrics collection (datadoghq.com)

One key thing to note is that dataDog is SaaS only service. It does not support on-premise deployment in your infrastructure or data center.

AppDynamics

The AppDynamics Cluster Agent is a lightweight Agent used to monitor Kubernetes and OpenShift clusters. You can use the Cluster Agent to monitor and understand how Kubernetes infrastructure affects your applications and business performance. With the Cluster Agent, you can collect metadata, metrics, and events for a Kubernetes cluster. The Cluster Agent is supported on Red Hat OpenShift and cloud-based Kubernetes platforms, such as Amazon EKS, Azure AKS, and Rancher.

The Cluster Agent monitors events and metrics of Kubernetes or OpenShift clusters. It also tracks the state of most Kubernetes resources: pods, replica sets, deployments, services, persistent volumes, nodes, and so on. The data is received through the Kubernetes API server and is sent to the AppDynamics Controller.

The Cluster Dashboard provides an overview of potential issues with cluster health, grouped by category and severity. It shows error events, evictions, node resource starvation, distribution of pod phases, and issues associated with:

  • Applications
  • Cluster configuration
  • Image or storage access
  • Security access errors
  • Quota violations

The dashboard contains cluster resource capacity stats and resource usage data relative to the deployment requests and limits for CPU, Memory, and Storage. The dashboard also provides real-time statistics on the state of monitored objects on the cluster, best-practice violations, and missing dependencies.

Source: Overview of Cluster Monitoring (appdynamics.com)

For a detailed description of cluster metrics please see Cluster Metrics (appdynamics.com)

Extensions and Custom Metrics

Using the Machine Agent, you can supplement the existing metrics in the AppDynamics Controller UI with your own custom metrics. There are many extensions currently available on the AppSphere Community site. For more information please see Extensions and Custom Metrics (appdynamics.com)

DynaTrace

Dynatrace is the only full-stack monitoring platform that is container-aware and comes with built-in monitoring support for Kubernetes and Red Hat OpenShift via the OneAgent Operator. Dynatrace supports full-stack monitoring for OpenShift, from the application down to the infrastructure layer.

Dynatrace Operator manages the classic full-stack injection after deploying the following custom resources.

  • OneAgent pod, deployed as a DaemonSet, collects host metrics from Kubernetes nodes. It also detects new containers and injects OneAgent code modules into application pods.
  • Dynatrace Kubernetes monitor pod collects cluster and workload metrics, events, and status from the Kubernetes API.

DynaTrace also supports your custom metrics. DynaTrace Metric ingestion provides a simple way to push any custom metrics to Dynatrace.

Monitoring Prometheus metrics

Prometheus is an open-source monitoring and alerting toolkit which is popular in the Kubernetes community. Prometheus scrapes metrics from a number of HTTP(s) endpoints that expose metrics in the OpenMetrics format.

Dynatrace integrates gauge and counter metrics from Prometheus exporters in Kubernetes and makes them available for charting, alerting, and analysis. Metrics from Prometheus exporters are available in the DynaTrace Data Explorer for custom charting.

Source:

Red Hat OpenShift | Dynatrace Documentation

Red Hat OpenShift monitoring | Dynatrace

Metric ingestion | Dynatrace Documentation

Monitor Prometheus metrics | Dynatrace Documentation

New Relic

The New Relic Kubernetes integration gives you infrastructure-centric and application-centric views. The Kubernetes integration reports on data and metadata about the nodes, namespaces, deployments, ReplicaSets, pods, clusters, and containers running in OpenShift, so you can fully monitor the frontend and backend applications and hosts running in your cluster.

New Relic’s Kubernetes integration gives full observability into the health and performance of your environment, no matter whether you run Kubernetes on-premises or in the cloud. With cluster explorer, you can cut through layers of complexity to see how cluster is performing, from the heights of the control plane down to applications running on a single pod.

New Relic Kubernetes integration includes New Relic Infrastructure agent. The Infrastructure agent is installed as a Kubernetes DaemonSet, which ensures that the New Relic Kubernetes integration is automatically running on each node in your OpenShift cluster.

One key thing to note is that New Relic is SaaS only service. It does not support on-premise deployment in your infrastructure or data center.

Source:

Monitor Applications and Infrastructure In Red Hat OpenShift with New Relic | New Relic

A Complete Introduction to Monitoring Kubernetes with New Relic

SysDig

Sysdig Monitor is part of Sysdig’s container intelligence platform. Sysdig uses a unified platform to deliver security, monitoring, and forensics in a container- and microservices-friendly architecture. Sysdig Monitor is a monitoring, troubleshooting, and alerting suite offering deep, process-level visibility into dynamic, distributed production environments. Sysdig Monitor captures, correlates, and visualizes full-stack data, and provides dashboards for monitoring.

In the background, the Sysdig agent lives on the hosts being monitored and collects the appropriate metrics and events. Out of the box, the agent reports on a wide variety of pre-defined metrics. Additional metrics and custom parameters are available via agent configuration files.

Monitoring Prometheus metrics

Sysdig Monitor transforms Prometheus metrics into usable, actionable entries in two ways: Calculated Metrics and Raw Metrics. The Prometheus metrics that are scraped by the Sysdig agent and transformed into the traditional StatsD model are called calculated metrics. The Prometheus metrics that are scraped (by the Sysdig agent), collected, sent, stored, visualized, and presented exactly as Prometheus exposes them are called raw metrics.

Source:

Sysdig Monitor

Prometheus Metrics Types (sysdig.com)

If you like this, please follow me and give a clap.

--

--