Monitoring and Observability

Overview

Rancher provides comprehensive monitoring capabilities through integration with Prometheus and the Prometheus Operator. This enables cluster-level and project-level monitoring, alerting, and observability for your Kubernetes infrastructure and applications.

Monitoring Architecture

Rancher’s monitoring system consists of several key components:

Core Components

Prometheus

Prometheus is the core metrics collection and storage engine:

Time-series database: Stores metrics with timestamp and labels
Pull-based model: Scrapes metrics from configured targets
PromQL: Powerful query language for metric analysis
Service discovery: Automatically discovers targets in Kubernetes

Source: pkg/image/origins.go:256 (prom-prometheus)

Prometheus Operator

The Prometheus Operator manages Prometheus instances declaratively:

Creates and manages Prometheus deployments
Handles ServiceMonitor and PodMonitor resources
Manages PrometheusRule resources for alerts and recording rules
Automatically updates configuration based on Kubernetes resources

Source: pkg/image/origins.go:230 (prometheus-operator)

Alertmanager

Alertmanager handles alert routing and notifications:

Receives alerts from Prometheus
Groups, deduplicates, and routes alerts
Sends notifications to configured receivers (email, Slack, PagerDuty, etc.)
Supports silencing and inhibition rules

Source: pkg/image/origins.go:227 (prometheus-alertmanager)

Monitoring Namespaces

Rancher uses specific namespaces for monitoring components:

cattle-prometheus: Legacy namespace for cluster-level monitoring (source:pkg/monitoring/monitoring.go:12)
cattle-monitoring-system: Current namespace for monitoring stack (source:pkg/data/management/podadmissionconfigurationtemplate_data.go:24)
rancher-alerting-drivers: Alert driver integrations (source:pkg/data/management/podadmissionconfigurationtemplate_data.go:25)

Monitoring Levels

Rancher supports monitoring at multiple levels:

Cluster-Level Monitoring

Monitors the entire cluster infrastructure:

Node metrics: CPU, memory, disk, network usage
Cluster resources: Overall cluster capacity and utilization
Control plane components: API server, etcd, scheduler, controller manager
System workloads: Kubelet, kube-proxy, CNI plugins

Application: cluster-alerting (source:pkg/monitoring/monitoring.go:18) Namespace: cattle-prometheus (source:pkg/monitoring/monitoring.go:12)

Project-Level Monitoring

Provides isolation and multi-tenancy for monitoring:

Project-specific metrics: Only resources within the project namespace
Isolated Prometheus: Separate Prometheus instance per project
Custom dashboards: Project-specific Grafana dashboards
RBAC integration: Access control based on project membership

Application: project-monitoring (source:pkg/monitoring/monitoring.go:17) Namespace pattern: cattle-prometheus- (source:pkg/monitoring/monitoring.go:41)

Prometheus Integration

Metrics Collection

Prometheus collects metrics from various sources:

Node Exporter

Collects node-level system metrics:

CPU usage and load average
Memory and swap utilization
Disk I/O and filesystem metrics
Network interface statistics

Source: pkg/image/origins.go:224 (mirrored-prom-node-exporter)

Kube-State-Metrics

Exports Kubernetes object state as metrics:

Deployment status and replica counts
Pod phase and resource requests/limits
Node conditions and capacity
ConfigMap and Secret metadata

cAdvisor

Container-level metrics from kubelet:

Container CPU and memory usage
Network traffic per container
Filesystem I/O per container
Container lifecycle events

ServiceMonitor Resources

ServiceMonitors define how Prometheus scrapes metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-application
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Prometheus automatically discovers and scrapes targets matching ServiceMonitor definitions. Source reference: pkg/data/management/role_data.go:214 (RBAC for servicemonitors)

PrometheusRule Resources

PrometheusRules define alerting and recording rules:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-rules
  namespace: default
spec:
  groups:
  - name: my-app-alerts
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status="500"}[5m]) > 0.05
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: High error rate detected

Labels identify rules:

Label key: source (source:pkg/monitoring/monitoring.go:24)
Label value: rancher-alert (source:pkg/monitoring/monitoring.go:25)

Source reference: pkg/data/management/role_data.go:214

Cluster Monitoring

Node Metrics

Rancher collects and exposes node-level metrics:

Node Count Metrics

Tracks the number of nodes per cluster:

rancher_cluster_nodes{cluster="cluster1", provider="rke2"} 5

Source: pkg/metrics/node.go:26-32

Node Core Metrics

Tracks total CPU cores across clusters:

rancher_cluster_node_cores{cluster="cluster1", provider="rke2"} 20

Source: pkg/metrics/node.go:33-39

Prometheus Metrics Endpoint

Rancher exposes Prometheus metrics:

GET /metrics

Metrics include:

Rancher server metrics
Cluster ownership information
Node counts and resource capacity
API request rates and latencies

Source: pkg/multiclustermanager/routes.go:9 (prometheus client integration)

Metrics Registration

Rancher registers custom metrics collectors:

prometheus.MustRegister(clusterOwner)
prometheus.MustRegister(numNodes)
prometheus.MustRegister(numCores)

Source: pkg/metrics/metrics.go:46-50

Alerting

Alert Configuration

Configure alerts through PrometheusRule resources or Rancher UI:

Define alert rules: Specify conditions and thresholds
Configure receivers: Set up notification channels
Set routing: Determine which alerts go to which receivers
Test alerts: Verify alert delivery

Alertmanager Configuration

Configure Alertmanager through the alertmanager-rancher-monitoring ConfigMap:

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
receivers:
- name: 'default'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#alerts'

Alert Receivers

Alertmanager supports multiple receiver types:

Email: SMTP-based notifications
Slack: Webhook-based Slack messages
PagerDuty: Incident management integration
Webhook: Generic HTTP webhook
Opsgenie: Alert management platform
VictorOps: Incident response platform

Alertmanager Endpoints

Access Alertmanager:

Service: alertmanager-operated (source:pkg/monitoring/monitoring.go:21)
Namespace: cattle-prometheus
Port: 9093 (source:pkg/monitoring/monitoring.go:45)
Headless service for clustering (source:pkg/monitoring/monitoring.go:21)

Monitoring Access Control

RBAC for Monitoring

Rancher provides role templates for monitoring access:

Project Monitoring View Role

Read-only access to monitoring resources:

View Prometheus queries and dashboards
View alerts and alert status
Access Grafana dashboards
No ability to modify monitoring configuration

Role: project-monitoring-readonly (source:pkg/data/management/role_data.go:347-349)

Monitoring Endpoints

Rancher configures access to monitoring services: Prometheus:

Alertmanager:

Grafana:

Source: pkg/data/management/role_data.go:352-357

Service Access Rules

Services accessible for monitoring:

rancher-monitoring-prometheus (source:pkg/data/management/role_data.go:360)
rancher-monitoring-alertmanager (source:pkg/data/management/role_data.go:361)
rancher-monitoring-grafana (source:pkg/data/management/role_data.go:362)

Grafana Integration

While not directly shown in the analyzed source code, Rancher’s monitoring stack typically includes:

Pre-configured Grafana dashboards
Automatic Prometheus data source configuration
Dashboard provisioning for common metrics
Custom dashboard creation and management

Configuration and Customization

Monitoring Configuration Template

Example monitoring configuration with customization options:

# Node selector for Prometheus, Grafana, and exporters
prometheus.nodeSelector.region: region-a
prometheus.nodeSelector.zone: zone-b
grafana.nodeSelector.region: region-a
grafana.nodeSelector.zone: zone-b
exporter-kube-state.nodeSelector.region: region-a
exporter-kube-state.nodeSelector.zone: zone-b

# Prometheus retention and persistence
prometheus.retention: 360h
prometheus.persistence.enabled: true
prometheus.persistence.storageClass: default
prometheus.persistence.accessMode: ReadWriteOnce
prometheus.persistence.size: 50Gi

# Grafana persistence
grafana.persistence.enabled: false
grafana.persistence.storageClass: default
grafana.persistence.accessMode: ReadWriteOnce
grafana.persistence.size: 10Gi

# Node exporter
exporter-node.ports.metrics.port: 9100

Source: pkg/monitoring/monitoring.go:48-95

Template-Based Configuration

Rancher supports template prefixes for configuration:

_tpl-Node_Selector: Applies node selectors to multiple components
_tpl-Storage_Class: Applies storage configuration to multiple components

Pattern: _tpl-{ConfigName} expands to {component}.{configKey} Example:

_tpl-Node_Selector | nodeSelector#(prometheus,grafana,exporter-kube-state)

Expands to:

prometheus.nodeSelector.*
grafana.nodeSelector.*
exporter-kube-state.nodeSelector.*

Source: pkg/monitoring/monitoring.go:48-94

Best Practices

Retention and Storage

Configure appropriate retention periods (default: 360h/15 days)
Use persistent volumes for production environments
Monitor storage usage and set up cleanup policies
Consider remote write for long-term storage

Resource Allocation

Allocate sufficient CPU and memory to Prometheus
Use node selectors to place monitoring components strategically
Configure resource requests and limits
Monitor monitoring system resource usage

Alert Management

Create meaningful alert rules with appropriate thresholds
Group related alerts to reduce noise
Configure appropriate notification channels
Test alerts regularly to ensure they work
Document runbooks for common alerts

Multi-Tenancy

Use project-level monitoring for team isolation
Configure RBAC to control monitoring access
Separate monitoring resources per project/team
Use label-based routing for alerts

Troubleshooting

Common Issues

Prometheus Not Scraping Targets

Verify ServiceMonitor selectors match services
Check network policies allow scraping
Ensure metrics endpoints are accessible
Review Prometheus logs for scrape errors

High Memory Usage

Reduce retention period
Decrease scrape frequency
Limit cardinality of labels
Configure memory limits appropriately

Missing Metrics

Verify exporters are running
Check ServiceMonitor configuration
Ensure endpoints expose metrics in Prometheus format
Review Prometheus service discovery

Debugging Commands

# Check Prometheus status
kubectl get prometheus -n cattle-monitoring-system
kubectl describe prometheus -n cattle-monitoring-system

# View Prometheus logs
kubectl logs -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0

# Check ServiceMonitors
kubectl get servicemonitors -A

# View PrometheusRules
kubectl get prometheusrules -A

# Check Alertmanager
kubectl get pods -n cattle-monitoring-system | grep alertmanager
kubectl logs -n cattle-monitoring-system alertmanager-rancher-monitoring-alertmanager-0

Next Steps

Configure ServiceMonitors for custom applications
Create custom dashboards in Grafana
Set up alert rules for critical metrics
Integrate with external monitoring systems
Explore Telemetry for usage analytics

​Overview

​Monitoring Architecture

​Core Components

​Prometheus

​Prometheus Operator

​Alertmanager

​Monitoring Namespaces

​Monitoring Levels

​Cluster-Level Monitoring

​Project-Level Monitoring

​Prometheus Integration

​Metrics Collection

​Node Exporter

​Kube-State-Metrics

​cAdvisor

​ServiceMonitor Resources

​PrometheusRule Resources

​Cluster Monitoring

​Node Metrics

​Node Count Metrics

​Node Core Metrics

​Prometheus Metrics Endpoint

​Metrics Registration

​Alerting

​Alert Configuration

​Alertmanager Configuration

​Alert Receivers

​Alertmanager Endpoints

​Monitoring Access Control

​RBAC for Monitoring

​Project Monitoring View Role

​Monitoring Endpoints

​Service Access Rules

​Grafana Integration

​Configuration and Customization

​Monitoring Configuration Template

​Template-Based Configuration

​Best Practices

​Retention and Storage

​Resource Allocation

​Alert Management

​Multi-Tenancy

​Troubleshooting

​Common Issues

​Prometheus Not Scraping Targets

​High Memory Usage

​Missing Metrics

​Debugging Commands

​Next Steps