Skip to main content

Kubernetes Monitoring Stack: Prometheus + Grafana + Loki

·997 words·5 mins· loading · loading ·
Author
Maksim P.
DevOps Engineer / SRE

TL;DR
#

  • Prometheus for metrics collection and alerting rules
  • Grafana for dashboards and visualization
  • Loki for log aggregation (lightweight alternative to Elasticsearch)
  • AlertManager for routing alerts to Slack/PagerDuty
  • Runs on k3s or any Kubernetes cluster, total overhead: ~1.5 GB RAM
  • Everything installed via Helm, configured via values files you can version control

Who this stack is for
#

You run a k3s or small Kubernetes cluster and want real monitoring without paying for Datadog. You’re comfortable with Helm and kubectl. You want logs and metrics in one place, with alerts that actually wake someone up when things break.

The stack
#

Layer Tool Why
Metrics Prometheus Industry standard, huge ecosystem
Dashboards Grafana Flexible, free, large dashboard library
Logs Loki Log aggregation without the Elasticsearch overhead
Log collector Promtail Ships logs to Loki, runs as DaemonSet
Alerts AlertManager Routes alerts by severity to Slack, PagerDuty, email

Namespace setup
#

kubectl create namespace monitoring

Prometheus + AlertManager
#

Install via the kube-prometheus-stack Helm chart, which bundles Prometheus, AlertManager, Grafana, and common dashboards.

Save as prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 15d
    resources:
      requests:
        memory: 512Mi
        cpu: 250m
      limits:
        memory: 1Gi
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 20Gi

    # Scrape all ServiceMonitors across namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      receiver: "slack-notifications"
      group_by: ["alertname", "namespace"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        - receiver: "slack-critical"
          match:
            severity: critical
          repeat_interval: 1h
        - receiver: "slack-notifications"
          match:
            severity: warning
    receivers:
      - name: "slack-notifications"
        slack_configs:
          - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            channel: "#alerts"
            title: "{{ .GroupLabels.alertname }}"
            text: >-
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Severity:* {{ .Labels.severity }}
              *Namespace:* {{ .Labels.namespace }}
              {{ end }}
      - name: "slack-critical"
        slack_configs:
          - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            channel: "#alerts-critical"
            title: "CRITICAL: {{ .GroupLabels.alertname }}"
            text: >-
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Namespace:* {{ .Labels.namespace }}
              {{ end }}

  alertmanagerSpec:
    resources:
      requests:
        memory: 64Mi
        cpu: 50m
      limits:
        memory: 128Mi

grafana:
  adminPassword: "change-me-immediately"
  persistence:
    enabled: true
    size: 5Gi
  resources:
    requests:
      memory: 128Mi
      cpu: 100m
    limits:
      memory: 256Mi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: "default"
          orgId: 1
          folder: ""
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/default

# Default recording and alerting rules
defaultRules:
  create: true
  rules:
    etcd: false  # Disable if not running etcd (e.g., k3s with SQLite)
    kubeScheduler: false  # Not accessible on most managed clusters

Install it:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml

Loki + Promtail
#

Save as loki-values.yaml:

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: index_
          period: 24h
  limits_config:
    retention_period: 168h  # 7 days
    max_query_series: 500
  compactor:
    retention_enabled: true

singleBinary:
  replicas: 1
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
  persistence:
    enabled: true
    size: 10Gi

gateway:
  enabled: false

# Minimal deployment — no read/write separation
read:
  replicas: 0
write:
  replicas: 0
backend:
  replicas: 0

Save as promtail-values.yaml:

config:
  clients:
    - url: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push

  snippets:
    pipelineStages:
      - cri: {}
      - multiline:
          firstline: '^\d{4}-\d{2}-\d{2}'
          max_wait_time: 3s

resources:
  requests:
    memory: 64Mi
    cpu: 50m
  limits:
    memory: 128Mi

Install both:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
  --namespace monitoring \
  --values loki-values.yaml

helm install promtail grafana/promtail \
  --namespace monitoring \
  --values promtail-values.yaml

Connect Loki to Grafana
#

Add Loki as a data source in Grafana. You can do this via the UI or by adding to your prometheus-values.yaml under the Grafana section:

grafana:
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc.cluster.local:3100
      access: proxy
      isDefault: false

Then upgrade the Helm release:

helm upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml

Custom alert rules
#

Add application-specific alerts. Save as custom-alerts.yaml and apply with kubectl apply -f:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  groups:
    - name: app.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
            > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "More than 5% of requests are failing on {{ $labels.service }} for the last 5 minutes."

        - alert: HighLatency
          expr: |
            histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
            > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High p95 latency on {{ $labels.service }}"
            description: "p95 latency is above 1s on {{ $labels.service }}."

        - alert: PodRestartLoop
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is restart-looping"
            description: "Pod {{ $labels.pod }} in {{ $labels.namespace }} has restarted more than 5 times in the last hour."

        - alert: PersistentVolumeSpaceLow
          expr: |
            kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.15
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "PV {{ $labels.persistentvolumeclaim }} is running low on space"
            description: "Less than 15% space remaining."

Access Grafana
#

# Port-forward to access locally
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring

# Or expose via ingress (Traefik example for k3s)

Traefik IngressRoute for k3s:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
  namespace: monitoring
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`grafana.your-domain.com`)
      kind: Rule
      services:
        - name: kube-prometheus-grafana
          port: 80
  tls:
    certResolver: letsencrypt

Resource totals
#

What this stack costs in cluster resources:

Component Memory request Memory limit CPU request
Prometheus 512Mi 1Gi 250m
AlertManager 64Mi 128Mi 50m
Grafana 128Mi 256Mi 100m
Loki 256Mi 512Mi 100m
Promtail (per node) 64Mi 128Mi 50m
Total (single node) ~1 Gi ~2 Gi ~550m

Fits comfortably on a 4 GB k3s node alongside your workloads.

When to outgrow this stack
#

  • Logs exceed 50 GB/day: Consider Loki with S3 backend or switch to a managed solution
  • Multi-cluster: Add Thanos or Mimir on top of Prometheus for cross-cluster metrics
  • Compliance needs: Switch to a managed service that handles retention policies and audit trails
  • Team grows past 15 engineers: The dashboard sprawl becomes real, consider Datadog or Grafana Cloud

Related reads #

Reply by Email