TL;DR #
- Prometheus for metrics collection and alerting rules
- Grafana for dashboards and visualization
- Loki for log aggregation (lightweight alternative to Elasticsearch)
- AlertManager for routing alerts to Slack/PagerDuty
- Runs on k3s or any Kubernetes cluster, total overhead: ~1.5 GB RAM
- Everything installed via Helm, configured via values files you can version control
Who this stack is for #
You run a k3s or small Kubernetes cluster and want real monitoring without paying for Datadog. You’re comfortable with Helm and kubectl. You want logs and metrics in one place, with alerts that actually wake someone up when things break.
The stack #
| Layer | Tool | Why |
|---|---|---|
| Metrics | Prometheus | Industry standard, huge ecosystem |
| Dashboards | Grafana | Flexible, free, large dashboard library |
| Logs | Loki | Log aggregation without the Elasticsearch overhead |
| Log collector | Promtail | Ships logs to Loki, runs as DaemonSet |
| Alerts | AlertManager | Routes alerts by severity to Slack, PagerDuty, email |
Namespace setup #
kubectl create namespace monitoringPrometheus + AlertManager #
Install via the kube-prometheus-stack Helm chart, which bundles Prometheus, AlertManager, Grafana, and common dashboards.
Save as prometheus-values.yaml:
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 1Gi
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
# Scrape all ServiceMonitors across namespaces
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: "slack-notifications"
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: "slack-critical"
match:
severity: critical
repeat_interval: 1h
- receiver: "slack-notifications"
match:
severity: warning
receivers:
- name: "slack-notifications"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Namespace:* {{ .Labels.namespace }}
{{ end }}
- name: "slack-critical"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#alerts-critical"
title: "CRITICAL: {{ .GroupLabels.alertname }}"
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Namespace:* {{ .Labels.namespace }}
{{ end }}
alertmanagerSpec:
resources:
requests:
memory: 64Mi
cpu: 50m
limits:
memory: 128Mi
grafana:
adminPassword: "change-me-immediately"
persistence:
enabled: true
size: 5Gi
resources:
requests:
memory: 128Mi
cpu: 100m
limits:
memory: 256Mi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
# Default recording and alerting rules
defaultRules:
create: true
rules:
etcd: false # Disable if not running etcd (e.g., k3s with SQLite)
kubeScheduler: false # Not accessible on most managed clustersInstall it:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yamlLoki + Promtail #
Save as loki-values.yaml:
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: filesystem
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 168h # 7 days
max_query_series: 500
compactor:
retention_enabled: true
singleBinary:
replicas: 1
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
persistence:
enabled: true
size: 10Gi
gateway:
enabled: false
# Minimal deployment — no read/write separation
read:
replicas: 0
write:
replicas: 0
backend:
replicas: 0Save as promtail-values.yaml:
config:
clients:
- url: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push
snippets:
pipelineStages:
- cri: {}
- multiline:
firstline: '^\d{4}-\d{2}-\d{2}'
max_wait_time: 3s
resources:
requests:
memory: 64Mi
cpu: 50m
limits:
memory: 128MiInstall both:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki \
--namespace monitoring \
--values loki-values.yaml
helm install promtail grafana/promtail \
--namespace monitoring \
--values promtail-values.yamlConnect Loki to Grafana #
Add Loki as a data source in Grafana. You can do this via the UI or by adding to your prometheus-values.yaml under the Grafana section:
grafana:
additionalDataSources:
- name: Loki
type: loki
url: http://loki.monitoring.svc.cluster.local:3100
access: proxy
isDefault: falseThen upgrade the Helm release:
helm upgrade kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yamlCustom alert rules #
Add application-specific alerts. Save as custom-alerts.yaml and apply with kubectl apply -f:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
labels:
release: kube-prometheus
spec:
groups:
- name: app.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "More than 5% of requests are failing on {{ $labels.service }} for the last 5 minutes."
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
> 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.service }}"
description: "p95 latency is above 1s on {{ $labels.service }}."
- alert: PodRestartLoop
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restart-looping"
description: "Pod {{ $labels.pod }} in {{ $labels.namespace }} has restarted more than 5 times in the last hour."
- alert: PersistentVolumeSpaceLow
expr: |
kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "PV {{ $labels.persistentvolumeclaim }} is running low on space"
description: "Less than 15% space remaining."Access Grafana #
# Port-forward to access locally
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring
# Or expose via ingress (Traefik example for k3s)Traefik IngressRoute for k3s:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: grafana
namespace: monitoring
spec:
entryPoints:
- websecure
routes:
- match: Host(`grafana.your-domain.com`)
kind: Rule
services:
- name: kube-prometheus-grafana
port: 80
tls:
certResolver: letsencryptResource totals #
What this stack costs in cluster resources:
| Component | Memory request | Memory limit | CPU request |
|---|---|---|---|
| Prometheus | 512Mi | 1Gi | 250m |
| AlertManager | 64Mi | 128Mi | 50m |
| Grafana | 128Mi | 256Mi | 100m |
| Loki | 256Mi | 512Mi | 100m |
| Promtail (per node) | 64Mi | 128Mi | 50m |
| Total (single node) | ~1 Gi | ~2 Gi | ~550m |
Fits comfortably on a 4 GB k3s node alongside your workloads.
When to outgrow this stack #
- Logs exceed 50 GB/day: Consider Loki with S3 backend or switch to a managed solution
- Multi-cluster: Add Thanos or Mimir on top of Prometheus for cross-cluster metrics
- Compliance needs: Switch to a managed service that handles retention policies and audit trails
- Team grows past 15 engineers: The dashboard sprawl becomes real, consider Datadog or Grafana Cloud
Related reads #
- Observability on a Budget — philosophy behind this stack
- k3s: Lightweight Kubernetes — the runtime this stack pairs best with
- Incident Response for Small Teams — what to do when alerts fire