[Engineering Log] The Migration Strangler: Promtail to Alloy
- #retrospective
- #platform
- #kubernetes
Context
Promtail is reaching End-of-Life (EOL). To stay current with the Grafana ecosystem and move toward a unified telemetry agent (logs, metrics, and traces in one binary), I needed to migrate to Grafana Alloy.
Challenge
- The Pain Point: Promtail's YAML-based pipelines were reaching their limits for complex relabeling, and staying on EOL software introduces significant security and technical debt.
- The Question: How do I migrate the "telemetry brain" of a live system without losing log visibility during the transition?
Investigation
- Concept: The Strangler Fig Pattern. Instead of a "Big Bang" replacement, I deployed Alloy alongside Promtail.
- Discovery: Alloy uses River (an HCL-based language), which is programmable and component-based, unlike Promtail's static YAML.
| Feature | Promtail (Legacy) | Grafana Alloy (New) |
|---|---|---|
| Config | Static YAML | Programmable River/HCL |
| Scope | Logs only | Logs, Metrics, Traces |
| Deployment | Standalone Container | kubernetes-native DaemonSet |
Solution
I deployed Alloy as a kubernetes DaemonSet to scrape the host's /var/log/journal. The critical insight was the two-stage cutover:
- Phase 1 (Hybrid): Alloy (in Kubernetes) was configured to push to the Docker-Loki instance. This allowed me to compare Alloy's logs directly against Promtail's logs in the same Grafana instance to verify parity.
- Phase 2 (Native): Once Loki was migrated to Kubernetes, I simply updated the endpoint URL.
// Alloy's filtered pipeline: Only keeping what matters
loki.source.journal "systemd" {
forward_to = [loki.process.journal_pipeline.receiver]
relabel_rules = loki.relabel.journal_relabel.rules
// ...
}
loki.write "local_loki" {
endpoint {
// Initial: "http://host.docker.internal:3100/loki/api/v1/push"
// Final (Kubernetes):
url = "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"
}
}
This "Strangler Fig" approach meant I never had a "blind spot" in my logs during the migration.
Outcome
- Result: Successfully migrated host-level logs with zero downtime.
- Lesson: Components > Static Config. Being able to "wire" telemetry outputs to multiple destinations (Loki + Debug) made validation trivial.
Thank you
Big thanks for reading! You’re awesome, and I hope this post helped. Until next time!