Grafana
Why I recommend it for teams serious about observability
I’ve run Grafana in production environments and home lab settings, connecting it to Prometheus for metrics, Loki for logs, and various external data sources including CloudWatch and PostgreSQL. The dashboards are genuinely the best in class. The ops investment to run the full LGTM stack yourself is real, and I’ve watched teams underestimate it. This is the honest version of that evaluation.
Why Grafana works
Four reasons hold up across deployments.
Best-in-class dashboards
Visualization quality is the headline. Time-series charts handle high-cardinality data without flinching. Heatmaps make latency distribution visible. Geomaps, state timelines, and canvas panels cover every use case from NOC boards to executive summaries. Variables and templating make one dashboard reusable across every environment, region, or service with a dropdown.
The first time a team sees a correlated view — Prometheus metrics in one panel, Loki log lines in the next, filtered to the same 90-second window of a latency spike — the value is obvious. You stop asking "what happened" and start reading the answer.
The LGTM stack (Loki, Grafana, Tempo, Mimir)
Metrics via Prometheus or Mimir, logs via Loki, traces via Tempo, visualized in Grafana. Each component is independently best-in-class for its domain. The integration is where the compounding value appears — trace IDs in Loki log lines link directly to Tempo trace spans; exemplars in Prometheus link metric anomalies to specific request traces.
Mimir extends Prometheus to horizontally scalable, long-retention storage. Loki stores logs without indexing the full content, which makes ingestion dramatically cheaper than Elasticsearch-based stacks. The cost delta at high log volume is not small.
Data source federation
One Grafana instance can query Prometheus, Loki, CloudWatch, BigQuery, Elasticsearch, MySQL, InfluxDB, Datadog, and over 100 other data sources simultaneously. Cross-source correlation is where the real value lives — a log panel next to a metric panel for the same incident time window, pulling from different backends, visualized side by side without data migration.
For multi-cloud shops that have AWS CloudWatch metrics, GCP Cloud Logging, and on-prem Prometheus all running at once, Grafana is often the only tool that can render a coherent ops picture without forcing data into a single backend. That federation capability is genuinely hard to replicate anywhere else.
Alloy agent for unified telemetry collection
Grafana Alloy replaces Promtail, the Prometheus agent, and the Tempo agent with a single binary and a single config file. It's OpenTelemetry-native, which matters as the industry converges on OTel as the telemetry wire standard. One agent, one config, one upgrade path for the entire collection layer.
The operational simplification of a single agent at the host level — versus maintaining separate scrape configs, Promtail pipelines, and OTel collectors in parallel — pays back quickly on any fleet larger than a handful of nodes. Alloy is the answer to "why are we running three agents on every box."
Where it fits best
Not every shop. The fit is sharpest when one of these describes you:
Teams that want to own their telemetry pipeline, avoid proprietary ingest formats, and aren't willing to pay Datadog prices at scale. The platform rewards engineering investment; it doesn't compensate for the absence of it.
Prometheus is the default for Kubernetes metrics. The kube-state-metrics and node-exporter Helm charts are well-maintained; the Kubernetes Grafana dashboards are published and community-hardened. The integration isn't a project — it's a few helm installs.
AWS, Azure, and GCP all have native monitoring that doesn't talk to each other. Grafana's data source plugins for CloudWatch, Azure Monitor, and Cloud Monitoring bring those signals into one place without requiring data migration to a central backend.
The LGTM stack self-hosted requires someone who understands object storage backends (S3, GCS, MinIO), retention configuration, and horizontal scaling. If that sounds manageable, the economics are excellent. If that sounds like a second job, Grafana Cloud is the better entry point.
If your monitoring needs are infrastructure-first (ping + SNMP), Zabbix is more direct. If you want zero-ops SaaS, Datadog. If LATAM-priced SaaS matters, Grafana Cloud Free tier covers a lot.
The honest tradeoffs
Marketing won’t print these. I have, in production. Tap to expand.
Ops complexitySelf-hosting LGTM means running four services plus object storage
A production-grade self-hosted LGTM stack means Grafana, Prometheus (or Mimir for scale), Loki, and Tempo — each with HA configuration, retention tuning, and an object storage backend (S3 or GCS for chunk storage). Four services, four upgrade paths, four failure modes. Grafana Cloud sidesteps all of this; pricing grows with ingestion volume, but the ops cost is genuinely zero. For teams without dedicated SRE capacity, Cloud is often the right answer even if self-hosted looks cheaper on a spreadsheet.
Alerting maturityUnified alerting caught up late and legacy migrations are real work
Grafana's unified alerting — introduced in 8.x, refined through 9.x and 10.x — is solid by 11.x. Teams running Grafana 7.x with the legacy alerting system face a real migration project to reach current state. Prometheus Alertmanager was historically more flexible for complex notification routing. The gap is largely closed in Grafana 11, but any org with legacy alert rules defined as dashboard annotations has cleanup work before accessing the modern system.
Dashboard sprawlWithout governance, 400 dashboards accumulate and no one owns them
The most common failure mode in mature Grafana deployments. Every team builds their own dashboards. After 18 months: 400 dashboards with overlapping queries, no canonical source for any signal, and a third of them unviewed in six months. The solution is dashboard provisioning from code (Jsonnet via Grafonnet, Terraform via the Grafana provider, or grafana-as-code tooling) from day one. Click-built dashboards become technical debt fast; provisioned ones survive team turnover and cluster rebuilds. This is an ops discipline problem, not a Grafana bug — but it's predictable enough to plan for before it bites you.
Open-source license changeGrafana Labs moved to AGPLv3 in 2021 — enterprise legal teams sometimes balk
Grafana, Loki, Tempo, and Mimir all moved from Apache 2.0 to AGPLv3 in 2021. For teams running the stack internally for their own observability, AGPLv3 is not a problem. The restriction bites when you embed or redistribute the software in a product or service you sell. Enterprise legal teams sometimes flag AGPLv3 reflexively without analyzing whether internal use is restricted — it usually isn't. The Mimir license also blocks certain commercial reselling scenarios. If your legal team asks: internal observability use is fine; reselling Grafana-as-a-service is where you need a commercial agreement with Grafana Labs.
Grafana is the open-source observability platform you build and own. If you'd rather rent than build, the calculus is different.
Is it right for your company?
Four dimensions to check before you commit:
- Size: 50–10,000+ employees with SRE or DevOps capacity. Below 50, Grafana Cloud Free tier handles most use cases and keeps ops overhead near zero. Above that, the self-hosted vs. Cloud decision is mostly an economic and ops-capacity question, not a features question.
- IT maturity: Kubernetes, Prometheus, or Linux-server-aware engineering team. Someone who knows what a scrape interval is, has written a PromQL query before, and understands retention tradeoffs. This is not a beginner’s observability platform.
- Existing stack: Cloud-native, multi-cloud, or hybrid with telemetry maturity ambitions. If you already have Prometheus running, Grafana is the obvious next step. If you’re starting from zero observability, consider whether Grafana Cloud’s managed ingest is a better entry point than standing up the full stack yourself.
- Geography: Global. LATAM has a strong and active Grafana open-source community, and Grafana Cloud’s pricing is USD-denominated with a free tier that covers meaningful workloads — useful in markets where enterprise SaaS pricing is a barrier.
If three of the four match, Grafana is on the shortlist. If all four match, it’s probably the right answer.
Who implements it
Internally, the lead implementer should be an SRE or senior DevOps engineer with Prometheus and Linux fluency. Grafana Labs has a certification program, but the labor market trains primarily through OSS adoption — most experienced Grafana engineers built their skills running the stack, not studying for an exam. The hiring signal is practical: can they write PromQL, have they configured Loki pipelines, do they understand exemplars and distributed tracing.
Grafana Labs offers Professional Services for enterprise rollouts — LGTM stack deployment, alerting migration, and training. Independent consultants with OTel ecosystem backgrounds can also fill this role.
If you’re evaluating self-host versus Grafana Cloud, or want a second opinion on your dashboard sprawl and alert architecture, let’s talk — a 30-minute scoping call is enough.
First steps
- Decide the deployment model first. Grafana Cloud Free (up to 10,000 active metrics series, 50 GB logs, 50 GB traces per month) is the lowest-friction entry and doesn't require any infrastructure. Self-hosted Grafana + Prometheus is the next step — start there if you already have Prometheus running or prefer to own the stack. Full LGTM self-hosting (Mimir, Loki, Tempo) is enterprise-tier ops investment; don't start there on day one unless you already have the SRE capacity to run it.
- Start with one data source and one use case. Don't try to unify metrics, logs, and traces on day one. Pick the most painful observability gap — usually application metrics — connect Prometheus or a CloudWatch data source, build three useful dashboards, and live with them for a month. The team learns what good looks like before scaling the scope. Add Loki for logs and Tempo for traces as the team matures into them, not before.
- Provision dashboards as code from day one. Use Jsonnet with Grafonnet, Terraform with the Grafana provider, or the grafana-as-code CLI from day one. Click-built dashboards feel faster in week one and become painful in month six. Provisioned dashboards survive engineers leaving, cluster rebuilds, and Grafana upgrades without losing configuration. The governance habit is easier to start than to retrofit.
Beyond first steps: talk to me about your observability stack. I’ll tell you in 30 minutes whether it’s a Grafana job, a Zabbix job, a Datadog job, or “instrument your apps before adding more dashboards.”