From Metric Overload to Victoria: Simplifying My Observability Stack

Or: how I stopped worrying and learned to love fewer containers

The Problem With "Best Practice" Stacks

There's a certain kind of infrastructure creep that happens to every DevOps engineer who's been around long enough. You start with a simple goal — "I want to know if my servers are alive" — and a few months later you're running a distributed metrics backend, object storage, log aggregation, tracing, and a Grafana dashboard that takes long enough to load that you reconsider your life choices.

That was me.

As a freelancer, I monitor a handful of client systems and personal projects. Nothing massive — around a dozen hosts, a few applications, some databases. But over time, I'd assembled a stack that looked like it belonged in a mid-sized SaaS company. The only problem: I'm not a mid-sized SaaS company. I'm one person with a laptop, a coffee habit, and a growing suspicion that I'd massively over-engineered things.

My setup looked like this:

  • Prometheus — metrics scraping
  • Mimir × 3 — HA long-term metrics storage
  • nginx — load balancing across the Mimir instances
  • Hetzner Object Storage (S3-compatible) — backend storage for Mimir
  • Loki — log aggregation
  • Promtail — log shipping
  • Tempo — distributed tracing
  • Alloy — ingestion and collection

Eight components. All working. All necessary — in the wrong context.

Every incident started with the same question: "Which of the eight things is broken this time?" That's usually a signal you've overdone it. And then there was the cost — three Mimir instances, an nginx server, and an S3 bucket continuously accumulating chunks and compacted blocks don't run for free. For a freelancer, that's real money every month to support infrastructure that was, let's be honest, sized for problems I didn't have.

Something had to change. But to understand how I got here, you need to know how it started — with a perfectly reasonable decision to ditch Graylog.


How It Started: Escaping the Elasticsearch Tax

Like many setups, this one started with logs.

I was running Graylog backed by Elasticsearch. It worked, but Elasticsearch has a very specific personality: it will happily consume every gigabyte of RAM you give it and then politely ask for more. Running it for a small environment felt like hiring a 40-tonne lorry to help you move a studio flat. Technically it gets the job done. Practically, it's absurd.

That's when I moved to Grafana Loki. The design made sense — label-based indexing instead of full-text search, lower resource footprint, tight Grafana integration. That migration was a clear win.

While rebuilding the logging side, I started looking at the metrics layer with fresh eyes too. Mimir had just become widely available — Grafana Labs had open-sourced it as a horizontally scalable, Prometheus-compatible long-term storage backend. It looked modern. It looked powerful. It had a shiny new logo and everything.

Reader, I installed it immediately.


The Mimir Chapter: Powerful, Educational, and Slightly Misplaced

To be fair to Mimir — it's genuinely impressive software, designed for multi-tenant environments at massive scale. If you're running metrics infrastructure for hundreds of services across multiple teams, it's a serious tool solving serious problems.

I was monitoring twelve hosts.

But at the time I was genuinely exploring the ecosystem, building experience with tools I might recommend to clients, and learning how modern observability infrastructure is actually architected. In that sense, the Mimir deep-dive was valuable. It taught me how distributed metrics systems are structured — distributors, ingesters, queriers, compactors — how object storage behaves under real workloads, and what multi-tenancy actually means when you have to configure it yourself.

Getting it properly set up took a while. My three-instance setup used nginx as a load balancer in front of the Mimir cluster — worth noting that nginx wasn't a Mimir requirement, just my approach to distributing traffic outside of Kubernetes. Each instance wrote to the same Hetzner S3 bucket for shared storage:

upstream mimir_cluster {
    server mimir-1:8080;
    server mimir-2:8080;
    server mimir-3:8080;
}

server {
    listen 8080;

    location /api/v1/push {
        proxy_pass http://mimir_cluster;
    }

    location / {
        proxy_pass http://mimir_cluster;
    }
}

Alloy wrote metrics to nginx, which distributed the load across the cluster. Lose one instance, the other two kept going. It was resilient, and once tuned correctly, stable.

But using it day-to-day revealed some consistent friction.


Where the Friction Appeared

The Object Storage Reality

Both Mimir and Loki rely on object storage for long-term data. At scale, this design is excellent — cheap storage, virtually unlimited retention, horizontal scalability. But it comes with a trade-off that only becomes obvious when you're using it.

Recent data is fast because it lives in memory or local caches. Older data has to be fetched from object storage over the network, often involving multiple chunk reads per query. Without aggressive caching layers or query frontends, the experience looked like this:

  • Recent queries → fast
  • Historical queries → noticeably, frustratingly slower

Not broken. Not unusable. Just consistently slow in a way that made incident retrospectives and capacity planning feel like wading through mud. You'd watch the Grafana loading spinner and just... wait. Loki had exactly the same problem — recent logs were snappy, anything from a few weeks ago meant the same chunk-fetching dance with S3 and you'd feel every millisecond of network latency.

Features I Was Paying to Not Use

Mimir's multi-tenancy is one of its headline features. I never used it. Neither did I use horizontal scaling, tenant isolation, or complex ingestion pipelines. All genuinely valuable — just not for my environment. I was running enterprise-grade multi-tenancy infrastructure to monitor a dozen single-tenant servers.

"We built a system capable of handling the observability needs of a mid-sized SaaS company. We used it to monitor twelve servers." — my git commit history, essentially

Operational Overhead

The real issue wasn't performance or cost alone — it was operational complexity. Eight moving parts means more failure modes, more debugging surface, and more mental overhead. The stack was technically correct, just badly misaligned with reality.


Enter the Victoria Stack

After enough late evenings debugging S3 connectivity and waiting on slow historical queries, I started seriously looking at alternatives that preserved functionality but reduced complexity. I ended up migrating to:

  • VictoriaMetrics — replacing Prometheus + Mimir × 3 + nginx + S3
  • VictoriaLogs — replacing Loki + Promtail
  • Alloy — kept, unchanged
  • Tempo — kept, unchanged

Eight components down to four. The Hetzner S3 bucket got deleted. The nginx config became a memory. The monthly bill got friendlier. And critically — the separation between collection (Alloy) and storage (Victoria) made the whole migration straightforward, because I only had to swap one layer at a time.


What I Kept (And Why)

Grafana Alloy: The Ingestion Layer That Stays

Alloy is the constant throughout all of this, and it earns it. It acts as a vendor-neutral ingestion layer — collecting metrics, logs, and traces, and forwarding to any backend — which is exactly why swapping the storage layer underneath it was so clean.

It also replaces multiple agents in one. Before Alloy, you might run Promtail for logs, a Prometheus agent for metrics, and something else for traces. Alloy handles all of it with a single binary and HCL-based config that feels familiar if you've ever written Terraform.

Example: shipping logs and metrics from a single Alloy config

// Scrape node metrics
prometheus.scrape "node" {
  targets    = [{"__address__" = "localhost:9100"}]
  forward_to = [prometheus.remote_write.victoria.receiver]
}

// Send metrics to VictoriaMetrics
prometheus.remote_write "victoria" {
  endpoint {
    url = "http://victoriametrics:8428/api/v1/write"
  }
}

// Tail logs from a file
local.file_match "app_logs" {
  path_targets = [{"__path__" = "/var/log/myapp/*.log"}]
}

loki.source.file "app" {
  targets    = local.file_match.app_logs.targets
  forward_to = [loki.write.victoria_logs.receiver]
}

// Send logs to VictoriaLogs via Loki-compatible endpoint
loki.write "victoria_logs" {
  endpoint {
    url = "http://victorialogs:9428/loki/api/v1/push"
  }
}

One config file. Metrics and logs. No Promtail, no separate Prometheus agent, no config files to keep in sync.

Tempo: Focused and Effective

Tempo stays because it does one thing well, and nothing in the Victoria ecosystem replaces it directly. Alloy already speaks OTLP and forwards traces to Tempo without complaint. If it ain't broke, don't replace it with three other things that might break.


What Changed

Prometheus + Mimir × 3 + nginx + S3 → VictoriaMetrics

VictoriaMetrics is a single binary that replaces what previously required five separate components. And it directly solves the S3 latency problem — because there is no S3. Data lives on local disk, and historical queries are fast.

Key improvements:

  • Single process, single port — runs on :8428, accepts Prometheus remote write, serves PromQL on the same port
  • No object storage dependency — no S3 bucket, no network latency on reads, no compactor debugging at 11pm
  • Fast historical queries — without S3 round-trips per chunk fetch, querying weeks of data is dramatically faster; this alone would have justified the migration
  • PromQL compatibility — existing Grafana dashboards work unchanged, with minor edge-case differences worth being aware of
  • Built-in retention — one flag: --retentionPeriod=6 for six months; no lifecycle rules, no separate compactor service

Example: VictoriaMetrics in Docker

services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:latest
    ports:
      - "8428:8428"
    volumes:
      - vm_data:/victoria-metrics-data
    command:
      - "--retentionPeriod=6"
      - "--storageDataPath=/victoria-metrics-data"

volumes:
  vm_data:

No nginx config. No S3 credentials. No cluster to coordinate. The same job that required a load balancer, three application instances, and cloud object storage now runs as a single container with a local volume.

"The best ops tool is the one you don't have to think about at 2am." — Ancient DevOps proverb (I may have invented this)

Loki + Promtail → VictoriaLogs

VictoriaLogs replaces Loki and solves the same class of problems on the logging side — including the slow historical query issue.

Key points:

  • Loki-compatible ingestion API — Alloy's loki.write component works completely unchanged, just pointed at a different URL
  • Local storage — no object storage latency; older logs query at local disk speeds
  • LogsQL instead of LogQL — different, not universally better or worse, but handles high-cardinality filtering well and felt intuitive after a short adjustment period
  • Grafana datasource plugin — existing log dashboards need minimal tweaking

Example: VictoriaLogs in Docker

services:
  victorialogs:
    image: victoriametrics/victoria-logs:latest
    ports:
      - "9428:9428"
    volumes:
      - vl_data:/vlogs-data
    command:
      - "--storageDataPath=/vlogs-data"
      - "--retentionPeriod=4w"

volumes:
  vl_data:

Four weeks of logs, one process, one port, zero Promtail sitting alongside it.


The Migration (Incremental, Zero Downtime)

Metrics:

  1. Run VictoriaMetrics alongside the existing stack
  2. Add a second prometheus.remote_write target in Alloy — dual-write to both
  3. Verify dashboards look identical against both datasources
  4. Decommission Prometheus, Mimir × 3, nginx, and the S3 bucket

Logs:

  1. Run VictoriaLogs alongside Loki
  2. Add a second loki.write target in Alloy — dual-write to both
  3. Update the Grafana datasource, verify dashboards
  4. Remove Loki, Promtail, and the redundant Alloy write target

Total downtime: zero. Total swearing: minimal. Total components decommissioned: six.


Before and After

ComponentBeforeAfter
Metrics collectionPrometheusAlloy
Metrics storageMimir × 3VictoriaMetrics
Metrics load balancingnginx
Metrics backendHetzner S3Local disk
Log shippingPromtailAlloy
Log storageLokiVictoriaLogs
TracingTempoTempo
Total services84
Historical query speedSlow (S3 latency)Fast (local disk)
S3 costsMonthly & growingGone
Multi-tenancy usedNeverStill never

Trade-offs (Real Ones)

This setup is simpler — but not universally better. Worth being honest about what changed in both directions.

What I gave up:

  • Built-in horizontal scalability
  • Multi-tenancy support
  • The maturity and ecosystem depth of Mimir and Loki
  • High availability by default (this is now a single-node setup)

What I gained:

  • Significantly lower operational overhead
  • Fast historical queries without caching infrastructure
  • A simpler mental model — fewer things to reason about under pressure
  • Lower running costs

For a freelancer monitoring a dozen hosts, that's the right trade. For a team running multi-tenant infrastructure at real scale, Mimir and Loki are still the right answer.


Was the Mimir Detour Worth It?

Yes — genuinely.

Running a real Mimir cluster, configuring S3 backends, tuning ingesters and compactors, and debugging query performance gave me solid, practical knowledge of how large-scale metrics infrastructure actually works. That experience matters when advising clients who are running at that scale, or evaluating whether complexity is justified for a given workload.

The lesson isn't "these tools are too complex." The lesson is:

Use tools that solve problems you actually have — not problems you might have someday.

Mimir and Loki are excellent. They're just excellent for a scale I wasn't at.


Final Thoughts

The stack evolved like this:

  • Graylog + Elasticsearch → too heavy for the workload
  • Mimir + Loki + S3 + nginx → powerful, educational, over-scoped
  • VictoriaMetrics + VictoriaLogs + Alloy + Tempo → proportional to actual needs

Each step had value. Each migration left me understanding the previous tool better than when I adopted it.

But the biggest takeaway is this:

Simplicity is not the opposite of capability. It's the result of choosing tools that match your scale.

This stack is fast enough, simple enough, and predictable enough. And most importantly — it's proportional. The monitoring no longer needs monitoring.


Running a similar stack? Still deep in the Mimir rabbit hole? I'd love to know — are you solving today's problems, or preparing for problems you might never have? Drop a comment or find me wherever developers argue about YAML indentation.