What are the two layers of observability needed in multi-agent systems?

Agent systems need two distinct observability layers. Layer one is infrastructure observability: did the job run, how long did it take, did it retry, what is the queue depth. BullMQ, OpenTelemetry, Datadog, and Prometheus cover this. Layer two is semantic and agentic observability: was the output correct, did the agent hallucinate, did retrieval return the right documents. Langfuse, LangSmith, Helicone, Braintrust, and Arize Phoenix cover this. These layers are orthogonal and designed to coexist. You wrap LLM calls with a tracing SDK inside your BullMQ worker and both layers light up at the same time. Most teams ship with only layer one and discover the gap when a customer reports wrong output that passed every infrastructure check. Learn more at learnagenticpatterns.com.

All posts

architecturemulti-agentobservability

The Two Layer Observability Mistake I See in Most Multi Agent Systems

Teams instrument the queue layer and skip the semantic layer. One catches infra failures. The other catches the failures that actually matter.

6 min readApril 7, 2026

TL;DR The One Thing to Know

Agent systems need two layers of observability, not one. Layer one is infrastructure (did the job run?). Layer two is semantic (was the output correct?). Most teams ship with only layer one and discover the gap the first time they actually read what their agent produced. The semantic layer is its own tooling category now: Langfuse, LangSmith, Helicone, Braintrust, Arize Phoenix. Pick one early.

The wall I keep hitting

I've been building multi-agent systems on BullMQ and Redis across a few projects over the past year. The most recent one runs 13 agents in production. It works. At the infrastructure level the picture looks healthy. Jobs run. Jobs finish. Failed jobs retry. The DLQ stays mostly empty. BullMQ even ships with OpenTelemetry support now, so producer and consumer spans link end to end out of the box. Dashboards green across the board. And then I actually read what the agent produced and it was wrong. The job ran. It returned successfully. It finished in 800 milliseconds. Every infrastructure metric says success. The output was wrong anyway. This is the gap I want to talk about. It is not a gap in BullMQ. BullMQ is doing exactly what it was built to do. It is a gap between two completely different layers of observability that most teams collapse into one.

The two layers

When you run an agent in production, there are two distinct questions you need to answer, and they require different tools. Layer one is infrastructure observability. Did the job run. How long did it take. Did it retry. Did the worker crash. What is the queue depth. This is what BullMQ, OpenTelemetry, Datadog, Prometheus, and queue dashboards like BullBoard or Bullstudio give you. It is a solved problem. The tooling is mature. Layer two is semantic and agentic observability. Was the output correct. Did the agent pick the right tool. Did it hallucinate. Did it get stuck in a reasoning loop. Did the retrieval step return the right documents. Did the response drift from what a human reviewer would accept. This is what Langfuse, LangSmith, Helicone, Braintrust, and Arize Phoenix exist to answer. These layers are orthogonal. They do not compete. You wrap your LLM calls with a tracing SDK inside your BullMQ worker, and both layers light up at the same time. Worth noting: Langfuse itself runs its background workers on BullMQ. The two are designed to coexist. The mistake I see most teams make, and the mistake I made on my first system, is treating layer one as if it covers layer two. It does not. A 200 response with healthy latency tells you nothing about whether the agent hallucinated.

Why this happens

Message queues were built for deterministic services where failure is binary. A payment either processes or it does not. An email either sends or it fails. The system knows immediately, retries cleanly, and moves on. Engineers have a decade of muscle memory around this model. Agents fail differently. A successful response can still be completely wrong. Quality drifts as prompts change. Same input, different output every run. There is no binary signal to catch. The infrastructure layer cannot see semantic failure because semantic failure is not what it was built to see. Galileo's research puts a number on when this catches up with teams: somewhere between 11 and 20 agents in production, manual debugging stops scaling and the gap becomes a daily problem.

What I built as a workaround

Before I went looking at the tooling landscape, I duct taped my own version of layer two. I added a review agent that critiques output before it reaches a human. An approval queue for anything the review agent flagged. Per call logging of cost, model, and which agent did what. It works. But it is duct tape, not infrastructure. There is no replay, no annotation queue for domain experts, no LLM as judge eval framework, no regression testing against historical traces. It catches obvious failures and misses subtle ones. So I spent a few days looking at the tools that actually live at layer two. Here is what I found.

The five tools worth knowing

Langfuse is open source, MIT licensed. Tracing, prompt management, evaluations, datasets, LLM as judge. Free if you self host. Cloud has a generous free tier and paid plans starting around $59 per month. Best if you want full control over your data and a tool that works with any framework via OpenTelemetry. It also uses BullMQ internally for its own worker queues, which is a small but telling detail about how these layers actually fit together. LangSmith is built by the LangChain team. Native integration if you are already on LangChain or LangGraph. Zero config tracing, automatic capture of every LLM call and tool invocation, annotation queues for domain experts. Around $39 per user per month. The catch is framework lock in. Best if your stack is already LangChain shaped. Helicone is proxy based. Change your base URL and you start logging immediately. Strong on cost tracking and multi provider routing. Generous free tier, paid plans starting around $79 per month. Fastest setup but lighter on deep agent tracing and eval. Best if you want observability with near zero code changes. Braintrust is eval first. Built for teams that want quality measurement tied to CI/CD. You can block deployments when output quality regresses against a held out dataset. Custom pricing. Best if your priority is systematic testing, not just monitoring. Arize Phoenix is OpenTelemetry native. Vendor agnostic, works across any stack. Free self hosted, paid cloud tier. Best if you want observability that will not lock you into one platform and you already speak OTel. Pricing in this space changes fast. Check the live pages before committing.

The pattern I noticed

Most teams do not pick one tool. They layer two of them. One for tracing and operational visibility, one for evaluation and quality scoring. The most common combination I saw was Langfuse or LangSmith for tracing paired with Braintrust for eval. That layering is the part that clicked for me. The future is not BullMQ versus Langfuse, because they were never competing. The future is a stack. Classical queues for orchestration because they are great at it. OpenTelemetry for infrastructure traces because it is the standard. Plus tools built specifically for semantic and agentic failure modes that the first two layers were never designed to catch. Three layers, not one.

The orchestration question

There is another path some teams take. Skip the agent orchestration frameworks entirely and build their own. When I built Viewplatform, I went straight to a custom orchestrator on BullMQ and Redis pub/sub. At the time it felt like the simplest path. Now that I have run the system in production and seen what frameworks like LangGraph and CrewAI actually offer, I understand the tradeoff better. Frameworks give you scaffolding fast. Custom gives you control. At scale, control wins. Event routing, state management, parallel execution with join logic, adaptive fallbacks, partial failure handling. These are easier to reason about when you own the orchestration layer instead of bending a framework into your shape. The pattern I am starting to see is that the more serious a team gets about agent infrastructure, the more they end up building their own orchestration layer and using external tools only for the specialized stuff: observability, eval, prompt management, dataset curation. Each layer doing what it is actually built for.

What I would do differently

If I started this system today, I would still pick BullMQ for the queue layer and still write a custom orchestrator on top of it. That part I do not regret. What I would change is layer two. I would add Langfuse from day one for trace visibility instead of waiting until I needed it. I would treat eval as a separate concern with its own tooling, probably Braintrust or Langfuse's built in eval features, instead of writing a custom review agent. I would enable BullMQ's OpenTelemetry support from the first commit so the infrastructure layer is wired up before anything else. The duct tape works until it does not. The teams I respect most in this space wire all three layers up before they ship, not after they read the output and realize it has been quietly wrong for weeks.

Key Takeaway

Agent systems need two distinct observability layers. Infrastructure observability tells you if the job ran. Semantic observability tells you if the answer was right. BullMQ and OpenTelemetry give you the first layer for free. Langfuse, LangSmith, Helicone, Braintrust, and Phoenix exist because nothing in the first layer can tell you the second thing. Wire both up before you ship, not after you read the output and realize the agent has been quietly wrong.

Share this post:Twitter/X LinkedIn