Large language model (LLM) applications are moving from experimental pilots into critical production systems at a rapid pace. But unlike traditional software, where a failed request returns a 5xx error you can immediately alert on, LLM applications can fail silently. A model can produce a confident but hallucinated answer, slowly inflate your token costs, or leak sensitive data in a way that looks like a perfectly successful HTTP 200 response.
This guide explains how to monitor LLM applications in production: what signals to track, which metrics matter, how to instrument your pipelines with OpenTelemetry, and what tools teams are using in 2026. Whether you are running a RAG chatbot, a code assistant, or a multi-agent workflow, the framework here applies.
🔑 Key Takeaways
- LLM monitoring covers five signal families: performance, quality, cost, safety, and agentic workflow metrics.
- Traditional APM misses LLM-specific failures because outputs are non-deterministic. You need semantic evals alongside latency and error rates.
- OpenTelemetry GenAI semantic conventions provide a standard way to instrument LLM pipelines with traces and metrics.
- Hallucination rate, groundedness, token cost per request, and prompt injection detection rate are the most business-critical metrics to start with.
- Tools such as Langfuse, MLflow, and OpenLIT (OpenTelemetry-native) cover open-source needs, while CubeAPM adds self-hosted APM with OTel trace ingestion.
- Start with basic request/response logging, then layer quality scoring and alerting incrementally.
The LLM Monitoring Landscape at a Glance

Monitoring an LLM application means tracking three foundational signal types: traces (the execution path of each request through your pipeline), metrics (aggregated performance and cost data), and evaluations (quality and safety scoring of outputs). Together these give you the visibility to detect and fix issues before users notice them.
What Is LLM Monitoring?
LLM monitoring is the ongoing process of tracking the performance, quality, cost, and safety of large language model applications in production. It encompasses:
- Performance monitoring: response latency, time to first token (TTFT), throughput, and error rates.
- Quality monitoring: evaluating whether model outputs are accurate, relevant, coherent, and grounded in provided context.
- Cost monitoring: tracking token consumption and API spend per request, per user, and per feature.
- Safety and compliance monitoring: detecting hallucinations, prompt injection attempts, PII leakage, toxic content, and policy violations.
- Agentic workflow monitoring: tracking tool selection accuracy, action completion, and reasoning coherence across multi-step agent pipelines.
LLM monitoring vs. LLM observability: these terms are often used interchangeably, but there is a useful distinction. Monitoring tracks predefined metrics (what is happening). Observability provides full visibility into the internal state of the system so you can reconstruct why something happened, even for failure modes you did not anticipate. For LLM applications, you need both.
Why LLM Monitoring Is Different From Traditional APM
Traditional application performance monitoring (APM) assumes deterministic behavior: the same input reliably produces the same output. LLMs break this assumption. Research cited by Galileo found that even at temperature=0, a frontier model produced 80 unique completions across 1,000 identical runs. This non-determinism creates failure modes that standard dashboards cannot surface:
- A request can return HTTP 200 while the model has hallucinated a policy, a fact, or a routing instruction.
- Cost can spike because a prompt change silently increases average token length across millions of requests.
- A multi-step agent can loop, stall, or call the wrong tool without triggering any infrastructure alert.
- Sensitive data can leak through outputs without a detectable error code.
To catch these failures, LLM monitoring adds semantic quality scoring, hallucination detection, cost attribution, and safety evaluations as first-class signals alongside the latency and error rates that traditional APM already tracks.
Key Metrics to Monitor in LLM Applications
Organize your metrics into five families. Each maps to a distinct type of failure and a distinct business risk.
| Signal / Layer | What to Measure | Key Metrics | Why It Matters |
| Performance | Speed & throughput | Latency, TTFT, req/s | User experience |
| Quality | Output accuracy | Groundedness, relevance | Prevents hallucinations |
| Cost | Token consumption | Tokens/req, $ per session | Budget control |
| Safety | Security & compliance | Injection rate, PII leaks | Risk mitigation |
| Agentic | Multi-step reliability | Tool accuracy, completion | Agent workflow health |
1. Performance Metrics
Performance metrics measure how reliably and quickly your LLM application serves requests.
- Latency (P50, P95, P99): end-to-end response time distribution. P99 latency is often the most useful SLA signal because it captures tail behavior under load.
- Time to first token (TTFT): for streaming responses, the time before the first token appears. This directly determines perceived responsiveness.
- Throughput: requests processed per second. Critical for capacity planning.
- Error rate: HTTP errors, timeouts, and provider-side rate limit hits tracked separately so you can distinguish client-side from server-side failures.
- Data and prediction drift: track statistical distance (Jensen-Shannon Distance or PSI) between production inputs and a baseline. Prompt drift is a common cause of quality degradation that looks healthy on latency dashboards.
2. Quality Metrics
Quality metrics answer whether model outputs are correct and useful. These are the metrics that traditional APM cannot provide at all.
- Groundedness / faithfulness: does the answer follow from the provided context (for RAG applications)? This is the primary defense against hallucinations.
- Answer relevancy: does the response address what the user actually asked?
- Context precision and recall: for retrieval-augmented pipelines, how well does the retrieval step surface the right documents?
- Coherence and fluency: is the output logically consistent and grammatically correct?
- F1 score / BLEU / ROUGE: useful for classification and summarization tasks where a ground truth reference exists.
- Perplexity: how predictable the model’s next-token probabilities are. A proxy for language quality.
Production teams commonly target hallucination rates below 0.5% for high-stakes applications such as legal, financial, or medical assistants. The right threshold differs by environment and error tolerance.
3. Cost Metrics
Token consumption directly determines LLM API spend. Without cost visibility at the request level, budget overruns arrive as monthly billing surprises.
- Token usage per request: prompt tokens plus completion tokens. Track the distribution, not just the average, because a long-tail of expensive requests often drives most of the cost.
- Cost per request and per session: useful for unit economics and per-feature cost attribution.
- Token efficiency: tokens consumed relative to output value. Inefficient prompt templates waste tokens on every call.
- Cost anomaly detection: automated alerts when cost per request exceeds a rolling baseline. A single poorly optimized prompt can multiply costs 10x overnight.
4. Safety and Compliance Metrics
Safety failures often have immediate legal, regulatory, or reputational consequences. The OWASP Top 10 for LLM Applications ranks prompt injection and sensitive information disclosure among the most critical vulnerabilities for LLM systems.
- Hallucination rate: percentage of outputs flagged as ungrounded by an automated evaluator.
- Prompt injection detection rate: a 2025-2026 study cited by Galileo logged 91,403 attack sessions targeting exposed LLM services between October 2025 and January 2026.
- PII leakage rate: frequency of outputs that include personally identifiable information not in the original prompt.
- Toxicity score: automated classification of harmful, biased, or offensive content.
- Policy compliance: tracking guardrail activations and content filter hits to confirm safety policies are enforced.
Courts have held airlines legally liable when chatbots hallucinated refund policies. The EU AI Act GPAI governance obligations are in effect as of 2025, and NIST IR 8596 identifies prompt injection, data leakage, and overreliance as priority risk areas requiring runtime controls.
5. Agentic Workflow Metrics
Autonomous agents that chain multiple LLM calls, tools, and retrieval steps introduce failure modes that have no equivalent in single-call monitoring.
- Tool selection quality: whether the agent invokes the correct tool with the correct parameters at each step.
- Action completion rate: whether the agent finishes assigned tasks rather than stalling or looping.
- Reasoning coherence: whether the agent maintains logical consistency across chained decision steps.
- Overall workflow reliability: end-to-end success rate across complete workflows, not just individual LLM calls.
Many agentic failures return HTTP 200 at the infrastructure layer. The request succeeds, but the agent chose the wrong tool or skipped a required step. You only discover the failure downstream.
How to Instrument LLM Applications with OpenTelemetry
OpenTelemetry is the open standard for instrumenting distributed systems. In 2024 the OTel community established dedicated GenAI semantic conventions that define standard attribute names for LLM instrumentation, making it possible to route LLM observability data to any OTel-compatible backend including CubeAPM, Jaeger, Prometheus, or Grafana.
The key OTel signals for LLM applications are:
- Traces and spans: capture the full execution path of each request through your application, retrieval pipeline, LLM provider call, and tool invocations.
- Metrics: aggregate request volume, duration distribution, token counts, and cost.
The GenAI semantic conventions define standard attributes such as gen_ai.agent.name, gen_ai.agent.id, gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens that make every decision node in an LLM pipeline traceable.
Step-by-Step: Instrument a Python LLM App with OpenLIT
OpenLIT is an OpenTelemetry-based auto-instrumentation library for LLM applications. It aligns with OTel GenAI semantic conventions and works with backends such as Prometheus, Jaeger, and Grafana.
- Install the OpenTelemetry Collector and configure it to send metrics to Prometheus and traces to Jaeger (or any OTLP-compatible backend).
- Install OpenLIT in your Python application:
pip install openlit- Initialize OpenLIT at application startup:
import openlit openlit.init(otlp_endpoint="YOUR_OTELCOL_URL:4318")- OpenLIT auto-instruments calls to OpenAI, Anthropic, Cohere, LangChain, LlamaIndex, and other popular LLM providers and frameworks without code changes to your application logic.
- Add Prometheus as a Grafana data source and import the OpenLIT dashboard to visualize request volume, latency distribution, token counts, and cost over time.
Key Trace Attributes to Capture
For each LLM request, instrument the following trace attributes:
- Model name and version (gen_ai.request.model): essential for correlating quality changes with model updates.
- Temperature and top_p: model parameters that directly affect output randomness and cost.
- Prompt content (as a span event, not a span attribute, because large payloads can overwhelm some backends).
- Input token count (gen_ai.usage.input_tokens) and output token count (gen_ai.usage.output_tokens).
- Estimated cost per call: derived from token counts and provider pricing.
- Response content (again as a span event).
- For RAG pipelines: vector retrieval spans with document IDs, relevance scores, and chunk content.
- For agents: tool invocation spans with tool name, input parameters, and return value.
Note: the OTel LLM Working Group recommends capturing prompt and response content on span events rather than span attributes because many backend systems struggle with large attribute payloads.
Tracing Multi-Step LLM and Agent Workflows
Single-call LLM monitoring is relatively straightforward. Multi-agent pipelines are harder because failures compound across steps. A broken retrieval step can skew an LLM answer; the LLM answer can cause a downstream tool to fail; the tool failure can cascade into a billing anomaly.
The trace hierarchy for agentic systems follows a consistent pattern:
- Trace: one complete user interaction from start to finish.
- Agent span: one autonomous agent participating in the workflow.
- Generation span: one LLM call made by an agent.
- Tool span: one external tool invocation (API call, code execution, database query).
- Retriever span: one vector search or embedding lookup.
With distributed tracing across the full workflow, you can visualize the execution graph, see exactly where latency accumulates, and identify which hop caused a failure. Without this, debugging a multi-agent failure means sifting through disconnected logs.
Graph-based trace views that show every branch and tool call in sequence are particularly valuable for multi-agent debugging. They convert guesswork into targeted investigation.
How to Automate Quality Evaluation at Scale
Manual review of LLM outputs does not scale beyond a handful of requests per day. Automated evaluation pipelines score every production request continuously and feed results into dashboards, regression tests, and alerting workflows.
Automated Evaluation Approaches
- LLM-as-judge: use a separate language model (often a smaller, cheaper one) to score each production response for groundedness, relevance, and safety.
- Rule-based scorers: deterministic checks for format compliance, output length bounds, prohibited phrases, or required fields.
- Semantic similarity: compare outputs to expected responses using embedding cosine similarity when ground truth references are available.
- Retrieval metrics (RAGAS): for RAG pipelines, compute context precision, context recall, faithfulness, and answer relevancy as a standard evaluation suite.
Cost consideration: running a large frontier model as a judge adds a second LLM call for every production request. Research comparing small and large judge models found that a 14B-parameter model achieved comparable evaluation agreement at 46% of the per-query cost of GPT-4o, while an 8B-parameter model cut costs by 82%. Starting with a smaller evaluation model lets you score 100% of traffic instead of sampling a small fraction.
Sampling Strategy
- At low traffic volumes: score every request.
- At high traffic volumes: sample 10-20% for detailed evaluation, log basic metrics (tokens, latency, cost) for all requests.
- Always score 100% of high-risk request categories (financial decisions, medical information, legal advice).
- Feed failed or borderline responses back into your evaluation dataset as regression test cases.
Connecting Evals to CI/CD
Evaluation logic built pre-production can be redeployed as runtime enforcement in production. The pattern is:
- Define evaluation criteria and acceptable score thresholds during development.
- Wire automated evals into your CI/CD pipeline so that pull requests fail if quality scores drop below threshold.
- Deploy the same eval logic as runtime guardrails that block or redact unsafe outputs before they reach users.
- Feed production failures back into your pre-production eval dataset so the bar rises over time.
Setting Up Alerting for LLM Applications
Effective alerting for LLM applications requires moving beyond simple threshold checks. LLM traffic has natural variation by time of day, day of week, and user segment. A flat threshold on quality score will generate false positives during unusual but legitimate traffic patterns.
Alerts to Configure
- Latency SLA breach: alert when P99 latency exceeds the agreed threshold for a sustained window (e.g., 5 minutes).
- Quality degradation: alert when rolling average quality score drops more than 5-10% below the 7-day baseline.
- Cost anomaly: alert at 50%, 80%, and 100% of daily or weekly token budget. A single runaway prompt can consume thousands of dollars overnight.
- Hallucination spike: alert when hallucination rate exceeds the target threshold (commonly 0.5-1% for production applications).
- Prompt injection: alert on any detected prompt injection attempt; these should be treated as security events.
- Error rate: alert on sustained increases in provider API errors or application-level exceptions.
- Agentic loop detection: alert when an agent workflow exceeds a maximum step count or wall-clock duration without completing.
Anomaly Detection vs. Static Thresholds
For mature LLM monitoring setups, replace static thresholds with anomaly detection that establishes baselines per time window, per user segment, or per prompt version. Statistical approaches such as time-series anomaly detection with Jensen-Shannon Distance catch subtle drift that static alerts miss. This is particularly important for:
- Token usage: gradual prompt bloat that doubles costs over weeks.
- Quality degradation: slow decline after a model version update.
- Bias drift: outputs that gradually shift toward problematic patterns without a sharp step change.
📌 Monitor Your LLM Applications with CubeAPM
CubeAPM is a self-hosted, OpenTelemetry-native APM platform built for teams that need full observability over LLM pipelines, agent workflows, and backend services. With flat-rate pricing at $0.15/GB, no per-host or per-user charges, and native OpenTelemetry trace ingestion, CubeAPM gives you end-to-end visibility into latency, token cost, and error paths in your LLM applications without the unpredictable bills of SaaS vendors.
Best Practices for Monitoring LLM Applications in Production
Do not attempt to build a comprehensive LLMOps platform from day one. Start with basic request/response logging and performance metrics. Add quality scoring incrementally. Validate each monitoring component before adding complexity.
A practical evolution path:
- Weeks 1-2: Basic request/response logging with latency and token counts.
- Weeks 3-4: Performance alerting on latency and error rate.
- Month 2: Automated quality scoring for groundedness and relevance.
- Month 3: Human evaluation sampling and feedback integration.
- Month 4 onward: Advanced anomaly detection, cost attribution, and agentic workflow tracing.
Log every LLM interaction with its prompt (hashed if necessary for privacy), response, latency, token counts, estimated cost, model version, and session ID. Storage is cheap. Missing data during an incident is expensive. Then run detailed evaluations on a sampled subset of traffic rather than blocking every request for scoring.
Treat every prompt template, retrieval template, and model parameter set as a version-controlled artifact. Link every trace to the exact prompt version that produced it. When quality drops, you can identify precisely which prompt change caused it rather than diffing logs.
Hash or redact sensitive fields before storing prompt and response content. Implement role-based access controls so that most team members see aggregated metrics and anonymized samples rather than raw user data. Automate data retention policies to purge detailed logs after the required retention period.
Run automated evaluations on every pull request. Fail the build if quality scores drop below the acceptable threshold. Connect pre-production evals to runtime guardrails so that the same logic that catches issues in testing blocks them in production.
Cost and latency are both reliability signals in LLM applications. A doubling of average token usage per request is as much a sign of a regression as a 2x latency increase. Set budget alerts at 50%, 80%, and 100% of daily spend limits.
LLM-specific incidents require different response playbooks than traditional software outages. Prepare runbooks for:
- Quality degradation: implement fallback responses, investigate prompt drift or model version changes, add regression tests.
- Hallucination spike: activate guardrails, notify affected users if required, trace to source in retrieval or prompt.
- Cost overrun: identify the high-token requests via cost attribution, roll back the offending prompt version.
- Prompt injection incident: block the offending session, audit the attack type, update guardrails.
- Agent loop or stall: terminate the stuck workflow, analyze the trace to find the offending decision node.
Conclusion
Monitoring LLM applications in production requires a new category of observability beyond what traditional APM provides. The combination of non-deterministic outputs, semantic quality failures, token-based costs, and multi-step agentic workflows means that reliable LLM operations requires traces, metrics, and evaluations working together.
Start by instrumenting your application with OpenTelemetry using a library such as OpenLIT, which gives you standard traces and metrics routable to any OTel-compatible backend. Layer in automated quality scoring for groundedness, relevance, and safety. Set cost alerts before token budgets overrun. And as your application matures, connect your evaluation logic to CI/CD gates and runtime guardrails so that the same quality bar enforced in testing holds in production.
The teams that get this right ship more reliable AI experiences, catch regressions before users notice them, and control costs proactively. The teams that do not are running blind until a customer reports a problem.
⚠️ Disclaimer
- The information in this article is provided for educational purposes only. Tool capabilities, pricing, and API availability change frequently. Always verify details against official vendor documentation before making implementation or procurement decisions. Pricing figures referenced in this article reflect publicly available information at the time of writing and may have changed.
FAQs
1. What is the difference between LLM monitoring and LLM observability?
LLM monitoring tracks predefined metrics in real time, such as response latency, error rates, and token counts. It answers the question of what is happening. LLM observability provides full visibility into the internal state of the system so you can reconstruct why an unexpected behavior occurred, even for failure modes you did not anticipate in advance. In practice, you need both: monitoring for operational alerting and observability for root cause analysis and debugging.
2. What metrics should I start with when monitoring LLM applications?
If you are starting from scratch, prioritize these metrics first: latency (P99), token usage per request, estimated cost per request, error rate, and hallucination or groundedness score. These five metrics cover the most common and impactful failure modes: slow responses, runaway costs, broken requests, and inaccurate outputs. Add more specialized metrics (prompt injection detection, agentic tool accuracy) as your application matures.
3. How do I monitor hallucinations in production?
Hallucination detection in production uses automated evaluation models that score each response for groundedness (whether the answer follows from the retrieved context) and faithfulness. For RAG applications, frameworks such as RAGAS provide standard metrics. For general LLM applications, an LLM-as-judge approach uses a smaller evaluation model to score every production response. Common production targets are hallucination rates below 0.5% for high-stakes applications. Feed detected hallucinations back into your evaluation dataset as regression test cases.
4. Can I use OpenTelemetry to monitor LLM applications?
Yes. The OpenTelemetry project publishes dedicated GenAI semantic conventions that define standard attribute names for LLM instrumentation, including model name, token counts, input and output content, and agent identifiers. Libraries such as OpenLIT auto-instrument popular LLM providers and frameworks using these conventions and export traces and metrics to any OTel-compatible backend. This makes your LLM observability data portable across observability backends including CubeAPM, Jaeger, Prometheus, Grafana, and others.
5. What is the best open-source tool for monitoring LLM applications?
Langfuse is the most widely adopted open-source LLM observability platform as of 2026, with support for distributed tracing, session tracking, prompt versioning, and both automated and human-in-the-loop evaluation. OpenLIT is the best choice for teams wanting OpenTelemetry-native instrumentation that routes to an existing OTel stack. Phoenix by Arize adds embedding visualization and drift detection. The right choice depends on whether you prioritize self-hosting, OTel compatibility, or depth of quality evaluation features.





