CubeAPM
CubeAPM CubeAPM

How to Monitor Tekton Pipeline Runs and Task Failures

How to Monitor Tekton Pipeline Runs and Task Failures

Table of Contents

Tekton is a Kubernetes-native open-source CI/CD framework that models pipeline execution as a set of custom resources: Tasks, Pipelines, TaskRuns, and PipelineRuns. Each of these runs as a Kubernetes Pod, which means full observability is possible, but only if you know where to look.

Without tekton monitoring in place, a silently failing test task or a stuck pipeline run can delay deployments for hours before anyone notices. Platform engineers and DevOps teams need a reliable way to know the status of every PipelineRun, identify which TaskRun failed and why, measure duration trends, and get alerted before problems escalate.

This guide walks through every layer of Tekton monitoring: the Prometheus metrics Tekton exposes out of the box, how to configure Grafana dashboards, how to handle task failures gracefully, how to use Tekton Results for historical pipeline data, and how to set up alerting that fires when something goes wrong.

Key Takeaways

  • Tekton exposes Prometheus metrics on port 9090 of the controller service, covering PipelineRun duration, TaskRun counts, and queue depth.
  • The observability ConfigMap lets you switch between Prometheus and OTLP (gRPC/HTTP) export without restarting the controller.
  • Grafana dashboards built on tekton_pipelines_controller_* metrics give you real-time visibility into success rates, failure trends, and task latency.
  • TEP-0050 introduced the onError: continue field, enabling a Pipeline to keep executing downstream tasks even when a specific TaskRun fails.
  • Tekton Results stores completed run data in a queryable gRPC/REST API backed by PostgreSQL, so you keep history even after Kubernetes GC prunes the CRDs.
  • kubectl get events -n tekton-pipelines and kubectl describe pipelinerun are the fastest tools for real-time debugging.
  • CubeAPM can sit on top of your Prometheus data to provide correlated traces, service topology, and alert routing in one place.

1. How Tekton Monitoring Works

Tekton Pipelines ships with a built-in metrics exporter. The pipeline controller exposes a Prometheus-compatible scrape endpoint at port 9090 of the controller-service. By default, Prometheus export is enabled. You can also configure OTLP (gRPC and HTTP) export to send metrics directly to an OpenTelemetry Collector or any compatible backend.

Metrics behaviour is controlled through the observability ConfigMap in the tekton-pipelines namespace. Changing this ConfigMap applies immediately, with no controller restart required. 

# Check the observability ConfigMap

kubectl get configmap config-observability \

  -n tekton-pipelines -o yaml

What Gets Measured

Tekton exposes two categories of metrics: core Tekton metrics and infrastructure metrics inherited from the Knative and Go runtime.

Metric NameTypeWhat It Tells You
tekton_pipelines_controller_pipelinerun_duration_secondsHistogram / GaugeEnd-to-end duration of each PipelineRun, labelled by pipeline, status, namespace, and reason
tekton_pipelines_controller_pipelinerun_totalCounterTotal number of PipelineRuns by status (succeeded / failed / cancelled)
tekton_pipelines_controller_running_pipelinerunsGaugeNumber of PipelineRuns currently in progress
tekton_pipelines_controller_taskrun_duration_secondsHistogram / GaugeDuration of individual TaskRuns, labelled by task, status, namespace, and reason
tekton_pipelines_controller_taskrun_totalCounterTotal TaskRuns by status
tekton_pipelines_controller_running_taskrunsGaugeLive count of active TaskRuns
tekton_pipelines_controller_running_taskruns_throttled_by_quotaGaugeTaskRuns blocked by namespace resource quotas
tekton_pipelines_controller_running_taskruns_throttled_by_nodeGaugeTaskRuns blocked because no eligible node is available
tekton_pipelines_controller_taskruns_pod_latency_millisecondsHistogramTime between TaskRun creation and its Pod being scheduled

Note: All metrics carry an otel_scope_name label identifying the instrumentation package. This label is informational and transparent to most PromQL queries. Optional labels (pipeline, pipelinerun, task, taskrun, reason) are marked with an asterisk in the official schema and are off by default to avoid cardinality explosion; enable them in the ConfigMap only when you need per-run granularity.

2. Configuring Prometheus to Scrape Tekton

Tekton does not register a ServiceMonitor automatically. You need to tell Prometheus where to scrape. If you use the Prometheus Operator (kube-prometheus-stack), add a ServiceMonitor pointing at the controller service on port 9090.

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: tekton-pipelines

  namespace: monitoring

spec:

  namespaceSelector:

    matchNames:

      - tekton-pipelines

  selector:

    matchLabels:

      app: tekton-pipelines-controller

  endpoints:

    - port: metrics        # port 9090

      interval: 30s

      path: /metrics

If you manage Prometheus with a static configuration file instead, add a scrape job:

# prometheus.yaml (static config)

scrape_configs:

  - job_name: tekton

    kubernetes_sd_configs:

      - role: endpoints

        namespaces:

          names: [tekton-pipelines]

    relabel_configs:

      - source_labels: [__meta_kubernetes_service_label_app]

        regex: tekton-pipelines-controller

        action: keep

Verify the scrape is working by querying for a known metric in the Prometheus UI or via CLI:

# Confirm Tekton metrics are visible in Prometheus

curl -s http://<prometheus-host>:9090/api/v1/query \

  --data-urlencode 'query=tekton_pipelines_controller_pipelinerun_total' | jq .

OTLP Export (Optional)

If your observability stack is built around OpenTelemetry Collector rather than direct Prometheus scraping, you can enable OTLP export in the ConfigMap. Tekton supports both gRPC and HTTP OTLP endpoints.

# Patch the observability ConfigMap to enable OTLP gRPC export

kubectl patch configmap config-observability \

  -n tekton-pipelines \

  --type merge \

  -p '{"data":{"metrics.backend-destination":"opencensus",

       "metrics.opencensus-address":"otel-collector.monitoring:55678"}}'

3. Building Grafana Dashboards for Tekton

With Tekton metrics flowing into Prometheus, you can build dashboards that answer the questions platform teams ask every day. A practical Tekton monitoring dashboard should contain at least the following panels.

PanelPromQL QueryWhy It Matters
Pipeline Success Raterate(tekton_pipelines_controller_pipelinerun_total{status=”succeeded”}[5m]) / rate(tekton_pipelines_controller_pipelinerun_total[5m])Tracks overall CI health at a glance
Failure Count (5m)increase(tekton_pipelines_controller_pipelinerun_total{status=”failed”}[5m])Spikes indicate broken branches or flaky tests
Active PipelineRunstekton_pipelines_controller_running_pipelinerunsDetects queue buildup and concurrency issues
P95 Task Durationhistogram_quantile(0.95, rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[10m]))Identifies slow tasks that block pipelines
Throttled by Quotatekton_pipelines_controller_running_taskruns_throttled_by_quotaShows when namespace resource limits are a bottleneck
Pod Scheduling Latencyhistogram_quantile(0.99, rate(tekton_pipelines_controller_taskruns_pod_latency_milliseconds_bucket[10m]))Reveals node pressure or missing resources

4. Handling Task Failures in a Pipeline

By default, a single failing TaskRun causes the entire Pipeline to stop, leaving downstream tasks in the Skipped state. This is the correct default for most deployments, but there are cases where you want a pipeline to continue despite a non-critical task failure.

4.1  The onError Field (TEP-0050)

Tekton Enhancement Proposal 0050 (TEP-0050), implemented and marked as status: implemented, introduced an onError field on individual tasks within a Pipeline. Setting it to continue allows the pipeline to keep executing even if that task fails. (Source: TEP-0050)

apiVersion: tekton.dev/v1

kind: Pipeline

metadata:

  name: build-and-test

spec:

  tasks:

    - name: run-unit-tests

      taskRef:

        name: go-test

      onError: continue   # Pipeline continues even if this task fails

    - name: build-image

      runAfter: [run-unit-tests]

      taskRef:

        name: kaniko-build

Important: When a task with onError: continue fails, the PipelineRun itself still reflects the failure in its status conditions. The task is marked as “failed but ignored” so you can still detect and alert on the failure without blocking the pipeline. Emitting results from failed tasks is also supported; the results remain accessible to downstream tasks.

4.2  Inspecting a Failed PipelineRun

The fastest way to diagnose a task failure is to inspect the PipelineRun status and then look at the individual TaskRun logs.

# Check the overall pipeline status and the reason field

kubectl get pipelinerun <name> -n <namespace> -o json \

  | jq .status.conditions

# List all TaskRuns belonging to a PipelineRun

kubectl get taskrun -n <namespace> \

  -l tekton.dev/pipelineRun=<pipelinerun-name>

# Stream logs for a specific failed TaskRun

kubectl logs -n <namespace> \

  -l tekton.dev/taskRun=<taskrun-name> --all-containers

# Use Tekton CLI for a friendlier view

tkn pipelinerun describe <name> -n <namespace>

tkn taskrun logs <name> -n <namespace> --follow

4.3  Kubernetes Events

Kubernetes events capture scheduling and execution issues that do not always appear in container logs. These include Pod scheduling failures, resource quota denials, and image pull errors.

# See all events in the tekton-pipelines namespace, sorted by time

kubectl get events -n tekton-pipelines \

  --sort-by=.lastTimestamp

# Filter events related to a specific PipelineRun

kubectl get events -n <namespace> \

  --field-selector involvedObject.name=<pipelinerun-name>

5. Alerting on Tekton Pipeline Failures

Observing metrics in a dashboard is reactive. Alerting makes monitoring proactive. The following Prometheus alerting rules cover the most important failure and performance scenarios.

groups:

  - name: tekton

    rules:

      # Alert when PipelineRun failures exceed 2 per minute

      - alert: TektonPipelineRunFailures

        expr: >

          rate(tekton_pipelines_controller_pipelinerun_total

          {status="failed"}[5m]) * 60 > 2

        for: 2m

        labels:

          severity: warning

        annotations:

          summary: "Elevated Tekton PipelineRun failure rate"

          description: "More than 2 failures/min for 2 minutes."

      # Alert when TaskRun P95 duration exceeds 20 minutes

      - alert: TektonTaskRunSlow

        expr: >

          histogram_quantile(0.95,

          rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[10m]))

          > 1200

        for: 5m

        labels:

          severity: warning

        annotations:

          summary: "Tekton TaskRuns are running slowly"

      # Alert when more than 5 TaskRuns are throttled by quota

      - alert: TektonTaskRunThrottled

        expr: >

          tekton_pipelines_controller_running_taskruns_throttled_by_quota > 5

        for: 5m

        labels:

          severity: critical

        annotations:

          summary: "TaskRuns throttled by namespace quota"

Apply these rules by placing the file in your Prometheus rules directory or by creating a PrometheusRule CRD if you use the Prometheus Operator. Route alerts to Slack, PagerDuty, or email via Alertmanager receivers.

6. Long-Term Pipeline History with Tekton Results

By default, completed PipelineRun and TaskRun objects are stored as Kubernetes custom resources in etcd. Over time, these accumulate and consume cluster resources. Kubernetes garbage collection prunes them, which means you lose historical data. Tekton Results solves this by providing a dedicated storage layer for CI/CD history. 

Architecture

Tekton Results has three components: a Result Watcher that monitors the Kubernetes API for TaskRun and PipelineRun changes, a gRPC/REST API server that stores and serves result data, and a retention policy agent that removes records beyond a configurable age. The default storage backend is PostgreSQL.

Installing Tekton Results

# Deploy Tekton Results from the official release manifest

# (includes API server, Watcher, and bundled PostgreSQL for dev)

kubectl apply -f \

# Verify all components are running

kubectl get pods -n tekton-pipelines \

  -l app.kubernetes.io/part-of=tekton-results

# Confirm the API server is up

kubectl rollout status deployment/tekton-results-api -n tekton-pipelines

Querying Results

Once installed, the Result Watcher creates a record for every completed run. Each record follows the naming pattern <namespace>/results/<parent-run-uuid>. You can query records using the Tekton CLI (tkn), the REST API, or custom tooling against the gRPC endpoint.

# List all results in a namespace using the Tekton CLI

tkn result list -n <namespace>

# Fetch a specific result record

tkn result get <namespace>/results/<uuid> -n <namespace>

# Query via REST (requires port-forwarding the API service)

kubectl port-forward svc/tekton-results-api \

  -n tekton-pipelines 8080:8080

curl -s http://localhost:8080/apis/results.tekton.dev/v1alpha2/ \

  namespaces/<namespace>/results | jq .

Note: In Red Hat OpenShift Pipelines 1.14, Tekton Results is available as a Technology Preview feature. The result name format used is <namespace>/results/<parent_run_uuid>. 

7. Real-Time Debugging in Tekton Pipelines

When a pipeline fails unexpectedly, you need to narrow down the problem quickly. The following sequence covers the most efficient real-time debugging path.

  1. Check PipelineRun status: Run kubectl describe pipelinerun <name> to see the status conditions, failed task names, and reason codes.
  2. Identify the failed TaskRun: The PipelineRun status block includes a childReferences list that names every TaskRun created by the pipeline, along with its status.
  3. Read container logs: Each TaskRun step runs as a separate container in the same Pod. Use kubectl logs <pod-name> -c step-<step-name> to read the output of a specific step.
  4. Check Kubernetes events: Use kubectl get events -n tekton-pipelines –sort-by=.lastTimestamp to see scheduling errors, OOM kills, and resource quota denials.
  5. Enable debug mode (TEP): Red Hat Developer guidance shows that you can attach a debug breakpoint to a TaskRun by annotating it, then exec into the running pod to inspect the workspace state before the step exits. This is particularly useful for intermittent failures. 

8. Monitoring Comparison: Tekton Tools and Platforms

Tool / ApproachWhat It CoversGaps to Be Aware Of
CubeAPMUnified APM over Prometheus metrics, distributed tracing, service topology, and correlated alert routing. Works with any Tekton deployment.Requires Prometheus scraping to be configured first.
Prometheus + GrafanaCore tekton_pipelines_controller_* metrics, custom dashboards, PromQL-based alerting.No persistent run history. Requires manual dashboard creation.
Tekton ResultsLong-term storage of PipelineRun and TaskRun data with queryable gRPC/REST API.Does not provide real-time metrics or alerting.
Tekton DashboardWeb UI for browsing PipelineRuns and TaskRuns, viewing logs, and triggering runs.Read-only observability; no alerting or metrics aggregation.
Elastic Stack (via mgreau/tekton-pipelines-elastic-o11y)Log ingestion from Tekton pods into Elasticsearch, visualised in Kibana.Requires Beats or Fluent Bit pipeline setup. No native Tekton integration.
kubectl / tkn CLIAd-hoc inspection of PipelineRuns, TaskRuns, events, and pod logs.Manual and reactive. Not suitable for continuous monitoring.

Monitor Tekton Pipelines with CubeAPM

Tekton exposes Prometheus metrics, but scraping and querying raw metrics is just the start. CubeAPM gives your team a unified observability layer over those metrics, with automatic service topology, correlated traces, and alert routing, so you can go from a failed PipelineRun to root cause in seconds rather than minutes.

Get started today

Summary: Tekton Monitoring Checklist

LayerTool / FeatureKey Action
Metrics scrapingPrometheus ServiceMonitorPoint at port 9090 of tekton-pipelines-controller service
Metrics export (alternative)OTLP via observability ConfigMapSet metrics.backend-destination to opencensus and configure the collector address
DashboardsGrafanaBuild panels for success rate, failure count, active runs, P95 duration, and throttled tasks
AlertingPrometheus alerting rules + AlertmanagerAlert on failure rate, slow tasks, throttled TaskRuns
Task failure handlingonError: continue (TEP-0050)Use on non-critical tasks to prevent a single failure blocking the whole pipeline
Historical dataTekton ResultsInstall from official release manifest; query with tkn or REST API
Real-time debuggingkubectl / tkn CLIUse describe, events, and logs commands as the first debugging step
Unified observabilityCubeAPMLayer over Prometheus for traces, topology, and alert correlation

Conclusion

Tekton monitoring is not a single tool problem. The full picture requires Prometheus scraping the controller metrics, Grafana dashboards giving your team visibility into success rates and slow tasks, alerting rules to catch problems before they escalate, Tekton Results preserving history after Kubernetes cleans up the CRDs, and the Tekton CLI for fast real-time debugging.

The key insight from the official Tekton metrics documentation is that the controller already instruments itself. Your job as a platform engineer is to wire up the scraping, build dashboards that surface the right signals, and add a persistence layer with Tekton Results so that a completed run does not disappear from the record.

For teams that want correlated traces, service topology, and alert routing on top of those Prometheus metrics, CubeAPM provides a unified observability layer that sits alongside your existing Tekton and Kubernetes setup.

Disclaimer: Metric names, configuration fields, and API endpoints are based on the official Tekton documentation available at the time of writing. Tekton is an actively maintained open-source project; always verify details against the current official documentation at tekton.dev before applying configurations in production environments. Third-party tool features and pricing referenced in comparison sections are subject to change.

FAQs

1. Where does Tekton expose its Prometheus metrics?

The Tekton Pipelines controller exposes a Prometheus-compatible metrics endpoint on port 9090 of the controller-service in the tekton-pipelines namespace. You configure the export format (Prometheus or OTLP) through the config-observability ConfigMap.

2. How do I stop one failing task from cancelling the entire pipeline?

Add onError: continue to the task definition inside your Pipeline spec. This feature was introduced in TEP-0050 and is now fully implemented in Tekton Pipelines. The PipelineRun still records the failure, so alerting and monitoring remain accurate.

3. How do I retain pipeline run history after Kubernetes GC prunes it?

Install Tekton Results from the official release manifest. The Result Watcher automatically archives every completed PipelineRun and TaskRun into a PostgreSQL-backed API server, which you can query with the tkn CLI or REST API long after the original CRD objects have been deleted.

4. What is the fastest way to debug a failed TaskRun?

Run tkn pipelinerun describe <name> to see which task failed and its reason. Then use tkn taskrun logs <name> –follow to stream the step-level output. If the issue is scheduling rather than execution, check kubectl get events -n <namespace> –sort-by=.lastTimestamp for quota denials and image pull errors.

5. Can I send Tekton metrics to Datadog or New Relic instead of Prometheus?

Yes. Enable OTLP export in the config-observability ConfigMap and point the collector address at an OpenTelemetry Collector that has a Datadog or New Relic exporter configured. Alternatively, both Datadog and New Relic support Prometheus remote_write, so you can forward metrics from Prometheus to those platforms. CubeAPM can also be deployed as a lightweight alternative that consumes Prometheus metrics directly without a separate collector.

×
×