CubeAPM
CubeAPM CubeAPM

How to Monitor GitLab CI/CD Pipeline Performance and Errors

How to Monitor GitLab CI/CD Pipeline Performance and Errors

Table of Contents

GitLab CI/CD is the engine behind most modern software delivery workflows, but a pipeline that silently fails, queues for 20 minutes, or flaps between success and failure drains developer productivity and erodes deployment confidence. GitLab CI monitoring means proactively observing pipeline duration, failure rates, runner health, and job-level errors, so you can fix problems before they block a release.

This guide walks through every monitoring layer available to GitLab users: the built-in analytics UI (available on all tiers), Prometheus and Grafana integration, OpenTelemetry-based pipeline tracing, and third-party APM tools. 

Key Takeaways

  • ✓  GitLab’s built-in CI/CD Analytics page tracks total runs, median duration, 95th-percentile duration, and failure rate without any additional setup.
  • ✓  The GitLab CI Pipelines Exporter scrapes the GitLab API and exposes Prometheus metrics, enabling Grafana dashboards and threshold-based alerts.
  • ✓  GitLab Observability (introduced in GitLab 18.1 as an experiment) auto-instruments pipelines via a single OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
  • ✓  Runner health (CPU, memory, disk I/O) is a leading indicator of pipeline slowdowns and should be monitored alongside pipeline metrics.
  • ✓  CubeAPM correlates pipeline telemetry with application traces, bridging the gap between a failing CI job and its production impact.
  • ✓  Best practices: set failure-rate alert thresholds, monitor the 95th-percentile duration, use job-level caching, and run fast-failing jobs first.

1. Understanding What to Monitor in GitLab CI/CD

gitlab ci monitoring
How to Monitor GitLab CI/CD Pipeline Performance and Errors 2

Before configuring any tool, it helps to understand which signals matter. GitLab’s own documentation on pipeline efficiency identifies four primary areas:

  • Pipeline duration (median and 95th percentile): measures how long pipelines take end to end.
  • Failure rate: the percentage of pipeline runs that did not complete successfully.
  • Success rate: the inverse of failure rate; your baseline reliability indicator.
  • Runner health: CPU, memory, disk I/O, and process count on the machines executing jobs.

According to GitLab’s Pipeline efficiency documentation, the total pipeline duration is heavily influenced by the size of the repository, the number of stages and jobs, dependencies between jobs, and the resources provisioned to runners.

2. Using GitLab’s Built-in CI/CD Analytics

GitLab provides a native CI/CD Analytics page for every project. It is available on all tiers (Free, Premium, Ultimate) and requires no external tooling.

How to Access CI/CD Analytics

  1. In the top bar, select Search or go to and find your project.
  2. In the left sidebar, select Analyze > CI/CD analytics.
  3. Use the Source, Branch, and Date range filters to focus on specific workflows.

Metrics Available

MetricDescriptionAvailable On
Total pipeline runsAll pipelines including child pipelines and failed YAML pipelinesFree+
Median duration50th-percentile pipeline execution timeFree+
95th-percentile duration95% of pipelines finish within this time; used to spot outliersFree+
Failure ratePercentage of pipelines that did not complete successfullyFree+
Success ratePercentage of pipelines that completed successfullyFree+
Other rateSkipped or canceled pipelinesFree+
Job performance metrics (P50, P95 duration, failure rate per job)Per-job breakdown; requires ClickHouse on Self-ManagedPremium/Ultimate

Note on job performance metrics: CI/CD job performance metrics were introduced in GitLab 18.9 as limited availability. On GitLab Self-Managed and GitLab Dedicated instances, you must configure ClickHouse to enable this feature. On GitLab.com it is available by default for Premium and Ultimate tiers.

3. Monitoring Job Logs for Errors

Job logs are the most immediate source of error information. Every CI/CD job writes a full execution log, accessible from your project under CI/CD > Pipelines. Select any pipeline, then select a job to view its log.

Collapsible Log Sections

GitLab supports custom collapsible sections in job logs. This is useful when a build step produces hundreds of lines of output. You can mark sections using ANSI escape codes in your .gitlab-ci.yml script:

job1:  script:    - echo -e "\e[0Ksection_start:`date +%s`:install_deps\r\e[0KInstalling dependencies"    - npm install    - echo -e "\e[0Ksection_end:`date +%s`:install_deps\r\e[0K"

When the feature flag FF_SCRIPT_SECTIONS is enabled on the runner, multi-line script commands automatically appear as collapsible sections in the GitLab UI. 

Retrieving Logs via the API

For automated error analysis, retrieve job logs programmatically using the GitLab Jobs API:

curl --header "PRIVATE-TOKEN: <your_access_token>"  "https://gitlab.example.com/api/v4/projects/<project_id>/jobs/<job_id>/trace"

4. Prometheus and Grafana Integration

For teams that need historical trends, alerting, and cross-project visibility, Prometheus is the standard approach. GitLab’s own pipeline efficiency documentation recommends the GitLab CI Pipelines Exporter as the primary Prometheus integration.

GitLab CI Pipelines Exporter

The GitLab CI Pipelines Exporter polls the GitLab API and emits Prometheus metrics for pipeline status, duration, and job-level data. It can automatically discover branches and environments.

# docker-compose.yml (example)version: '3'services:  gitlab-ci-pipelines-exporter:    image: mvisonneau/gitlab-ci-pipelines-exporter:latest    environment:      GCPE_GITLAB_TOKEN: "<your-gitlab-token>"    volumes:      - ./config.yml:/etc/gitlab-ci-pipelines-exporter/config.yml    ports:      - "8080:8080"

A minimal config.yml to monitor a project:

# config.ymlprojects:  - name: my-org/my-project    refs:      - main      - develop

Once running, you can build Grafana dashboards from these metrics. Embedded metric charts can also be referenced in GitLab incident management issues, making it easier to correlate pipeline failures with system incidents. Runner Monitoring with Prometheus Node Exporter

GitLab’s documentation on pipeline efficiency recommends the Prometheus Node Exporter for monitoring runner host systems (CPU, memory, disk I/O), and kube-state-metrics for Kubernetes-based runners.

5. OpenTelemetry-Based Pipeline Tracing

GitLab Observability was introduced in GitLab 18.1 as an experiment and provides distributed tracing, metrics, and logs within the GitLab platform. According to the GitLab Observability documentation, it automatically instruments CI/CD pipelines when you set one environment variable.

Enabling Automatic CI Pipeline Tracing

Set the following variable in your project’s CI/CD settings or directly in .gitlab-ci.yml:

variables:  OTEL_EXPORTER_OTLP_ENDPOINT: "https://<gitlab-instance>/-/otel/traces"  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer <token>"

GitLab Observability is free for all tiers with no per-seat, per-metric, or per-host charges, and no cardinality limits. As of the week of April 21, 2026, users are processing more than 57 million traces daily and monitoring more than 3,000 services on GitLab.com.

Sending to an External OTLP Backend

If you use an external OpenTelemetry collector or APM backend, redirect the same environment variable to your collector endpoint:

variables:  OTEL_EXPORTER_OTLP_ENDPOINT: "https://your-apm-backend:4318"  OTEL_SERVICE_NAME: "my-gitlab-pipeline"

6. Key Metrics to Monitor and Alert On

Whether you use GitLab’s built-in analytics, Prometheus, or an external APM, these are the metrics that matter most for GitLab CI monitoring:

MetricWhat It Tells YouAlert Threshold (guideline)
Failure ratePipeline reliability; recurring failures indicate flaky tests or infra problems> 10% on main branch
Median pipeline durationTypical developer wait time from push to greenBaseline + 20%
P95 pipeline durationWorst-case experience; outliers not visible in medianBaseline + 50%
Job-level P95 durationWhich specific job is the bottleneckJob baseline + 30%
Runner CPU utilizationRunner saturation; queuing upstream> 80% sustained
Runner memory usageOOM kills cause silent job failures> 85%
Pipeline queue timeTime waiting for an available runner> 5 min on critical branches

7. Identifying and Fixing Common Pipeline Errors

Monitoring surfaces the symptoms; the following patterns explain the root causes.

Flaky Jobs

Flaky jobs fail randomly without a code change triggering them. GitLab’s documentation on pipeline efficiency recommends tracking test coverage drops and code quality correlated to that behavior. Use the job-level failure rate metric to identify jobs with inconsistent results and move them to a dedicated flaky-test stage so they do not block the main pipeline.

Slow Pipelines Due to Missing Cache

Install steps such as npm install or pip install are a common bottleneck. Cache them across pipeline runs:

build:  script:    - npm ci  cache:    key: ${CI_COMMIT_REF_SLUG}    paths:      - node_modules/    when: always

Late-Failing Long Jobs

If a slow job (e.g., end-to-end tests) fails only after 30 minutes, all earlier fast jobs were wasted. Redesign the pipeline to run fast-failing jobs first using the needs keyword to decouple stage ordering:

lint:  stage: validate  script: npm run lint
unit-tests:  stage: validate  needs: []  script: npm test
e2e-tests:  stage: integration  needs: ["unit-tests"]  script: npm run e2e

Runner Resource Starvation

When CPU or memory on the runner host is exhausted, jobs queue or are killed silently. Monitor runner hosts using the Prometheus Node Exporter and alert at 80% CPU or 85% memory sustained utilization. For Kubernetes-based runners, kube-state-metrics tracks pod resource requests and limits.

8. Tools Comparison for GitLab CI Monitoring

ToolPrimary StrengthSetup EffortBest For
CubeAPMCorrelates pipeline traces with app APM; OTel-native; self-hosted; no cardinality limitsLow (single OTLP env var)Teams wanting pipeline + app traces in one pane
GitLab CI/CD Analytics (built-in)Zero setup; pipeline success/duration charts; job performance metrics (Premium+)NoneQuick overview on all GitLab tiers
GitLab Observability (built-in, experiment)Distributed tracing, metrics, and logs natively in GitLab; auto pipeline instrumentationLowGitLab 18.1+ users wanting native OTel tracing
GitLab CI Pipelines Exporter + PrometheusLong-term metrics retention; Grafana dashboards; cross-project alertingMediumTeams with existing Prometheus/Grafana stack
Datadog CI VisibilityEnd-to-end pipeline and test visibility with APM correlation; hosted SaaSMediumEnterprise teams on Datadog
NetdataPer-second runner host metrics; zero config; anomaly detectionLow (agent install)Real-time runner resource monitoring
Grafana + Prometheus (custom)Fully custom dashboards; any metric sourceHighTeams needing bespoke alerting logic

9. Best Practices for GitLab CI Monitoring

  • Monitor the 95th percentile, not just the median. Median duration hides outliers. Alert when P95 exceeds your baseline by 50%.
  • Set branch-level filters. CI/CD Analytics lets you filter by branch. Prioritize alerts on your main or production branch.
  • Alert on failure rate trends, not individual failures. A single failed pipeline is noise; a rising failure rate over 24 hours is a signal.
  • Run fast-failing jobs first. Linting, security scanning, and unit tests should appear before long integration tests so developers get feedback within minutes, not hours.
  • Cache aggressively. Use cache:when: always to preserve downloaded dependencies even after a failed job run.
  • Monitor runner resources proactively. Runner saturation causes invisible queuing. Add CPU and memory dashboards next to your pipeline dashboards.
  • Correlate code changes with performance. Use GitLab Observability or CubeAPM to link a specific merge request to a duration spike.
  • Review storage usage regularly. Job artifacts accumulate quickly. Set expire_in on artifacts to prevent storage bloat from slowing pipelines.

Go Beyond GitLab’s Built-in Analytics with CubeAPM

GitLab CI/CD analytics shows you pipeline success rates and duration, but it does not correlate failures with application-level traces, runner resource saturation, or downstream service degradation. CubeAPM bridges that gap. It ingests OpenTelemetry pipeline telemetry alongside your application traces, giving you a single pane of glass that connects a failed CI job to the exact service error it caused in production.

CubeAPM is self-hosted and fully open-standards based. You can instrument your GitLab pipelines using the native OpenTelemetry environment variable (OTEL_EXPORTER_OTLP_ENDPOINT) and start sending traces to CubeAPM in minutes, with no per-seat or per-metric pricing.

Summary

Here is a quick reference of every monitoring layer covered in this guide:

LayerTool / FeatureKey Benefit
Built-in analyticsGitLab CI/CD Analytics (Analyze > CI/CD analytics)Instant pipeline health overview; no setup
Job logsGitLab Job Logs UI + API (/jobs/:id/trace)Raw error output; collapsible sections
Prometheus metricsGitLab CI Pipelines ExporterLong-term trends; Grafana dashboards; alerting
Runner monitoringPrometheus Node Exporter / kube-state-metricsCPU, memory, disk I/O on runner hosts
Distributed tracingGitLab Observability (OTel, experiment)Auto-instrumented pipeline traces in GitLab UI
Full APM correlationCubeAPM (docs.cubeapm.com)Pipeline traces + app traces; self-hosted; OTel-native

Disclaimer

The information in this article is provided for educational purposes only. All GitLab feature availability, version numbers, and tier requirements have been verified against official GitLab documentation as of the date of publication. Features marked as experimental or limited availability may change. Always refer to the official GitLab documentation at docs.gitlab.com for the most current information. Third-party tool details (pricing, features) may change independently of this article.

FAQs

1. What is the easiest way to start GitLab CI monitoring without any extra tools?

Navigate to Analyze > CI/CD analytics inside your GitLab project. This built-in page shows total pipeline runs, median and 95th-percentile duration, and success and failure rates for any date range and branch. No configuration is required and it is available on the Free tier.

2. How do I get Prometheus metrics for GitLab CI/CD pipelines?

Deploy the GitLab CI Pipelines Exporter (github.com/mvisonneau/gitlab-ci-pipelines-exporter). It polls the GitLab API and exposes Prometheus metrics for pipeline status, duration, and job-level data. Point a Prometheus scrape job at it and build Grafana dashboards on top of the collected metrics.

3. Does GitLab support OpenTelemetry tracing for pipelines?

Yes. GitLab Observability (introduced as an experiment in GitLab 18.1) auto-instruments CI/CD pipelines when you set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. The feature is free for all tiers and has no cardinality limits on traces, metrics, or logs.

4. How do I monitor GitLab Runner performance?

Use the Prometheus Node Exporter on Linux runner hosts to track CPU, memory, disk I/O, and process resources. For Kubernetes-based runners, deploy kube-state-metrics. GitLab’s pipeline efficiency documentation also recommends testing runner auto-scaling with cloud providers and defining offline times to reduce costs.

5. What is the difference between GitLab CI/CD Analytics and GitLab Observability?

CI/CD Analytics (available to all tiers) shows aggregated pipeline and job metrics directly in the GitLab UI, such as success rates and median duration. GitLab Observability is a separate experimental feature (GitLab 18.1+) that provides distributed tracing, logs, and metrics using OpenTelemetry standards, allowing you to follow a request across microservices and correlate code changes with application performance issues.

×
×