The bill arrives at the end of the month and the number is bigger than last month — again. Your continuous integration pipelines are not running more often, your team has not grown, and yet the compute line for your build infrastructure keeps climbing. Worse, when you ask developers how the pipelines feel, the answer is "slow." That combination — rising cost and rising wait times at the same time — is the signature of a runner fleet that is provisioned for capacity but not tuned for efficiency.
It is a counterintuitive failure mode. The instinct when builds queue is to add more capacity, and the instinct when costs rise is to cut it. Both instincts, applied blindly, make a self-hosted Linux runner fleet less efficient, not more. Capacity efficiency is the discipline of getting the most useful build throughput out of every dollar and every core-hour — and on Linux-based GitLab Runner and GitHub Actions fleets, the levers that move it are surprisingly specific.
✓ Key Takeaways
- Self-hosted CI/CD typically consumes 15–40% of total cloud infrastructure spend, and that line grows 20–35% a year if left untuned.
- The largest sources of waste are always-on idle agents, oversized instances, and redundant or failed jobs — not a shortage of capacity.
- The Docker Machine executor is deprecated as of GitLab 17.5 and scheduled for removal in GitLab 20.0 (May 2027); plan a migration to the GitLab Runner Autoscaler or the Kubernetes executor.
- Right-sizing instances, scheduling idle capacity to zero off-hours, and running on spot instances routinely cut CI compute bills 35–80% without slowing builds.
- Efficiency is measurable: correlate runner saturation with pipeline queue time, and you can tune the fleet instead of guessing.
What "capacity efficiency" actually means for a runner fleet
A runner fleet has two failure directions, and they pull against each other. Provision too little capacity and jobs sit in a queue waiting for an executor to free up — developers wait, deployments slip, and the cost of engineering idle time dwarfs any compute savings. Provision too much and instances sit powered on with nothing to do, burning money on idle cores. Capacity efficiency is the point of balance between those two: enough headroom to keep queue time low, with little enough idle time that you are not paying for air.
The reason this is hard is that CI/CD load is bursty and non-uniform. A team of forty engineers does not generate a smooth, predictable stream of jobs. They push in clusters around stand-up, before lunch, and in the late-afternoon rush to merge before the end of the day. A fixed-size fleet sized for the afternoon peak is two-thirds idle at 3 a.m.; a fleet sized for the average is overwhelmed at peak. The only way to be efficient against a bursty load is to make the fleet elastic — to grow when the queue grows and shrink when it drains.
The core insight:
Efficiency is not a single number you maximize. It is the gap you minimize between "capacity you pay for" and "capacity that does useful work." Every optimization in this article either shrinks the idle gap or shrinks the wait gap — ideally without widening the other.
Where the waste actually hides
Before tuning anything, it helps to know where the money goes. Industry analyses of CI/CD spend converge on three recurring culprits, and none of them is "we need a bigger fleet."
15–40%
of total cloud spend goes to CI/CD compute
~20%
of CI spend wasted on redundant or failed jobs
20–35%
annual cost growth when fleets go untuned
Sources: Security Boulevard (2026), CI/CD Watch, nOps
The first culprit is idle always-on agents. Traditional self-hosted CI runs dedicated build servers that stay powered on 24 hours a day, seven days a week — which means they bill for nights, weekends, and holidays when no one is pushing code. For a fleet sized to peak demand, the majority of paid hours produce zero builds.
The second is oversized instances. Teams reach for large machine types "to be safe," but most CI jobs are I/O-bound (cloning, dependency installs, artifact uploads) rather than CPU-bound, and they finish in the same wall-clock time on a modest instance as on an expensive one. Paying for sixteen vCPUs to run a job that saturates two is pure overhead.
The third is redundant and failed work. Up to a fifth of CI spend comes from jobs that did not need to run — pipelines triggered on documentation-only changes, test suites re-running steps whose inputs never changed, and jobs that fail late after consuming full compute. A cache that misses, a pipeline without dependency-aware staging, a missing path filter: each quietly multiplies the bill.
Self-hosted Linux runners give teams control over cost and capacity — but only if the fleet is sized and scheduled to match real demand rather than worst-case peaks.
The executor decision shapes everything downstream
On GitLab Runner, the executor determines how jobs map to compute, and that choice sets the ceiling on how efficient the fleet can be. It is also a decision with a deadline attached: the Docker Machine executor was deprecated in GitLab 17.5 and is scheduled for removal in GitLab 20.0 (May 2027). Fleets that still autoscale through Docker Machine on AWS EC2, Azure, or Google Compute Engine need a migration plan toward the newer GitLab Runner Autoscaler or the Kubernetes executor [GitLab Docs].
The three Linux-based options trade off differently between operational simplicity and resource utilization:
| Executor | Best for | Efficiency profile | Status |
|---|---|---|---|
| Docker (single host) | Steady, predictable load on a dedicated instance | A c5.2xlarge handles 8–12 concurrent jobs with sub-30s container startup; no elasticity on its own | Supported |
| GitLab Runner Autoscaler | Bursty load on cloud instances (EC2, Azure, GCE) | Elastic — provisions only as many instances as the queue needs; the successor to Docker Machine | Recommended |
| Kubernetes | Teams already running a cluster; cloud-native shops | Schedules jobs as pods for dense bin-packing; EKS Auto Mode reports up to 90% cost reduction vs. dedicated | Supported |
| Docker Machine | Legacy autoscaling deployments | Once the standard autoscaler — now superseded | Deprecated (removal in GitLab 20.0) |
For most organizations the practical answer is the GitLab Runner Autoscaler if you live on raw cloud instances, or the Kubernetes executor if you already operate a cluster. Both replace the "always-on, fixed-size" model with one that bends to demand. The Kubernetes path is particularly attractive when the same cluster runs other workloads, because the scheduler can bin-pack CI pods into capacity that would otherwise sit reserved. Running a fleet on either, of course, means owning the underlying Linux hosts — patching, hardening, and monitoring them — which is exactly the kind of infrastructure work that benefits from disciplined Linux IT support whether handled in-house or by a managed partner.
Right-sizing: the highest-leverage tuning you are not doing
Once the fleet is elastic, the next lever is matching instance size and concurrency to the actual shape of your jobs. Two knobs do most of the work.
The first is instance type. The guidance from GitLab's own autoscaling documentation is blunt: most CI jobs do not need large instances. A smaller type such as an amazonec2-instance-type=t3.medium handles the typical clone-build-test job perfectly well, and you provision up rather than down only for the specific jobs that genuinely need it. Defaulting the whole fleet to a large type "for safety" is the single most common way teams overspend.
The second is concurrency and resource limits. On the Kubernetes executor, you set CPU and memory requests and limits per build, helper, and service container in config.toml, and allow per-job overrides in .gitlab-ci.yml. The asymmetry here matters: Kubernetes terminates any pod that exceeds its memory limit, so memory must be sized for the job's real peak — but the CPU limit is softer, throttling rather than killing, which means a lower CPU ceiling lengthens job duration instead of failing it. The efficient configuration sets memory honestly and treats CPU as a dial between speed and density.
config.toml — right-sized defaults with override room
concurrent = 12 # cap on simultaneous jobs across the runner
[[runners]]
executor = "kubernetes"
[runners.kubernetes]
cpu_request = "500m" # honest baseline, not worst case
memory_request = "1Gi" # sized to real peak — pods over limit are killed
cpu_limit = "2" # soft ceiling: throttles, does not fail
memory_limit = "2Gi"
# allow heavy jobs to ask for more in .gitlab-ci.yml
[runners.kubernetes.pod_annotations]The concurrent setting deserves attention because it is also your capacity-planning instrument. Watching the number of running jobs against the concurrent (or limit) value tells you whether the fleet still has room to absorb more work or is saturated and forcing jobs to queue. That single ratio is the foundation of every scaling decision that follows.
The off-hours and spot levers: where the big savings live
Elasticity and right-sizing get you most of the way; the cost structure of the underlying compute gets you the rest. Two moves dominate.
Schedule idle capacity to zero. The GitLab autoscaler supports time-based scaling periods, so you can hold a warm pool of runners during working hours and set IdleCount to zero overnight and on weekends, with a short idle timeout (five minutes is typical) so instances spin down quickly once a burst clears. For a fleet that previously ran 24/7, eliminating nights and weekends alone removes well over half of all paid hours — pure savings with no impact on a sleeping team's productivity.
Run on spot capacity. Cloud providers discount spare capacity 60–90% through spot or preemptible instances, and CI is almost the perfect spot workload: jobs are short, stateless, and retryable, so an interrupted build simply re-runs on a fresh instance. Teams running self-hosted runners on spot routinely cut CI compute bills 60–80% [nOps]. Pairing a spot-integrated autoscaling group with the patterns above is how organizations reach the dramatic reductions in the case data below.
An elastic fleet grows with queue depth and drains to a minimal warm pool — or to zero — when demand falls, so paid capacity tracks real work.
You cannot tune what you do not measure
Every optimization above depends on visibility. The metric that ties the whole system together is the relationship between infrastructure saturation and pipeline queue time. Runner and infrastructure dashboards that combine host metrics (CPU, memory, and disk from a Node Exporter agent) with pipeline queue-time data let you see, at a glance, whether wait times are caused by saturated runners — meaning you should add capacity — or by something else entirely, like a slow dependency or a stuck stage.
A handful of signals are worth instrumenting before you touch any scaling parameter:
| Job queue time (p50 / p99) | Are jobs waiting for an executor? Rising p99 = under-provisioned at peak |
| Runner utilization (running ÷ concurrent) | Headroom indicator — low at peak means you can shrink the fleet |
| Idle instance-hours | Capacity paid for but unused — the off-hours scheduling target |
| Cache hit rate | Low rates inflate job duration and redundant compute |
| Failed-job compute share | Spend on work that produced nothing — the redundancy tax |
This is where capacity efficiency stops being a one-time project and becomes an operating practice. Saturation and queue-time trends drive scaling decisions; cache and failure metrics drive pipeline hygiene. Wiring those signals into the same observability stack you already use for production — the same discipline behind continuous infrastructure and network monitoring — turns guesswork into a feedback loop.
What "good" looks like: a reference result
The payoff is not theoretical. In one documented migration, an engineering team moved off managed runners to eight self-hosted Graviton3 (Arm) instances on AWS, enabled GitLab's distributed caching, restructured pipelines with the needs keyword to parallelize the bulk of their jobs into a directed acyclic graph, and autoscaled runners against live queue depth through the GitLab API. The combined effect was not a single lever but the stacked discipline this article describes [johal.in].
Before
- Managed runners billed at a premium per minute
- p99 pipeline latency near 12 minutes
- Low cache hit rate, jobs re-doing work
- Peak-hour queue waits stacking up
After
- Monthly CI/CD spend cut 37% (to ~$8,940)
- p99 latency down to 8m12s — about 30% faster
- Cache hit rate up to 92%
- Peak wait times down to roughly 2 minutes
The detail that matters most: the fleet got cheaper and faster at the same time. That is the proof that the rising-cost-and-rising-wait spiral is not an iron law of CI/CD. It is a symptom of a fleet tuned for neither dimension — and it reverses when capacity is made elastic, right-sized, scheduled, and measured.
"The goal is not the biggest fleet or the cheapest fleet. It is the smallest fleet that never makes a developer wait."
— Infrastructure Operations, ITECS
Where a managed partner fits
None of these levers is exotic, but pulling them in the right order — and keeping them tuned as the team and the codebase grow — is ongoing operational work. It means owning hardened Linux hosts, running an autoscaler against live metrics, managing spot interruption gracefully, keeping caches warm, and watching the saturation-versus-queue-time relationship week over week. For organizations whose engineers would rather ship product than babysit build infrastructure, that work is a natural fit for a managed services model.
ITECS designs, hosts, and operates Linux infrastructure for exactly this kind of workload — from the underlying managed cloud hosting that runner instances live on, to the day-to-day operations that keep a fleet efficient. If your CI/CD bill is climbing while your builds slow down, that gap is recoverable, and it usually starts with measuring where the idle and the waiting actually are.
Stop paying for idle build capacity
Get an assessment of your CI/CD and Linux infrastructure — where the waste hides, what an elastic fleet would cost, and how much faster your pipelines could run.
Start an Infrastructure Assessment →Related Resources
Sources
- GitLab Docs — GitLab Runner Autoscaling
- GitLab Docs — Docker Machine Executor Autoscale Configuration (deprecation notice)
- GitLab Docs — Kubernetes Executor
- GitLab Docs — Plan and Operate a Fleet of Runners
- nOps — Cost-Optimizing CI/CD Pipelines with Spot-Integrated ASGs
- AWS — GitLab Runners with Amazon EKS Auto Mode
- Reducing CI/CD Costs 35% with GitLab CI 16.10 and Self-Hosted Runners
- Security Boulevard — DevOps Best Practices That Cut Cloud Waste (2026)
