SlideShare a Scribd company logo
Why and How to Run Your
Own Gitlab Runner Fleet
Casey Zednick
Principal Software Engineer – DE-Tools
©2022 F5
2
1. Learning briefly about Gitlab.
2. Exploring the benefits of running your own Gitlab
runners.
3. Understanding costs.
4. Designing an auto-scale runner fleet.
5. Creating and meeting service level objectives (SLOs).
6. Listing of resources.
Agenda
©2022 F5
3
What is Gitlab?
• Gitlab is a complete development platform that
supports git repositories, issue tracking, code
reviews, and CI/CD
• Comes in four main offerings
• Community Edition - Open-Source Software
(MIT licenses)
B A C K G R O U N D
©2022 F5
4
What are Gitlab Runners?
• Run the actual jobs of Gitlab’s continuous
integration / continuous delivery (CI/CD) pipelines.
• Two main parts:
• runners - talks to the API and executors.
• executors – do the actual CI/CD work.
• For example, runners pickup tasks, such as
building, testing, and deploying the code, and
given them to executors, which run the actual
commands.
B A C K G R O U N D
©2022 F5
5
Why Run Your Own Runners?
B E N E F I T S
Availability • Gitlab’s SaaS runners are in GCP. If your resources are in AWS, Azure, etc… emergent
cross-cloud issues impact availability
• Ability to monitor, diagnose, and mitigate emergent conditions without support ticket
response times
Security • No chance of your pipeline variables being leaked to other tenants due to runner bugs or
misconfiguration
• Control of the runner and executor hosts to enable increased supply chain security
Features • Enable right-sizing VMs to job workloads instead of using Gitlab’s single shared type
• Run jobs on non-Intel VMs
Cost • Run jobs for less than Gitlab’s $10 for 1,000 minutes ($0.60 per hour)
©2022 F5
6
Understanding Costs of Running a Fleet
Understanding how much it costs to run your own
runner fleet isn’t easy
You might think you can look at how many hours of
job time you need and multiple that by your VM
costs
In practice, it’s not this simple. Nevertheless, I’ll
show you how to forecast your expected total costs,
so you can make an informed decision
C O S T S
Fixed costs:
• Personnel time (omitted in this model)
• Compute and storage for runners
• Compute and storage for central metrics/logs
Dynamic costs:
• Compute for executors (the VMs running the
actual jobs)
• Storage for job artifacts
©2022 F5
7
Forecasting Fleet Costs
C O S T S
©2022 F5
8
Optimizing Fleet Compute Costs
C O S T S
config.toml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[[runners.machine.autoscaling]]
Periods = ["* * * * * * *"] # Default no idle runners
IdleCount = 0
IdleTime = 600
Timezone = "America/Los_Angeles"
[[runners.machine.autoscaling]]
Periods = ["* * 8-17 * * mon-fri *"] # Pacific business hours
IdleCount = 40
IdleCountMin = 5
IdleScaleFactor = 1.5 # Means that current number of Idle machines will be 1.5*in-
use machines
IdleTime = 1200
Timezone = "America/Los_Angeles”
[[runners.machine.autoscaling]]
Periods = ["* * 10-15 * * mon-fri *"] # Pacific peak development
IdleCount = 60
IdleCountMin = 10
IdleScaleFactor = 1.5 # Means that current number of Idle machines will be 1.5*in-
use machines
IdleTime = 1200
Timezone = "America/Los_Angeles"
• Scale idle count by time of day
• Right size runners: our least expensive
executor is 8 times cheaper than our
most expensive
• For most expensive runners keep zero
idle count as idle count costs add up
quickly
©2022 F5
9
Understanding the Key Parts of Fleet Design
• Gitlab Instance (On-Prem / SaaS)
• Hosts source code and job artifacts
• API for runners to get pipeline jobs
• Runners
• Retrieve jobs from the Gitlab API and give them to
executors
• Controls how many executors are available for
jobs
• Executors
• Run actual workloads - think go build in Docker
container
D E S I G N
©2022 F5
10
Hosting the Runner (Control)
D E S I G N
deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-autoscale-uswest3-small-runner
namespace: nginx-autoscale-uswest3-small-runner
labels:
app: nginx-autoscale-uswest3-small-runner
spec:
strategy:
type: Recreate
replicas: 1
selector:
matchLabels:
app: nginx-autoscale-uswest3-small-runner
template:
metadata:
labels:
app: nginx-autoscale-uswest3-small-runner
spec:
containers:
- name: nginx-autoscale-uswest3-small-runner
image: gitlab/gitlab-runner:v14.5.2
• Long running daemon process
• Doesn’t have to share the same
network or cloud as the
docker+machine executors
• One runner can control many
executors.
• We use one runner per executor and
deploy it in its own k8s namespace to
allow for more control in config updates
©2022 F5
11
Understanding Autoscaling Executors (Compute)
• Kubernetes
• Quick job starts
• Scales using k8s horizontal pod autoscaling
• Problematic for Docker-in-Docker (DIND) use
• docker+machine (what I’m covering)
• Quickish job starts with tuned idle pool
• Orchestrates cloud VMs
• High isolation and support for DIND
D E S I G N
docker+machine VM lifecycle
1. VM provisioned via cloud APIs
2. VM OS updated and configure via ssh
3. VM’s Docker pulls job’s Docker container
4. Setup scripts ran in container
5. Source code checked out
6. Pipeline job script: code ran
7. Container exits
8. VM destroyed or returned to idle pool
©2022 F5
12
Hey, my pipeline job…
Ensuring your fleet meets your needs by using site reliability engineering (SRE)
©2022 F5
13
Establishing Service Level Objectives (SLOs)
S E R V I C E L E V E L O B J E C T I V E S ( S L O S )
Volume • How many jobs must you support per hour? 50, 100s, 1,000?
• Our volume SLO is 45% of our compute capacity 99% of the time
Availability • How long can a job wait before you consider the system down?
• Our availability SLO is 99% of jobs start in under 300 seconds
Latency • How quickly do you developers need to have their jobs serviced?
• Our latency SLO is 95% of our job start in 20 seconds
Errors • How many errors are normal?
• Our error SLO is less than 12 failed jobs per minute. Might sound high but in our experience,
everything is good until it isn’t
©2022 F5
14
Gathering Service Level Indicators (SLIs)
• Gitlab’s API
• Job queued_duration
• Job status success, failed, etc.
• Gitlab Runner Logs
• Cloud API errors
• General orchestration info
• Gitlab Runners Prometheus Endpoint
• http://localhost:9252/metrics
• Good scaling information
• http:/
S E R V I C E L E V E L O B J E C T I V E S ( S L O S )
Note: many low-level metrics like docker
pull times or times at various executor
stages aren’t exposed in easy-to-use
fashion. :/
©2022 F5
15
Visualizing Service Level Indicators (SLIs)
S E R V I C E L E V E L O B J E C T I V E S ( S L O S )
©2022 F5
16
Resources
• Gitlab Runner Overview
• Gitlab Fleet Scaling
• Excel runner cost model
R E S O U R C E S

More Related Content

Why and How to Run Your Own Gitlab Runners as Your Company Grows

  • 1. Why and How to Run Your Own Gitlab Runner Fleet Casey Zednick Principal Software Engineer – DE-Tools
  • 2. ©2022 F5 2 1. Learning briefly about Gitlab. 2. Exploring the benefits of running your own Gitlab runners. 3. Understanding costs. 4. Designing an auto-scale runner fleet. 5. Creating and meeting service level objectives (SLOs). 6. Listing of resources. Agenda
  • 3. ©2022 F5 3 What is Gitlab? • Gitlab is a complete development platform that supports git repositories, issue tracking, code reviews, and CI/CD • Comes in four main offerings • Community Edition - Open-Source Software (MIT licenses) B A C K G R O U N D
  • 4. ©2022 F5 4 What are Gitlab Runners? • Run the actual jobs of Gitlab’s continuous integration / continuous delivery (CI/CD) pipelines. • Two main parts: • runners - talks to the API and executors. • executors – do the actual CI/CD work. • For example, runners pickup tasks, such as building, testing, and deploying the code, and given them to executors, which run the actual commands. B A C K G R O U N D
  • 5. ©2022 F5 5 Why Run Your Own Runners? B E N E F I T S Availability • Gitlab’s SaaS runners are in GCP. If your resources are in AWS, Azure, etc… emergent cross-cloud issues impact availability • Ability to monitor, diagnose, and mitigate emergent conditions without support ticket response times Security • No chance of your pipeline variables being leaked to other tenants due to runner bugs or misconfiguration • Control of the runner and executor hosts to enable increased supply chain security Features • Enable right-sizing VMs to job workloads instead of using Gitlab’s single shared type • Run jobs on non-Intel VMs Cost • Run jobs for less than Gitlab’s $10 for 1,000 minutes ($0.60 per hour)
  • 6. ©2022 F5 6 Understanding Costs of Running a Fleet Understanding how much it costs to run your own runner fleet isn’t easy You might think you can look at how many hours of job time you need and multiple that by your VM costs In practice, it’s not this simple. Nevertheless, I’ll show you how to forecast your expected total costs, so you can make an informed decision C O S T S Fixed costs: • Personnel time (omitted in this model) • Compute and storage for runners • Compute and storage for central metrics/logs Dynamic costs: • Compute for executors (the VMs running the actual jobs) • Storage for job artifacts
  • 8. ©2022 F5 8 Optimizing Fleet Compute Costs C O S T S config.toml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [[runners.machine.autoscaling]] Periods = ["* * * * * * *"] # Default no idle runners IdleCount = 0 IdleTime = 600 Timezone = "America/Los_Angeles" [[runners.machine.autoscaling]] Periods = ["* * 8-17 * * mon-fri *"] # Pacific business hours IdleCount = 40 IdleCountMin = 5 IdleScaleFactor = 1.5 # Means that current number of Idle machines will be 1.5*in- use machines IdleTime = 1200 Timezone = "America/Los_Angeles” [[runners.machine.autoscaling]] Periods = ["* * 10-15 * * mon-fri *"] # Pacific peak development IdleCount = 60 IdleCountMin = 10 IdleScaleFactor = 1.5 # Means that current number of Idle machines will be 1.5*in- use machines IdleTime = 1200 Timezone = "America/Los_Angeles" • Scale idle count by time of day • Right size runners: our least expensive executor is 8 times cheaper than our most expensive • For most expensive runners keep zero idle count as idle count costs add up quickly
  • 9. ©2022 F5 9 Understanding the Key Parts of Fleet Design • Gitlab Instance (On-Prem / SaaS) • Hosts source code and job artifacts • API for runners to get pipeline jobs • Runners • Retrieve jobs from the Gitlab API and give them to executors • Controls how many executors are available for jobs • Executors • Run actual workloads - think go build in Docker container D E S I G N
  • 10. ©2022 F5 10 Hosting the Runner (Control) D E S I G N deployment.yaml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 apiVersion: apps/v1 kind: Deployment metadata: name: nginx-autoscale-uswest3-small-runner namespace: nginx-autoscale-uswest3-small-runner labels: app: nginx-autoscale-uswest3-small-runner spec: strategy: type: Recreate replicas: 1 selector: matchLabels: app: nginx-autoscale-uswest3-small-runner template: metadata: labels: app: nginx-autoscale-uswest3-small-runner spec: containers: - name: nginx-autoscale-uswest3-small-runner image: gitlab/gitlab-runner:v14.5.2 • Long running daemon process • Doesn’t have to share the same network or cloud as the docker+machine executors • One runner can control many executors. • We use one runner per executor and deploy it in its own k8s namespace to allow for more control in config updates
  • 11. ©2022 F5 11 Understanding Autoscaling Executors (Compute) • Kubernetes • Quick job starts • Scales using k8s horizontal pod autoscaling • Problematic for Docker-in-Docker (DIND) use • docker+machine (what I’m covering) • Quickish job starts with tuned idle pool • Orchestrates cloud VMs • High isolation and support for DIND D E S I G N docker+machine VM lifecycle 1. VM provisioned via cloud APIs 2. VM OS updated and configure via ssh 3. VM’s Docker pulls job’s Docker container 4. Setup scripts ran in container 5. Source code checked out 6. Pipeline job script: code ran 7. Container exits 8. VM destroyed or returned to idle pool
  • 12. ©2022 F5 12 Hey, my pipeline job… Ensuring your fleet meets your needs by using site reliability engineering (SRE)
  • 13. ©2022 F5 13 Establishing Service Level Objectives (SLOs) S E R V I C E L E V E L O B J E C T I V E S ( S L O S ) Volume • How many jobs must you support per hour? 50, 100s, 1,000? • Our volume SLO is 45% of our compute capacity 99% of the time Availability • How long can a job wait before you consider the system down? • Our availability SLO is 99% of jobs start in under 300 seconds Latency • How quickly do you developers need to have their jobs serviced? • Our latency SLO is 95% of our job start in 20 seconds Errors • How many errors are normal? • Our error SLO is less than 12 failed jobs per minute. Might sound high but in our experience, everything is good until it isn’t
  • 14. ©2022 F5 14 Gathering Service Level Indicators (SLIs) • Gitlab’s API • Job queued_duration • Job status success, failed, etc. • Gitlab Runner Logs • Cloud API errors • General orchestration info • Gitlab Runners Prometheus Endpoint • http://localhost:9252/metrics • Good scaling information • http:/ S E R V I C E L E V E L O B J E C T I V E S ( S L O S ) Note: many low-level metrics like docker pull times or times at various executor stages aren’t exposed in easy-to-use fashion. :/
  • 15. ©2022 F5 15 Visualizing Service Level Indicators (SLIs) S E R V I C E L E V E L O B J E C T I V E S ( S L O S )
  • 16. ©2022 F5 16 Resources • Gitlab Runner Overview • Gitlab Fleet Scaling • Excel runner cost model R E S O U R C E S