Configuration
Manually deploying a BuildGrid
To get anything done, you first need to have a PostgreSQL database available
with the migrations from data/revisions/all.sql applied.
Configuration File
If you’d like to get started, use the buildgrid/data/config/all-in-one.yml
as an example configuration.
Copy the contents of buildgrid/data/config/all-in-one.yml into a file
called config.yml, and edit the connection-string option to point to
your database.
To start BuildGrid with this configuration, run:
bgd server start --verbose /path/to/config.yml
See the reference-configuration section to learn more about this file. For now, we will continue setting up BuildGrid for work.
Setting up a bot
Now we will need a worker. The recommended worker to use with BuildGrid is buildbox-worker. This worker works best when used alongside a local CAS cache called buildbox-casd. First, build these tools following the instructions in their READMEs.
Then, start the CAS cache.
buildbox-casd --cas-remote=http://localhost:50051 --bind=127.0.0.1:50011 ~/casd &
Once CASD is running we can start the worker itself, pointing it to CASD for CAS requests.
buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
--cas-remote=http://127.0.0.1:50011 --request-timeout=30 my_bot
We should be able to see this worker connecting as log messages for CreateBotSession and
UpdateBotSession requests in the server logs.
Without CASD
Warning
Whilst this approach has less moving parts, it will make your build slower due to
needing to freshly fetch the input root for every Action rather than keeping a local
cache. With large input roots, this will completely wipe out any benefits gained by
using remote execution. Production deployments should use buildbox-casd.
buildbox-worker supports running without buildbox-casd by pointing it to the remote CAS
rather than the local CASD, although this isn’t recommended due to the additional network load
it will lead to. When running in this configuration, its important to tell the runner command
to not use the LocalCAS protocol to stage the input root.
buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
--cas-remote=http://localhost:50051 --request-timeout=30 --runner-arg=--disable-localcas my_bot
Reference configuration
Below is an example of the full configuration reference:
# Server's configuration desciption.
description: |
BuildGrid's server reference configuration.
# Server's network configuration.
server:
- !channel
# TCP port number.
address: "[::]:50051"
# Whether or not to activate SSL/TLS encryption.
insecure-mode: true
# SSL/TLS credentials.
credentials:
tls-server-key: !expand-path ~/.config/buildgrid/server.key
tls-server-cert: !expand-path ~/.config/buildgrid/server.cert
tls-client-certs: !expand-path ~/.config/buildgrid/client.cert
# gRPC tunables to pass to the gRPC server.
# See https://grpc.github.io/grpc/core/group__grpc__arg__keys.html for the full list
grpc-server-options:
grpc.so_reuseport: 0
grpc.max_connection_age_ms: 300000
# Server's authorization configuration.
authorization:
# Type of authorization method.
# none - Bypass the authorization mechanism
# jwt - OAuth 2.0 bearer with JWT tokens
method: jwt
# Location for the file containing the secret, pass
# or key needed by 'method' to authorize requests.
secret: !expand-path ~/.config/buildgrid/auth.secret
# The url to fetch the JWKs.
# Either secret or this field must be specified. Defaults to ``None``.
jwks-url: https://test.dev/.well-known/jwks.json
# Audience used to validate the JWT.
# This field must be specified if jwks-url is specified.
# This field is case sensitive!
audience: BuildGrid
# The amount of time between fetching of the JWKs.
# Defaults to 60 minutes.
jwks-fetch-minutes: 30
# Encryption algorithm to be used together with 'secret'
# by 'method' to authorize requests (optinal).
# hs256 - HMAC+SHA-256 for JWT method
# rs256 - RSASSA-PKCS1-v1_5+SHA-256 for JWT method
algorithm: rs256
# List of connections to use for items like sql and redis
connections:
- !sql-connection &sql
# URI for connecting to a PostgreSQL database:
connection-string: postgresql://bgd:insecure@database/bgd
# SQLAlchemy Pool Options
pool-size: 5
pool-timeout: 30
pool-recycle: 3600
max-overflow: 10
# List of storage backends for the instance.
storages:
- !disk-storage &main-storage
# Path to the local storage folder.
path: !expand-path $HOME/cas
# List of action cache stores
caches:
- !lru-action-cache &main-action
# Alias to a storage backend, see 'storages'.
storage: *main-storage
# Maximum number of entires kept in cache.
max-cached-refs: 256
# Whether writing to the cache is allowed.
allow-updates: true
# Whether failed actions (non-zero exit code) are stored.
cache-failed-actions: true
# List of schedulers to use in Execution and Bots services
schedulers:
- !sql-scheduler &state-database
sql: *sql
action-cache: *main-action
storage: *main-storage
property-set:
!dynamic-property-set
# Non-standard keys which BuildGrid will allow jobs to set and use in the
# scheduling algorithm when matching a job to an appropriate worker
#
# Jobs with keys which aren't defined in either this list or
# `wildcard-property-keys` will be rejected.
match-property-keys:
# BuildGrid will match worker and jobs on foo, if set by job
- foo
# Can specify multiple keys.
- bar
# Non-standard keys which BuildGrid will allow jobs to set. These keys
# won't be considered when matching jobs to workers.
#
# Jobs with keys which aren't defined in either this list or
# `match-property-keys` will be rejected.
wildcard-property-keys:
# BuildGrid won't use the `chrootRootDigest` property to match jobs to workers,
# but workers will still be able to use the value of the key to determine
# what environment the job needs
- chrootRootDigest
# A static property set can be used instead of a dynamic property set using
# !static-property-set
# Static property sets require the value of keys to be pre-defined.
# This decreases the scheduling cost to linear in comparison to the dynamic set but
# requires the definition of all valid property sets.
# property-labels: define a set of property combinations which are allowed by the schedular.
# - { label: linuxGreen, properties: [[platform, linux], [colour, green]] }
# - { label: linuxBlue, properties: [[platform, linux], [colour, blue]] }
cohort-set: !cohort-set
# A cohort is a named group of workers that share a set of property labels.
# While a worker can have more than one property label, it can only belong to one cohort.
# If all property-labels of a worker match the property-labels of a cohort, it will be
# assigned to that cohort.
cohorts:
- name: default
property-labels: ["unknown", "linux"]
- name: linux-large
property-labels: ["linux-large"]
# Base URL for external build action (web) browser service.
action-browser-url: http://localhost:8080
# BotSession Keepalive Timeout: The maximum time (in seconds)
# to wait to hear back from a bot before assuming they're unhealthy.
bot-session-keepalive-timeout: 120
# Max Execution Timeout: Specify the maximum amount of time (in seconds) a job
# can remain in executing state. If it exceeds the maximum execution timeout,
# it will be marked as cancelled.
# (Default: 7200)
max-execution-timeout: 7200
# Max number of locality hints to be associated with each bot.
bot-locality-hint-limit: 10
assigners:
- !priority-age-assigner
# Number of assigner threads to run.
count: 5
# Interval (in seconds) between each assignment attempt.
interval: 1.0
# Percentage of jobs that will be assigned by priority
priority-assignment-percentage: 95
# Bot assignment strategy to use for finding a bot to assign the job to.
bot-assignment-strategy: !assign-by-locality
sampling: !sampling-config
# Sample size: The number of bots to sample when assigning a job.
sample-size: 5
# Max attempts: The maximum number of times to attempt to sample bots
max-attempts: 3
fallback: !assign-by-capacity
- !cohort-assigner
# Number of assigner threads to run.
count: 3
# Cohort set to use for this assigner.
cohort-set: ["default", "linux-large"]
# Number of seconds to wait before a job is eligible for preemptive assignment.
preemption-delay: 20.0
# Bot assignment strategy to use for finding a bot to assign the job to.
bot-assignment-strategy: !assign-by-locality
sampling: !sampling-config
# Sample size: The number of bots to sample when assigning a job.
sample-size: 5
# Max attempts: The maximum number of times to attempt to sample bots
max-attempts: 3
fallback: !assign-by-capacity
# Server's instances configuration.
instances:
- name: main
description: |
The 'main' server instance.
# List of services for the instance.
# action-cache - REAPI ActionCache service.
# bytestream - Google APIs ByteStream service.
# cas - REAPI ContentAddressableStorage service.
# execution - REAPI Execution + RWAPI Bots services.
services:
- !action-cache
cache: *main-action
- !execution
scheduler: *state-database
# Operation Stream Keepalive Timeout: The maximum time (in seconds)
# to wait before sending the current status in an Operation response
# stream of an `Execute` or `WaitExecution` request.
operation-stream-keepalive-timeout: 120
# Max List Operations Page Size: Specify the maximum number of results that can
# be returned from a ListOperations request. BuildGrid will provide a page_token
# with the response that the client can specify to get the next page of results.
# (Default: 1000)
max-list-operations-page-size: 1000
- !cas
# Alias to a storage backend, see 'storages'.
storage: *main-storage
- !bytestream
# Alias to a storage backend, see 'storages'.
storage: *main-storage
# List of services that are not tied to a specific instance.
services:
- !quota-service
scheduler: *state-database
# Server's internal monitoring configuration.
monitoring:
# Whether or not to activate the monitoring subsytem.
enabled: false
# Type of the monitoring bus endpoint.
# stdout - Standard output stream.
# file - On-disk file.
# socket - UNIX domain socket.
# udp - Port listening for UDP packets
endpoint-type: socket
# Location for the monitoring bus endpoint. Only
# necessary for 'file', 'socket', and 'udp' `endpoint-type`.
# Full path is expected for 'file', name
# only for 'socket', and `hostname:port` for 'udp'.
endpoint-location: monitoring_bus_socket
# Messages serialisation format.
# binary - Protobuf binary format.
# json - JSON format.
# statsd - StatsD format. Only metrics are kept - logs are dropped.
serialization-format: binary
# Prefix to prepend to the metric name before writing
# to the configured endpoint.
metric-prefix: buildgrid
# Maximum number of gRPC threads. Defaults to 5 times
# the CPU count if not specifed. A minimum of 5 is
# enforced, whatever the configuration is.
thread-pool-size: 30
# Set unavailability lower than thread-pool-size to customize error responses.
limiter:
!limiter
concurrent-request-limit: 25
See the Parser API reference for details on the tagged YAML nodes in this configuration.
Deployment Architecture
BuildGrid is designed for flexibility in deployment topology. It can be configured with any combination of the supported services in a single server configuration.
Due to BuildGrid’s use of a thread pool for handling gRPC requests, along with the Python GIL, it is sensible to split up services into several processes to scale concurrent connection counts. With the exception of the Build Events related services, each service is horizontally scalable to support running multiple processes for the same service across multiple machines.
The recommended split is as follows:
Action Cache, ByteStream, and CAS
Execution, Operations, and Introspection
BotsInterface
Action Cache, ByteStream, and CAS
![digraph cas_process_config {
rankdir = LR;
bgcolor = "#fcfcfc";
graph [
fontname = "Verdana",
fontsize = 10,
];
node [
style = filled,
shape = box,
fontname = "Verdana",
fontsize = 10
];
subgraph cluster_cas {
bgcolor = "#eaeaea";
style = "dashed";
subgraph cluster_storage {
color = "#f4f4f4";
style = filled;
label = "Storage backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
CAS [
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
<TR>
<TD colspan="2" port="sharded">ShardedStorage</TD>
</TR>
<TR>
<TD colspan="2" port="redis">RedisIndex</TD>
</TR>
<TR>
<TD colspan="2">SizeDifferentiatedStorage</TD>
</TR>
<TR>v
<TD port="sql">SQLStorage</TD>
<TD port="s3">S3Storage</TD>
</TR>
</TABLE>>,
];
}
subgraph cluster_caches {
color = "#f4f4f4";
style = filled;
label = "Cache backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
caches [
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
<TR>
<TD colspan="2" port="sharded">ShardedActionCache</TD>
</TR>
<TR>
<TD colspan="2" port="redis">RedisActionCache</TD>
</TR>
</TABLE>>,
];
}
subgraph cluster_services {
color = "#f4f4f4";
style = filled;
label = "gRPC Services"
node [
color = lightgrey
];
ByteStream [
label = "ByteStream"
];
cas [
label = "CAS"
];
actioncache [
label = "Action Cache"
];
}
label = "`bgd server` process";
}
S3 [
shape = "cylinder"
];
PostgreSQL [
shape = "cylinder"
];
Redis [
shape = "cylinder"
];
caches:redis -> Redis;
CAS:redis -> Redis;
CAS:sql -> PostgreSQL;
CAS:s3 -> S3;
cas -> CAS:sharded;
ByteStream -> CAS:sharded;
actioncache -> caches:sharded;
{rank=same Redis PostgreSQL S3}
}](../_images/graphviz-02f04851c5087fb69495411cba3ab6834c5c5bbe.png)
This configuration specifies all the services needed for cache-only usage.
The exact choice of storage backends to use is dependent on your expected
workloads and availability of other services. Using an index somewhere in
the stack is strongly recommended to support handling FindMissingBlobs
without querying the actual storage. The Redis index used in this example is
more performant than the SQL index, but likely requires sharding to scale to
production workloads.
As build workflows often involve many small blobs and a small number of much
larger blobs, it can be beneficial to store smaller blobs in a faster storage
location. In this example we use SizeDifferentiatedStorage to direct small
blobs to an SQLStorage, whilst large blobs are stored in a slower but
significantly larger S3Storage.
It may be beneficial to add a cache layer using WithCacheStorage for
particularly slow storage backends, although this backend doesn’t support
any special routing to reduce duplication of storage.
The Action Cache backends are separate to the storage backends, though they can
reference them. Again the choice here depends on service availability and
desired cache behaviour. An S3ActionCache will be slow but large and
resilient, whereas an LRUActionCache will be fast but small and
short-lived. The RedisActionCache used here provides a middle-ground and
is generally the best option for most workloads.
Execution, Operations, and Introspection
![digraph execution_process_config {
rankdir = LR;
bgcolor = "#fcfcfc";
graph [
fontname = "Verdana",
fontsize = 10,
];
node [
style = filled,
shape = box,
fontname = "Verdana",
fontsize = 10
];
subgraph cluster_cas {
bgcolor = "#eaeaea";
style = "dashed";
subgraph cluster_storage {
color = "#f4f4f4";
style = filled;
label = "Storage backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
cas_remote [
label = "Remote"
];
}
subgraph cluster_caches {
color = "#f4f4f4";
style = filled;
label = "Cache backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
cache_remote [
label = "RemoteActionCache"
];
}
subgraph cluster_schedulers {
color = "#f4f4f4";
style = filled;
label = "Schedulers"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e"
color = "#294a2e"
];
scheduler [
label = "Scheduler"
];
}
subgraph cluster_connections {
color = "#f4f4f4";
style = filled;
label = "Connections"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e"
color = "#294a2e"
];
sql_writeable [
label = "SQL (read/write)"
];
sql_read_only [
label = "SQL (read-only)"
];
sql_listen_notify [
label = "SQL (notifiers)"
];
}
subgraph cluster_services {
color = "#f4f4f4";
style = filled;
label = "gRPC Services"
node [
color = lightgrey
];
execution [
label = "Execution"
];
operations [
label = "Operations"
];
cas [
label = "CAS"
];
introspection [
label = "Introspection"
];
}
label = "`bgd server` process";
}
PostgreSQL [
shape = "cylinder"
];
BuildGridCAS [
label = "BuildGrid CAS";
shape = "cylinder";
];
cache_remote -> BuildGridCAS;
cas_remote -> BuildGridCAS;
cas -> cas_remote;
execution -> scheduler;
operations -> scheduler;
introspection -> scheduler;
scheduler -> cas_remote;
scheduler -> sql_writeable;
scheduler -> sql_read_only;
scheduler -> sql_listen_notify;
scheduler -> cache_remote;
sql_writeable -> PostgreSQL;
sql_read_only -> PostgreSQL;
sql_listen_notify -> PostgreSQL;
{rank=same PostgreSQL BuildGridCAS}
}](../_images/graphviz-f54f33af1eb1ea021fc73b2c918ac533934fb347.png)
This configuration specifies the services needed to support the client-side parts of remote execution.
The Execution service uses its configured Scheduler to queue incoming jobs
in the database for assignment. When using PostgreSQL as in this example, the
Scheduler will use LISTEN/NOTIFY to listen for updates to job state, which
will be reported back to clients by the Execution service.
The Operations service and Introspection service are mainly for querying for information regarding the current internal state of BuildGrid. The Operations service is also used for requesting cancellation of a previously queued job.
All three of these services use the Scheduler, which in turn uses up to
three different SQL connection configurations. This allows for example sending
read-only traffic to a read-only database replica, or using an external
connection pool such as PGBouncer for regular queries whilst maintaining an
in-process pool for the long running connections used for LISTEN/NOTIFY.
The Scheduler also needs access to a cache backend and a storage backend.
The RemoteActionCache and RemoteStorage backends exist to support
splitting the configuration like this, and in this example should be pointed to
a BuildGrid running the Action Cache, ByteStream, and CAS configuration above.
BotsInterface
![digraph bots_process_config {
rankdir = LR;
bgcolor = "#fcfcfc";
graph [
fontname = "Verdana",
fontsize = 10,
];
node [
style = filled,
shape = box,
fontname = "Verdana",
fontsize = 10
];
subgraph cluster_cas {
bgcolor = "#eaeaea";
style = "dashed";
subgraph cluster_storage {
color = "#f4f4f4";
style = filled;
label = "Storage backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
cas_remote [
label = "Remote"
];
}
subgraph cluster_caches {
color = "#f4f4f4";
style = filled;
label = "Cache backends"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e",
color = "#294a2e"
];
edge [
color = "#294a2e"
];
cache_remote [
label = "RemoteActionCache"
];
}
subgraph cluster_schedulers {
color = "#f4f4f4";
style = filled;
label = "Schedulers"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e"
color = "#294a2e"
];
scheduler [
label = "Scheduler"
];
}
subgraph cluster_connections {
color = "#f4f4f4";
style = filled;
label = "Connections"
node [
fillcolor = "#bbf0c3",
fontcolor = "#294a2e"
color = "#294a2e"
];
sql_writeable [
label = "SQL (read/write)"
];
sql_read_only [
label = "SQL (read-only)"
];
sql_listen_notify [
label = "SQL (notifiers)"
];
}
subgraph cluster_services {
color = "#f4f4f4";
style = filled;
label = "gRPC Services"
node [
color = lightgrey
];
bots [
label = "Bots"
];
}
label = "`bgd server` process";
}
PostgreSQL [
shape = "cylinder"
];
BuildGridCAS [
label = "BuildGrid CAS";
shape = "cylinder";
];
cache_remote -> BuildGridCAS;
cas_remote -> BuildGridCAS;
bots -> scheduler;
scheduler -> cas_remote;
scheduler -> sql_writeable;
scheduler -> sql_read_only;
scheduler -> sql_listen_notify;
scheduler -> cache_remote;
sql_writeable -> PostgreSQL;
sql_read_only -> PostgreSQL;
sql_listen_notify -> PostgreSQL;
{rank=same PostgreSQL BuildGridCAS}
}](../_images/graphviz-356f549c417a3d06224c8fc6dddbd801d4e9ce0b.png)
This configuration is just for a BotsInterface, the server side of the RWAPI.
It is very similar to the Execution configuration, using a Scheduler which
has all the same configuration options as before. This Scheduler also has
configuration for an assigner thread, which periodically fetches the next job
in the queue and attempts to assign it to an available bot. This thread could
be in the Execution process instead, with the same functionality.
A Scheduler used just for the RWAPI side like this also uses LISTEN/NOTIFY
when given a PostgreSQL database. In this case it is used to listen for
assignment of work to a connected bot. Bots can long-poll when sending
UpdateBotSession and CreateBotSession requests, returning immediately
when work is assigned.