Remote execution overview
Remote execution aims to speed up build processes by relying on two separate but related concepts: remote caching and remote execution. Remote caching allows users to share build outputs while remote execution allows the running of operations on a remote cluster of machines which may be more powerful (or different configurations) than what the user has access to locally.
The Remote Execution API (REAPI) describes a gRPC + protocol-buffers interface that has three main services for remote caching and execution:
A
ContentAddressableStorage
(CAS) service: a remote storage end-point where content is addressed by digests, a digest being a pair of the hash and size of the data stored or retrieved.An
ActionCache
(AC) service: a mapping between build actions already performed and their corresponding resulting artifact (usually lives with the CAS service).An
Execution
service: the main end-point allowing one to request build job to be performed against the build farm.
The Remote Worker API (RWAPI) describes another gRPC + protocol-buffers
interface that allows a central BotsService
to manage a farm of pluggable workers.
BuildGrid is combining these two interfaces in order to provide a complete remote caching and execution service. The high level architecture can be represented like this:
BuildGrid can be split up into separate endpoints. It is possible to have
a separate ActionCache
and CAS
from the Controller
. The
following diagram shows a typical setup.
The flow of BuildGrid requests
BuildGrid uses various threads to achieve different tasks. The following diagram is an overview of the interactions between components of BuildGrid in response to a GRPC Request.
The Light Green color is used to signify distinct threads, and entities outside of the green boxes are shared among all threads.
Concurrency Model
As is shown in the diagram above there are various approaches to concurrency
used in a bgd server
process, and the purpose of everything isn’t
necessarily immediately clear.
This section describes the concurrency model in more detail.
First, the two processes. The main bgd server
process forks shortly after
startup, with the parent being the actual gRPC server process, and the child
being a separate process to handle publishing metrics.
Metrics Writer Process
This subprocess consumes LogRecord
and MonitoringRecord
messages from
a multiprocessing.Queue
. These messages are then processed and
published to the configured monitoring output, e.g. a StatsD server.
Messages are sent to the queue in the main BuildGrid process, by the methods
in buildgrid.server.metrics_utils
.
This metrics writer lives in a subprocess because the required rate of metrics throughput when BuildGrid is under load wasn’t achievable using coroutines or threads in the same process as the gRPC handler threads.
BuildGrid Process
This is the main bgd server
process. It uses both asyncio and threading
for concurrency, whilst it runs a gRPC server to handle incoming requests
to the configured BuildGrid services.
Coroutines
There are potentially three long-running coroutines in a BuildGrid server. There are also some other short-lived coroutines that run, as implementation details of the Janus queue used for logging and for submitting metrics asynchronously to the Metrics Writer Process.
Log Writer Coroutine
This coroutine handles formatting and writing log messages to stdout.
In BuildGrid, the logging
library gets configured to write logs into
a Janus queue. A Janus queue is a simple wrapper around a queue.Queue
with both synchronous and asynchronous get/put methods.
The Log Writer coroutine asynchronously reads messages from this Janus queue, and writes them to stdout. This approach is intended to ensure that writing log messages doesn’t become a bottleneck in a gRPC request handler at some point.
State Monitoring Worker Coroutine
The State Monitoring Worker periodically inspects any Execution or Bots
services and related Schedulers, and generates metrics about their current
state. These metrics are then sent to the multiprocessing.Queue
used by the Metrics Writer.
If the server doesn’t have any Execution or Bots services, and doesn’t have any Schedulers configured either, then this coroutine will currently still exist but won’t actually do anything.
BotSession Reaper Coroutine
Note
This is only present when the server contains a Bots service.
The BotSession Reaper is part of the Bots service, and keeps track of when specific workers were last seen by BuildGrid. If a worker fails to reconnect by the agreed deadline, then this coroutine handles killing the relevant BotSession and requeueing the job that was being executed by the missing worker (if any).
Threads
There are a significant number of threads in use in a bgd server
process.
The majority of these live inside the gRPC thread pool used to handle gRPC
requests concurrently, but there are a few threads that we create for both
background tasks and optimizing access to resources.
Main Thread
This is the main execution thread. It contains the asyncio event loop which runs the two coroutines noted above. It also starts the gRPC server, and handles stopping the gRPC server cleanly.
gRPC Thread
This is a daemon thread created by the gRPC server when Server.start
is
called. This thread handles the actual running of the gRPC server, such as
receiving incoming requests and routing them to handler methods, and running
any callbacks when connections are closed.
gRPC Executor Threads
This is a number of threads in a ThreadPoolExecutor
used by the gRPC
server to handle incoming requests. The number is configurable using the
thread-pool-size
configuration key.
When the gRPC server gets a new request, it locates the correct handler method in the servicers that are registered with the server, and then runs that handler in a thread from this pool.
Job Watcher Thread
Note
This is only present when the server contains a Scheduler.
This thread is part of the data store implementations used by the Scheduler.
BuildGrid needs to keep track of all the jobs it is currently serving update
streams for (via either Execute
or WaitExecution
requests), and detect
changes in the state of those jobs so that the update messages can be sent.
Rather than having each handler thread monitor the database (or in-memory job state) independently, the handlers register their job with the Job Watcher thread, and wait for notification of updates.
The Job Watcher thread watches for updates in the data store (e.g. by using
PostgreSQL’s LISTEN
if available in the SQL implementation), and when it
finds an update to a job that is being watched will notify the relevant
handler threads of the change.
Scheduler Pruner Thread
Note
This is only present when the server contains an SQL Scheduler with pruning enabled in the configuration.
This thread is part of the SQL Scheduler data store implementation.
If pruning is enabled, then this thread is started to manage pruning the
jobs
table in the database, which otherwise will grow to very large
sizes without some external cleanup mechanism.
Deferred CAS Writer Threads
Note
These are only present when the server contains a !with-cache
storage configured with deferred writes enabled.
This is a number of threads which handle deferring writes of blobs to the
cache’s fallback storage. It’s another ThreadPoolExecutor
, this time
used by the implementation of the !with-cache
storage backend to defer
the write of a blob to the configured fallback storage. This allows writes
to return early, after just the write to the cache layer which is likely
faster than the fallback.
The number of threads in this pool is configurable using the
fallback-writer-threads
configuration key in the !with-cache
config
dictionary.