Remote execution overview¶
Remote execution aims to speed up build processes by relying on two separate but related concepts: remote caching and remote execution. Remote caching allows users to share build outputs while remote execution allows the running of operations on a remote cluster of machines which may be more powerful (or different configurations) than what the user has access to locally.
ContentAddressableStorage(CAS) service: a remote storage end-point where content is addressed by digests, a digest being a pair of the hash and size of the data stored or retrieved.
ActionCache(AC) service: a mapping between build actions already performed and their corresponding resulting artifact (usually lives with the CAS service).
Executionservice: the main end-point allowing one to request build job to be performed against the build farm.
BuildGrid is combining these two interfaces in order to provide a complete remote caching and execution service. The high level architecture can be represented like this:
BuildGrid can be split up into separate endpoints. It is possible to have
CAS from the
following diagram shows a typical setup.
The flow of BuildGrid requests¶
BuildGrid uses various threads to achieve different tasks. The following diagram is an overview of the interactions between components of BuildGrid in response to a GRPC Request.
The Light Green color is used to signify distinct threads, and entities outside of the green boxes are shared among all threads.
As is shown in the diagram above there are various approaches to concurrency
used in a
bgd server process, and the purpose of everything isn’t
necessarily immediately clear.
This section describes the concurrency model in more detail.
First, the two processes. The main
bgd server process forks shortly after
startup, with the parent being the actual gRPC server process, and the child
being a separate process to handle publishing metrics.
Metrics Writer Process¶
This subprocess consumes
MonitoringRecord messages from
multiprocessing.Queue. These messages are then processed and
published to the configured monitoring output, e.g. a StatsD server.
Messages are sent to the queue in the main BuildGrid process, by the methods
This metrics writer lives in a subprocess because the required rate of metrics throughput when BuildGrid is under load wasn’t achievable using coroutines or threads in the same process as the gRPC handler threads.
This is the main
bgd server process. It uses both asyncio and threading
for concurrency, whilst it runs a gRPC server to handle incoming requests
to the configured BuildGrid services.
There are potentially three long-running coroutines in a BuildGrid server. There are also some other short-lived coroutines that run, as implementation details of the Janus queue used for logging and for submitting metrics asynchronously to the Metrics Writer Process.
Log Writer Coroutine¶
This coroutine handles formatting and writing log messages to stdout.
In BuildGrid, the
logging library gets configured to write logs into
a Janus queue. A Janus queue is a simple wrapper around a
with both synchronous and asynchronous get/put methods.
The Log Writer coroutine asynchronously reads messages from this Janus queue, and writes them to stdout. This approach is intended to ensure that writing log messages doesn’t become a bottleneck in a gRPC request handler at some point.
State Monitoring Worker Coroutine¶
The State Monitoring Worker periodically inspects any Execution or Bots
services and related Schedulers, and generates metrics about their current
state. These metrics are then sent to the
used by the Metrics Writer.
If the server doesn’t have any Execution or Bots services, and doesn’t have any Schedulers configured either, then this coroutine will currently still exist but won’t actually do anything.
BotSession Reaper Coroutine¶
This is only present when the server contains a Bots service.
The BotSession Reaper is part of the Bots service, and keeps track of when specific workers were last seen by BuildGrid. If a worker fails to reconnect by the agreed deadline, then this coroutine handles killing the relevant BotSession and requeueing the job that was being executed by the missing worker (if any).
There are a significant number of threads in use in a
bgd server process.
The majority of these live inside the gRPC thread pool used to handle gRPC
requests concurrently, but there are a few threads that we create for both
background tasks and optimizing access to resources.
This is the main execution thread. It contains the asyncio event loop which runs the two coroutines noted above. It also starts the gRPC server, and handles stopping the gRPC server cleanly.
This is a daemon thread created by the gRPC server when
called. This thread handles the actual running of the gRPC server, such as
receiving incoming requests and routing them to handler methods, and running
any callbacks when connections are closed.
gRPC Executor Threads¶
This is a number of threads in a
ThreadPoolExecutor used by the gRPC
server to handle incoming requests. The number is configurable using the
thread-pool-size configuration key.
When the gRPC server gets a new request, it locates the correct handler method in the servicers that are registered with the server, and then runs that handler in a thread from this pool.
Job Watcher Thread¶
This is only present when the server contains a Scheduler.
This thread is part of the data store implementations used by the Scheduler.
BuildGrid needs to keep track of all the jobs it is currently serving update
streams for (via either
WaitExecution requests), and detect
changes in the state of those jobs so that the update messages can be sent.
Rather than having each handler thread monitor the database (or in-memory job state) independently, the handlers register their job with the Job Watcher thread, and wait for notification of updates.
The Job Watcher thread watches for updates in the data store (e.g. by using
LISTEN if available in the SQL implementation), and when it
finds an update to a job that is being watched will notify the relevant
handler threads of the change.
Scheduler Pruner Thread¶
This is only present when the server contains an SQL Scheduler with pruning enabled in the configuration.
This thread is part of the SQL Scheduler data store implementation.
If pruning is enabled, then this thread is started to manage pruning the
jobs table in the database, which otherwise will grow to very large
sizes without some external cleanup mechanism.
Deferred CAS Writer Threads¶
These are only present when the server contains a
storage configured with deferred writes enabled.
This is a number of threads which handle deferring writes of blobs to the
cache’s fallback storage. It’s another
ThreadPoolExecutor, this time
used by the implementation of the
!with-cache storage backend to defer
the write of a blob to the configured fallback storage. This allows writes
to return early, after just the write to the cache layer which is likely
faster than the fallback.
The number of threads in this pool is configurable using the
fallback-writer-threads configuration key in the
Future RabbitMQ threads¶
With the new RabbitMQ-based architecture (see RabbitMQ BuildGrid Architecture), some extra threads will be needed.
RetryingPikaConsumer monitors a
PikaConsumer instance from a connection watcher thread. The consumer loop runs in another thread to keep it in the background.