Components
BuildGrid is made up of a number of components which work together to provide client-agnostic remote caching and remote execution functionality. These components can be independently deployed, if only some of the total set of services are needed for your use case.
For detail on the APIs provided by the services, see Resources.
CAS
The CAS, or Content Addressable Storage, is a service which stores blobs can retrieve them using the “digest” of the blobs themselves. A digest here is a pair of the hash of the content, and the size of the blob.
The CAS can be used to store and retrieve arbitrary blobs, but more pertinently is used in BuildGrid for input and output files, gRPC messages (such as the Actions sent by clients, and the corresponding ActionResults), and also stdout/stderr from Action execution. For remote caching only, the CAS would be used to store the actual cached blobs.
BuildGrid’s CAS implementation supports a number of storage backends, and some more complex options.
In-memory
This stores blobs in-memory, which is fast but obviously has limitations on both the number of blobs that can be stored, and the size those blobs can be. This is probably most useful for testing, or as the cache part of a two-level CAS (see cache-fallback).
If adding a new blob results in the CAS being full, then old blobs are deleted on a least-recently-used basis.
Local Disk
This stores blobs in a directory on the CAS machine’s local disk. This is slower than the in-memory storage, but doesn’t have limitations on size and number of blobs.
There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside Indexed CAS which will be able to handle this.
Redis
This stores blobs in a Redis key/value store. This also has no enforced limitations on blob counts and size, though it is probably somewhat unwise to use this for very large blobs.
S3
This storage backend stores blobs using the AWS S3 API. It should be compatible with anything which exposes the S3 API; from AWS itself to other object storage implementations like Ceph or Swift.
There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside Indexed CAS which will be able to handle this.
Remote
This storage backend looks for the requested blobs in another remote gRPC server. This is especially useful for connecting a BuildGrid Execution Service with a remote BuildGrid CAS, or to use another CAS implementation from BuildGrid.
The gRPC connection to these remote services can be configured using the channel-options config option, which takes multiple key-value options where the keys are the name of the channel option without the grpc. prefix and with all _ replaced with -.
See grpc_types.h for the list of channel options.
Cache + Fallback
This is an implementation of BuildGrid’s storage API which handles writing blobs to multiple other storage implementations. It is used to provide a cache layer for speed over the top of a slower but persistent storage, such as S3.
This storage type can also optionally defer the write to the fallback storage. This allows write requests to return once the write to the cache layer completes, which is potentially much faster than writing to the fallback.
However, this approach is not safe in all circumstances; it requires that the cache layer can reliably be expected to contain anything written to it for at least the duration of the related build.
As such, it shouldn’t be used when using a small cache, or a cache that isn’t shared amongst instances in a multi-BuildGrid deployment.
Size Differentiated
This is a storage provider which is intended to wrap two or more other storages. It takes a list of storages paired with a maximum blob size allowed in the storage, and a fallback storage to handle any blobs which are too big for any of the other storages.
This can be used in conjunction with the Cache + Fallback storage to provide a more efficient cache layer, by caching blobs differently based on their size. This allows the faster, size-limited storage like In-memory to be used by many small blobs, with larger blobs being cached somewhere with more space.
Indexed CAS
Indexed CAS is a storage implementation which maintains an index of the storage’s contents, and hands the reading/writing off to another backend.
This index is used to speed up requests like FindMissingBlobs
, by looking
up blobs in the index rather than in a slower storage.
The index will also be used for handling cleanup of storages which don’t have a built-in mechanism for cleanup/expiry of blobs, since it can track when blobs were last accessed.
Bytestream
The ByteStream service is a generic API for writing/reading bytes to/from a resource. BuildGrid uses it to write/read blobs to/from CAS, and as such a ByteStream service should be deployed in the same server as the CAS. It is also used by BuildGrid’s LogStream service, to handle reading/writing streams of logs. Any LogStream service also needs a ByteStream service in the same server to function correctly.
Action Cache
The Action Cache is a key/value store which maps Action digests to their corresponding ActionResults. It actually stores the result digest, but also handles retrieving the result message from the CAS.
BuildGrid’s Action Cache can be configured to store this mapping either in-memory, using Redis, or using the S3 API. Additionally a Remote Action Cache can be specified and queries made against the remote service.
Write-Once Action Cache
BuildGrid also has an Action Cache which only allows a given key to be written once. This was added for testing purposes, but may be useful anywhere that an immutable cache of Action results is needed.
Operations
The Operations service is used to inspect the state of Actions currently being
executed by BuildGrid. It also handles cancellation of requested Actions, and
is normally deployed in the same place as the Execution service (some tools
expect it to be accessible at the same endpoint). The Operations service can be
used to either inspect Operations (GetOperation
) or list all Operations that
BuildGrid knows about (ListOperations
).
Note that BuildGrid currently maintains knowledge of all past Operations, so listing
the Operations can get quite long. To deal with this, Operations are returned in
paginated responses, with each ListOperationsResponse
containing a
next_page_token
to get the next page of results.
ListOperations Filtering and Sorting
You can filter the output of ListOperations
by passing a string to the filter
parameter. A filter string looks like the following:
completed_time > 2020-07-30T14:30:00 & stage = COMPLETED
The supported parameters are:
name
(the operation name without the instance name prefix)stage
(UNKNOWN
,CACHE_CHECK
,QUEUED
,EXECUTING
, orCOMPLETED
)queued_time
(an ISO-8601 timestamp indicating the time the Action was queued)start_time
(an ISO-8601 timestamp indicating the time work on the Action began)completed_time
(an ISO-8601 timestamp indicating the time work on the Action completed)tool_name
(the name of the tool used to send the Action)tool_version
(the version of the tool used to send the Action)invocation_id
(the invocation ID set by the tool used to send the Action, used to tie multiple related Actions sent by the same invocation of the tool together;correlated_invocations_id
(the the correlated invocations ID set by the tool used to send the Action; used to tie together multiple related invocations of the tool)
The supported operators are: =
, !=
, >
, >=
, <
, <=
You can also use a special sort_order
parameter to adjust the order the results
are displayed, like this:
completed_time > 2020-07-30T14:30:00 & sort_order = completed_time
Any of the filtering parameters above can be used as values for sort_order
. By default,
sort_order
indicates ascending order. You can use (asc)
or (desc)
at the end
of the value to explicitly call out ascending or descending order, like this:
completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(asc)
completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(desc)
You can use multiple sort_order
keys in the filter string. Each subsequent sort_order
key breaks ties among elements sorted by previous keys.
completed_time > 2020-07-30T14:30:00 & sort_order = stage & sort_order = queued_time
The default filter is:
stage != COMPLETED & sort_order = queued_time
Execution
The Execution service implements the execution part of the Remote Execution
API. It receives Execute requests containing Action digests, and schedules the
Action for execution. Actions are prioritized first by their priority
, where
smaller integers are higher priority, and then by how long the Action been queued.
BuildGrid’s Execution service has a pluggable scheduling component. Currently there are two scheduler implementations; in-memory and SQL-based. The SQL scheduler is tested with sqlite and PostgreSQL, but theoretically could work with any database supported by sqlalchemy. Production BuildGrid deployments should use the SQL scheduler with PostgreSQL, to provide a reliable and persistent job queue.
Bots
The Bots service implements the Remote Workers API. It handles assigning queued Actions to workers, and reporting updates on their execution.
If the Execution service is using an in-memory scheduler, the Bots service needs to be deployed in the same server. However, using an SQL scheduler allows the Bots service to be independently deployed, as long as it uses the same database as the Execution service.
LogStream
The LogStream service implements the LogStream API. In a BuildGrid context, this provides a mechanism for workers to stream logs to interested clients whilst the build is in progress. The client doesn’t necessarily need to be the tool which made an Execute request; the resource name used to read the stream can be obtained using the Operations API.
The LogStream service just handles creating the actual stream resource, reading from and writing to the stream uses the ByteStream API. This means that any config including a LogStream service also needs a ByteStream service to function correctly.
Use of the LogStream service isn’t limited to streaming build logs from a BuildBox worker, the buildbox-tools repository provides tooling for writing to a stream generically which could be reused for other purposes. The LogStream service is also completely independent of the rest of BuildGrid (except the ByteStream used for read/write access), and so can be used in situations with no need for the rest of the remote execution/caching functionality. An example LogStream-only deployment is provided in this docker-compose example
Build Events Stream
The PublishBuildEvents
service implements the Build Event Protocol.
This protocol is used by Bazel to publish lifecycle events and build
information to help future debugging. The implementation in BuildGrid directly
supports this Bazel use-case, but is also usable with any other non-Bazel
event streams using the same protocol.
The QueryBuildEvents
service implements a custom BuildGrid-specific
proto which allows these event streams to be retrieved from the server after
completion of the related build. This service supports querying the set of
streams by regex on the stream ID, which allows easy retrieval of all streams
related to a specific Build.
Stream IDs internally are of the format build_id.component.invocation_id
.
Elsewhere in BuildGrid this build_id
is called the correlated invocations ID
.
Values for the component
part are defined in the build_events
proto
file, as part of the Build Events Protocol itself.
An example query to get all streams for a specific build/correlated invocations
ID, 75e9ee07-9a1c-4a80-aa05-13c377c5a1f3\..*\..*
.