Components

BuildGrid is made up of a number of components which work together to provide client-agnostic remote caching and remote execution functionality. These components can be independently deployed, if only some of the total set of services are needed for your use case.

For detail on the APIs provided by the services, see Resources.

$digraph buildgrid_overview { bgcolor="#fcfcfc"; graph [fontsize=14 fontname="Verdana" compound=true]; node [shape=box fontsize=10 fontname="Verdana"]; edge [fontsize=10 fontname="Verdana"]; label="BuildGrid Deployment Example"; labelloc=top; subgraph cluster_bgd_cas { label="CAS service"; fontsize=10; cas [ label="CAS" ]; bytestream [ label="ByteStream" ]; } subgraph cluster_bgd_ac { label="Action Cache service"; fontsize=10; action_cache [ label="Action Cache" ]; } subgraph cluster_bgd_execution { label="Execution service"; fontsize=10; execution [ label="Execution" ]; operations [ label="Operations" ]; } subgraph cluster_bgd_bots { label="Bots service"; fontsize=10; bots [ label="Bots" ]; } cas, execution, operations, bots -> sql; cas, bytestream, action_cache -> s3; sql [ label="PostgreSQL (configurable)" ]; s3 [ label="S3 (configurable)" ]; }$

CAS

The CAS, or Content Addressable Storage, is a service which stores blobs can retrieve them using the “digest” of the blobs themselves. A digest here is a pair of the hash of the content, and the size of the blob.

The CAS can be used to store and retrieve arbitrary blobs, but more pertinently is used in BuildGrid for input and output files, gRPC messages (such as the Actions sent by clients, and the corresponding ActionResults), and also stdout/stderr from Action execution. For remote caching only, the CAS would be used to store the actual cached blobs.

BuildGrid’s CAS implementation supports a number of storage backends, and some more complex options.

In-memory

This stores blobs in-memory, which is fast but obviously has limitations on both the number of blobs that can be stored, and the size those blobs can be. This is probably most useful for testing, or as the cache part of a two-level CAS (see cache-fallback).

If adding a new blob results in the CAS being full, then old blobs are deleted on a least-recently-used basis.

Local Disk

This stores blobs in a directory on the CAS machine’s local disk. This is slower than the in-memory storage, but doesn’t have limitations on size and number of blobs.

There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside Indexed CAS which will be able to handle this.

Redis

This stores blobs in a Redis key/value store. This also has no enforced limitations on blob counts and size, though it is probably somewhat unwise to use this for very large blobs.

S3

This storage backend stores blobs using the AWS S3 API. It should be compatible with anything which exposes the S3 API; from AWS itself to other object storage implementations like Ceph or Swift.

There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside Indexed CAS which will be able to handle this.

Remote

This storage backend looks for the requested blobs in another remote gRPC server. This is especially useful for connecting a BuildGrid Execution Service with a remote BuildGrid CAS, or to use another CAS implementation from BuildGrid.

The gRPC connection to these remote services can be configured using the channel-options config option, which takes multiple key-value options where the keys are the name of the channel option without the grpc. prefix and with all _ replaced with -.

See grpc_types.h for the list of channel options.

Cache + Fallback

This is an implementation of BuildGrid’s storage API which handles writing blobs to multiple other storage implementations. It is used to provide a cache layer for speed over the top of a slower but persistent storage, such as S3.

This storage type can also optionally defer the write to the fallback storage. This allows write requests to return once the write to the cache layer completes, which is potentially much faster than writing to the fallback.

However, this approach is not safe in all circumstances; it requires that the cache layer can reliably be expected to contain anything written to it for at least the duration of the related build.

As such, it shouldn’t be used when using a small cache, or a cache that isn’t shared amongst instances in a multi-BuildGrid deployment.

Size Differentiated

This is a storage provider which is intended to wrap two or more other storages. It takes a list of storages paired with a maximum blob size allowed in the storage, and a fallback storage to handle any blobs which are too big for any of the other storages.

This can be used in conjunction with the Cache + Fallback storage to provide a more efficient cache layer, by caching blobs differently based on their size. This allows the faster, size-limited storage like In-memory to be used by many small blobs, with larger blobs being cached somewhere with more space.

Indexed CAS

Indexed CAS is a storage implementation which maintains an index of the storage’s contents, and hands the reading/writing off to another backend.

This index is used to speed up requests like FindMissingBlobs, by looking up blobs in the index rather than in a slower storage.

The index will also be used for handling cleanup of storages which don’t have a built-in mechanism for cleanup/expiry of blobs, since it can track when blobs were last accessed.

Bytestream

The ByteStream service is a generic API for writing/reading bytes to/from a resource. BuildGrid uses it to write/read blobs to/from CAS, and as such a ByteStream service should be deployed in the same server as the CAS. It is also used by BuildGrid’s LogStream service, to handle reading/writing streams of logs. Any LogStream service also needs a ByteStream service in the same server to function correctly.

Action Cache

The Action Cache is a key/value store which maps Action digests to their corresponding ActionResults. It actually stores the result digest, but also handles retrieving the result message from the CAS.

BuildGrid’s Action Cache can be configured to store this mapping either in-memory, using Redis, or using the S3 API. Additionally a Remote Action Cache can be specified and queries made against the remote service.

Write-Once Action Cache

BuildGrid also has an Action Cache which only allows a given key to be written once. This was added for testing purposes, but may be useful anywhere that an immutable cache of Action results is needed.

Operations

The Operations service is used to inspect the state of Actions currently being executed by BuildGrid. It also handles cancellation of requested Actions, and is normally deployed in the same place as the Execution service (some tools expect it to be accessible at the same endpoint). The Operations service can be used to either inspect Operations (GetOperation) or list all Operations that BuildGrid knows about (ListOperations).

Note that BuildGrid currently maintains knowledge of all past Operations, so listing the Operations can get quite long. To deal with this, Operations are returned in paginated responses, with each ListOperationsResponse containing a next_page_token to get the next page of results.

ListOperations Filtering and Sorting

You can filter the output of ListOperations by passing a string to the filter parameter. A filter string looks like the following:

completed_time > 2020-07-30T14:30:00 & stage = COMPLETED

The supported parameters are:

name (the operation name without the instance name prefix)
stage (UNKNOWN, CACHE_CHECK, QUEUED, EXECUTING, or COMPLETED)
queued_time (an ISO-8601 timestamp indicating the time the Action was queued)
start_time (an ISO-8601 timestamp indicating the time work on the Action began)
completed_time (an ISO-8601 timestamp indicating the time work on the Action completed)
tool_name (the name of the tool used to send the Action)
tool_version (the version of the tool used to send the Action)
invocation_id (the invocation ID set by the tool used to send the Action, used to tie multiple related Actions sent by the same invocation of the tool together;
correlated_invocations_id (the the correlated invocations ID set by the tool used to send the Action; used to tie together multiple related invocations of the tool)

The supported operators are: =, !=, >, >=, <, <=

You can also use a special sort_order parameter to adjust the order the results are displayed, like this:

completed_time > 2020-07-30T14:30:00 & sort_order = completed_time

Any of the filtering parameters above can be used as values for sort_order. By default, sort_order indicates ascending order. You can use (asc) or (desc) at the end of the value to explicitly call out ascending or descending order, like this:

completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(asc)
completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(desc)

You can use multiple sort_order keys in the filter string. Each subsequent sort_order key breaks ties among elements sorted by previous keys.

completed_time > 2020-07-30T14:30:00 & sort_order = stage & sort_order = queued_time

The default filter is:

stage != COMPLETED & sort_order = queued_time

Execution

The Execution service implements the execution part of the Remote Execution API. It receives Execute requests containing Action digests, and schedules the Action for execution. Actions are prioritized first by their priority, where smaller integers are higher priority, and then by how long the Action been queued.

BuildGrid’s Execution service has a pluggable scheduling component. Currently there are two scheduler implementations; in-memory and SQL-based. The SQL scheduler is tested with sqlite and PostgreSQL, but theoretically could work with any database supported by sqlalchemy. Production BuildGrid deployments should use the SQL scheduler with PostgreSQL, to provide a reliable and persistent job queue.

Bots

The Bots service implements the Remote Workers API. It handles assigning queued Actions to workers, and reporting updates on their execution.

If the Execution service is using an in-memory scheduler, the Bots service needs to be deployed in the same server. However, using an SQL scheduler allows the Bots service to be independently deployed, as long as it uses the same database as the Execution service.

LogStream

The LogStream service implements the LogStream API. In a BuildGrid context, this provides a mechanism for workers to stream logs to interested clients whilst the build is in progress. The client doesn’t necessarily need to be the tool which made an Execute request; the resource name used to read the stream can be obtained using the Operations API.

The LogStream service just handles creating the actual stream resource, reading from and writing to the stream uses the ByteStream API. This means that any config including a LogStream service also needs a ByteStream service to function correctly.

Use of the LogStream service isn’t limited to streaming build logs from a BuildBox worker, the buildbox-tools repository provides tooling for writing to a stream generically which could be reused for other purposes. The LogStream service is also completely independent of the rest of BuildGrid (except the ByteStream used for read/write access), and so can be used in situations with no need for the rest of the remote execution/caching functionality. An example LogStream-only deployment is provided in this docker-compose example

Build Events Stream

The PublishBuildEvents service implements the Build Event Protocol. This protocol is used by Bazel to publish lifecycle events and build information to help future debugging. The implementation in BuildGrid directly supports this Bazel use-case, but is also usable with any other non-Bazel event streams using the same protocol.

The QueryBuildEvents service implements a custom BuildGrid-specific proto which allows these event streams to be retrieved from the server after completion of the related build. This service supports querying the set of streams by regex on the stream ID, which allows easy retrieval of all streams related to a specific Build.

Stream IDs internally are of the format build_id.component.invocation_id. Elsewhere in BuildGrid this build_id is called the correlated invocations ID. Values for the component part are defined in the build_events proto file, as part of the Build Events Protocol itself.

An example query to get all streams for a specific build/correlated invocations ID, 75e9ee07-9a1c-4a80-aa05-13c377c5a1f3\..*\..*.