Components ========== BuildGrid is made up of a number of components which work together to provide client-agnostic remote caching and remote execution functionality. These components can be independently deployed, if only some of the total set of services are needed for your use case. For detail on the APIs provided by the services, see :ref:`external-resources`. .. graphviz:: :align: center digraph buildgrid_overview { bgcolor="#fcfcfc"; graph [fontsize=14 fontname="Verdana" compound=true]; node [shape=box fontsize=10 fontname="Verdana"]; edge [fontsize=10 fontname="Verdana"]; label="BuildGrid Deployment Example"; labelloc=top; subgraph cluster_bgd_cas { label="CAS service"; fontsize=10; cas [ label="CAS" ]; bytestream [ label="ByteStream" ]; } subgraph cluster_bgd_ac { label="Action Cache service"; fontsize=10; action_cache [ label="Action Cache" ]; } subgraph cluster_bgd_execution { label="Execution service"; fontsize=10; execution [ label="Execution" ]; operations [ label="Operations" ]; } subgraph cluster_bgd_bots { label="Bots service"; fontsize=10; bots [ label="Bots" ]; } cas, execution, operations, bots -> sql; cas, bytestream, action_cache -> s3; sql [ label="PostgreSQL (configurable)" ]; s3 [ label="S3 (configurable)" ]; } CAS --- The CAS, or **C**\ ontent **A**\ ddressable **S**\ torage, is a service which stores blobs can retrieve them using the "digest" of the blobs themselves. A digest here is a pair of the hash of the content, and the size of the blob. The CAS can be used to store and retrieve arbitrary blobs, but more pertinently is used in BuildGrid for input and output files, gRPC messages (such as the Actions sent by clients, and the corresponding ActionResults), and also stdout/stderr from Action execution. For remote caching only, the CAS would be used to store the actual cached blobs. BuildGrid's CAS implementation supports a number of storage backends, and some more complex options. .. _in-memory-storage: In-memory ~~~~~~~~~ This stores blobs in-memory, which is fast but obviously has limitations on both the number of blobs that can be stored, and the size those blobs can be. This is probably most useful for testing, or as the cache part of a two-level CAS (see :ref:`cache-fallback`). If adding a new blob results in the CAS being full, then old blobs are deleted on a least-recently-used basis. .. _disk-storage: Local Disk ~~~~~~~~~~ This stores blobs in a directory on the CAS machine's local disk. This is slower than the in-memory storage, but doesn't have limitations on size and number of blobs. There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside :ref:`indexed-cas` which will be able to handle this. .. _redis-storage: Redis ~~~~~ This stores blobs in a Redis key/value store. This also has no enforced limitations on blob counts and size, though it is probably somewhat unwise to use this for very large blobs. .. _s3-storage: S3 ~~ This storage backend stores blobs using the AWS S3 API. It should be compatible with anything which exposes the S3 API; from AWS itself to other object storage implementations like Ceph or Swift. There is currently no internal mechanism to clean up this storage, but work is ongoing to implement a cleanup command to work alongside :ref:`sql-index-storage` which will be able to handle this. .. _remote-storage: Remote ~~~~~~ This storage backend looks for the requested blobs in another remote gRPC server. This is especially useful for connecting a BuildGrid Execution Service with a remote BuildGrid CAS, or to use another CAS implementation from BuildGrid. The gRPC connection to these remote services can be configured using the `channel-options` config option, which takes multiple key-value options where the keys are the name of the channel option without the `grpc.` prefix and with all `_` replaced with `-`. See `grpc_types.h`_ for the list of channel options. .. _cache-fallback-storage: Cache + Fallback ~~~~~~~~~~~~~~~~ This is an implementation of BuildGrid's storage API which handles writing blobs to multiple other storage implementations. It is used to provide a cache layer for speed over the top of a slower but persistent storage, such as S3. This storage type can also optionally defer the write to the fallback storage. This allows write requests to return once the write to the cache layer completes, which is potentially much faster than writing to the fallback. However, this approach is not safe in all circumstances; it requires that the cache layer can reliably be expected to contain anything written to it for at least the duration of the related build. As such, it shouldn't be used when using a small cache, or a cache that isn't shared amongst instances in a multi-BuildGrid deployment. .. _size-differentiated-storage: Size Differentiated ~~~~~~~~~~~~~~~~~~~ This is a storage provider which is intended to wrap two or more other storages. It takes a list of storages paired with a maximum blob size allowed in the storage, and a fallback storage to handle any blobs which are too big for any of the other storages. This can be used in conjunction with the :ref:`cache-fallback-storage` storage to provide a more efficient cache layer, by caching blobs differently based on their size. This allows the faster, size-limited storage like :ref:`in-memory-storage` to be used by many small blobs, with larger blobs being cached somewhere with more space. .. _sql-index-storage: Indexed CAS ~~~~~~~~~~~ Indexed CAS is a storage implementation which maintains an index of the storage's contents, and hands the reading/writing off to another backend. This index is used to speed up requests like ``FindMissingBlobs``, by looking up blobs in the index rather than in a slower storage. The index will also be used for handling cleanup of storages which don't have a built-in mechanism for cleanup/expiry of blobs, since it can track when blobs were last accessed. Bytestream ---------- The ByteStream service is a generic API for writing/reading bytes to/from a resource. BuildGrid uses it to write/read blobs to/from CAS, and as such a ByteStream service should be deployed in the same server as the CAS. It is also used by BuildGrid's LogStream service, to handle reading/writing streams of logs. Any LogStream service also needs a ByteStream service in the same server to function correctly. Action Cache ------------ The Action Cache is a key/value store which maps Action digests to their corresponding ActionResults. It actually stores the result digest, but also handles retrieving the result message from the CAS. BuildGrid's Action Cache can be configured to store this mapping either in-memory, using Redis, or using the S3 API. Additionally a Remote Action Cache can be specified and queries made against the remote service. Write-Once Action Cache ~~~~~~~~~~~~~~~~~~~~~~~ BuildGrid also has an Action Cache which only allows a given key to be written once. This was added for testing purposes, but may be useful anywhere that an immutable cache of Action results is needed. Operations ---------- The Operations service is used to inspect the state of Actions currently being executed by BuildGrid. It also handles cancellation of requested Actions, and is normally deployed in the same place as the Execution service (some tools expect it to be accessible at the same endpoint). The Operations service can be used to either inspect Operations (``GetOperation``) or list all Operations that BuildGrid knows about (``ListOperations``). Note that BuildGrid currently maintains knowledge of all past Operations, so listing the Operations can get quite long. To deal with this, Operations are returned in paginated responses, with each ``ListOperationsResponse`` containing a ``next_page_token`` to get the next page of results. ListOperations Filtering and Sorting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can filter the output of ``ListOperations`` by passing a string to the ``filter`` parameter. A filter string looks like the following: - ``completed_time > 2020-07-30T14:30:00 & stage = COMPLETED`` The supported parameters are: - ``name`` (the operation name without the instance name prefix) - ``stage`` (``UNKNOWN``, ``CACHE_CHECK``, ``QUEUED``, ``EXECUTING``, or ``COMPLETED``) - ``queued_time`` (an ISO-8601 timestamp indicating the time the Action was queued) - ``start_time`` (an ISO-8601 timestamp indicating the time work on the Action began) - ``completed_time`` (an ISO-8601 timestamp indicating the time work on the Action completed) - ``tool_name`` (the name of the tool used to send the Action) - ``tool_version`` (the version of the tool used to send the Action) - ``invocation_id`` (the invocation ID set by the tool used to send the Action, used to tie multiple related Actions sent by the same invocation of the tool together; - ``correlated_invocations_id`` (the the correlated invocations ID set by the tool used to send the Action; used to tie together multiple related invocations of the tool) The supported operators are: ``=``, ``!=``, ``>``, ``>=``, ``<``, ``<=`` You can also use a special ``sort_order`` parameter to adjust the order the results are displayed, like this: - ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time`` Any of the filtering parameters above can be used as values for ``sort_order``. By default, ``sort_order`` indicates ascending order. You can use ``(asc)`` or ``(desc)`` at the end of the value to explicitly call out ascending or descending order, like this: - ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(asc)`` - ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(desc)`` You can use multiple ``sort_order`` keys in the filter string. Each subsequent ``sort_order`` key breaks ties among elements sorted by previous keys. - ``completed_time > 2020-07-30T14:30:00 & sort_order = stage & sort_order = queued_time`` The default filter is: - ``stage != COMPLETED & sort_order = queued_time`` Execution --------- The Execution service implements the execution part of the Remote Execution API. It receives Execute requests containing Action digests, and schedules the Action for execution. Actions are prioritized first by their ``priority``, where smaller integers are higher priority, and then by how long the Action been queued. BuildGrid's Execution service has a pluggable scheduling component. Currently there are two scheduler implementations; in-memory and SQL-based. The SQL scheduler is tested with sqlite and PostgreSQL, but theoretically could work with any database supported by sqlalchemy. Production BuildGrid deployments should use the SQL scheduler with PostgreSQL, to provide a reliable and persistent job queue. Bots ---- The Bots service implements the Remote Workers API. It handles assigning queued Actions to workers, and reporting updates on their execution. If the Execution service is using an in-memory scheduler, the Bots service needs to be deployed in the same server. However, using an SQL scheduler allows the Bots service to be independently deployed, as long as it uses the same database as the Execution service. LogStream --------- The LogStream service implements the LogStream API. In a BuildGrid context, this provides a mechanism for workers to stream logs to interested clients whilst the build is in progress. The client doesn't necessarily need to be the tool which made an Execute request; the resource name used to read the stream can be obtained using the Operations API. The LogStream service just handles creating the actual stream resource, reading from and writing to the stream uses the ByteStream API. This means that any config including a LogStream service also needs a ByteStream service to function correctly. Use of the LogStream service isn't limited to streaming build logs from a BuildBox worker, the buildbox-tools repository provides `tooling`_ for writing to a stream generically which could be reused for other purposes. The LogStream service is also completely independent of the rest of BuildGrid (except the ByteStream used for read/write access), and so can be used in situations with no need for the rest of the remote execution/caching functionality. An example LogStream-only deployment is provided in this `docker-compose example`_ .. _tooling: https://gitlab.com/BuildGrid/buildbox/buildbox-tools/-/tree/master/cpp/outputstreamer .. _docker-compose example: https://gitlab.com/BuildGrid/buildgrid/-/tree/master/data/docker-compose-examples/logstream.yaml .. _grpc_types.h: https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h Build Events Stream ------------------- The ``PublishBuildEvents`` service implements the `Build Event Protocol`_. This protocol is used by Bazel to publish lifecycle events and build information to help future debugging. The implementation in BuildGrid directly supports this Bazel use-case, but is also usable with any other non-Bazel event streams using the same protocol. The ``QueryBuildEvents`` service implements a custom BuildGrid-specific proto which allows these event streams to be retrieved from the server after completion of the related build. This service supports querying the set of streams by regex on the stream ID, which allows easy retrieval of all streams related to a specific Build. Stream IDs internally are of the format ``build_id.component.invocation_id``. Elsewhere in BuildGrid this ``build_id`` is called the ``correlated invocations ID``. Values for the ``component`` part are defined in the ``build_events`` proto file, as part of the Build Events Protocol itself. An example query to get all streams for a specific build/correlated invocations ID, ``75e9ee07-9a1c-4a80-aa05-13c377c5a1f3\..*\..*``. .. _Build Event Protocol: https://docs.bazel.build/versions/4.0.0/build-event-protocol.html