Components
==========

BuildGrid is made up of a number of components which work together to
provide client-agnostic remote caching and remote execution functionality.
These components can be independently deployed, if only some of the total
set of services are needed for your use case.

For detail on the APIs provided by the services, see
:ref:`external-resources`.

.. graphviz::
   :align: center

    digraph buildgrid_overview {
        bgcolor="#fcfcfc";

        graph [fontsize=14 fontname="Verdana" compound=true];
        node [shape=box fontsize=10 fontname="Verdana"];
        edge [fontsize=10 fontname="Verdana"];

        label="BuildGrid Deployment Example";
        labelloc=top;

        subgraph cluster_bgd_cas {
            label="CAS service";
            fontsize=10;

            cas [
                label="CAS"
            ];
            bytestream [
                label="ByteStream"
            ];
        }

        subgraph cluster_bgd_ac {
            label="Action Cache service";
            fontsize=10;

            action_cache [
                label="Action Cache"
            ];
        }

        subgraph cluster_bgd_execution {
            label="Execution service";
            fontsize=10;

            execution [
                label="Execution"
            ];
            operations [
                label="Operations"
            ];
        }

        subgraph cluster_bgd_bots {
            label="Bots service";
            fontsize=10;

            bots [
                label="Bots"
            ];
        }

        cas, execution, operations, bots -> sql;
        cas, bytestream, action_cache -> s3;

        sql [
            label="PostgreSQL (configurable)"
        ];
        s3 [
            label="S3 (configurable)"
        ];
    }


CAS
---

The CAS, or **C**\ ontent **A**\ ddressable **S**\ torage, is a service which
stores blobs can retrieve them using the "digest" of the blobs themselves. A
digest here is a pair of the hash of the content, and the size of the blob.

The CAS can be used to store and retrieve arbitrary blobs, but more pertinently
is used in BuildGrid for input and output files, gRPC messages (such as
the Actions sent by clients, and the corresponding ActionResults), and also
stdout/stderr from Action execution. For remote caching only, the CAS would
be used to store the actual cached blobs.

BuildGrid's CAS implementation supports a number of storage backends, and
some more complex options.


.. _in-memory-storage:

In-memory
~~~~~~~~~

This stores blobs in-memory, which is fast but obviously has limitations on
both the number of blobs that can be stored, and the size those blobs can
be. This is probably most useful for testing, or as the cache part of a
two-level CAS (see :ref:`cache-fallback`).

If adding a new blob results in the CAS being full, then old blobs are
deleted on a least-recently-used basis.


.. _disk-storage:

Local Disk
~~~~~~~~~~

This stores blobs in a directory on the CAS machine's local disk. This is
slower than the in-memory storage, but doesn't have limitations on size and
number of blobs.

There is currently no internal mechanism to clean up this storage, but work
is ongoing to implement a cleanup command to work alongside :ref:`indexed-cas`
which will be able to handle this.


.. _redis-storage:

Redis
~~~~~

This stores blobs in a Redis key/value store. This also has no enforced
limitations on blob counts and size, though it is probably somewhat unwise
to use this for very large blobs.


.. _s3-storage:

S3
~~

This storage backend stores blobs using the AWS S3 API. It should be
compatible with anything which exposes the S3 API; from AWS itself to other
object storage implementations like Ceph or Swift.

There is currently no internal mechanism to clean up this storage, but work
is ongoing to implement a cleanup command to work alongside :ref:`sql-index-storage`
which will be able to handle this.


.. _remote-storage:

Remote
~~~~~~

This storage backend looks for the requested blobs in another remote gRPC server. This
is especially useful for connecting a BuildGrid Execution Service with a remote BuildGrid
CAS, or to use another CAS implementation from BuildGrid.

The gRPC connection to these remote services can be configured using the `channel-options`
config option, which takes multiple key-value options where the keys are the name of the
channel option without the `grpc.` prefix and with all `_` replaced with `-`.

See `grpc_types.h`_ for the list of channel options.


.. _cache-fallback-storage:

Cache + Fallback
~~~~~~~~~~~~~~~~

This is an implementation of BuildGrid's storage API which handles writing
blobs to multiple other storage implementations. It is used to provide a cache
layer for speed over the top of a slower but persistent storage, such as S3.

This storage type can also optionally defer the write to the fallback storage.
This allows write requests to return once the write to the cache layer completes,
which is potentially much faster than writing to the fallback.

However, this approach is not safe in all circumstances; it requires that the
cache layer can reliably be expected to contain anything written to it for at
least the duration of the related build.

As such, it shouldn't be used when using a small cache, or a cache that isn't
shared amongst instances in a multi-BuildGrid deployment.


.. _size-differentiated-storage:

Size Differentiated
~~~~~~~~~~~~~~~~~~~

This is a storage provider which is intended to wrap two or more other
storages. It takes a list of storages paired with a maximum blob size allowed
in the storage, and a fallback storage to handle any blobs which are too big
for any of the other storages.

This can be used in conjunction with the :ref:`cache-fallback-storage` storage
to provide a more efficient cache layer, by caching blobs differently based on
their size. This allows the faster, size-limited storage like
:ref:`in-memory-storage` to be used by many small blobs, with larger blobs
being cached somewhere with more space.


.. _sql-index-storage:

Indexed CAS
~~~~~~~~~~~

Indexed CAS is a storage implementation which maintains an index of the
storage's contents, and hands the reading/writing off to another backend.

This index is used to speed up requests like ``FindMissingBlobs``, by looking
up blobs in the index rather than in a slower storage.

The index will also be used for handling cleanup of storages which don't
have a built-in mechanism for cleanup/expiry of blobs, since it can track
when blobs were last accessed.


Bytestream
----------

The ByteStream service is a generic API for writing/reading bytes to/from a
resource. BuildGrid uses it to write/read blobs to/from CAS, and as such a
ByteStream service should be deployed in the same server as the CAS. It is
also used by BuildGrid's LogStream service, to handle reading/writing streams
of logs. Any LogStream service also needs a ByteStream service in the same
server to function correctly.


Action Cache
------------

The Action Cache is a key/value store which maps Action digests to their
corresponding ActionResults. It actually stores the result digest, but also
handles retrieving the result message from the CAS.

BuildGrid's Action Cache can be configured to store this mapping either
in-memory, using Redis, or using the S3 API. Additionally a Remote Action
Cache can be specified and queries made against the remote service.

Write-Once Action Cache
~~~~~~~~~~~~~~~~~~~~~~~

BuildGrid also has an Action Cache which only allows a given key to be written
once. This was added for testing purposes, but may be useful anywhere that an
immutable cache of Action results is needed.


Operations
----------

The Operations service is used to inspect the state of Actions currently being
executed by BuildGrid. It also handles cancellation of requested Actions, and
is normally deployed in the same place as the Execution service (some tools
expect it to be accessible at the same endpoint). The Operations service can be
used to either inspect Operations (``GetOperation``) or list all Operations that
BuildGrid knows about (``ListOperations``).


Note that BuildGrid currently maintains knowledge of all past Operations, so listing
the Operations can get quite long. To deal with this, Operations are returned in
paginated responses, with each ``ListOperationsResponse`` containing a
``next_page_token`` to get the next page of results.

ListOperations Filtering and Sorting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can filter the output of ``ListOperations`` by passing a string to the ``filter``
parameter. A filter string looks like the following:

- ``completed_time > 2020-07-30T14:30:00 & stage = COMPLETED``

The supported parameters are:

- ``name`` (the operation name without the instance name prefix)
- ``stage`` (``UNKNOWN``, ``CACHE_CHECK``, ``QUEUED``, ``EXECUTING``, or ``COMPLETED``)
- ``queued_time`` (an ISO-8601 timestamp indicating the time the Action was queued)
- ``start_time`` (an ISO-8601 timestamp indicating the time work on the Action began)
- ``completed_time`` (an ISO-8601 timestamp indicating the time work on the Action completed)
- ``tool_name`` (the name of the tool used to send the Action)
- ``tool_version`` (the version of the tool used to send the Action)
- ``invocation_id`` (the invocation ID set by the tool used to send the Action, used to
  tie multiple related Actions sent by the same invocation of the tool together;
- ``correlated_invocations_id`` (the the correlated invocations ID set by the tool used to send
  the Action; used to tie together multiple related invocations of the tool)

The supported operators are: ``=``, ``!=``, ``>``, ``>=``, ``<``, ``<=``

You can also use a special ``sort_order`` parameter to adjust the order the results
are displayed, like this:

- ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time``

Any of the filtering parameters above can be used as values for ``sort_order``. By default,
``sort_order`` indicates ascending order. You can use ``(asc)`` or ``(desc)`` at the end
of the value to explicitly call out ascending or descending order, like this:

- ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(asc)``
- ``completed_time > 2020-07-30T14:30:00 & sort_order = completed_time(desc)``

You can use multiple ``sort_order`` keys in the filter string. Each subsequent ``sort_order``
key breaks ties among elements sorted by previous keys.

- ``completed_time > 2020-07-30T14:30:00 & sort_order = stage & sort_order = queued_time``

The default filter is:

- ``stage != COMPLETED & sort_order = queued_time``

Execution
---------

The Execution service implements the execution part of the Remote Execution
API. It receives Execute requests containing Action digests, and schedules the
Action for execution. Actions are prioritized first by their ``priority``, where
smaller integers are higher priority, and then by how long the Action been queued.

BuildGrid's Execution service has a pluggable scheduling component. Currently
there are two scheduler implementations; in-memory and SQL-based. The SQL
scheduler is tested with sqlite and PostgreSQL, but theoretically could work
with any database supported by sqlalchemy. Production BuildGrid deployments
should use the SQL scheduler with PostgreSQL, to provide a reliable and
persistent job queue.


Bots
----

The Bots service implements the Remote Workers API. It handles assigning
queued Actions to workers, and reporting updates on their execution.

If the Execution service is using an in-memory scheduler, the Bots service
needs to be deployed in the same server. However, using an SQL scheduler
allows the Bots service to be independently deployed, as long as it uses
the same database as the Execution service.


LogStream
---------

The LogStream service implements the LogStream API. In a BuildGrid context,
this provides a mechanism for workers to stream logs to interested clients
whilst the build is in progress. The client doesn't necessarily need to be
the tool which made an Execute request; the resource name used to read the
stream can be obtained using the Operations API.

The LogStream service just handles creating the actual stream resource, reading
from and writing to the stream uses the ByteStream API. This means that any
config including a LogStream service also needs a ByteStream service to function
correctly.

Use of the LogStream service isn't limited to streaming build logs from a
BuildBox worker, the buildbox-tools repository provides `tooling`_ for writing
to a stream generically which could be reused for other purposes. The LogStream
service is also completely independent of the rest of BuildGrid (except the
ByteStream used for read/write access), and so can be used in situations with
no need for the rest of the remote execution/caching functionality. An example
LogStream-only deployment is provided in this `docker-compose example`_

.. _tooling: https://gitlab.com/BuildGrid/buildbox/buildbox/-/tree/master/outputstreamer
.. _docker-compose example: https://gitlab.com/BuildGrid/buildgrid/-/tree/master/data/docker-compose-examples/logstream.yaml
.. _grpc_types.h: https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h


Build Events Stream
-------------------

The ``PublishBuildEvents`` service implements the `Build Event Protocol`_.
This protocol is used by Bazel to publish lifecycle events and build
information to help future debugging. The implementation in BuildGrid directly
supports this Bazel use-case, but is also usable with any other non-Bazel
event streams using the same protocol.

The ``QueryBuildEvents`` service implements a custom BuildGrid-specific
proto which allows these event streams to be retrieved from the server after
completion of the related build. This service supports querying the set of
streams by regex on the stream ID, which allows easy retrieval of all streams
related to a specific Build.

Stream IDs internally are of the format ``build_id.component.invocation_id``.
Elsewhere in BuildGrid this ``build_id`` is called the ``correlated invocations ID``.
Values for the ``component`` part are defined in the ``build_events`` proto
file, as part of the Build Events Protocol itself.

An example query to get all streams for a specific build/correlated invocations
ID, ``75e9ee07-9a1c-4a80-aa05-13c377c5a1f3\..*\..*``.

.. _Build Event Protocol: https://docs.bazel.build/versions/4.0.0/build-event-protocol.html