Configuration
=============

.. _manual-configuration:

Manually deploying a BuildGrid
------------------------------

To get anything done, you first need to have a PostgreSQL database available
with the migrations from ``data/revisions/all.sql`` applied.

Configuration File
~~~~~~~~~~~~~~~~~~

If you'd like to get started, use the ``buildgrid/data/config/all-in-one.yml``
as an example configuration.

Copy the contents of ``buildgrid/data/config/all-in-one.yml`` into a file
called ``config.yml``, and edit the ``connection-string`` option to point to
your database.

To start BuildGrid with this configuration, run:

.. code-block:: sh

    bgd server start --verbose /path/to/config.yml

See the `reference-configuration`_ section to learn more about this file.
For now, we will continue setting up BuildGrid for work.

Setting up a bot
~~~~~~~~~~~~~~~~

Now we will need a worker. The recommended worker to use with BuildGrid is `buildbox-worker`_.
This worker works best when used alongside a local CAS cache called `buildbox-casd`_. First,
build these tools following the instructions in their READMEs.

Then, start the CAS cache.

.. code-block:: sh

    buildbox-casd --cas-remote=http://localhost:50051 --bind=127.0.0.1:50011 ~/casd &

Once CASD is running we can start the worker itself, pointing it to CASD for CAS requests.

.. code-block:: sh

    buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
        --cas-remote=http://127.0.0.1:50011 --request-timeout=30 my_bot

We should be able to see this worker connecting as log messages for ``CreateBotSession`` and
``UpdateBotSession`` requests in the server logs.

.. _buildbox-worker: https://gitlab.com/BuildGrid/buildbox/buildbox/-/blob/master/worker/
.. _buildbox-casd: https://gitlab.com/BuildGrid/buildbox/buildbox/-/master/casd/

Without CASD
''''''''''''

.. warning::
    Whilst this approach has less moving parts, it **will** make your build slower due to
    needing to freshly fetch the input root for every Action rather than keeping a local
    cache. With large input roots, this will completely wipe out any benefits gained by
    using remote execution. Production deployments should use ``buildbox-casd``.

``buildbox-worker`` supports running without ``buildbox-casd`` by pointing it to the remote CAS
rather than the local CASD, although this isn't recommended due to the additional network load
it will lead to. When running in this configuration, its important to tell the runner command
to not use the LocalCAS protocol to stage the input root.

.. code-block:: sh

    buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
        --cas-remote=http://localhost:50051 --request-timeout=30 --runner-arg=--disable-localcas my_bot

.. _reference-configuration:

Reference configuration
-----------------------

Below is an example of the full configuration reference:

.. literalinclude:: ../../../buildgrid/server/app/settings/reference.yml
   :language: yaml

See the :ref:`Parser API reference <server-config-parser>` for details on the
tagged YAML nodes in this configuration.

.. _deployment-architecture:

Deployment Architecture
-----------------------

BuildGrid is designed for flexibility in deployment topology. It can be
configured with any combination of the supported services in a single server
configuration.

Due to BuildGrid's use of a thread pool for handling gRPC requests, along with
the Python `GIL`_, it is sensible to split up services into several processes
to scale concurrent connection counts. With the exception of the Build Events
related services, each service is horizontally scalable to support running
multiple processes for the same service across multiple machines.

The recommended split is as follows:

1. Action Cache, ByteStream, and CAS
2. Execution, Operations, and Introspection
3. BotsInterface

.. _GIL: https://docs.python.org/3/glossary.html#term-global-interpreter-lock

Action Cache, ByteStream, and CAS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. graphviz::
   :align: center

    digraph cas_process_config {

      rankdir = LR;
      bgcolor = "#fcfcfc";

      graph [
        fontname = "Verdana",
        fontsize = 10,
      ];

      node [
        style = filled,
        shape = box,
        fontname = "Verdana",
        fontsize = 10
      ];

      subgraph cluster_cas {
        bgcolor = "#eaeaea";
        style = "dashed";

        subgraph cluster_storage {
            color = "#f4f4f4";
            style = filled;
            label = "Storage backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            CAS [
                label=<
                
        <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
          <TR>
            <TD colspan="2" port="sharded">ShardedStorage</TD>
          </TR>
          <TR>
            <TD colspan="2" port="redis">RedisIndex</TD>
          </TR>
          <TR>
            <TD colspan="2">SizeDifferentiatedStorage</TD>
          </TR>
          <TR>v
            <TD port="sql">SQLStorage</TD>
            <TD port="s3">S3Storage</TD>
          </TR>
        </TABLE>>,
            ];
        }

        subgraph cluster_caches {
            color = "#f4f4f4";
            style = filled;
            label = "Cache backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            caches [
                label=<
                
        <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
          <TR>
            <TD colspan="2" port="sharded">ShardedActionCache</TD>
          </TR>
          <TR>
            <TD colspan="2" port="redis">RedisActionCache</TD>
          </TR>
        </TABLE>>,
            ];
        }

        subgraph cluster_services {
            color = "#f4f4f4";
            style = filled;
            label = "gRPC Services"

            node [
                color = lightgrey
            ];

            ByteStream [
                label = "ByteStream"
            ];
            
            cas [ 
                label = "CAS"
            ];
            
            actioncache [
                label = "Action Cache"
            ];
        }

        label = "`bgd server` process";
      }

      S3 [
        shape = "cylinder"
      ];
      PostgreSQL [
        shape = "cylinder"
      ];
      Redis [
        shape = "cylinder"
      ];

      caches:redis -> Redis;
      CAS:redis -> Redis;
      CAS:sql -> PostgreSQL;
      CAS:s3 -> S3;
      
      cas -> CAS:sharded;
      ByteStream -> CAS:sharded;
      actioncache -> caches:sharded;
      
      {rank=same Redis PostgreSQL S3}
    }

This configuration specifies all the services needed for cache-only usage.

The exact choice of storage backends to use is dependent on your expected
workloads and availability of other services. Using an index somewhere in
the stack is strongly recommended to support handling ``FindMissingBlobs``
without querying the actual storage. The Redis index used in this example is
more performant than the SQL index, but likely requires sharding to scale to
production workloads.

As build workflows often involve many small blobs and a small number of much
larger blobs, it can be beneficial to store smaller blobs in a faster storage
location. In this example we use ``SizeDifferentiatedStorage`` to direct small
blobs to an ``SQLStorage``, whilst large blobs are stored in a slower but
significantly larger ``S3Storage``.

It may be beneficial to add a cache layer using ``WithCacheStorage`` for
particularly slow storage backends, although this backend doesn't support
any special routing to reduce duplication of storage.

The Action Cache backends are separate to the storage backends, though they can
reference them. Again the choice here depends on service availability and
desired cache behaviour. An ``S3ActionCache`` will be slow but large and
resilient, whereas an ``LRUActionCache`` will be fast but small and
short-lived. The ``RedisActionCache`` used here provides a middle-ground and
is generally the best option for most workloads.


Execution, Operations, and Introspection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. graphviz::
   :align: center

    digraph execution_process_config {

      rankdir = LR;
      bgcolor = "#fcfcfc";

      graph [
        fontname = "Verdana",
        fontsize = 10,
      ];

      node [
        style = filled,
        shape = box,
        fontname = "Verdana",
        fontsize = 10
      ];

      subgraph cluster_cas {
        bgcolor = "#eaeaea";
        style = "dashed";

        subgraph cluster_storage {
            color = "#f4f4f4";
            style = filled;
            label = "Storage backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            cas_remote [
                label = "Remote"
            ];
        }

        subgraph cluster_caches {
            color = "#f4f4f4";
            style = filled;
            label = "Cache backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            cache_remote [
                label = "RemoteActionCache"
            ];
        }

        subgraph cluster_schedulers {
            color = "#f4f4f4";
            style = filled;
            label = "Schedulers"

            node [
              fillcolor = "#bbf0c3",
              fontcolor = "#294a2e"
              color = "#294a2e"
            ];

            scheduler [
              label = "Scheduler"
            ];
        }

        subgraph cluster_connections {
            color = "#f4f4f4";
            style = filled;
            label = "Connections"

            node [
              fillcolor = "#bbf0c3",
              fontcolor = "#294a2e"
              color = "#294a2e"
            ];

            sql_writeable [
              label = "SQL (read/write)"
            ];
            sql_read_only [
              label = "SQL (read-only)"
            ];
            sql_listen_notify [
              label = "SQL (notifiers)"
            ];
        }

        subgraph cluster_services {
            color = "#f4f4f4";
            style = filled;
            label = "gRPC Services"

            node [
              color = lightgrey
            ];

            execution [
              label = "Execution"
            ];

            operations [
              label = "Operations"
            ];
            
            cas [ 
              label = "CAS"
            ];

            introspection [
              label = "Introspection"
            ];
        }

        label = "`bgd server` process";
      }

      PostgreSQL [
        shape = "cylinder"
      ];
      BuildGridCAS [
        label = "BuildGrid CAS";
        shape = "cylinder";
      ];

      cache_remote -> BuildGridCAS;
      cas_remote -> BuildGridCAS;
      
      cas -> cas_remote;
      execution -> scheduler;
      operations -> scheduler;
      introspection -> scheduler;
  
      scheduler -> cas_remote;
      scheduler -> sql_writeable;
      scheduler -> sql_read_only;
      scheduler -> sql_listen_notify;
      scheduler -> cache_remote;

      sql_writeable -> PostgreSQL;
      sql_read_only -> PostgreSQL;
      sql_listen_notify -> PostgreSQL;
      
      {rank=same PostgreSQL BuildGridCAS}
    }

This configuration specifies the services needed to support the client-side
parts of remote execution.

The Execution service uses its configured ``Scheduler`` to queue incoming jobs
in the database for assignment. When using PostgreSQL as in this example, the
``Scheduler`` will use LISTEN/NOTIFY to listen for updates to job state, which
will be reported back to clients by the Execution service.

The Operations service and Introspection service are mainly for querying for
information regarding the current internal state of BuildGrid. The Operations
service is also used for requesting cancellation of a previously queued job.

All three of these services use the ``Scheduler``, which in turn uses up to
three different SQL connection configurations. This allows for example sending
read-only traffic to a read-only database replica, or using an external
connection pool such as PGBouncer for regular queries whilst maintaining an
in-process pool for the long running connections used for LISTEN/NOTIFY.

The ``Scheduler`` also needs access to a cache backend and a storage backend.
The ``RemoteActionCache`` and ``RemoteStorage`` backends exist to support
splitting the configuration like this, and in this example should be pointed to
a BuildGrid running the Action Cache, ByteStream, and CAS configuration above.

BotsInterface
~~~~~~~~~~~~~

.. graphviz::
   :align: center

    digraph bots_process_config {

      rankdir = LR;
      bgcolor = "#fcfcfc";

      graph [
        fontname = "Verdana",
        fontsize = 10,
      ];

      node [
        style = filled,
        shape = box,
        fontname = "Verdana",
        fontsize = 10
      ];

      subgraph cluster_cas {
        bgcolor = "#eaeaea";
        style = "dashed";

        subgraph cluster_storage {
            color = "#f4f4f4";
            style = filled;
            label = "Storage backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            cas_remote [
                label = "Remote"
            ];
        }

        subgraph cluster_caches {
            color = "#f4f4f4";
            style = filled;
            label = "Cache backends"

            node [
                fillcolor = "#bbf0c3",
                fontcolor = "#294a2e",
                color = "#294a2e"
            ];
            
            edge [
                color = "#294a2e"
            ];

            cache_remote [
                label = "RemoteActionCache"
            ];
        }

        subgraph cluster_schedulers {
            color = "#f4f4f4";
            style = filled;
            label = "Schedulers"

            node [
              fillcolor = "#bbf0c3",
              fontcolor = "#294a2e"
              color = "#294a2e"
            ];

            scheduler [
              label = "Scheduler"
            ];
        }

        subgraph cluster_connections {
            color = "#f4f4f4";
            style = filled;
            label = "Connections"

            node [
              fillcolor = "#bbf0c3",
              fontcolor = "#294a2e"
              color = "#294a2e"
            ];

            sql_writeable [
              label = "SQL (read/write)"
            ];
            sql_read_only [
              label = "SQL (read-only)"
            ];
            sql_listen_notify [
              label = "SQL (notifiers)"
            ];
        }

        subgraph cluster_services {
            color = "#f4f4f4";
            style = filled;
            label = "gRPC Services"

            node [
              color = lightgrey
            ];

            bots [
              label = "Bots"
            ];
        }

        label = "`bgd server` process";
      }

      PostgreSQL [
        shape = "cylinder"
      ];
      BuildGridCAS [
        label = "BuildGrid CAS";
        shape = "cylinder";
      ];

      cache_remote -> BuildGridCAS;
      cas_remote -> BuildGridCAS;
      
      bots -> scheduler;
  
      scheduler -> cas_remote;
      scheduler -> sql_writeable;
      scheduler -> sql_read_only;
      scheduler -> sql_listen_notify;
      scheduler -> cache_remote;

      sql_writeable -> PostgreSQL;
      sql_read_only -> PostgreSQL;
      sql_listen_notify -> PostgreSQL;
      
      {rank=same PostgreSQL BuildGridCAS}
    }

This configuration is just for a BotsInterface, the server side of the RWAPI.

It is very similar to the Execution configuration, using a ``Scheduler`` which
has all the same configuration options as before. This ``Scheduler`` also has
configuration for an assigner thread, which periodically fetches the next job
in the queue and attempts to assign it to an available bot. This thread could
be in the Execution process instead, with the same functionality.

A ``Scheduler`` used just for the RWAPI side like this also uses LISTEN/NOTIFY
when given a PostgreSQL database. In this case it is used to listen for
assignment of work to a connected bot. Bots can long-poll when sending
``UpdateBotSession`` and ``CreateBotSession`` requests, returning immediately
when work is assigned.