Configuration

Manually deploying a BuildGrid

To get anything done, you first need to have a PostgreSQL database available with the migrations from data/revisions/all.sql applied.

Configuration File

If you’d like to get started, use the buildgrid/data/config/all-in-one.yml as an example configuration.

Copy the contents of buildgrid/data/config/all-in-one.yml into a file called config.yml, and edit the connection-string option to point to your database.

To start BuildGrid with this configuration, run:

bgd server start --verbose /path/to/config.yml

See the reference-configuration section to learn more about this file. For now, we will continue setting up BuildGrid for work.

Setting up a bot

Now we will need a worker. The recommended worker to use with BuildGrid is buildbox-worker. This worker works best when used alongside a local CAS cache called buildbox-casd. First, build these tools following the instructions in their READMEs.

Then, start the CAS cache.

buildbox-casd --cas-remote=http://localhost:50051 --bind=127.0.0.1:50011 ~/casd &

Once CASD is running we can start the worker itself, pointing it to CASD for CAS requests.

buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
    --cas-remote=http://127.0.0.1:50011 --request-timeout=30 my_bot

We should be able to see this worker connecting as log messages for CreateBotSession and UpdateBotSession requests in the server logs.

Without CASD

Warning

Whilst this approach has less moving parts, it will make your build slower due to needing to freshly fetch the input root for every Action rather than keeping a local cache. With large input roots, this will completely wipe out any benefits gained by using remote execution. Production deployments should use buildbox-casd.

buildbox-worker supports running without buildbox-casd by pointing it to the remote CAS rather than the local CASD, although this isn’t recommended due to the additional network load it will lead to. When running in this configuration, its important to tell the runner command to not use the LocalCAS protocol to stage the input root.

buildbox-worker --buildbox-run=buildbox-run-hosttools --bots-remote=http://localhost:50051 \
    --cas-remote=http://localhost:50051 --request-timeout=30 --runner-arg=--disable-localcas my_bot

Reference configuration

Below is an example of the full configuration reference:

# Server's configuration desciption.
description: |
  BuildGrid's server reference configuration.

# Server's network configuration.
server:
  - !channel
    # TCP port number.
    address: "[::]:50051"
    # Whether or not to activate SSL/TLS encryption.
    insecure-mode: true
    # SSL/TLS credentials.
    credentials:
      tls-server-key: !expand-path ~/.config/buildgrid/server.key
      tls-server-cert: !expand-path ~/.config/buildgrid/server.cert
      tls-client-certs: !expand-path ~/.config/buildgrid/client.cert

# gRPC tunables to pass to the gRPC server.
# See https://grpc.github.io/grpc/core/group__grpc__arg__keys.html for the full list
grpc-server-options:
  grpc.so_reuseport: 0
  grpc.max_connection_age_ms: 300000

# Server's authorization configuration.
authorization:
  # Type of authorization method.
  #  none  - Bypass the authorization mechanism
  #  jwt   - OAuth 2.0 bearer with JWT tokens
  method: jwt
  # Location for the file containing the secret, pass
  # or key needed by 'method' to authorize requests.
  secret: !expand-path ~/.config/buildgrid/auth.secret
  # The url to fetch the JWKs.
  # Either secret or this field must be specified. Defaults to ``None``.
  jwks-url: https://test.dev/.well-known/jwks.json
  # Audience used to validate the JWT.
  # This field must be specified if jwks-url is specified.
  # This field is case sensitive!
  audience: BuildGrid
  # The amount of time between fetching of the JWKs.
  # Defaults to 60 minutes.
  jwks-fetch-minutes: 30
  # Encryption algorithm to be used together with 'secret'
  # by 'method' to authorize requests (optinal).
  #  hs256  - HMAC+SHA-256 for JWT method
  #  rs256  - RSASSA-PKCS1-v1_5+SHA-256 for JWT method
  algorithm: rs256

# List of connections to use for items like sql and redis
connections:
  - !sql-connection &sql
    # URI for connecting to a PostgreSQL database:
    connection-string: postgresql://bgd:insecure@database/bgd

    # SQLAlchemy Pool Options
    pool-size: 5
    pool-timeout: 30
    pool-recycle: 3600
    max-overflow: 10

# List of storage backends for the instance.
storages:
  - !disk-storage &main-storage
    # Path to the local storage folder.
    path: !expand-path $HOME/cas

# List of action cache stores
caches:
  - !lru-action-cache &main-action
    # Alias to a storage backend, see 'storages'.
    storage: *main-storage
    # Maximum number of entires kept in cache.
    max-cached-refs: 256
    # Whether writing to the cache is allowed.
    allow-updates: true
    # Whether failed actions (non-zero exit code) are stored.
    cache-failed-actions: true

# List of schedulers to use in Execution and Bots services
schedulers:
  - !sql-scheduler &state-database
    sql: *sql
    action-cache: *main-action
    storage: *main-storage

    property-set:
      !dynamic-property-set
      # Non-standard keys which BuildGrid will allow jobs to set and use in the
      # scheduling algorithm when matching a job to an appropriate worker
      #
      # Jobs with keys which aren't defined in either this list or
      # `wildcard-property-keys` will be rejected.
      match-property-keys:
        # BuildGrid will match worker and jobs on foo, if set by job
        - foo
        # Can specify multiple keys.
        - bar

      # Non-standard keys which BuildGrid will allow jobs to set. These keys
      # won't be considered when matching jobs to workers.
      #
      # Jobs with keys which aren't defined in either this list or
      # `match-property-keys` will be rejected.
      wildcard-property-keys:
        # BuildGrid won't use the `chrootRootDigest` property to match jobs to workers,
        # but workers will still be able to use the value of the key to determine
        # what environment the job needs
        - chrootRootDigest

      # A static property set can be used instead of a dynamic property set using
      # !static-property-set
      # Static property sets require the value of keys to be pre-defined.
      # This decreases the scheduling cost to linear in comparison to the dynamic set but
      # requires the definition of all valid property sets.

      # property-labels: define a set of property combinations which are allowed by the schedular.
        # - { label: linuxGreen, properties: [[platform, linux], [colour, green]] }
        # - { label: linuxBlue, properties: [[platform, linux], [colour, blue]] }

    cohort-set: !cohort-set
      # A cohort is a named group of workers that share a set of property labels.
      # While a worker can have more than one property label, it can only belong to one cohort.
      # If all property-labels of a worker match the property-labels of a cohort, it will be
      # assigned to that cohort.
      cohorts:
        - name: default
          property-labels: ["unknown", "linux"]
        - name: linux-large
          property-labels: ["linux-large"]

    # Base URL for external build action (web) browser service.
    action-browser-url: http://localhost:8080

    # BotSession Keepalive Timeout: The maximum time (in seconds)
    # to wait to hear back from a bot before assuming they're unhealthy.
    bot-session-keepalive-timeout: 120

    # Max Execution Timeout: Specify the maximum amount of time (in seconds) a job
    # can remain in executing state. If it exceeds the maximum execution timeout,
    # it will be marked as cancelled.
    # (Default: 7200)
    max-execution-timeout: 7200

    # Max number of locality hints to be associated with each bot.
    bot-locality-hint-limit: 10

    assigners:
      - !priority-age-assigner
        # Number of assigner threads to run.
        count: 5
        # Interval (in seconds) between each assignment attempt.
        interval: 1.0
        # Percentage of jobs that will be assigned by priority
        priority-assignment-percentage: 95
        # Bot assignment strategy to use for finding a bot to assign the job to.
        bot-assignment-strategy: !assign-by-locality
          sampling: !sampling-config
            # Sample size: The number of bots to sample when assigning a job.
            sample-size: 5
            # Max attempts: The maximum number of times to attempt to sample bots
            max-attempts: 3
          fallback: !assign-by-capacity
      - !cohort-assigner
        # Number of assigner threads to run.
        count: 3
        # Cohort set to use for this assigner.
        cohort-set: ["default", "linux-large"]
        # Number of seconds to wait before a job is eligible for preemptive assignment.
        preemption-delay: 20.0
        # Bot assignment strategy to use for finding a bot to assign the job to.
        bot-assignment-strategy: !assign-by-locality
          sampling: !sampling-config
            # Sample size: The number of bots to sample when assigning a job.
            sample-size: 5
            # Max attempts: The maximum number of times to attempt to sample bots
            max-attempts: 3
          fallback: !assign-by-capacity

# Server's instances configuration.
instances:
  - name: main
    description: |
      The 'main' server instance.

    # List of services for the instance.
    #  action-cache     - REAPI ActionCache service.
    #  bytestream       - Google APIs ByteStream service.
    #  cas              - REAPI ContentAddressableStorage service.
    #  execution        - REAPI Execution + RWAPI Bots services.
    services:
      - !action-cache
        cache: *main-action

      - !execution
        scheduler: *state-database

        # Operation Stream Keepalive Timeout: The maximum time (in seconds)
        # to wait before sending the current status in an Operation response
        # stream of an `Execute` or `WaitExecution` request.
        operation-stream-keepalive-timeout: 120

        # Max List Operations Page Size: Specify the maximum number of results that can
        # be returned from a ListOperations request. BuildGrid will provide a page_token
        # with the response that the client can specify to get the next page of results.
        # (Default: 1000)
        max-list-operations-page-size: 1000

      - !cas
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage

      - !bytestream
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage

# List of services that are not tied to a specific instance.
services:
  - !quota-service
    scheduler: *state-database

# Server's internal monitoring configuration.
monitoring:
  # Whether or not to activate the monitoring subsytem.
  enabled: false

  # Type of the monitoring bus endpoint.
  #  stdout  - Standard output stream.
  #  file    - On-disk file.
  #  socket  - UNIX domain socket.
  #  udp     - Port listening for UDP packets
  endpoint-type: socket

  # Location for the monitoring bus endpoint. Only
  # necessary for 'file', 'socket', and 'udp' `endpoint-type`.
  # Full path is expected for 'file', name
  # only for 'socket', and `hostname:port` for 'udp'.
  endpoint-location: monitoring_bus_socket

  # Messages serialisation format.
  #  binary  - Protobuf binary format.
  #  json    - JSON format.
  #  statsd  - StatsD format. Only metrics are kept - logs are dropped.
  serialization-format: binary

  # Prefix to prepend to the metric name before writing
  # to the configured endpoint.
  metric-prefix: buildgrid

# Maximum number of gRPC threads. Defaults to 5 times
# the CPU count if not specifed. A minimum of 5 is
# enforced, whatever the configuration is.
thread-pool-size: 30

# Set unavailability lower than thread-pool-size to customize error responses.
limiter:
  !limiter
  concurrent-request-limit: 25

See the Parser API reference for details on the tagged YAML nodes in this configuration.

Deployment Architecture

BuildGrid is designed for flexibility in deployment topology. It can be configured with any combination of the supported services in a single server configuration.

Due to BuildGrid’s use of a thread pool for handling gRPC requests, along with the Python GIL, it is sensible to split up services into several processes to scale concurrent connection counts. With the exception of the Build Events related services, each service is horizontally scalable to support running multiple processes for the same service across multiple machines.

The recommended split is as follows:

  1. Action Cache, ByteStream, and CAS

  2. Execution, Operations, and Introspection

  3. BotsInterface

Action Cache, ByteStream, and CAS

digraph cas_process_config {

   rankdir = LR;
   bgcolor = "#fcfcfc";

   graph [
     fontname = "Verdana",
     fontsize = 10,
   ];

   node [
     style = filled,
     shape = box,
     fontname = "Verdana",
     fontsize = 10
   ];

   subgraph cluster_cas {
     bgcolor = "#eaeaea";
     style = "dashed";

     subgraph cluster_storage {
         color = "#f4f4f4";
         style = filled;
         label = "Storage backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         CAS [
             label=<

     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
       <TR>
         <TD colspan="2" port="sharded">ShardedStorage</TD>
       </TR>
       <TR>
         <TD colspan="2" port="redis">RedisIndex</TD>
       </TR>
       <TR>
         <TD colspan="2">SizeDifferentiatedStorage</TD>
       </TR>
       <TR>v
         <TD port="sql">SQLStorage</TD>
         <TD port="s3">S3Storage</TD>
       </TR>
     </TABLE>>,
         ];
     }

     subgraph cluster_caches {
         color = "#f4f4f4";
         style = filled;
         label = "Cache backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         caches [
             label=<

     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
       <TR>
         <TD colspan="2" port="sharded">ShardedActionCache</TD>
       </TR>
       <TR>
         <TD colspan="2" port="redis">RedisActionCache</TD>
       </TR>
     </TABLE>>,
         ];
     }

     subgraph cluster_services {
         color = "#f4f4f4";
         style = filled;
         label = "gRPC Services"

         node [
             color = lightgrey
         ];

         ByteStream [
             label = "ByteStream"
         ];

         cas [
             label = "CAS"
         ];

         actioncache [
             label = "Action Cache"
         ];
     }

     label = "`bgd server` process";
   }

   S3 [
     shape = "cylinder"
   ];
   PostgreSQL [
     shape = "cylinder"
   ];
   Redis [
     shape = "cylinder"
   ];

   caches:redis -> Redis;
   CAS:redis -> Redis;
   CAS:sql -> PostgreSQL;
   CAS:s3 -> S3;

   cas -> CAS:sharded;
   ByteStream -> CAS:sharded;
   actioncache -> caches:sharded;

   {rank=same Redis PostgreSQL S3}
 }

This configuration specifies all the services needed for cache-only usage.

The exact choice of storage backends to use is dependent on your expected workloads and availability of other services. Using an index somewhere in the stack is strongly recommended to support handling FindMissingBlobs without querying the actual storage. The Redis index used in this example is more performant than the SQL index, but likely requires sharding to scale to production workloads.

As build workflows often involve many small blobs and a small number of much larger blobs, it can be beneficial to store smaller blobs in a faster storage location. In this example we use SizeDifferentiatedStorage to direct small blobs to an SQLStorage, whilst large blobs are stored in a slower but significantly larger S3Storage.

It may be beneficial to add a cache layer using WithCacheStorage for particularly slow storage backends, although this backend doesn’t support any special routing to reduce duplication of storage.

The Action Cache backends are separate to the storage backends, though they can reference them. Again the choice here depends on service availability and desired cache behaviour. An S3ActionCache will be slow but large and resilient, whereas an LRUActionCache will be fast but small and short-lived. The RedisActionCache used here provides a middle-ground and is generally the best option for most workloads.

Execution, Operations, and Introspection

digraph execution_process_config {

   rankdir = LR;
   bgcolor = "#fcfcfc";

   graph [
     fontname = "Verdana",
     fontsize = 10,
   ];

   node [
     style = filled,
     shape = box,
     fontname = "Verdana",
     fontsize = 10
   ];

   subgraph cluster_cas {
     bgcolor = "#eaeaea";
     style = "dashed";

     subgraph cluster_storage {
         color = "#f4f4f4";
         style = filled;
         label = "Storage backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         cas_remote [
             label = "Remote"
         ];
     }

     subgraph cluster_caches {
         color = "#f4f4f4";
         style = filled;
         label = "Cache backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         cache_remote [
             label = "RemoteActionCache"
         ];
     }

     subgraph cluster_schedulers {
         color = "#f4f4f4";
         style = filled;
         label = "Schedulers"

         node [
           fillcolor = "#bbf0c3",
           fontcolor = "#294a2e"
           color = "#294a2e"
         ];

         scheduler [
           label = "Scheduler"
         ];
     }

     subgraph cluster_connections {
         color = "#f4f4f4";
         style = filled;
         label = "Connections"

         node [
           fillcolor = "#bbf0c3",
           fontcolor = "#294a2e"
           color = "#294a2e"
         ];

         sql_writeable [
           label = "SQL (read/write)"
         ];
         sql_read_only [
           label = "SQL (read-only)"
         ];
         sql_listen_notify [
           label = "SQL (notifiers)"
         ];
     }

     subgraph cluster_services {
         color = "#f4f4f4";
         style = filled;
         label = "gRPC Services"

         node [
           color = lightgrey
         ];

         execution [
           label = "Execution"
         ];

         operations [
           label = "Operations"
         ];

         cas [
           label = "CAS"
         ];

         introspection [
           label = "Introspection"
         ];
     }

     label = "`bgd server` process";
   }

   PostgreSQL [
     shape = "cylinder"
   ];
   BuildGridCAS [
     label = "BuildGrid CAS";
     shape = "cylinder";
   ];

   cache_remote -> BuildGridCAS;
   cas_remote -> BuildGridCAS;

   cas -> cas_remote;
   execution -> scheduler;
   operations -> scheduler;
   introspection -> scheduler;

   scheduler -> cas_remote;
   scheduler -> sql_writeable;
   scheduler -> sql_read_only;
   scheduler -> sql_listen_notify;
   scheduler -> cache_remote;

   sql_writeable -> PostgreSQL;
   sql_read_only -> PostgreSQL;
   sql_listen_notify -> PostgreSQL;

   {rank=same PostgreSQL BuildGridCAS}
 }

This configuration specifies the services needed to support the client-side parts of remote execution.

The Execution service uses its configured Scheduler to queue incoming jobs in the database for assignment. When using PostgreSQL as in this example, the Scheduler will use LISTEN/NOTIFY to listen for updates to job state, which will be reported back to clients by the Execution service.

The Operations service and Introspection service are mainly for querying for information regarding the current internal state of BuildGrid. The Operations service is also used for requesting cancellation of a previously queued job.

All three of these services use the Scheduler, which in turn uses up to three different SQL connection configurations. This allows for example sending read-only traffic to a read-only database replica, or using an external connection pool such as PGBouncer for regular queries whilst maintaining an in-process pool for the long running connections used for LISTEN/NOTIFY.

The Scheduler also needs access to a cache backend and a storage backend. The RemoteActionCache and RemoteStorage backends exist to support splitting the configuration like this, and in this example should be pointed to a BuildGrid running the Action Cache, ByteStream, and CAS configuration above.

BotsInterface

digraph bots_process_config {

   rankdir = LR;
   bgcolor = "#fcfcfc";

   graph [
     fontname = "Verdana",
     fontsize = 10,
   ];

   node [
     style = filled,
     shape = box,
     fontname = "Verdana",
     fontsize = 10
   ];

   subgraph cluster_cas {
     bgcolor = "#eaeaea";
     style = "dashed";

     subgraph cluster_storage {
         color = "#f4f4f4";
         style = filled;
         label = "Storage backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         cas_remote [
             label = "Remote"
         ];
     }

     subgraph cluster_caches {
         color = "#f4f4f4";
         style = filled;
         label = "Cache backends"

         node [
             fillcolor = "#bbf0c3",
             fontcolor = "#294a2e",
             color = "#294a2e"
         ];

         edge [
             color = "#294a2e"
         ];

         cache_remote [
             label = "RemoteActionCache"
         ];
     }

     subgraph cluster_schedulers {
         color = "#f4f4f4";
         style = filled;
         label = "Schedulers"

         node [
           fillcolor = "#bbf0c3",
           fontcolor = "#294a2e"
           color = "#294a2e"
         ];

         scheduler [
           label = "Scheduler"
         ];
     }

     subgraph cluster_connections {
         color = "#f4f4f4";
         style = filled;
         label = "Connections"

         node [
           fillcolor = "#bbf0c3",
           fontcolor = "#294a2e"
           color = "#294a2e"
         ];

         sql_writeable [
           label = "SQL (read/write)"
         ];
         sql_read_only [
           label = "SQL (read-only)"
         ];
         sql_listen_notify [
           label = "SQL (notifiers)"
         ];
     }

     subgraph cluster_services {
         color = "#f4f4f4";
         style = filled;
         label = "gRPC Services"

         node [
           color = lightgrey
         ];

         bots [
           label = "Bots"
         ];
     }

     label = "`bgd server` process";
   }

   PostgreSQL [
     shape = "cylinder"
   ];
   BuildGridCAS [
     label = "BuildGrid CAS";
     shape = "cylinder";
   ];

   cache_remote -> BuildGridCAS;
   cas_remote -> BuildGridCAS;

   bots -> scheduler;

   scheduler -> cas_remote;
   scheduler -> sql_writeable;
   scheduler -> sql_read_only;
   scheduler -> sql_listen_notify;
   scheduler -> cache_remote;

   sql_writeable -> PostgreSQL;
   sql_read_only -> PostgreSQL;
   sql_listen_notify -> PostgreSQL;

   {rank=same PostgreSQL BuildGridCAS}
 }

This configuration is just for a BotsInterface, the server side of the RWAPI.

It is very similar to the Execution configuration, using a Scheduler which has all the same configuration options as before. This Scheduler also has configuration for an assigner thread, which periodically fetches the next job in the queue and attempts to assign it to an available bot. This thread could be in the Execution process instead, with the same functionality.

A Scheduler used just for the RWAPI side like this also uses LISTEN/NOTIFY when given a PostgreSQL database. In this case it is used to listen for assignment of work to a connected bot. Bots can long-poll when sending UpdateBotSession and CreateBotSession requests, returning immediately when work is assigned.