Configuration

The details of how to tune BuildGrid’s configuration.

Hint

In order to spin-up a server instance using a given server.conf configuration file, run:

bgd server start server.conf

Please refer to the CLI reference section for command line interface details.

Reference configuration

Below is an example of the full configuration reference:

##
# Server's configuration desciption.
description: |
  BuildGrid's server reference configuration.

##
# Server's network configuration.
server:
  - !channel
    ##
    # TCP port number.
    port: 50051
    ##
    # Whether or not to activate SSL/TLS encryption.
    insecure-mode: true
    ##
    # SSL/TLS credentials.
    credentials:
      tls-server-key: !expand-path ~/.config/buildgrid/server.key
      tls-server-cert: !expand-path ~/.config/buildgrid/server.cert
      tls-client-certs: !expand-path ~/.config/buildgrid/client.cert

##
# Server's authorization configuration.
authorization:
  ##
  # Type of authorization method.
  #  none  - Bypass the authorization mechanism
  #  jwt   - OAuth 2.0 bearer with JWT tokens
  method: jwt
  ##
  # Location for the file containing the secret, pass
  # or key needed by 'method' to authorize requests.
  secret: !expand-path ~/.config/buildgrid/auth.secret
  ##
  # Encryption algorithm to be used together with 'secret'
  # by 'method' to authorize requests (optinal).
  #  hs256  - HMAC+SHA-256 for JWT method
  #  rs256  - RSASSA-PKCS1-v1_5+SHA-256 for JWT method
  algorithm: rs256

##
# Server's instances configuration.
instances:
  - name: main
    description: |
      The 'main' server instance.
    ##
    # List of storage backends for the instance.
    #  disk         - On-disk storage.
    #  lru-storage  - In-memory storage (non-persistant).
    #  remote       - Proxy to remote storage.
    #  s3-storage   - Amazon S3 storage.
    storages:
      - !disk-storage &main-storage
        ##
        # Path to the local storage folder.
        path: !expand-path $HOME/cas

    ##
    # List of schedulers to use in Execution and Bots services
    #  sql-scheduler      - A scheduler which uses a SQLAlchemy-compatible
    #                       database
    #  memory-scheduler   - Legacy in-memory scheduler, not recommended
    #                       for use in production scenarios due to lack of
    #                       persistent state and inability to horizontally
    #                       scale (or even share state outside a single
    #                       configuration at all)
    schedulers:
      - !sql-scheduler &state-database
        # Storage backend that results should be stored in
        storage: *main-storage

        # URI for connecting to a PostgreSQL database:
        connection-string: postgresql://bgd:insecure@database/bgd
        # URI for connecting to an SQLite database:
        #connection-string: sqlite:///./example.db

        ##
        # Whether or not to automatically run database migrations
        # when starting the server
        automigrate: yes

        # SQLAlchemy Pool Options
        pool-size: 5
        pool-timeout: 30
        max-overflow: 10

      # Using an in-memory scheduler isn't recommended for long-term
      # or production use. Using this the features that allow horizontal
      # scaling will not function correctly, and also the job queue will
      # be ephemeral and lost on restart.
      - !memory-scheduler &state-in-memory
        # Storage backend that results should be stored in
        storage: *main-storage

    ##
    # List of services for the instance.
    #  action-cache     - REAPI ActionCache service.
    #  bytestream       - Google APIs ByteStream service.
    #  cas              - REAPI ContentAddressableStorage service.
    #  execution        - REAPI Execution + RWAPI Bots services.
    #  reference-cache  - BuildStream ReferenceStorage service.
    services:
      - !action-cache &main-action
        ##
        # Alias to a storage backend, see 'storages'.
        storage: *memory-storage
        ##
        # Maximum number of entires kept in cache.
        max-cached-refs: 256
        ##
        # Whether or not writing to the cache is allowed.
        allow-updates: true
        ##
        # Whether failed actions (non-zero exit code) are stored.
        cache-failed-actions: true

      - !execution
        ##
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage
        ##
        # Alias to an action-cache service.
        action-cache: *main-action
        ##
        # BotSession Keepalive Timeout: The maximum time (in seconds)
        # to wait to hear back from a bot before assuming they're unhealthy.
        bot-session-keepalive-timeout: 120

        # Non-standard keys which buildgrid will allow job's to set
        # Job's with non-standard keys, not in this list, will be rejected
        property-keys:
          ##
          # buildgrid will match worker and jobs on foo, if set by job
          - foo
          ##
          # Can specify multiple keys.
          - bar
        ##
        # Base URL for external build action (web) browser service.
        action-browser-url: http://localhost:8080
        ##
        # Alias to a data store used to store the scheduler's state, see 'schedulers'.
        scheduler: *state-database
        ##
        # Remove operation if there are no clients currently connected watching it.
        # (Default: false)
        discard-unwatched-jobs: true
        ##
        # Max Execution Timeout: Specify the maximum amount of time (in seconds) a job
        # can remain in executing state. If it exceeds the maximum execution timeout,
        # it will be marked as cancelled.
        # (Default: 7200)
        max-execution-timeout: 7200
        ##
        # Max List Operations Page Size: Specify the maximum number of results that can
        # be returned from a ListOperations request. BuildGrid will provide a page_token
        # with the response that the client can specify to get the next page of results.
        # (Default: 1000)
        max-list-operation-page-size: 1000

      - !cas
        ##
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage
        ##
        # Whether the CAS should be read only or not
        read-only: false

      - !bytestream
        ##
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage
        ##
        # Whether the ByteStream should be read-only
        read-only: false

      - !reference-cache
        ##
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage
        ##
        # Maximum number of entires kept in cache.
        max-cached-refs: 256
        ##
        # Whether or not writing to the cache is allowed.
        allow-updates: true

##
# Server's internal monitoring configuration.
monitoring:
  ##
  # Whether or not to activate the monitoring subsytem.
  enabled: false

  ##
  # Type of the monitoring bus endpoint.
  #  stdout  - Standard output stream.
  #  file    - On-disk file.
  #  socket  - UNIX domain socket.
  #  udp     - Port listening for UDP packets
  endpoint-type: socket

  ##
  # Location for the monitoring bus endpoint. Only
  # necessary for 'file', 'socket', and 'udp' `endpoint-type`.
  # Full path is expected for 'file', name
  # only for 'socket', and `hostname:port` for 'udp'.
  endpoint-location: monitoring_bus_socket

  ##
  # Messages serialisation format.
  #  binary  - Protobuf binary format.
  #  json    - JSON format.
  #  statsd  - StatsD format. Only metrics are kept - logs are dropped.
  serialization-format: binary

  ##
  # Prefix to prepend to the metric name before writing
  # to the configured endpoint.
  metric-prefix: buildgrid

##
# Maximum number of gRPC threads. Defaults to 5 times
# the CPU count if not specifed. A minimum of 5 is
# enforced, whatever the configuration is.
thread-pool-size: 20

See the Parser API reference for details on the tagged YAML nodes in this configuration.

Deployment Guidance

BuildGrid is designed to be flexible about deployment topology. Each of the services it can provide can be configured in any combination in a given server. This section provides some example configuration files for different deployment topologies.

For details of the services, see Understanding the configuration file.

All-in-one

server:
  - !channel
    port: 50051
    insecure-mode: true

description: >
  BuildGrid's default configuration:
    - Unauthenticated plain HTTP at :50051
    - Single instance: [unnamed]
    - In-memory data, max. 2Gio
    - DataStore: sqlite:///./example.db
    - Hosted services:
       - ActionCache
       - Execute
       - ContentAddressableStorage
       - ByteStream

authorization:
  method: none

monitoring:
  enabled: false

instances:
  - name: ''
    description: |
      The unique '' instance.

    storages:
      - !lru-storage &cas-storage
        size: 2048M

    schedulers:
      - !sql-scheduler &state-database
        storage: *cas-storage
        connection-string: sqlite:///./example.db
        automigrate: yes
        connection-timeout: 15
        poll-interval: 0.5

    caches:
      - !lru-action-cache &build-cache
        storage: *cas-storage
        max-cached-refs: 256
        cache-failed-actions: true
        allow-updates: true

    services:
      - !action-cache
        cache: *build-cache

      - !execution
        storage: *cas-storage
        action-cache: *build-cache
        scheduler: *state-database
        max-execution-timeout: 7200

      - !cas
        storage: *cas-storage

      - !bytestream
        storage: *cas-storage

This configuration includes all the services required for remote execution and caching in a single gRPC server. This is an ideal configuration for trying out BuildGrid locally, but not recommended for production. With this deployment you’ll likely run into issues with the number of threads available to handle incoming requests pretty quickly if running this in a production environment.

In this configuration, all requests are sent to the same endpoint (which is exposed on port 50051).

Separate Execution and CAS/ActionCache

This example is for deploying two separate gRPC servers, one exposing the Execution, Operations, and Bots services, and the other exposing the CAS, ByteStream, and ActionCache services. In general, there’s unlikely to be a good reason to not colocate the CAS and ByteStream services no matter what the rest of your deployment looks like.

Execution, Operations, and Bots services
server:
  - !channel
    port: 50051
    insecure-mode: true

instances:
  - name: ''

    storages:
      - !remote-storage &remote-cas
        url: http://storage:50052
        instance-name: ''

    caches:
      - !remote-action-cache &remote-cache
        url: http://storage:50052
        instance-name: ''

    data-stores:
      - !sql-scheduler &state-database
        storage: *remote-cas
        connection-string: sqlite:///./example.db
        automigrate: yes
        connection-timeout: 15
        poll-interval: 0.5

    services:
      - !execution
        storage: *remote-cas
        action-cache: *remote-cache
        scheduler: *state-database
        max-execution-timeout: 7200
        endpoints:
          - execution
          - operations

      - !bots
        storage: *remote-cas
        action-cache: *remote-cache
        scheduler: *state-database

thread-pool-size: 1000

This configuration file defines the Execution, Operations, and Bots services. The Bots service is defined separately to give an example of how it can be independently defined. The !execution tag still supports including a Bots service if defined as follows.

- !execution
  storage: *remote-cas
  action-cache: *remote-cache
  data-store: *state-database
  max-execution-timeout: 7200
  endpoints:
    - execution
    - operations
    - bots

Omitting the endpoints key has the same effect, as all three services is currently the default option.

CAS, ByteStream, and ActionCache services
server:
  - !channel
    port: 50052
    insecure_mode: true

instances:
  - name: ''

    storages:
      - !disk-storage &main-storage
        path: !expand-path $HOME/cas

    services:
      - !action-cache &main-action
        storage: *main-storage
        max-cached-refs: 256
        allow-updates: true

      - !cas
        storage: *main-storage

      - !bytestream
        storage: *main-storage

thread-pool-size: 1000

This configuration file defines the CAS, ByteStream, and ActionCache services. These are the services referenced by the !remote-storage and !remote-action-cache tags in the earlier configuration.

This configuration is a bit more production-ready than the all-in-one example, however there are a few limitiations still.

  • PostgreSQL should be used for the scheduler’s data store, rather than SQLite

  • The ActionCache is probably too small for real use.

  • The ActionCache as configured here won’t support horizontal scaling, which will be needed to handle a good amount of incoming requests (due to the thread limit being set to 1000).

It is also possible (if your client supports it) to split out the services further, for example splitting the ActionCache service out into a separate server, and similarly moving out the Bots service. Its worth noting that Bazel doesn’t support that topology for the ActionCache, since it assumes the ActionCache to be colocated with CAS.

This kind of further splitting can be useful for targetting specific parts of the deployment for horizontal scaling.

Configuration location

Unless a configuration file is explicitly specified on the command line when invoking bgd, BuildGrid will always attempt to load configuration resources from $XDG_CONFIG_HOME/buildgrid. On most Linux based systems, the location will be ~/.config/buildgrid.

This location is refered as $CONFIG_HOME is the rest of the document.

TLS encryption

Every BuildGrid gRPC communication channel can be encrypted using SSL/TLS. By default, the BuildGrid server will try to setup secure gRPC endpoints and return in error if that fails. You must specify --allow-insecure explicitly if you want it to use non-encrypted connections.

The TLS protocol handshake relies on an asymmetric cryptography system that requires the server and the client to own a public/private key pair. BuildGrid will try to load keys from these locations by default:

  • Server private key: $CONFIG_HOME/server.key

  • Server public key/certificate: $CONFIG_HOME/server.crt

  • Client private key: $CONFIG_HOME/client.key

  • Client public key/certificate: $CONFIG_HOME/client.crt

Server key pair

The TLS protocol requires a key pair to be used by the server. The following example generates a self-signed key server.key, which requires clients to have a copy of the server certificate server.crt. You can of course use a key pair obtained from a trusted certificate authority instead.

openssl req -new -newkey rsa:4096 -x509 -sha256 -days 3650 -nodes -batch -subj "/CN=localhost" -out server.crt -keyout server.key

Client key pair

If the server requires authentication in order to be granted special permissions like uploading to CAS, a client side key pair is required. The following example generates a self-signed key client.key, which requires the server to have a copy of the client certificate client.crt.

openssl req -new -newkey rsa:4096 -x509 -sha256 -days 3650 -nodes -batch -subj "/CN=client" -out client.crt -keyout client.key

Persisting Internal State

BuildGrid’s Execution and Bots services can be configured to store their internal state (such as the job queue) in an external data store of some kind. At the moment the only supported type of data store is any SQL database with a driver supported by SQLALchemy.

This makes it possible to restart a BuildGrid process while preserving the Job Queue, alleviating concerns about having to finish currently queued work before restarting the scheduler or else losing track of that work. Upon restarting, BuildGrid will load the jobs it previously knew about from the data store, and recreate its internal state.

However note that:

  • Previous connections will need to be recreated.

    • For clients, that can be done by sending a WaitExecution request with

    the relevant operation name.

    • For bots, they can re-register by sending a CreateBotSession request to

    accept more work.

  • Work executing during the restart will be re-assigned to a capable, newly-registered bot

when it gets picked up from the queue, thus progress will be lost.

Hint

Permissive BotSession Mode is an option for the Bots Interface, which allows configurations using a persistent scheduler to verify some of the ongoing leases that were assigned by a different BuildGrid process, making it possible to keep progress done on a lease while BuildGrid is restarting (or in a round-robin BuildGrid cluster). However, enabling this option may cause issues if used with the bot_session_keepalive_timeout option, e.g. BuildGrid re-queuing some jobs and cancelling the relevant existing leases, if the bots start talking another BuildGrid process while executing the job while the previous process(es) they were talking to are still running and have the bot_session_keepalive_timeout option enabled. This will work well in cases where the bot is able to talk to the same BuildGrid process except for when that process is restarting (for example in a primary/backup or sticky-session set-up, configured at the DNS level).

To use this feature, use the following option in the scheduler config:

services:
...
    - !execution
      storage: ...
      action-cache: ...
      scheduler: ...
      permissive-bot-session: True
      ...

SQL Database

The SQL data store implementation uses SQLAlchemy to connect to a database for storing the job queue and related state.

There are database migrations provided, and BuildGrid can be configured to automatically run them when connecting to the database. Alternatively, this can be disabled and the migrations can be executed manually using Alembic.

When using the SQL Data Store with the default configuration (e.g. no connection-string), a temporary SQLite database will be created for the lifetime of BuildGrid’s execution.

Hint

SQLite in-memory databases are not supported by BuildGrid to ensure multiple threads can share the same state database without any issues (using SQLAlchemy’s StaticPool).

SQLite Configuration Block Example

instances:
  - name: ''

    storages:
      - !lru-storage &cas-storage
        size: 2048M

    schedulers:
      - !sql-scheduler &state-database
        storage: *cas-storage
        connection-string: sqlite:////path/to/sqlite.db
        # ... or don't specify the connection-string and BuildGrid will create a tempfile

    services:
      - !execution
        storage: *cas-storage
        scheduler: *state-database

PostgreSQL Configuration Block Example

instances:
  - name: ''

    storages:
      - !lru-storage &cas-storage
        size: 2048M

    schedulers:
      - !sql-scheduler &state-database
        storage: *cas-storage
        connection-string: postgresql://username:password@sql_server/database_name
        # SQLAlchemy Pool Options
        pool-size: 5
        pool-timeout: 30
        max-overflow: 10

    services:
      - !execution
        storage: *cas-storage
        scheduler: *state-database

With automigrate: no, the migrations can be run by cloning the git repository, modifying the sqlalchemy.url line in alembic.ini to match the connection-string in the configuration, and executing

tox -e venv -- alembic --config ./alembic.ini upgrade head

in the root directory of the repository. The docker-compose files in the git repository offer an example approach for PostgreSQL.

Hint

For the creation of the database and depending on the permissions and database config, you may need to create and initialize the database before Alembic can create all the tables for you.

If Alembic fails to create the tables because it cannot read or create the alembic_version table, you could use the following SQL command:

CREATE TABLE alembic_version (
  version_num VARCHAR(32) NOT NULL,
  CONSTRAINT alembic_version_pkc PRIMARY KEY (version_num))

Monitoring and Metrics

BuildGrid provides a mechanism to output its logs in a number of formats, in addition to printing them to stdout. Log messages can be formatted as JSON or the binary form of the protobuf messages, and can be written to a file, a UNIX domain socket, or a UDP port.

BuildGrid also provides some metrics to give insight into the current health and utilisation of the BuildGrid instance. These metrics are protobuf messages similar to the log messages, and can be configured in the same way. Additionally, metrics can be formatted as statsd metrics strings, to allow simply configuring BuildGrid to output its metrics to a remote StatsD server.

If the statsd format is used, then log messages are dropped and only metrics are written to the configured endpoint. The log messages are still written to stdout in this situation.

See Monitoring and Metrics for more details on monitoring options.

StatsD Metrics

A common monitoring set up is to have metrics published into a StatsD server, for aggregation and display using a tool like Grafana. BuildGrid’s udp monitoring endpoint-type supports this trivially.

This configuration snippet will cause metrics to be published with a buildgrid prefix to a StatsD server listening on port 8125 with a hostname statsd-server which is resolvable by the BuildGrid instance.

monitoring:
  enabled: true
  endpoint-type: udp
  endpoint-location: statsd-server:8125
  serialization-format: statsd
  metric-prefix: buildgrid

Server Reflection

For every service specifed in the configuration file, buildgrid supports server reflection. This allows clients to send requests to specific services, without knowing/having the protos. For example, listing the details of an operation currently ongoing using the grpccli, can be done as follows:

./grpc_cli call localhost:50051 GetOperation  "name: '46a5640e-c3c5-4c7e-b622-df0709540107'"
connecting to localhost:50051
{
"name": "dev/46a5640e-c3c5-4c7e-b622-df0709540107",
"metadata": {
  "@type": "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteOperationMetadata",
  "stage": "QUEUED",
  "actionDigest": {
  "hash": "267d1ff6e8d45b812fbc535fdbb8b69cbd6f7401ac3cc4ba21daa02750045906",
  "sizeBytes": "138"
  }
},
"response": {
  "@type": "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteResponse"
}
}

Rpc succeeded with OK status

server reflection is enabled by default, and can be disabled by specifying the following key in the yaml configuration: server-reflection: false