Configuration

The details of how to tune BuildGrid’s configuration.

Hint

In order to spin-up a server instance using a given server.yml configuration file, run:

bgd server start server.yml

Please refer to the CLI reference section for command line interface details.

Anatomy of a configuration file

BuilGrid configuration files describe how the server should run, and which of the various components of BuildGrid should be included. This section walks through an example configuration file to describe how each section works.

BuildGrid configuration is stored in a YAML file, and uses custom YAML tags to instantiate actual Python classes at parse time. These tags are enumerated in detail in the Parser API reference.

Example server configuration

server:
- !channel
  address: "[::]:50051"
  insecure-mode: true

description: >
  A simple configuration example

authorization:
  method: none

monitoring:
  enabled: true
  endpoint-type: udp
  endpoint-location: statsd:8125
  serialization-format: statsd
  metric-prefix: buildgrid

instances:
  - name: ''
    description: |
      The unique '' instance.

    storages:
      - !lru-storage &cas-backend
        size: 2048M

    caches:
      - !lru-action-cache &build-cache
        storage: *cas-backend
        max-cached-refs: 256
        cache-failed-actions: true
        allow-updates: true

    schedulers:
      - !sql-scheduler &scheduler
        storage: *cas-backend
        connection-string: sqlite:///./example.db
        connection-timeout: 15
        poll-interval: 0.5
        action-cache: *build-cache
        max-execution-timeout: 7200

    services:
      - !action-cache
        cache: *build-cache

      - !execution
        scheduler: *scheduler

      - !cas
        storage: *cas-backend

      - !bytestream
        storage: *cas-backend

This is a configuration which results in a BuildGrid server listening on port 50051 containing the following gRPC services.

ActionCache service
Execution service
Operations service
Bots service
CAS service
ByteStream service

Let’s go through this config piece by piece.

server key

server:
- !channel
  port: 50051
  insecure-mode: true

The server key contains a list of Channel objects, which define what ports the gRPC server should bind to. These Channel objects are generated by the parser when it finds a !channel tag.

description key

description: >
  A simple configuration example

The description key expects a string value. This is intended to be a human-readable string describing the configuration. This key is completely optional.

authorization section

authorization:
  method: none

The authorization section specifies what auth is expected by BuildGrid. Currently BuildGrid support JWT-based authorization, allowing access to services to be restricted based on the callers’ JWT content.

This key is optional, and the default is no authorization. This allows all clients access to all services in the config, unless a proxy between the client and BuildGrid does some other authorization.

monitoring section

monitoring:
  enabled: true
  endpoint-type: udp
  endpoint-location: statsd:8125
  serialization-format: statsd
  metric-prefix: buildgrid

The monitoring section contains configuration for BuildGrid’s metrics publishing functionality. This is disabled by default, but when enabled allows configuring where and how metrics should be written.

This example publishes metrics to a UDP port (endpoint-type: udp) located at statsd:8125 (endpoint-location). The metrics will be written in StatsD format (serialization-format) with the metric names prefixed with buildgrid (metric-prefix).

instances section

instances:
  - ...

The instances section defines a list of the “instances” to serve from the configured BuildGrid. These instances are self-contained sets of gRPC services, with distinct “instance names” which allow clients to select which set of services to use.

Normally there’ll only be need for a single instance in any given config file, to maximise the number of gRPC handler threads available for the instance. However, multiple elements in this list are supported.

thread-pool-size key

thread-pool-size: 100

This defines the size of the thread pool used to provide gRPC handler threads. This is a hard cap on the number of gRPC requests that the server can be handling at any given time, with further requests being rejected to avoid deadlocks (eg. a situation where no workers can connect because the connections are full of Execute requests, but those requests can’t be handled because no workers can connect).

Instance Configuration

Each instance in the instances list is a complex object with several non-standard YAML tags. Let’s look at the example instance in a bit more detail too.

name key

instances:
  - name: ""

This key defines the instance name for this instance. This name is used by clients when connecting, to provide a way to (eg) separate workload into different logical sections of infrastructure.

description key

instances:
  - ...
    description: |
      An instance description goes here

This key contains a human-readable description of the instance. This is completely optional, but allows a place for details about the instance to be documented inside the config file.

storages section

instances:
  - ...
    storages:
      - !lru-storage &cas-backend
        size: 512MB

This section contains a list of objects tagged with one of the YAML tags which get parsed as storage backend implementations. The parsing of the tags in this section actually instantiates the storage backend objects which BuildGrid uses internally, and we hook them up with the gRPC service implementations later in the instance config.

This storage is anchored as cas-backend, so that we can refer to same Python object constructed from this node later in the configuration.

schedulers section

instances:
  - ...
    schedulers:
      - !sql-scheduler &scheduler
        storage: *cas-backend
        connection-string: sqlite:///./example.db
        connection-timeout: 15
        poll-interval: 0.5

This section is a list of objects tagged with one of the YAML tags which get parsed as scheduler backend implementations. Most commonly this will likely be an SQL scheduler, as in this example.

The storage key in this object expects an object annotated with a storage tag. In this example we’re aliasing the cas-backend storage object we anchored in the storages section. This means that our SQL scheduler will be passed the Python object we created earlier.

This approach using anchors and aliases lets us define storages in a single place, and share the resulting Python objects amongst various other pieces of the config which need storages. For an in-memory storage like this example, this is important to ensure that both CAS and ByteStream are using the same data structure for storing blobs. We’ll see more on that soon.

The connection-string key is an SQLAlchemy-compatible connection string. It should point at your database, whether that’s an SQLite database file as in this example, or a URL to a database server. Currently SQLite and PostgreSQL are supported and tested.

The connection-timeout key defines how many seconds to wait for the database to respond to queries before timing out the request.

The poll-interval key specifies how many seconds to wait between polling the database for current job state. This polling only happens when not using PostgreSQL, and is used to trigger sending update messages to Execution request clients when the state of the Job they requested changes. When using PostgreSQL, this number is only used to decide how frequently to check for whether to stop the thread responsible for this behaviour, since it uses LISTEN/NOTIFY to detect database updates rather than polling.

Having multiple schedulers in this list is possible, but generally not recommended unless you want different services to use different database settings for some reason.

caches section

instances:
  - ...
    caches:
      - !lru-action-cache &build-cache
        storage: *cas-backend
        max-cached-refs: 256
        cache-failed-actions: true
        allow-updates: true

The caches section contains a list of ActionCache backends. These are defined with YAML tags that end in -action-cache. Like the previous tags, these instantiate the actual Python objects that are used by ActionCache instances to interact with their backing store. This example creates an in-memory LRU ActionCache backend, and gives it the build-cache anchor.

Like the scheduler, the storage key here takes a CAS storage backend object. Here we use a YAML reference to pass the same storage we defined earlier and used in the scheduler.

max-cached-refs specifies the size of this LRU ActionCache, with cache-failed-actions toggling whether or not the ActionCache should accept ActionResults containing failures, and allow-updates specifying whether or not UpdateActionCache messages should be allowed. Setting allow-updates to be false allows the creation of a read-only ActionCache service. This isn’t much use for an LRU cache, but for persistent cache implementations such as the Redis ActionCache allows a read-only client facing ActionCache, enforcing that only workers can populate the cache.

services section

instances:
  - ...
    services:
      ...

The services section defines the actual gRPC services that should be enabled in this instance.

In this list we’ll use more YAML tags to instantiate the service instances, and pass references to the storages, schedulers, and caches defined earlier.

ActionCache service

instances:
  - ...
    services:
      - !action-cache
        cache: *build-cache

The !action-cache tag instantiates an ActionCache instance. Here we pass the anchored build-cache cache backend that we defined earlier to the cache key, ensuring that our ActionCache instance uses the cache we created earlier.

Execution service

instances:
  - ...
    services:
      ...
      - !execution
        storage: *cas-backend
        action-cache: *build-cache
        scheduler: *scheduler
        max-execution-timeout: 7200

The !execution tag creates an Execution instance, and by default a BotsInterface and an Operations instance too.

Here we pass our previously created storage, cache, and scheduler backends to the appropriate config keys using YAML references. Note that this means our Execution service and ActionCache implementation are using the same backend object.

We also specify a max-execution-timeout for our Execution service here.

Selecting only a subset of services is done using the endpoints key, set to a list containing the services you want to enable defined as execution, bots, and operations.

The Execution service can take a number of other configuration options, which are listed in the Parser API reference.

CAS and ByteStream services

instances:
    ...
  - services:
      ...
      - !cas
        storage: *cas-backend

      - !bytestream
        storage: *cas-backend

The !cas tag creates a CAS instance, whilst the !bytestream tag creates a ByteStream instance. For a CAS to function correctly, both of these services need to be present in the instance configuration.

We pass the same anchored storage backend to both services, so that they’re both working to serve the same content.

Key points

Enabled gRPC services are in the instances -> services section of the configuration.
The YAML tags instantiate actual Python objects.
Reuse these objects using YAML anchors and references, to make sure everything gets wired up correctly.

Reference configuration

Below is an example of the full configuration reference:

# Server's configuration desciption.
description: |
  BuildGrid's server reference configuration.

# Server's network configuration.
server:
  - !channel
    # TCP port number.
    address: "[::]:50051"
    # Whether or not to activate SSL/TLS encryption.
    insecure-mode: true
    # SSL/TLS credentials.
    credentials:
      tls-server-key: !expand-path ~/.config/buildgrid/server.key
      tls-server-cert: !expand-path ~/.config/buildgrid/server.cert
      tls-client-certs: !expand-path ~/.config/buildgrid/client.cert

# gRPC tunables to pass to the gRPC server.
# See https://grpc.github.io/grpc/core/group__grpc__arg__keys.html for the full list
grpc-server-options:
  grpc.so_reuseport: 0
  grpc.max_connection_age_ms: 300000

# Server's authorization configuration.
authorization:
  # Type of authorization method.
  #  none  - Bypass the authorization mechanism
  #  jwt   - OAuth 2.0 bearer with JWT tokens
  method: jwt
  # Location for the file containing the secret, pass
  # or key needed by 'method' to authorize requests.
  secret: !expand-path ~/.config/buildgrid/auth.secret
  # The url to fetch the JWKs.
  # Either secret or this field must be specified. Defaults to ``None``.
  jwks-url: https://test.dev/.well-known/jwks.json
  # Audience used to validate the JWT.
  # This field must be specified if jwks-url is specified.
  # This field is case sensitive!
  audience: BuildGrid
  # The amount of time between fetching of the JWKs.
  # Defaults to 60 minutes.
  jwks-fetch-minutes: 30
  # Encryption algorithm to be used together with 'secret'
  # by 'method' to authorize requests (optinal).
  #  hs256  - HMAC+SHA-256 for JWT method
  #  rs256  - RSASSA-PKCS1-v1_5+SHA-256 for JWT method
  algorithm: rs256

# List of connections to use for items like sql and redis
connections:
  - !sql-connection &sql
    # URI for connecting to a PostgreSQL database:
    connection-string: postgresql://bgd:insecure@database/bgd
    # URI for connecting to an SQLite database:
    #connection-string: sqlite:///./example.db

    # SQLAlchemy Pool Options
    pool-size: 5
    pool-timeout: 30
    pool-recycle: 3600
    max-overflow: 10

# List of storage backends for the instance.
storages:
  - !disk-storage &main-storage
    # Path to the local storage folder.
    path: !expand-path $HOME/cas

# List of action cache stores
caches:
  - !lru-action-cache &main-action
    # Alias to a storage backend, see 'storages'.
    storage: *main-storage
    # Maximum number of entires kept in cache.
    max-cached-refs: 256
    # Whether writing to the cache is allowed.
    allow-updates: true
    # Whether failed actions (non-zero exit code) are stored.
    cache-failed-actions: true

# List of schedulers to use in Execution and Bots services
schedulers:
  - !sql-scheduler &state-database
    sql: *sql
    action-cache: *main-action
    storage: *main-storage

    property-set:
      !dynamic-property-set
      # Non-standard keys which BuildGrid will allow jobs to set and use in the
      # scheduling algorithm when matching a job to an appropriate worker
      #
      # Jobs with keys which aren't defined in either this list or
      # `wildcard-property-keys` will be rejected.
      match-property-keys:
        # BuildGrid will match worker and jobs on foo, if set by job
        - foo
        # Can specify multiple keys.
        - bar

      # Non-standard keys which BuildGrid will allow jobs to set. These keys
      # won't be considered when matching jobs to workers.
      #
      # Jobs with keys which aren't defined in either this list or
      # `match-property-keys` will be rejected.
      wildcard-property-keys:
        # BuildGrid won't use the `chrootRootDigest` property to match jobs to workers,
        # but workers will still be able to use the value of the key to determine
        # what environment the job needs
        - chrootRootDigest

      # A static property set can be used instead of a dynamic property set using
      # !static-property-set
      # Static property sets require the value of keys to be pre-defined.
      # This decreases the scheduling cost to linear in comparison to the dynamic set but
      # requires the definition of all valid property sets.

      # property-labels: define a set of property combinations which are allowed by the schedular.
        # - { label: linuxGreen, properties: [[platform, linux], [colour, green]] }
        # - { label: linuxBlue, properties: [[platform, linux], [colour, blue]] }

    # Base URL for external build action (web) browser service.
    action-browser-url: http://localhost:8080

    # BotSession Keepalive Timeout: The maximum time (in seconds)
    # to wait to hear back from a bot before assuming they're unhealthy.
    bot-session-keepalive-timeout: 120

    # Max Execution Timeout: Specify the maximum amount of time (in seconds) a job
    # can remain in executing state. If it exceeds the maximum execution timeout,
    # it will be marked as cancelled.
    # (Default: 7200)
    max-execution-timeout: 7200

    # Max number of locality hints to be associated with each bot.
    bot-locality-hint-limit: 10

    assigners:
      - !priority-age-assigner
        # Number of assigner threads to run.
        count: 5
        # Interval (in seconds) between each assignment attempt.
        interval: 1.0
        # Percentage of jobs that will be assigned by priority
        priority-assignment-percentage: 95
        # Bot assignment strategy to use for finding a bot to assign the job to.
        bot-assignment-strategy: !assign-by-locality
          sampling: !sampling-config
            # Sample size: The number of bots to sample when assigning a job.
            sample-size: 5
            # Max attempts: The maximum number of times to attempt to sample bots
            max-attempts: 3
          fallback: !assign-by-capacity

# Server's instances configuration.
instances:
  - name: main
    description: |
      The 'main' server instance.

    # List of services for the instance.
    #  action-cache     - REAPI ActionCache service.
    #  bytestream       - Google APIs ByteStream service.
    #  cas              - REAPI ContentAddressableStorage service.
    #  execution        - REAPI Execution + RWAPI Bots services.
    services:
      - !action-cache
        cache: *main-action

      - !execution
        scheduler: *state-database

        # Operation Stream Keepalive Timeout: The maximum time (in seconds)
        # to wait before sending the current status in an Operation response
        # stream of an `Execute` or `WaitExecution` request.
        operation-stream-keepalive-timeout: 120

        # Max List Operations Page Size: Specify the maximum number of results that can
        # be returned from a ListOperations request. BuildGrid will provide a page_token
        # with the response that the client can specify to get the next page of results.
        # (Default: 1000)
        max-list-operations-page-size: 1000

      - !cas
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage

      - !bytestream
        # Alias to a storage backend, see 'storages'.
        storage: *main-storage

# Server's internal monitoring configuration.
monitoring:
  # Whether or not to activate the monitoring subsytem.
  enabled: false

  # Type of the monitoring bus endpoint.
  #  stdout  - Standard output stream.
  #  file    - On-disk file.
  #  socket  - UNIX domain socket.
  #  udp     - Port listening for UDP packets
  endpoint-type: socket

  # Location for the monitoring bus endpoint. Only
  # necessary for 'file', 'socket', and 'udp' `endpoint-type`.
  # Full path is expected for 'file', name
  # only for 'socket', and `hostname:port` for 'udp'.
  endpoint-location: monitoring_bus_socket

  # Messages serialisation format.
  #  binary  - Protobuf binary format.
  #  json    - JSON format.
  #  statsd  - StatsD format. Only metrics are kept - logs are dropped.
  serialization-format: binary

  # Prefix to prepend to the metric name before writing
  # to the configured endpoint.
  metric-prefix: buildgrid

# Maximum number of gRPC threads. Defaults to 5 times
# the CPU count if not specifed. A minimum of 5 is
# enforced, whatever the configuration is.
thread-pool-size: 30

# Set unavailability lower than thread-pool-size to customize error responses.
limiter:
  !limiter
  concurrent-request-limit: 25

See the Parser API reference for details on the tagged YAML nodes in this configuration.

Deployment Guidance

BuildGrid is designed to be flexible about deployment topology. Each of the services it can provide can be configured in any combination in a given server. This section provides some example configuration files for different deployment topologies.

For details of the services, see Understanding the configuration file.

All-in-one

server:
  - !channel
    address: "[::]:50051"
    insecure-mode: true

grpc-server-options:
  grpc.max_connection_age_ms: 300000

description: >
  BuildGrid's default configuration:
    - Unauthenticated plain HTTP at :50051
    - Single instance: [unnamed]
    - In-memory data, max. 2Gio
    - DataStore: sqlite:///./example.db
    - Hosted services:
       - ActionCache
       - Execute
       - ContentAddressableStorage
       - ByteStream

authorization:
  method: none

monitoring:
  enabled: false

connections:
  - !sql-connection &sql
    connection-string: ''
    connection-timeout: 15

storages:
  - !lru-storage &cas-storage
    size: 2048M

caches:
  - !lru-action-cache &build-cache
    storage: *cas-storage
    max-cached-refs: 256
    cache-failed-actions: true
    allow-updates: true

schedulers:
  - !sql-scheduler &state-database
    sql: *sql
    storage: *cas-storage
    action-cache: *build-cache
    max-execution-timeout: 7200
    poll-interval: 0.5

instances:
  - name: ''
    description: |
      The unique '' instance.

    services:
      - !action-cache
        cache: *build-cache

      - !execution
        scheduler: *state-database

      - !cas
        storage: *cas-storage

      - !bytestream
        storage: *cas-storage

This configuration includes all the services required for remote execution and caching in a single gRPC server. This is an ideal configuration for trying out BuildGrid locally, but not recommended for production. With this deployment you’ll likely run into issues with the number of threads available to handle incoming requests pretty quickly if running this in a production environment.

In this configuration, all requests are sent to the same endpoint (which is exposed on port 50051).

Separate Execution and CAS/ActionCache

This example is for deploying two separate gRPC servers, one exposing the Execution, Operations, and Bots services, and the other exposing the CAS, ByteStream, and ActionCache services. In general, there’s unlikely to be a good reason to not colocate the CAS and ByteStream services no matter what the rest of your deployment looks like.

Execution, Operations, and Bots services

server:
  - !channel
    address: localhost:50051
    insecure-mode: true

connections:
  - !sql-connection &sql
    connection-string: sqlite:///./example.db
    connection-timeout: 15

storages:
  - !remote-storage &remote-cas
    url: http://storage:50052
    instance-name: ''

caches:
  - !remote-action-cache &remote-cache
    url: http://storage:50052
    instance-name: ''

schedulers:
  - !sql-scheduler &state-database
    sql: *sql
    storage: *remote-cas
    action-cache: *remote-cache
    poll-interval: 0.5
    max-execution-timeout: 7200

instances:
  - name: ''

    services:
      - !execution
        scheduler: *state-database

thread-pool-size: 1000

This configuration file defines the Execution, Operations, and Bots services. The Bots service is defined separately to give an example of how it can be independently defined. The !execution tag still supports including a Bots service if defined as follows.

- !execution
  storage: *remote-cas
  action-cache: *remote-cache
  data-store: *state-database
  max-execution-timeout: 7200
  endpoints:
    - execution
    - operations
    - bots

Omitting the endpoints key has the same effect, as all three services is currently the default option.

CAS, ByteStream, and ActionCache services

server:
  - !channel
    address: localhost:50052
    insecure-mode: true

storages:
  - !disk-storage &main-storage
    path: !expand-path $HOME/cas

caches:
  - !lru-action-cache &main-action
    storage: *main-storage
    max-cached-refs: 256
    allow-updates: true

instances:
  - name: ''

    services:
      - !action-cache
        cache: *main-action

      - !cas
        storage: *main-storage

      - !bytestream
        storage: *main-storage

thread-pool-size: 1000

This configuration file defines the CAS, ByteStream, and ActionCache services. These are the services referenced by the !remote-storage and !remote-action-cache tags in the earlier configuration.

This configuration is a bit more production-ready than the all-in-one example, however there are a few limitiations still.

PostgreSQL should be used for the scheduler’s data store, rather than SQLite
The ActionCache is probably too small for real use.
The ActionCache as configured here won’t support horizontal scaling, which will be needed to handle a good amount of incoming requests (due to the thread limit being set to 1000).

It is also possible (if your client supports it) to split out the services further, for example splitting the ActionCache service out into a separate server, and similarly moving out the Bots service. Its worth noting that Bazel doesn’t support that topology for the ActionCache, since it assumes the ActionCache to be colocated with CAS.

This kind of further splitting can be useful for targetting specific parts of the deployment for horizontal scaling.

Behind a Proxy

BuildGrid can be deployed behind a gRPC proxy to allow services to be deployed separately as described above, whilst providing the ease of having all services exposed via a single URL.

This also avoids the aforementioned need to colocate the CAS and ActionCache in order to support Bazel as a client, since pointing Bazel at a proxy which can route to separate CAS and ActionCache services is functionally the same.

BuildGrid should work behind any web server which can handle routing gRPC requests, for example nginx or Envoy. The proxy should be configured to route requests the the relevant service, with GetCapabilities requests being routed to the Execution service. The Execution service has a special handling of GetCapabilities requests, whereby it also forwards the request to the CAS and ActionCache it is configured to use, and combines the results before returning. This allows it to effectively report on the capabilities of the whole BuildGrid deployment.

In this example routing is done on a service level, with each request being routed to the relevant backend BuildGrid service. Note that requests to Capabilities are routed to the Execution service. A more complex deployment may find it useful to route at the request level, for example routing ByteStream Write requests to a specific place.

Configuration location

Unless a configuration file is explicitly specified on the command line when invoking bgd, BuildGrid will always attempt to load configuration resources from $XDG_CONFIG_HOME/buildgrid. On most Linux based systems, the location will be ~/.config/buildgrid.

This location is refered as $CONFIG_HOME is the rest of the document.

TLS encryption

Every BuildGrid gRPC communication channel can be encrypted using SSL/TLS. By default, the BuildGrid server will try to setup secure gRPC endpoints and return in error if that fails. You must specify --allow-insecure explicitly if you want it to use non-encrypted connections.

The TLS protocol handshake relies on an asymmetric cryptography system that requires the server and the client to own a public/private key pair. BuildGrid will try to load keys from these locations by default:

Server private key: $CONFIG_HOME/server.key
Server public key/certificate: $CONFIG_HOME/server.crt
Client private key: $CONFIG_HOME/client.key
Client public key/certificate: $CONFIG_HOME/client.crt

Server key pair

The TLS protocol requires a key pair to be used by the server. The following example generates a self-signed key server.key, which requires clients to have a copy of the server certificate server.crt. You can of course use a key pair obtained from a trusted certificate authority instead.

openssl req -new -newkey rsa:4096 -x509 -sha256 -days 3650 -nodes -batch -subj "/CN=localhost" -out server.crt -keyout server.key

Client key pair

If the server requires authentication in order to be granted special permissions like uploading to CAS, a client side key pair is required. The following example generates a self-signed key client.key, which requires the server to have a copy of the client certificate client.crt.

openssl req -new -newkey rsa:4096 -x509 -sha256 -days 3650 -nodes -batch -subj "/CN=client" -out client.crt -keyout client.key

Persisting Internal State

BuildGrid’s Execution and Bots services can be configured to store their internal state (such as the job queue) in an external data store of some kind. At the moment the only supported type of data store is any SQL database with a driver supported by SQLALchemy.

This makes it possible to restart a BuildGrid process while preserving the Job Queue, alleviating concerns about having to finish currently queued work before restarting the scheduler or else losing track of that work. Upon restarting, BuildGrid will load the jobs it previously knew about from the data store, and recreate its internal state.

However note that:

Previous connections will need to be recreated.
- For clients, that can be done by sending a WaitExecution request with
the relevant operation name.
- For bots, they can re-register by sending a CreateBotSession request to
accept more work.
Work executing during the restart will be re-assigned to a capable, newly-registered bot

when it gets picked up from the queue, thus progress will be lost.

Hint

Permissive BotSession Mode is an option for the Bots Interface, which allows configurations using a persistent scheduler to verify some of the ongoing leases that were assigned by a different BuildGrid process, making it possible to keep progress done on a lease while BuildGrid is restarting (or in a round-robin BuildGrid cluster). However, enabling this option may cause issues if used with the bot_session_keepalive_timeout option, e.g. BuildGrid re-queuing some jobs and cancelling the relevant existing leases, if the bots start talking another BuildGrid process while executing the job while the previous process(es) they were talking to are still running and have the bot_session_keepalive_timeout option enabled. This will work well in cases where the bot is able to talk to the same BuildGrid process except for when that process is restarting (for example in a primary/backup or sticky-session set-up, configured at the DNS level).

To use this feature, use the following option in the scheduler config:

services:
...
    - !execution
      storage: ...
      action-cache: ...
      scheduler: ...
      permissive-bot-session: True
      ...

SQL Database

The SQL data store implementation uses SQLAlchemy to connect to a database for storing the job queue and related state.

There are database migrations provided, and BuildGrid can be configured to automatically run them when connecting to the database. Alternatively, this can be disabled and the migrations can be executed manually using Alembic.

When using the SQL Data Store with the default configuration (e.g. no connection-string), a temporary SQLite database will be created for the lifetime of BuildGrid’s execution.

Hint

SQLite in-memory databases are not supported by BuildGrid to ensure multiple threads can share the same state database without any issues (using SQLAlchemy’s StaticPool).

SQLite Configuration Block Example

instances:
  - name: ''

    storages:
      - !lru-storage &cas-storage
        size: 2048M

    schedulers:
      - !sql-scheduler &state-database
        storage: *cas-storage
        connection-string: sqlite:////path/to/sqlite.db
        # ... or don't specify the connection-string and BuildGrid will create a tempfile

    services:
      - !execution
        storage: *cas-storage
        scheduler: *state-database

PostgreSQL Configuration Block Example

instances:
  - name: ''

    storages:
      - !lru-storage &cas-storage
        size: 2048M

    schedulers:
      - !sql-scheduler &state-database
        storage: *cas-storage
        connection-string: postgresql://username:password@sql_server/database_name
        # SQLAlchemy Pool Options
        pool-size: 5
        pool-timeout: 30
        pool-pre-ping: yes
        pool-recycle: 3600
        max-overflow: 10

    services:
      - !execution
        storage: *cas-storage
        scheduler: *state-database

The database migrations can be run by cloning the git repository, modifying the sqlalchemy.url line in alembic.ini to match the connection-string in the configuration, and executing

tox -e venv -- alembic --config ./alembic.ini upgrade head

in the root directory of the repository. The docker-compose files in the git repository offer an example approach for PostgreSQL.

Hint

For the creation of the database and depending on the permissions and database config, you may need to create and initialize the database before Alembic can create all the tables for you.

If Alembic fails to create the tables because it cannot read or create the alembic_version table, you could use the following SQL command:

CREATE TABLE alembic_version (
  version_num VARCHAR(32) NOT NULL,
  CONSTRAINT alembic_version_pkc PRIMARY KEY (version_num))

Automatic job pruning

When a job completes, its associated record will remain in the database so that queries continue to reflect its status.

The automatic pruning mechanism ensures that jobs that have been completed for longer than a given age are removed from the database, freeing up space.

When enabled, a cleanup routine will spawn periodically every pruner-period and delete jobs that are older than pruner-job-max-age. Internally, it follows this logic:

pruning_thread():
  every pruning-period:
    delete at most max-delete-window Jobs older than jobs-max-age

Because the delete operation will block the database, another option, pruner-max-delete-window, allows setting an upper bound on the number of records that can be deleted in one pass.

Note

A lower pruner-max-delete-window size will make each pruning pass less expensive but will make the recovery of free space take longer.

Configuration

The example below shows an SQL-backed scheduler that will keep jobs for 90 days after their completion, pruning at most 10k database entries every 48 hours.

Durations can be specified as floating-point amounts of weeks, days, hours, and combinations thereof.

schedulers:
  - !sql-scheduler &state-database
    storage: *cas-storage
    connection-string: sqlite:///./example.db
    connection-timeout: 15
    poll-interval: 0.5
    # Automatic pruning options:
    pruner-job-max-age:
      days: 90
    pruner-period:
      hours: 48
    pruner-max-delete-window: 10000

Monitoring and Metrics

BuildGrid provides a mechanism to output its logs in a number of formats, in addition to printing them to stdout. Log messages can be formatted as JSON or the binary form of the protobuf messages, and can be written to a file, a UNIX domain socket, or a UDP port.

BuildGrid also provides some metrics to give insight into the current health and utilisation of the BuildGrid instance. These metrics are protobuf messages similar to the log messages, and can be configured in the same way. Additionally, metrics can be formatted as statsd metrics strings, to allow simply configuring BuildGrid to output its metrics to a remote StatsD server.

If the statsd format is used, then log messages are dropped and only metrics are written to the configured endpoint. The log messages are still written to stdout in this situation.

See Monitoring and Metrics for more details on monitoring options.

StatsD Metrics

A common monitoring set up is to have metrics published into a StatsD server, for aggregation and display using a tool like Grafana. BuildGrid’s udp monitoring endpoint-type supports this trivially.

This configuration snippet will cause metrics to be published with a buildgrid prefix to a StatsD server listening on port 8125 with a hostname statsd-server which is resolvable by the BuildGrid instance.

monitoring:
  enabled: true
  endpoint-type: udp
  endpoint-location: statsd-server:8125
  serialization-format: statsd
  metric-prefix: buildgrid

Server Reflection

For every service specifed in the configuration file, buildgrid supports server reflection. This allows clients to send requests to specific services, without knowing/having the protos. For example, listing the details of an operation currently ongoing using the grpccli, can be done as follows:

./grpc_cli call localhost:50051 GetOperation  "name: '46a5640e-c3c5-4c7e-b622-df0709540107'"
connecting to localhost:50051
{
"name": "dev/46a5640e-c3c5-4c7e-b622-df0709540107",
"metadata": {
  "@type": "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteOperationMetadata",
  "stage": "QUEUED",
  "actionDigest": {
  "hash": "267d1ff6e8d45b812fbc535fdbb8b69cbd6f7401ac3cc4ba21daa02750045906",
  "sizeBytes": "138"
  }
},
"response": {
  "@type": "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteResponse"
}
}

Rpc succeeded with OK status

server reflection is enabled by default, and can be disabled by specifying the following key in the yaml configuration: server-reflection: false