Worker Resource Sharing

BuildGrid’s concept of instance names allows creation of separate tenancies for different client use-cases. In practice, many use-cases can theoretically share the same worker machines. BuildGrid provides features to allow instances which share a database to also efficiently share worker compute resources.

Bot cohorts

BuildGrid has a concept of bot cohorts, which are groups of bots which match specific characteristics. These can be thought of as groups of available compute capacity of a specific type, and used to model overall capacity of the grid without considering individual machines.

Each bot connected to BuildGrid has a set of platform properties. These properties are used to apply platform labels to the bots based on the property set configuration of the server.

Cohorts are configured with a set of labels, and a bot is considered to be in a cohort if all of the labels applied to the bot are within the cohort’s set of labels.

Cohorts can be used in conjunction with a cohort job assigner to support fair resource sharing across instances, as described below.

Instance-agnostic bots

Normally bots specify an instance name when connecting to BuildGrid, similarly to other clients. This restricts bots to only consuming jobs from that specific instance, which can sometimes be desireable, but makes it difficult to make efficient use of compute when instances have fluctuations in incoming Execute request rates.

To resolve this situation, a Bots service can be configured for a special instance name, *. Bots connecting to this instance name will be instance-agnostic, and accept jobs from all instance names assuming other platform requirements are met.

Quota

Using the instance-agnostic bots functionality allows instances to make use of all available worker machines. However, it introduces a problem whereby a single busy instance can starve other instances of compute. This problem is resolved by the Quota service, a BuildGrid-specific service for managing the number of worker resources available for use on a per-instance basis.

The Quota service allows setting two different quotas per instance name,

min_quota - The minimum guaranteed number of concurrent job slots on worker machines for this instance name.
max_quota - The maximum number of concurrent job slots on worker machines that can be used by this instance name.

These quotas are also per bot cohort, which can be configured based on the property labels applied to a bot.

A recommended approach is to configure a maximum quota equal to the expected maximum number of job slots in the cohort, and a minimum quota sufficient to provide the desired throughput based on the complexity of your execution jobs.

The sum of all minimum quotas in a cohort must be less than or equal to the total number of expected job slots in the cohort. If this is not the case, the specified minimum quotas will not be respected by the assigners.

Configuring quotas

Quotas can be configured using the bgd quota CLI tool.

bgd quota get INSTANCE_NAME BOT_COHORT will retrieve information about the quotas currently configured for the given instance on the given bot cohort. This also retrieves the current usage.

bgd quota put --min-quota=X --max-quota=Y INSTANCE_NAME BOT_COHORT will set or update quotas for the given instance on the given bot cohort. Both min and max quota are integers, and both must be specified in the command.

bgd quota delete INSTANCE_NAME BOT_COHORT will remove all quota configuration from the given instance on the given bot cohort.

See bgd quota.

Pre-emption scheduling

When using the quota service with non-zero minimum quotas, it is recommended to configure the job assigner to do pre-emption based scheduling using the cohort job assigner.

This form of assignment will cancel in-flight jobs for instances over their min_quota if there are no job slots available for a job from an instance below its min_quota. Using this method allows the minimum quotas to actually be enforced. The cancellation is not immediate, rather there is a grace period to attempt to wait for an in-flight job to complete before relying on cancellation.

Eviction cooldown

After a job is evicted from an instance via pre-emption, the instance’s effective max_quota is temporarily reduced by the number of recent evictions within a configurable cooldown window. This prevents the evicted instance from immediately reclaiming slots at its full max_quota, giving the evicting instance time to make use of the freed capacity.

The cooldown window is controlled by the eviction_cooldown_seconds scheduler setting, which defaults to 600 seconds. Setting this value to 0 disables the adjustment entirely.

Example configuration

server:
  - !channel
    address: localhost:50051
    insecure-mode: true

connections:
  - !sql-connection &sql
    connection-string: postgresql://bgd:insecure@database/bgd
    pool-size: 5
    pool-timeout: 30
    max-overflow: 10

storages:
  - !remote-storage &remote-cas
    url: http://storage:50052
    instance-name: ''

caches:
  - !remote-action-cache &remote-cache
    url: http://storage:50052
    instance-name: ''

schedulers:
  - !sql-scheduler &scheduler
    sql: *sql
    storage: *remote-cas
    action-cache: *remote-cache
    property-set:
      # Configuration of a static property set, allowing linux bots running on
      # either x86-64 or arm-a64 machines to connect.
      #
      # Since these are mutually exclusive, bots will only ever have one of
      # these labels. In some configurations, a bot may be able to match
      # multiple labels.
      !static-property-set
      property-labels: 
        - { label: "linux-arm", properties: [["OSFamily", "linux"], ["ISA", "arm-a64"]] }
        - { label: "linux-x86", properties: [["OSFamily", "linux"], ["ISA", "x86-64"]] }
    cohort-set:
      # Configuration to specify two separate cohorts of bots. Cohorts are
      # groups of bots which we want to consider together for scheduling
      # purposes.
      #
      # This example is simple and splits based on the ISA which we note in
      # the property label. More complex configurations can group multiple
      # labels into a single cohort. A bot is considered part of a cohort if
      # all property labels the bot matches are configured as part of the
      # `property-labels` list in a `!cohort-set`.
      !cohort-set
      cohorts:
        - name: "arm"
          property-labels: ["linux-arm"]
        - name: "x86"
          property-labels: ["linux-x86"]
    assigners:
      # Configuration for a cohort assigner, assigning with a preemption
      # delay of 10 seconds. This will start a thread for each cohort in
      # `cohort-set`, which will assign matching jobs to bots in that cohort.
      # If the instance name for a job is for is below its `min_quota` in the
      # cohort, then after the configured 10 second delay another job will be
      # evicted from the cohort.
      - !cohort-assigner
        cohort-set: ["linux-arm", "linux-x86"]
        count: 1
        preemption-delay: 10.0

# This services block outside of any instance name configuration is used to
# configure services which will accept arbitrary instance names and operate on
# them appropriately.
services:
  - !bots-service
    scheduler: *scheduler

  - !quota-service
    scheduler: *scheduler

thread-pool-size: 100