Worker Resource Sharing
BuildGrid’s concept of instance names allows creation of separate tenancies for different client use-cases. In practice, many use-cases can theoretically share the same worker machines. BuildGrid provides features to allow instances which share a database to also efficiently share worker compute resources.
Bot cohorts
BuildGrid has a concept of bot cohorts, which are groups of bots which match specific characteristics. These can be thought of as groups of available compute capacity of a specific type, and used to model overall capacity of the grid without considering individual machines.
Each bot connected to BuildGrid has a set of platform properties. These properties are used to apply platform labels to the bots based on the property set configuration of the server.
Cohorts are configured with a set of labels, and a bot is considered to be in a cohort if all of the labels applied to the bot are within the cohort’s set of labels.
Cohorts can be used in conjunction with a cohort job assigner to support fair resource sharing across instances, as described below.
Instance-agnostic bots
Normally bots specify an instance name when connecting to BuildGrid, similarly to other clients. This restricts bots to only consuming jobs from that specific instance, which can sometimes be desireable, but makes it difficult to make efficient use of compute when instances have fluctuations in incoming Execute request rates.
To resolve this situation, a Bots service can be configured for a special
instance name, *. Bots connecting to this instance name will be
instance-agnostic, and accept jobs from all instance names assuming other
platform requirements are met.
Quota
Using the instance-agnostic bots functionality allows instances to make use of all available worker machines. However, it introduces a problem whereby a single busy instance can starve other instances of compute. This problem is resolved by the Quota service, a BuildGrid-specific service for managing the number of worker resources available for use on a per-instance basis.
The Quota service allows setting two different quotas per instance name,
min_quota- The minimum guaranteed number of concurrent job slots on worker machines for this instance name.max_quota- The maximum number of concurrent job slots on worker machines that can be used by this instance name.
These quotas are also per bot cohort, which can be configured based on the property labels applied to a bot.
A recommended approach is to configure a maximum quota equal to the expected maximum number of job slots in the cohort, and a minimum quota sufficient to provide the desired throughput based on the complexity of your execution jobs.
The sum of all minimum quotas in a cohort must be less than or equal to the total number of expected job slots in the cohort. If this is not the case, the specified minimum quotas will not be respected by the assigners.
Configuring quotas
Quotas can be configured using the bgd quota CLI tool.
bgd quota get INSTANCE_NAME BOT_COHORT will retrieve information about the
quotas currently configured for the given instance on the given bot cohort.
This also retrieves the current usage.
bgd quota put --min-quota=X --max-quota=Y INSTANCE_NAME BOT_COHORT will set
or update quotas for the given instance on the given bot cohort. Both min and
max quota are integers, and both must be specified in the command.
bgd quota delete INSTANCE_NAME BOT_COHORT will remove all quota
configuration from the given instance on the given bot cohort.
See bgd quota.
Pre-emption scheduling
When using the quota service with non-zero minimum quotas, it is recommended to configure the job assigner to do pre-emption based scheduling using the cohort job assigner.
This form of assignment will cancel in-flight jobs for instances over their
min_quota if there are no job slots available for a job from an instance
below its min_quota. Using this method allows the minimum quotas to
actually be enforced. The cancellation is not immediate, rather there is a
grace period to attempt to wait for an in-flight job to complete before relying
on cancellation.
Example configuration
server:
- !channel
address: localhost:50051
insecure-mode: true
connections:
- !sql-connection &sql
connection-string: postgresql://bgd:insecure@database/bgd
pool-size: 5
pool-timeout: 30
max-overflow: 10
storages:
- !remote-storage &remote-cas
url: http://storage:50052
instance-name: ''
caches:
- !remote-action-cache &remote-cache
url: http://storage:50052
instance-name: ''
schedulers:
- !sql-scheduler &scheduler
sql: *sql
storage: *remote-cas
action-cache: *remote-cache
property-set:
# Configuration of a static property set, allowing linux bots running on
# either x86-64 or arm-a64 machines to connect.
#
# Since these are mutually exclusive, bots will only ever have one of
# these labels. In some configurations, a bot may be able to match
# multiple labels.
!static-property-set
property-labels:
- { label: "linux-arm", properties: [["OSFamily", "linux"], ["ISA", "arm-a64"]] }
- { label: "linux-x86", properties: [["OSFamily", "linux"], ["ISA", "x86-64"]] }
cohort-set:
# Configuration to specify two separate cohorts of bots. Cohorts are
# groups of bots which we want to consider together for scheduling
# purposes.
#
# This example is simple and splits based on the ISA which we note in
# the property label. More complex configurations can group multiple
# labels into a single cohort. A bot is considered part of a cohort if
# all property labels the bot matches are configured as part of the
# `property-labels` list in a `!cohort-set`.
!cohort-set
cohorts:
- name: "arm"
property-labels: ["linux-arm"]
- name: "x86"
property-labels: ["linux-x86"]
assigners:
# Configuration for a cohort assigner, assigning with a preemption
# delay of 10 seconds. This will start a thread for each cohort in
# `cohort-set`, which will assign matching jobs to bots in that cohort.
# If the instance name for a job is for is below its `min_quota` in the
# cohort, then after the configured 10 second delay another job will be
# evicted from the cohort.
- !cohort-assigner
cohort-set: ["linux-arm", "linux-x86"]
count: 1
preemption-delay: 10.0
# This services block outside of any instance name configuration is used to
# configure services which will accept arbitrary instance names and operate on
# them appropriately.
services:
- !bots-service
scheduler: *scheduler
- !quota-service
scheduler: *scheduler
thread-pool-size: 100