Indexed CAS
To improve the performance of CAS-related requests like FindMissingBlobs
,
it is possible to configure an “indexed” CAS. This also facilitates intelligent
cleanup of blobs, currently only for S3.
This index keeps track of all the blobs that are currently stored in CAS, as well as the last time they were accessed at.
Configuration
In a configuration file, the CAS index acts just like another storage backend definition. It fully implements the same API interface as the other storage backends, so you can just pass it to the service definition as with any other storage implementation.
For example, here is a basic disk-based CAS configuration
server:
- !channel
address: localhost:50051
insecure-mode: true
authorization:
method: none
monitoring:
enabled: false
instances:
- name: ''
description: |
The unique '' instance.
storages:
- !disk-storage &disk-store
path: example-cas/
services:
- !cas
storage: *disk-store
- !bytestream
storage: *disk-store
To add an index to this, we simply need to add a CAS index to the list of storages, point it at the disk storage, and then use the index in the service definitions rather than the disk storage.
server:
- !channel
address: localhost:50051
insecure-mode: true
authorization:
method: none
monitoring:
enabled: false
instances:
- name: ''
description: |
The unique '' instance.
storages:
- !disk-storage &disk-store
path: example-cas/
- !sql-index &cas-index
storage: *disk-store
connection-string: sqlite:///./index.example.db
automigrate: yes
services:
- !cas
storage: *cas-index
- !bytestream
storage: *cas-index
Warning
If adding an index to an existing storage which is used by another service
which stores digests, such as an ActionCache or Asset API, either those
services need to be cleared or the fallback_on_get
config option needs
to be temporarily set to True on the index.
By default, the index will report that a blob is missing if it isn’t in the
index, whether it exists in storage or not. fallback_on_get
is a pretty
big burden on performance so is off by default, and should ideally not be
left on permanently.
Index Types
The CAS index implementation is pluggable, similarly to the storage implementation. Currently only an SQL-based implementation exists, but it is possible to write a custom index implementation without too much effort.
SQL
As the name suggests, the SQL index uses an SQL database to store the index. It is tested to support both SQLite and PostgreSQL as the database, and theoretically could support any database backend that SQLAlchemy supports (providing that the database has support for window functions).
The storage instance and connection string must be provided, but the other parameters have defaults that should be functional.
- class buildgrid._app.settings.parser.SQL_Index(_yaml_filename: str, storage: StorageABC, connection_string: str | None = None, sql: SqlProvider | None = None, automigrate: bool = False, window_size: int = 1000, inclause_limit: int = -1, fallback_on_get: bool = False, max_inline_blob_size: int = 0, refresh_accesstime_older_than: int = 0, **kwargs)
Generates
buildgrid.server.cas.storage.index.sql.SQLIndex
using the tag!sql-index
.- Usage
- !sql-index # This assumes that a storage instance is defined elsewhere # with a `&cas-storage` anchor storage: *cas-storage sql: *sql window-size: 1000 inclause-limit: -1 fallback-on-get: no max-inline-blob-size: 256 refresh-accesstime-older-than: 0
- Parameters:
storage (
buildgrid.server.cas.storage.storage_abc.StorageABC
) – Instance of storage to use. This must be a storage object constructed using a YAML tag ending in-storage
, for example!disk-storage
.connection_string (str) – SQLAlchemy connection string
automigrate (bool) – Attempt to automatically upgrade an existing DB schema to the newest version.
window_size (uint) – Maximum number of blobs to fetch in one SQL operation (larger resultsets will be automatically split into multiple queries)
inclause_limit (int) – If nonnegative, overrides the default number of variables permitted per “in” clause. See the buildgrid.server.cas.storage.index.sql.SQLIndex comments for more details.
fallback_on_get (bool) – By default, the SQL Index only fetches blobs from the underlying storage if they’re present in the index on
get_blob
/bulk_read_blobs
requests to minimize interactions with the storage. If this is set, the index instead checks the underlying storage directly onget_blob
/bulk_read_blobs
requests, then loads all blobs found into the index.max_inline_blob_size (int) – Blobs of this size or smaller are stored directly in the index and not in the backing storage (must be nonnegative).
refresh-accesstime-older-than (int) – When reading a blob, its access timestamp will not be updated if the current time is not at least refresh-accesstime-older-than seconds newer than the access timestamp. Set this to reduce load associated with frequent timestamp updates.