Indexed CAS

To improve the performance of CAS-related requests like FindMissingBlobs, it is possible to configure an “indexed” CAS. This also facilitates intelligent cleanup of blobs, currently only for S3.

This index keeps track of all the blobs that are currently stored in CAS, as well as the last time they were accessed at.

Configuration

In a configuration file, the CAS index acts just like another storage backend definition. It fully implements the same API interface as the other storage backends, so you can just pass it to the service definition as with any other storage implementation.

For example, here is a basic disk-based CAS configuration

server:
  - !channel
    port: 50051
    insecure-mode: true

authorization:
  method: none

monitoring:
  enabled: false

instances:
  - name: ''
    description: |
      The unique '' instance.

    storages:
      - !disk-storage &disk-store
        path: example-cas/

    services:
      - !cas
        storage: *disk-store

      - !bytestream
        storage: *disk-store

To add an index to this, we simply need to add a CAS index to the list of storages, point it at the disk storage, and then use the index in the service definitions rather than the disk storage.

server:
  - !channel
    port: 50051
    insecure-mode: true

authorization:
  method: none

monitoring:
  enabled: false

instances:
  - name: ''
    description: |
      The unique '' instance.

    storages:
      - !disk-storage &disk-store
        path: example-cas/

      - !sql-index &cas-index
        storage: *disk-store
        connection_string: sqlite:///./index.example.db
        automigrate: yes

    services:
      - !cas
        storage: *cas-index

      - !bytestream
        storage: *cas-index

Warning

If adding an index to an existing storage which is used by another service which stores digests, such as an ActionCache or Asset API, either those services need to be cleared or the fallback_on_get config option needs to be temporarily set to True on the index.

By default, the index will report that a blob is missing if it isn’t in the index, whether it exists in storage or not. fallback_on_get is a pretty big burden on performance so is off by default, and should ideally not be left on permanently.

Index Types

The CAS index implementation is pluggable, similarly to the storage implementation. Currently only an SQL-based implementation exists, but it is possible to write a custom index implementation without too much effort.

SQL

As the name suggests, the SQL index uses an SQL database to store the index. It is tested to support both SQLite and PostgreSQL as the database, and theoretically could support any database backend that SQLAlchemy supports (providing that the database has support for window functions).

The storage instance and connection string must be provided, but the other parameters have defaults that should be functional.

class buildgrid._app.settings.parser.SQL_Index

Generates buildgrid.server.cas.storage.index.sql.SQLIndex using the tag !sql-index.

Usage
- !sql-index
  # This assumes that a storage instance is defined elsewhere
  # with a `&cas-storage` anchor
  storage: *cas-storage
  connection_string: postgresql://bgd:insecure@database/bgd
  automigrate: yes
  window-size: 1000
  inclause-limit: -1
  fallback-on-get: no
Parameters
  • storage (buildgrid.server.cas.storage.storage_abc.StorageABC) – Instance of storage to use. This must be a storage object constructed using a YAML tag ending in -storage, for example !disk-storage.

  • connection_string (str) – SQLAlchemy connection string

  • automigrate (bool) – Attempt to automatically upgrade an existing DB schema to the newest version.

  • window_size (uint) – Maximum number of blobs to fetch in one SQL operation (larger resultsets will be automatically split into multiple queries)

  • inclause_limit (int) – If nonnegative, overrides the default number of variables permitted per “in” clause. See the buildgrid.server.cas.storage.index.sql.SQLIndex comments for more details.

  • fallback_on_get (bool) – By default, the SQL Index only fetches blobs from the underlying storage if they’re present in the index on get_blob/bulk_read_blobs requests to minimize interactions with the storage. If this is set, the index instead checks the underlying storage directly on get_blob/bulk_read_blobs requests, then loads all blobs found into the index.