Indexed CAS

To improve the performance of CAS-related requests like FindMissingBlobs, it is possible to configure an “indexed” CAS. This also facilitates intelligent cleanup of blobs, currently only for S3.

This index keeps track of all the blobs that are currently stored in CAS, as well as the last time they were accessed at.

Configuration

In a configuration file, the CAS index acts just like another storage backend definition. It fully implements the same API interface as the other storage backends, so you can just pass it to the service definition as with any other storage implementation.

For example, here is a basic disk-based CAS configuration

server:
  - !channel
    address: localhost:50051
    insecure-mode: true

authorization:
  method: none

monitoring:
  enabled: false

instances:
  - name: ''
    description: |
      The unique '' instance.

    storages:
      - !disk-storage &disk-store
        path: example-cas/

    services:
      - !cas
        storage: *disk-store

      - !bytestream
        storage: *disk-store

To add an index to this, we simply need to add a CAS index to the list of storages, point it at the disk storage, and then use the index in the service definitions rather than the disk storage.

server:
  - !channel
    address: localhost:50051
    insecure-mode: true

authorization:
  method: none

monitoring:
  enabled: false


connections:
  - !sql-connection &sql
    connection-string: sqlite:///./example.db
    connection-timeout: 15

storages:
  - !disk-storage &disk-store
    path: example-cas/

  - !sql-index &cas-index
    sql: *sql
    storage: *disk-store

instances:
  - name: ''
    description: |
      The unique '' instance.

    services:
      - !cas
        storage: *cas-index

      - !bytestream
        storage: *cas-index

Warning

If adding an index to an existing storage which is used by another service which stores digests, such as an ActionCache or Asset API, either those services need to be cleared or the fallback_on_get config option needs to be temporarily set to True on the index.

By default, the index will report that a blob is missing if it isn’t in the index, whether it exists in storage or not. fallback_on_get is a pretty big burden on performance so is off by default, and should ideally not be left on permanently.

Index Types

The CAS index implementation is pluggable, similarly to the storage implementation. Currently only an SQL-based implementation exists, but it is possible to write a custom index implementation without too much effort.

SQL

As the name suggests, the SQL index uses an SQL database to store the index. It is tested to support both SQLite and PostgreSQL as the database, and theoretically could support any database backend that SQLAlchemy supports (providing that the database has support for window functions).

The storage instance and connection string must be provided, but the other parameters have defaults that should be functional.