.. _cas-operation:

CAS
===

CAS, or Content Addressable Storage, is the service responsible for
persistently storing data in BuildGrid. The data stored in CAS **must** exist
for the lifespan of a job from submission through to the ``ActionResult`` being
returned to the client. Cleanup and other TTLs should be configured accordingly
and this requirement should be kept in mind when determining storage size
requirements.

In order to benefit from remote caching, blobs in CAS will need a longer
minimum lifespan than this. The exact lifespan to aim for is dictated by your
storage availability and use-case.

.. _cas-cleanup:

Cleanup
-------

CAS object lifespans are enforced by the CAS cleanup daemon. This tool is
configured with a minimum blob age for deletion, a high watermark, and a low
watermark. Cleanup starts when the total CAS size exceeds the high watermark,
and ends when the size is below the low watermark. Blobs with a recorded access
time more recent than the minimum blob age (``only-if-unused-for``) will not be
deleted.

Usage
~~~~~

To run the cleanup daemon,

.. code-block:: sh

    bgd cleanup --high-watermark 10G --low-watermark 7.5G --batch-size 100M \
        --sleep-interval 10 deployment.yml

The batch size and high/low water mark parameters take numbers in bytes.
Shorthands for kB, MB, GB, and TB are available as K, M, G, and T respectively,
as seen in the example.

The batch size is the minimum amount of space cleared in one go. The cleanup
tool will try to remain as close as possible to the configured batch size, but
depending on the size of blobs in the CAS will sometimes delete more than the
specified batch at a time.

A smaller batch size adds more load to the database and the storage backend,
but space will start to be actually cleared faster than with large batch sizes.

If the batch size is larger than the difference between the current CAS size
and the low water mark, then the whole set of deletions required will be done
in one batch.

The sleep interval is the time in seconds to sleep after checking whether the
CAS size has reached the configured high water mark. A lower sleep interval
means a more reactive cleanup, at the cost of more database load.

The configuration file used should contain the index and backend storage
definitions. The easiest way to achieve this is to just use the same config
file that was used to deploy the indexed CAS in the first place.

It should be noted that if monitoring is configured in the provided config
file (see :ref:`monitoring`) then any metrics produced by the
cleanup tool will be published in the configured place. If that shouldn't be
the same place as the indexed CAS metrics for whatever reason then the config
will need to be changed.

.. warning::

   If using a ``!with-cache`` storage type and a non-distributed storage type,
   such as ``!lru-storage``, the caches will *not* be cleaned up along with the
   backing storage. In rare cases this can cause issues. To minimize this
   issue, the configured cache size across all BuildGrids should be smaller
   than the configured low watermark.


.. _indexed-cas:

Index
-----

In order to use the cleanup daemon, an index is currently needed. BuildGrid
supports a CAS index stored in either Redis or PostgreSQL.

This index also provides performance improvements for ``FindMissingBlobs``
requests, allowing the CAS to perform existence checking without using
potentially slow storage services.


Configuration
~~~~~~~~~~~~~

In a configuration file, the CAS index acts just like another storage backend
definition. It fully implements the same API interface as the other storage
backends, so you can just pass it to the service definition as with any
other storage implementation.

For example, here is a basic disk-based CAS configuration

.. literalinclude:: ../data/basic-disk-cas.yml
   :language: yaml

To add an index to this, we simply need to add a CAS index to the list of
storages, point it at the disk storage, and then use the index in the service
definitions rather than the disk storage.

.. literalinclude:: ../data/postgresql-index-cas-only.yml
   :language: yaml

.. warning::

    Adding a new, empty index to an existing storage is the same as deleting
    everything from the storage from a client perspective. If the storage is
    used by another service which stores digests, such as an ActionCache or
    Asset API, the contents of those services will also be lost.

    Consider a migration phase using a ``!replicated-storage`` wrapper to work
    around this issue.


Index Types
~~~~~~~~~~~

The CAS index implementation is pluggable, similarly to the storage
implementation. Currently both Redis and SQL are supported as storage indices.

Redis
'''''

The Redis Index uses Redis to store TTLs for CAS objects. The keys themselves
are used to determine existence of a blob in CAS, and the TTL of the key is
set to the expiry time of the blob. The value is set to a dummy value of 1,
since we only care about existence checking and TTLs.

.. autofunction:: buildgrid.server.app.settings.parser.load_redis_index
   :noindex:

SQL
'''

As the name suggests, the SQL index uses a PostgreSQL database to store the index.

The storage instance and connection string must be provided, but the other
parameters have defaults that should be functional.

.. autofunction:: buildgrid.server.app.settings.parser.load_sql_index
   :noindex:

.. important::
  
  It is **strongly recommended** to set ``refresh-accesstime-older-than`` to
  a reasonably large value to minimise database churn when using an SQLIndex.

  See :ref:`index-lifespan` for details on how this affects blob lifespans.

.. versionchanged:: 0.4.0
   ``fallback_on_get`` functionality was removed. ``!replicated-storage``
   allows for clean migrations more generally.

.. _index-lifespan:

Lifespan
~~~~~~~~

The index implementions maintain access timestamps which are used by cleanup to
determine whether or not the blob is old enough to be deleted. The Redis index
updates this timestamp whenever the blob is referenced by a ``FindMissingBlobs``
request, but the SQL index can be configured to skip this update sometimes, if
the timestamp is recent enough.

This configuration option (``refresh-accesstime-older-than``) has implications
on the guaranteed lifespan of blobs however, and should be taken into account
when configuring ``only-if-unused-for`` in the cleanup process.

Specifically, the last access could have occurred anywhere in the window of
``refresh-accesstime-older-than`` seconds since the recorded access time.

.. code-block::

    |------ SQLIndex ------|
    |--------------------------- Cleanup only-if-unused-for ----------------------------|
                           |-------------------- Guaranteed Lifespan -------------------|

This leaves the actual guaranteed lifespan of blobs equal to

.. code-block::

   l = only_if_unused_for - refresh_accesstime_older_than


FMB Cache
---------

BuildGrid also has a ``RedisFMBCache`` storage type. This stores blob existence
information in Redis in a similar way to the ``RedisIndex``, however it sets
the value of each digest's key to ``1`` if the blob exists, and ``0`` if not.

Keys in the ``RedisFMBCache`` also expire after a configured TTL, which is only
refreshed when changing the value of the key in the cache. This is to force
requests to sometimes hit the underlying storage layer, to allow that storage
implementation to update its access time metadata for the blob if necessary.

This cache is transparent, and will gracefully fall back to the underlying
storages in the case of Redis unavailability. Whilst also doing the job of
speeding up ``FindMissingBlobs`` calls, this is a separate concept from the
index and cannot be used for cleanup. This cache is not an authoritative source
for the existence (or not) of blobs whose digests are not cached.

Lifespan
~~~~~~~~

Adding this extra layer of caching complicates the minimum guaranteed lifespan
calculation.

.. code-block::
   
                           |- FMB Cache TTL -|
    |------ SQLIndex ------|
    |--------------------------- Cleanup only-if-unused-for ----------------------------|
                                             |----------- Guaranteed Lifespan ----------|

The pathological case is a digest which is included in a ``FindMissingBlobs``
call close to the end of the SQLIndex access timestamp update window. This will
cache the existence, but not update the timestamp. As such, there is a window
equal to the size of the FMB cache TTL during which we may have received a
request which should have updated the access timestamp but did not.

The guaranteed lifespan for a blob in CAS is now

.. code-block::

   l = only_if_unused_for - refresh_accesstime_older_than - fmb_cache_ttl

This should be accounted for when configuring cleanup.

As a worked example, say we want to guarantee that blobs in CAS are available
for a minimum of 24 hours after their last access. If we have our SQLIndex
configured to refresh access timestamps every 6 hours, and an FMB cache in
front of the index with a 1 hour TTL, we need to set ``only-if-unused-for`` to
31 hours when configuring cleanup.