Internal Data Model

REAPI

The remote execution API has a concept of an Operation. These reflect the state of the work requested in an Execute request. A stream of Operation messages are returned by both Execute and WaitExecute requests. WaitExecute requests take the name of an operation to stream updates of, implying a need to track operations in the server.

In BuildGrid, the state of each Operation is tracked across both the buildgrid.server.job.Job class, and the Operation protobuf objects in the operations_by_name attribute of that class.

When an update is to be communicated to the peer (client) for a specific operation, the data in the Job is combined with the data already in the Operation, and the resulting Operation message is sent to the peer.

The Job abstraction exists for a couple of main reasons:

  • Allows us to deduplicate work by tying multiple operations to the same actual execution task

  • Allows us to tie together the REAPI Operation concept with the RWAPI Lease concept.

In addition to tracking the various operations and lease(s) for the work, the job class stores the Action being executed by the relevant Execute request plus some other attributes, such as its priority and platform requirements.

The requirements are used for scheduling work to workers which provide an environment that matches the constraints set by the peer.

Handling an Execute request

This diagram shows how the data in an Execute request is split up within BuildGrid, for a request to execute an Action that isn’t already queued or executing. The data from the Job is combined with the relevant Operation in update messages streamed back to the peer.

digraph reapi_data_flow {
     bgcolor="#fcfcfc";
     rankdir=LR;

     graph [fontsize=10 fontname="Verdana" compound=true]
     node [shape=box fontsize=10 fontname="Verdana"];
     edge [fontsize=10 fontname="Verdana"];

     subgraph cluster_bgd_reapi {
         label="BuildGrid Execution Service";
         style="dashed";

         node [shape=box];

         servicer -> job, operation [
             label="Creates"
         ];
         operation -> job [
             style="dashed"
             label="Some state in"
         ];

         { rank=same; job, operation }

         job, operation [
             style=filled;
         ];
         job [
             label=<
 <b>Job</b><br/><br align="left"/>
 - action<br align="left"/>
 - execute_response<br align="left"/>
 - operation_stage<br align="left"/>
 - priority<br align="left"/>
 - platform_requirements
 >
         ];
         operation [
             label=<
 <b>Operation</b><br/><br align="left"/>
 - name<br align="left"/>
 - done<br align="left"/>
 - cancelled
 >
         ];
         servicer [
             label="ExecutionService"
             shape="circle"
         ];
     }

     execute -> servicer [
         dir="both"
         lhead=cluster_bgd_reapi
         label="Send Execute request\nStream Operation messages"
     ];

     execute [
         label="Peer\ne.g. recc, bazel, bst"
     ];
 }

If the request is for an Action already queued or executing, the creation of the Job is skipped in favour of updating the priority of the job if needed.

In the case of a WaitExecute request, neither the Job or the Operation are created. Instead a message queue for the peer is created to get updates from the specified Operation.

RWAPI

The remote worker API has a concept of a Lease, which contains the state of a given task being executed by a worker. This is implemented fairly straightforwardly in BuildGrid: a worker requests a new Lease from the server, and the server finds a Job in the queue with requirements that match the capabilities advertised by the worker. The server then creates a Lease for this job, and sends it to the worker in the response.

The Lease message contains a payload field, which BuildGrid will populate with the Job’s Action message. [1]

All the state of the Lease is in the Lease objects themselves rather than some being in the Job instead. Each Job has the capacity to track multiple leases, to handle retrying.

which required a worker to fetch the latter from the CAS.

Handling a CreateBotSession request

The initial connection from a worker to BuildGrid should be a CreateBotSession request. In BuildGrid, this will start tracking the bot for metrics and then looking for queued jobs that match the platform properties for that worker.

If a job is found, a Lease is created and the response sent, and the job state is updated to reflect that its now being worked on.

digraph rwapi_data_flow {
     bgcolor="#fcfcfc";
     rankdir=LR;

     graph [fontsize=10 fontname="Verdana" compound=true]
     node [shape=box fontsize=10 fontname="Verdana"];
     edge [fontsize=10 fontname="Verdana"];

     subgraph cluster_bgd_rwapi {
         label="BuildGrid Bots Service"
         style="dashed";

         node [shape=box];

         servicer -> job [
             dir="both"
             label="Search,\nUpdate"
         ];
         servicer -> lease [
             label="Create"
         ];
         lease -> job [
             style="dotted"
             label="Relates to"
         ];

         { rank=same; job, lease }

         servicer [
             label="BotsService"
             shape="circle"
         ];
         job, lease [
             style="filled"
         ];
         job [
             label=<
 <b>Job</b><br/><br align="left"/>
 - action<br align="left"/>
 - execute_response<br align="left"/>
 - operation_stage<br align="left"/>
 - priority<br align="left"/>
 - platform_requirements
 >
         ];
         lease [
             label=<
 <b>Lease</b><br/><br align="left"/>
 - id<br align="left"/>
 - state<br align="left"/>
 - status
 >
         ];
     }

     worker -> servicer [
         label="Send CreateBotSession request"
         lhead="cluster_bgd_rwapi"
     ];

     worker [
         label="Worker\ne.g. buildbox-worker"
     ];
 }

Handling an UpdateBotSession request

The subsequent connections should be UpdateBotSession requests. Internally, these requests are handled very similarly. There is an initial step of checking the state of any leases held by the bot, and updating the internal representation to match. If the change implies a change to the job state, that is also updated here.

After that, if the bot needs a new lease, BuildGrid looks for a queued job in the same way as before, and adds the any new lease to the response.

digraph rwapi_data_flow {
     bgcolor="#fcfcfc";
     rankdir=LR;

     graph [fontsize=10 fontname="Verdana" compound=true]
     node [shape=box fontsize=10 fontname="Verdana"];
     edge [fontsize=10 fontname="Verdana"];

     subgraph cluster_bgd_rwapi {
         label="BuildGrid Bots Service"
         style="dashed";

         node [shape=box];

         servicer -> job [
             dir="both"
             label="Search,\nUpdate"
         ];
         servicer -> lease [
             dir="both"
             label="Create,\nUpdate"
         ];
         lease -> job [
             style="dotted"
             label="Relates to"
         ];

         { rank=same; job, lease }

         servicer [
             label="BotsService"
         ];
         job, lease [
             style="filled"
         ];
         job [
             label=<
 <b>Job</b><br/><br align="left"/>
 - action<br align="left"/>
 - execute_response<br align="left"/>
 - operation_stage<br align="left"/>
 - priority<br align="left"/>
 - platform_requirements
 >
         ];
         lease [
             label=<
 <b>Lease</b><br/><br align="left"/>
 - id<br align="left"/>
 - state<br align="left"/>
 - status
 >
         ];
     }

     worker -> servicer [
         label="Send UpdateBotSession request"
         lhead="cluster_bgd_rwapi"
     ];

     worker [
         label="Worker\ne.g. buildbox-worker"
     ];
 }