Ian M Kenney

AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

The autoscaling API allows custom, platform specific manager services to execute automatic horizontal resource allocation.

Background

Currently, the number of alchemiscale compute services needs to be manually managed by administrators. This is tedious and error prone, especially when balancing across a heterogeous set of compute platforms.

Objectives

In order to decrease the need for administrators to micromanage compute services, we aim to define a protocol to enable the implementation of a "compute manager" application.

These compute managers will allow compute services to be horizontally scaled based on the demand, as determined by the claimable tasks in the statestore.

Requirements

Compute services growth can be made automatic, dictated by a compute manager's settings and statestore contents.
Compute managers can be signaled to stop creating new services.
Scaling descisions will be based on throughput estimates and the current number of compute services registered.
A maximum number of instances will be enforced to prevent resource exhaustion.

Design

A compute manager is responsible for communicating with the Alchemicale compute API to determine whether or not to allocate more resources in response to the number of claimable tasks. The compute API will issue a signal (and an optional data payload) with how the manager should proceed:

GROW, int: The compute manager is allowed to scale up. The number of claimable tasks is included along with the signal.
SKIP, None: The compute manager should skip this cycle and try again later.
SHUTDOWN, string: The compute manager should shut down. A string is provided to indicate why a shutdown was requested and should ideally be included in the compute manager log before shutting down.

In order to scale up, compute services, which claim tasks and execute their associate protocols, are created programatically. In addition to the number of tasks, a compute manager can also query the statestore about the compute services that it was responsible for creating. The compute services are created by a compute manager from a predefined computer service settings template file.

We will make the assumption that a compute manager won't have direct access to the compute services that it creates. This mostly accomodates for systems that are running on a queuing system, such as slurm. Because of this, created services must be able to cleanly shut themselves down. This can be triggered by either a time constraint, the number of executed tasks, or the number of empty task claims, all of which are configurable when creating the compute service. Therefore, while the compute manager has the responsiblity of up-scaling, the created services are responsible for down-scaling.

A note on uniqueness and manager identities

It is possible that competing managers will erroneously be duplicated during deployment by an administrator. Requiring a unique ID to deploy a compute manager will allow the alchemiscale server to limit communication to one instance of a manager with a given ID. This unique ID will also be tracked by the compute services that result from the compute manager, allowing managed services to be reassociate with a restarted compute manager. Should a manager fail unexpectedly, the alchemiscale server will not allow the registration of a new manager until the previous registration has been flushed.

In order to effectively differentiate compute managers sharing the same manager ID, a uuid (v4) is introduced. This will allow new compute managers to take over the responsibilities of computer managers that may have not been able to cleanly shut down or who are unable to communicate with server due to network issues. Only a compute manager with the correct manager ID and uuid will be given the affirmative signal for compute upscaling.

The two identifiers will be combined into a single identifier matching that pattern {manager_id}-{uuid4}.

Implementation

Introduction of autoscaling will require a new client, an additional model in the statestore, new api endpoints, and an abstract class for implementing new compute managers. Additionally, ComputeServiceRegistration nodes in the statestore will have an additional field referencing the name of a compute manager.

A new client supporting compute managers

While compute managers share a lot of overlap with compute services, the AlchemiscaleComputeClient provides functionality that shouldn't be exposed to the manager, such as task claiming. We will define a new, minimal client that fulfills the requirements for a compute manager.

class ComputeManagerClient(AlchemiscaleBaseClient):

    def register(self) -> ComputeManagerID:
        ...

    def deregister(self) -> ComputeManagerID:
        ...

    def heartbeat(self) -> ComputeManagerID:
        ...

    def get_instruction(self) -> (ComputeManagerInstruction, int | str | None):
        ...

    def get_registered_compute_services(self) -> list[ComputeServiceID]:
        ...

where ComputeManagerID is a subclass of str:

class ComputeManagerID(str):

    @property
    def uuid(self):
        return self.split('-')[0]

    @property
    def name(self):
        return self.split('-')[1]

and the ComputeManagerInstruction is a StrEnum, reflecting the signals specified in the design section:

class ComputeManagerInstruction(StrEnum):
    GROW = "GROW"
    SKIP = "SKIP"
    SHUTDOWN = "SHUTDOWN"

Statestore represenatation

A compute manager must be registered with the alchemiscale statestore, similarly to how compute services are registered.

class ComputeManagerRegistration:

    compute_manager_id: str
    registered: datetime
    heartbeat: datetime

New API methods

New methods will be added to the compute API.

register_computemanager: Attempt to register a compute manager with the statestore. If there is already a compute manager
deregister_computemanager: Deregister a compute manager from the statestore. This leaves any ComputeServiceRegistration nodes resulting from the manager intact.
created_compute_service_ids: Given a ComputeManagerID, return a list of ComputeServiceID strings that were created by that compute manager.

Compute manager abstract class

The compute manager abstract class structure largely follows the current implemantion of the SynchronousComputeService.

class ComputeManager:
    compute_manager_id: ComputeManagerID
    heartbeat_interval: int
    sleep_interval: int
    client: AlchemiscaleComputeClient
    service_settings_template: bytes

class ComputeManagerSettings:
    ...

The implementation requires a definition of a new interface, ComputeManager, which defines the following methods:

register
request_claimable
check_managed_compute_services
create_compute_service (abstract)
introspect_local (abstract)

The register method communicates to the alchemiscale compute API the unique ID given to the compute manager. If a compute manager with that ID already exists in the statestore, None is returned.

Changes to `ComputeServiceRegistration` and the compute service `register` method

ComputeServiceRegistration nodes will need to keep track of the compute managers responsible for their creation. To support this at the statestore level, we need only add another string field, which can optionally store a compute manager name.

Testing and validation

Since this feature defines the interface through which a specific compute manager interacts with the database and creates compute services, the majority of testing is left to those specific implementations.

Risks

Since this feature interacts directly with the allocation of high performance computing resources, implementation errors and deployment misconfigurations manifest as a very real, potentially ballooning financial cost. Communication of these risks and putting appropriate guardrails in place is paramount.

AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

Background

Objectives

Requirements

Design

A note on uniqueness and manager identities

Implementation

A new client supporting compute managers

Statestore represenatation

New API methods

Compute manager abstract class

Changes to `ComputeServiceRegistration` and the compute service `register` method

Testing and validation

Risks

Conclusion

AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

Background

Objectives

Requirements

Design

A note on uniqueness and manager identities

Implementation

A new client supporting compute managers

Statestore represenatation

New API methods

Compute manager abstract class

Changes to ComputeServiceRegistration and the compute service register method

Testing and validation

Risks

Conclusion

Changes to `ComputeServiceRegistration` and the compute service `register` method