Understanding Balsam¶

Balsam is made up of:

A centrally-managed, multi-tenant web application for securely curating HPC applications, authoring workflows, and managing high-throughput job campaigns across one or many computing facilities.
Distributed, user-run Balsam Sites that sync with the central API to orchestrate and carry out the workflows defined by users on a given HPC platform.

In order to understand how Balsam is organized, one should first consider the server side entities. This graph shows the database schema of the Balsam application. Each node is a table in the database, represented by one of the model classes in the ORM. Each arrow represents a ForeignKey (or many-to-one) relationship between two tables.

Database Schema

The Database Schema¶

A User represents a Balsam user account. All items in the database are linked to a single owner (tenant), which is reflected in the connectivity of the graph. For example, to get all the jobs belonging to current_user, join the tables via Job.objects.filter(app__site__user=current_user)
A Site must have a globally unique name which corresponds to a directory on some machine. One user can own several Balsam sites located across one or several machines. Each site is an independent endpoint where applications are registered, data is transferred in and out, and Job working directories are located. Each Balsam site runs a daemon on behalf of the user that communicates with the central API. If a user has multiple active Balsam Sites, then a separate daemon runs at each of them. The authenticated daemons communicate with the central Balsam API to fetch jobs, orchestrate the workflow locally, and update the database state.
An App represents a runnable application at a particular Balsam Site. Every Balsam Site contains an apps/ directory with Python modules containing ApplicationDefinition classes. The set of ApplicationDefinitions determines the applications which may run at the Site. An App instance in the data model is merely a reference to an ApplicationDefinition class, uniquely identified by the Site ID and class path.
A Job represents a single run of an App at a particular Site. The Job contains both application-specific data (like command line arguments) and resource requirements (like number of MPI ranks per node) for the run. It is important to note that Job-->App-->Site are non-nullable relations, so a Job is always bound to run at a particular Site from the moment its created. Therefore, the corresponding Balsam service daemon may begin staging-in data as soon as a Job becomes visible, as appropriate.
A BatchJob represents a job launch script and resource request submitted by the Site to the local workload manager (e.g. Slurm). Notice that the relation of BatchJob to Site is many-to-one, and that Job to BatchJob is many-to-one. That is, many Jobs run in a single BatchJob, and many BatchJobs are submitted at a Site over time.
The Session is an internal model representing an active Balsam launcher session. Jobs have a nullable relationship to Session; when it is not null, the job is said to be locked by a launcher, and no other launcher should try running it. The Balsam session API is used by launchers acquiring jobs concurrently to avoid race conditions. Sessions contain a heartbeat timestamp that must be periodically ticked to maintain the session.
A TransferItem is created for each stage-in or stage-out task associated with a Job. This permits the transfer module of the Balsam service to group transfers according to the remote source or destination, and therefore batch small transfers efficiently. When all the stage-in TransferItems linked to a Job are finished, it is considered "staged-in" and moves ahead to preprocessing.
A LogEvent contains a timestamp, from_state, to_state, and message for each state transition linked to a Job. The benefit of breaking a Job's state history out into a separate Table is that it becomes easy to query for aggregate throughput, etc... without having to first parse and accumulate timestamps nested inside a Job field.

The REST API¶

Refer to the interactive document located under the /docs URL of your Balsam server for detailed information about each endpoint. For instance, launch a local server with docker-compose up and visit localhost:8000/docs.

Swagger UI

User & a note on Auth¶

Generally, Balsam will need two types of Auth to function:

Login auth: This will likely be an pair of views providing an OAuth flow, where Balsam redirects the user to an external auth system, and upon successful authentication, user information is redirected back to a Balsam callback view. For testing purposes, basic password-based login could be used instead.
Token auth: After the initial login, Balsam clients need a way to authenticate subsequent requests to the API. This can be performed with Token authentication and a secure setup like Django REST Knox. Upon successful login authentication (step 1), a Token is generated and stored (encrypted) for the User. This token is returned to the client in the login response. The client then stores this token, which has some expiration date, and includes it as a HTTP header on every subsequent request to the API (e.g. Authorization: Token 4789ac8372...). This is both how Javascript web clients and automated Balsam Site services can communicate with the API.

Summary of Endpoints¶

HTTP Method	URL	Description	Example usage
GET	/sites/	Retrieve the current user's list of sites	A user checks their Balsam site statuses on dashboard
POST	/sites/	Create a new Site	`balsam init` creates a Site and stores new `id` locally
PUT	/sites/{id}	Update Site information	Service daemon syncs `backfill_windows` periodically
DELETE	/sites/{id}	Delete Site	User deletes their Site with `balsam rm site`
-----------	---------------	--------------------	-------------------
GET	/apps/	Retrieve the current user's list of Apps	`balsam ls apps` shows Apps across sites
POST	/apps/	Create a new `App`	`balsam app sync` creates new `Apps` from local `ApplicationDefinitions`
PUT	/apps/{id}	Update `App` information	`balsam app sync` updates existing `Apps` with changes from local `ApplicationDefinitions`
DELETE	/apps/{id}	Delete `App`	User deletes an `App`; all related `Jobs` are deleted
-----------	---------------	--------------------	-------------------
GET	/jobs/	Get paginated Job lists, filtered by site, state, tags, BatchJob, or App	`balsam ls`
POST	/jobs/	Bulk-create `Jobs`	Create 1k jobs with single API call
PUT	/jobs/{id}	Update `Job` information	Tweak a single job in web UI
DELETE	/jobs/{id}	Delete `Job`	Delete a single job in web UI
PUT	/jobs/	Bulk-update Jobs: apply same update to all jobs matching query	Restart all jobs at Site X with tag workflow="foo"
PATCH	/jobs/	Bulk-update Jobs: apply list of patches job-wise	Balsam StatusUpdater component sends a list of status updates to API
-----------	---------------	--------------------	-------------------
GET	/batch-jobs/	Get BatchJobs	Web client lists recent BatchJobs
POST	/batch-jobs/	Create BatchJob	Web client or ElasticQueue submits a new BatchJob
PUT	/batch-jobs/{id}	Alter BatchJob by ID	Web client alters job runtime while queued
DELETE	/batch-jobs/{id}	Delete BatchJob by ID	User deletes job before it was ever submitted
PATCH	/batch-jobs/	Bulk Update batch jobs by patch list	Service syncs BatchJob states
-----------	---------------	--------------------	-------------------
GET	/sessions	Get Sessions List	BatchJob Web view shows "Last Heartbeat" for each running
POST	/sessions	Create new `Session`	Launcher `JobSource` initialized
POST	/sessions/{id}/acquire	Acquire Jobs for launcher	`JobSource` acquires new jobs to run
PUT	/sessions/{id}	Tick `Session` heartbeat	`JobSource` ticks Session periodically
DELETE	/sessions/{id}	Destroy `Session` and release Jobs	Final `JobSource` `release()` call
-----------	---------------	--------------------	-------------------
GET	/transfers/	List `TransferItems`	Transfer module gets list of pending Transfers
PUT	/transfers/{id}	Update `TransferItem` State	Transfer module updates status
PATCH	/transfers/	Bulk update `TransferItems` via patch list	Transfer module bulk-updates statuses of finished transfers
-----------	---------------	--------------------	-------------------
GET	/events	Fetch EventLogs	Web client filters by Job `tags` and last 24 hours to get a quick view at throughput/utilization for a particular job type

Site¶

Field Name	Description
`id`	Unique Site ID
`name`	The unique site name like `theta-knl`
`path`	Absolute POSIX path to the Site directory
`last_refresh`	Automatically updated timestamp: last update to Site information
`creation_date`	Timestamp when Site was created
`owner`	ForeignKey to `User` model
`globus_endpoint_id`	Optional `UUID`: setting an associated endpoint for data transfer
`num_nodes`	Number of compute nodes available at the Site
`backfill_windows`	JSONField: array of `[queue, num_nodes, wall_time_min]` tuples indicating backfill slots
`queued_jobs`	JSONField: array of `[queue, num_nodes, wall_time_min, state]` indicating currently queued and running jobs
`optional_batch_job_params`	JSONField used in BatchJob forms/validation `{name: default_value}`. Taken from site config.
`allowed_projects`	JSONField used in BatchJob forms/validation: `[ name: str ]`
`allowed_queues`	JSONField used in BatchJob forms/validation: `{name: {max_nodes, max_walltime, max_queued}}`
`transfer_locations`	JSONField used in Job stage-in/stage-out validation: `{alias: {protocol, netloc}}`

App¶

Field Name	Description
`id`	Unique App ID
`site`	Foreign Key to `Site` instance containing this App
`name`	Short name identifying the app.
`description`	Text description (useful in generating Web forms)
`name`	Name of `ApplicationDefinition` class
`parameters`	Command line template or function parameters. A dict of dicts with the structure: `{name: {required: bool, default: str, help: str}}`
`transfers`	A dict of stage-in/stage-out slots with the structure: `{name: {required: bool, direction: ["in"\|"out"], target_path: str, help: str}}`

The App model is used to merely index the ApplicationDefinition classes that a user has registered at their Balsam Sites.

The parameters field represents "slots" for each adjustable command line parameter.
For example, an ApplicationDefinition command template of "echo hello, {{first_name}}!" would result in an App having the parameters list: [ {name: "first_name", required: true, default: "", help: ""} ]. None of the Balsam site components use App.parameters internally; the purpose of mirroring this field in the database is simply to facilitate Job validation and create App-tailored web forms.

Similarly, transfers mirrors data on the ApplicationDefinition for Job input and validation purposes only.

For security reasons, the validation of Job input parameters takes place in the site-local ApplicationDefinition module. Even if a malicious user altered the parameters field in the API, they would not be able to successfully run a Job with injected parameters.

Job¶

Field Name	Description
`id`	Unique Job ID
`workdir`	Working directory, relative to the Site `data/` directory
`tags`	JSON `{str: str}` mappings for tagging and selecting jobs
`session`	ForeignKey to `Session` instance
`app`	ForeignKey to `App` instance
`parameters`	JSON `{paramName: paramValue}` for the `App` command template parameters
`batch_job`	ForeignKey to current or most recent `BatchJob` instance in which this `Job` ran
`state`	Current state of the `Job`
`last_update`	Timestamp of last modification to Job
`data`	Arbitrary JSON data storage
`return_code`	Most recent return code of job
`parents`	Non-symmetric ManyToMany Parent --> Child relations between Jobs
`num_nodes`	Number of compute nodes required (> 1 implies MPI usage)
`ranks_per_node`	Number of ranks per node (> 1 implies MPI usage)
`threads_per_rank`	Number of logical threads per MPI rank
`threads_per_core`	Number of logical threads per hardware core
`launch_params`	Optional pass-through parameters to MPI launcher (e.g -cc depth)
`gpus_per_rank`	Number of GPUs per MPI rank
`node_packing_count`	Maximum number of instances that can run on a single node
`wall_time_min`	Lower bound estimate for runtime of the Job (leaving at default 0 is allowed)

Let workdir uniqueness be the user's problem. If they put 2 jobs with same workdir, assume it's intentional. We can ensure that "stdout" of each job goes into a file named by Job ID, so multiple runs do not collide.

stateDiagram-v2 created: Created awaiting_parents: Awaiting Parents ready: Ready staged_in: Staged In preprocessed: Preprocessed restart_ready: Restart Ready running: Running run_done: Run Done postprocessed: Postprocessed staged_out: Staged Out finished: Job Finished run_error: Run Error run_timeout: Run Timeout failed: Failed created --> ready: No parents created --> awaiting_parents: Pending dependencies awaiting_parents --> ready: Dependencies finished ready --> staged_in: Transfer external data in staged_in --> preprocessed: Run preprocess script preprocessed --> running: Launch job running --> run_done: Return code 0 running --> run_error: Nonzero return running --> run_timeout: Early termination run_timeout --> restart_ready: Auto retry run_error --> restart_ready: Run error handler run_error --> failed: No error handler restart_ready --> running: Launch job run_done --> postprocessed: Run postprocess script postprocessed --> staged_out: Transfer data out staged_out --> finished: Job Finished

A user can only access Jobs they own. The related App, BatchJob, and parents are included by ID in the serialized representation. The session is excluded since it is only used internally. Reverse relationships (one-to-many) with transfers and events are also not included in the Job representation, as they can be accessed through separate API endpoints.

The related entities are represented in JSON as follows:

Field	Serialized	Deserialized
`id`	Primary Key	Fetch Job from user-filtered queryset
`app_id`	Primary Key	Fetch App from user-filtered queryset
`batch_job_id`	Primary Key	Fetch BatchJob from user-filtered queryset
`parent_ids`	Primary Key list	Fetch parent jobs from user-filtered queryset
`transfers`	N/A	Create only: Dict of `{transfer_item_name: {location_alias: str, path: str}}`
`events`	N/A	N/A
`session`	N/A	N/A

transfers are nested in the Job for POST only: Job creation is an atomic transaction grouping addition of the Job with its related TransferItems. The API fetches the related App.transfers and Site.transfer_locations to validate each transfer item:

transfer_item_name must match one of the keys in App.transfers, which determines the direction and local path
The location_alias must match one of the keys in Site.transfer_locations, which determines the protocol and remote_netloc
Finally, the remote path is determined by the path key in each Job transfer item

BatchJob¶

Field Name	Description
`id`	Unique ID. Not to be confused with Scheduler ID, which is not necessarily unique across Sites!
`site`	ForeignKey to `Site` where submitted
`scheduler_id`	ID assigned by Site's batch scheduler (null if unassigned)
`project`	Project/allocation to be charged for the job submission
`queue`	Which scheduler queue the batchjob is submitted to
`num_nodes`	Number of nodes requested for batchjob
`wall_time_min`	Wall time, in minutes, requested
`job_mode`	Balsam launcher job mode
`optional_params`	Extra pass-through parameters to Job Template
`filter_tags`	Restrict launcher to run jobs with matching tags. JSONField dict: `{tag_key: tag_val}`
`state`	Current status of BatchJob
`status_info`	JSON: Error or custom data received from scheduler
`start_time`	DateTime when BatchJob started running
`end_time`	DateTime when BatchJob ended

Every workload manager is different and there are numerous job states intentionally not considered in the BatchJob model, including starting, exiting, user_hold, dep_hold, etc. It is the responsibility of the site's Scheduler interface to translate real scheduler states to one of the few coarse-grained Balsam BatchJob states: queued, running, or finished.

stateDiagram-v2 pending_submission --> queued pending_submission --> submit_failed queued --> running running --> finished pending_submission --> pending_deletion queued --> pending_deletion running --> pending_deletion pending_deletion --> finished

Session¶

Field Name	Description
`id`	Unique ID
`heartbeat`	DateTime of last session tick API call
`batch_job`	Non-nullable ForeignKey to `BatchJob` this Session is running under

Session creation only requires providing batch_job_id.
Session tick has empty payload
Session acquire endpoint uses a special JobAcquireSerializer representation:

Field	Description
`states`	`list` of states to acquire
`max_num_acquire`	limit number of jobs to acquire
`filter_tags`	filter `Jobs` for which `job.tags` contains all `{tag_name: tag_value}` pairs
`node_resources`	Nested `NodeResource` representation placing resource constraints on what Jobs may be acquired
`order_by`	order returned jobs according to a set of Job fields (may include ascending or descending `num_nodes`, `node_packing_count`, `wall_time_min`)

The nested NodeResource representation is provided as a dict with the structure:

{
    "max_jobs_per_node": 1,  # Determined by Site settings for each Launcher job mode
    "max_wall_time_min": 60,
    "running_job_counts": [0, 1, 0],
    "node_occupancies": [0.0, 1.0, 0.0],
    "idle_cores": [64, 63, 64],
    "idle_gpus": [1, 0, 1],
}

TransferItem¶

Field Name	Description
`id`	Unique TransferItem ID
`job`	ForeignKey to Job
`protocol`	`globus` or `rsync`
`direction`	`in` or `out`. If `in`, the transfer is from `remote_netloc:source_path` to `Job.workdir/destination_path`. If `out`, the transfer is from `Job.workdir/src_path` to `remote_netloc:dest_path`.
`remote_netloc`	The Globus endpoint UUID or user@hostname of the remote data location
`source_path`	If stage-`in`: the remote path. If stage-`out`: the local path
`destination_path`	If stage-`in`: the local path. If stage-`out`: the remote path.
`state`	`pending` -> `active` -> `done` or `error`
`task_id`	Unique identifier of the Transfer task (e.g. Globus Task UUID)
`transfer_info`	JSONField for Error messages, average bandwidth, transfer time, etc...
There is no create (`POST`) method on the `/transfers` endpoint, because `TransferItem` creation is directly linked with `Job` creation. The related Transfers are nested in the `Job` representation when `POSTING` new jobs. The following fields are fixed at creation time:

id
job
protocol
direction
remote_netloc
source_path
dest_path

For list (GET), the representation includes all fields. job_id represents the Job by primary key.

For update (PUT and PATCH), only state, task_id, and transfer_info may be modified. The update of a state to done triggers a check of the related Job's transfers to determine whether the job can be advanced to STAGED_IN.

LogEvent¶

Field Name	Description
`id`	Unique ID
`job`	ForeignKey to `Job` undergoing event
`timestamp`	DateTime of event
`from_state`	Job state before transition
`to_state`	Job state after transition
`data`	JSONField containing `{message: str}` and other optional data

For transitions to or from RUNNING, the data includes nodes as a fractional number of occupied nodes. This enables clients to generate throughput and utilization views without having to fetch entire related Jobs.

This is a read only-API with all fields included. The related Job is represented by primary key job_id field.