Scheduling Launchers¶
Unless Auto Scaling is enabled, Jobs do not automatically run in
Balsam. After the Site agent completes data staging and preprocessing for a
Job, it waits in the PREPROCESSED
state until a launcher (pilot job) submitted
to the HPC batch scheduler takes over.
Launchers run independently of the Site agent, continuously fetching and executing runnable Jobs on the available compute resources. They track occupancy of the allocated CPUs/GPUs while launching Jobs with the requested resources. Because launchers acquire Jobs on-the-fly, you can submit Jobs to any system, and they will execute in real-time on an existing allocation!
The Balsam launcher thus handles the compute-intensive core of the Job
lifecycle: from PREPROCESSED
to RUNNING
to RUN_DONE
. As users, we need
only submit a BatchJob
to any one of our Sites. The BatchJob
represents an
HPC resource allocation as a fixed block of node-hours. Sites will
handle new BatchJobs
by generating the script to run a launcher and submitting
it to the local HPC scheduler.
Using the CLI¶
If we are inside a Site directory, we can submit a BatchJob
to that Site
from the CLI:
$ balsam queue submit -q QUEUE -A PROJECT -n 128 -t 70 -j mpi
The CLI works remotely, too: we just need to target a specific --site
so Balsam knows where you want things to run:
$ balsam queue submit --site=my-site -q QUEUE -A PROJECT -n 128 -t 70 -j mpi
Balsam will perform the appropriate submission to the underlying HPC scheduler
and synchronize your BatchJobs
with the local scheduler state. Therefore, instead of checking queues locally (e.g. qstat
), we can check on BatchJobs
across all of our Sites with a simple command:
$ balsam queue ls --site=all
Systems without a batch scheduler
Use -n 1 -q local -A local
when creating BatchJobs
at generic
MacOS/Linux Sites without a real batch scheduler or multiple nodes. The OS
process manager takes the place of the resource manager, but everything else
looks the same!
Selecting a Launcher Job Mode¶
All of the queue submit
options pass through to the usual scheduler interface
(like sbatch
or qsub
), except for the -j/--job-mode
flag, which may be
either mpi
or serial
. These refer to the launcher job modes, which determines the pilot job implementation that will actually run.
mpi
mode is the most flexible and should be preferred unless you have a particularly extreme throughput requirement or need to use one of the workarounds offered byserial
mode. Thempi
launcher runs on the head-node of theBatchJob
and executes each job using the system's MPI launch command (e.g.srun
ormpirun
).serial
mode only handles single-process (non-distributed memory) Jobs that run within a single compute node. Higher throughput of fine-grained tasks (e.g. millions of single-core tasks) is achieved by running a worker process on each compute node and fanning out cachedJobs
acquired from the REST API.
Both launcher modes can simultaneously execute multiple applications per node,
as long as the underlying HPC system provides support. This is not always the
case: for example, on ALCF's Theta-KNL system, serial
mode is required to pack
multiple runs per node.
You can submit multiple BatchJobs to a Site
Balsam launchers cooperatively divide and conquer the runnable Jobs at a Site. You may therefore choose between queueing up fewer large BatchJobs or several smaller BatchJobs simultaneously. On a busy HPC cluster, smaller BatchJobs can get through the queues faster and improve overall throughput.
Ordering Job Execution¶
By default, Balsam will sort jobs that are ready to run first by num_nodes
in acending order, then by node_packing_count
in decending order, and finally
by wall_time_min
in decending order. This default behavior will result in
the smallest jobs by node count starting first.
There is an alternative sorting model that can be enabled that sorts jobs first
by wall_time_min
in decending order, then by num_nodes
in decending order,
and finally by node_packing_count
in decending order. This alternative
sorting behavior will start the longest running jobs, as estimated by
wall_time_min
, first. If jobs have no wall_time_min
set, it will start
the largest jobs by node count first. This alternative sorting model can be
enabled for the site by modifying the site's configuration settings.yml
file.
Under launcher
, add this option:
sort_by: long_large_first # set this to enable alternative sorting model that starts the longest running and largest node count jobs first
settings.yml
for the changes to take effect.
Using the API¶
A unique capability of the Balsam Python API is that it allows us
programmatically manage HPC resources (via BatchJob
) and tasks (via Job
) on
equal footing. We can submit and monitor Jobs
and BatchJobs
at any Site with
ease, using a single, consistent programming model.
from balsam.api import Job, BatchJob
# Create Jobs:
job = Job.objects.create(
site_name="myProject-theta-gpu",
app_id="SimulationX",
workdir="test-runs/foo/1",
)
# Or allocate resources:
BatchJob.objects.create(
site_id=job.site_id,
num_nodes=1,
wall_time_min=20,
job_mode="mpi",
project="datascience",
queue="full-node",
)
We can query BatchJobs
to track how many resources are currently available or waiting in the queue at each Site:
queued_nodes = sum(
batch_job.num_nodes
for batch_job in BatchJob.objects.filter(site_id=123, state="queued")
)
Or we can instruct Balsam to cleanly terminate an allocation:
BatchJob.objects.filter(scheduler_id=1234).update(state="pending_deletion")
Jobs running in that BatchJob
will be marked RUN_TIMEOUT
and handled by
their respective Apps' handle_timeout
hooks.
Job Templates and specialized parameters¶
Behind the scenes, each BatchJob
materializes as a batch job script rendered
from the Site's job template.
These templates can be customized to support new scheduler flags, load global
modules, or perform general pre-execution logic. These templates also accept
optional, system-specific parameters that can be passed on the CLI via -x
or
to the BatchJob optional_params
dictionary.
Theta-KNL Optional Params¶
On Theta-KNL, we can prime the LDAP cache on each compute node prior to a large-scale ensemble of Singularity jobs. This is necessary to avoid a system error that arises in Singularity startup at scale.
With the CLI:
$ balsam queue submit -x singularity_prime_cache=yes # ..other args
With the Python API:
BatchJob.objects.create(
# ...other kwargs
optional_params={"singularity_prime_cache": "yes"}
)
ThetaGPU¶
On Theta-GPU, we can partition each of the 8 physical A100 GPUs into 2, 3, or 7
Multi-Instance GPU (MIG) resources. This allows us to achieve higher GPU
utilization with high-throughput tasks consuming a fraction of the 40 GB device
memory. Jobs using a MIG instance should still request a single logical GPU
with gpus_per_rank=1
but specify a higher node-packing (e.g.
node_packing_count
should be 8*3 = 24
for a 3-way MIG partitioning).
With the CLI:
$ balsam queue submit -x mig_count=3 # ..other args
With the Python API:
BatchJob.objects.create(
# ...other kwargs
optional_params={"mig_count": "3"}
)
Restricting BatchJobs with tags¶
We strongly encourage the use of descriptive tags to facilitate monitoring Jobs. Another major use of tags is to restrict which Jobs can run in a given BatchJob.
The default behavior is that a BatchJob will run as many Jobs as possible: Balsam decides what runs in any given allocation. But perhaps we wish to prioritize a certain group of runs, or deliberately run Jobs in separate partitions as part of a scalability study.
This is easy with the CLI:
# Only run jobs with tags system=H2O and scale=4
$ balsam queue submit -n 4 --tag system=H2O --tag scale=4 # ...other args
Or with the API:
BatchJob.objects.create(
# ...other kwargs
num_nodes=4,
filter_tags={"system": "H2O", "scale": "4"}
)
Partitioning BatchJobs¶
By default, a single launcher process manages the entire allocation of compute nodes with a single job mode of either mpi
or serial
. In advanced use-cases, we can actually divide a single queue submission/allocation into multiple launcher partitions.
Each of this partitions can have its own number of compute nodes, job mode, and filter tags. This can be useful in different contexts:
- Dividing an allocation to run a mixed workload of MPI applications and high-throughput sub-node applications.
- Improving scalability to large node counts by parallelizing the job launcher.
We can request that a BatchJob is split into partitions on the CLI. In this
example, we split a 128-node allocation into a 2-node MPI launcher (to run some
"leader" MPI app on 2 nodes), while the remaining 126 nodes are managed by the
efficient serial
mode launcher for high-throughput.
$ balsam queue submit -n 128 -p mpi:2 -p serial:126 # ...other args
We could also apply tag restrictions to ensure that the right Jobs run in the right partition:
$ balsam queue submit -n 128 -p mpi:2:role=leader -p serial:126:role=worker # ...other args
With the Python API, this looks like:
BatchJob.objects.create(
# ...other kwargs
num_nodes=128,
partitions=[
{"job_mode": "mpi", "num_nodes": 2, "filter_tags": {"role": "leader"}},
{"job_mode": "serial", "num_nodes": 126, "filter_tags": {"role": "worker"}},
]
)