Auto-scaling Resources with Balsam Elastic Queue¶
A Balsam Site does not automatically use your computing allocation by default.
Instead, we must use the BatchJob API or balsam queue CLI to
explicitly request compute nodes at a given Balsam Site. This follows the
principle of least surprise: as the User, you explicitly decide when and how
resources are spent. This is not a limitation since we can remotely submit
BatchJobs to any Site from a computer where Balsam is installed!
However, we can opt-in to enable the elastic_queue plugin that creates
BatchJob submissions on our behalf. This can be a useful service
in bursty or real-time workloads: instead of micro-managing the queues, we
simply submit a stream of Jobs and allow the Site to provision resources
as needed over time.
Enabling the Elastic Queue Plugin¶
Auto-scaling is enabled at a Site by setting the elastic_queue configuration appropriately inside of the settings.yml file.
You should find this line uncommented by default:
elastic_queue: null
Following this line is a commented-block showing an example elastic_queue configuration. We want to comment out the elastic_queue: null line and uncomment the configuration; setting each of the parameters appropriately for our use case.
Once the Plugin has been configured properly (see below), we must restart the Balsam Site Agent to load it:
$ balsam site sync
# Or if the Site isn't already running:
$ balsam site start
Disabling the Elastic Queue Plugin¶
To disable, comment out (or delete) the current elastic_queue
configuration in settings.yml and replace it with the line:
elastic_queue: null
Then restart the Balsam Site to run without the elastic queue plugin:
$ balsam site sync
Configuring the Elastic Queue¶
The configuration is fairly flexible to enable a wide range of use cases. This section explains the YAML configuration in chunks.
Project, Queue, and Submit Frequency¶
Firstly, service_period controls the waiting period (in seconds) between
cycles in which a new BatchJob might be submitted to the queue. The
submit_project, submit_queue, and job_mode are directly passed through to
the new BatchJob.
The max_queue_wait_time_min determines how long a submitted BatchJob should
be enqueued before the elastic queue deletes it and tries re-submitting. When
using backfill to grab idle nodes (see next section), it makes sense to set a
relatively short waiting time of 5-10 minutes. Otherwise, this duration should
be increased to a reasonable upper threshold to avoid deleting BatchJobs that
have accrued priority in the queues.
The elastic queue will maintain up to max_queued_jobs in the queue at any
given time. This should be set to the maximum desired (or allowed) number of
simultaneously queued/running BatchJobs at the Site.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Wall Time and Backfilling¶
Many HPC systems use backfilling schedulers, which attempt to place small Jobs
while draining nodes for larger Jobs to start up at a determined future time.
By opportunistically sizing jobs to fit into these idle node-hour windows,
Balsam effectively "fills the gaps" in unused resources. We enable this dynamic
sizing with use_backfill: True.
The interpretation of min_wall_time_min and max_wall_time_min depends on whether or not use_backfill is enabled:
- When
use_backfillisFalse:min_wall_time_minis ignored and BatchJobs are submitted for a constant wallclock time limit ofmax_wall_time_min. - When
use_backfillisTrue: Balsam selects backfill windows that are at least as long asmin_wall_time_min(this is to avoid futile 5 minute submissions when all Jobs take at least 30 minutes). The wallclock time limit is then the lesser of the scheduler's backfill duration andmax_wall_time_min. - Finally, a "padding" value of
wall_time_pad_minis subtracted from the final wallclock time in allBatchJobsubmissions. This should be set to a couple minutes whenuse_backfillisTrueand0otherwise.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Node Count¶
Finally, the min_num_nodes and max_num_nodes determine the permissible range
of node counts in submitted BatchJobs.
When operating with the use_backfill=True constraint, backfill windows
smaller than min_num_nodes will be ignored. Otherwise, BatchJob submissions use min_num_nodes as a lower bound. Likewise, max_num_nodes gives an upper bound on the BatchJob's node count.
The actual submitted BatchJob node count falls somewhere in this range. It is determined from the difference between how many nodes are currently requested (queued or running BatchJobs) and the aggregate node footprint of all runnable Jobs.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Therefore, the elastic queue automatically controls the size and number
of requested BatchJobs as the workload grows. We can think of each BatchJob
as a flexibly-sized block of resources, and the elastic queue creates multiple
blocks (one per service_period) while choosing their sizes. If one BatchJob
does not accommodate the incoming volume of tasks, then multiple BatchJobs of the
maximum size are submitted at each iteration.
When the incoming Jobs slow down and the backlog falls inside the
(min_num_nodes, max_num_nodes) range, the BatchJobs reduce down to a single,
smaller allocation of resources. As utilization decreases and launchers
become idle, the nodes are released according to the launcher's idle_ttl_sec
configuration (also in settings.yml).