Home
A unified platform to manage high-throughput workflows across the HPC landscape.
Run Balsam on any laptop, cluster, or supercomputer.
$ pip install --pre balsam
$ balsam login
$ balsam site init my-site
Python class-based declaration of Apps and execution lifecycles.
from balsam.api import ApplicationDefinition
class Hello(ApplicationDefinition):
site = "my-laptop"
command_template = "echo hello {{ name }}"
def handle_timeout(self):
self.job.state = "RESTART_READY"
Seamless remote job management.
# On any machine with internet access...
from balsam.api import Job, BatchJob
# Create Jobs:
job = Job.objects.create(
site_name="my-laptop",
app_id="Hello",
workdir="test/say-hello",
parameters={"name": "world!"},
)
# Or allocate resources:
BatchJob.objects.create(
site_id=job.site_id,
num_nodes=1,
wall_time_min=10,
job_mode="serial",
project="local",
queue="local",
)
Dispatch Python Apps across heterogeneous resources from a single session.
import numpy as np
class MyApp(ApplicationDefinition):
site = "theta-gpu"
def run(self, vec):
from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()
print("Hello from rank", rank)
return np.linalg.norm(vec)
jobs = [
MyApp.submit(
workdir=f"test/{i}",
vec=np.random.rand(3),
ranks_per_node=4,
gpus_per_rank=0,
)
for i in range(10)
]
for job in Job.objects.as_completed(jobs):
print(job.workdir, job.result())
Features¶
- Easy
pip
installation runs out-of-the-box on several HPC systems and is easily adaptable to others. - Balsam Sites are remotely controlled by design: submit and monitor workflows from anywhere
- Run any existing application, with flexible execution environments and job lifecycle hooks
- High-throughput and fault-tolerant task execution on diverse resources
- Define data dependencies for any task: Balsam orchestrates the necessary data transfers
- Elastic queueing: auto-scale resources to the workload size
- Monitoring APIs: query recent task failures, node utilization, or throughput