Data Transfers¶
Background¶
Each Balsam Job
may require data to be staged in prior to execution or staged
out after execution. A core feature of Balsam is to interface with services such
as Globus Transfer and automatically submit and
monitor batched transfer tasks between endpoints.
This enables distributed workflows where large numbers of Jobs
with relatively
small datasets are submitted in real-time: the Site manages the details of
efficient batch transfers and marks individual jobs as STAGED_IN
as the
requisite data arrives.
To use this functionality, the first step is to define the Transfer
Slots for a given Balsam App. We can then submit
Jobs
with transfer items that fill the required
transfer slots.
Be sure to read these two sections in the user guide for more information.
The only other requirement is to configure the transfer
plugin at the Balsam
Site and authenticate with Globus, which we explain below.
Configuring Transfers¶
When using the Globus transfer interface, Balsam needs an access token to
communicate with the Globus Transfer API. You may already have an access token
stored from a Globus CLI installation on your machine: check globus whoami
to
see if this is the case. Otherwise, Balsam ships with the necessary tooling and you can follow the same Globus authentication flow by running:
$ balsam site globus-login
Next, we configure the transfers
section of settings.yml
:
transfer_locations
should be set to a dictionary of trusted location aliases. If you need to add Globus endpoints, they can be inserted here.globus_endpoint_id
should refer to the endpoint ID of the local Site.globus_endpoint_site_path
specifies the path on the Globus endpoint, which might be different from the path used on login/compute nodes (e.g. for ALCF home filesystem, paths begin with /home/${USER}, but on the dtn_home endpoint, paths begin with /${USER}.)max_concurrent_transfers
determines the maximum number of in-flight transfer tasks, where each task manages a batch of files for many Jobs.transfer_batch_size
determines the maximum number of transfer items per transfer task. This should be tuned depending on your workload (a higher number makes sense to utilize available bandwidth for smaller files).num_items_query_limit
determines the maximum number of transfer items considered in any single transfer task submission.service_period
determines the interval (in seconds) between transfer task submissions.
Globus requires that you give Balsam consent to make transfers on your behalf; consent is granted for each endpoint that you intend to use. You can review your Globus consents here. For any endpoints that you have configured above (including the globus_endpoint_id
), determine the Globus endpoint id, and execute the following command:
balsam site globus-login -e ENDPOINT_ID1 -e ENDPOINT_ID2
Note that globus_endpoint_id
in settings.yaml will be used to stage input data in, and to stage output data out. This endpoint id will depend on the filesystem where your site is located (e.g. at ALCF, if it's in your home directory, use alcf#dtn_home; if it's on the Eagle filesystem, use alcf#eagle_dtn). Also make sure that the path to your site is set to correspond to how it is mapped on your Globus endpoint, using the globus_endpoint_site_path
setting above.
Once settings.yml
has been configured appropriately, be sure to restart the Balsam Site:
$ balsam site sync
The Site will start issuing stage in and stage out tasks immediately and advancing Jobs as needed. The state of transfers can be tracked using the Python API:
from balsam.api import TransferItem
for item in TransferItem.objects.filter(direction="in", state="active"):
print(f"File {item.remote_path} is currently staging in via task ID: {item.task_id}")