Skip to content

Inference with vLLM

vLLM is an open-source library designed to optimize the inference and serving. Originally developed at UC Berkeley's Sky Computing Lab, it has evolved into a community-driven project. The library is built around the innovative PagedAttention algorithm, which significantly improves memory management by reducing waste in Key-Value (KV) cache memory.

Install vLLM

First, SSH to an Aurora login node:

ssh <username>@aurora.alcf.anl.gov

Refer to Getting Started on Aurora for additional information. In particular, you need to set the environment variables that provide access to the proxy host.

Note

The instructions below should be run directly from a compute node. Explicitly, to request an interactive job (from aurora-uan):

qsub -I -q <your_Queue> -l select=1,walltime=60:00 -A <your_ProjectName> -l filesystems=<fs1:fs2>

Refer to job scheduling and execution for additional information.

Install vLLM using pre-built wheels
module load frameworks
conda create --name vllm python=3.10 -y
conda activate vllm

module unload oneapi/eng-compiler/2024.07.30.002
module use /opt/aurora/24.180.3/spack/unified/0.8.0/install/modulefiles/oneapi/2024.07.30.002
module use /soft/preview/pe/24.347.0-RC2/modulefiles
module add oneapi/release

pip install /flare/datascience/sraskar/vllm-install/wheels/*
pip install /flare/datascience/sraskar/vllm-install/vllm-0.6.6.post2.dev28+g5dbf8545.d20250129.xpu-py3-none-any.whl

Access Model Weights

Model weights for commonly used open-weight models are downloaded and available in the following directory on Aurora:

/flare/datascience/model-weights/hub
To ensure your workflows utilize the preloaded model weights and datasets, update the following environment variables in your session. Some models hosted on Hugging Face may be gated, requiring additional authentication. To access these gated models, you will need a Hugging Face authentication token.
1
2
3
4
5
export HF_HOME="/flare/datascience/model-weights/hub"
export HF_DATASETS_CACHE="/flare/datascience/model-weights/hub"
export HF_TOKEN="YOUR_HF_TOKEN"
export RAY_TMPDIR="/tmp"
export TMPDIR="/tmp"

Common Configuration Recommendations

For small models that fit within a single tile's memory (64 GB), no additional configuration is required to serve the model. Simply set TP=1 (Tensor Parallelism). This configuration ensures the model is run on a single tile without the need for distributed setup. Models with fewer than 7 billion parameters typically fit within a single tile. To utilize multiple tiles for larger models (TP>1), a more advanced setup is necessary. This involves configuring a Ray cluster and setting the ZE_FLAT_DEVICE_HIERARCHY environment variable:

1
2
3
4
5
export ZE_FLAT_DEVICE_HIERARCHY=FLAT

export VLLM_HOST_IP=$(getent hosts $(hostname).hsn.cm.aurora.alcf.anl.gov | awk '{ print $1 }' | tr ' ' '\n' | sort | head -n 1)
export tiles=12
ray --logging-level debug start --head --verbose --node-ip-address=$VLLM_HOST_IP --port=6379 --num-cpus=64 --num-gpus=$tiles&

Serve Small Models

Using Single Tile

The following command serves meta-llama/Llama-2-7b-chat-hf on a single tile of a single node:

vllm serve meta-llama/Llama-2-7b-chat-hf --port 8000 --device xpu --dtype float16

Using Multiple Tiles

The following commands set up the Ray cluster and serve meta-llama/Llama-3.3-70b-chat-hf on 8 tiles on single node.

1
2
3
export VLLM_HOST_IP=$(getent hosts $(hostname).hsn.cm.aurora.alcf.anl.gov | awk '{ print $1 }' | tr ' ' '\n' | sort | head -n 1)
ray --logging-level debug start --head --verbose --node-ip-address=$VLLM_HOST_IP --port=6379 --num-cpus=64 --num-gpus=8&
vllm serve meta-llama/Llama-2-7b-chat-hf --port 8000 --tensor-parallel-size 8 --device xpu --dtype float16 --trust-remote-code

Serve Medium Models

Using Single Node

These commands set up a Ray cluster and serves meta-llama/Llama-3.3-70B-Instruct on 8 tiles on single node. Models with up to 70 billion parameters can usually fit within a single node, utilizing multiple tiles.

1
2
3
export VLLM_HOST_IP=$(getent hosts $(hostname).hsn.cm.aurora.alcf.anl.gov | awk '{ print $1 }' | tr ' ' '\n' | sort | head -n 1)
ray --logging-level debug start --head --verbose --node-ip-address=$VLLM_HOST_IP --port=6379 --num-cpus=64 --num-gpus=8&
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000 --tensor-parallel-size 8 --device xpu --dtype float16 --trust-remote-code

Serve Large Models

Using Multiple Nodes

The following example serves meta-llama/Llama-3.1-405B-Instruct model using 2 nodes with TP=8 and PP=2. Models exceeding 70 billion parameters generally require more than one Aurora node. First, use setup_ray_cluster.sh script to setup a Ray cluster across nodes:

Setup script
setup_ray_cluster.sh
########################################################################
# FUNCTIONS
########################################################################

# Setup environment and variables needed to setup ray and vllm
setup_environment() {
    echo "[$(hostname)] Setting up the environment..."
    # Set proxy configurations
    export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
    export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
    export http_proxy="http://proxy.alcf.anl.gov:3128"
    export https_proxy="http://proxy.alcf.anl.gov:3128"
    export ftp_proxy="http://proxy.alcf.anl.gov:3128"

    # Define the common setup script path (make sure this file is accessible on all nodes)
    export COMMON_SETUP_SCRIPT="/path/to/setup_ray_cluster.sh"

    # Load modules and activate your conda environment
    module load frameworks
    conda activate vllm_0125

    module unload oneapi/eng-compiler/2024.07.30.002
    module use /opt/aurora/24.180.3/spack/unified/0.8.0/install/modulefiles/oneapi/2024.07.30.002
    module use /soft/preview/pe/24.347.0-RC2/modulefiles
    module add oneapi/release

    export TORCH_LLM_ALLREDUCE=1
    export CCL_ZE_IPC_EXCHANGE=drmfd

    export ZE_FLAT_DEVICE_HIERARCHY=FLAT

    export HF_TOKEN="YOUR_HF_TOKEN"
    export HF_HOME="/flare/datascience/model-weights/hub"
    export HF_DATASETS_CACHE="/flare/datascience/model-weights/hub"
    export TMPDIR="/tmp"

    export RAY_TMPDIR="/tmp"
    export VLLM_IMAGE_FETCH_TIMEOUT=60

    ulimit -c unlimited

    # Derive the node's HSN IP address (modify the getent command as needed)
    export HSN_IP_ADDRESS=$(getent hosts "$(hostname).hsn.cm.aurora.alcf.anl.gov" | awk '{ print $1 }' | sort | head -n 1)
    export VLLM_HOST_IP="$HSN_IP_ADDRESS"

    echo "[$(hostname)] Environment setup complete. HSN_IP_ADDRESS is $HSN_IP_ADDRESS"
}

# Stop any running Ray processes
stop_ray() {
    echo "[$(hostname)] Stopping Ray (if running)..."
    ray stop -f
}

# Start Ray head node
start_ray_head() {
    echo "[$(hostname)] Starting Ray head..."
    ray start --num-gpus=8 --num-cpus=64 --head --node-ip-address="$HSN_IP_ADDRESS" --temp-dir=/tmp

    # Wait until Ray reports that the head node is up
    echo "[$(hostname)] Waiting for Ray head to be up..."
    until ray status &>/dev/null; do
        sleep 5
        echo "[$(hostname)] Waiting for Ray head..."
    done
    echo "[$(hostname)] ray status: $(ray status)"
    echo "[$(hostname)] Ray head node is up."
}

# Start Ray worker node
start_ray_worker() {
    echo "[$(hostname)] Starting Ray worker, connecting to head at $RAY_HEAD_IP..."
    echo "HSN IP Address : $HSN_IP_ADDRESS"
    ray start --num-gpus=8 --num-cpus=64 --address="$RAY_HEAD_IP:6379" --node-ip-address="$HSN_IP_ADDRESS" --temp-dir=/tmp

    echo "[$(hostname)] Waiting for Ray worker to be up..."
    until ray status &>/dev/null; do
        sleep 5
        echo "[$(hostname)] Waiting for Ray worker..."
    done
    echo "[$(hostname)] ray status: $(ray status)"
    echo "[$(hostname)] Ray worker node is up."
}

########################################################################
# MAIN SCRIPT LOGIC
########################################################################


main() {

    # Ensure that the script is being run within a PBS job
    if [ -z "$PBS_NODEFILE" ]; then
        echo "Error: PBS_NODEFILE not set. This script must be run within a PBS job allocation."
        exit 1
    fi

    # Read all nodes from the PBS_NODEFILE into an array.
    mapfile -t nodes_full < "$PBS_NODEFILE"
    num_nodes=${#nodes_full[@]}

    echo "Allocated nodes ($num_nodes):"
    printf " - %s\n" "${nodes_full[@]}"

    # Require at least 2 nodes (one head + one worker)
    if [ "$num_nodes" -lt 2 ]; then
        echo "Error: Need at least 2 nodes to launch the Ray cluster."
        exit 1
    fi

    # The first node will be our Ray head.
    head_node_full="${nodes_full[0]}"

    # All remaining nodes will be the workers.
    worker_nodes_full=("${nodes_full[@]:1}")

    # It is a good idea to run this master script on the designated head node.
    current_node=$(hostname -f)


    echo "[$(hostname)] Running on head node."

    # --- Setup and start the head node ---
    setup_environment
    stop_ray
    start_ray_head

    # Export the head node's IP so that workers can join.
    export RAY_HEAD_IP="$HSN_IP_ADDRESS"
    echo "[$(hostname)] RAY_HEAD_IP exported as $RAY_HEAD_IP"

    # --- Launch Ray workers on each of the other nodes via SSH ---
    for worker in "${worker_nodes_full[@]}"; do
        echo "[$(hostname)] Launching Ray worker on $worker..."
        ssh "$worker" "bash -l -c 'set -x; export RAY_HEAD_IP=${RAY_HEAD_IP}; export COMMON_SETUP_SCRIPT="/flare/datascience/sraskar/vllm-2025_1_release/vllm-2025_1/vllm/examples/submit-dist.sh" ;source \$COMMON_SETUP_SCRIPT; setup_environment; stop_ray; start_ray_worker'" &
    done

    # Wait for all background SSH jobs to finish.
    wait

    echo "[$(hostname)] Ray cluster is up and running with $num_nodes nodes."
}

main 

Then, execute vLLM:

vllm serve meta-llama/Llama-3.1-405B-Instruct --port 8000 --tensor-parallel-size 8 --pipeline-parallel-size 2 --device xpu --dtype float16 --trust-remote-code --max-model-len 1024
Setting --max-model-len is important in order to fit this model on 2 nodes. In order to use higher --max-model-len values, you will need to use additonal nodes.