Skip to content

Multi-Node Milvus Deployment on Aurora

Overview

Milvus is an open-source, high-performance vector database designed for scalable similarity search and AI applications. It supports indexing and querying billions of vectors with low latency, making it well-suited for retrieval-augmented generation (RAG), recommendation systems, image search, and other machine learning workloads that rely on dense vector embeddings.

At ALCF, Milvus is deployed on Aurora to support large-scale vector search workflows across multiple compute nodes. This multi-node deployment leverages Apptainer containers and PBS job scheduling to run Milvus in a distributed configuration, enabling researchers to perform high-throughput vector indexing and retrieval as part of their AI/ML pipelines.

For more information, please refer to the Milvus documentation.

Obtaining the Source Code

The deployment scripts and configuration files for running Milvus on Aurora are available in the VDB repository. Clone the aurora branch to your home directory on Aurora:

git clone -b aurora https://github.com/argonne-lcf/VDB.git
cd VDB

The Milvus multi-node deployment scripts are located under milvus/multiNode/.

Building Milvus on Aurora

Installing the Python Client

Before deploying the Milvus server, install the Python client library (pymilvus) into a local directory using the pre-configured frameworks module on Aurora. This provides the necessary tools to interact with the Milvus cluster from a client node.

mkdir milvus-test
cd milvus-test
module load frameworks
pip install --target=/milvus-install -U "pymilvus[milvus-lite]"

After installation, add the install directory to your Python path so that pymilvus is available:

export PYTHONPATH=/milvus-install:/opt/aurora/25.190.0/oneapi/advisor/latest/pythonapi

Verify Python version

You can confirm that the correct Python interpreter is active by running:

python --version

Expected output:

Python 3.10.14

Running Milvus on Aurora

Requesting Compute Nodes

A multi-node Milvus deployment requires at least two compute nodes: one to host the Milvus server and one to act as the client. Request an interactive session with two nodes using PBS:

qsub -l select=2 -l walltime=60:00 -l filesystems=flare -A <project_name> -q debug -I

Replace <project_name> with your ALCF project allocation name.

Launching the Milvus Server

Navigate to the multi-node deployment directory within the cloned repository:

cd VDB/milvus/multiNode

Before launching the server, update the repository path in the deployment script to match your local installation. Open apptainer_deploy.sh and modify the directory path:

# Change this line to point to your cloned repo location
cd /your/path/to/VDB/milvus/multiNode/$myDIR

Once the path is configured, launch the Milvus server on the first compute node. The deployment script starts etcd, MinIO, and the Milvus process using Apptainer containers:

bash apptainer_deploy.sh
Example output
Replaced <ETCD_HOST> with 10.115.31.21 and <MINIO_HOST> with 10.115.31.21 in config/milvus.yaml

After the server starts, navigate back to the parent directory and set the correct file permissions on the multiNode directory so that the client node can access the shared files:

cd ..
chmod -R 755 multiNode

You can verify the deployment size to confirm all necessary files were generated:

du -sh multiNode
Example output
124M    multiNode

Connecting from the Client Node

With the server running, you need to connect from a second compute node to issue queries against the Milvus cluster.

First, identify the available compute nodes and the server's IP address. Run the following on the server node:

# List all allocated compute nodes
cat $PBS_NODEFILE

# Get the server node's IP address (needed for the client URI)
hostname -I | awk '{print $1}'
Example output
10.115.31.26

Next, open a new terminal window and SSH into the second compute node (the client node):

ssh <client_hostname>
cd VDB/milvus/multiNode

Update the connection URI in test_cluster.py to point to the server node's IP address:

# Replace localhost with the server node's IP from the previous step, for example
- URI = "http://<server-ip>:19530"
+ URI = "http://10.115.31.26:19530"

Finally, run the test script on the client node. Load the frameworks module, set the Python path, and unset the proxy environment variables to ensure direct network communication between nodes:

module load frameworks
export PYTHONPATH=/home/$USER/milvus-install:/opt/aurora/25.190.0/oneapi/advisor/latest/pythonapi
NO_PROXY="" no_proxy="" http_proxy="" https_proxy="" HTTP_PROXY="" HTTPS_PROXY="" python test_cluster.py

Proxy settings

Aurora compute nodes are configured with HTTP proxy environment variables by default. These must be unset when communicating between compute nodes to avoid routing traffic through the proxy, which would prevent the client from reaching the Milvus server.

Expected Output

A successful run of test_cluster.py produces output confirming that vectors were inserted and queried against the Milvus cluster:

row_count: 100
top-5: data: [[{'id': 0, 'distance': 1.0000001192092896, 'entity': {'tag': 'red', 'id': 0}},
                {'id': 14, 'distance': 0.9414785504341125, 'entity': {'tag': 'blue', 'id': 14}},
                {'id': 82, 'distance': 0.7108423113822937, 'entity': {'tag': 'blue', 'id': 82}},
                {'id': 74, 'distance': 0.549793004989624, 'entity': {'tag': 'blue', 'id': 74}},
                {'id': 47, 'distance': 0.4732419550418854, 'entity': {'tag': 'blue', 'id': 47}}]]
top ids: [0, 14, 82, 74, 47]

The output shows that 100 vectors were inserted into the collection and a similarity search returned the top 5 nearest neighbors with their distances and metadata tags.

Restarting the Deployment

If you need to restart the Milvus deployment — for example, after a configuration change or a failed run — clean up the running processes and temporary files on both nodes before relaunching.

On the server node, stop the Milvus process and remove cached state:

pkill -f "milvus run"
rm -rf ./srv ./volumes
rm -f etcd.ip minio.ip

On the client node, stop the supporting services:

pkill -f etcd; pkill -f minio

After cleanup, you can relaunch the server following the steps in Launching the Milvus Server.

Troubleshooting

Stale Configuration from a Previous Run

If the deployment script fails with an error similar to the following:

Replaced <ETCD_HOST> with 10.115.31.160 and <MINIO_HOST> with 10.115.31.160 in config/milvus.yaml
apptainer_deploy.sh: line 118: 159064 Aborted  apptainer exec --fakeroot ...

This typically occurs because milvus/multiNode/config/milvus.yaml was modified during a previous failed run and still contains hardcoded IP addresses instead of the expected placeholder values.

Fix

Open config/milvus.yaml and restore the placeholder tokens. Locate the etcd endpoint configuration and reset it:

endpoints: <ETCD_HOST>:2379

Similarly, locate the MinIO address configuration and reset it:

address: <MINIO_HOST>:9000