ALCF Inference Endpoints¶
Unlock Powerful AI Inference at Argonne Leadership Computing Facility (ALCF). This service provides API access to a variety of state-of-the-art open-source models running on dedicated ALCF hardware. Join our mailing list to receive updates and maintenance notifications.
Quick Start¶
This guide will walk you through the fastest ways to start using the ALCF Inference Endpoints.
Web UI¶
The easiest way to get started is through the web interface, accessible at https://inference.alcf.anl.gov/
The UI is based on the popular Open WebUI platform. After logging in with your ANL or ALCF credentials, you can:
- Select a model from the dropdown menu at the top of the screen.
- Start a conversation directly in the chat interface.
In the model selection dropdown, you can see the status of each model:

- Live: These models are "hot" and ready for immediate use.
- Starting: A node has been acquired and the model is being loaded into memory.
- Queued: The model is in a queue waiting for resources to become available.
- Offline: The model is available but not currently loaded. It will be queued for loading when a user sends a request.
- All: Lists all available models regardless of their status.
For Advanced UI Features
For a full guide on advanced features like RAG (Retrieval-Augmented Generation), function calling, and more, please refer to the official Open WebUI documentation.
API Access¶
For programmatic access, you can use the API endpoints directly.
1. Setup Your Environment¶
You can run the following setup from anywhere (your local machine, or an ALCF machine).
# Create a new Conda environment
conda create -n globus_env python==3.11.9 --y
conda activate globus_env
# Install necessary packages
pip install openai globus_sdk
2. Authenticate¶
To access the endpoints, you need an authentication token.
# Download the authentication helper script
wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/inference_auth_token.py
# Authenticate with your Globus account
python inference_auth_token.py authenticate
This will generate and store access and refresh tokens in your home directory. To see how much time you have left before your access token expires, type the following command (units can be seconds, minutes, or hours):
Token Validity
- Access tokens are valid for 48 hours. The
get_access_tokencommand will automatically refresh your token if it has expired. - An internal policy requires re-authentication every 7 days. If you encounter permission errors, logout from Globus at app.globus.org/logout and re-run
python inference_auth_token.py authenticate --force.
3. Make a Test Call¶
Once authenticated, you can make a test call using cURL or Python.
#!/bin/bash
# Get your access token
access_token=$(python inference_auth_token.py get_access_token)
curl -X POST "https://inference-api.alcf.anl.gov/resource_server/metis/api/v1/chat/completions" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "Meta-Llama-3.1-8B-Instruct",
"messages":[{"role": "user", "content": "Explain quantum computing in simple terms."}]
}'
from openai import OpenAI
from inference_auth_token import get_access_token
# Get your access token
access_token = get_access_token()
client = OpenAI(
api_key=access_token,
base_url="https://inference-api.alcf.anl.gov/resource_server/metis/api/v1"
)
response = client.chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}]
)
print(response.choices[0].message.content)
System Details¶
Available Clusters¶
Two clusters are currently active, with additional systems coming soon:
| Cluster | Status | Framework | Base URL | Supported Endpoints |
|---|---|---|---|---|
| Sophia | Active | vLLM | /resource_server/sophia/vllm/v1 | /chat/completions/completions/embeddings/batches |
| SambaNova SN40L (Metis) | Active | SambaNova API | /resource_server/metis/api/v1 | /chat/completions |
| Cerebras CS-3 | Coming Soon | - | - | - |
| GH200 Nvidia | Coming Soon | - | - | - |
Cluster Differences
- Sophia uses vLLM and supports the full range of OpenAI-compatible endpoints including chat, completions, embeddings, and batch processing.
- Metis uses SambaNova's inference API and currently supports only chat completions.
Discovering Available Models
You can programmatically query all available models and endpoints:
API Usage Examples¶
Querying Endpoint Status¶
Querying Endpoint Status
You can check the status of models on the cluster and list all available endpoints programmatically.
This endpoint provides information about what is currently live or queued.
#!/bin/bash
# Get your access token
access_token=$(python inference_auth_token.py get_access_token)
# Check Sophia cluster status
curl -X GET "https://inference-api.alcf.anl.gov/resource_server/sophia/jobs" \
-H "Authorization: Bearer ${access_token}"
# Check Metis cluster status (replace 'sophia' with 'metis')
curl -X GET "https://inference-api.alcf.anl.gov/resource_server/metis/jobs" \
-H "Authorization: Bearer ${access_token}"
Switching Between Clusters
Replace /sophia/ with /metis/ in the URL to query the Metis cluster instead.
Chat Completions¶
Chat Completions
This endpoint is used for conversational AI.
#!/bin/bash
access_token=$(python inference_auth_token.py get_access_token)
# Sophia cluster example
curl -X POST "https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1/chat/completions" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"temperature": 0.2,
"max_tokens": 150,
"messages":[{"role": "user", "content": "What are the symptoms of diabetes?"}]
}'
# Metis cluster example (replace '/sophia/vllm' with '/metis/api')
curl -X POST "https://inference-api.alcf.anl.gov/resource_server/metis/api/v1/chat/completions" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "Meta-Llama-3.1-8B-Instruct",
"temperature": 0.2,
"max_tokens": 150,
"messages":[{"role": "user", "content": "What are the symptoms of diabetes?"}]
}'
from openai import OpenAI
from inference_auth_token import get_access_token
access_token = get_access_token()
# Sophia cluster
client = OpenAI(
api_key=access_token,
base_url="https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What are the symptoms of diabetes?"}]
)
print(response.choices[0].message.content)
# Metis cluster (replace '/sophia/vllm' with '/metis/api')
client_metis = OpenAI(
api_key=access_token,
base_url="https://inference-api.alcf.anl.gov/resource_server/metis/api/v1"
)
response = client_metis.chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What are the symptoms of diabetes?"}]
)
print(response.choices[0].message.content)
Switching Between Clusters
To target a different cluster, simply replace the cluster/framework portion of the URL:
- Sophia:
/resource_server/sophia/vllm/v1 - Metis:
/resource_server/metis/api/v1
Vision Language Models¶
Vision Language Models
Use this endpoint to analyze images with text prompts.
from openai import OpenAI
import base64
from inference_auth_token import get_access_token
access_token = get_access_token()
client = OpenAI(
api_key=access_token,
base_url="https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1"
)
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "scientific_diagram.png" # Replace with your image
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="Qwen/Qwen2-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the key components in this scientific diagram"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
]
}
],
max_tokens=300
)
print(response.choices[0].message.content)
Embeddings¶
Embeddings
This endpoint generates vector embeddings from text, currently supported by the infinity framework.
from openai import OpenAI
from inference_auth_token import get_access_token
access_token = get_access_token()
client = OpenAI(
api_key=access_token,
base_url="https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1"
)
response = client.embeddings.create(
model="mistralai/Mistral-7B-Instruct-v0.3-embed",
input="The food was delicious and the waiter...",
encoding_format="float"
)
print(response.data[0].embedding)
For more examples, please see the inference-endpoints GitHub repository.
Available Models¶
Models are organized by cluster and marked with the following capabilities:
- B - Batch Processing Enabled
- T - Tool Calling Enabled
- R - Reasoning Enabled
- H - Always Hot Model
Sophia Cluster (vLLM)¶
Chat Language Models
Qwen Family
- Qwen/Qwen2.5-7B-InstructBT
- Qwen/Qwen2.5-14B-InstructBT
- Qwen/QwQ-32BBRT
- Qwen/Qwen3-32BBRTH
- Qwen/Qwen3-235B-A22BT
- Qwen/Qwen3-Next-80B-A3B-InstructT
- Qwen/Qwen3-Next-80B-A3B-ThinkingRT
Meta Llama Family
- meta-llama/Meta-Llama-3.1-8B-InstructBTH
- meta-llama/Meta-Llama-3.1-70B-InstructBTH
- meta-llama/Meta-Llama-3.1-405B-InstructBT
- meta-llama/Llama-3.3-70B-InstructBT
- meta-llama/Llama-4-Scout-17B-16E-InstructBTH
- meta-llama/Llama-4-Maverick-17B-128E-InstructT
Mistral Family
- mistralai/Mistral-Large-Instruct-2407
- mistralai/Mixtral-8x22B-Instruct-v0.1
OpenAI Family
- openai/gpt-oss-20bBRTH
- openai/gpt-oss-120bBRTH
Aurora GPT Family
- argonne/AuroraGPT-IT-v4-0125B
- argonne/AuroraGPT-Tulu3-SFT-0125B
- argonne/AuroraGPT-DPO-UFB-0225B
- argonne/AuroraGPT-KTO-UFB-0325B
Other Models
- allenai/Llama-3.1-Tulu-3-405B
- google/gemma-3-27b-itBTH
- mgoin/Nemotron-4-340B-Instruct-hf
- zai-org/GLM-4.5-AirT
Vision Language Models
- Qwen/Qwen2-VL-72B-InstructT
- Qwen/Qwen2.5-VL-72B-InstructT
- meta-llama/Llama-3.2-90B-Vision-Instruct
Embedding Models
- mistralai/Mistral-7B-Instruct-v0.3-embed
- Qwen/Qwen3-Embedding-8B
- Salesforce/SFR-Embedding-Mistral
Metis Cluster (SambaNova)¶
Chat Language Models
- DeepSeek-R1RH
- Meta-Llama-3.1-8B-InstructH
- Meta-Llama-3.3-70B-InstructH
- Qwen2.5-Coder-0.5B-InstructH
Metis Limitations
- Batch processing and Tool Calling is not currently supported on the Metis cluster
- Only chat completions endpoint is available
Want to add a model?
To request a new model, please contact ALCF Support.
Batch Processing¶
For large-scale inference, the batch processing service allows you to submit a file with up to 150,000 requests.
Batch Processing Requirements
- You must have an active ALCF allocation.
- Input files and output folders must be located within the
/eagle/argonne_tpcproject space or a world-readable directory. - Each line in the input file must be a complete JSON request object (JSON Lines format).
- Only models marked with B support batch processing.
Batch API Endpoints¶
Create Batch¶
Create Batch Request
#!/bin/bash
# Get your access token
access_token=$(python inference_auth_token.py get_access_token)
# Define the base URL
base_url="https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1/batches"
# Submit batch request
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl"
}'
# Submit batch request with custom output folder
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl",
"output_folder_path": "/eagle/argonne_tpc/path/to/your/output/folder/"
}'
import requests
import json
from inference_auth_token import get_access_token
# Get your access token
access_token = get_access_token()
# Define headers and URL
headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}
url = "https://inference-api.alcf.anl.gov/resource_server/sophia/vllm/v1/batches"
# Submit batch request
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl",
"output_folder_path": "/eagle/argonne_tpc/path/to/your/output/folder/"
}
response = requests.post(url, headers=headers, json=data)
print(response.json())
Retrieve Batch¶
Retrieve Batch Metrics
import requests
from inference_auth_token import get_access_token
# Get your access token
access_token = get_access_token()
# Define headers and URL
headers = {
'Authorization': f'Bearer {access_token}'
}
batch_id = "your-batch-id"
url = f"https://inference-api.alcf.anl.gov/resource_server/v1/batches/{batch_id}/result"
# Get batch results
response = requests.get(url, headers=headers)
print(response.json())
Sample Output:
{
"results_file": "/eagle/argonne_tpc/path/to/your/output/folder/<input-file-name>_<model>_<batch-id>/<input-file-name>_<timestamp>.results.jsonl",
"progress_file": "/eagle/argonne_tpc/path/to/your/output/folder/<input-file-name>_<model>_<batch-id>/<input-file-name>_<timestamp>.progress.json",
"metrics": {
"response_time": 27837.440138816833,
"throughput_tokens_per_second": 3899.833442250346,
"total_tokens": 108561380,
"num_responses": 99985,
"lines_processed": 100000
}
}
List Batch¶
List All Batches
#!/bin/bash
# Get your access token
access_token=$(python inference_auth_token.py get_access_token)
# List all batches
curl -X GET "https://inference-api.alcf.anl.gov/resource_server/v1/batches" \
-H "Authorization: Bearer ${access_token}"
# Optionally filter by status (pending, running, completed, or failed)
curl -X GET "https://inference-api.alcf.anl.gov/resource_server/v1/batches?status=completed" \
-H "Authorization: Bearer ${access_token}"
import requests
from inference_auth_token import get_access_token
# Get your access token
access_token = get_access_token()
# Define headers and URL
headers = {
'Authorization': f'Bearer {access_token}'
}
url = "https://inference-api.alcf.anl.gov/resource_server/v1/batches"
# List all batches
response = requests.get(url, headers=headers)
print(response.json())
# Optionally filter by status (pending, running, completed, or failed)
params = {'status': 'completed'}
response = requests.get(url, headers=headers, params=params)
print(response.json())
Sample Output:
[
{
"batch_id": "f8fa8efd-1111-476d-a0a0-111111111111",
"cluster": "sophia",
"created_at": "2025-02-20 18:39:58.049584+00:00",
"framework": "vllm",
"input_file": "/eagle/argonne_tpc/path/to/your/output/folder/chunk_a.jsonl",
"status": "pending"
},
{
"batch_id": "4b8a31b8-2222-479f-8c8c-222222222222",
"cluster": "sophia",
"created_at": "2025-02-20 18:40:30.882414+00:00",
"framework": "vllm",
"input_file": "/eagle/argonne_tpc/path/to/your/output/folder/chunk_b.jsonl",
"status": "pending"
}
]
Batch Status¶
Get Batch Status
import requests
from inference_auth_token import get_access_token
# Get your access token
access_token = get_access_token()
# Define headers and URL
headers = {
'Authorization': f'Bearer {access_token}'
}
batch_id = "your-batch-id"
url = f"https://inference-api.alcf.anl.gov/resource_server/v1/batches/{batch_id}"
# Get batch status
response = requests.get(url, headers=headers)
print(response.json())
Batch Status Codes: - pending: The request was submitted, but the job has not started yet. - running: The job is currently running on a compute node. - failed: An error occurred; the error message will be displayed when querying the result. - completed: :tada:
Cancel Batch¶
Cancel Submitted Batch
The inference team is currently developing a mechanism for users to cancel submitted batches. In the meantime, please contact us with your batch_id if you have a batch to cancel.
Performance and Wait Times¶
- Cold Starts: The first query to an inactive model on Sophia may take 10-15 minutes to load.
- Queueing: During high demand, your request may be queued until resources are available.
- Payload Limits: Payloads are limited to 10MB per request and is further limited by the model's context window.
On Sophia, from the 10 nodes reserved for inference, 5 nodes are dedicated to serving popular models "hot" for immediate access. The remaining 5 nodes rotate through other models based on user requests. These dynamically loaded models will remain active for up to 24 hours and will be unloaded if not used for 2 hours.
Important Notes¶
- If you’re interested in extended model runtimes, reservations, or private model deployments, please contact ALCF Support.
Troubleshooting¶
- Connection Timeout: The model you are requesting may be queued as the cluster has too many pending jobs. You can check model status by querying the
/jobsendpoint. See Querying Endpoint Status for an example. - Permission Denied: Your token may have expired. Logout from Globus at app.globus.org/logout and re-authenticate using the
--forceflag. - Batch Permission Error: Ensure your input/output paths are in a readable location like
/eagle/argonne_tpc. It is currently internal only to ALCF and will be made public in the future. - IdentityMismatchError: Detected a change in identity: This happens when trying to get an access token using a Globus identity that is not linked to the one you previously used to generate your access tokens. Locate your tokens file (typically at
~/.globus/app/58fdd3bc-e1c3-4ce5-80ea-8d6b87cfb944/inference_app/tokens.json), delete it, and restart the authentication process.
Notifications¶
To receive notifications regarding new model support, maintenances, and policy updates, please join our mailing list.
Contact Us¶
For questions or support, please contact ALCF Support.