Running Local Models with vLLM
Note
This section describes how to set up and run local language models using the vLLM inference server.
Inference Backend Setup (Remote/Local)
Virtual Python Environment
All instructions below must be executed within a Python virtual environment. Ensure the virtual environment uses the same Python version as your project (e.g., Python 3.11).
Example 1: Using conda
Example 2: Using python venv
python3.11 -m venv vllm-env
source vllm-env/bin/activate # On Windows use `vllm-env\\Scripts\\activate`
Install Inference Server (vLLM)
vLLM is recommended for serving many transformer models efficiently.
Basic vLLM installation from source: Make sure your virtual environment is activated.
# Ensure git is installed
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Running the vLLM Server (Standalone)
A script is provided at scripts/run_vllm_server.sh to help start a vLLM server with features like logging, retry attempts, and timeout. This is intended for running vLLM as a separate standalone service (for example, on a machine with GPU access).
Before running the script: 1. Ensure your vLLM Python virtual environment is activated.
# Example: if you used conda
# conda activate vllm-env
# Example: if you used python venv
# source path/to/your/vllm-env/bin/activate
To run the script:
[MODEL_IDENTIFIER](optional): The Hugging Face model identifier. Defaults tofacebook/opt-125m.[PORT](optional): The port for the vLLM server. Defaults to8001.[MAX_MODEL_LENGTH](optional): The maximum model length. Defaults to4096.
Example:
Important Note on Gated Models (e.g., Llama 3):
-
Many models, such as those from the Llama family by Meta, are gated and require you to accept their terms of use on Hugging Face and use an access token for download.
-
To use such models with vLLM:
- Hugging Face Account and Token: Ensure you have a Hugging Face account and have generated an access token with
readpermissions. You can find this in your Hugging Face account settings under "Access Tokens". - Accept Model License: Navigate to the Hugging Face page of the specific model you want to use (e.g.,
meta-llama/Meta-Llama-3-8B-Instruct) and accept its license/terms if prompted. - Environment Variables: Before running the vLLM server, set the following environment variables in your terminal session or environment configuration (e.g.,
.bashrc,.zshrc): vLLM will use these environment variables to authenticate with Hugging Face and download the model weights.
- Hugging Face Account and Token: Ensure you have a Hugging Face account and have generated an access token with
-
The script will:
- Attempt to start the vLLM OpenAI-compatible API server.
- Log output to a file in the
logs/directory (created if it doesn't exist at the project root). - The server runs in the background via
nohup.
-
This standalone script is the recommended approach for users who manage their own vLLM instances directly.