Running Local Models with vLLM
Note
This section describes how to set up and run local language models using the vLLM inference server.
Inference Backend Setup (Remote/Local)
Virtual Python Environment
All instructions below must be executed within a Python virtual environment. Ensure the virtual environment uses the same Python version as your project (e.g., Python 3.11).
Example 1: Using conda
Example 2: Using python venv
python3.11 -m venv vllm-env
source vllm-env/bin/activate # On Windows use `vllm-env\\Scripts\\activate`
Install Inference Server (vLLM)
vLLM is recommended for serving many transformer models efficiently.
Basic vLLM installation from source: Make sure your virtual environment is activated.
# Ensure git is installed
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Running the vLLM Server (Standalone)
A script is provided at scripts/run_vllm_server.sh
to help start a vLLM server with features like logging, retry attempts, and timeout. This is useful for running vLLM outside of Docker Compose, for example, directly on a machine with GPU access.
Before running the script: 1. Ensure your vLLM Python virtual environment is activated.
# Example: if you used conda
# conda activate vllm-env
# Example: if you used python venv
# source path/to/your/vllm-env/bin/activate
To run the script:
[MODEL_IDENTIFIER]
(optional): The Hugging Face model identifier. Defaults tofacebook/opt-125m
.[PORT]
(optional): The port for the vLLM server. Defaults to8001
.[MAX_MODEL_LENGTH]
(optional): The maximum model length. Defaults to4096
.
Example:
Important Note on Gated Models (e.g., Llama 3):
-
Many models, such as those from the Llama family by Meta, are gated and require you to accept their terms of use on Hugging Face and use an access token for download.
-
To use such models with vLLM (either via the script or Docker Compose):
- Hugging Face Account and Token: Ensure you have a Hugging Face account and have generated an access token with
read
permissions. You can find this in your Hugging Face account settings under "Access Tokens". - Accept Model License: Navigate to the Hugging Face page of the specific model you want to use (e.g.,
meta-llama/Meta-Llama-3-8B-Instruct
) and accept its license/terms if prompted. - Environment Variables: Before running the vLLM server (either via the script or
docker-compose up
), you need to set the following environment variables in your terminal session or within your environment configuration (e.g.,.bashrc
,.zshrc
, or by passing them to Docker Compose if applicable): vLLM will use these environment variables to authenticate with Hugging Face and download the model weights.
- Hugging Face Account and Token: Ensure you have a Hugging Face account and have generated an access token with
-
The script will:
- Attempt to start the vLLM OpenAI-compatible API server.
- Log output to a file in the
logs/
directory (created if it doesn't exist at the project root). - The server runs in the background via
nohup
.
-
This standalone script is an alternative to running vLLM via Docker Compose and is primarily for users who manage their vLLM instances directly.