vLLM app
vLLM is a high-performance library for serving large language models (LLMs). Flyte provides VLLMAppEnvironment for deploying vLLM model servers.
Installation
First, install the vLLM plugin:
pip install --pre flyteplugins-vllmBasic vLLM app
Here’s a simple example serving a HuggingFace model:
"""A simple vLLM app example."""
from flyteplugins.vllm import VLLMAppEnvironment
import flyte
vllm_app = VLLMAppEnvironment(
name="my-llm-app",
model_hf_path="Qwen/Qwen3-0.6B", # HuggingFace model path
model_id="qwen3-0.6b", # Model ID exposed by vLLM
resources=flyte.Resources(
cpu="4",
memory="16Gi",
gpu="L40s:1", # GPU required for LLM serving
disk="10Gi",
),
scaling=flyte.app.Scaling(
replicas=(0, 1),
scaledown_after=300, # Scale down after 5 minutes of inactivity
),
requires_auth=False,
)
if __name__ == "__main__":
flyte.init_from_config()
app = flyte.serve(vllm_app)
print(f"Deployed vLLM app: {app.url}")
Using prefetched models
You can use models prefetched with flyte.prefetch:
"""vLLM app using prefetched models."""
from flyteplugins.vllm import VLLMAppEnvironment
import flyte
# Use the prefetched model
vllm_app = VLLMAppEnvironment(
name="my-llm-app",
model_hf_path="Qwen/Qwen3-0.6B", # this is a placeholder
model_id="qwen3-0.6b",
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
stream_model=True, # Stream model directly from blob store to GPU
requires_auth=False,
)
if __name__ == "__main__":
flyte.init_from_config()
# Prefetch the model first
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()
# Use the prefetched model
app = flyte.serve(
vllm_app.clone_with(
vllm_app.name,
model_hf_path=None,
model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
)
)
print(f"Deployed vLLM app: {app.url}")
Model streaming
VLLMAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time.
When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing
to a path in object store:
- Model weights stream directly from storage to GPU
- Faster startup time (no full download required)
- Lower disk space requirements
The contents of the model directory must be compatible with the vLLM-supported formats, e.g. the HuggingFace model serialization format.
Custom vLLM arguments
Use extra_args to pass additional arguments to vLLM:
vllm_app = VLLMAppEnvironment(
name="custom-vllm-app",
model_hf_path="Qwen/Qwen3-0.6B",
model_id="qwen3-0.6b",
extra_args=[
"--max-model-len", "8192", # Maximum context length
"--gpu-memory-utilization", "0.8", # GPU memory utilization
"--trust-remote-code", # Trust remote code in models
],
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
# ...
)See the vLLM documentation for all available arguments.
Using the OpenAI-compatible API
Once deployed, your vLLM app exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="https://your-app-url/v1", # vLLM endpoint
api_key="your-api-key", # If you passed an --api-key argument
)
response = client.chat.completions.create(
model="qwen3-0.6b", # Your model_id
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests.
See
here for more details on how to pass auth secrets to your app.
Multi-GPU inference (Tensor Parallelism)
For larger models, use multiple GPUs with tensor parallelism:
"""vLLM app with multi-GPU tensor parallelism."""
from flyteplugins.vllm import VLLMAppEnvironment
import flyte
vllm_app = VLLMAppEnvironment(
name="multi-gpu-llm-app",
model_hf_path="meta-llama/Llama-2-70b-hf",
model_id="llama-2-70b",
resources=flyte.Resources(
cpu="8",
memory="32Gi",
gpu="L40s:4", # 4 GPUs for tensor parallelism
disk="100Gi",
),
extra_args=[
"--tensor-parallel-size", "4", # Use 4 GPUs
"--max-model-len", "4096",
"--gpu-memory-utilization", "0.9",
],
requires_auth=False,
)
The tensor-parallel-size should match the number of GPUs specified in resources.
Model sharding with prefetch
You can prefetch and shard models for multi-GPU inference:
# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
repo="meta-llama/Llama-2-70b-hf",
accelerator="L40s:4",
shard_config=flyte.prefetch.ShardConfig(
engine="vllm",
args=flyte.prefetch.VLLMShardArgs(
tensor_parallel_size=4,
dtype="auto",
trust_remote_code=True,
),
),
)
run.wait()
# Use the sharded model
vllm_app = VLLMAppEnvironment(
name="sharded-llm-app",
model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
model_id="llama-2-70b",
resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
extra_args=["--tensor-parallel-size", "4"],
stream_model=True,
)See Prefetching models for more details on sharding.
Autoscaling
vLLM apps work well with autoscaling:
vllm_app = VLLMAppEnvironment(
name="autoscaling-llm-app",
model_hf_path="Qwen/Qwen3-0.6B",
model_id="qwen3-0.6b",
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
scaling=flyte.app.Scaling(
replicas=(0, 1), # Scale to zero when idle
scaledown_after=600, # 10 minutes idle before scaling down
),
# ...
)Best practices
- Use prefetching: Prefetch models for faster deployment and better reproducibility
- Enable streaming: Use
stream_model=Trueto reduce startup time and disk usage - Right-size GPUs: Match GPU memory to model size
- Configure memory utilization: Use
--gpu-memory-utilizationto control memory usage - Use tensor parallelism: For large models, use multiple GPUs with
tensor-parallel-size - Set autoscaling: Use appropriate idle TTL to balance cost and performance
- Limit context length: Use
--max-model-lenfor smaller models to reduce memory usage
Troubleshooting
Model loading fails:
- Verify GPU memory is sufficient for the model
- Check that the model path or HuggingFace path is correct
- Review container logs for detailed error messages
Out of memory errors:
- Reduce
--max-model-len - Lower
--gpu-memory-utilization - Use a smaller model or more GPUs
Slow startup:
- Enable
stream_model=Truefor faster loading - Prefetch models before deployment
- Use faster storage backends