vLLM app

vLLM is a high-performance library for serving large language models (LLMs). Flyte provides VLLMAppEnvironment for deploying vLLM model servers.

Installation

First, install the vLLM plugin:

pip install --pre flyteplugins-vllm

Basic vLLM app

Here’s a simple example serving a HuggingFace model:

basic_vllm.py

"""A simple vLLM app example."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte

vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # HuggingFace model path
    model_id="qwen3-0.6b",  # Model ID exposed by vLLM
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",  # GPU required for LLM serving
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=300,  # Scale down after 5 minutes of inactivity
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(vllm_app)
    print(f"Deployed vLLM app: {app.url}")

Using prefetched models

You can use models prefetched with flyte.prefetch:

vllm_with_prefetch.py

"""vLLM app using prefetched models."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte



# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # this is a placeholder
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
    stream_model=True,  # Stream model directly from blob store to GPU
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Prefetch the model first
    run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
    run.wait()

    # Use the prefetched model
    app = flyte.serve(
        vllm_app.clone_with(
            vllm_app.name,
            model_hf_path=None,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
        )
    )
    print(f"Deployed vLLM app: {app.url}")

VLLMAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time. When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing to a path in object store:

Model weights stream directly from storage to GPU
Faster startup time (no full download required)
Lower disk space requirements

The contents of the model directory must be compatible with the vLLM-supported formats, e.g. the HuggingFace model serialization format.

Custom vLLM arguments

Use extra_args to pass additional arguments to vLLM:

vllm_app = VLLMAppEnvironment(
    name="custom-vllm-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    extra_args=[
        "--max-model-len", "8192",  # Maximum context length
        "--gpu-memory-utilization", "0.8",  # GPU memory utilization
        "--trust-remote-code",  # Trust remote code in models
    ],
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    # ...
)

See the vLLM documentation for all available arguments.

Using the OpenAI-compatible API

Once deployed, your vLLM app exposes an OpenAI-compatible API:

            
        
from openai import OpenAI

client = OpenAI(
    base_url="https://your-app-url/v1",  # vLLM endpoint
    api_key="your-api-key",  # If you passed an --api-key argument
)

response = client.chat.completions.create(
    model="qwen3-0.6b",  # Your model_id
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests. See here for more details on how to pass auth secrets to your app.

Multi-GPU inference (Tensor Parallelism)

For larger models, use multiple GPUs with tensor parallelism:

vllm_multi_gpu.py

"""vLLM app with multi-GPU tensor parallelism."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte

vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    model_hf_path="meta-llama/Llama-2-70b-hf",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # 4 GPUs for tensor parallelism
        disk="100Gi",
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Use 4 GPUs
        "--max-model-len", "4096",
        "--gpu-memory-utilization", "0.9",
    ],
    requires_auth=False,
)

The tensor-parallel-size should match the number of GPUs specified in resources.

Model sharding with prefetch

You can prefetch and shard models for multi-GPU inference:

# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    accelerator="L40s:4",
    shard_config=flyte.prefetch.ShardConfig(
        engine="vllm",
        args=flyte.prefetch.VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
)
run.wait()

# Use the sharded model
vllm_app = VLLMAppEnvironment(
    name="sharded-llm-app",
    model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
    model_id="llama-2-70b",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
    extra_args=["--tensor-parallel-size", "4"],
    stream_model=True,
)

See Prefetching models for more details on sharding.

Autoscaling

vLLM apps work well with autoscaling:

vllm_app = VLLMAppEnvironment(
    name="autoscaling-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Scale to zero when idle
        scaledown_after=600,  # 10 minutes idle before scaling down
    ),
    # ...
)

Best practices

Use prefetching: Prefetch models for faster deployment and better reproducibility
Enable streaming: Use stream_model=True to reduce startup time and disk usage
Right-size GPUs: Match GPU memory to model size
Configure memory utilization: Use --gpu-memory-utilization to control memory usage
Use tensor parallelism: For large models, use multiple GPUs with tensor-parallel-size
Set autoscaling: Use appropriate idle TTL to balance cost and performance
Limit context length: Use --max-model-len for smaller models to reduce memory usage

Troubleshooting

Model loading fails:

Verify GPU memory is sufficient for the model
Check that the model path or HuggingFace path is correct
Review container logs for detailed error messages

Out of memory errors:

Reduce --max-model-len
Lower --gpu-memory-utilization
Use a smaller model or more GPUs

Slow startup:

Enable stream_model=True for faster loading
Prefetch models before deployment
Use faster storage backends