vLLM app

vLLM is a high-performance library for serving large language models (LLMs). Flyte provides VLLMAppEnvironment for deploying vLLM model servers.

Installation

First, install the vLLM plugin:

pip install --pre flyteplugins-vllm

Basic vLLM app

Here’s a simple example serving a HuggingFace model:

basic_vllm.py
"""A simple vLLM app example."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte

vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # HuggingFace model path
    model_id="qwen3-0.6b",  # Model ID exposed by vLLM
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",  # GPU required for LLM serving
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=300,  # Scale down after 5 minutes of inactivity
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(vllm_app)
    print(f"Deployed vLLM app: {app.url}")

Using prefetched models

You can use models prefetched with flyte.prefetch:

vllm_with_prefetch.py
"""vLLM app using prefetched models."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte



# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # this is a placeholder
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
    stream_model=True,  # Stream model directly from blob store to GPU
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Prefetch the model first
    run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
    run.wait()

    # Use the prefetched model
    app = flyte.serve(
        vllm_app.clone_with(
            vllm_app.name,
            model_hf_path=None,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
        )
    )
    print(f"Deployed vLLM app: {app.url}")

Model streaming

VLLMAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time. When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing to a path in object store:

  • Model weights stream directly from storage to GPU
  • Faster startup time (no full download required)
  • Lower disk space requirements

The contents of the model directory must be compatible with the vLLM-supported formats, e.g. the HuggingFace model serialization format.

Custom vLLM arguments

Use extra_args to pass additional arguments to vLLM:

vllm_app = VLLMAppEnvironment(
    name="custom-vllm-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    extra_args=[
        "--max-model-len", "8192",  # Maximum context length
        "--gpu-memory-utilization", "0.8",  # GPU memory utilization
        "--trust-remote-code",  # Trust remote code in models
    ],
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    # ...
)

See the vLLM documentation for all available arguments.

Using the OpenAI-compatible API

Once deployed, your vLLM app exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-app-url/v1",  # vLLM endpoint
    api_key="your-api-key",  # If you passed an --api-key argument
)

response = client.chat.completions.create(
    model="qwen3-0.6b",  # Your model_id
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests. See here for more details on how to pass auth secrets to your app.

Multi-GPU inference (Tensor Parallelism)

For larger models, use multiple GPUs with tensor parallelism:

vllm_multi_gpu.py
"""vLLM app with multi-GPU tensor parallelism."""

from flyteplugins.vllm import VLLMAppEnvironment
import flyte

vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    model_hf_path="meta-llama/Llama-2-70b-hf",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # 4 GPUs for tensor parallelism
        disk="100Gi",
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Use 4 GPUs
        "--max-model-len", "4096",
        "--gpu-memory-utilization", "0.9",
    ],
    requires_auth=False,
)

The tensor-parallel-size should match the number of GPUs specified in resources.

Model sharding with prefetch

You can prefetch and shard models for multi-GPU inference:

# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    accelerator="L40s:4",
    shard_config=flyte.prefetch.ShardConfig(
        engine="vllm",
        args=flyte.prefetch.VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
)
run.wait()

# Use the sharded model
vllm_app = VLLMAppEnvironment(
    name="sharded-llm-app",
    model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
    model_id="llama-2-70b",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
    extra_args=["--tensor-parallel-size", "4"],
    stream_model=True,
)

See Prefetching models for more details on sharding.

Autoscaling

vLLM apps work well with autoscaling:

vllm_app = VLLMAppEnvironment(
    name="autoscaling-llm-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Scale to zero when idle
        scaledown_after=600,  # 10 minutes idle before scaling down
    ),
    # ...
)

Best practices

  1. Use prefetching: Prefetch models for faster deployment and better reproducibility
  2. Enable streaming: Use stream_model=True to reduce startup time and disk usage
  3. Right-size GPUs: Match GPU memory to model size
  4. Configure memory utilization: Use --gpu-memory-utilization to control memory usage
  5. Use tensor parallelism: For large models, use multiple GPUs with tensor-parallel-size
  6. Set autoscaling: Use appropriate idle TTL to balance cost and performance
  7. Limit context length: Use --max-model-len for smaller models to reduce memory usage

Troubleshooting

Model loading fails:

  • Verify GPU memory is sufficient for the model
  • Check that the model path or HuggingFace path is correct
  • Review container logs for detailed error messages

Out of memory errors:

  • Reduce --max-model-len
  • Lower --gpu-memory-utilization
  • Use a smaller model or more GPUs

Slow startup:

  • Enable stream_model=True for faster loading
  • Prefetch models before deployment
  • Use faster storage backends