Deploy Optimized LLM Endpoints with vLLM and SGLang

This guide shows you how to deploy high-performance LLM endpoints using SGLang and vLLM. It also shows how to use Union’s optimized serving images that are designed to reduce cold start times and provide efficient model serving capabilities.

For information on how to cache models from HuggingFace Hub as Union Artifacts, see the Cache a HuggingFace Model as an Artifact guide.

Overview

Union provides two specialized app classes for serving high-performance LLM endpoints:

SGLangApp: Optimized for structured generation and complex reasoning tasks
VLLMApp: High-performance inference engine with excellent throughput

By default, both classes provide:

Reduced cold start times through optimized image loading.
Fast model loading by streaming model weights directly from blob storage to GPU memory.
Distributed inference with options for shared memory and tensor parallelism.

You can also serve models with other frameworks like FastAPI, but doing so would require more effort to achieve high performance, whereas vLLM and SGLang provide highly performant LLM endpoints out of the box.

Basic Example: Deploy a Non-Sharded Model

Deploy with vLLM

Assuming that you have followed the guide to cache models from huggingface and have a model artifact named qwen2-5-0-5b-instruct, you can deploy a simple LLM endpoint with the following code:

        
    
# vllm_app.py

import union
from union.app.llm import VLLMApp
from flytekit.extras.accelerators import L4

# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")

# Deploy with default optimized image
vllm_app = VLLMApp(
    name="basic-vllm-app",
    requests=union.Resources(cpu="12", mem="24Gi", gpu="1"),
    accelerator=L4,
    model=Model.query(),  # Query the cached artifact
    model_id="qwen2",
    scaledown_after=300,
    stream_model=True,  # Enable streaming for faster loading
    port=8084,
    requires_auth=False,
)

Here we’re using a single L4 GPU to serve the model and specifying stream_model=True to stream the model weights directly to GPU memory.

Deploy the app:

union deploy apps vllm_app.py basic-vllm-app

Deploy with SGLang

        
    
# sglang_app.py

import union
from union.app.llm import SGLangApp
from flytekit.extras.accelerators import L4

# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")

# Deploy with default optimized image
sglang_app = SGLangApp(
    name="basic-sglang-app",
    requests=union.Resources(cpu="12", mem="24Gi", gpu="1"),
    accelerator=L4,
    model=Model.query(),  # Query the cached artifact
    model_id="qwen2",
    scaledown_after=300,
    stream_model=True,  # Enable streaming for faster loading
    port=8000,
    requires_auth=False,
)

Deploy the app:

union deploy apps sglang_app.py basic-sglang-app

Custom Image Example: Deploy with Your Own Image

If you need more control over the serving environment, you can define a custom ImageSpec. For vLLM apps, that would look like this:

        
    
import union
from union.app.llm import VLLMApp
from flytekit.extras.accelerators import L4

# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")

# Define custom optimized image
image = union.ImageSpec(
    name="vllm-serving-custom",
    builder="union",
    apt_packages=["build-essential"],
    packages=["union[vllm]==0.1.187"],
    env={
        "NCCL_DEBUG": "INFO",
        "CUDA_LAUNCH_BLOCKING": "1",
    },
)

# Deploy with custom image
vllm_app = VLLMApp(
    name="vllm-app-custom",
    container_image=image,
    ...
)

And for SGLang apps, it would look like this:

        
    
# sglang_app.py

import union
from union.app.llm import SGLangApp
from flytekit.extras.accelerators import L4

# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")

# Define custom optimized image
image = union.ImageSpec(
    name="sglang-serving-custom",
    builder="union",
    python_version="3.12",
    apt_packages=["build-essential"],
    packages=["union[sglang]==0.1.187"],
)

# Deploy with custom image
sglang_app = SGLangApp(
    name="sglang-app-custom",
    container_image=image,
    ...
)

This allows you to control the exact package versions in the image, but at the cost of increased cold start times. This is because the Union images are optimized with Nydus, which reduces the cold start time by streaming container image layers. This allows the container to start before the image is fully downloaded.

Advanced Example: Deploy a Sharded Model

For large models that require distributed inference, deploy using a sharded model artifact:

Cache a Sharded Model

First, cache a large model with sharding (see Cache a HuggingFace Model as an Artifact for details). First create a shard configuration file:

        
    
# shard_config.yaml
engine: vllm
args:
  model: unsloth/Llama-3.3-70B-Instruct
  tensor_parallel_size: 4
  gpu_memory_utilization: 0.9
  extra_args:
    max_model_len: 16384

Then cache the model:

        
    
union cache model-from-hf unsloth/Llama-3.3-70B-Instruct \
    --hf-token-key HUGGINGFACE_TOKEN \
    --union-api-key EAGER_API_KEY \
    --artifact-name llama-3-3-70b-instruct-sharded \
    --cpu 36 \
    --gpu 4 \
    --mem 300Gi \
    --ephemeral-storage 300Gi \
    --accelerator nvidia-l40s \
    --shard-config shard_config.yaml \
    --project flytesnacks \
    --domain development \
    --wait

Deploy with VLLMApp

Once the model is cached, you can deploy it to a vLLM app:

        
    
# vllm_app_sharded.py

from flytekit.extras.accelerators import L40S
from union import Artifact, Resources
from union.app.llm import VLLMApp

# Reference the sharded model artifact
LLMArtifact = Artifact(name="llama-3-3-70b-instruct-sharded")

# Deploy sharded model with optimized configuration
vllm_app = VLLMApp(
    name="vllm-app-sharded",
    requests=Resources(
        cpu="36",
        mem="300Gi",
        gpu="4",
        ephemeral_storage="300Gi",
    ),
    accelerator=L40S,
    model=LLMArtifact.query(),
    model_id="llama3",

    # Additional arguments to pass into the vLLM engine:
    # see https://docs.vllm.ai/en/stable/serving/engine_args.html
    # or run `vllm serve --help` to see all available arguments
    extra_args=[
        "--tensor-parallel-size", "4",
        "--gpu-memory-utilization", "0.8",
        "--max-model-len", "4096",
        "--max-num-seqs", "256",
        "--enforce-eager",
    ],
    env={
        "NCCL_DEBUG": "INFO",
        "CUDA_LAUNCH_BLOCKING": "1",
        "VLLM_SKIP_P2P_CHECK": "1",
    },
    shared_memory=True,  # Enable shared memory for multi-GPU
    scaledown_after=300,
    stream_model=True,
    port=8084,
    requires_auth=False,
)

Then deploy the app:

union deploy apps vllm_app_sharded.py vllm-app-sharded-optimized

Deploy with SGLangApp

You can also deploy the sharded model to a SGLang app:

        
    
import os
from flytekit.extras.accelerators import GPUAccelerator
from union import Artifact, Resources
from union.app.llm import SGLangApp

# Reference the sharded model artifact
LLMArtifact = Artifact(name="llama-3-3-70b-instruct-sharded")

# Deploy sharded model with SGLang
sglang_app = SGLangApp(
    name="sglang-app-sharded",
    requests=Resources(
        cpu="36",
        mem="300Gi",
        gpu="4",
        ephemeral_storage="300Gi",
    ),
    accelerator=GPUAccelerator("nvidia-l40s"),
    model=LLMArtifact.query(),
    model_id="llama3",

    # Additional arguments to pass into the SGLang engine:
    # See https://docs.sglang.ai/backend/server_arguments.html for details.
    extra_args=[
        "--tensor-parallel-size", "4",
        "--mem-fraction-static", "0.8",
    ],
    env={
        "NCCL_DEBUG": "INFO",
        "CUDA_LAUNCH_BLOCKING": "1",
    },
    shared_memory=True,
    scaledown_after=300,
    stream_model=True,
    port=8084,
    requires_auth=False,
)

Then deploy the app:

union deploy apps sglang_app_sharded.py sglang-app-sharded-optimized

Performance Tuning

You can refer to the corresponding documentation for vLLM and SGLang for more information on how to tune the performance of your app.

vLLM: see the optimization and tuning and engine arguments pages to learn about how to tune the performance of your app. You can also look at the distributed inference and serving page to learn more about distributed inference.
SGLang: see the environment variables and server arguments pages to learn about all of the available serving options in SGLang.