Deploy Optimized LLM Endpoints with vLLM and SGLang
This guide shows you how to deploy high-performance LLM endpoints using SGLang and vLLM. It also shows how to use Union’s optimized serving images that are designed to reduce cold start times and provide efficient model serving capabilities.
For information on how to cache models from HuggingFace Hub as Union Artifacts, see the Cache a HuggingFace Model as an Artifact guide.
Overview
Union provides two specialized app classes for serving high-performance LLM endpoints:
-
SGLangApp
: Optimized for structured generation and complex reasoning tasks -
VLLMApp
: High-performance inference engine with excellent throughput
By default, both classes provide:
- Reduced cold start times through optimized image loading.
- Fast model loading by streaming model weights directly from blob storage to GPU memory.
- Distributed inference with options for shared memory and tensor parallelism.
You can also serve models with other frameworks like FastAPI, but doing so would require more effort to achieve high performance, whereas vLLM and SGLang provide highly performant LLM endpoints out of the box.
Basic Example: Deploy a Non-Sharded Model
Deploy with vLLM
Assuming that you have followed the guide to
cache models from huggingface
and have a model artifact named qwen2-5-0-5b-instruct
, you can deploy a simple LLM endpoint with the following code:
# vllm_app.py
import union
from union.app.llm import VLLMApp
from flytekit.extras.accelerators import L4
# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")
# Deploy with default optimized image
vllm_app = VLLMApp(
name="basic-vllm-app",
requests=union.Resources(cpu="12", mem="24Gi", gpu="1"),
accelerator=L4,
model=Model.query(), # Query the cached artifact
model_id="qwen2",
scaledown_after=300,
stream_model=True, # Enable streaming for faster loading
port=8084,
requires_auth=False,
)
Here we’re using a single L4 GPU to serve the model and specifying stream_model=True
to stream the model weights directly to GPU memory.
Deploy the app:
union deploy apps vllm_app.py basic-vllm-app
Deploy with SGLang
# sglang_app.py
import union
from union.app.llm import SGLangApp
from flytekit.extras.accelerators import L4
# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")
# Deploy with default optimized image
sglang_app = SGLangApp(
name="basic-sglang-app",
requests=union.Resources(cpu="12", mem="24Gi", gpu="1"),
accelerator=L4,
model=Model.query(), # Query the cached artifact
model_id="qwen2",
scaledown_after=300,
stream_model=True, # Enable streaming for faster loading
port=8000,
requires_auth=False,
)
Deploy the app:
union deploy apps sglang_app.py basic-sglang-app
Custom Image Example: Deploy with Your Own Image
If you need more control over the serving environment, you can define a custom ImageSpec
.
For vLLM apps, that would look like this:
import union
from union.app.llm import VLLMApp
from flytekit.extras.accelerators import L4
# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")
# Define custom optimized image
image = union.ImageSpec(
name="vllm-serving-custom",
builder="union",
apt_packages=["build-essential"],
packages=["union[vllm]==0.1.187"],
env={
"NCCL_DEBUG": "INFO",
"CUDA_LAUNCH_BLOCKING": "1",
},
)
# Deploy with custom image
vllm_app = VLLMApp(
name="vllm-app-custom",
container_image=image,
...
)
And for SGLang apps, it would look like this:
# sglang_app.py
import union
from union.app.llm import SGLangApp
from flytekit.extras.accelerators import L4
# Reference the cached model artifact
Model = union.Artifact(name="qwen2-5-0-5b-instruct")
# Define custom optimized image
image = union.ImageSpec(
name="sglang-serving-custom",
builder="union",
python_version="3.12",
apt_packages=["build-essential"],
packages=["union[sglang]==0.1.187"],
)
# Deploy with custom image
sglang_app = SGLangApp(
name="sglang-app-custom",
container_image=image,
...
)
This allows you to control the exact package versions in the image, but at the cost of increased cold start times. This is because the Union images are optimized with Nydus, which reduces the cold start time by streaming container image layers. This allows the container to start before the image is fully downloaded.
Advanced Example: Deploy a Sharded Model
For large models that require distributed inference, deploy using a sharded model artifact:
Cache a Sharded Model
First, cache a large model with sharding (see Cache a HuggingFace Model as an Artifact for details). First create a shard configuration file:
# shard_config.yaml
engine: vllm
args:
model: unsloth/Llama-3.3-70B-Instruct
tensor_parallel_size: 4
gpu_memory_utilization: 0.9
extra_args:
max_model_len: 16384
Then cache the model:
union cache model-from-hf unsloth/Llama-3.3-70B-Instruct \
--hf-token-key HUGGINGFACE_TOKEN \
--union-api-key EAGER_API_KEY \
--artifact-name llama-3-3-70b-instruct-sharded \
--cpu 36 \
--gpu 4 \
--mem 300Gi \
--ephemeral-storage 300Gi \
--accelerator nvidia-l40s \
--shard-config shard_config.yaml \
--project flytesnacks \
--domain development \
--wait
Deploy with VLLMApp
Once the model is cached, you can deploy it to a vLLM app:
# vllm_app_sharded.py
from flytekit.extras.accelerators import L40S
from union import Artifact, Resources
from union.app.llm import VLLMApp
# Reference the sharded model artifact
LLMArtifact = Artifact(name="llama-3-3-70b-instruct-sharded")
# Deploy sharded model with optimized configuration
vllm_app = VLLMApp(
name="vllm-app-sharded",
requests=Resources(
cpu="36",
mem="300Gi",
gpu="4",
ephemeral_storage="300Gi",
),
accelerator=L40S,
model=LLMArtifact.query(),
model_id="llama3",
# Additional arguments to pass into the vLLM engine:
# see https://docs.vllm.ai/en/stable/serving/engine_args.html
# or run `vllm serve --help` to see all available arguments
extra_args=[
"--tensor-parallel-size", "4",
"--gpu-memory-utilization", "0.8",
"--max-model-len", "4096",
"--max-num-seqs", "256",
"--enforce-eager",
],
env={
"NCCL_DEBUG": "INFO",
"CUDA_LAUNCH_BLOCKING": "1",
"VLLM_SKIP_P2P_CHECK": "1",
},
shared_memory=True, # Enable shared memory for multi-GPU
scaledown_after=300,
stream_model=True,
port=8084,
requires_auth=False,
)
Then deploy the app:
union deploy apps vllm_app_sharded.py vllm-app-sharded-optimized
Deploy with SGLangApp
You can also deploy the sharded model to a SGLang app:
import os
from flytekit.extras.accelerators import GPUAccelerator
from union import Artifact, Resources
from union.app.llm import SGLangApp
# Reference the sharded model artifact
LLMArtifact = Artifact(name="llama-3-3-70b-instruct-sharded")
# Deploy sharded model with SGLang
sglang_app = SGLangApp(
name="sglang-app-sharded",
requests=Resources(
cpu="36",
mem="300Gi",
gpu="4",
ephemeral_storage="300Gi",
),
accelerator=GPUAccelerator("nvidia-l40s"),
model=LLMArtifact.query(),
model_id="llama3",
# Additional arguments to pass into the SGLang engine:
# See https://docs.sglang.ai/backend/server_arguments.html for details.
extra_args=[
"--tensor-parallel-size", "4",
"--mem-fraction-static", "0.8",
],
env={
"NCCL_DEBUG": "INFO",
"CUDA_LAUNCH_BLOCKING": "1",
},
shared_memory=True,
scaledown_after=300,
stream_model=True,
port=8084,
requires_auth=False,
)
Then deploy the app:
union deploy apps sglang_app_sharded.py sglang-app-sharded-optimized
Performance Tuning
You can refer to the corresponding documentation for vLLM and SGLang for more information on how to tune the performance of your app.
- vLLM: see the optimization and tuning and engine arguments pages to learn about how to tune the performance of your app. You can also look at the distributed inference and serving page to learn more about distributed inference.
- SGLang: see the environment variables and server arguments pages to learn about all of the available serving options in SGLang.