SGLang app
SGLang is a fast structured generation library for large language models (LLMs). Flyte provides SGLangAppEnvironment for deploying SGLang model servers.
Installation
First, install the SGLang plugin:
pip install --pre flyteplugins-sglangBasic SGLang app
Here’s a simple example serving a HuggingFace model:
"""A simple SGLang app example."""
from flyteplugins.sglang import SGLangAppEnvironment
import flyte
sglang_app = SGLangAppEnvironment(
name="my-sglang-app",
model_hf_path="Qwen/Qwen3-0.6B", # HuggingFace model path
model_id="qwen3-0.6b", # Model ID exposed by SGLang
resources=flyte.Resources(
cpu="4",
memory="16Gi",
gpu="L40s:1", # GPU required for LLM serving
disk="10Gi",
),
scaling=flyte.app.Scaling(
replicas=(0, 1),
scaledown_after=300, # Scale down after 5 minutes of inactivity
),
requires_auth=False,
)
if __name__ == "__main__":
flyte.init_from_config()
app = flyte.serve(sglang_app)
print(f"Deployed SGLang app: {app.url}")
Using prefetched models
You can use models prefetched with flyte.prefetch:
"""SGLang app using prefetched models."""
from flyteplugins.sglang import SGLangAppEnvironment
import flyte
# Use the prefetched model
sglang_app = SGLangAppEnvironment(
name="my-sglang-app",
model_hf_path="Qwen/Qwen3-0.6B", # this is a placeholder
model_id="qwen3-0.6b",
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
stream_model=True, # Stream model directly from blob store to GPU
requires_auth=False,
)
if __name__ == "__main__":
flyte.init_from_config()
# Prefetch the model first
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()
app = flyte.serve(
sglang_app.clone_with(
sglang_app.name,
model_hf_path=None,
model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
)
)
print(f"Deployed SGLang app: {app.url}")
Model streaming
SGLangAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time.
When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing
to a path in object store:
- Model weights stream directly from storage to GPU
- Faster startup time (no full download required)
- Lower disk space requirements
The contents of the model directory must be compatible with the SGLang-supported formats, e.g. the HuggingFace model serialization format.
Custom SGLang arguments
Use extra_args to pass additional arguments to SGLang:
sglang_app = SGLangAppEnvironment(
name="custom-sglang-app",
model_hf_path="Qwen/Qwen3-0.6B",
model_id="qwen3-0.6b",
extra_args=[
"--max-model-len", "8192", # Maximum context length
"--mem-fraction-static", "0.8", # Memory fraction for static allocation
"--trust-remote-code", # Trust remote code in models
],
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
# ...
)See the SGLang server arguments documentation for all available options.
Using the OpenAI-compatible API
Once deployed, your SGLang app exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="https://your-app-url/v1", # SGLang endpoint
api_key="your-api-key", # If you passed an --api-key argument
)
response = client.chat.completions.create(
model="qwen3-0.6b", # Your model_id
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests.
See
here for more details on how to pass auth secrets to your app.
Multi-GPU inference (Tensor Parallelism)
For larger models, use multiple GPUs with tensor parallelism:
sglang_app = SGLangAppEnvironment(
name="multi-gpu-sglang-app",
model_hf_path="meta-llama/Llama-2-70b-hf",
model_id="llama-2-70b",
resources=flyte.Resources(
cpu="8",
memory="32Gi",
gpu="L40s:4", # 4 GPUs for tensor parallelism
disk="100Gi",
),
extra_args=[
"--tp", "4", # Tensor parallelism size (4 GPUs)
"--max-model-len", "4096",
"--mem-fraction-static", "0.9",
],
requires_auth=False,
)The tensor parallelism size (--tp) should match the number of GPUs specified in resources.
Model sharding with prefetch
You can prefetch and shard models for multi-GPU inference using SGLang’s sharding:
# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
repo="meta-llama/Llama-2-70b-hf",
accelerator="L40s:4",
shard_config=flyte.prefetch.ShardConfig(
engine="vllm",
args=flyte.prefetch.VLLMShardArgs(
tensor_parallel_size=4,
dtype="auto",
trust_remote_code=True,
),
),
)
run.wait()
# Use the sharded model
sglang_app = SGLangAppEnvironment(
name="sharded-sglang-app",
model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
model_id="llama-2-70b",
resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
extra_args=["--tp", "4"],
stream_model=True,
)See Prefetching models for more details on sharding.
Autoscaling
SGLang apps work well with autoscaling:
sglang_app = SGLangAppEnvironment(
name="autoscaling-sglang-app",
model_hf_path="Qwen/Qwen3-0.6B",
model_id="qwen3-0.6b",
resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
scaling=flyte.app.Scaling(
replicas=(0, 1), # Scale to zero when idle
scaledown_after=600, # 10 minutes idle before scaling down
),
# ...
)Structured generation
SGLang is particularly well-suited for structured generation tasks. The deployed app supports standard OpenAI API calls, and you can use SGLang’s advanced features through the API.
Best practices
- Use prefetching: Prefetch models for faster deployment and better reproducibility
- Enable streaming: Use
stream_model=Trueto reduce startup time and disk usage - Right-size GPUs: Match GPU memory to model size
- Use tensor parallelism: For large models, use multiple GPUs with
--tp - Set autoscaling: Use appropriate idle TTL to balance cost and performance
- Configure memory: Use
--mem-fraction-staticto control memory allocation - Limit context length: Use
--max-model-lenfor smaller models to reduce memory usage
Troubleshooting
Model loading fails:
- Verify GPU memory is sufficient for the model
- Check that the model path or HuggingFace path is correct
- Review container logs for detailed error messages
Out of memory errors:
- Reduce
--max-model-len - Lower
--mem-fraction-static - Use a smaller model or more GPUs
Slow startup:
- Enable
stream_model=Truefor faster loading - Prefetch models before deployment
- Use faster storage backends