SGLang app

SGLang is a fast structured generation library for large language models (LLMs). Flyte provides SGLangAppEnvironment for deploying SGLang model servers.

Installation

First, install the SGLang plugin:

pip install --pre flyteplugins-sglang

Basic SGLang app

Here’s a simple example serving a HuggingFace model:

basic_sglang.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-sglang>=2.0.0b45",
# ]
# ///

"""A simple SGLang app example."""

from flyteplugins.sglang import SGLangAppEnvironment
import flyte

sglang_app = SGLangAppEnvironment(
    name="my-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # HuggingFace model path
    model_id="qwen3-0.6b",  # Model ID exposed by SGLang
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",  # GPU required for LLM serving
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=300,  # Scale down after 5 minutes of inactivity
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(sglang_app)
    print(f"Deployed SGLang app: {app.url}")

Using prefetched models

You can use models prefetched with flyte.prefetch:

sglang_with_prefetch.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-sglang>=2.0.0b45",
# ]
# ///

"""SGLang app using prefetched models."""

from flyteplugins.sglang import SGLangAppEnvironment
import flyte


# Use the prefetched model
sglang_app = SGLangAppEnvironment(
    name="my-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # this is a placeholder
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
    stream_model=True,  # Stream model directly from blob store to GPU
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Prefetch the model first
    run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
    run.wait()

    app = flyte.serve(
        sglang_app.clone_with(
            sglang_app.name,
            model_hf_path=None,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
        )
    )
    print(f"Deployed SGLang app: {app.url}")

SGLangAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time. When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing to a path in object store:

Model weights stream directly from storage to GPU
Faster startup time (no full download required)
Lower disk space requirements

The contents of the model directory must be compatible with the SGLang-supported formats, e.g. the HuggingFace model serialization format.

Custom SGLang arguments

Use extra_args to pass additional arguments to SGLang:

sglang_app = SGLangAppEnvironment(
    name="custom-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    extra_args=[
        "--max-model-len", "8192",  # Maximum context length
        "--mem-fraction-static", "0.8",  # Memory fraction for static allocation
        "--trust-remote-code",  # Trust remote code in models
    ],
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    # ...
)

See the SGLang server arguments documentation for all available options.

Using the OpenAI-compatible API

Once deployed, your SGLang app exposes an OpenAI-compatible API:

            
        
from openai import OpenAI

client = OpenAI(
    base_url="https://your-app-url/v1",  # SGLang endpoint
    api_key="your-api-key",  # If you passed an --api-key argument
)

response = client.chat.completions.create(
    model="qwen3-0.6b",  # Your model_id
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests. See here for more details on how to pass auth secrets to your app.

Multi-GPU inference (Tensor Parallelism)

For larger models, use multiple GPUs with tensor parallelism:

sglang_multi_gpu.py

sglang_app = SGLangAppEnvironment(
    name="multi-gpu-sglang-app",
    model_hf_path="meta-llama/Llama-2-70b-hf",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # 4 GPUs for tensor parallelism
        disk="100Gi",
    ),
    extra_args=[
        "--tp", "4",  # Tensor parallelism size (4 GPUs)
        "--max-model-len", "4096",
        "--mem-fraction-static", "0.9",
    ],
    requires_auth=False,
)

The tensor parallelism size (--tp) should match the number of GPUs specified in resources.

Model sharding with prefetch

You can prefetch and shard models for multi-GPU inference using SGLang’s sharding:

# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    accelerator="L40s:4",
    shard_config=flyte.prefetch.ShardConfig(
        engine="vllm",
        args=flyte.prefetch.VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
)
run.wait()

# Use the sharded model
sglang_app = SGLangAppEnvironment(
    name="sharded-sglang-app",
    model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
    model_id="llama-2-70b",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
    extra_args=["--tp", "4"],
    stream_model=True,
)

See Prefetching models for more details on sharding.

Autoscaling

SGLang apps work well with autoscaling:

sglang_app = SGLangAppEnvironment(
    name="autoscaling-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Scale to zero when idle
        scaledown_after=600,  # 10 minutes idle before scaling down
    ),
    # ...
)

Structured generation

SGLang is particularly well-suited for structured generation tasks. The deployed app supports standard OpenAI API calls, and you can use SGLang’s advanced features through the API.

Best practices

Use prefetching: Prefetch models for faster deployment and better reproducibility
Enable streaming: Use stream_model=True to reduce startup time and disk usage
Right-size GPUs: Match GPU memory to model size
Use tensor parallelism: For large models, use multiple GPUs with --tp
Set autoscaling: Use appropriate idle TTL to balance cost and performance
Configure memory: Use --mem-fraction-static to control memory allocation
Limit context length: Use --max-model-len for smaller models to reduce memory usage

Troubleshooting

Model loading fails:

Verify GPU memory is sufficient for the model
Check that the model path or HuggingFace path is correct
Review container logs for detailed error messages

Out of memory errors:

Reduce --max-model-len
Lower --mem-fraction-static
Use a smaller model or more GPUs

Slow startup:

Enable stream_model=True for faster loading
Prefetch models before deployment
Use faster storage backends