SGLang app

SGLang is a fast structured generation library for large language models (LLMs). Flyte provides SGLangAppEnvironment for deploying SGLang model servers.

Installation

First, install the SGLang plugin:

pip install --pre flyteplugins-sglang

Basic SGLang app

Here’s a simple example serving a HuggingFace model:

basic_sglang.py
"""A simple SGLang app example."""

from flyteplugins.sglang import SGLangAppEnvironment
import flyte

sglang_app = SGLangAppEnvironment(
    name="my-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # HuggingFace model path
    model_id="qwen3-0.6b",  # Model ID exposed by SGLang
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",  # GPU required for LLM serving
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=300,  # Scale down after 5 minutes of inactivity
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(sglang_app)
    print(f"Deployed SGLang app: {app.url}")

Using prefetched models

You can use models prefetched with flyte.prefetch:

sglang_with_prefetch.py
"""SGLang app using prefetched models."""

from flyteplugins.sglang import SGLangAppEnvironment
import flyte


# Use the prefetched model
sglang_app = SGLangAppEnvironment(
    name="my-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",  # this is a placeholder
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1", disk="10Gi"),
    stream_model=True,  # Stream model directly from blob store to GPU
    requires_auth=False,
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Prefetch the model first
    run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
    run.wait()

    app = flyte.serve(
        sglang_app.clone_with(
            sglang_app.name,
            model_hf_path=None,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
        )
    )
    print(f"Deployed SGLang app: {app.url}")

Model streaming

SGLangAppEnvironment supports streaming models directly from blob storage to GPU memory, reducing startup time. When stream_model=True and the model_path argument is provided with either a flyte.io.Dir or RunOutput pointing to a path in object store:

  • Model weights stream directly from storage to GPU
  • Faster startup time (no full download required)
  • Lower disk space requirements

The contents of the model directory must be compatible with the SGLang-supported formats, e.g. the HuggingFace model serialization format.

Custom SGLang arguments

Use extra_args to pass additional arguments to SGLang:

sglang_app = SGLangAppEnvironment(
    name="custom-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    extra_args=[
        "--max-model-len", "8192",  # Maximum context length
        "--mem-fraction-static", "0.8",  # Memory fraction for static allocation
        "--trust-remote-code",  # Trust remote code in models
    ],
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    # ...
)

See the SGLang server arguments documentation for all available options.

Using the OpenAI-compatible API

Once deployed, your SGLang app exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-app-url/v1",  # SGLang endpoint
    api_key="your-api-key",  # If you passed an --api-key argument
)

response = client.chat.completions.create(
    model="qwen3-0.6b",  # Your model_id
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

If you passed an --api-key argument, you can use the api_key parameter to authenticate your requests. See here for more details on how to pass auth secrets to your app.

Multi-GPU inference (Tensor Parallelism)

For larger models, use multiple GPUs with tensor parallelism:

sglang_app = SGLangAppEnvironment(
    name="multi-gpu-sglang-app",
    model_hf_path="meta-llama/Llama-2-70b-hf",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # 4 GPUs for tensor parallelism
        disk="100Gi",
    ),
    extra_args=[
        "--tp", "4",  # Tensor parallelism size (4 GPUs)
        "--max-model-len", "4096",
        "--mem-fraction-static", "0.9",
    ],
    requires_auth=False,
)

The tensor parallelism size (--tp) should match the number of GPUs specified in resources.

Model sharding with prefetch

You can prefetch and shard models for multi-GPU inference using SGLang’s sharding:

# Prefetch with sharding configuration
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    accelerator="L40s:4",
    shard_config=flyte.prefetch.ShardConfig(
        engine="vllm",
        args=flyte.prefetch.VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
)
run.wait()

# Use the sharded model
sglang_app = SGLangAppEnvironment(
    name="sharded-sglang-app",
    model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
    model_id="llama-2-70b",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4", disk="100Gi"),
    extra_args=["--tp", "4"],
    stream_model=True,
)

See Prefetching models for more details on sharding.

Autoscaling

SGLang apps work well with autoscaling:

sglang_app = SGLangAppEnvironment(
    name="autoscaling-sglang-app",
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Scale to zero when idle
        scaledown_after=600,  # 10 minutes idle before scaling down
    ),
    # ...
)

Structured generation

SGLang is particularly well-suited for structured generation tasks. The deployed app supports standard OpenAI API calls, and you can use SGLang’s advanced features through the API.

Best practices

  1. Use prefetching: Prefetch models for faster deployment and better reproducibility
  2. Enable streaming: Use stream_model=True to reduce startup time and disk usage
  3. Right-size GPUs: Match GPU memory to model size
  4. Use tensor parallelism: For large models, use multiple GPUs with --tp
  5. Set autoscaling: Use appropriate idle TTL to balance cost and performance
  6. Configure memory: Use --mem-fraction-static to control memory allocation
  7. Limit context length: Use --max-model-len for smaller models to reduce memory usage

Troubleshooting

Model loading fails:

  • Verify GPU memory is sufficient for the model
  • Check that the model path or HuggingFace path is correct
  • Review container logs for detailed error messages

Out of memory errors:

  • Reduce --max-model-len
  • Lower --mem-fraction-static
  • Use a smaller model or more GPUs

Slow startup:

  • Enable stream_model=True for faster loading
  • Prefetch models before deployment
  • Use faster storage backends