Northwind Voice Support

# Voice customer-service agent > [!NOTE] > Code available [here](https://github.com/unionai/unionai-examples/tree/main/v2/tutorials/voice_customer_service). Talk to a customer-support agent in your browser and hear it answer back. This tutorial builds that agent as two Flyte apps: a small Qwen model served with vLLM on a GPU, and a web app that serves a single-page voice UI and proxies the model. Speech recognition runs in the browser, the reply streams back as text, and the text is spoken aloud. Once Union is set up, bringing the whole thing online is two `python` commands. ![Voice customer-service agent in the browser](https://www.union.ai/docs/v2/union/_static/images/tutorials/voice_customer_service/demo.gif) By the end you will understand, and have running: - how to serve an LLM as a Flyte app with `VLLMAppEnvironment`, exposing an OpenAI-compatible endpoint, - how to build a web app with `FastAPIAppEnvironment` that serves a UI and proxies the model, - how the two apps compose, with the browser only ever talking to its own origin, - how text-to-speech is switchable between the browser and a neural voice served on Union, with a live latency comparison, and - how Flyte app features like health checks, warm replicas, and per-app routing turn this into something that feels like a product. The agent is "Ava", a support rep for a fictional electronics company called Northwind. The model is `Qwen/Qwen2.5-3B-Instruct`, which is public, so no Hugging Face token is needed, and small enough to be snappy on a single L4 GPU. ## How it fits together The demo is two apps plus the browser: | App | Environment | Hardware | Job | | --------- | ----------------------- | -------- | ---------------------------------------------------------------- | | `llm_app` | `VLLMAppEnvironment` | L4 GPU | serve Qwen with vLLM, OpenAI-compatible `/v1` | | `ui_app` | `FastAPIAppEnvironment` | CPU | serve the voice page, proxy chat to `llm_app`, synthesize speech | Speech recognition and the default text-to-speech run in the browser through the Web Speech API, so there is no audio model to host for input and the GPU footprint stays tiny. The browser talks only to the UI app, and the UI app talks to the model. That keeps the model internal and avoids any cross-origin setup in the browser. ```mermaid flowchart LR B["Browser
mic + speaker
Web Speech API (STT)"] subgraph UI["ui_app · FastAPIAppEnvironment · CPU"] P["/ voice page
/api/chat proxy
/api/tts (Kokoro)"] end subgraph LLM["llm_app · VLLMAppEnvironment · L4 GPU"] Q["Qwen2.5-3B-Instruct
OpenAI /v1"] end B -- "text turn" --> P P -- "/v1/chat/completions (streamed)" --> Q Q -- "tokens" --> P P -- "reply text or spoken WAV" --> B classDef ui fill:#e0f2fe,stroke:#0369a1,color:#1a1a2e; classDef gpu fill:#fde68a,stroke:#b45309,color:#1a1a2e; classDef br fill:#ede9fe,stroke:#6d28d9,color:#1a1a2e; class P ui; class Q gpu; class B br; ``` Serving the UI as a Flyte app, rather than hosting it somewhere separate, means the web tier gets the same treatment as the model: a managed endpoint, autoscaling, logs, and one deploy and auth story across both apps, with no separate web server to stand up or operate. It sits next to the model it proxies instead of reaching across the internet to it. Serving over HTTPS by default is part of that same package, and it happens to clear a practical hurdle for voice, since the browser only grants microphone access and speech recognition on a secure origin. So the page works the moment it deploys. ## Serving the model The model app is a `VLLMAppEnvironment`. It wraps vLLM, downloads the weights, and exposes the standard OpenAI `/v1` API, so the UI talks to it the same way it would talk to any OpenAI-compatible server. First, the image. Flyte's `Image` API defines the runtime as a chain of layers you control, so you pin exactly what a served model needs. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* The app itself is a few lines. `Qwen/Qwen2.5-3B-Instruct` is about 6 GB in bf16 and fits an L4 comfortably. `scaling=flyte.app.Scaling(replicas=(1, 1))` keeps exactly one warm replica so there is no cold start mid-demo, and the short `--max-model-len` keeps the KV cache small and latency low, which is all a customer-service turn needs. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* > [!NOTE] Why `llm_app` is a module-level variable > The default serving entry point resolves the app by `module:attr`, so `llm_app` has to be importable at module level. If it were created inside a function, the resolver would silently fall back to `ui_app`, and the GPU pod would end up running the web UI and returning 404 on `/v1`. The `flyteplugins-vllm` import is guarded so the lightweight UI image, which never installs that plugin, still imports this module cleanly. ## The voice UI app The UI is a `FastAPIAppEnvironment`. You hand it a plain FastAPI app, and Flyte serves it over HTTPS. This one serves the single-page voice client at `/`, proxies chat to the model at `/api/chat`, and synthesizes neural speech at `/api/tts`. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* It runs on CPU, with no GPU at all. The server-side voice uses Kokoro, an 82M-parameter text-to-speech model that runs comfortably on CPU, so the image bundles a CPU build of torch to keep it small. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* ## The agent and the proxy The agent's whole personality is a system prompt. Because the replies are spoken aloud in something that feels like a phone call, it asks for short, plain sentences, one question at a time, and tells the model to say it will look into account-specific details rather than invent them. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* The proxy is what the browser actually calls. It injects the system prompt, forwards the turn to the selected model backend, and streams the reply back as plain text token by token. Keeping the model behind this proxy is what lets the browser talk only to its own origin. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* ## Speech in, speech out Speech recognition happens in the browser through the Web Speech API, so the microphone is transcribed locally and only text is sent to the model. This needs Chrome or Edge. For the reply, the page offers two voices you can switch between live: - **Browser**, using the built-in `speechSynthesis`. It is the lowest latency, but its audio is not echo-cancelled, so it is best with headphones. - **Server, using Kokoro**, a neural voice served by the UI app. Its audio plays through the Web Audio graph, which the browser's echo canceller can subtract, so it works on open speakers without the agent interrupting itself. The page defaults to this when it is ready. The server voice is one endpoint. It synthesizes a clause of speech with Kokoro and returns a WAV, with the measured synthesis time in a response header so the page can show it. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* A few details make the conversation feel natural rather than like a walkie-talkie: - **Clause-by-clause speech.** The reply is spoken as soon as the first clause is ready, and the next clause is fetched while the current one plays, so only the first clause's latency is ever felt. - **Barge-in.** The page watches microphone energy, and when you start talking over Ava it cancels both the model stream and the speech playback, so she stops and listens. - **A live latency comparison.** After each reply the footer shows time-to-first-audio for the voice you used, and keeps a running average for both, so you can compare the browser and server voices side by side. ## What makes this a good Flyte app The interesting part is not the model, it is how little stands between "I have a model" and "I have a product". A few things the UI surfaces on purpose: **Two right-sized apps, composed.** A GPU model server and a CPU web app are separate environments with their own images and resources. They are wired together at deploy time by passing the model's URL into the UI app's environment, and nothing else. **Health you can see.** The header shows two status pills. One pings the UI app's own health endpoint. The other pings the model app and reports whether a warm replica is serving. That second check is a direct read on the `Scaling` policy: it stays warm with `replicas=(1, 1)`, and would show a cold start if you let the model scale to zero when idle. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union

""" # --------------------------------------------------------------------------- # Deploy driver # --------------------------------------------------------------------------- if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("target", choices=["llm", "omni", "ui"]) parser.add_argument("--llm-url", default=os.environ.get("LLM_BASE_URL", "")) parser.add_argument("--omni-url", default=os.environ.get("OMNI_BASE_URL", "")) args = parser.parse_args() # Reads your default Flyte config; uses the remote image builder (no local Docker needed). flyte.init_from_config(image_builder="remote") if args.target == "llm": if llm_app is None: sys.exit("flyteplugins-vllm not importable; run `uv pip install -e plugins/vllm --no-deps`") # GPU provisioning + image build + weight download can take a while. app = flyte.with_servecontext(activate_timeout=1800.0).serve(llm_app) print(f"LLM app: {app.url}") elif args.target == "omni": # vllm-omni builds from source + downloads a multimodal model — be patient. app = flyte.with_servecontext(activate_timeout=1800.0).serve(build_omni_app()) print(f"Omni app: {app.url}") else: if not args.llm_url: sys.exit("--llm-url (or LLM_BASE_URL) is required for the ui target") # Bake the backend endpoints into the app's container env so the proxies can reach them. env = {**(ui_app.env_vars or {}), "LLM_BASE_URL": args.llm_url} if args.omni_url: env["OMNI_BASE_URL"] = args.omni_url ui_app.env_vars = env app = flyte.serve(ui_app) print(f"Voice UI: {app.url}") ``` *Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/voice_customer_service/app.py* **Serve many models, switch live.** Because another model is just another app, the UI can route between several. Point it at more than one backend and a model switcher appears in the page; each chat turn goes to the selected one. With a single backend it stays hidden, so the default demo is unchanged. ``` """ Voice customer-service agent — talk in the browser, it talks back. A two-app Flyte demo: * ``llm_app`` — a small, fast Qwen instruct model served with vLLM on an L4 GPU (OpenAI-compatible API). This is the "brain". * ``ui_app`` — a tiny FastAPI app that serves a single-page voice UI and proxies chat requests to ``llm_app``. Speech-to-text and text-to-speech happen **in the browser** via the Web Speech API, so there is no audio model to host: the mic is transcribed locally, the text goes to the LLM, and the reply is spoken locally. That keeps latency low and the GPU footprint tiny (a 3B model on one L4). 🎤 browser STT ──► /api/chat (FastAPI proxy) ──► vLLM /v1 (Qwen on L4) │ streamed tokens 🔊 browser TTS ◄── streamed text ◄────────────────────┘ The UI is served over HTTPS from the Flyte app, which is what lets the browser grant microphone access and use speech recognition (both require a secure context). The proxy means the browser only ever talks to its own origin, so there are no CORS headaches. Deploy ------ # 1. Bring up the GPU model server (long pole: provisions an L4 + pulls weights) python app.py llm # 2. Bring up the voice UI, pointed at the LLM from step 1 python app.py ui --llm-url Then open the printed UI url in Chrome and click the mic. """ from __future__ import annotations import asyncio import base64 import io import json import os import sys import time import httpx from fastapi import FastAPI, Request, Response from fastapi.responses import HTMLResponse, StreamingResponse import flyte import flyte.app from flyte.app.extras import FastAPIAppEnvironment # NOTE: `flyteplugins.vllm` is imported lazily inside build_llm_app() rather than # at module top. This module is loaded by BOTH app containers; the lightweight UI # image does not install flyteplugins-vllm, so a top-level import would crash the # UI app on startup. # --------------------------------------------------------------------------- # 1. The LLM: small, fast Qwen instruct model on vLLM / L4 # # Qwen2.5-3B-Instruct is a good "quality is OK, latency matters" pick: ~6 GB in # bf16, trivially fits a 24 GB L4, and decodes fast enough that the browser's # TTS is the pacing factor, not the model. vLLM downloads the weights straight # from the Hugging Face hub (the model is public — no token needed). # --------------------------------------------------------------------------- MODEL_ID = "qwen" # Pin the serving image. The plugin's default image pins vllm==0.11.0 but not # transformers, and the newest transformers breaks vllm 0.11's tokenizer caching # (AttributeError: Qwen2Tokenizer has no attribute all_special_tokens_extended). # transformers==4.57.6 is the version the repo's own vLLM example uses. # {{docs-fragment vllm_image}} vllm_image = ( flyte.Image.from_debian_base(name="vllm-app-image", install_flyte=False) .with_pip_packages("flashinfer-python", "flashinfer-cubin") .with_pip_packages("flashinfer-jit-cache", index_url="https://flashinfer.ai/whl/cu129") .with_pip_packages("flyteplugins-vllm") .with_pip_packages("vllm==0.11.0", "transformers==4.57.6") ) # {{/docs-fragment vllm_image}} # {{docs-fragment llm_app}} try: from flyteplugins.vllm import VLLMAppEnvironment llm_app = VLLMAppEnvironment( name="cs-qwen-llm", model_id=MODEL_ID, model_hf_path="Qwen/Qwen2.5-3B-Instruct", image=vllm_image, resources=flyte.Resources(cpu="6", memory="20Gi", gpu="L4:1", disk="40Gi"), # One warm replica so there's no cold start mid-demo. Flip to (0, 1) + # scaledown_after to save the GPU when idle, at the cost of a cold start. scaling=flyte.app.Scaling(replicas=(1, 1)), requires_auth=False, extra_args=[ # Short context keeps the KV cache small and latency low; a customer # service turn is tiny. "--max-model-len", "8192", "--max-num-seqs", "16", ], ) except ImportError: llm_app = None # flyteplugins-vllm not installed (e.g. the UI container) # {{/docs-fragment llm_app}} # --------------------------------------------------------------------------- # 1b. The combined app: ONE model that does LLM + speech (Qwen2.5-Omni-3B) # # Qwen2.5-Omni uses a Thinker-Talker architecture: a single # /v1/chat/completions call with "modalities": ["audio"] returns BOTH the text # reply and synthesized speech. Served by vllm-omni (a separate vLLM project that # adds omni-modality output) — NOT the flyteplugins-vllm plugin, which pins an # older vLLM without omni support. We run the OpenAI server via a custom # container `command`, which bypasses Flyte's default fserve entrypoint. # --------------------------------------------------------------------------- OMNI_HF_MODEL = "Qwen/Qwen2.5-Omni-3B" OMNI_MODEL_ID = "omni" # vllm-omni installs from source on top of vLLM 0.23.0 (see its quickstart). # CRITICAL: pin --torch-backend=cu130 (NOT auto). The remote image builder has no # GPU, so `auto` resolves to CPU torch (torch+cpu) and vllm._C then fails with # `libcudart.so.13: cannot open shared object file`. The demo L4 nodes run driver # 580 / CUDA 13, so cu130 is the right GPU build. No separate flashinfer (the old # cu129 wheels are CUDA 12.9 and conflict with the CUDA-13 stack). omni_image = ( flyte.Image.from_debian_base(name="vllm-omni-server", install_flyte=False) .with_apt_packages("git") .with_commands( [ "uv pip install --system vllm==0.23.0 --torch-backend=cu130", "git clone https://github.com/vllm-project/vllm-omni.git /opt/vllm-omni", "uv pip install --system -e /opt/vllm-omni", ] ) ) def build_omni_app(): """A single model that returns text + speech (Qwen2.5-Omni-3B via vllm-omni).""" return flyte.app.AppEnvironment( name="cs-omni", image=omni_image, # Raw vllm OpenAI server with omni audio output enabled. # vllm-omni runs each stage (thinker + talker) as a SEPARATE engine on the # SAME GPU, and each applies --gpu-memory-utilization to the whole device. So # the stages must share: 0.45 each (~0.90 total) leaves room for both. The # thinker model alone is ~8.8 GB, so the 24 GB L4 is too tight for two stages # with usable KV cache — the 48 GB L40S fits both comfortably. command=[ "bash", "-lc", "export PATH=/opt/venv/bin:/usr/local/bin:$PATH; " f"exec vllm serve {OMNI_HF_MODEL} --omni --trust-remote-code " f"--served-model-name {OMNI_MODEL_ID} --port 8080 " "--gpu-memory-utilization 0.45 --max-model-len 8192", ], # This runtime image has the CUDA *runtime* libs (from torch) but no CUDA # *toolkit* (nvcc / CUDA_HOME). Several vLLM kernels JIT-compile at startup and # assert a toolkit is present, killing the engine core. Disable those so they # use prebuilt/native paths: the flashinfer sampler and deep_gemm. (The crash # was never RAM/GPU size — L4 and L40S failed identically — so we use the L4.) env_vars={"VLLM_USE_FLASHINFER_SAMPLER": "0", "VLLM_USE_DEEP_GEMM": "0"}, port=8080, # L40S (g6e.12xlarge): 48 GB GPU fits both omni stages; big node so cpu/mem/disk # requests schedule freely. (Earlier L40S attempt failed only at the now-fixed # flashinfer error, before reaching this two-stage memory split.) resources=flyte.Resources(cpu="12", memory="48Gi", gpu="L40s:1", disk="60Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # --------------------------------------------------------------------------- # 2. The voice UI: FastAPI serving the page + proxying to the LLM # --------------------------------------------------------------------------- # {{docs-fragment system_prompt}} SYSTEM_PROMPT = ( "You are Ava, a warm, efficient customer-support agent for 'Northwind', a " "consumer electronics company. Your replies are spoken aloud in a live phone-" "like call, so keep them very short (1-2 sentences), natural, and free of " "markdown, lists, or emoji. Get to the point in the first sentence. Ask one " "clarifying question at a time. The caller may interrupt you at any moment; if " "they do, stop and listen. If you don't know an account-specific detail, say " "you'll look into it rather than inventing facts." ) # {{/docs-fragment system_prompt}} # The LLM endpoint is injected at deploy time (see __main__) via this env var. LLM_BASE_URL = os.environ.get("LLM_BASE_URL", "") # {{docs-fragment backends}} # Optional model switcher. Set LLM_BACKENDS to a comma-separated list of "Label|https://url" # pairs — each url is its own vLLM app — and the UI shows a dropdown to route between them. # Serving another model is just another Flyte app, so this is the whole "switch models" story. # When unset, the single LLM_BASE_URL above is used and no switcher appears (default demo). LLM_BACKENDS = os.environ.get("LLM_BACKENDS", "") # Served-model-id per backend url, cached so each vLLM app is asked at most once. _model_cache: dict = {} def _backends() -> list: """The list of {label, url} chat backends; a single Default unless LLM_BACKENDS is set.""" pairs = [] for item in LLM_BACKENDS.split(","): label, sep, url = item.partition("|") if sep and url.strip(): pairs.append({"label": label.strip(), "url": url.strip().rstrip("/")}) if pairs: return pairs base = os.environ.get("LLM_BASE_URL", LLM_BASE_URL).rstrip("/") return [{"label": "Default", "url": base}] if base else [] def _pick_backend(label: str | None) -> dict | None: """Choose a backend by label, falling back to the first configured one.""" backends = _backends() return next((b for b in backends if b["label"] == label), backends[0] if backends else None) async def _model_id_for(base: str) -> str: """Ask a vLLM backend which model id it serves (cached); fall back to MODEL_ID.""" if not base: return MODEL_ID if base not in _model_cache: mid = MODEL_ID try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() mid = ((r.json().get("data") or [{}])[0].get("id")) or MODEL_ID except Exception: mid = MODEL_ID _model_cache[base] = mid return _model_cache[base] # {{/docs-fragment backends}} # TTS configuration. # TTS_MODE: "both" (show the in-UI switch) | "browser" | "server" (lock one mode) # TTS_VOICE: a Kokoro voice id; af_heart is a warm female voice that fits "Ava". TTS_MODE = os.environ.get("TTS_MODE", "both") TTS_VOICE = os.environ.get("TTS_VOICE", "af_heart") # Omni (combined LLM+TTS) backend — Qwen2.5-Omni via vllm-omni. Injected at # deploy time; when set, the UI exposes an "Omni" engine that does chat+speech in # one call. OMNI_SAMPLE_RATE is used only if the model returns raw PCM (no header). OMNI_BASE_URL = os.environ.get("OMNI_BASE_URL", "") OMNI_MODEL_ID = os.environ.get("OMNI_MODEL_ID", "omni") OMNI_SAMPLE_RATE = int(os.environ.get("OMNI_SAMPLE_RATE", "24000")) # Kokoro is loaded lazily/once at startup (heavy torch import) and only when the # server-side TTS path is enabled. Stored on app state so requests reuse it. _tts_state: dict = {"pipeline": None, "error": None} # Kokoro synthesis is CPU-bound; running several at once just thrashes the cores # and makes each one slower. Serialize so every clause stays fast (~0.5s) even if # the client's prefetch ever overlaps two requests. _synth_sem = asyncio.Semaphore(1) def _load_kokoro(): """Build the Kokoro pipeline once and warm it. Returns the pipeline or raises.""" from kokoro import KPipeline # heavy (torch); imported only when serving TTS pipeline = KPipeline(lang_code="a") # 'a' = American English # Warm-up: the first synth compiles/caches; do it now so real calls are fast. for _ in pipeline("Hello.", voice=TTS_VOICE): pass return pipeline def _synth(text: str): """Run Kokoro and return concatenated 24 kHz float32 audio (numpy).""" import numpy as np pipeline = _tts_state["pipeline"] chunks = [audio for _, _, audio in pipeline(text, voice=TTS_VOICE)] if not chunks: return np.zeros(1, dtype="float32") return np.concatenate(chunks).astype("float32") def _wav_bytes(audio, sr: int = 24000) -> bytes: import soundfile as sf buf = io.BytesIO() sf.write(buf, audio, sr, format="WAV", subtype="PCM_16") return buf.getvalue() fastapi_app = FastAPI(title="Northwind Voice Support") @fastapi_app.on_event("startup") async def _startup(): # Load Kokoro unless TTS is browser-only (then we skip the heavy import). if TTS_MODE == "browser": return try: _tts_state["pipeline"] = await asyncio.to_thread(_load_kokoro) except Exception as e: # keep the app up; server-TTS just stays unavailable _tts_state["error"] = f"{type(e).__name__}: {e}" @fastapi_app.get("/healthz") async def healthz(): return { "ok": True, "llm": LLM_BASE_URL or "unset", "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "tts_error": _tts_state["error"], "omni": OMNI_BASE_URL or "unset", } @fastapi_app.get("/api/config") async def config(): """Tells the browser which TTS modes / engines / model backends are available.""" return { "tts_mode": TTS_MODE, "tts_ready": _tts_state["pipeline"] is not None, "omni_ready": bool(OMNI_BASE_URL), "backends": [b["label"] for b in _backends()], } # {{docs-fragment backend_status}} @fastapi_app.get("/api/backend") async def backend_status(req: Request): """Liveness of a chat backend, for the "model warm / waking" pill. Pings the vLLM app's ``/v1/models``. A quick OK means a warm replica is already serving; a failure or a timeout is the cold start you'd see with ``Scaling(replicas=(0, 1))``. """ chosen = _pick_backend(req.query_params.get("backend")) base = (chosen or {}).get("url", "") if not base: return {"up": False, "model": None} try: async with httpx.AsyncClient(timeout=4.0) as client: r = await client.get(f"{base}/v1/models") r.raise_for_status() data = r.json() return {"up": True, "model": (data.get("data") or [{}])[0].get("id")} except Exception: return {"up": False, "model": None} # {{/docs-fragment backend_status}} # {{docs-fragment tts_endpoint}} @fastapi_app.post("/api/tts") async def tts(req: Request): """Synthesize speech for one clause with Kokoro; returns a 24 kHz WAV. The X-Synth-Ms response header carries the measured server-side synthesis time so the client can display/compare latency. """ body = await req.json() text = (body.get("text") or "").strip() if not text: return Response(status_code=204) if _tts_state["pipeline"] is None: return Response(status_code=503, content=_tts_state["error"] or "TTS not ready") t0 = time.perf_counter() async with _synth_sem: audio = await asyncio.to_thread(_synth, text) wav = await asyncio.to_thread(_wav_bytes, audio) synth_ms = int((time.perf_counter() - t0) * 1000) return Response(content=wav, media_type="audio/wav", headers={"X-Synth-Ms": str(synth_ms)}) # {{/docs-fragment tts_endpoint}} # {{docs-fragment chat_proxy}} @fastapi_app.post("/api/chat") async def chat(req: Request): """Proxy a chat turn to the selected vLLM backend and stream the text reply back.""" body = await req.json() history = body.get("messages", []) chosen = _pick_backend(body.get("backend")) base = (chosen or {}).get("url", "") payload = { "model": await _model_id_for(base), "messages": [{"role": "system", "content": SYSTEM_PROMPT}, *history], "stream": True, "max_tokens": 200, "temperature": 0.3, } async def gen(): url = f"{base}/v1/chat/completions" async with httpx.AsyncClient(timeout=120.0) as client: async with client.stream("POST", url, json=payload) as r: r.raise_for_status() async for line in r.aiter_lines(): if not line.startswith("data:"): continue data = line[len("data:") :].strip() if data == "[DONE]": break try: delta = json.loads(data)["choices"][0]["delta"].get("content") except (json.JSONDecodeError, KeyError, IndexError): continue if delta: yield delta return StreamingResponse(gen(), media_type="text/plain") # {{/docs-fragment chat_proxy}} def _omni_extract(data: dict) -> tuple[str, bytes]: """Pull (reply_text, wav_bytes) out of a Qwen2.5-Omni chat-completion response. The omni audio field shape isn't fully documented, so be defensive: text is in choices[0]; audio is in some later choice's message.audio, as either a base64 string or a dict with a base64 ``data`` field. If the decoded bytes are already a WAV (RIFF) we pass them through; otherwise we assume raw PCM16 and add a header. """ choices = data.get("choices") or [] text = "" audio_b64 = None for ch in choices: msg = ch.get("message") or {} if not text and msg.get("content"): text = msg["content"] aud = msg.get("audio") if aud is not None and audio_b64 is None: audio_b64 = aud.get("data") if isinstance(aud, dict) else aud if isinstance(aud, dict) and not text and aud.get("transcript"): text = aud["transcript"] if not audio_b64: raise ValueError("no audio in omni response") raw = base64.b64decode(audio_b64) if raw[:4] == b"RIFF": return text, raw # already a WAV container # Raw PCM16 -> wrap in a WAV header at the configured sample rate. import numpy as np pcm = np.frombuffer(raw, dtype="phoneme runtime dep # CPU torch wheel keeps the image far smaller than the default CUDA build. .with_pip_packages("torch", index_url="https://download.pytorch.org/whl/cpu") .with_pip_packages( "fastapi", "uvicorn", "httpx", "kokoro>=0.9.2", "soundfile", "numpy" ) # Kokoro's G2P (misaki) needs spaCy's en_core_web_sm. .with_commands(["python -m spacy download en_core_web_sm"]) ) # {{/docs-fragment ui_image}} # {{docs-fragment ui_app}} ui_app = FastAPIAppEnvironment( name="cs-voice-ui", app=fastapi_app, description="Browser voice UI for the Qwen customer-service agent (browser + Kokoro TTS)", image=ui_image, # Bumped for torch + the Kokoro model living in memory. resources=flyte.Resources(cpu="6", memory="8Gi"), requires_auth=False, scaling=flyte.app.Scaling(replicas=(1, 1)), ) # {{/docs-fragment ui_app}} # --------------------------------------------------------------------------- # Single-page voice UI (Web Speech API: SpeechRecognition + speechSynthesis) # --------------------------------------------------------------------------- INDEX_HTML = """ Northwind Voice Support

◆ Northwind Voice Support

App Model Served on Union