Serve your LLM with MAX Serve
MAX Serve is a high-performance inference server for deploying large language models. In this tutorial, we learn how to cache a model from HuggingFace and serve with MAX Serve and Union Serving.
Once you have a Union account, install union:
pip install unionExport the following environment variable to build and push images to your own container registry:
# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"Then run the following commands to run the workflow:
$ git clone https://github.com/unionai/unionai-examples
$ cd unionai-examples
$ union run --remote <path/to/file.py> <workflow_name> <params>The source code for this example can be found here.
Managing Dependencies
First we import the dependencies for defining the Union App:
from union import Resources, ImageSpec, Artifact
from union.app import App, Input
from flytekit.extras.accelerators import L4
import osFor defining the image, we install union-runtime into Modular’s base name with the ImageSpec
image builder. Set the IMAGE_SPEC_REGISTRY environment variable to be a public registry you can push to.
With python_exec="/opt/venv/bin/python", we configure the image builder to install any new packages
into the base image’s python environment.
image = ImageSpec(
    name="modular-max",
    base_image="modular/max-nvidia-base:25.4.0.dev2025050705",
    builder="default",
    packages=["union-runtime>=0.1.18"],
    entrypoint=["/bin/bash"],
    python_exec="/opt/venv/bin/python",
    registry=os.environ.get("IMAGE_SPEC_REGISTRY"),
)Defining the Union App
The workflow in cache_model.py caches the Qwen2.5 model from HuggingFace into a Union Artifact. Here
we use the same Artifact as an Input to the Union App, which gets downloaded in mount=/root/qwen-0-5.
The args is set to a Max Serve specific entrypoint, where --model-path=/root/qwen-0-5
configures Max Serve to load the model from /root/qwen-0-5.
Qwen_Coder_Artifact = Artifact(name="Qwen2.5-Coder-0.5B")
modular_model = App(
    name="modular-qwen-0-5-coder",
    container_image=image,
    inputs=[Input(name="model", value=Qwen_Coder_Artifact.query(), env_var="MODEL", mount="/root/qwen-0-5")],
    args=[
        "python",
        "-m",
        "max.entrypoints.pipelines",
        "serve",
        "--model-path=/root/qwen-0-5",
        "--device-memory-utilization",
        "0.7",
        "--max-length",
        "2048",
    ],
    port=8000,
    requests=Resources(cpu="7", mem="20Gi", gpu="1", ephemeral_storage="20Gi"),
    accelerator=L4,
    scaledown_after=300,
)Caching and deploying The App
Run the workflow to cache the LLM:
union run --remote cache_model.py cache_modelDeploy the Union App backed by Max Serve:
union deploy apps max_serve.py modular-qwen-0-5-coder