Cache a HuggingFace Model as an Artifact

This guide shows you how to cache HuggingFace models as Union Artifacts.

The union cache model-from-hf command allows you to automatically download and cache models from HuggingFace Hub as Union Artifacts. This is particularly useful for serving large language models (LLMs) and other AI models efficiently in production environments.

Why Cache Models from HuggingFace?

Caching models from HuggingFace Hub as Union Artifacts provides several key benefits:

Faster Model Downloads: Once cached, models load much faster since they’re stored in Union’s optimized blob storage.
Stream model weights into GPU memory: Union’s SGLangApp and VLLMApp classes also allow you to load model weights directly into GPU memory instead of downloading the weights to disk first, then loading to GPU memory.
Reliability: Eliminates dependency on HuggingFace Hub availability during model serving.
Cost Efficiency: Reduces repeated downloads and bandwidth costs from HuggingFace Hub.
Version Control: Each cached model gets a unique artifact ID for reproducible deployments.
Sharding Support: Large models can be automatically sharded for distributed inference.
Streaming: Models can be streamed directly from blob storage to GPU memory.

Prerequisites

Before using the union cache model-from-hf command, you need to set up authentication:

Create a HuggingFace API Token:
- Go to HuggingFace Settings
- Create a new token with read access
- Store it as a Union secret:
```
union create secret --name HUGGINGFACE_TOKEN
```

Create a Union API Key (optional):

        
union create api-key admin --name MY_API_KEY
union create secret --name MY_API_KEY

If you don’t want to create a Union API key, Union tenants typically ship with a EAGER_API_KEY secret, which is an internally-provision Union API key that you can use for the purpose of caching HuggingFace models.

Basic Example: Cache a Model As-Is

The simplest way to cache a model is to download it directly from HuggingFace without any modifications:

        
    
union cache model-from-hf Qwen/Qwen2.5-0.5B-Instruct \
    --hf-token-key HUGGINGFACE_TOKEN \
    --union-api-key EAGER_API_KEY \
    --artifact-name qwen2-5-0-5b-instruct \
    --cpu 2 \
    --mem 8Gi \
    --ephemeral-storage 10Gi \
    --wait

Command Breakdown

Qwen/Qwen2.5-0.5B-Instruct: The HuggingFace model repository
--hf-token-key HUGGINGFACE_TOKEN: Union secret containing your HuggingFace API token
--union-api-key EAGER_API_KEY: Union secret with admin permissions
--artifact-name qwen2-5-0-5b-instruct: Custom name for the cached artifact. If not provided, the model repository name is lower-cased and . characters are replaced with -.
--cpu 2: CPU resources for downloading the caching
--mem 8Gi: Memory resources for downloading and caching
--ephemeral-storage 10Gi: Temporary storage for the download process
--wait: Wait for the caching process to complete

Output

When the command runs, you’ll see outputs like this:

        
    
🔄 Started background process to cache model from Hugging Face repo Qwen/Qwen2.5-0.5B-Instruct.
 Check the console for status at
https://acme.union.ai/console/projects/flytesnacks/domains/development/executions/a5nr2
g79xb9rtnzczqtp

You can then visit the URL to see the model caching workflow on the Union UI.

If you provide the --wait flag to the union cache model-from-hf command, the command will wait for the model to be cached and then output additional information:

        
    
Cached model at:
/tmp/flyte-axk70dc8/sandbox/local_flytekit/50b27158c2bb42efef8e60622a4d2b6d/model_snapshot
Model Artifact ID:
flyte://av0.2/acme/flytesnacks/development/qwen2-5-0-5b-instruct@322a60c7ba4df41621be528a053f3b1a

To deploy this model run:
union deploy model --project None --domain development
flyte://av0.2/acme/flytesnacks/development/qwen2-5-0-5b-instruct@322a60c7ba4df41621be528a053f3b1a

Using Cached Models in Applications

Once you have cached a model, you can use it in your Union serving apps:

VLLM App Example

        
    
import os
from union import Artifact, Resources
from union.app.llm import VLLMApp
from flytekit.extras.accelerators import L4

# Use the cached model artifact
Model = Artifact(name="qwen2-5-0-5b-instruct")

vllm_app = VLLMApp(
    name="vllm-app-3",
    requests=Resources(cpu="12", mem="24Gi", gpu="1"),
    accelerator=L4,
    model=Model.query(),  # Query the cached artifact
    model_id="qwen2",
    scaledown_after=300,
    stream_model=True,
    port=8084,
)

SGLang App Example

        
    
import os
from union import Artifact, Resources
from union.app.llm import SGLangApp
from flytekit.extras.accelerators import L4

# Use the cached model artifact
Model = Artifact(name="qwen2-5-0-5b-instruct")

sglang_app = SGLangApp(
    name="sglang-app-3",
    requests=Resources(cpu="12", mem="24Gi", gpu="1"),
    accelerator=L4,
    model=Model.query(),  # Query the cached artifact
    model_id="qwen2",
    scaledown_after=300,
    stream_model=True,
    port=8000,
)

Advanced Example: Sharding a Model with the vLLM Engine

For large models that require distributed inference, you can use the --shard-config option to automatically shard the model using the vLLM inference engine.

Create a Shard Configuration File

Create a YAML file (e.g., shard_config.yaml) with the sharding parameters:

        
    
engine: vllm
args:
  model: unsloth/Llama-3.3-70B-Instruct
  tensor_parallel_size: 4
  gpu_memory_utilization: 0.9
  extra_args:
    max_model_len: 16384

The shard_config.yaml file is a YAML file that should conform to the remote.ShardConfig dataclass, where the args field contains configuration that’s forwarded to the underlying inference engine. Currently, only the vLLM engine is supported for sharding, so the args field should conform to the remote.VLLMShardArgs dataclass.

Cache the Sharded Model

        
    
union cache model-from-hf unsloth/Llama-3.3-70B-Instruct \
    --hf-token-key HUGGINGFACE_TOKEN \
    --union-api-key EAGER_API_KEY \
    --artifact-name llama-3-3-70b-instruct-sharded \
    --cpu 36 \
    --gpu 4 \
    --mem 300Gi \
    --ephemeral-storage 300Gi \
    --accelerator nvidia-l40s \
    --shard-config shard_config.yaml \
    --project flytesnacks \
    --domain development \
    --wait

Best Practices

When caching models without sharding

Resource Sizing: Allocate sufficient resources for the model size:
- Small models (< 1B): 2-4 CPU, 4-8Gi memory
- Medium models (1-7B): 4-8 CPU, 8-16Gi memory
- Large models (7B+): 8+ CPU, 16Gi+ memory
Sharding for Large Models: Use tensor parallelism for models > 7B parameters:
- 7-13B models: 2-4 GPUs
- 13-70B models: 4-8 GPUs
- 70B+ models: 8+ GPUs
Storage Considerations: Ensure sufficient ephemeral storage for the download process