Samhita Alla
Shalabh Chaudhri

Union unveils a powerful model deployment stack built with AWS Sagemaker & NVIDIA Triton inference server

Machine learning engineering is a complex process, spanning data procurement and processing, model training, deployment, and scaling. Manually integrating tools to build end-to-end deployment pipelines can quickly become a liability, leading to higher operational risks and inefficient utilization of engineering time, ultimately resulting in longer time-to-market and poor return on investment (ROI) from ML initiatives.

Imagine the headache of reproducing results from a manual integration process, complete with additional testing and debugging steps to ensure seamless operation. Cross-functional collaboration with DevOps personnel may be required, further compounding the complexity.

Error-prone manual pipelines not only increase the risk of downtime, performance issues, and security vulnerabilities, but also delay the availability of models for end users. This can significantly impact your competitive advantage, especially if the models power critical business applications or strategic solutions.

The solution? Automate manual deployments to free your machine learning engineers from operational complexities, allowing them to focus on business logic. By leveraging the combined power of Union, NVIDIA Triton Inference Server, and AWS SageMaker, you can build centralized end-to-end deployment workflows, enabling reproducible and traceable deployments.

AWS SageMaker

SageMaker is a managed service to build, train, and deploy ML models at scale. When it comes to deploying models efficiently, SageMaker inference emerges as a game changer as it enables scaling of model deployments, reduces inference costs, manages models more effectively in production, and overall, reduces operational burden.

SageMaker inference offers native integration with NVIDIA Triton Inference Server, which is available with the NVIDIA AI Enterprise software platform and standardizes AI model deployment and execution, allowing developers to leverage high-performance inference capabilities within the SageMaker ecosystem. With this integration, you no longer need to spend time building images with model serving code, since SageMaker provides pre-built Triton Inference Server images out of the box, allowing you to hit the ground running with your model deployments.

Union

Union abstracts away the low-level details of production-grade AI orchestration so that ML engineers and data scientists can focus on delivering value from data. Developed by the team behind the popular open-source platform flyte.org, Union SDK leverages flytekit at its core. This makes it incredibly easy to build pipelines that can be executed locally and automatically packaged (in containers) to ship to remote servers, enabling scalable and reproducible executions.

Figure 1 shows a successful workflow execution on Union.

An example Union DAG
Figure 1. An example Union DAG 

You can build pipelines in Union using the @task and @workflow decorators that you can import from the flytekit library. Tasks represent the individual steps in your pipeline, such as data processing or fine-tuning. By linking these tasks together, you create a workflow.

Copied to clipboard!
from flytekit import task, workflow

@task
def slope(x: list[int], y: list[int]) -> float:
    ...

@task
def intercept(x: list[int], y: list[int], slope: float) -> float:
    ...

@workflow
def simple_wf(x: list[int], y: list[int]) -> float:
    slope_value = slope(x=x, y=y)
    intercept_value = intercept(x=x, y=y, slope=slope_value)
    return intercept_value

In the above example, slope and intercept represent tasks, while simple_wf functions as a workflow that links these tasks together. 

With Union, you can build end-to-end deployment workflows. Union offers integration with SageMaker through the SageMaker agent, which alleviates the need to manage container images and models. Union also supports model deployment using Triton Inference Server through the agent.

Distributed training, NVIDIA TensorRT, and Triton Inference Server on SageMaker

Imagine you want to deploy a Stable Diffusion model, and you want it to perform specifically well on certain images, and hence you decide fine-tuning is the way to go. So you need to implement fine-tuning and deployment. 

While it’s possible to manually fine-tune and deploy the model as separate processes, this traditional approach brings its own set of challenges, in particular:

  • Code decoupling: When fine-tuning and serving code live in separate environments, code drift can occur, making the workflow non-reproducible and troubleshooting a nightmare.
  • Deployment complexity: Deploying a model in SageMaker requires packaging the inference code, dependencies, and entry point into a single Docker image. Writing Dockerfiles and building containers introduces complexity, especially during the experimental phase when iterating on the deployment pipeline.

The ideal solution is a single, centralized workflow that handles fine-tuning and deployment, simplifying the process and enabling effortless iteration and production deployment.

Union serves as a single source of truth, centralizing the entire model deployment pipeline. With Union, you can fine-tune and deploy your Stable Diffusion model in a single, unified pipeline, leveraging the benefits of orchestration and reproducibility.

Copied to clipboard!
from flytekit import workflow

@workflow
def stable_diffusion_on_triton_wf(
    execution_role_arn: str,
    finetuning_args: FineTuningArgs = FineTuningArgs(),
    model_name: str = "stable-diffusion-model",
    endpoint_config_name: str = "stable-diffusion-endpoint-config",
    endpoint_name: str = "stable-diffusion-endpoint",
    instance_type: str = "ml.g4dn.xlarge",
    initial_instance_count: int = 1,
    region: str = "us-east-2",
) -> str:
    repo_id = stable_diffusion_finetuning(args=finetuning_args)
    model_repo = optimize_model(
        model_name=finetuning_args.pretrained_model_name_or_path,
        repo_id=repo_id,
    )
    compressed_model = compress_model(model_repo=model_repo)
    deployment = sd_deployment(
        model_name=model_name,
        endpoint_config_name=endpoint_config_name,
        endpoint_name=endpoint_name,
        model_path=compressed_model,
        execution_role_arn=execution_role_arn,
        instance_type=instance_type,
        initial_instance_count=initial_instance_count,
        region=region,
    )
    return deployment

The workflow above also includes a model optimization step using the NVIDIA TensorRT software development kit to improve the inference performance. You can find the end-to-end pipeline in the Union GitHub repository.

When you execute this pipeline on Union:

  1. The model will undergo fine-tuning in a distributed setup on a single node comprising 8 NVIDIA GPU instances using the Flyte PyTorch plugin.
  2. The fine-tuned model will be optimized using ONNX and TensorRT, creating a directory with the necessary Triton Inference Server configuration for optimized inference.
  3. The optimized model will be compressed to tar.gz format, ready for deployment on SageMaker's inference infrastructure.
  4. With minimal configuration, Union will deploy the compressed, optimized model to a SageMaker endpoint, making it production-ready and accessible for inference requests.
Copied to clipboard!
from flytekitplugins.awssagemaker_inference import (
    create_sagemaker_deployment,
    triton_image_uri,
)

sd_deployment = create_sagemaker_deployment(
    model_config={...},
    endpoint_config_config={...},
    endpoint_config={...},
    images={"sd_deployment_image": triton_image_uri(version="23.12")},
    region_at_runtime=True,
)

SageMaker deployment workflow

You can use the pre-built Triton Inference Server image by importing the triton_image_uri function from the plugin and specifying the version.

The SageMaker agent also allows you to use custom images, for which you need to define the ImageSpec. The agent takes care of the heavy lifting by automatically building and pushing the image to the specified registry: 

Copied to clipboard!
from flytekit import ImageSpec

sam_deployment_image = ImageSpec(
    name="sam-deployment",
    registry=os.getenv("REGISTRY"),
    packages=["transformers==4.38.2", ...],
    source_root="sam/tasks/fastapi",
).with_commands(["chmod +x /root/serve"])

sam_deployment = create_sagemaker_deployment(
    images={"image": sam_deployment_image}, 
    ...
)

A custom image for FastAPI inference on SageMaker 

A single pipeline is all you need to build to tap into the combined capabilities of Triton Inference Server, TensorRT, SageMaker, and Union. From centralizing the code to enabling caching and reproducibility, and even automating the deployment process, this deployment stack is a go-to option for optimizing models, enhancing inference performance, and saving valuable ML engineering time.

Training and serving models is no longer a hassle!

Throughput and latency are crucial factors determining inference performance. To achieve the best of both worlds—high throughput and low latency—this deployment stack is worth exploring. Union lets you connect the disparate pieces of the puzzle—fine-tuning, optimization, and serving—into a cohesive, end-to-end solution.

If you’re interested in diving deeper, check out the relevant docs:

The end-to-end Stable Diffusion fine-tuning and deployment code is available on GitHub. 

Don’t hesitate to reach out to the Union team if you're considering implementing this deployment stack.

Model Deployment
AWS