Autoscaling apps

Flyte apps support autoscaling, allowing them to scale up and down based on traffic. This helps optimize costs by scaling down when there’s no traffic and scaling up when needed.

Scaling configuration

The scaling parameter uses a [[Scaling]] object to configure autoscaling behavior:

scaling=flyte.app.Scaling(
    replicas=(min_replicas, max_replicas),
    scaledown_after=idle_ttl_seconds,
)

Parameters

replicas: A tuple (min_replicas, max_replicas) specifying the minimum and maximum number of replicas.
scaledown_after: Time in seconds to wait before scaling down when idle (idle TTL).

Basic scaling example

Here’s a simple example with scaling from 0 to 1 replica:

autoscaling-examples.py

# Basic example: scale from 0 to 1 replica
app_env = flyte.app.AppEnvironment(
    name="autoscaling-app",
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Scale from 0 to 1 replica
        scaledown_after=300,  # Scale down after 5 minutes of inactivity
    ),
    # ...
)

This configuration:

Starts with 0 replicas (no running instances)
Scales up to 1 replica when there’s traffic
Scales back down to 0 after 5 minutes (300 seconds) of no traffic

Scaling patterns

Always-on app

For apps that need to always be running:

autoscaling-examples.py

# Always-on app
app_env2 = flyte.app.AppEnvironment(
    name="always-on-api",
    scaling=flyte.app.Scaling(
        replicas=(1, 1),  # Always keep 1 replica running
        # scaledown_after is ignored when min_replicas > 0
    ),
    # ...
)

Scale-to-zero app

For apps that can scale to zero when idle:

autoscaling-examples.py

# Scale-to-zero app
app_env3 = flyte.app.AppEnvironment(
    name="scale-to-zero-app",
    scaling=flyte.app.Scaling(
        replicas=(0, 1),  # Can scale down to 0
        scaledown_after=600,  # Scale down after 10 minutes of inactivity
    ),
    # ...
)

High-availability app

For apps that need multiple replicas for availability:

autoscaling-examples.py

# High-availability app
app_env4 = flyte.app.AppEnvironment(
    name="ha-api",
    scaling=flyte.app.Scaling(
        replicas=(2, 5),  # Keep at least 2, scale up to 5
        scaledown_after=300,  # Scale down after 5 minutes
    ),
    # ...
)

Burstable app

For apps with variable load:

autoscaling-examples.py

# Burstable app
app_env5 = flyte.app.AppEnvironment(
    name="bursty-app",
    scaling=flyte.app.Scaling(
        replicas=(1, 10),  # Start with 1, scale up to 10 under load
        scaledown_after=180,  # Scale down quickly after 3 minutes
    ),
    # ...
)

Idle TTL (Time To Live)

The scaledown_after parameter (idle TTL) determines how long an app instance can be idle before it’s scaled down.

Considerations

Too short: May cause frequent scale up/down cycles, leading to cold starts.
Too long: Keeps resources running unnecessarily, increasing costs.
Optimal: Balance between cost and user experience.

Common idle TTL values

Development/Testing: 60-180 seconds (1-3 minutes) - quick scale down for cost savings.
Production APIs: 300-600 seconds (5-10 minutes) - balance cost and responsiveness.
Batch processing: 900-1800 seconds (15-30 minutes) - longer to handle bursts.
Always-on: Set min_replicas > 0 - never scale down.

Autoscaling best practices

Start conservative: Begin with longer idle TTL values and adjust based on usage.
Monitor cold starts: Track how long it takes for your app to become ready after scaling up.
Consider costs: Balance idle TTL between cost savings and user experience.
Use appropriate min replicas: Set min_replicas > 0 for critical apps that need to be always available.
Test scaling behavior: Verify your app handles scale up/down correctly (for example, state management and connections).

Autoscaling limitations

Scaling is based on traffic/request patterns, not CPU/memory utilization.
Cold starts may occur when scaling from zero.
Stateful apps need careful design to handle scaling (use external state stores).
Maximum replicas are limited by your cluster capacity.

Autoscaling troubleshooting

App scales down too quickly:

Increase scaledown_after value.
Set min_replicas > 0 if the app needs to stay warm.

App doesn’t scale up fast enough:

Ensure your cluster has capacity.
Check if there are resource constraints.

Cold starts are too slow:

Pre-warm with min_replicas = 1.
Optimize app startup time.
Consider using faster storage for model loading.