Autoscaling apps
Flyte apps support autoscaling, allowing them to scale up and down based on traffic. This helps optimize costs by scaling down when there’s no traffic and scaling up when needed.
Scaling configuration
The scaling parameter uses a Scaling object to configure autoscaling behavior:
scaling=flyte.app.Scaling(
replicas=(min_replicas, max_replicas),
scaledown_after=idle_ttl_seconds,
)Parameters
replicas: A tuple(min_replicas, max_replicas)specifying the minimum and maximum number of replicas.scaledown_after: Time in seconds to wait before scaling down when idle (idle TTL).
Basic scaling example
Here’s a simple example with scaling from 0 to 1 replica:
# Basic example: scale from 0 to 1 replica
app_env = flyte.app.AppEnvironment(
name="autoscaling-app",
scaling=flyte.app.Scaling(
replicas=(0, 1), # Scale from 0 to 1 replica
scaledown_after=300, # Scale down after 5 minutes of inactivity
),
# ...
)
This configuration:
- Starts with 0 replicas (no running instances)
- Scales up to 1 replica when there’s traffic
- Scales back down to 0 after 5 minutes (300 seconds) of no traffic
Scaling patterns
Always-on app
For apps that need to always be running:
# Always-on app
app_env2 = flyte.app.AppEnvironment(
name="always-on-api",
scaling=flyte.app.Scaling(
replicas=(1, 1), # Always keep 1 replica running
# scaledown_after is ignored when min_replicas > 0
),
# ...
)
Scale-to-zero app
For apps that can scale to zero when idle:
# Scale-to-zero app
app_env3 = flyte.app.AppEnvironment(
name="scale-to-zero-app",
scaling=flyte.app.Scaling(
replicas=(0, 1), # Can scale down to 0
scaledown_after=600, # Scale down after 10 minutes of inactivity
),
# ...
)
High-availability app
For apps that need multiple replicas for availability:
# High-availability app
app_env4 = flyte.app.AppEnvironment(
name="ha-api",
scaling=flyte.app.Scaling(
replicas=(2, 5), # Keep at least 2, scale up to 5
scaledown_after=300, # Scale down after 5 minutes
),
# ...
)
Burstable app
For apps with variable load:
# Burstable app
app_env5 = flyte.app.AppEnvironment(
name="bursty-app",
scaling=flyte.app.Scaling(
replicas=(1, 10), # Start with 1, scale up to 10 under load
scaledown_after=180, # Scale down quickly after 3 minutes
),
# ...
)
Idle TTL (Time To Live)
The scaledown_after parameter (idle TTL) determines how long an app instance can be idle before it’s scaled down.
Considerations
- Too short: May cause frequent scale up/down cycles, leading to cold starts.
- Too long: Keeps resources running unnecessarily, increasing costs.
- Optimal: Balance between cost and user experience.
Common idle TTL values
- Development/Testing: 60-180 seconds (1-3 minutes) - quick scale down for cost savings.
- Production APIs: 300-600 seconds (5-10 minutes) - balance cost and responsiveness.
- Batch processing: 900-1800 seconds (15-30 minutes) - longer to handle bursts.
- Always-on: Set
min_replicas > 0- never scale down.
Autoscaling best practices
- Start conservative: Begin with longer idle TTL values and adjust based on usage.
- Monitor cold starts: Track how long it takes for your app to become ready after scaling up.
- Consider costs: Balance idle TTL between cost savings and user experience.
- Use appropriate min replicas: Set
min_replicas > 0for critical apps that need to be always available. - Test scaling behavior: Verify your app handles scale up/down correctly (for example, state management and connections).
Autoscaling limitations
- Scaling is based on traffic/request patterns, not CPU/memory utilization.
- Cold starts may occur when scaling from zero.
- Stateful apps need careful design to handle scaling (use external state stores).
- Maximum replicas are limited by your cluster capacity.
Autoscaling troubleshooting
App scales down too quickly:
- Increase
scaledown_aftervalue. - Set
min_replicas > 0if the app needs to stay warm.
App doesn’t scale up fast enough:
- Ensure your cluster has capacity.
- Check if there are resource constraints.
Cold starts are too slow:
- Pre-warm with
min_replicas = 1. - Optimize app startup time.
- Consider using faster storage for model loading.