Job Reliability
Understanding cancellation, progress tracking, and resume capabilities
Job Reliability
VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.
Execution Modes
Jobs run in one of two modes:
| Mode | Duration | Mechanism | Example |
|---|---|---|---|
| SYNC | Under 30 seconds | Inline in API request | Web searches, simple queries |
| ASYNC_QUEUE | Minutes to hours | SQS + Dramatiq workers | Data extractions, large syncs |
Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.
Cancelling Jobs
Pending Jobs
Jobs in PENDING status are cancelled immediately:
vai actions cancel run_abc123
# Run cancelledcurl -X POST https://api.virtuousai.com/api/v1/action-runs/run_abc123/cancel \
-H "Authorization: Bearer $VAI_API_KEY"Running Jobs
Jobs in RUNNING status use cooperative cancellation:
- API sets
cancel_requested_aton the run - Worker checks for cancellation every 30 seconds
- Current resource completes (no partial data)
- Job transitions to
CANCELLED
Cancellation is checked between resources, not mid-resource. The current resource always completes to avoid partial data.
Cancellation Sources
| Source | Trigger | Behavior |
|---|---|---|
| User Request | API call or CLI | Cooperative, graceful |
| SIGTERM | ECS deployment | Cooperative, 120s grace period |
| Lease Lost | Worker crash | Watchdog marks failed |
| Timeout | Dramatiq time limit | Forced termination |
Progress Tracking
Monitor extraction progress in real-time:
vai actions get run_abc123Progress includes:
- Current phase:
extracting,normalizing,loading - Current resource being processed
- Resources completed vs total
- Rows extracted so far
- Elapsed time
Progress Phases
| Phase | Description |
|---|---|
starting | Initializing extraction |
extracting | Pulling data from source API |
normalizing | Applying schema transformations |
loading | Writing to S3 bronze layer |
completed | Finished successfully |
failed | Encountered error |
cancelled | User or system cancelled |
Resume After Failure
If a job fails mid-extraction, it resumes from the last checkpoint.
Per-Resource Checkpointing
- Each resource extracts independently
- Checkpoint saved after each resource completes
- On retry, completed resources are skipped
Example: Extracting profiles, events, lists
| Scenario | On Retry |
|---|---|
Crashed during profiles | Re-extract profiles from dlt cursor |
Crashed during events | Skip profiles, resume events |
Crashed during lists | Skip profiles + events, resume lists |
dlt Incremental State
Within each resource, dlt maintains cursor state:
- Stored in S3
_dlt_pipeline_state/ - Tracks last
updated_ator similar cursor - On restart, only fetches records after cursor
dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).
Lease-Based Ownership
Workers must acquire a database lease before processing a job:
| Parameter | Value | Purpose |
|---|---|---|
| Lease Duration | 90 seconds | How long a worker owns a job |
| Heartbeat Interval | 30 seconds | How often lease is extended |
| Watchdog Grace | 180 seconds | How stale before recovery kicks in |
This prevents duplicate processing when:
- SQS delivers the same message twice
- A worker is slow but not dead
- Network partitions occur
Deployment Safety
When deploying new worker versions:
- ECS sends
SIGTERMto running containers - Workers have 120 seconds to finish gracefully
- Cancellation token is set immediately
- Current resource completes and checkpoints
- Job marked as
CANCELLED(retryable) - New worker picks up from checkpoint
Best Practices for Deployments
-
Check running jobs before deploying:
vai actions list --status running -
Wait for completion if possible (safest)
-
Deploy with confidence — jobs resume automatically from checkpoints
Watchdog Recovery
A background task monitors for abandoned jobs:
| Check | Frequency | Action |
|---|---|---|
| Expired leases | Every 5 minutes | Mark FAILED (retryable) |
Stuck PENDING | Every 5 minutes | Re-enqueue if stale |
Jobs with worker_lost error are automatically re-enqueued if retry count allows.
Troubleshooting
Job stuck in RUNNING
- Check if worker is alive (lease should be fresh)
- If lease expired, watchdog will recover within 5 minutes
- Manual recovery:
vai actions retry run_abc123
Job keeps failing
- Check error details:
vai actions get run_abc123 - If
AUTH_ERROR: Update connection credentials - If
RATE_LIMITED: Job will auto-retry with backoff - If
worker_lost: Infrastructure issue, check ECS logs
Data seems duplicated
Bronze layer may have duplicate files after crash/restart. This is expected:
- Bronze = raw data (duplicates acceptable)
- Silver layer deduplicates during transformation
SQS Visibility Heartbeat
For very long jobs (8+ hours), the system extends SQS message visibility:
| Parameter | Value |
|---|---|
| Initial Visibility | 30 minutes |
| Extension Interval | 5 minutes |
| Extension Amount | 10 minutes |
This prevents SQS from re-delivering messages during extremely long extractions.