VirtuousAI

Job Reliability

Understanding cancellation, progress tracking, and resume capabilities

Job Reliability

VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.

Execution Modes

Jobs run in one of two modes:

ModeDurationMechanismExample
SYNCUnder 30 secondsInline in API requestWeb searches, simple queries
ASYNC_QUEUEMinutes to hoursSQS + Dramatiq workersData extractions, large syncs

Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.

Cancelling Jobs

Pending Jobs

Jobs in PENDING status are cancelled immediately:

vai actions cancel run_abc123
# Run cancelled
curl -X POST https://api.virtuousai.com/api/v1/action-runs/run_abc123/cancel \
  -H "Authorization: Bearer $VAI_API_KEY"

Running Jobs

Jobs in RUNNING status use cooperative cancellation:

  1. API sets cancel_requested_at on the run
  2. Worker checks for cancellation every 30 seconds
  3. Current resource completes (no partial data)
  4. Job transitions to CANCELLED

Cancellation is checked between resources, not mid-resource. The current resource always completes to avoid partial data.

Cancellation Sources

SourceTriggerBehavior
User RequestAPI call or CLICooperative, graceful
SIGTERMECS deploymentCooperative, 120s grace period
Lease LostWorker crashWatchdog marks failed
TimeoutDramatiq time limitForced termination

Progress Tracking

Monitor extraction progress in real-time:

vai actions get run_abc123

Progress includes:

  • Current phase: extracting, normalizing, loading
  • Current resource being processed
  • Resources completed vs total
  • Rows extracted so far
  • Elapsed time

Progress Phases

PhaseDescription
startingInitializing extraction
extractingPulling data from source API
normalizingApplying schema transformations
loadingWriting to S3 bronze layer
completedFinished successfully
failedEncountered error
cancelledUser or system cancelled

Resume After Failure

If a job fails mid-extraction, it resumes from the last checkpoint.

Per-Resource Checkpointing

  1. Each resource extracts independently
  2. Checkpoint saved after each resource completes
  3. On retry, completed resources are skipped

Example: Extracting profiles, events, lists

ScenarioOn Retry
Crashed during profilesRe-extract profiles from dlt cursor
Crashed during eventsSkip profiles, resume events
Crashed during listsSkip profiles + events, resume lists

dlt Incremental State

Within each resource, dlt maintains cursor state:

  • Stored in S3 _dlt_pipeline_state/
  • Tracks last updated_at or similar cursor
  • On restart, only fetches records after cursor

dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).

Lease-Based Ownership

Workers must acquire a database lease before processing a job:

ParameterValuePurpose
Lease Duration90 secondsHow long a worker owns a job
Heartbeat Interval30 secondsHow often lease is extended
Watchdog Grace180 secondsHow stale before recovery kicks in

This prevents duplicate processing when:

  • SQS delivers the same message twice
  • A worker is slow but not dead
  • Network partitions occur

Deployment Safety

When deploying new worker versions:

  1. ECS sends SIGTERM to running containers
  2. Workers have 120 seconds to finish gracefully
  3. Cancellation token is set immediately
  4. Current resource completes and checkpoints
  5. Job marked as CANCELLED (retryable)
  6. New worker picks up from checkpoint

Best Practices for Deployments

  1. Check running jobs before deploying:

    vai actions list --status running
  2. Wait for completion if possible (safest)

  3. Deploy with confidence — jobs resume automatically from checkpoints

Watchdog Recovery

A background task monitors for abandoned jobs:

CheckFrequencyAction
Expired leasesEvery 5 minutesMark FAILED (retryable)
Stuck PENDINGEvery 5 minutesRe-enqueue if stale

Jobs with worker_lost error are automatically re-enqueued if retry count allows.

Troubleshooting

Job stuck in RUNNING

  1. Check if worker is alive (lease should be fresh)
  2. If lease expired, watchdog will recover within 5 minutes
  3. Manual recovery: vai actions retry run_abc123

Job keeps failing

  1. Check error details: vai actions get run_abc123
  2. If AUTH_ERROR: Update connection credentials
  3. If RATE_LIMITED: Job will auto-retry with backoff
  4. If worker_lost: Infrastructure issue, check ECS logs

Data seems duplicated

Bronze layer may have duplicate files after crash/restart. This is expected:

  • Bronze = raw data (duplicates acceptable)
  • Silver layer deduplicates during transformation

SQS Visibility Heartbeat

For very long jobs (8+ hours), the system extends SQS message visibility:

ParameterValue
Initial Visibility30 minutes
Extension Interval5 minutes
Extension Amount10 minutes

This prevents SQS from re-delivering messages during extremely long extractions.

Next Steps

On this page