Job Reliability

VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.

Execution Modes

Jobs run in one of two modes:

Mode	Duration	Mechanism	Example
SYNC	Under 30 seconds	Inline in API request	Web searches, simple queries
ASYNC_QUEUE	Minutes to hours	SQS + Dramatiq workers	Data extractions, large syncs

Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.

Cancelling Jobs

Pending Jobs

Jobs in PENDING status are cancelled immediately:

vai actions cancel run_abc123
# Run cancelled

curl -X POST https://api.virtuousai.com/api/v1/action-runs/run_abc123/cancel \
  -H "Authorization: Bearer $VAI_API_KEY"

Running Jobs

Jobs in RUNNING status use cooperative cancellation:

API sets cancel_requested_at on the run
Worker checks for cancellation every 30 seconds
Current resource completes (no partial data)
Job transitions to CANCELLED

Cancellation is checked between resources, not mid-resource. The current resource always completes to avoid partial data.

Cancellation Sources

Source	Trigger	Behavior
User Request	API call or CLI	Cooperative, graceful
SIGTERM	ECS deployment	Cooperative, 120s grace period
Lease Lost	Worker crash	Watchdog marks failed
Timeout	Dramatiq time limit	Forced termination

Progress Tracking

Monitor extraction progress in real-time:

vai actions get run_abc123

Progress includes:

Current phase: extracting, normalizing, loading
Current resource being processed
Resources completed vs total
Rows extracted so far
Elapsed time

Progress Phases

Phase	Description
`starting`	Initializing extraction
`extracting`	Pulling data from source API
`normalizing`	Applying schema transformations
`loading`	Writing to S3 bronze layer
`completed`	Finished successfully
`failed`	Encountered error
`cancelled`	User or system cancelled

Resume After Failure

If a job fails mid-extraction, it resumes from the last checkpoint.

Per-Resource Checkpointing

Each resource extracts independently
Checkpoint saved after each resource completes
On retry, completed resources are skipped

Example: Extracting profiles, events, lists

Scenario	On Retry
Crashed during `profiles`	Re-extract `profiles` from dlt cursor
Crashed during `events`	Skip `profiles`, resume `events`
Crashed during `lists`	Skip `profiles` + `events`, resume `lists`

dlt Incremental State

Within each resource, dlt maintains cursor state:

Stored in S3 _dlt_pipeline_state/
Tracks last updated_at or similar cursor
On restart, only fetches records after cursor

dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).

Lease-Based Ownership

Workers must acquire a database lease before processing a job:

Parameter	Value	Purpose
Lease Duration	90 seconds	How long a worker owns a job
Heartbeat Interval	30 seconds	How often lease is extended
Watchdog Grace	180 seconds	How stale before recovery kicks in

This prevents duplicate processing when:

SQS delivers the same message twice
A worker is slow but not dead
Network partitions occur

Deployment Safety

When deploying new worker versions:

ECS sends SIGTERM to running containers
Workers have 120 seconds to finish gracefully
Cancellation token is set immediately
Current resource completes and checkpoints
Job marked as CANCELLED (retryable)
New worker picks up from checkpoint

Best Practices for Deployments

Check running jobs before deploying:
```
vai actions list --status running
```
Wait for completion if possible (safest)
Deploy with confidence — jobs resume automatically from checkpoints