Architecture

VirtuousAI provides the execution backbone for data workflows. It's designed around reliability, composability, and strong typing.

Execution Entry Points

All execution flows through the same primitives, regardless of how it's triggered:

Entry Point	Creates ActionRun With	Use Case
REST API	Direct `POST /action-runs`	Programmatic access, integrations
Chat (LLM Tools)	`turn_id` + `message_id`	Agentic tool execution
Automations	`automation_run_id` + `step_key`	DAG orchestration
Webhooks	Trigger source metadata	Event-driven workflows

Execution Modes

The ActionKind Catalog determines how each action type executes:

Mode	When Used	Characteristics
SYNC	Fast operations (<30s)	Executes inline in API request, immediate response
ASYNC_QUEUE	Long-running jobs	Enqueued to SQS, processed by Dramatiq workers

Synchronous Execution

For fast operations like web searches or simple API calls:

Client sends POST /action-runs
API creates ActionRun (PENDING)
Executor runs immediately
Client receives completed ActionRun

Asynchronous Execution

For long-running operations like data extraction:

Steps:

Client sends POST /action-runs
API creates ActionRun (PENDING)
API enqueues message to SQS
Client receives 202 Accepted with run_id
Worker picks up message, acquires lease
Executor runs (can take minutes to hours)
Worker completes/fails the run
Client polls or subscribes to SSE for status

Lease-Based Distributed Ownership

Long-running jobs use leases to handle worker failures gracefully:

Parameter	Value	Purpose
Lease Duration	90 seconds	How long a worker owns a job
Heartbeat Interval	30 seconds	How often lease is extended
Cancel Check Interval	30 seconds	How often to check for cancellation
Watchdog Grace	180 seconds	How stale before recovery kicks in

Recovery scenarios:

Worker crash → Lease expires → Watchdog marks FAILED (retryable)
Network partition → Heartbeat fails → Lease expires → Takeover or watchdog
Graceful shutdown → Cancellation token set → Completes current work → Marks CANCELLED

ActionRun Lifecycle

Every ActionRun goes through a defined state machine:

Status	Description	Transitions To
`PENDING`	Created, waiting for execution	RUNNING, CANCELLED, AWAITING_APPROVAL
`AWAITING_APPROVAL`	Requires human approval	RUNNING (approved), REJECTED
`RUNNING`	Currently executing	COMPLETED, FAILED, CANCELLED
`COMPLETED`	Finished successfully	Terminal
`FAILED`	Finished with error	Terminal (can retry → PENDING)
`CANCELLED`	User or system cancelled	Terminal
`REJECTED`	Approval denied	Terminal

Cooperative Cancellation

Cancellation is cooperative and can come from multiple sources:

Source	How It Works
User Request	`cancel_requested_at` set on ActionRun
SIGTERM	ECS shutdown signal sets cancellation token
Lease Lost	Heartbeat failed, another worker may take over
Task Timeout	Dramatiq `time_limit` exceeded

For data extraction, cancellation is checked between resources — the current resource completes before graceful exit.

Job Reliability Features

Long-running extractions use multiple reliability mechanisms working together:

SQS Visibility Heartbeat

For jobs lasting hours, a background thread extends SQS message visibility:

Parameter	Value	Purpose
Initial Visibility	30 minutes	Default SQS timeout
Extension Interval	5 minutes	How often to extend
Extension Amount	10 minutes	How much time to add

This prevents SQS from re-delivering messages during 8+ hour extractions.

Per-Resource Checkpointing

Data extractions process resources sequentially with checkpoints between each:

If a worker crashes after profiles completes:

Checkpoint shows profiles done
New worker skips profiles
Resumes from events

Watchdog Recovery

A system task runs every 5 minutes to detect and recover abandoned jobs:

Find runs where status = RUNNING and lease_expires_at < NOW()
Mark as FAILED with worker_lost error code
Optionally re-enqueue for automatic retry

This handles scenarios where workers crash without graceful shutdown.

Infrastructure

AWS Architecture

Component	Service	Configuration
API	ECS Fargate	Auto-scaled, behind ALB
Workers	ECS Fargate	2 vCPU, 4GB RAM, 120s stop timeout
Queue	SQS	30min visibility timeout
Database	RDS PostgreSQL	Private subnets, TLS enforced
Storage	S3	Bronze (raw) and Silver (processed) buckets
CDN	CloudFront + WAF	OWASP rules, rate limiting

Traffic Flow

Internet → CloudFront (CDN + WAF)
CloudFront → ALB (origin verification header)
ALB → ECS API (private subnets)
API → SQS (async jobs)
Workers → SQS (pull messages)
Workers → S3 (write artifacts)
Workers → External APIs (via connections)

Credential Security

Credentials use envelope encryption with a two-layer key hierarchy:

Layer	Purpose
DEK (Data Encryption Key)	Random Fernet key, encrypts actual credentials
KEK (Key Encryption Key)	Derived from master secret, encrypts the DEK

How it works:

Seal (write): Generate random DEK → Encrypt credentials with DEK → Encrypt DEK with KEK → Store both
Open (read): Derive KEK from master → Decrypt DEK using KEK → Decrypt credentials using DEK

Security properties: