VirtuousAI

Architecture

Understanding the VirtuousAI execution model and infrastructure

Architecture

VirtuousAI provides the execution backbone for data workflows. It's designed around reliability, composability, and strong typing.

Execution Entry Points

All execution flows through the same primitives, regardless of how it's triggered:

Entry PointCreates ActionRun WithUse Case
REST APIDirect POST /action-runsProgrammatic access, integrations
Chat (LLM Tools)turn_id + message_idAgentic tool execution
Automationsautomation_run_id + step_keyDAG orchestration
WebhooksTrigger source metadataEvent-driven workflows

Execution Modes

The ActionKind Catalog determines how each action type executes:

ModeWhen UsedCharacteristics
SYNCFast operations (<30s)Executes inline in API request, immediate response
ASYNC_QUEUELong-running jobsEnqueued to SQS, processed by Dramatiq workers

Synchronous Execution

For fast operations like web searches or simple API calls:

  1. Client sends POST /action-runs
  2. API creates ActionRun (PENDING)
  3. Executor runs immediately
  4. Client receives completed ActionRun

Asynchronous Execution

For long-running operations like data extraction:

Steps:

  1. Client sends POST /action-runs
  2. API creates ActionRun (PENDING)
  3. API enqueues message to SQS
  4. Client receives 202 Accepted with run_id
  5. Worker picks up message, acquires lease
  6. Executor runs (can take minutes to hours)
  7. Worker completes/fails the run
  8. Client polls or subscribes to SSE for status

Lease-Based Distributed Ownership

Long-running jobs use leases to handle worker failures gracefully:

ParameterValuePurpose
Lease Duration90 secondsHow long a worker owns a job
Heartbeat Interval30 secondsHow often lease is extended
Cancel Check Interval30 secondsHow often to check for cancellation
Watchdog Grace180 secondsHow stale before recovery kicks in

Recovery scenarios:

  • Worker crash → Lease expires → Watchdog marks FAILED (retryable)
  • Network partition → Heartbeat fails → Lease expires → Takeover or watchdog
  • Graceful shutdown → Cancellation token set → Completes current work → Marks CANCELLED

ActionRun Lifecycle

Every ActionRun goes through a defined state machine:

StatusDescriptionTransitions To
PENDINGCreated, waiting for executionRUNNING, CANCELLED, AWAITING_APPROVAL
AWAITING_APPROVALRequires human approvalRUNNING (approved), REJECTED
RUNNINGCurrently executingCOMPLETED, FAILED, CANCELLED
COMPLETEDFinished successfullyTerminal
FAILEDFinished with errorTerminal (can retry → PENDING)
CANCELLEDUser or system cancelledTerminal
REJECTEDApproval deniedTerminal

Cooperative Cancellation

Cancellation is cooperative and can come from multiple sources:

SourceHow It Works
User Requestcancel_requested_at set on ActionRun
SIGTERMECS shutdown signal sets cancellation token
Lease LostHeartbeat failed, another worker may take over
Task TimeoutDramatiq time_limit exceeded

For data extraction, cancellation is checked between resources — the current resource completes before graceful exit.

Job Reliability Features

Long-running extractions use multiple reliability mechanisms working together:

SQS Visibility Heartbeat

For jobs lasting hours, a background thread extends SQS message visibility:

ParameterValuePurpose
Initial Visibility30 minutesDefault SQS timeout
Extension Interval5 minutesHow often to extend
Extension Amount10 minutesHow much time to add

This prevents SQS from re-delivering messages during 8+ hour extractions.

Per-Resource Checkpointing

Data extractions process resources sequentially with checkpoints between each:

If a worker crashes after profiles completes:

  1. Checkpoint shows profiles done
  2. New worker skips profiles
  3. Resumes from events

Watchdog Recovery

A system task runs every 5 minutes to detect and recover abandoned jobs:

  1. Find runs where status = RUNNING and lease_expires_at < NOW()
  2. Mark as FAILED with worker_lost error code
  3. Optionally re-enqueue for automatic retry

This handles scenarios where workers crash without graceful shutdown.

Infrastructure

AWS Architecture

ComponentServiceConfiguration
APIECS FargateAuto-scaled, behind ALB
WorkersECS Fargate2 vCPU, 4GB RAM, 120s stop timeout
QueueSQS30min visibility timeout
DatabaseRDS PostgreSQLPrivate subnets, TLS enforced
StorageS3Bronze (raw) and Silver (processed) buckets
CDNCloudFront + WAFOWASP rules, rate limiting

Traffic Flow

  1. Internet → CloudFront (CDN + WAF)
  2. CloudFront → ALB (origin verification header)
  3. ALB → ECS API (private subnets)
  4. API → SQS (async jobs)
  5. Workers → SQS (pull messages)
  6. Workers → S3 (write artifacts)
  7. Workers → External APIs (via connections)

Credential Security

Credentials use envelope encryption with a two-layer key hierarchy:

LayerPurpose
DEK (Data Encryption Key)Random Fernet key, encrypts actual credentials
KEK (Key Encryption Key)Derived from master secret, encrypts the DEK

How it works:

  1. Seal (write): Generate random DEK → Encrypt credentials with DEK → Encrypt DEK with KEK → Store both
  2. Open (read): Derive KEK from master → Decrypt DEK using KEK → Decrypt credentials using DEK

Security properties:

  • Per-connection isolation (each connection has its own DEK)
  • Key rotation support via key_id
  • No plaintext in database
  • Tenant isolation enforced at every layer

Multi-Tenancy

All resources are strictly tenant-isolated:

  • Every table has tenant_id column
  • All queries automatically filter by current tenant
  • Connections, actions, automations, events — all scoped
  • Users can belong to multiple orgs and switch between them

Next Steps

On this page