Architecture
Understanding the VirtuousAI execution model and infrastructure
Architecture
VirtuousAI provides the execution backbone for data workflows. It's designed around reliability, composability, and strong typing.
Execution Entry Points
All execution flows through the same primitives, regardless of how it's triggered:
| Entry Point | Creates ActionRun With | Use Case |
|---|---|---|
| REST API | Direct POST /action-runs | Programmatic access, integrations |
| Chat (LLM Tools) | turn_id + message_id | Agentic tool execution |
| Automations | automation_run_id + step_key | DAG orchestration |
| Webhooks | Trigger source metadata | Event-driven workflows |
Execution Modes
The ActionKind Catalog determines how each action type executes:
| Mode | When Used | Characteristics |
|---|---|---|
| SYNC | Fast operations (<30s) | Executes inline in API request, immediate response |
| ASYNC_QUEUE | Long-running jobs | Enqueued to SQS, processed by Dramatiq workers |
Synchronous Execution
For fast operations like web searches or simple API calls:
- Client sends
POST /action-runs - API creates ActionRun (PENDING)
- Executor runs immediately
- Client receives completed ActionRun
Asynchronous Execution
For long-running operations like data extraction:
Steps:
- Client sends
POST /action-runs - API creates ActionRun (PENDING)
- API enqueues message to SQS
- Client receives 202 Accepted with
run_id - Worker picks up message, acquires lease
- Executor runs (can take minutes to hours)
- Worker completes/fails the run
- Client polls or subscribes to SSE for status
Lease-Based Distributed Ownership
Long-running jobs use leases to handle worker failures gracefully:
| Parameter | Value | Purpose |
|---|---|---|
| Lease Duration | 90 seconds | How long a worker owns a job |
| Heartbeat Interval | 30 seconds | How often lease is extended |
| Cancel Check Interval | 30 seconds | How often to check for cancellation |
| Watchdog Grace | 180 seconds | How stale before recovery kicks in |
Recovery scenarios:
- Worker crash → Lease expires → Watchdog marks FAILED (retryable)
- Network partition → Heartbeat fails → Lease expires → Takeover or watchdog
- Graceful shutdown → Cancellation token set → Completes current work → Marks CANCELLED
ActionRun Lifecycle
Every ActionRun goes through a defined state machine:
| Status | Description | Transitions To |
|---|---|---|
PENDING | Created, waiting for execution | RUNNING, CANCELLED, AWAITING_APPROVAL |
AWAITING_APPROVAL | Requires human approval | RUNNING (approved), REJECTED |
RUNNING | Currently executing | COMPLETED, FAILED, CANCELLED |
COMPLETED | Finished successfully | Terminal |
FAILED | Finished with error | Terminal (can retry → PENDING) |
CANCELLED | User or system cancelled | Terminal |
REJECTED | Approval denied | Terminal |
Cooperative Cancellation
Cancellation is cooperative and can come from multiple sources:
| Source | How It Works |
|---|---|
| User Request | cancel_requested_at set on ActionRun |
| SIGTERM | ECS shutdown signal sets cancellation token |
| Lease Lost | Heartbeat failed, another worker may take over |
| Task Timeout | Dramatiq time_limit exceeded |
For data extraction, cancellation is checked between resources — the current resource completes before graceful exit.
Job Reliability Features
Long-running extractions use multiple reliability mechanisms working together:
SQS Visibility Heartbeat
For jobs lasting hours, a background thread extends SQS message visibility:
| Parameter | Value | Purpose |
|---|---|---|
| Initial Visibility | 30 minutes | Default SQS timeout |
| Extension Interval | 5 minutes | How often to extend |
| Extension Amount | 10 minutes | How much time to add |
This prevents SQS from re-delivering messages during 8+ hour extractions.
Per-Resource Checkpointing
Data extractions process resources sequentially with checkpoints between each:
If a worker crashes after profiles completes:
- Checkpoint shows
profilesdone - New worker skips
profiles - Resumes from
events
Watchdog Recovery
A system task runs every 5 minutes to detect and recover abandoned jobs:
- Find runs where
status = RUNNINGandlease_expires_at < NOW() - Mark as
FAILEDwithworker_losterror code - Optionally re-enqueue for automatic retry
This handles scenarios where workers crash without graceful shutdown.
Infrastructure
AWS Architecture
| Component | Service | Configuration |
|---|---|---|
| API | ECS Fargate | Auto-scaled, behind ALB |
| Workers | ECS Fargate | 2 vCPU, 4GB RAM, 120s stop timeout |
| Queue | SQS | 30min visibility timeout |
| Database | RDS PostgreSQL | Private subnets, TLS enforced |
| Storage | S3 | Bronze (raw) and Silver (processed) buckets |
| CDN | CloudFront + WAF | OWASP rules, rate limiting |
Traffic Flow
- Internet → CloudFront (CDN + WAF)
- CloudFront → ALB (origin verification header)
- ALB → ECS API (private subnets)
- API → SQS (async jobs)
- Workers → SQS (pull messages)
- Workers → S3 (write artifacts)
- Workers → External APIs (via connections)
Credential Security
Credentials use envelope encryption with a two-layer key hierarchy:
| Layer | Purpose |
|---|---|
| DEK (Data Encryption Key) | Random Fernet key, encrypts actual credentials |
| KEK (Key Encryption Key) | Derived from master secret, encrypts the DEK |
How it works:
- Seal (write): Generate random DEK → Encrypt credentials with DEK → Encrypt DEK with KEK → Store both
- Open (read): Derive KEK from master → Decrypt DEK using KEK → Decrypt credentials using DEK
Security properties:
- Per-connection isolation (each connection has its own DEK)
- Key rotation support via
key_id - No plaintext in database
- Tenant isolation enforced at every layer
Multi-Tenancy
All resources are strictly tenant-isolated:
- Every table has
tenant_idcolumn - All queries automatically filter by current tenant
- Connections, actions, automations, events — all scoped
- Users can belong to multiple orgs and switch between them