Skip to content

Reliability & consistency

A scan can involve thousands of tasks across a fleet of disposable machines, any of which can crash, stall, or get the same message twice. Sonar is built so that none of that corrupts your data or wedges a scan. This page is the tour of how — the parts of the system we're proudest of.

The single governing idea: delivery is at-least-once, but every effect is idempotent. Messages may be retried and redelivered; applying them twice is always safe.

The reconciler: a self-healing control loop

Every ~5 seconds a background job (WorkflowReconcilerJob, Quartz, [DisallowConcurrentExecution]) wakes up and, for each running scan, asks: given what's finished, what should happen next? It computes ready steps, dispatches new tasks within capacity, creates downstream tasks, and checks for completion.

Because it's a loop that re-derives state from the database every tick — not a one-shot script — a scan is robust to interruption. Restart the backend mid-scan and the next tick simply picks up where reality left off. There is no in-memory progress to lose.

One scan at a time: Postgres advisory locks

Two overlapping ticks (or two backend instances) must never process the same scan at once. Each scan is reconciled while holding a session-scoped Postgres advisory lock, keyed by the first 8 bytes of the scan's UUID:

sql
SELECT pg_try_advisory_lock(:scanKey)   -- non-blocking; skip the scan if not acquired
-- … reconcile …
SELECT pg_advisory_unlock(:scanKey)     -- always, in a finally block

pg_try_advisory_lock is non-blocking: if another worker holds the scan, this tick just skips it and moves on. No deadlocks, no double-dispatch. (The key is only the UUID's first 8 bytes, so two different scans could collide on it — but the effect is merely that one waits for the next tick. Correctness is preserved; at worst a scan is reconciled a few seconds later.)

Idempotency, layer by layer

A task's result can arrive more than once (a worker retries; RabbitMQ redelivers). Several independent layers make that harmless:

  1. Worker pre-check. Before running, a worker calls GET /api/scans/tasks/{id}/status and skips tasks that are already terminal or whose scan was stopped.
  2. Stable dispatch key. Every task carries a dispatchKey = "{scanId}:{taskId}:{stepId}" — stable across retries — plus a unique per-dispatch jobId. The worker echoes both back: the stable dispatchKey lets the backend recognize "this is that task" across retries, while the per-dispatch jobId distinguishes two delivery attempts in the logs.
  3. command-started short-circuit. This is what actually enforces the dedup: if a task is already Running/Completed/Failed, the start webhook returns { duplicate: true } and mutates nothing.
  4. ON CONFLICT output writes. The real guarantee for data: all scan output is ingested with upserts (next section), so processing the same output twice can't duplicate rows.
  5. Requeue on failure, not drop. If a worker can't deliver its result after retries, it nacks (negative-acknowledges, telling RabbitMQ to redeliver) with requeue — the task comes back, never lost.

Delivery is at-least-once; the effect is exactly-once.

Writing results without duplicates: streaming upserts

Structured output (domains, ports, http_paths, …) is ingested by BulkUpsertService with a deliberate, fast, idempotent pipeline:

CREATE TEMP TABLE …             -- staging
COPY … FROM STDIN               -- stream rows in (no giant in-memory list)
materialize foreign keys        -- e.g. before inserting http_paths, ensure the parent
                                --   domain rows exist so the FK join finds a match
INSERT … SELECT DISTINCT ON (conflictKeys) …
  ON CONFLICT (conflictKeys) DO UPDATE SET …, updated_at = NOW()

Two subtleties worth calling out:

  • DISTINCT ON (conflictKeys) dedupes within a single batch, avoiding Postgres's "ON CONFLICT cannot affect row a second time" error when a tool emits the same value twice.
  • Conflict keys are declared, not hardcodedOutputTableConstants maps each table to its conflict columns (e.g. domainsvalue, scopes(program_id, name)), and the actual SQL column names are resolved from EF Core's model at runtime. One place to change, and it can't drift from the schema. See Output tables.

The output is read by streaming from MinIO (JSON array or JSONL), so even multi-hundred-MB results never balloon memory.

The DAG can't lie: validation & cycle detection

Before a scan starts, the workflow is validated (WorkflowValidator, rules R1–R9) and the graph is checked for cycles (DFS with a recursion stack). The rules encode the engine's invariants so a malformed workflow is rejected at authoring time, not discovered at 3 a.m.: CustomSql steps must declare their input file and can't also be a streaming child; at most one Single dependency per step; a Single upstream must emit an output; no self- or duplicate dependencies; reserved variable names are off-limits; every {PARAM.x} must resolve. (Full list on Author a scan workflow.)

Fan-out is tracked, and it's idempotent

When a Single dependency fans out (one child per finished upstream task), each child inherits the upstream's LineageId (so you can trace a result back to the input that produced it) and records its ParentTaskId. That parent link is also the idempotency key: the reconciler finds "upstream tasks needing fan-out" with a single anti-join query (NOT EXISTS a child with that ParentTaskId) — so re-running the tick never creates a duplicate child, and in steady state the query returns nothing and does no work. (This replaced an N+1 that grew with the cumulative completed-task count.)

Reliable dispatch: the transactional outbox

Publishing a task to RabbitMQ and marking it "published" in the DB must be all or nothing — otherwise a crash between the two leaves a task that's either double-sent or lost. Sonar uses the outbox pattern:

  • An EF SaveChanges interceptor drains each entity's domain events into an outbox_messages row in the same transaction as the state change. The intent to publish commits atomically with the task flip.
  • A poller (ProcessOutboxJob, ~5 s) claims one row per short transaction with SELECT … FOR UPDATE SKIP LOCKED — each poller grabs a row no other poller has locked and skips the locked ones instead of waiting, so many pollers run in parallel without stepping on each other — publishes, and marks it processed. Failures retry up to a cap, then dead-letter (retained with the error, not silently dropped).
  • Handlers are themselves idempotent: an OutboxMessageConsumer(messageId, handler) record means a redelivered message never runs the same handler twice.

When machines vanish: dead-worker recovery

Workers are disposable and will die mid-task. Every reconciler tick runs a recovery pass (RecoverStaleInFlight) even when no scans are running:

  • A task Published but unacknowledged for >10 min → reset to Pending (redispatched).
  • A task Running on a worker that is no longer live → reset to Pending.

Liveness comes from a heartbeat: workers POST /api/workers/pulse every few seconds, updating a registry with LastPulseAt and their routing tags. A separate cleanup job ends GitHub runs / tears down AWS workers that have gone stale. Crucially, the orphan signal is worker liveness, not a task timeout — a legitimately long task on a healthy worker is left alone, but the moment its worker disappears, its work is requeued.

Spot instances get special care: a worker polls EC2's interruption notice, and on a 2-minute warning it stops taking new jobs, tells the backend, and drains — so a reclaimed spot node doesn't strand tasks.

Smaller guards that matter

  • Empty-chunk guard. When a fan-out finds zero files/lines, the parent task is marked Completed immediately instead of looping — this is what lets transform steps emit [] for empty chunks without wedging the graph.
  • Duplicate-scan guard. Starting the "same" scan twice within 10 s returns the existing scan — so a client retry after a timeout doesn't spawn a phantom run.
  • Bounded output processing. The result webhook returns 200 immediately and hands work to a bounded channel (capacity 2000) drained at concurrency 5 — so a burst of finishing tasks can't exhaust the Postgres connection pool.
  • Water-filled dispatch. Tasks are spread across worker tags up to each tag's live capacity and each step's maxConcurrent, with a global per-machine ceiling so a multi-tag worker isn't over-committed.

The snapshot

When a scan starts, Sonar stores a WorkflowSnapshot — a jsonb copy of the workflow's steps and edges at that moment. Its job is faithful history and rendering: a scan's DAG and audit trail display correctly forever, even after the underlying workflow is edited or deleted.

Precise scope

The snapshot backs display and history. The runtime reconciler still reads the live workflow definition each tick, so it is not a guarantee that execution is frozen against mid-run edits — it's a guarantee that a scan's record is stable. Treat workflows as stable while a scan of them is running.