Sergey Orsik.dev
← notes

2026-05-16

AI Run Audit Trail and Prompt Versioning

Every LLM call creates an ai_runs row with input hash, prompt version FK, token usage, and validation errors for replay and debugging.

Purpose

Non-deterministic LLM outputs need production debugging: which prompt version, which model, and which input produced a bad clip boundary. The system treats AI as metered, auditable batch jobs — not fire-and-forget HTTP.

ai_runs lifecycle

Each agent step in AnalyzeTranscriptWorkflowService follows the same pattern:

  1. Resolve promptpromptVersions.resolveActive(promptKey) → FK to prompt_versions.
  2. Open runaiRuns.startRun({ workflowName: 'analyze_transcript', agentName, inputHash, inputPreview, status: 'running' }).
  3. Call modelai.generateStructured(...) via OpenRouter / Mastra.
  4. Close runaiRuns.finishRun with succeeded | failed | repaired, plus optional outputJson, validationErrorsJson, tokens, and latencyMs.

Input fingerprint

  • inputHash = sha256(JSON.stringify(input)) — dedup analysis and a hook for future cache layers (cache not implemented today).
  • inputPreview — first 600 chars for admin UI; full chunk text stays in transcript_chunks.

Prompt registry

PieceRole
PROMPT_DEFINITIONS + PROMPT_KEYSCentral agent names and instructions
PROMPT_VERSIONBumps tie into idempotency keys (analyzeTranscriptKey(..., promptVersion))
ensureRegistered on workflow startCreates DB rows when keys are missing

Structured output

Zod schemas (chunkScoutOutputSchema, clipRankerOutputSchema, clipPlanOutputSchema, jsonRepairOutputSchema) validate every response before persisting candidates.

Invalid scout / ranker / planner → failed run + AiAnalysisError with stable error codes for API consumers.

JSON repair path

A separate run with status repaired when the fix agent succeeds — separates first-pass planner failure from recovery for support triage.

Product linkage

clip_candidates.sourceAiRunId points to the planner run, not scout runs. Scouts are many-to-one per project analyze job.

Operational invariant: every LLM invocation must leave an ai_runs row before side effects touch clip_candidates. If you cannot replay from the row, the step is not production-ready.

Failure modes

RiskImpact
Partial scout success, then job failureOrphan succeeded ai_runs for early chunks; rerun may duplicate scout spend unless idempotency blocks
inputPreview carries PIIChunk snippet in DB — fine for internal admin; consider redaction for multi-tenant
AI_USE_FIXTURE=true in prodDeterministic fake clips — requires env guard

Tradeoff

Per-chunk scout runs improve quality and context fit but multiply cost vs single-pass summarization. tokenEstimate on chunks is the hook for future budgeting dashboards.