VendorSync

Enterprise-grade vendor change management platform. VendorSync automatically ingests vendor notification emails, uses a Large Language Model to triage them against your application registry and rule set, creates tracked tickets with full SLA enforcement, and opens Jira issues for every affected engineering team — all without a human touching a keyboard.

What is VendorSync?
Why VendorSync Exists
How It Works — End to End
Architecture Overview
Service Topology
Redis Streams — The Job Bus
The Triage Agent
Ticket Lifecycle & State Machine
SLA Enforcement & Escalation
Jira Integration Deep Dive
Notification System
Authentication & Authorization
Data Model
Technology Stack
Project Structure
Configuration Reference
Getting Started (Local Development)
Business Rules Reference
Architectural Principles

What is VendorSync?

VendorSync is a full-stack internal platform that solves a specific, painful problem at scale: vendor change management.

Every enterprise engineering organization receives a constant stream of emails from third-party vendors — API endpoint changes, certificate rotations, protocol deprecations, breaking schema updates, deadline-driven migration notices. Each of these can break production systems if missed or mishandled. Manually routing these emails, figuring out which internal applications are affected, assigning owners, opening Jira tickets, and tracking deadlines is tedious, error-prone, and doesn't scale.

VendorSync eliminates all of that manual work. It:

Watches your mailboxes — polls configured email sources (IMAP, Microsoft Exchange, Gmail API) for new vendor notifications
Reads and understands the email — sends it to an LLM (Claude, GPT-4o, or Gemini) along with your entire application registry and rule set, asking it to decide: which vendor sent this, which of your apps are impacted, how severe it is, who owns it, what the deadline is
Creates a tracked VendorSync ticket — a first-class record with severity, SLA, owner, and deadline
Opens Jira issues automatically — one per affected application, pre-filled with full context so engineers never have to ask "what is this about"
Enforces SLAs — monitors deadlines and fires escalations at D-7, D-3, D-1, opening war rooms and paging on-call engineers if needed
Closes the loop — syncs Jira status back into VendorSync every 5 minutes; when all Jira issues are Done, the ticket auto-transitions to Resolved for a change manager to confirm

The entire pipeline from email arrival to Jira issues created and notifications sent targets under 20 seconds at the 95th percentile.

Why VendorSync Exists

The alternative is a spreadsheet, a shared inbox, and hope. Common failure modes VendorSync prevents:

Failure mode	How VendorSync prevents it
Vendor email goes to wrong person or gets lost	Watches dedicated mailboxes, never misses
No one knows which apps are affected	LLM reads app descriptions and decides automatically
Wrong team gets paged	Rules + app registry route to the right owner every time
Deadline slips without anyone noticing	Scheduled deadline monitor, three escalation thresholds, breach detection
Jira issues created inconsistently or not at all	Automatic creation with full context on every ticket
No audit trail of what happened and when	Immutable audit log on every state change
Hard to tell which open changes are at risk	Dashboard with SLA countdowns, severity badges, breach indicators

How It Works — End to End

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                                                                 │
│  Vendor sends   →  Mailbox  →  [Scheduler polls]  →  vs:emails:incoming        │
│  notification                                                                   │
│  email                                                                          │
│                                                                                 │
│  [Worker] fetches message, stores in source_emails, publishes process_email job │
│                                                                                 │
│  [Triage Agent]  ←── all active rules ──────────────────── DB                  │
│       │          ←── all applications + descriptions ─────  DB                 │
│       │          ←── email subject + body + attachments                        │
│       ↓                                                                         │
│  TriageDecision {                                                               │
│    vendor, topic, effective_date,                                               │
│    affected_application_ids,                                                    │
│    severity, owner_team, sla_days,                                              │
│    alert_channel, matched_rule_ids,                                             │
│    reasoning                                                                    │
│  }                                                                              │
│       ↓                                                                         │
│  [Ticket Engine]                                                                │
│    → INSERT tickets (status: open)                                              │
│    → INSERT ticket_applications (one per affected app)                          │
│    → INSERT ticket_jira_links (one per app with Jira enabled)                   │
│    → PUBLISH create_jira_issue jobs  →  vs:jira:create                         │
│    → PUBLISH notify job              →  vs:tickets:notify                      │
│                                                                                 │
│  [Worker] creates Jira issues, stores issue keys, syncs every 5 min            │
│  [Worker] sends Slack/email/PagerDuty notifications                             │
│  [Scheduler] monitors deadlines every 15 min, escalates at D-7/D-3/D-1        │
│  [Worker] syncs Jira status → auto-transitions VS ticket to Resolved           │
│  [Change manager] confirms closure in UI → ticket Closed                       │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Architecture Overview

VendorSync is decomposed into three Python processes sharing one codebase, backed by PostgreSQL and Redis, served through a Traefik reverse proxy, with a Next.js frontend.

                        ┌───────────────────────────────────────────┐
                        │                 Traefik                   │
                        │  /        → frontend (Next.js)            │
                        │  /api/*   → backend  (FastAPI)            │
                        └──────────────┬────────────────────────────┘
                                       │
              ┌──────────────┬─────────┴──────────┬────────────────────┐
              ▼              ▼                     ▼                    ▼
         ┌──────────┐  ┌──────────┐         ┌──────────┐        ┌───────────┐
         │ Frontend │  │ Backend  │         │  Worker  │        │ Scheduler │
         │ Next.js  │  │ FastAPI  │         │  (N     │        │APScheduler│
         │ App      │  │ HTTP API │         │ replicas)│        │ periodic  │
         │ Router   │  │          │         │ Streams  │        │ jobs      │
         └──────────┘  └────┬─────┘         └────┬─────┘        └─────┬─────┘
                            │                    │                     │
                    ┌───────┴──────┐    ┌────────┴──────────────────────┘
                    │              │    │
              ┌─────▼─────┐  ┌────▼────▼──┐
              │ PostgreSQL │  │   Redis    │
              │  primary   │  │  Streams   │
              │  database  │  │  + cache   │
              └────────────┘  └────────────┘

Module boundaries

The Python codebase is organized into strict layers. Nothing depends backwards:

Module	Responsibility
`backend.api`	HTTP routes only — validates input, calls services, returns responses
`backend.services`	Business logic — pure Python, no FastAPI imports, reusable from both API and workers
`backend.agents`	The Triage Agent and future agent abstractions
`backend.models`	SQLAlchemy ORM models — source of truth for DB schema
`backend.schemas`	Pydantic v2 schemas for API input/output validation
`backend.workers`	Redis Streams consumers — call into services and agents
`backend.scheduler`	APScheduler periodic jobs — publish work items to streams
`backend.llm`	LiteLLM wrapper, prompt templates, response parsing
`backend.ingestion`	Email fetching (IMAP / Exchange / Gmail), attachment handling
`backend.jira`	Jira REST client, sync logic, comment helpers
`backend.auth`	Okta OIDC integration, JWT validation, role middleware
`backend.core`	Config, DB session, Redis client, security helpers

Dependency rule: api, workers, and scheduler may depend on services and agents. services and agents depend on models, llm, ingestion, jira. Nothing depends on api or workers.

Service Topology

VendorSync runs as 7 Docker services in a single Compose stack:

Service	Image	Purpose
`traefik`	`traefik:v3`	Reverse proxy, single external entry point, HTTPS via Let's Encrypt in production
`backend`	custom Python	FastAPI HTTP API — all REST endpoints
`worker`	custom Python (same image)	Redis Streams consumer — ingestion, LLM calls, Jira creation, notifications, sync
`scheduler`	custom Python (same image)	APScheduler — mailbox polling, Jira sync batch, deadline monitor
`frontend`	custom Node	Next.js App Router UI
`postgres`	`postgres:17`	Primary relational database, all persistent state
`redis`	`redis:7`	Streams job bus + caching

backend, worker, and scheduler run the same Docker image, differentiated by the command: override in docker-compose.yml. This means a single docker compose build backend rebuilds all three.

Traefik routing

Path	Routed to
`/` and all UI routes	Next.js frontend (port 3000 internally)
`/api/*`	FastAPI backend (port 8000 internally)
`/health`	Backend health check
`:8080/dashboard`	Traefik live dashboard (dev only)

In production, Traefik handles TLS termination via Let's Encrypt and redirects all HTTP to HTTPS. Service discovery is via Docker labels — no static config files need editing when adding services.

Redis Streams — The Job Bus

All asynchronous work flows through Redis Streams. There is no Celery, no RabbitMQ, no separate broker — Redis is already required for caching, and Streams on top of it replace all of that at a fraction of the operational overhead.

Vendor Watch streams

Stream	Jobs carried
`vs:emails:incoming`	`poll_mailbox` (fetch new messages from IMAP/Exchange/Gmail)
`vs:jira:create`	`create_jira_issue` (one job per app per ticket, with retry)
`vs:jira:sync`	`sync_jira_batch` (periodic) + `sync_jira_single` (manual)
`vs:tickets:notify`	`send_notifications` (Slack, email, PagerDuty fanout)
`vs:tickets:escalate`	`escalate_ticket` (D-7, D-3, D-1 actions)
`vs:dlq`	Dead-letter queue — exhausted jobs held for admin review

UI event stream

Stream	Events pushed
`vs:ui:events`	`system.status` — queue depths and scheduler heartbeats (every 5s)

Worker pattern

Workers are stateless consumers in a consumer group (vs-workers). The pattern is simple and reliable:

XREADGROUP GROUP vs-workers <worker-id> COUNT 10 BLOCK 5000 STREAMS vs:emails:incoming >
  → process message
  → on success:  XACK vs:emails:incoming vs-workers <message-id>
  → on failure:  retry (up to 3×); on exhaustion: XADD vs:dlq ...

Multiple worker replicas can run concurrently — Redis ensures each message is claimed by exactly one consumer at a time. Pending messages from a crashed worker are auto-reclaimed after a configurable visibility timeout.

Scheduler pattern

APScheduler runs in its own process. Its only job is to publish work items into streams on schedule. It never does the work itself. This means the scheduler is lightweight, stateless, and easily restartable — the actual processing always happens in workers through the same code paths.

The Triage Agent

The Triage Agent is the brain of VendorSync. It is a stable interface over an LLM call — a class that takes an email and returns a fully structured decision. All the complexity of prompt construction, LLM provider routing, response parsing, and validation lives inside it. The rest of the codebase calls triage_agent.decide(email) and gets a TriageDecision back.

What the agent does

class TriageAgent:
    async def decide(self, email: SourceEmail) -> TriageDecision:
        # 1. Load all active rules from DB (filtered by vendor if known)
        # 2. Load all active applications with their descriptions
        # 3. Construct a structured prompt with the email + context
        # 4. Call LLM (via LiteLLM — provider-agnostic)
        # 5. Parse and validate the JSON response
        # 6. Apply post-LLM severity guardrails (deterministic)
        # 7. Return TriageDecision
        ...

The TriageDecision output

{
  "vendor_id": "uuid-of-matched-vendor",
  "topic": "URL endpoint change",
  "effective_date": "2027-04-20",
  "affected_application_ids": ["uuid-1", "uuid-2"],
  "severity": "high",
  "owner_team": "Payments",
  "sla_days": 14,
  "alert_channel": "#payments-changes",
  "matched_rule_ids": ["uuid-of-matched-rule"],
  "reasoning": "PaymentService directly calls the Visa authorization endpoint per its description and is critically affected. AccountUpdate uses Visa schemas — included as a precaution since it likely uses the same base URL infrastructure. Billing is excluded — its description mentions billing operations, not authorization flows."
}

What the LLM sees

The LLM receives a single structured prompt containing:

The raw email (subject + body + attachments as multimodal inputs)
The full list of vendor names VendorSync tracks
Every active rule (natural language, with severity / SLA / owner / instruction text)
Every active application (name + plain-language description + owner team)

No keyword pre-filtering is done before the LLM call. The LLM does the matching. This is intentional — keyword-based pre-filtering introduces false negatives; the LLM is better at semantic matching.

Severity guardrails (post-LLM, deterministic)

After the LLM responds, a deterministic check runs before the TriageDecision is returned:

If the vendor is tier_1 AND the change is classified as breaking AND the LLM assigned low or medium severity → override to high.

This override is written to the audit log with the original LLM severity preserved in llm_reasoning. The guardrail cannot be tricked by prompt injection.

Application descriptions matter

The quality of triage directly depends on how well applications are described. The description field on each application is fed verbatim to the LLM:

"API-based application that exchanges Visa schema files for credit card status updates, polls Reuters rates feed every 2 hours, processes payments in real-time. Calls the Visa authorization v2 endpoint on every transaction."

Rich descriptions = accurate routing. Sparse descriptions = false negatives.

Ticket Lifecycle & State Machine

Every vendor change that passes triage becomes a VendorSync ticket — the single source of truth for that change's progress. Tickets have a strictly enforced state machine.

                 ┌─────────────┐
                 │    open     │  ← created by system on email triage
                 └──────┬──────┘
                        │  any linked Jira issue → "In Progress"
                        │  OR owner manually transitions in VS UI
                        ▼
                 ┌─────────────┐
                 │ in_progress │
                 └──────┬──────┘
                        │  ALL linked Jira issues → "Done"
                        │  AND all non-Jira apps → non_jira_status = complete
                        ▼
                 ┌─────────────┐
                 │  resolved   │  ← awaiting change manager confirmation
                 └──────┬──────┘
                        │  change manager clicks "Confirm closure" in VS UI
                        ▼
                 ┌─────────────┐
                 │   closed    │  ← terminal
                 └─────────────┘

  Any non-closed state + effective_date passed → breached  (terminal)
  Any non-closed state + admin force-close → closed  (with mandatory reason)

State definitions

State	Meaning
`open`	Ticket created, no work started; all Jira issues in "To Do"
`in_progress`	At least one Jira issue has moved to "In Progress"
`resolved`	All Jira issues Done + all non-Jira apps marked complete; awaiting human confirmation
`closed`	Change manager confirmed; change is handled
`breached`	Effective date passed without closure — SLA violated

Status is driven by Jira, not by humans

For tickets with linked Jira issues, state transitions from open → in_progress → resolved happen automatically based on Jira sync results. Humans do not manually move these states. Only resolved → closed requires a human action.

For apps configured with creates_jira_ticket = false, owners use VS UI buttons to update non_jira_status on their app's ticket row. The combined evaluation (all Jira issues Done + all non-Jira apps complete) triggers resolution.

Two-person close rule

The change manager who confirmed the last state transition on a ticket cannot also confirm its closure. The backend enforces:

if closed_by == ticket.last_transitioned_by:
    raise HTTPException(403, "The user who last transitioned this ticket cannot also confirm closure")

Admin role can self-close but the bypass is flagged in the audit log.

Forbidden transitions

closed → anything (terminal — cannot be reopened; create a new ticket referencing it)
breached → closed requires admin manual override with a documented reason
Skipping states (e.g. open → resolved) is forbidden. If Jira sync finds all issues Done on a still-open ticket, two separate transitions are written to the audit log atomically: open → in_progress → resolved.

SLA Enforcement & Escalation

VendorSync's scheduler checks all open tickets every 15 minutes. For each ticket with an effective_date, it calculates days remaining and fires precisely-targeted escalations.

Escalation thresholds

Threshold	Actions triggered
D-7 (7 days before deadline)	Comment on all linked Jira issues · Slack DM to ticket owner · Email to owner
D-3 (3 days before deadline)	Comment on all linked Jira issues · Slack channel ping · Email to manager
D-1 (1 day before deadline)	Comment on all linked Jira issues · War room Slack channel alert · PagerDuty on-call page
D-0 (breached)	Ticket status → `breached` · Slack + email alert to admin + senior management

Each threshold fires at most once per ticket — enforced via notification_log with a deduplication index on (ticket_id, channel, metadata.escalation_threshold). There is no double-paging.

Escalations stop if the ticket reaches resolved or closed before the threshold fires.

Jira comments at each threshold

🔔 VendorSync — D-7 reminder

Vendor deadline approaching: April 20, 2027 (in 7 days)

This Jira issue is linked to VendorSync ticket VS-2847.
Vendor change: Visa — URL endpoint change

If this change has been completed, please move this Jira issue to Done.

VendorSync ticket: https://vendorsync.company.com/tickets/VS-2847

SLA defaults

If no rule matches a vendor change, SLA is determined by the vendor's criticality tier:

Vendor tier	Default SLA
`tier_1`	14 days
`tier_2`	30 days
`tier_3`	60 days

Configurable in system_settings.defaults.sla_days_by_tier.

Jira Integration Deep Dive

VendorSync talks to Jira via direct async HTTP calls (httpx) to the Jira REST API. No SDKs, no proxies. Both Jira Cloud (API v3) and Jira Server/Data Center (API v2) are supported — configured per-instance via jira_config in the database.

What VendorSync creates in Jira

When a ticket is triaged and linked applications have Jira enabled, one issue is created per application:

Summary:   Visa — URL endpoint change [VS-2847]
Project:   PAY  (from applications.jira_project_key)
Type:      Task  (from applications.jira_issue_type)
Due date:  2027-04-20
Assignee:  <from applications.jira_default_assignee if set>
Labels:    ["vendorsync", "vendor:visa", <app.jira_labels...>]

Description:
═══════════════════════════════════════════════════════
VENDOR CHANGE — Visa URL endpoint update
═══════════════════════════════════════════════════════
Vendor:           Visa
Change type:      URL endpoint change
Effective date:   April 20, 2027 (D-180)
Severity:         High
SLA:              14 days
Affected application: PaymentService (Payments team)

Summary: [LLM-generated 2-3 sentence summary]
VendorSync routing reasoning: [LLM reasoning text]

VendorSync ticket: VS-2847
Direct link: https://vendorsync.company.com/tickets/VS-2847
═══════════════════════════════════════════════════════

The engineer working the Jira issue has everything they need without opening VendorSync. The original email is accessible via the VendorSync link if needed.

Sync loop (every 5 minutes)

VendorSync never queries Jira issue-by-issue. The sync is batched:

Pull all open jira_issue_key values in one SQL query
Chunk into batches of 100
For each batch: one JQL search request — key in (PAY-9421, PAY-9422, ...)
Diff stored status against Jira status category
Update ticket_jira_links, write audit log entries
Re-evaluate parent VS ticket status

Batch sync is ~100× faster than per-issue queries and is far more Jira-API-quota-efficient.

Status mapping

VendorSync uses Jira's status category (standardized across all Jira projects) rather than exact status names (which vary by workflow):

Jira status category	VendorSync `jira_status_category`
To Do / Open / Backlog	`to_do`
In Progress / In Review / etc.	`in_progress`
Done / Closed / Resolved / etc.	`done`
Anything else	`unknown`

The exact Jira status name is preserved in jira_status for display in the UI.

Error handling and retry

Error	Action
`400` bad payload	Log full error, alert admin — likely a misconfigured project key or issue type
`401` unauthorized	Disable Jira config, urgent admin alert
`403` forbidden	Log, admin alert — missing "Create Issues" permission
`404` project not found	Log, admin alert — `jira_project_key` is wrong
`5xx` / network timeout	Retry × 3 with exponential backoff
Exhausted retries	Mark `sync_status = orphan`, background retry every 15 min

Jira creation failures never block ticket creation. VendorSync tickets exist and are tracked regardless of Jira state. Orphaned links are surfaced in the admin view.

Permissions VendorSync needs in Jira

The Jira account used by VendorSync requires:

Browse Projects on all target projects
Create Issues on all target projects
Add Comments on all target projects

VendorSync intentionally does not transition Jira issue statuses, delete issues, or touch issues it didn't create.

Notification System

Notifications are dispatched asynchronously through the vs:tickets:notify stream, after ticket creation. Severity determines the fanout:

Severity	Slack channel	Slack DM to owner	Email	PagerDuty
`critical`	✓	✓	✓	✓
`high`	✓	✓	✓	—
`medium`	✓	—	✓	—
`low`	✓	—	—	—

Every notification dispatched is recorded in notification_log. If a notification channel fails (Slack down, SMTP unreachable), the failure is logged, the ticket creation still completes, and retry happens via DLQ replay — notifications never block the primary pipeline.

Authentication & Authorization

Authentication — Okta OIDC

VendorSync uses Okta as the identity provider. The backend validates JWTs with authlib; the frontend uses Auth.js v5 (NextAuth) with the Okta provider.

User opens VendorSync
  → frontend redirects to Okta login
  → user authenticates (MFA, SSO, etc.)
  → Okta redirects back with auth code
  → Auth.js exchanges code for tokens
  → frontend sends Okta access token in Authorization header
  → backend validates JWT via Okta's JWKS endpoint
  → backend looks up role in users table
  → request proceeds with full user context

Authorization — internal roles

Okta identifies who the user is. VendorSync's users table controls what they can do. Role changes don't require Okta admin involvement.

Role	What they can do
`admin`	Everything — users, config, system settings, manual override close
`change_manager`	Manage rules and registry, confirm closures, manual override
`owner`	Work tickets for their team, mark non-Jira app progress
`viewer`	Read-only — dashboards, ticket lists, audit log

Permission matrix

Action	Admin	Change Manager	Owner	Viewer
View all tickets	✓	✓	✓	✓
Mark non-Jira app progress (own team)	✓	✓	✓	—
Trigger manual Jira sync	✓	✓	✓	—
Confirm ticket closure	✓	✓	—	—
Manual override close	✓	✓	—	—
Edit rules	✓	✓	—	—
Edit application registry	✓	✓	—	—
Edit vendors	✓	—	—	—
Edit email sources / LLM / Jira config	✓	—	—	—
Manage users and roles	✓	—	—	—
View audit log	✓	✓	✓	✓

First-time login (auto-provisioning)

When an unknown Okta user hits the system for the first time, VendorSync auto-provisions them with role = viewer. Existing admins are notified via Slack/email: "New user X just logged in — set their role."

Auto-provisioning can be disabled (system_settings.auth.auto_provision = false), requiring admins to pre-create accounts.

Bootstrap admin

On first run, no admins exist. Set the BOOTSTRAP_ADMINS environment variable to comma-separated emails:

BOOTSTRAP_ADMINS=alice@company.com,bob@company.com

When those users log in for the first time, they receive admin role automatically. The variable is idempotent — re-applying it will not downgrade existing admins.

Data Model

VendorSync's database is PostgreSQL 17, managed via Alembic migrations, accessed through SQLAlchemy 2.x async ORM. All tables use UUID primary keys and UTC timestamps.

Core tables

Table	Purpose
`users`	Okta-authenticated users with internal roles
`vendors`	External vendors that send change notifications
`applications`	Internal applications, with Jira routing config and owner info
`application_vendors`	Many-to-many: which apps integrate with which vendors
`rules`	Natural-language routing rules consumed by the Triage Agent
`tickets`	The core entity — one per vendor change, with full lifecycle
`ticket_applications`	Many-to-many: which apps are affected by each ticket
`ticket_jira_links`	One row per Jira issue linked to a ticket
`ticket_notes`	User-authored notes on tickets
`source_emails`	Ingested emails with classification metadata
`email_attachments`	Attachment metadata and storage paths
`audit_log`	Immutable append-only record of every state change
`notification_log`	Record of every notification dispatched

Configuration tables

Table	Purpose
`email_sources`	Configured mailboxes (IMAP / Exchange / Gmail)
`llm_config`	LLM provider config — primary + fallback, with encrypted API keys
`jira_config`	Jira instance config — base URL, API version, auth token
`system_settings`	Key/value store for all system-wide defaults

Ticket numbering

Tickets are assigned human-readable IDs (VS-0001, VS-0002, …) via a dedicated Postgres sequence ticket_number_seq. The application reads nextval('ticket_number_seq') at insert time and formats it as VS- + zero-padded integer. Sequence values are never manually assigned.

Encryption at rest

Sensitive fields (email_sources.password_encrypted, llm_config.api_key_encrypted, jira_config.api_token_encrypted) are encrypted with Fernet symmetric encryption using VENDORSYNC_SECRET_KEY from the environment. Raw values are never written to the database. *_encrypted fields are never exposed in API read schemas.

Soft deletes only

Nothing is ever hard-deleted. All entities use is_active = false flags. Queries filter on is_active = true by default. Historical records are always queryable for audit purposes.

Key indexes

Beyond primary keys, the schema carries targeted indexes for the hottest query paths:

tickets (status, effective_date) — deadline monitor
tickets (vendor_id, created_at) — vendor history views
source_emails (processing_status, received_at) — triage queue
audit_log (entity_type, entity_id, created_at) — per-entity history
notification_log — expression index on metadata->>'escalation_threshold' for escalation deduplication
ticket_jira_links (jira_issue_key) — Jira key → VS ticket lookup
rules (vendor_id, is_active) — rule loading in triage

Technology Stack

Every technology choice in VendorSync is deliberate. The stack is designed for developer ergonomics, operational simplicity, and correctness — not novelty.

Backend

Component	Choice	Why
Language	Python 3.12+	Ecosystem depth for LLM, async I/O, data tooling
Framework	FastAPI	Native async, automatic OpenAPI docs, Pydantic integration
ASGI server	Uvicorn (dev) / Gunicorn + Uvicorn workers (prod)	Production-proven, minimal
ORM	SQLAlchemy 2.x async	Industry standard, excellent async support
Migrations	Alembic	Pairs with SQLAlchemy, reliable version tracking
Job queue	Redis Streams via `redis-py` async	No additional infrastructure, persistent, debuggable
Scheduler	APScheduler	Lightweight in-process, no separate beat service
Validation	Pydantic v2	Fast, type-safe, shared schemas with FastAPI
Package manager	`uv`	10–100× faster than pip, deterministic lock file

LLM

Component	Choice	Why
Abstraction	LiteLLM	Single API across all providers — switch providers via config with zero code changes
Default provider	Anthropic Claude	Best instruction following; multimodal for PDF/image attachments
Fallback providers	OpenAI, Google Gemini, Azure OpenAI	Configurable at runtime
Document parsing	Delegated to LLM	No separate OCR library needed — Claude Sonnet, GPT-4o, Gemini all handle images and PDFs natively

Frontend

Component	Choice	Why
Framework	Next.js (App Router)	File-based routing, server components, production-ready
Language	TypeScript	End-to-end type safety
Styling	Tailwind CSS	Utility-first, co-located with markup
Components	shadcn/ui	Accessible primitives, fully owned source code
Icons	lucide-react	Consistent, tree-shakeable
Server state	TanStack Query	Best-in-class caching, background refetch, stale-while-revalidate
Forms	react-hook-form + zod	Type-safe validation, reusable schemas
Auth	Auth.js v5 (NextAuth) with Okta provider	Full App Router compatibility — server components, middleware, server actions

Infrastructure

Component	Choice	Why
Reverse proxy	Traefik v3	Docker label service discovery, built-in Let's Encrypt, zero config for adding services
Database	PostgreSQL 17	ACID, JSONB, sequences, GIN indexes, industry standard
Cache / queue	Redis 7	Streams + pub/sub + caching in one service
Container runtime	Docker Compose	One `docker compose up` brings up the full stack

Email ingestion

Protocol	Library
IMAP	`aioimaplib` (async)
Microsoft Exchange	`exchangelib`
Gmail API	`google-api-python-client`
POP3	Not supported — not in the protocol enum

Notifications

Channel	Library / API
Slack	`slack-sdk` Python
Email	`smtplib` (or SendGrid SDK)
PagerDuty	Events API v2 (direct HTTP POST)

Project Structure

vendorsync/
├── backend/
│   ├── app/
│   │   ├── api/              # FastAPI routes (one file per resource)
│   │   ├── models/           # SQLAlchemy ORM models
│   │   ├── schemas/          # Pydantic v2 request/response schemas
│   │   ├── services/         # Business logic (pure Python)
│   │   ├── agents/           # Triage Agent (and future agents)
│   │   ├── workers/          # Redis Streams consumer functions
│   │   ├── scheduler/        # APScheduler periodic job definitions
│   │   ├── llm/              # LiteLLM wrapper, prompt templates
│   │   ├── ingestion/        # Email fetch and attachment handling
│   │   ├── jira/             # Jira REST client and sync logic
│   │   ├── auth/             # Okta JWT validation, role decorators
│   │   ├── core/             # Config, DB session, Redis client
│   │   └── main.py           # FastAPI app entry point
│   ├── tests/
│   ├── alembic/              # Database migration scripts
│   ├── pyproject.toml        # uv project manifest
│   ├── uv.lock               # Deterministic dependency lock
│   └── Dockerfile            # Single image for backend + worker + scheduler
├── frontend/
│   ├── src/
│   │   ├── app/              # Next.js App Router pages
│   │   │   ├── layout.tsx    # Root layout (sidebar + header)
│   │   │   ├── page.tsx      # Dashboard
│   │   │   ├── tickets/[id]/ # Ticket detail
│   │   │   ├── vendors/      # Vendor registry
│   │   │   ├── applications/ # Application registry
│   │   │   ├── rules/        # Rules admin
│   │   │   ├── settings/     # Email / LLM / Jira / users config
│   │   │   ├── audit/        # Audit log viewer
│   │   │   └── docs/         # In-app documentation tab
│   │   ├── components/
│   │   │   ├── ui/           # shadcn/ui primitives
│   │   │   ├── layout/       # Sidebar, Header
│   │   │   ├── tickets/      # Ticket-specific components
│   │   │   └── shared/       # Badges, cards, sync buttons
│   │   ├── hooks/            # TanStack Query hooks, WS hooks
│   │   ├── lib/
│   │   │   ├── api/          # Typed API client functions
│   │   │   ├── auth/         # Auth.js helpers
│   │   │   ├── time.ts       # formatDateTime, formatDate, etc.
│   │   │   └── types/        # Shared TypeScript types
│   │   └── styles/
│   │       └── globals.css
│   ├── package.json
│   ├── tsconfig.json
│   └── Dockerfile
├── traefik/
│   └── traefik.yml           # Traefik static config
├── docker-compose.yml
├── docker-compose.server.yml # Production server overrides
├── .env.example              # All required env vars with comments
├── .gitattributes            # LF line endings enforced for all text files
└── docs/                     # Architecture, data model, business rules, etc.

Configuration Reference

All secrets and environment configuration are in .env (never committed — see .env.example).

Required variables

# Database
DATABASE_URL=postgresql+asyncpg://vendorsync:password@postgres:5432/vendorsync

# Redis
REDIS_URL=redis://redis:6379/0

# Encryption key (Fernet) — used for all encrypted DB fields
VENDORSYNC_SECRET_KEY=<base64-encoded 32-byte key>

# Okta OIDC (backend)
OKTA_DOMAIN=company.okta.com
OKTA_CLIENT_ID=...
OKTA_CLIENT_SECRET=...
OKTA_AUDIENCE=api://vendorsync
OKTA_ISSUER=https://company.okta.com/oauth2/default

# Auth.js (frontend)
AUTH_SECRET=<random 32-byte secret>
NEXTAUTH_URL=https://vendorsync.company.com

# Bootstrap admin (first run only)
BOOTSTRAP_ADMINS=alice@company.com,bob@company.com

System settings (stored in DB, editable via UI)

Key	Default	Description
`defaults.timezone`	`"UTC"`	Timezone for `effective_date` interpretation
`defaults.working_hours`	`{"start":"09:00","end":"18:00"}`	Business hours for escalation scheduling
`escalation.thresholds_days`	`[7, 3, 1]`	Days-before-deadline escalation triggers
`defaults.sla_days_by_tier`	`{"tier_1":14,"tier_2":30,"tier_3":60}`	Fallback SLA by vendor tier
`audit.retention_days`	`1825` (5 years)	How long to retain audit log rows
`auth.auto_provision`	`true`	Auto-create users on first Okta login
`auth.default_role`	`"viewer"`	Default role for auto-provisioned users

Getting Started (Local Development)

Prerequisites

Docker Desktop with the Linux engine
Git configured with core.autocrlf = input (Windows only — prevents CRLF contamination)

First run

# 1. Clone the repo
git clone <repo-url>
cd vendorsync

# 2. Copy environment template
cp .env.example .env
# Edit .env — fill in OKTA_*, VENDORSYNC_SECRET_KEY, AUTH_SECRET at minimum

# 3. Build and start all services
docker compose up --build

# 4. Run database migrations (first time only)
docker compose exec backend alembic upgrade head

# 5. Open the app
# http://localhost — frontend (Traefik routes it)
# http://localhost/api/docs — FastAPI Swagger UI
# http://localhost:9081/dashboard — Traefik dashboard

Subsequent runs

docker compose up          # Start all services (no rebuild)
docker compose up --build  # Rebuild and start (after Dockerfile or dependency changes)
docker compose down        # Stop all services, keep volumes

Never use docker compose down -v — this destroys vendorsync_postgres_data and vendorsync_redis_data permanently. Always stop without -v.

Hot reload

All source code is volume-mounted into containers. Changes to Python or TypeScript files are live immediately — no rebuild needed:

Backend: Uvicorn --reload watches /app/app for .py changes
Frontend: Next.js Turbopack watches /app/src for .ts/.tsx changes

Running migrations

# Apply pending migrations
docker compose exec backend alembic upgrade head

# Create a new migration after model changes
docker compose exec backend alembic revision --autogenerate -m "describe the change"

Viewing logs

docker compose logs -f backend
docker compose logs -f worker
docker compose logs -f scheduler
docker compose logs -f frontend

Business Rules Reference

Severity levels

Severity	Triggered by
`critical`	Breaking change with deadline < 30 days, vendor `tier_1`, or affects payment-critical apps
`high`	Breaking change with deadline 30–90 days, vendor `tier_2`, or affects multiple critical apps
`medium`	Non-breaking change with deadline > 90 days, or single non-critical app
`low`	Informational, optional updates, vendor `tier_3`

Vendor criticality tiers

Tier	Meaning	Default SLA
`tier_1`	Mission-critical vendors — payment networks, core infrastructure	14 days
`tier_2`	Important integrations — significant operational impact if broken	30 days
`tier_3`	Non-critical vendors — informational or optional integrations	60 days

Manual override rules

Force-closing a ticket without normal resolution requires:

admin or change_manager role
close_reason text of at least 20 characters
is_manual_override = true set on the ticket
Audit log entry flagged with override context
Override tickets excluded from SLA compliance metrics

Unknown vendor handling

When the LLM cannot match an email to any known vendor:

source_emails.processing_status = 'unknown'
No ticket is created
Slack alert to triage operators
Email surfaces in the VS triage view (source_emails WHERE processing_status = 'unknown')
Operator classifies manually; can optionally save the classification as a new rule for future emails

Notifications must not block

If Slack, email, or PagerDuty calls fail, the ticket and Jira creation still complete. The failure is logged to notification_log and retried via DLQ. Notification channels are fire-and-forget relative to the core pipeline.

Architectural Principles

These principles guided every design decision in VendorSync. They are not aspirations — they are constraints actively enforced by the codebase.

Separation of concerns — API, services, agents, workers, scheduler, and models are distinct layers with one-way dependency rules. Nothing depends on api or workers.

LLM is for unstructured → structured conversion only — the LLM turns free-text emails into TriageDecision objects. Once data is structured, deterministic code takes over for everything downstream — state machines, SLA math, Jira API calls, deadline monitoring. The LLM never sees a structured decision point.

The VendorSync ticket is the audit truth — Jira is where engineering work happens. VendorSync is where vendor-change handling is tracked, measured, and enforced. They are complementary, not duplicates.

Rules and application descriptions are data, not code — admins edit them through the GUI at runtime. The LLM reads them fresh on every triage call. Adding a new rule requires zero deployments.

Multi-provider LLM from day one — every LLM call routes through LiteLLM. No provider is hardcoded anywhere. Switching from Claude to GPT-4o is a one-row database change.

Async everywhere — FastAPI async routes, async SQLAlchemy, async Redis, async httpx. No blocking I/O anywhere in the hot path.

Idempotent jobs — workers check for existing state before acting. Retrying a failed create_jira_issue job will not create a duplicate issue if the first attempt partially succeeded.

Single entry point — Traefik handles all external routing. Backend and frontend containers do not expose ports directly in production. Adding a new service requires only Docker labels.

Agent abstraction — LLM logic is wrapped in named agent classes (TriageAgent) with stable interfaces. The implementation can be swapped, multi-step chains can be added, A/B testing can be layered — callers never change.

Documentation

Full internal documentation lives in docs/:

Document	Contents
`docs/ARCHITECTURE.md`	System design, service topology, module boundaries, build sequence
`docs/DATA_MODEL.md`	Complete database schema with field-level documentation
`docs/PROCESSING_FLOW.md`	Step-by-step email processing pipeline with edge cases
`docs/BUSINESS_RULES.md`	Ticket lifecycle, SLA, escalation, override logic
`docs/JIRA_INTEGRATION.md`	Jira API usage, sync pattern, error handling, permissions
`docs/AUTH.md`	Okta OIDC flow, roles, permission matrix, implementation details
`docs/TECH_STACK.md`	Technology choices and rationale
`docs/UI_DESIGN.md`	Frontend design system, component library, screen inventory

These docs are also accessible in-app at /docs — the documentation tab is part of the VendorSync UI itself.

VendorSync is built with FastAPI · Next.js · PostgreSQL · Redis · Traefik · LiteLLM · Okta

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.kiro/steering		.kiro/steering
backend		backend
data		data
design-references		design-references
docs		docs
frontend		frontend
test-data		test-data
traefik		traefik
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Avi_Notes.txt		Avi_Notes.txt
CLAUDE.md		CLAUDE.md
PROGRESS.md		PROGRESS.md
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.server.yml		docker-compose.server.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation