Enterprise-grade vendor change management platform. VendorSync automatically ingests vendor notification emails, uses a Large Language Model to triage them against your application registry and rule set, creates tracked tickets with full SLA enforcement, and opens Jira issues for every affected engineering team — all without a human touching a keyboard.
- What is VendorSync?
- Why VendorSync Exists
- How It Works — End to End
- Architecture Overview
- Service Topology
- Redis Streams — The Job Bus
- The Triage Agent
- Ticket Lifecycle & State Machine
- SLA Enforcement & Escalation
- Jira Integration Deep Dive
- Notification System
- Authentication & Authorization
- Data Model
- Technology Stack
- Project Structure
- Configuration Reference
- Getting Started (Local Development)
- Business Rules Reference
- Architectural Principles
VendorSync is a full-stack internal platform that solves a specific, painful problem at scale: vendor change management.
Every enterprise engineering organization receives a constant stream of emails from third-party vendors — API endpoint changes, certificate rotations, protocol deprecations, breaking schema updates, deadline-driven migration notices. Each of these can break production systems if missed or mishandled. Manually routing these emails, figuring out which internal applications are affected, assigning owners, opening Jira tickets, and tracking deadlines is tedious, error-prone, and doesn't scale.
VendorSync eliminates all of that manual work. It:
- Watches your mailboxes — polls configured email sources (IMAP, Microsoft Exchange, Gmail API) for new vendor notifications
- Reads and understands the email — sends it to an LLM (Claude, GPT-4o, or Gemini) along with your entire application registry and rule set, asking it to decide: which vendor sent this, which of your apps are impacted, how severe it is, who owns it, what the deadline is
- Creates a tracked VendorSync ticket — a first-class record with severity, SLA, owner, and deadline
- Opens Jira issues automatically — one per affected application, pre-filled with full context so engineers never have to ask "what is this about"
- Enforces SLAs — monitors deadlines and fires escalations at D-7, D-3, D-1, opening war rooms and paging on-call engineers if needed
- Closes the loop — syncs Jira status back into VendorSync every 5 minutes; when all Jira issues are Done, the ticket auto-transitions to Resolved for a change manager to confirm
The entire pipeline from email arrival to Jira issues created and notifications sent targets under 20 seconds at the 95th percentile.
The alternative is a spreadsheet, a shared inbox, and hope. Common failure modes VendorSync prevents:
| Failure mode | How VendorSync prevents it |
|---|---|
| Vendor email goes to wrong person or gets lost | Watches dedicated mailboxes, never misses |
| No one knows which apps are affected | LLM reads app descriptions and decides automatically |
| Wrong team gets paged | Rules + app registry route to the right owner every time |
| Deadline slips without anyone noticing | Scheduled deadline monitor, three escalation thresholds, breach detection |
| Jira issues created inconsistently or not at all | Automatic creation with full context on every ticket |
| No audit trail of what happened and when | Immutable audit log on every state change |
| Hard to tell which open changes are at risk | Dashboard with SLA countdowns, severity badges, breach indicators |
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Vendor sends → Mailbox → [Scheduler polls] → vs:emails:incoming │
│ notification │
│ email │
│ │
│ [Worker] fetches message, stores in source_emails, publishes process_email job │
│ │
│ [Triage Agent] ←── all active rules ──────────────────── DB │
│ │ ←── all applications + descriptions ───── DB │
│ │ ←── email subject + body + attachments │
│ ↓ │
│ TriageDecision { │
│ vendor, topic, effective_date, │
│ affected_application_ids, │
│ severity, owner_team, sla_days, │
│ alert_channel, matched_rule_ids, │
│ reasoning │
│ } │
│ ↓ │
│ [Ticket Engine] │
│ → INSERT tickets (status: open) │
│ → INSERT ticket_applications (one per affected app) │
│ → INSERT ticket_jira_links (one per app with Jira enabled) │
│ → PUBLISH create_jira_issue jobs → vs:jira:create │
│ → PUBLISH notify job → vs:tickets:notify │
│ │
│ [Worker] creates Jira issues, stores issue keys, syncs every 5 min │
│ [Worker] sends Slack/email/PagerDuty notifications │
│ [Scheduler] monitors deadlines every 15 min, escalates at D-7/D-3/D-1 │
│ [Worker] syncs Jira status → auto-transitions VS ticket to Resolved │
│ [Change manager] confirms closure in UI → ticket Closed │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
VendorSync is decomposed into three Python processes sharing one codebase, backed by PostgreSQL and Redis, served through a Traefik reverse proxy, with a Next.js frontend.
┌───────────────────────────────────────────┐
│ Traefik │
│ / → frontend (Next.js) │
│ /api/* → backend (FastAPI) │
└──────────────┬────────────────────────────┘
│
┌──────────────┬─────────┴──────────┬────────────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Frontend │ │ Backend │ │ Worker │ │ Scheduler │
│ Next.js │ │ FastAPI │ │ (N │ │APScheduler│
│ App │ │ HTTP API │ │ replicas)│ │ periodic │
│ Router │ │ │ │ Streams │ │ jobs │
└──────────┘ └────┬─────┘ └────┬─────┘ └─────┬─────┘
│ │ │
┌───────┴──────┐ ┌────────┴──────────────────────┘
│ │ │
┌─────▼─────┐ ┌────▼────▼──┐
│ PostgreSQL │ │ Redis │
│ primary │ │ Streams │
│ database │ │ + cache │
└────────────┘ └────────────┘
The Python codebase is organized into strict layers. Nothing depends backwards:
| Module | Responsibility |
|---|---|
backend.api |
HTTP routes only — validates input, calls services, returns responses |
backend.services |
Business logic — pure Python, no FastAPI imports, reusable from both API and workers |
backend.agents |
The Triage Agent and future agent abstractions |
backend.models |
SQLAlchemy ORM models — source of truth for DB schema |
backend.schemas |
Pydantic v2 schemas for API input/output validation |
backend.workers |
Redis Streams consumers — call into services and agents |
backend.scheduler |
APScheduler periodic jobs — publish work items to streams |
backend.llm |
LiteLLM wrapper, prompt templates, response parsing |
backend.ingestion |
Email fetching (IMAP / Exchange / Gmail), attachment handling |
backend.jira |
Jira REST client, sync logic, comment helpers |
backend.auth |
Okta OIDC integration, JWT validation, role middleware |
backend.core |
Config, DB session, Redis client, security helpers |
Dependency rule: api, workers, and scheduler may depend on services and agents. services and agents depend on models, llm, ingestion, jira. Nothing depends on api or workers.
VendorSync runs as 7 Docker services in a single Compose stack:
| Service | Image | Purpose |
|---|---|---|
traefik |
traefik:v3 |
Reverse proxy, single external entry point, HTTPS via Let's Encrypt in production |
backend |
custom Python | FastAPI HTTP API — all REST endpoints |
worker |
custom Python (same image) | Redis Streams consumer — ingestion, LLM calls, Jira creation, notifications, sync |
scheduler |
custom Python (same image) | APScheduler — mailbox polling, Jira sync batch, deadline monitor |
frontend |
custom Node | Next.js App Router UI |
postgres |
postgres:17 |
Primary relational database, all persistent state |
redis |
redis:7 |
Streams job bus + caching |
backend, worker, and scheduler run the same Docker image, differentiated by the command: override in docker-compose.yml. This means a single docker compose build backend rebuilds all three.
| Path | Routed to |
|---|---|
/ and all UI routes |
Next.js frontend (port 3000 internally) |
/api/* |
FastAPI backend (port 8000 internally) |
/health |
Backend health check |
:8080/dashboard |
Traefik live dashboard (dev only) |
In production, Traefik handles TLS termination via Let's Encrypt and redirects all HTTP to HTTPS. Service discovery is via Docker labels — no static config files need editing when adding services.
All asynchronous work flows through Redis Streams. There is no Celery, no RabbitMQ, no separate broker — Redis is already required for caching, and Streams on top of it replace all of that at a fraction of the operational overhead.
| Stream | Jobs carried |
|---|---|
vs:emails:incoming |
poll_mailbox (fetch new messages from IMAP/Exchange/Gmail) |
vs:jira:create |
create_jira_issue (one job per app per ticket, with retry) |
vs:jira:sync |
sync_jira_batch (periodic) + sync_jira_single (manual) |
vs:tickets:notify |
send_notifications (Slack, email, PagerDuty fanout) |
vs:tickets:escalate |
escalate_ticket (D-7, D-3, D-1 actions) |
vs:dlq |
Dead-letter queue — exhausted jobs held for admin review |
| Stream | Events pushed |
|---|---|
vs:ui:events |
system.status — queue depths and scheduler heartbeats (every 5s) |
Workers are stateless consumers in a consumer group (vs-workers). The pattern is simple and reliable:
XREADGROUP GROUP vs-workers <worker-id> COUNT 10 BLOCK 5000 STREAMS vs:emails:incoming >
→ process message
→ on success: XACK vs:emails:incoming vs-workers <message-id>
→ on failure: retry (up to 3×); on exhaustion: XADD vs:dlq ...
Multiple worker replicas can run concurrently — Redis ensures each message is claimed by exactly one consumer at a time. Pending messages from a crashed worker are auto-reclaimed after a configurable visibility timeout.
APScheduler runs in its own process. Its only job is to publish work items into streams on schedule. It never does the work itself. This means the scheduler is lightweight, stateless, and easily restartable — the actual processing always happens in workers through the same code paths.
The Triage Agent is the brain of VendorSync. It is a stable interface over an LLM call — a class that takes an email and returns a fully structured decision. All the complexity of prompt construction, LLM provider routing, response parsing, and validation lives inside it. The rest of the codebase calls triage_agent.decide(email) and gets a TriageDecision back.
class TriageAgent:
async def decide(self, email: SourceEmail) -> TriageDecision:
# 1. Load all active rules from DB (filtered by vendor if known)
# 2. Load all active applications with their descriptions
# 3. Construct a structured prompt with the email + context
# 4. Call LLM (via LiteLLM — provider-agnostic)
# 5. Parse and validate the JSON response
# 6. Apply post-LLM severity guardrails (deterministic)
# 7. Return TriageDecision
...{
"vendor_id": "uuid-of-matched-vendor",
"topic": "URL endpoint change",
"effective_date": "2027-04-20",
"affected_application_ids": ["uuid-1", "uuid-2"],
"severity": "high",
"owner_team": "Payments",
"sla_days": 14,
"alert_channel": "#payments-changes",
"matched_rule_ids": ["uuid-of-matched-rule"],
"reasoning": "PaymentService directly calls the Visa authorization endpoint per its description and is critically affected. AccountUpdate uses Visa schemas — included as a precaution since it likely uses the same base URL infrastructure. Billing is excluded — its description mentions billing operations, not authorization flows."
}The LLM receives a single structured prompt containing:
- The raw email (subject + body + attachments as multimodal inputs)
- The full list of vendor names VendorSync tracks
- Every active rule (natural language, with severity / SLA / owner / instruction text)
- Every active application (name + plain-language description + owner team)
No keyword pre-filtering is done before the LLM call. The LLM does the matching. This is intentional — keyword-based pre-filtering introduces false negatives; the LLM is better at semantic matching.
After the LLM responds, a deterministic check runs before the TriageDecision is returned:
If the vendor is
tier_1AND the change is classified as breaking AND the LLM assignedlowormediumseverity → override tohigh.
This override is written to the audit log with the original LLM severity preserved in llm_reasoning. The guardrail cannot be tricked by prompt injection.
The quality of triage directly depends on how well applications are described. The description field on each application is fed verbatim to the LLM:
"API-based application that exchanges Visa schema files for credit card status updates, polls Reuters rates feed every 2 hours, processes payments in real-time. Calls the Visa authorization v2 endpoint on every transaction."
Rich descriptions = accurate routing. Sparse descriptions = false negatives.
Every vendor change that passes triage becomes a VendorSync ticket — the single source of truth for that change's progress. Tickets have a strictly enforced state machine.
┌─────────────┐
│ open │ ← created by system on email triage
└──────┬──────┘
│ any linked Jira issue → "In Progress"
│ OR owner manually transitions in VS UI
▼
┌─────────────┐
│ in_progress │
└──────┬──────┘
│ ALL linked Jira issues → "Done"
│ AND all non-Jira apps → non_jira_status = complete
▼
┌─────────────┐
│ resolved │ ← awaiting change manager confirmation
└──────┬──────┘
│ change manager clicks "Confirm closure" in VS UI
▼
┌─────────────┐
│ closed │ ← terminal
└─────────────┘
Any non-closed state + effective_date passed → breached (terminal)
Any non-closed state + admin force-close → closed (with mandatory reason)
| State | Meaning |
|---|---|
open |
Ticket created, no work started; all Jira issues in "To Do" |
in_progress |
At least one Jira issue has moved to "In Progress" |
resolved |
All Jira issues Done + all non-Jira apps marked complete; awaiting human confirmation |
closed |
Change manager confirmed; change is handled |
breached |
Effective date passed without closure — SLA violated |
For tickets with linked Jira issues, state transitions from open → in_progress → resolved happen automatically based on Jira sync results. Humans do not manually move these states. Only resolved → closed requires a human action.
For apps configured with creates_jira_ticket = false, owners use VS UI buttons to update non_jira_status on their app's ticket row. The combined evaluation (all Jira issues Done + all non-Jira apps complete) triggers resolution.
The change manager who confirmed the last state transition on a ticket cannot also confirm its closure. The backend enforces:
if closed_by == ticket.last_transitioned_by:
raise HTTPException(403, "The user who last transitioned this ticket cannot also confirm closure")Admin role can self-close but the bypass is flagged in the audit log.
closed→ anything (terminal — cannot be reopened; create a new ticket referencing it)breached→closedrequires admin manual override with a documented reason- Skipping states (e.g.
open→resolved) is forbidden. If Jira sync finds all issues Done on a still-open ticket, two separate transitions are written to the audit log atomically:open → in_progress → resolved.
VendorSync's scheduler checks all open tickets every 15 minutes. For each ticket with an effective_date, it calculates days remaining and fires precisely-targeted escalations.
| Threshold | Actions triggered |
|---|---|
| D-7 (7 days before deadline) | Comment on all linked Jira issues · Slack DM to ticket owner · Email to owner |
| D-3 (3 days before deadline) | Comment on all linked Jira issues · Slack channel ping · Email to manager |
| D-1 (1 day before deadline) | Comment on all linked Jira issues · War room Slack channel alert · PagerDuty on-call page |
| D-0 (breached) | Ticket status → breached · Slack + email alert to admin + senior management |
Each threshold fires at most once per ticket — enforced via notification_log with a deduplication index on (ticket_id, channel, metadata.escalation_threshold). There is no double-paging.
Escalations stop if the ticket reaches resolved or closed before the threshold fires.
🔔 VendorSync — D-7 reminder
Vendor deadline approaching: April 20, 2027 (in 7 days)
This Jira issue is linked to VendorSync ticket VS-2847.
Vendor change: Visa — URL endpoint change
If this change has been completed, please move this Jira issue to Done.
VendorSync ticket: https://vendorsync.company.com/tickets/VS-2847
If no rule matches a vendor change, SLA is determined by the vendor's criticality tier:
| Vendor tier | Default SLA |
|---|---|
tier_1 |
14 days |
tier_2 |
30 days |
tier_3 |
60 days |
Configurable in system_settings.defaults.sla_days_by_tier.
VendorSync talks to Jira via direct async HTTP calls (httpx) to the Jira REST API. No SDKs, no proxies. Both Jira Cloud (API v3) and Jira Server/Data Center (API v2) are supported — configured per-instance via jira_config in the database.
When a ticket is triaged and linked applications have Jira enabled, one issue is created per application:
Summary: Visa — URL endpoint change [VS-2847]
Project: PAY (from applications.jira_project_key)
Type: Task (from applications.jira_issue_type)
Due date: 2027-04-20
Assignee: <from applications.jira_default_assignee if set>
Labels: ["vendorsync", "vendor:visa", <app.jira_labels...>]
Description:
═══════════════════════════════════════════════════════
VENDOR CHANGE — Visa URL endpoint update
═══════════════════════════════════════════════════════
Vendor: Visa
Change type: URL endpoint change
Effective date: April 20, 2027 (D-180)
Severity: High
SLA: 14 days
Affected application: PaymentService (Payments team)
Summary: [LLM-generated 2-3 sentence summary]
VendorSync routing reasoning: [LLM reasoning text]
VendorSync ticket: VS-2847
Direct link: https://vendorsync.company.com/tickets/VS-2847
═══════════════════════════════════════════════════════
The engineer working the Jira issue has everything they need without opening VendorSync. The original email is accessible via the VendorSync link if needed.
VendorSync never queries Jira issue-by-issue. The sync is batched:
- Pull all open
jira_issue_keyvalues in one SQL query - Chunk into batches of 100
- For each batch: one JQL search request —
key in (PAY-9421, PAY-9422, ...) - Diff stored status against Jira status category
- Update
ticket_jira_links, write audit log entries - Re-evaluate parent VS ticket status
Batch sync is ~100× faster than per-issue queries and is far more Jira-API-quota-efficient.
VendorSync uses Jira's status category (standardized across all Jira projects) rather than exact status names (which vary by workflow):
| Jira status category | VendorSync jira_status_category |
|---|---|
| To Do / Open / Backlog | to_do |
| In Progress / In Review / etc. | in_progress |
| Done / Closed / Resolved / etc. | done |
| Anything else | unknown |
The exact Jira status name is preserved in jira_status for display in the UI.
| Error | Action |
|---|---|
400 bad payload |
Log full error, alert admin — likely a misconfigured project key or issue type |
401 unauthorized |
Disable Jira config, urgent admin alert |
403 forbidden |
Log, admin alert — missing "Create Issues" permission |
404 project not found |
Log, admin alert — jira_project_key is wrong |
5xx / network timeout |
Retry × 3 with exponential backoff |
| Exhausted retries | Mark sync_status = orphan, background retry every 15 min |
Jira creation failures never block ticket creation. VendorSync tickets exist and are tracked regardless of Jira state. Orphaned links are surfaced in the admin view.
The Jira account used by VendorSync requires:
- Browse Projects on all target projects
- Create Issues on all target projects
- Add Comments on all target projects
VendorSync intentionally does not transition Jira issue statuses, delete issues, or touch issues it didn't create.
Notifications are dispatched asynchronously through the vs:tickets:notify stream, after ticket creation. Severity determines the fanout:
| Severity | Slack channel | Slack DM to owner | PagerDuty | |
|---|---|---|---|---|
critical |
✓ | ✓ | ✓ | ✓ |
high |
✓ | ✓ | ✓ | — |
medium |
✓ | — | ✓ | — |
low |
✓ | — | — | — |
Every notification dispatched is recorded in notification_log. If a notification channel fails (Slack down, SMTP unreachable), the failure is logged, the ticket creation still completes, and retry happens via DLQ replay — notifications never block the primary pipeline.
VendorSync uses Okta as the identity provider. The backend validates JWTs with authlib; the frontend uses Auth.js v5 (NextAuth) with the Okta provider.
User opens VendorSync
→ frontend redirects to Okta login
→ user authenticates (MFA, SSO, etc.)
→ Okta redirects back with auth code
→ Auth.js exchanges code for tokens
→ frontend sends Okta access token in Authorization header
→ backend validates JWT via Okta's JWKS endpoint
→ backend looks up role in users table
→ request proceeds with full user context
Okta identifies who the user is. VendorSync's users table controls what they can do. Role changes don't require Okta admin involvement.
| Role | What they can do |
|---|---|
admin |
Everything — users, config, system settings, manual override close |
change_manager |
Manage rules and registry, confirm closures, manual override |
owner |
Work tickets for their team, mark non-Jira app progress |
viewer |
Read-only — dashboards, ticket lists, audit log |
| Action | Admin | Change Manager | Owner | Viewer |
|---|---|---|---|---|
| View all tickets | ✓ | ✓ | ✓ | ✓ |
| Mark non-Jira app progress (own team) | ✓ | ✓ | ✓ | — |
| Trigger manual Jira sync | ✓ | ✓ | ✓ | — |
| Confirm ticket closure | ✓ | ✓ | — | — |
| Manual override close | ✓ | ✓ | — | — |
| Edit rules | ✓ | ✓ | — | — |
| Edit application registry | ✓ | ✓ | — | — |
| Edit vendors | ✓ | — | — | — |
| Edit email sources / LLM / Jira config | ✓ | — | — | — |
| Manage users and roles | ✓ | — | — | — |
| View audit log | ✓ | ✓ | ✓ | ✓ |
When an unknown Okta user hits the system for the first time, VendorSync auto-provisions them with role = viewer. Existing admins are notified via Slack/email: "New user X just logged in — set their role."
Auto-provisioning can be disabled (system_settings.auth.auto_provision = false), requiring admins to pre-create accounts.
On first run, no admins exist. Set the BOOTSTRAP_ADMINS environment variable to comma-separated emails:
BOOTSTRAP_ADMINS=alice@company.com,bob@company.comWhen those users log in for the first time, they receive admin role automatically. The variable is idempotent — re-applying it will not downgrade existing admins.
VendorSync's database is PostgreSQL 17, managed via Alembic migrations, accessed through SQLAlchemy 2.x async ORM. All tables use UUID primary keys and UTC timestamps.
| Table | Purpose |
|---|---|
users |
Okta-authenticated users with internal roles |
vendors |
External vendors that send change notifications |
applications |
Internal applications, with Jira routing config and owner info |
application_vendors |
Many-to-many: which apps integrate with which vendors |
rules |
Natural-language routing rules consumed by the Triage Agent |
tickets |
The core entity — one per vendor change, with full lifecycle |
ticket_applications |
Many-to-many: which apps are affected by each ticket |
ticket_jira_links |
One row per Jira issue linked to a ticket |
ticket_notes |
User-authored notes on tickets |
source_emails |
Ingested emails with classification metadata |
email_attachments |
Attachment metadata and storage paths |
audit_log |
Immutable append-only record of every state change |
notification_log |
Record of every notification dispatched |
| Table | Purpose |
|---|---|
email_sources |
Configured mailboxes (IMAP / Exchange / Gmail) |
llm_config |
LLM provider config — primary + fallback, with encrypted API keys |
jira_config |
Jira instance config — base URL, API version, auth token |
system_settings |
Key/value store for all system-wide defaults |
Tickets are assigned human-readable IDs (VS-0001, VS-0002, …) via a dedicated Postgres sequence ticket_number_seq. The application reads nextval('ticket_number_seq') at insert time and formats it as VS- + zero-padded integer. Sequence values are never manually assigned.
Sensitive fields (email_sources.password_encrypted, llm_config.api_key_encrypted, jira_config.api_token_encrypted) are encrypted with Fernet symmetric encryption using VENDORSYNC_SECRET_KEY from the environment. Raw values are never written to the database. *_encrypted fields are never exposed in API read schemas.
Nothing is ever hard-deleted. All entities use is_active = false flags. Queries filter on is_active = true by default. Historical records are always queryable for audit purposes.
Beyond primary keys, the schema carries targeted indexes for the hottest query paths:
tickets (status, effective_date)— deadline monitortickets (vendor_id, created_at)— vendor history viewssource_emails (processing_status, received_at)— triage queueaudit_log (entity_type, entity_id, created_at)— per-entity historynotification_log— expression index onmetadata->>'escalation_threshold'for escalation deduplicationticket_jira_links (jira_issue_key)— Jira key → VS ticket lookuprules (vendor_id, is_active)— rule loading in triage
Every technology choice in VendorSync is deliberate. The stack is designed for developer ergonomics, operational simplicity, and correctness — not novelty.
| Component | Choice | Why |
|---|---|---|
| Language | Python 3.12+ | Ecosystem depth for LLM, async I/O, data tooling |
| Framework | FastAPI | Native async, automatic OpenAPI docs, Pydantic integration |
| ASGI server | Uvicorn (dev) / Gunicorn + Uvicorn workers (prod) | Production-proven, minimal |
| ORM | SQLAlchemy 2.x async | Industry standard, excellent async support |
| Migrations | Alembic | Pairs with SQLAlchemy, reliable version tracking |
| Job queue | Redis Streams via redis-py async |
No additional infrastructure, persistent, debuggable |
| Scheduler | APScheduler | Lightweight in-process, no separate beat service |
| Validation | Pydantic v2 | Fast, type-safe, shared schemas with FastAPI |
| Package manager | uv |
10–100× faster than pip, deterministic lock file |
| Component | Choice | Why |
|---|---|---|
| Abstraction | LiteLLM | Single API across all providers — switch providers via config with zero code changes |
| Default provider | Anthropic Claude | Best instruction following; multimodal for PDF/image attachments |
| Fallback providers | OpenAI, Google Gemini, Azure OpenAI | Configurable at runtime |
| Document parsing | Delegated to LLM | No separate OCR library needed — Claude Sonnet, GPT-4o, Gemini all handle images and PDFs natively |
| Component | Choice | Why |
|---|---|---|
| Framework | Next.js (App Router) | File-based routing, server components, production-ready |
| Language | TypeScript | End-to-end type safety |
| Styling | Tailwind CSS | Utility-first, co-located with markup |
| Components | shadcn/ui | Accessible primitives, fully owned source code |
| Icons | lucide-react | Consistent, tree-shakeable |
| Server state | TanStack Query | Best-in-class caching, background refetch, stale-while-revalidate |
| Forms | react-hook-form + zod | Type-safe validation, reusable schemas |
| Auth | Auth.js v5 (NextAuth) with Okta provider | Full App Router compatibility — server components, middleware, server actions |
| Component | Choice | Why |
|---|---|---|
| Reverse proxy | Traefik v3 | Docker label service discovery, built-in Let's Encrypt, zero config for adding services |
| Database | PostgreSQL 17 | ACID, JSONB, sequences, GIN indexes, industry standard |
| Cache / queue | Redis 7 | Streams + pub/sub + caching in one service |
| Container runtime | Docker Compose | One docker compose up brings up the full stack |
| Protocol | Library |
|---|---|
| IMAP | aioimaplib (async) |
| Microsoft Exchange | exchangelib |
| Gmail API | google-api-python-client |
| POP3 | Not supported — not in the protocol enum |
| Channel | Library / API |
|---|---|
| Slack | slack-sdk Python |
smtplib (or SendGrid SDK) |
|
| PagerDuty | Events API v2 (direct HTTP POST) |
vendorsync/
├── backend/
│ ├── app/
│ │ ├── api/ # FastAPI routes (one file per resource)
│ │ ├── models/ # SQLAlchemy ORM models
│ │ ├── schemas/ # Pydantic v2 request/response schemas
│ │ ├── services/ # Business logic (pure Python)
│ │ ├── agents/ # Triage Agent (and future agents)
│ │ ├── workers/ # Redis Streams consumer functions
│ │ ├── scheduler/ # APScheduler periodic job definitions
│ │ ├── llm/ # LiteLLM wrapper, prompt templates
│ │ ├── ingestion/ # Email fetch and attachment handling
│ │ ├── jira/ # Jira REST client and sync logic
│ │ ├── auth/ # Okta JWT validation, role decorators
│ │ ├── core/ # Config, DB session, Redis client
│ │ └── main.py # FastAPI app entry point
│ ├── tests/
│ ├── alembic/ # Database migration scripts
│ ├── pyproject.toml # uv project manifest
│ ├── uv.lock # Deterministic dependency lock
│ └── Dockerfile # Single image for backend + worker + scheduler
├── frontend/
│ ├── src/
│ │ ├── app/ # Next.js App Router pages
│ │ │ ├── layout.tsx # Root layout (sidebar + header)
│ │ │ ├── page.tsx # Dashboard
│ │ │ ├── tickets/[id]/ # Ticket detail
│ │ │ ├── vendors/ # Vendor registry
│ │ │ ├── applications/ # Application registry
│ │ │ ├── rules/ # Rules admin
│ │ │ ├── settings/ # Email / LLM / Jira / users config
│ │ │ ├── audit/ # Audit log viewer
│ │ │ └── docs/ # In-app documentation tab
│ │ ├── components/
│ │ │ ├── ui/ # shadcn/ui primitives
│ │ │ ├── layout/ # Sidebar, Header
│ │ │ ├── tickets/ # Ticket-specific components
│ │ │ └── shared/ # Badges, cards, sync buttons
│ │ ├── hooks/ # TanStack Query hooks, WS hooks
│ │ ├── lib/
│ │ │ ├── api/ # Typed API client functions
│ │ │ ├── auth/ # Auth.js helpers
│ │ │ ├── time.ts # formatDateTime, formatDate, etc.
│ │ │ └── types/ # Shared TypeScript types
│ │ └── styles/
│ │ └── globals.css
│ ├── package.json
│ ├── tsconfig.json
│ └── Dockerfile
├── traefik/
│ └── traefik.yml # Traefik static config
├── docker-compose.yml
├── docker-compose.server.yml # Production server overrides
├── .env.example # All required env vars with comments
├── .gitattributes # LF line endings enforced for all text files
└── docs/ # Architecture, data model, business rules, etc.
All secrets and environment configuration are in .env (never committed — see .env.example).
# Database
DATABASE_URL=postgresql+asyncpg://vendorsync:password@postgres:5432/vendorsync
# Redis
REDIS_URL=redis://redis:6379/0
# Encryption key (Fernet) — used for all encrypted DB fields
VENDORSYNC_SECRET_KEY=<base64-encoded 32-byte key>
# Okta OIDC (backend)
OKTA_DOMAIN=company.okta.com
OKTA_CLIENT_ID=...
OKTA_CLIENT_SECRET=...
OKTA_AUDIENCE=api://vendorsync
OKTA_ISSUER=https://company.okta.com/oauth2/default
# Auth.js (frontend)
AUTH_SECRET=<random 32-byte secret>
NEXTAUTH_URL=https://vendorsync.company.com
# Bootstrap admin (first run only)
BOOTSTRAP_ADMINS=alice@company.com,bob@company.com| Key | Default | Description |
|---|---|---|
defaults.timezone |
"UTC" |
Timezone for effective_date interpretation |
defaults.working_hours |
{"start":"09:00","end":"18:00"} |
Business hours for escalation scheduling |
escalation.thresholds_days |
[7, 3, 1] |
Days-before-deadline escalation triggers |
defaults.sla_days_by_tier |
{"tier_1":14,"tier_2":30,"tier_3":60} |
Fallback SLA by vendor tier |
audit.retention_days |
1825 (5 years) |
How long to retain audit log rows |
auth.auto_provision |
true |
Auto-create users on first Okta login |
auth.default_role |
"viewer" |
Default role for auto-provisioned users |
- Docker Desktop with the Linux engine
- Git configured with
core.autocrlf = input(Windows only — prevents CRLF contamination)
# 1. Clone the repo
git clone <repo-url>
cd vendorsync
# 2. Copy environment template
cp .env.example .env
# Edit .env — fill in OKTA_*, VENDORSYNC_SECRET_KEY, AUTH_SECRET at minimum
# 3. Build and start all services
docker compose up --build
# 4. Run database migrations (first time only)
docker compose exec backend alembic upgrade head
# 5. Open the app
# http://localhost — frontend (Traefik routes it)
# http://localhost/api/docs — FastAPI Swagger UI
# http://localhost:9081/dashboard — Traefik dashboarddocker compose up # Start all services (no rebuild)
docker compose up --build # Rebuild and start (after Dockerfile or dependency changes)
docker compose down # Stop all services, keep volumesNever use
docker compose down -v— this destroysvendorsync_postgres_dataandvendorsync_redis_datapermanently. Always stop without-v.
All source code is volume-mounted into containers. Changes to Python or TypeScript files are live immediately — no rebuild needed:
- Backend: Uvicorn
--reloadwatches/app/appfor.pychanges - Frontend: Next.js Turbopack watches
/app/srcfor.ts/.tsxchanges
# Apply pending migrations
docker compose exec backend alembic upgrade head
# Create a new migration after model changes
docker compose exec backend alembic revision --autogenerate -m "describe the change"docker compose logs -f backend
docker compose logs -f worker
docker compose logs -f scheduler
docker compose logs -f frontend| Severity | Triggered by |
|---|---|
critical |
Breaking change with deadline < 30 days, vendor tier_1, or affects payment-critical apps |
high |
Breaking change with deadline 30–90 days, vendor tier_2, or affects multiple critical apps |
medium |
Non-breaking change with deadline > 90 days, or single non-critical app |
low |
Informational, optional updates, vendor tier_3 |
| Tier | Meaning | Default SLA |
|---|---|---|
tier_1 |
Mission-critical vendors — payment networks, core infrastructure | 14 days |
tier_2 |
Important integrations — significant operational impact if broken | 30 days |
tier_3 |
Non-critical vendors — informational or optional integrations | 60 days |
Force-closing a ticket without normal resolution requires:
adminorchange_managerroleclose_reasontext of at least 20 charactersis_manual_override = trueset on the ticket- Audit log entry flagged with override context
- Override tickets excluded from SLA compliance metrics
When the LLM cannot match an email to any known vendor:
source_emails.processing_status = 'unknown'- No ticket is created
- Slack alert to triage operators
- Email surfaces in the VS triage view (
source_emails WHERE processing_status = 'unknown') - Operator classifies manually; can optionally save the classification as a new rule for future emails
If Slack, email, or PagerDuty calls fail, the ticket and Jira creation still complete. The failure is logged to notification_log and retried via DLQ. Notification channels are fire-and-forget relative to the core pipeline.
These principles guided every design decision in VendorSync. They are not aspirations — they are constraints actively enforced by the codebase.
Separation of concerns — API, services, agents, workers, scheduler, and models are distinct layers with one-way dependency rules. Nothing depends on api or workers.
LLM is for unstructured → structured conversion only — the LLM turns free-text emails into TriageDecision objects. Once data is structured, deterministic code takes over for everything downstream — state machines, SLA math, Jira API calls, deadline monitoring. The LLM never sees a structured decision point.
The VendorSync ticket is the audit truth — Jira is where engineering work happens. VendorSync is where vendor-change handling is tracked, measured, and enforced. They are complementary, not duplicates.
Rules and application descriptions are data, not code — admins edit them through the GUI at runtime. The LLM reads them fresh on every triage call. Adding a new rule requires zero deployments.
Multi-provider LLM from day one — every LLM call routes through LiteLLM. No provider is hardcoded anywhere. Switching from Claude to GPT-4o is a one-row database change.
Async everywhere — FastAPI async routes, async SQLAlchemy, async Redis, async httpx. No blocking I/O anywhere in the hot path.
Idempotent jobs — workers check for existing state before acting. Retrying a failed create_jira_issue job will not create a duplicate issue if the first attempt partially succeeded.
Single entry point — Traefik handles all external routing. Backend and frontend containers do not expose ports directly in production. Adding a new service requires only Docker labels.
Agent abstraction — LLM logic is wrapped in named agent classes (TriageAgent) with stable interfaces. The implementation can be swapped, multi-step chains can be added, A/B testing can be layered — callers never change.
Full internal documentation lives in docs/:
| Document | Contents |
|---|---|
docs/ARCHITECTURE.md |
System design, service topology, module boundaries, build sequence |
docs/DATA_MODEL.md |
Complete database schema with field-level documentation |
docs/PROCESSING_FLOW.md |
Step-by-step email processing pipeline with edge cases |
docs/BUSINESS_RULES.md |
Ticket lifecycle, SLA, escalation, override logic |
docs/JIRA_INTEGRATION.md |
Jira API usage, sync pattern, error handling, permissions |
docs/AUTH.md |
Okta OIDC flow, roles, permission matrix, implementation details |
docs/TECH_STACK.md |
Technology choices and rationale |
docs/UI_DESIGN.md |
Frontend design system, component library, screen inventory |
These docs are also accessible in-app at /docs — the documentation tab is part of the VendorSync UI itself.
VendorSync is built with FastAPI · Next.js · PostgreSQL · Redis · Traefik · LiteLLM · Okta