Skip to content

aviciot/VendorSync

Repository files navigation

VendorSync

Enterprise-grade vendor change management platform. VendorSync automatically ingests vendor notification emails, uses a Large Language Model to triage them against your application registry and rule set, creates tracked tickets with full SLA enforcement, and opens Jira issues for every affected engineering team — all without a human touching a keyboard.


Table of Contents


What is VendorSync?

VendorSync is a full-stack internal platform that solves a specific, painful problem at scale: vendor change management.

Every enterprise engineering organization receives a constant stream of emails from third-party vendors — API endpoint changes, certificate rotations, protocol deprecations, breaking schema updates, deadline-driven migration notices. Each of these can break production systems if missed or mishandled. Manually routing these emails, figuring out which internal applications are affected, assigning owners, opening Jira tickets, and tracking deadlines is tedious, error-prone, and doesn't scale.

VendorSync eliminates all of that manual work. It:

  1. Watches your mailboxes — polls configured email sources (IMAP, Microsoft Exchange, Gmail API) for new vendor notifications
  2. Reads and understands the email — sends it to an LLM (Claude, GPT-4o, or Gemini) along with your entire application registry and rule set, asking it to decide: which vendor sent this, which of your apps are impacted, how severe it is, who owns it, what the deadline is
  3. Creates a tracked VendorSync ticket — a first-class record with severity, SLA, owner, and deadline
  4. Opens Jira issues automatically — one per affected application, pre-filled with full context so engineers never have to ask "what is this about"
  5. Enforces SLAs — monitors deadlines and fires escalations at D-7, D-3, D-1, opening war rooms and paging on-call engineers if needed
  6. Closes the loop — syncs Jira status back into VendorSync every 5 minutes; when all Jira issues are Done, the ticket auto-transitions to Resolved for a change manager to confirm

The entire pipeline from email arrival to Jira issues created and notifications sent targets under 20 seconds at the 95th percentile.


Why VendorSync Exists

The alternative is a spreadsheet, a shared inbox, and hope. Common failure modes VendorSync prevents:

Failure mode How VendorSync prevents it
Vendor email goes to wrong person or gets lost Watches dedicated mailboxes, never misses
No one knows which apps are affected LLM reads app descriptions and decides automatically
Wrong team gets paged Rules + app registry route to the right owner every time
Deadline slips without anyone noticing Scheduled deadline monitor, three escalation thresholds, breach detection
Jira issues created inconsistently or not at all Automatic creation with full context on every ticket
No audit trail of what happened and when Immutable audit log on every state change
Hard to tell which open changes are at risk Dashboard with SLA countdowns, severity badges, breach indicators

How It Works — End to End

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                                                                 │
│  Vendor sends   →  Mailbox  →  [Scheduler polls]  →  vs:emails:incoming        │
│  notification                                                                   │
│  email                                                                          │
│                                                                                 │
│  [Worker] fetches message, stores in source_emails, publishes process_email job │
│                                                                                 │
│  [Triage Agent]  ←── all active rules ──────────────────── DB                  │
│       │          ←── all applications + descriptions ─────  DB                 │
│       │          ←── email subject + body + attachments                        │
│       ↓                                                                         │
│  TriageDecision {                                                               │
│    vendor, topic, effective_date,                                               │
│    affected_application_ids,                                                    │
│    severity, owner_team, sla_days,                                              │
│    alert_channel, matched_rule_ids,                                             │
│    reasoning                                                                    │
│  }                                                                              │
│       ↓                                                                         │
│  [Ticket Engine]                                                                │
│    → INSERT tickets (status: open)                                              │
│    → INSERT ticket_applications (one per affected app)                          │
│    → INSERT ticket_jira_links (one per app with Jira enabled)                   │
│    → PUBLISH create_jira_issue jobs  →  vs:jira:create                         │
│    → PUBLISH notify job              →  vs:tickets:notify                      │
│                                                                                 │
│  [Worker] creates Jira issues, stores issue keys, syncs every 5 min            │
│  [Worker] sends Slack/email/PagerDuty notifications                             │
│  [Scheduler] monitors deadlines every 15 min, escalates at D-7/D-3/D-1        │
│  [Worker] syncs Jira status → auto-transitions VS ticket to Resolved           │
│  [Change manager] confirms closure in UI → ticket Closed                       │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Architecture Overview

VendorSync is decomposed into three Python processes sharing one codebase, backed by PostgreSQL and Redis, served through a Traefik reverse proxy, with a Next.js frontend.

                        ┌───────────────────────────────────────────┐
                        │                 Traefik                   │
                        │  /        → frontend (Next.js)            │
                        │  /api/*   → backend  (FastAPI)            │
                        └──────────────┬────────────────────────────┘
                                       │
              ┌──────────────┬─────────┴──────────┬────────────────────┐
              ▼              ▼                     ▼                    ▼
         ┌──────────┐  ┌──────────┐         ┌──────────┐        ┌───────────┐
         │ Frontend │  │ Backend  │         │  Worker  │        │ Scheduler │
         │ Next.js  │  │ FastAPI  │         │  (N     │        │APScheduler│
         │ App      │  │ HTTP API │         │ replicas)│        │ periodic  │
         │ Router   │  │          │         │ Streams  │        │ jobs      │
         └──────────┘  └────┬─────┘         └────┬─────┘        └─────┬─────┘
                            │                    │                     │
                    ┌───────┴──────┐    ┌────────┴──────────────────────┘
                    │              │    │
              ┌─────▼─────┐  ┌────▼────▼──┐
              │ PostgreSQL │  │   Redis    │
              │  primary   │  │  Streams   │
              │  database  │  │  + cache   │
              └────────────┘  └────────────┘

Module boundaries

The Python codebase is organized into strict layers. Nothing depends backwards:

Module Responsibility
backend.api HTTP routes only — validates input, calls services, returns responses
backend.services Business logic — pure Python, no FastAPI imports, reusable from both API and workers
backend.agents The Triage Agent and future agent abstractions
backend.models SQLAlchemy ORM models — source of truth for DB schema
backend.schemas Pydantic v2 schemas for API input/output validation
backend.workers Redis Streams consumers — call into services and agents
backend.scheduler APScheduler periodic jobs — publish work items to streams
backend.llm LiteLLM wrapper, prompt templates, response parsing
backend.ingestion Email fetching (IMAP / Exchange / Gmail), attachment handling
backend.jira Jira REST client, sync logic, comment helpers
backend.auth Okta OIDC integration, JWT validation, role middleware
backend.core Config, DB session, Redis client, security helpers

Dependency rule: api, workers, and scheduler may depend on services and agents. services and agents depend on models, llm, ingestion, jira. Nothing depends on api or workers.


Service Topology

VendorSync runs as 7 Docker services in a single Compose stack:

Service Image Purpose
traefik traefik:v3 Reverse proxy, single external entry point, HTTPS via Let's Encrypt in production
backend custom Python FastAPI HTTP API — all REST endpoints
worker custom Python (same image) Redis Streams consumer — ingestion, LLM calls, Jira creation, notifications, sync
scheduler custom Python (same image) APScheduler — mailbox polling, Jira sync batch, deadline monitor
frontend custom Node Next.js App Router UI
postgres postgres:17 Primary relational database, all persistent state
redis redis:7 Streams job bus + caching

backend, worker, and scheduler run the same Docker image, differentiated by the command: override in docker-compose.yml. This means a single docker compose build backend rebuilds all three.

Traefik routing

Path Routed to
/ and all UI routes Next.js frontend (port 3000 internally)
/api/* FastAPI backend (port 8000 internally)
/health Backend health check
:8080/dashboard Traefik live dashboard (dev only)

In production, Traefik handles TLS termination via Let's Encrypt and redirects all HTTP to HTTPS. Service discovery is via Docker labels — no static config files need editing when adding services.


Redis Streams — The Job Bus

All asynchronous work flows through Redis Streams. There is no Celery, no RabbitMQ, no separate broker — Redis is already required for caching, and Streams on top of it replace all of that at a fraction of the operational overhead.

Vendor Watch streams

Stream Jobs carried
vs:emails:incoming poll_mailbox (fetch new messages from IMAP/Exchange/Gmail)
vs:jira:create create_jira_issue (one job per app per ticket, with retry)
vs:jira:sync sync_jira_batch (periodic) + sync_jira_single (manual)
vs:tickets:notify send_notifications (Slack, email, PagerDuty fanout)
vs:tickets:escalate escalate_ticket (D-7, D-3, D-1 actions)
vs:dlq Dead-letter queue — exhausted jobs held for admin review

UI event stream

Stream Events pushed
vs:ui:events system.status — queue depths and scheduler heartbeats (every 5s)

Worker pattern

Workers are stateless consumers in a consumer group (vs-workers). The pattern is simple and reliable:

XREADGROUP GROUP vs-workers <worker-id> COUNT 10 BLOCK 5000 STREAMS vs:emails:incoming >
  → process message
  → on success:  XACK vs:emails:incoming vs-workers <message-id>
  → on failure:  retry (up to 3×); on exhaustion: XADD vs:dlq ...

Multiple worker replicas can run concurrently — Redis ensures each message is claimed by exactly one consumer at a time. Pending messages from a crashed worker are auto-reclaimed after a configurable visibility timeout.

Scheduler pattern

APScheduler runs in its own process. Its only job is to publish work items into streams on schedule. It never does the work itself. This means the scheduler is lightweight, stateless, and easily restartable — the actual processing always happens in workers through the same code paths.


The Triage Agent

The Triage Agent is the brain of VendorSync. It is a stable interface over an LLM call — a class that takes an email and returns a fully structured decision. All the complexity of prompt construction, LLM provider routing, response parsing, and validation lives inside it. The rest of the codebase calls triage_agent.decide(email) and gets a TriageDecision back.

What the agent does

class TriageAgent:
    async def decide(self, email: SourceEmail) -> TriageDecision:
        # 1. Load all active rules from DB (filtered by vendor if known)
        # 2. Load all active applications with their descriptions
        # 3. Construct a structured prompt with the email + context
        # 4. Call LLM (via LiteLLM — provider-agnostic)
        # 5. Parse and validate the JSON response
        # 6. Apply post-LLM severity guardrails (deterministic)
        # 7. Return TriageDecision
        ...

The TriageDecision output

{
  "vendor_id": "uuid-of-matched-vendor",
  "topic": "URL endpoint change",
  "effective_date": "2027-04-20",
  "affected_application_ids": ["uuid-1", "uuid-2"],
  "severity": "high",
  "owner_team": "Payments",
  "sla_days": 14,
  "alert_channel": "#payments-changes",
  "matched_rule_ids": ["uuid-of-matched-rule"],
  "reasoning": "PaymentService directly calls the Visa authorization endpoint per its description and is critically affected. AccountUpdate uses Visa schemas — included as a precaution since it likely uses the same base URL infrastructure. Billing is excluded — its description mentions billing operations, not authorization flows."
}

What the LLM sees

The LLM receives a single structured prompt containing:

  • The raw email (subject + body + attachments as multimodal inputs)
  • The full list of vendor names VendorSync tracks
  • Every active rule (natural language, with severity / SLA / owner / instruction text)
  • Every active application (name + plain-language description + owner team)

No keyword pre-filtering is done before the LLM call. The LLM does the matching. This is intentional — keyword-based pre-filtering introduces false negatives; the LLM is better at semantic matching.

Severity guardrails (post-LLM, deterministic)

After the LLM responds, a deterministic check runs before the TriageDecision is returned:

If the vendor is tier_1 AND the change is classified as breaking AND the LLM assigned low or medium severity → override to high.

This override is written to the audit log with the original LLM severity preserved in llm_reasoning. The guardrail cannot be tricked by prompt injection.

Application descriptions matter

The quality of triage directly depends on how well applications are described. The description field on each application is fed verbatim to the LLM:

"API-based application that exchanges Visa schema files for credit card status updates, polls Reuters rates feed every 2 hours, processes payments in real-time. Calls the Visa authorization v2 endpoint on every transaction."

Rich descriptions = accurate routing. Sparse descriptions = false negatives.


Ticket Lifecycle & State Machine

Every vendor change that passes triage becomes a VendorSync ticket — the single source of truth for that change's progress. Tickets have a strictly enforced state machine.

                 ┌─────────────┐
                 │    open     │  ← created by system on email triage
                 └──────┬──────┘
                        │  any linked Jira issue → "In Progress"
                        │  OR owner manually transitions in VS UI
                        ▼
                 ┌─────────────┐
                 │ in_progress │
                 └──────┬──────┘
                        │  ALL linked Jira issues → "Done"
                        │  AND all non-Jira apps → non_jira_status = complete
                        ▼
                 ┌─────────────┐
                 │  resolved   │  ← awaiting change manager confirmation
                 └──────┬──────┘
                        │  change manager clicks "Confirm closure" in VS UI
                        ▼
                 ┌─────────────┐
                 │   closed    │  ← terminal
                 └─────────────┘

  Any non-closed state + effective_date passed → breached  (terminal)
  Any non-closed state + admin force-close → closed  (with mandatory reason)

State definitions

State Meaning
open Ticket created, no work started; all Jira issues in "To Do"
in_progress At least one Jira issue has moved to "In Progress"
resolved All Jira issues Done + all non-Jira apps marked complete; awaiting human confirmation
closed Change manager confirmed; change is handled
breached Effective date passed without closure — SLA violated

Status is driven by Jira, not by humans

For tickets with linked Jira issues, state transitions from openin_progressresolved happen automatically based on Jira sync results. Humans do not manually move these states. Only resolved → closed requires a human action.

For apps configured with creates_jira_ticket = false, owners use VS UI buttons to update non_jira_status on their app's ticket row. The combined evaluation (all Jira issues Done + all non-Jira apps complete) triggers resolution.

Two-person close rule

The change manager who confirmed the last state transition on a ticket cannot also confirm its closure. The backend enforces:

if closed_by == ticket.last_transitioned_by:
    raise HTTPException(403, "The user who last transitioned this ticket cannot also confirm closure")

Admin role can self-close but the bypass is flagged in the audit log.

Forbidden transitions

  • closed → anything (terminal — cannot be reopened; create a new ticket referencing it)
  • breachedclosed requires admin manual override with a documented reason
  • Skipping states (e.g. openresolved) is forbidden. If Jira sync finds all issues Done on a still-open ticket, two separate transitions are written to the audit log atomically: open → in_progress → resolved.

SLA Enforcement & Escalation

VendorSync's scheduler checks all open tickets every 15 minutes. For each ticket with an effective_date, it calculates days remaining and fires precisely-targeted escalations.

Escalation thresholds

Threshold Actions triggered
D-7 (7 days before deadline) Comment on all linked Jira issues · Slack DM to ticket owner · Email to owner
D-3 (3 days before deadline) Comment on all linked Jira issues · Slack channel ping · Email to manager
D-1 (1 day before deadline) Comment on all linked Jira issues · War room Slack channel alert · PagerDuty on-call page
D-0 (breached) Ticket status → breached · Slack + email alert to admin + senior management

Each threshold fires at most once per ticket — enforced via notification_log with a deduplication index on (ticket_id, channel, metadata.escalation_threshold). There is no double-paging.

Escalations stop if the ticket reaches resolved or closed before the threshold fires.

Jira comments at each threshold

🔔 VendorSync — D-7 reminder

Vendor deadline approaching: April 20, 2027 (in 7 days)

This Jira issue is linked to VendorSync ticket VS-2847.
Vendor change: Visa — URL endpoint change

If this change has been completed, please move this Jira issue to Done.

VendorSync ticket: https://vendorsync.company.com/tickets/VS-2847

SLA defaults

If no rule matches a vendor change, SLA is determined by the vendor's criticality tier:

Vendor tier Default SLA
tier_1 14 days
tier_2 30 days
tier_3 60 days

Configurable in system_settings.defaults.sla_days_by_tier.


Jira Integration Deep Dive

VendorSync talks to Jira via direct async HTTP calls (httpx) to the Jira REST API. No SDKs, no proxies. Both Jira Cloud (API v3) and Jira Server/Data Center (API v2) are supported — configured per-instance via jira_config in the database.

What VendorSync creates in Jira

When a ticket is triaged and linked applications have Jira enabled, one issue is created per application:

Summary:   Visa — URL endpoint change [VS-2847]
Project:   PAY  (from applications.jira_project_key)
Type:      Task  (from applications.jira_issue_type)
Due date:  2027-04-20
Assignee:  <from applications.jira_default_assignee if set>
Labels:    ["vendorsync", "vendor:visa", <app.jira_labels...>]

Description:
═══════════════════════════════════════════════════════
VENDOR CHANGE — Visa URL endpoint update
═══════════════════════════════════════════════════════
Vendor:           Visa
Change type:      URL endpoint change
Effective date:   April 20, 2027 (D-180)
Severity:         High
SLA:              14 days
Affected application: PaymentService (Payments team)

Summary: [LLM-generated 2-3 sentence summary]
VendorSync routing reasoning: [LLM reasoning text]

VendorSync ticket: VS-2847
Direct link: https://vendorsync.company.com/tickets/VS-2847
═══════════════════════════════════════════════════════

The engineer working the Jira issue has everything they need without opening VendorSync. The original email is accessible via the VendorSync link if needed.

Sync loop (every 5 minutes)

VendorSync never queries Jira issue-by-issue. The sync is batched:

  1. Pull all open jira_issue_key values in one SQL query
  2. Chunk into batches of 100
  3. For each batch: one JQL search request — key in (PAY-9421, PAY-9422, ...)
  4. Diff stored status against Jira status category
  5. Update ticket_jira_links, write audit log entries
  6. Re-evaluate parent VS ticket status

Batch sync is ~100× faster than per-issue queries and is far more Jira-API-quota-efficient.

Status mapping

VendorSync uses Jira's status category (standardized across all Jira projects) rather than exact status names (which vary by workflow):

Jira status category VendorSync jira_status_category
To Do / Open / Backlog to_do
In Progress / In Review / etc. in_progress
Done / Closed / Resolved / etc. done
Anything else unknown

The exact Jira status name is preserved in jira_status for display in the UI.

Error handling and retry

Error Action
400 bad payload Log full error, alert admin — likely a misconfigured project key or issue type
401 unauthorized Disable Jira config, urgent admin alert
403 forbidden Log, admin alert — missing "Create Issues" permission
404 project not found Log, admin alert — jira_project_key is wrong
5xx / network timeout Retry × 3 with exponential backoff
Exhausted retries Mark sync_status = orphan, background retry every 15 min

Jira creation failures never block ticket creation. VendorSync tickets exist and are tracked regardless of Jira state. Orphaned links are surfaced in the admin view.

Permissions VendorSync needs in Jira

The Jira account used by VendorSync requires:

  • Browse Projects on all target projects
  • Create Issues on all target projects
  • Add Comments on all target projects

VendorSync intentionally does not transition Jira issue statuses, delete issues, or touch issues it didn't create.


Notification System

Notifications are dispatched asynchronously through the vs:tickets:notify stream, after ticket creation. Severity determines the fanout:

Severity Slack channel Slack DM to owner Email PagerDuty
critical
high
medium
low

Every notification dispatched is recorded in notification_log. If a notification channel fails (Slack down, SMTP unreachable), the failure is logged, the ticket creation still completes, and retry happens via DLQ replay — notifications never block the primary pipeline.


Authentication & Authorization

Authentication — Okta OIDC

VendorSync uses Okta as the identity provider. The backend validates JWTs with authlib; the frontend uses Auth.js v5 (NextAuth) with the Okta provider.

User opens VendorSync
  → frontend redirects to Okta login
  → user authenticates (MFA, SSO, etc.)
  → Okta redirects back with auth code
  → Auth.js exchanges code for tokens
  → frontend sends Okta access token in Authorization header
  → backend validates JWT via Okta's JWKS endpoint
  → backend looks up role in users table
  → request proceeds with full user context

Authorization — internal roles

Okta identifies who the user is. VendorSync's users table controls what they can do. Role changes don't require Okta admin involvement.

Role What they can do
admin Everything — users, config, system settings, manual override close
change_manager Manage rules and registry, confirm closures, manual override
owner Work tickets for their team, mark non-Jira app progress
viewer Read-only — dashboards, ticket lists, audit log

Permission matrix

Action Admin Change Manager Owner Viewer
View all tickets
Mark non-Jira app progress (own team)
Trigger manual Jira sync
Confirm ticket closure
Manual override close
Edit rules
Edit application registry
Edit vendors
Edit email sources / LLM / Jira config
Manage users and roles
View audit log

First-time login (auto-provisioning)

When an unknown Okta user hits the system for the first time, VendorSync auto-provisions them with role = viewer. Existing admins are notified via Slack/email: "New user X just logged in — set their role."

Auto-provisioning can be disabled (system_settings.auth.auto_provision = false), requiring admins to pre-create accounts.

Bootstrap admin

On first run, no admins exist. Set the BOOTSTRAP_ADMINS environment variable to comma-separated emails:

BOOTSTRAP_ADMINS=alice@company.com,bob@company.com

When those users log in for the first time, they receive admin role automatically. The variable is idempotent — re-applying it will not downgrade existing admins.


Data Model

VendorSync's database is PostgreSQL 17, managed via Alembic migrations, accessed through SQLAlchemy 2.x async ORM. All tables use UUID primary keys and UTC timestamps.

Core tables

Table Purpose
users Okta-authenticated users with internal roles
vendors External vendors that send change notifications
applications Internal applications, with Jira routing config and owner info
application_vendors Many-to-many: which apps integrate with which vendors
rules Natural-language routing rules consumed by the Triage Agent
tickets The core entity — one per vendor change, with full lifecycle
ticket_applications Many-to-many: which apps are affected by each ticket
ticket_jira_links One row per Jira issue linked to a ticket
ticket_notes User-authored notes on tickets
source_emails Ingested emails with classification metadata
email_attachments Attachment metadata and storage paths
audit_log Immutable append-only record of every state change
notification_log Record of every notification dispatched

Configuration tables

Table Purpose
email_sources Configured mailboxes (IMAP / Exchange / Gmail)
llm_config LLM provider config — primary + fallback, with encrypted API keys
jira_config Jira instance config — base URL, API version, auth token
system_settings Key/value store for all system-wide defaults

Ticket numbering

Tickets are assigned human-readable IDs (VS-0001, VS-0002, …) via a dedicated Postgres sequence ticket_number_seq. The application reads nextval('ticket_number_seq') at insert time and formats it as VS- + zero-padded integer. Sequence values are never manually assigned.

Encryption at rest

Sensitive fields (email_sources.password_encrypted, llm_config.api_key_encrypted, jira_config.api_token_encrypted) are encrypted with Fernet symmetric encryption using VENDORSYNC_SECRET_KEY from the environment. Raw values are never written to the database. *_encrypted fields are never exposed in API read schemas.

Soft deletes only

Nothing is ever hard-deleted. All entities use is_active = false flags. Queries filter on is_active = true by default. Historical records are always queryable for audit purposes.

Key indexes

Beyond primary keys, the schema carries targeted indexes for the hottest query paths:

  • tickets (status, effective_date) — deadline monitor
  • tickets (vendor_id, created_at) — vendor history views
  • source_emails (processing_status, received_at) — triage queue
  • audit_log (entity_type, entity_id, created_at) — per-entity history
  • notification_log — expression index on metadata->>'escalation_threshold' for escalation deduplication
  • ticket_jira_links (jira_issue_key) — Jira key → VS ticket lookup
  • rules (vendor_id, is_active) — rule loading in triage

Technology Stack

Every technology choice in VendorSync is deliberate. The stack is designed for developer ergonomics, operational simplicity, and correctness — not novelty.

Backend

Component Choice Why
Language Python 3.12+ Ecosystem depth for LLM, async I/O, data tooling
Framework FastAPI Native async, automatic OpenAPI docs, Pydantic integration
ASGI server Uvicorn (dev) / Gunicorn + Uvicorn workers (prod) Production-proven, minimal
ORM SQLAlchemy 2.x async Industry standard, excellent async support
Migrations Alembic Pairs with SQLAlchemy, reliable version tracking
Job queue Redis Streams via redis-py async No additional infrastructure, persistent, debuggable
Scheduler APScheduler Lightweight in-process, no separate beat service
Validation Pydantic v2 Fast, type-safe, shared schemas with FastAPI
Package manager uv 10–100× faster than pip, deterministic lock file

LLM

Component Choice Why
Abstraction LiteLLM Single API across all providers — switch providers via config with zero code changes
Default provider Anthropic Claude Best instruction following; multimodal for PDF/image attachments
Fallback providers OpenAI, Google Gemini, Azure OpenAI Configurable at runtime
Document parsing Delegated to LLM No separate OCR library needed — Claude Sonnet, GPT-4o, Gemini all handle images and PDFs natively

Frontend

Component Choice Why
Framework Next.js (App Router) File-based routing, server components, production-ready
Language TypeScript End-to-end type safety
Styling Tailwind CSS Utility-first, co-located with markup
Components shadcn/ui Accessible primitives, fully owned source code
Icons lucide-react Consistent, tree-shakeable
Server state TanStack Query Best-in-class caching, background refetch, stale-while-revalidate
Forms react-hook-form + zod Type-safe validation, reusable schemas
Auth Auth.js v5 (NextAuth) with Okta provider Full App Router compatibility — server components, middleware, server actions

Infrastructure

Component Choice Why
Reverse proxy Traefik v3 Docker label service discovery, built-in Let's Encrypt, zero config for adding services
Database PostgreSQL 17 ACID, JSONB, sequences, GIN indexes, industry standard
Cache / queue Redis 7 Streams + pub/sub + caching in one service
Container runtime Docker Compose One docker compose up brings up the full stack

Email ingestion

Protocol Library
IMAP aioimaplib (async)
Microsoft Exchange exchangelib
Gmail API google-api-python-client
POP3 Not supported — not in the protocol enum

Notifications

Channel Library / API
Slack slack-sdk Python
Email smtplib (or SendGrid SDK)
PagerDuty Events API v2 (direct HTTP POST)

Project Structure

vendorsync/
├── backend/
│   ├── app/
│   │   ├── api/              # FastAPI routes (one file per resource)
│   │   ├── models/           # SQLAlchemy ORM models
│   │   ├── schemas/          # Pydantic v2 request/response schemas
│   │   ├── services/         # Business logic (pure Python)
│   │   ├── agents/           # Triage Agent (and future agents)
│   │   ├── workers/          # Redis Streams consumer functions
│   │   ├── scheduler/        # APScheduler periodic job definitions
│   │   ├── llm/              # LiteLLM wrapper, prompt templates
│   │   ├── ingestion/        # Email fetch and attachment handling
│   │   ├── jira/             # Jira REST client and sync logic
│   │   ├── auth/             # Okta JWT validation, role decorators
│   │   ├── core/             # Config, DB session, Redis client
│   │   └── main.py           # FastAPI app entry point
│   ├── tests/
│   ├── alembic/              # Database migration scripts
│   ├── pyproject.toml        # uv project manifest
│   ├── uv.lock               # Deterministic dependency lock
│   └── Dockerfile            # Single image for backend + worker + scheduler
├── frontend/
│   ├── src/
│   │   ├── app/              # Next.js App Router pages
│   │   │   ├── layout.tsx    # Root layout (sidebar + header)
│   │   │   ├── page.tsx      # Dashboard
│   │   │   ├── tickets/[id]/ # Ticket detail
│   │   │   ├── vendors/      # Vendor registry
│   │   │   ├── applications/ # Application registry
│   │   │   ├── rules/        # Rules admin
│   │   │   ├── settings/     # Email / LLM / Jira / users config
│   │   │   ├── audit/        # Audit log viewer
│   │   │   └── docs/         # In-app documentation tab
│   │   ├── components/
│   │   │   ├── ui/           # shadcn/ui primitives
│   │   │   ├── layout/       # Sidebar, Header
│   │   │   ├── tickets/      # Ticket-specific components
│   │   │   └── shared/       # Badges, cards, sync buttons
│   │   ├── hooks/            # TanStack Query hooks, WS hooks
│   │   ├── lib/
│   │   │   ├── api/          # Typed API client functions
│   │   │   ├── auth/         # Auth.js helpers
│   │   │   ├── time.ts       # formatDateTime, formatDate, etc.
│   │   │   └── types/        # Shared TypeScript types
│   │   └── styles/
│   │       └── globals.css
│   ├── package.json
│   ├── tsconfig.json
│   └── Dockerfile
├── traefik/
│   └── traefik.yml           # Traefik static config
├── docker-compose.yml
├── docker-compose.server.yml # Production server overrides
├── .env.example              # All required env vars with comments
├── .gitattributes            # LF line endings enforced for all text files
└── docs/                     # Architecture, data model, business rules, etc.

Configuration Reference

All secrets and environment configuration are in .env (never committed — see .env.example).

Required variables

# Database
DATABASE_URL=postgresql+asyncpg://vendorsync:password@postgres:5432/vendorsync

# Redis
REDIS_URL=redis://redis:6379/0

# Encryption key (Fernet) — used for all encrypted DB fields
VENDORSYNC_SECRET_KEY=<base64-encoded 32-byte key>

# Okta OIDC (backend)
OKTA_DOMAIN=company.okta.com
OKTA_CLIENT_ID=...
OKTA_CLIENT_SECRET=...
OKTA_AUDIENCE=api://vendorsync
OKTA_ISSUER=https://company.okta.com/oauth2/default

# Auth.js (frontend)
AUTH_SECRET=<random 32-byte secret>
NEXTAUTH_URL=https://vendorsync.company.com

# Bootstrap admin (first run only)
BOOTSTRAP_ADMINS=alice@company.com,bob@company.com

System settings (stored in DB, editable via UI)

Key Default Description
defaults.timezone "UTC" Timezone for effective_date interpretation
defaults.working_hours {"start":"09:00","end":"18:00"} Business hours for escalation scheduling
escalation.thresholds_days [7, 3, 1] Days-before-deadline escalation triggers
defaults.sla_days_by_tier {"tier_1":14,"tier_2":30,"tier_3":60} Fallback SLA by vendor tier
audit.retention_days 1825 (5 years) How long to retain audit log rows
auth.auto_provision true Auto-create users on first Okta login
auth.default_role "viewer" Default role for auto-provisioned users

Getting Started (Local Development)

Prerequisites

  • Docker Desktop with the Linux engine
  • Git configured with core.autocrlf = input (Windows only — prevents CRLF contamination)

First run

# 1. Clone the repo
git clone <repo-url>
cd vendorsync

# 2. Copy environment template
cp .env.example .env
# Edit .env — fill in OKTA_*, VENDORSYNC_SECRET_KEY, AUTH_SECRET at minimum

# 3. Build and start all services
docker compose up --build

# 4. Run database migrations (first time only)
docker compose exec backend alembic upgrade head

# 5. Open the app
# http://localhost — frontend (Traefik routes it)
# http://localhost/api/docs — FastAPI Swagger UI
# http://localhost:9081/dashboard — Traefik dashboard

Subsequent runs

docker compose up          # Start all services (no rebuild)
docker compose up --build  # Rebuild and start (after Dockerfile or dependency changes)
docker compose down        # Stop all services, keep volumes

Never use docker compose down -v — this destroys vendorsync_postgres_data and vendorsync_redis_data permanently. Always stop without -v.

Hot reload

All source code is volume-mounted into containers. Changes to Python or TypeScript files are live immediately — no rebuild needed:

  • Backend: Uvicorn --reload watches /app/app for .py changes
  • Frontend: Next.js Turbopack watches /app/src for .ts/.tsx changes

Running migrations

# Apply pending migrations
docker compose exec backend alembic upgrade head

# Create a new migration after model changes
docker compose exec backend alembic revision --autogenerate -m "describe the change"

Viewing logs

docker compose logs -f backend
docker compose logs -f worker
docker compose logs -f scheduler
docker compose logs -f frontend

Business Rules Reference

Severity levels

Severity Triggered by
critical Breaking change with deadline < 30 days, vendor tier_1, or affects payment-critical apps
high Breaking change with deadline 30–90 days, vendor tier_2, or affects multiple critical apps
medium Non-breaking change with deadline > 90 days, or single non-critical app
low Informational, optional updates, vendor tier_3

Vendor criticality tiers

Tier Meaning Default SLA
tier_1 Mission-critical vendors — payment networks, core infrastructure 14 days
tier_2 Important integrations — significant operational impact if broken 30 days
tier_3 Non-critical vendors — informational or optional integrations 60 days

Manual override rules

Force-closing a ticket without normal resolution requires:

  • admin or change_manager role
  • close_reason text of at least 20 characters
  • is_manual_override = true set on the ticket
  • Audit log entry flagged with override context
  • Override tickets excluded from SLA compliance metrics

Unknown vendor handling

When the LLM cannot match an email to any known vendor:

  • source_emails.processing_status = 'unknown'
  • No ticket is created
  • Slack alert to triage operators
  • Email surfaces in the VS triage view (source_emails WHERE processing_status = 'unknown')
  • Operator classifies manually; can optionally save the classification as a new rule for future emails

Notifications must not block

If Slack, email, or PagerDuty calls fail, the ticket and Jira creation still complete. The failure is logged to notification_log and retried via DLQ. Notification channels are fire-and-forget relative to the core pipeline.


Architectural Principles

These principles guided every design decision in VendorSync. They are not aspirations — they are constraints actively enforced by the codebase.

Separation of concerns — API, services, agents, workers, scheduler, and models are distinct layers with one-way dependency rules. Nothing depends on api or workers.

LLM is for unstructured → structured conversion only — the LLM turns free-text emails into TriageDecision objects. Once data is structured, deterministic code takes over for everything downstream — state machines, SLA math, Jira API calls, deadline monitoring. The LLM never sees a structured decision point.

The VendorSync ticket is the audit truth — Jira is where engineering work happens. VendorSync is where vendor-change handling is tracked, measured, and enforced. They are complementary, not duplicates.

Rules and application descriptions are data, not code — admins edit them through the GUI at runtime. The LLM reads them fresh on every triage call. Adding a new rule requires zero deployments.

Multi-provider LLM from day one — every LLM call routes through LiteLLM. No provider is hardcoded anywhere. Switching from Claude to GPT-4o is a one-row database change.

Async everywhere — FastAPI async routes, async SQLAlchemy, async Redis, async httpx. No blocking I/O anywhere in the hot path.

Idempotent jobs — workers check for existing state before acting. Retrying a failed create_jira_issue job will not create a duplicate issue if the first attempt partially succeeded.

Single entry point — Traefik handles all external routing. Backend and frontend containers do not expose ports directly in production. Adding a new service requires only Docker labels.

Agent abstraction — LLM logic is wrapped in named agent classes (TriageAgent) with stable interfaces. The implementation can be swapped, multi-step chains can be added, A/B testing can be layered — callers never change.


Documentation

Full internal documentation lives in docs/:

Document Contents
docs/ARCHITECTURE.md System design, service topology, module boundaries, build sequence
docs/DATA_MODEL.md Complete database schema with field-level documentation
docs/PROCESSING_FLOW.md Step-by-step email processing pipeline with edge cases
docs/BUSINESS_RULES.md Ticket lifecycle, SLA, escalation, override logic
docs/JIRA_INTEGRATION.md Jira API usage, sync pattern, error handling, permissions
docs/AUTH.md Okta OIDC flow, roles, permission matrix, implementation details
docs/TECH_STACK.md Technology choices and rationale
docs/UI_DESIGN.md Frontend design system, component library, screen inventory

These docs are also accessible in-app at /docs — the documentation tab is part of the VendorSync UI itself.


VendorSync is built with FastAPI · Next.js · PostgreSQL · Redis · Traefik · LiteLLM · Okta

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors