Skip to content

feat(eval): Add gate-check structured output and --debug-json instrumentation#64

Closed
oshorefueled wants to merge 130 commits intomainfrom
feat/eval-gate-checks
Closed

feat(eval): Add gate-check structured output and --debug-json instrumentation#64
oshorefueled wants to merge 130 commits intomainfrom
feat/eval-gate-checks

Conversation

@oshorefueled
Copy link
Contributor

@oshorefueled oshorefueled commented Mar 2, 2026

Why

Previous evaluations used a reasoning field to improve accuracy, but reasoning tokens are generated after the model has already pattern-matched to a conclusion — making the reasoning cosmetic and post-hoc. This PR replaces that with a "tax" approach: the model is forced to emit specific boolean gate checks per violation candidate before it can proceed. Serializing through these gates constrains what the model can plausibly output next, producing more accurate evaluations without a second model pass.

What

  • Requires line, rule_quote, confidence, reasoning, and gate check fields (rule_supports_claim, evidence_exact, context_supports_violation, plausible_non_violation, fix_is_drop_in, fix_preserves_meaning) in LLM structured output for both check and judge modes
  • Adds deterministic violation-filter.ts — computes surface/hide decisions and reason strings from gate check booleans; no extra model call
  • Adds --debug-json CLI flag that writes per-run artifacts to .vectorlint/runs/ containing raw model output, filter decisions, and surfaced violations
  • Updates the built-in directive to reflect the raw-candidates + gate-checks approach
  • Adds tests/evaluations/ with test rules and config for manual evaluation runs

Scope

In scope

  • Structured output schema contract (both check and judge)
  • Deterministic gate filtering logic
  • --debug-json instrumentation
  • Directive wording update

Behavior impact

  • Violation output will change. The gate check fields are part of the structured output the model generates on every run — not just when --debug-json is passed. Because the model has to commit to boolean gate checks per violation candidate during generation, the set of violations it produces will differ from the old approach. Expect changes in what gets surfaced across all output formats (line, json, vale-json, rdjson), even though the output structure is unchanged.
  • The standard output formats (--output json etc.) do not include the new gate fields — they are consumed internally and written to debug artifacts only
  • Models that fail to populate the required gate fields will throw a structured output error
  • --debug-json writes to .vectorlint/runs/ (gitignored); no impact on default runs
  • Directive rewrite changes the system prompt on every evaluation, compounding the change in violation output

Risk

  • Medium — the new required fields change the LLM output contract; any provider or model that doesn't support structured output reliably may fail where it previously succeeded
  • Gate checks are computed on every run but only used when --debug-json is passed — adds token cost even when not debugging

How to test / verify

Checks run

  • npm run test:run — passed
  • npm run lint — passed

Manual verification

# Smoke test with debug output
npm run dev -- tests/evaluations/TEST_FILE.md \
  --config tests/evaluations/.vectorlint.ini \
  --debug-json --verbose

# Expected: normal findings + a line like:
# [vectorlint] Debug JSON written: .vectorlint/runs/<uuid>.json
# Verify the artifact contains raw_model_output.violations[*].line and gate fields
ls .vectorlint/runs/

Expected artifact output

When --debug-json is passed, a file is written to .vectorlint/runs/<model-tag>/<run_id>.json. Example (one violation, truncated):

{
  "run_id": "908b766e-b80a-4caf-a3f9-76a1a80aab8d",
  "timestamp": "2026-02-28T16:24:15.188Z",
  "file": "TEST_FILE.md",
  "model": {
    "provider": "openai",
    "name": "gpt-5.2-2025-12-11",
    "tag": "openai-gpt-5.2-2025-12-11"
  },
  "prompt": {
    "pack": "Test",
    "id": "Consistency",
    "filename": "consistency.md",
    "evaluation_type": "check"
  },
  "raw_model_output": {
    "reasoning": "Reviewed the input for terminology consistency...",
    "violations": [
      {
        "line": 43,
        "quoted_text": "AI-Powered Search",
        "context_before": "* **",
        "context_after": "**\n  Enable AI Search to help developers ask questions...",
        "description": "Potential terminology inconsistency: \"AI-Powered Search\" vs \"AI Search\" for the same feature.",
        "analysis": "The bullet heading names the feature \"AI-Powered Search\", but the description calls it \"AI Search\".",
        "suggestion": "Use one term consistently.",
        "fix": "  Enable AI-Powered Search to help developers ask questions about your product and instantly receive an answer.",
        "rule_quote": "Flag when the same concept is referred to by different terms",
        "checks": {
          "rule_supports_claim": true,
          "evidence_exact": true,
          "context_supports_violation": true,
          "plausible_non_violation": true,
          "fix_is_drop_in": true,
          "fix_preserves_meaning": true
        },
        "check_notes": {
          "rule_supports_claim": "The rule explicitly requires flagging the same concept referred to by different terms.",
          "evidence_exact": "Quoted text exactly matches line 43.",
          "context_supports_violation": "The very next line uses a different name (\"AI Search\") for the likely same feature.",
          "plausible_non_violation": "\"AI-Powered Search\" might be a category label while \"AI Search\" is the product name.",
          "fix_is_drop_in": "Direct substitution without structural changes.",
          "fix_preserves_meaning": "Keeps the intended meaning while aligning terminology."
        },
        "confidence": 0.8
      }
    ]
  },
  "filter_decisions": [
    {
      "index": 0,
      "surface": false,
      "reasons": ["plausible_non_violation=true"]
    }
  ],
  "surfaced_violations": []
}

Key things to verify:

  • raw_model_output.violations[*] contains line, rule_quote, checks, check_notes, and confidence
  • filter_decisions[*].reasons lists why each candidate was hidden or surfaced
  • surfaced_violations contains only candidates that passed all gates

Rollback

  • Revert to main — no DB/config/env changes
  • Delete .vectorlint/runs/ locally (optional, gitignored)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added --debug-json CLI flag to emit debug artifacts for evaluation runs, showing model outputs and filtering decisions.
    • Enhanced violation reporting with structured quality checks and confidence scores.
    • Added five new evaluation rules for document quality: clarity, consistency, passive voice, readability, and wordiness.
  • Documentation

    • Added comprehensive guides for the new gate-check feature and debug workflow.
    • Added evaluation guides and test rules documentation.

  - Add YAML frontmatter rubric in prompts
  - Use response_format JSON schema for outputs
  - Inject article content as separate user message
  - Parse frontmatter; validate; skip invalid prompts
  - Apply thresholds/severity; compute findings; set exit codes
  - Resolve files via ScanPaths; exclude PromptsPath; .md/.txt
  - Improve verbose logs; full JSON behind --debug-json
  - Add runPromptStructured; keep legacy runPrompt
  - Remove legacy grammar/lint pipeline and formatter
  - Add Vitest, tests for config/scan/evaluation
  - Add yaml/fast-glob deps; test scripts added
  - Add vectorlint.example.ini; git-ignore vectorlint.ini
- Remove per‑criterion threshold/severity; use score heuristic (0–1 error, 2 warning, 3–4 ok)
  - Treat below‑threshold via overall.severity (error fails CI, warning does not)
  - Compute/print per‑prompt summary: Top/Threshold/Score
  - Update loader, CLI, and reporter accordingly
  - Refresh README to reflect new config
  - Remove per‑criterion threshold/severity; use score heuristic (0–1 error, 2 warning, 3–4 ok)
  - Evaluate overall pass/fail via weighted score vs top‑level threshold
  - Treat below‑threshold using top‑level severity (error fails CI; warning does not)
  - Update loader, CLI, and reporting for new model
  - Refresh prompt example and README to reflect changes
  - Rename Applicability.ts to PromptMapping.ts
  - Rename APIs to readPromptMappingFromIni/resolvePromptMapping
  - Inject mapping filter into CLI evaluation loop
  - Derive alias from prompt path; add fullPath
  - Update tests and descriptions to “Prompt mapping”
  - Add minimal micromatch type shim for tsc
  - Document mapping and validate command in README
  - Extend example INI with mapping sections
  - Update project vectorlint.ini to new style
  - Preserve default behavior when mapping absent
  - TTY: remove per-row score column; widen message column
  - Collect per-criterion weighted scores and list each on its own line
  - Keep overall Top/Threshold/Score line; add a blank line after it
  - Use per‑violation analysis for row message; suggestions only on issues
  - JSON schema: add criterion-level reasoning: string (required)
  - Update CriteriaResult type to include summary and reasoning
  - Directive: request brief step-by-step reasoning in outputs
* Add eslint-import-resolver-typescript for proper module resolution
* Configure TypeScript resolver with project-specific settings
* Fix unnecessary escape characters in regex patterns (Config.ts)
* Replace empty catch blocks with descriptive comments
* Remove unused variables and fix prefer-const violations
* Add file-specific ESLint rule overrides for different contexts:
  - Allow dev dependencies in config files and tests
  - Allow any types in CLI entry point and provider files
  - Allow any types in YAML parsing modules
* Maintain type safety while preserving external API compatibility
* use kebab case for file names
* add code naming rules to eslint
- Add tsup as dev dependency
- Create tsup.config.ts with ESM output and external deps
- Update build script from tsc to tsup
- Update bin entry to point to dist/index.js
ayo6706 and others added 24 commits December 29, 2025 19:05
- Add smol-toml dependency for TOML file parsing
- Create comprehensive test suite for global config loader functionality
- Add tests for environment variable injection from TOML configuration
- Add tests to verify existing environment variables are not overwritten
- Add tests for graceful handling of missing config files
- Add tests for invalid TOML error handling
- Update init command tests to mock global config module
- Replace .env.vectorlint file generation with global config initialization
- Simplify init command test by removing environment file assertions
- Remove redundant file overwrite protection tests for env files
- Reorganize quick start steps to prioritize initialization command
- Update configuration instructions to reflect global config at ~/.vectorlint/config.toml
- Add reference to vectorlint init command for automatic file generation
- Clarify precedence of project-scoped .env over global configuration
- Simplify LLM provider setup instructions with TOML example
- Update CONFIGURATION.md to document global and project-scoped configuration options
- Improve documentation flow to guide users through initialization before rule creation
- Remove unused zod import from global-config.ts
- Remove unused getGlobalConfigPath import from global-config.test.ts
- Rename mockEnsureGlobalConfig to MOCK_ENSURE_GLOBAL_CONFIG for consistency with constant naming conventions
- Update all references to use the new constant name in init-command.test.ts
- Improve code clarity and reduce unnecessary dependencies
- Remove `getLastUsage()` method and `lastUsage` field from BaseEvaluator
- Move token usage from orchestrator result wrapper to nested evaluation result
- Update `RunPromptEvaluationResultSuccess` to remove top-level usage field
- Modify accuracy evaluator to aggregate token usage from claim extraction and base evaluation
- Update orchestrator to access usage from `result.usage` instead of `r.usage`
- Create `ClaimExtractionResult` interface to return claims with optional usage data
- Ensure token usage is properly accumulated and returned within evaluation results
- Simplifies the result structure by keeping usage data closer to the actual evaluation data
feat(cli): add 'init' command for configuration setup
feat(token-usage): Add token usage tracking and cost calculation
* feat: implement bundled rule packs (presets)

- Add presets/ directory with meta.json registry
- Implement PresetLoader and refactor to RulePackLoader
- User rules shadow presets with same name
- Rename Eval/Prompt terminology to Rule for consistency

* Add pack name to rule naming hierarchy

- Make pack field required in PROMPT_FILE_SCHEMA
- Add buildRuleName() helper for consistent rule name construction
- Update rule names to follow PackName.RuleId.CriterionId pattern
- Apply naming to both subjective and semi-objective evaluation paths
- Update loadRuleFile() to require packName parameter
- Use 'Default' pack for legacy loadRules() function in tests

* bug: fix ESM __dirname compatibility in preset loading

- Add fileURLToPath and dirname imports for ESM compatibility
- Define __filename and __dirname using import.meta.url
- Fixes 'ReferenceError: __dirname is not defined' in built output
- Affects commands.ts and validate-command.ts preset loading

* expand VectorLint preset with three new rules targeting common content quality issues

- add pseudo-advice rule that catches vague guidance telling WHAT without explaining HOW
- add repetition rule that identifies redundant sections wasting reader time
- add directness rule that ensures sections immediately answer their header questions
- group violations by criterion in orchestrator for better multi-criterion rule reporting
- add ai-pattern test fixtures for buzzword and negation pattern detection

* fix: correct presets directory path resolution for dist builds

- Add presets/ to package.json files array for npm publishing
- Fix presetsDir path in commands.ts (../../presets → ../presets)
- Fix presetsDir path in validate-command.ts (../../presets → ../presets)

* feat: make RulesPath optional in config

- Make rulesPath optional in config schema (presets can provide rules)
- Remove duplicate validation from config-loader.ts (schema is now single source of truth)
- Update commands.ts and validate-command.ts to handle optional rulesPath
- Update file-resolver.ts to skip rules exclusion when rulesPath is undefined
- Update rule-pack-loader.ts to accept undefined userRulesPath
- Rename vectorlint.ini.example to .vectorlint.ini.example and update with optional RulesPath instructions

* fix: resolve CLI path and remove dead code in validate-command

- Resolve --evals CLI path to absolute before use (config paths already absolute)
- Remove unreachable duplicate prompts.length === 0 check after try block

* chore: fix lint errors
* docs: update documentation to reflect optional RulesPath

- README.md: Simplify Quick Start to 3 steps using bundled preset
- README.md: Remove manual rule creation step (now uses VectorLint preset)
- README.md: Update terminology from 'Global API keys' to 'LLM provider API keys'
- CONFIGURATION.md: Mark RulesPath as optional in settings table
- CONFIGURATION.md: Update example to show RulesPath commented out
- init-command.ts: Reorder next steps to prioritize API key setup

* docs: use '(none)' instead of 'No' for RulesPath default value
…uators (#39)

* feat: implement content chunking and dedicated scoring logic for evaluators

* refactor: replace `pre` and `post` fields with `quoted_text`, `context_before`, and `context_after`

* feat: add line numbering and per-rule chunking control

- Add line numbering before chunking for accurate LLM line reporting
- Add token usage tracking and aggregation across chunks
- Add 'chunking' frontmatter option to disable chunking per-rule

* refactor: improve chunking and scoring code quality

- Use composite key for violation deduplication (quoted_text + description + analysis)
- Remove unused chunking options (overlapFraction, preserveSentences)
- Rename total_count to violation_count, remove meaningless passed_count
- Add warning for array length mismatch in averageSubjectiveScores
- Fix Zod schema ordering (.optional().default(true))

* fix: correct word counting and improve test mocking

* refactor: use consistent deduplication and word counting

- Use composite key (quoted_text + analysis) for subjective violation deduplication
- Use countWords utility instead of inline split for word counting
- Export countWords from chunking module

* refactor: use countWords utility consistently in evaluators

* refactor: remove unused chunk offset calculation

* refactor: make countWords automatically strip line number prefixes

* refactor: replace chunking with evaluateAs property
* refactor(config): Extract CLI version and description to constants

- Move package.json version parsing to CLI_VERSION constant in config/constants.ts
- Add PACKAGE_JSON_SCHEMA to schemas/cli-schemas.ts for reusable validation
- Remove duplicate package.json parsing logic from json-formatter.ts
- Add CLI_DESCRIPTION constant with setup instructions and configuration examples
- Update json-formatter.ts to use CLI_VERSION from constants instead of local parsing
- Centralize version management and CLI metadata for easier maintenance

* refactor(cli): Improve CLI help text and use constants for version/description

- Remove unused --debug-json option from main command
- Fix quote escaping in --config option help text
- Import CLI_VERSION and CLI_DESCRIPTION from constants module
- Add comprehensive help text with usage examples
- Display help automatically when no arguments provided
- Consolidate version and description configuration in one place
- Improve user experience with clearer command documentation

* refactor(providers): Remove debugJson option from all providers

- Remove debugJson configuration option from all provider interfaces (Anthropic, Azure OpenAI, Gemini, OpenAI)
- Remove debugJson property initialization and assignments from provider constructors
- Remove debugJson conditional logging blocks that output full JSON responses
- Remove debugJson from ProviderOptions interface and provider factory configuration
- Remove debugJson from CLI schemas and command registration
- Simplify debug output by relying on existing debug and showPrompt options
- Update all provider tests to reflect removal of debugJson parameter
- Reduces configuration complexity and consolidates debug output handling

* refactor(cli): Rename evals option to rules for consistency

* chore: Clean up unwanted whitespaces

* chore: Update vectorlint description
…k/judge (#49)

BREAKING CHANGE: Rule type terminology has changed
- 'semi-objective' → 'check'
- 'subjective' → 'judge'

Backward compatibility is maintained via Zod schema transform.
Existing configs using old type names will continue to work.

Changes:
- Update EvaluationType enum with new JUDGE/CHECK values
- Add Zod transform to map deprecated values to new canonical values
- Rename internal methods to runJudgeEvaluation/runCheckEvaluation
- Update all VectorLint preset rules to use type: check
- Update CREATING_RULES.md, CONFIGURATION.md, README.md documentation
- Fix missing 'pack' property in test fixtures
- Apply npm audit fixes for glob and js-yaml vulnerabilities
- Bump version to 2.2.0
* chore(constants): Add style guide configuration constants

- Add STYLE_GUIDE_FILENAME constant for default style guide file name
- Add STYLE_GUIDE_TOKEN_WARNING_THRESHOLD constant for token limit warnings
- These constants support the style guide feature for content evaluation

* feat(config-loader): Support zero-config mode with style guide detection

- Import STYLE_GUIDE_FILENAME constant for style guide file detection
- Update resolveConfigPath return type to string | undefined instead of throwing error
- Return undefined when no config file is found instead of throwing ConfigError
- Implement zero-config mode that detects VECTORLINT.md style guide file
- Return default configuration with StyleGuide rule when style guide exists
- Preserve original error handling for cases where neither config nor style guide exists
- Enables users to run vectorlint without explicit configuration file when style guide is present

* feat(style-guide-loader): Add style guide loading with token estimation

- Create new StyleGuideLoader boundary module for loading VECTORLINT.md files
- Implement estimateTokens utility function using 4 characters per token approximation
- Add loadStyleGuide function to read style guide from specified directory
- Include token count warnings when style guide exceeds recommended threshold
- Return StyleGuideResult interface with content, token estimate, and file path
- Handle file not found and read errors gracefully with appropriate warnings
- Enables zero-config mode to automatically detect and load project style guides

* feat(cli): Integrate style guide loading into command execution

- Add style guide loading from VECTORLINT.md file during command initialization
- Pass loaded style guide content to DefaultRequestBuilder for prompt construction
- Implement zero-config mode that creates synthetic style guide prompt when no rules are configured but VECTORLINT.md exists
- Add createStyleGuidePrompt helper function to generate style guide compliance evaluation prompts
- Update request builder to prepend style guide content to prompt body with proper formatting
- Add verbose logging for style guide loading and zero-config mode activation
- Improve error messaging to distinguish between missing rules and missing style guide scenarios

* feat(cli): Add style guide initialization to init command

- Import STYLE_GUIDE_FILENAME constant for style guide file reference
- Add STYLE_GUIDE_TEMPLATE with default writing style, terminology, and tone guidelines
- Add --quick option to create only style guide for zero-config usage
- Add --full option to create both config and style guide files
- Refactor file existence checks to handle both config and style guide files
- Update file creation logic to conditionally write config and style guide based on options
- Enhance success output to display both created files
- Update next steps guidance to include style guide editing instructions
- Improve user experience by supporting multiple initialization workflows

* test(style-guide): Add comprehensive tests for init command and zero-config

- Add init-command.test.ts with tests for default, --quick, and --full modes
- Test backward compatibility with .vectorlint.ini creation
- Test VECTORLINT.md style guide file creation with --quick flag
- Test simultaneous creation of both config files with --full flag
- Add tests for --force flag to overwrite existing files
- Add tests to verify existing files are respected without --force
- Add zero-config.test.ts with tests for style guide detection
- Test default config generation when only VECTORLINT.md exists
- Test preference for .vectorlint.ini when both files are present
- Test error handling when neither configuration file exists
- Ensure StyleGuide rule is applied in zero-config mode

* docs: Add VECTORLINT.md style guide documentation and update quick start

- Add VECTORLINT.md to .gitignore to prevent accidental commits
- Document global style guide feature in CONFIGURATION.md with zero-config and combined modes
- Explain automatic style guide detection and synthetic rule creation
- Add token limit warning for large VECTORLINT.md files
- Restructure README quick start section with numbered steps
- Add zero-config mode instructions with --quick flag
- Clarify full configuration setup and API key configuration steps
- Improve readability with clearer hierarchy and examples

* chore: Add txt file format for VECTORLINT.md

* fix(config-loader,cli): Validate default config and improve style guide prompt

- Add CONFIG_SCHEMA validation to default config in zero-config mode to ensure type safety
- Update style guide prompt body text to clarify evaluation against attached style guide rules
- Fix commander import pattern in tests from deprecated `program` to `Command` constructor
- Improve consistency across all test cases for style guide initialization

* chore: Update config error to include VECTORLINT.md

* chore: clean up

* docs: Clarify style guide setup and remove unnecessary validation

- Update CONFIGURATION.md to clarify that style guide evaluates all content types, not just markdown files
- Add prerequisite note in README.md about credential setup before running checks
- Remove unused zod schema validation from config-loader.ts resolveConfigPath function
- Simplify config path resolution by removing redundant type checking

* docs: Clean up readme

* chore: Update readme

* chore: Remove unwanted comments
* refactor(prompts): rename schema types from subjective/semi-objective to judge/check

- Rename `buildSubjectiveLLMSchema()` to `buildJudgeLLMSchema()`
- Rename `buildSemiObjectiveLLMSchema()` to `buildCheckLLMSchema()`
- Rename `SubjectiveLLMResult` type to `JudgeLLMResult`
- Rename `SemiObjectiveLLMResult` type to `CheckLLMResult`
- Rename `SubjectiveResult` type to `JudgeResult`
- Rename `SemiObjectiveItem` type to `CheckItem`
- Rename `SemiObjectiveResult` type to `CheckResult`
- Rename `isSubjectiveResult()` function to `isJudgeResult()`
- Rename `isSemiObjectiveResult()` function to `isCheckResult()`
- Update `EvaluationResult` union type to use new type names
- Align schema naming with recent rule type terminology changes

* refactor(schema): rename subjective/semi-objective types to judge/check

- Rename SemiObjectiveItem to CheckItem in chunking/merger.ts
- Rename SubjectiveResult to JudgeResult in cli/orchestrator.ts and cli/types.ts
- Rename SemiObjectiveResult to CheckResult in evaluators
- Rename buildSubjectiveLLMSchema to buildJudgeLLMSchema
- Rename buildSemiObjectiveLLMSchema to buildCheckLLMSchema
- Rename SubjectiveLLMResult to JudgeLLMResult
- Rename SemiObjectiveLLMResult to CheckLLMResult
- Rename calculateSubjectiveScore to calculateJudgeScore
- Rename calculateSemiObjectiveScore to calculateCheckScore
- Rename averageSubjectiveScores to averageJudgeScores
- Rename isSubjectiveResult to isJudgeResult
- Update all comments and documentation to use judge/check terminology
- Update test files to reflect new type names
- Ensures consistent naming convention across codebase following recent rule type refactoring

* chore: Rename EvalutionResult to PromptEvaluationResult

* test(scoring-types): Update check evaluation test identifiers

- Rename test prompt id from "test-semi" to "test-check"
- Update prompt metadata id to match new naming convention
- Change prompt name from "Test Semi" to "Test Check"
- Align test identifiers with refactored check/judge type naming
* feat: add fix field support to violation reporting

* feat: add fix field support to Judge and Check schemas

* feat: add fix field support to JSON formatter

* feat(prompts): add fix field to directive loader schema

* feat(scoring): add fix field support to violation reporting and scoring
* refactor: rename style guide to user instructions

- Rename STYLE_GUIDE_FILENAME to USER_INSTRUCTION_FILENAME in constants
- Create user-instruction-loader.ts to replace style-guide-loader.ts
- Update all imports and variable names across codebase
- Remove unnecessary prompt body for VECTORLINT.md evaluation
- VECTORLINT.md content now used directly from system prompt

* refactor: centralize evaluation instructions in directive

- Rewrite directive with structured Role/Task/Instructions format
- Remove redundant INSTRUCTION sections from bundled presets
- Style guide now runs alongside rules (not fallback-only)
- Fix zero-config test expectation for empty runRules array
- Remove obsolete AGENTS.md architectural document

* test: fix imports and expectations after terminology refactor

- Replace STYLE_GUIDE_FILENAME with USER_INSTRUCTION_FILENAME
- Update template content expectations to use 'User Instructions'
- Fix request-builder directive order test expectation

* docs: update test description and terminology in CLAUDE.md

- Fix request-builder test description: 'appends' → 'prepends'
- Update CLAUDE.md terminology: 'Style Guide' → 'User Instructions'
- Remove 'Security & Configuration Tips' section from CLAUDE.md
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…er interface (#62)

* chore: migrate to Vercel AI SDK and update dependencies

- Replace individual AI provider SDKs (@anthropic-ai/sdk, @google/generative-ai, @perplexity-ai/perplexity_ai, openai) with unified @AI-SDK packages
- Add @ai-sdk/anthropic, @ai-sdk/google, @ai-sdk/openai, @ai-sdk/perplexity, and @ai-sdk/azure as dependencies
- Add ai package (^4.0.0) as core dependency for unified AI SDK
- Remove direct openai dependency in favor of @ai-sdk/openai
- Update package-lock.json with new dependency tree and resolved versions
- Bump version to 2.3.0

* feat(providers): migrate to Vercel AI SDK with unified provider interface

- Replace individual provider implementations (OpenAI, Azure, Anthropic, Gemini) with unified VercelAIProvider using @AI-SDK packages
- Migrate Perplexity search provider to use Vercel AI SDK's generateText with boundary validation for source data
- Update provider factory to instantiate models through Vercel AI SDK factory functions instead of provider-specific classes
- Add VercelAIConfig interface and remove provider-specific config types (AzureOpenAIConfig, AnthropicConfig, OpenAIConfig, GeminiConfig)
- Export LLMResult, SearchProvider, PerplexitySearchProvider, TokenUsage utilities, and ProviderType from providers index
- Remove api-client export from boundaries index as it's no longer needed
- Simplify provider instantiation by consolidating configuration handling into VercelAIProvider
- Add Zod schema for Perplexity source boundary validation to safely extract provider-specific fields

* refactor: remove legacy provider implementations and migrate to Vercel AI SDK

- Remove custom provider implementations (Anthropic, OpenAI, Azure OpenAI, Gemini)
- Delete API client boundary layer and response validation schemas
- Remove provider-specific test files (anthropic-e2e, anthropic-provider, openai-provider)
- Consolidate to unified Vercel AI SDK interface for all LLM providers
- Simplify codebase by eliminating duplicate validation and schema logic

* test: migrate Perplexity provider tests to Vercel AI SDK

- Update PerplexitySearchProvider tests to use Vercel AI SDK's generateText instead of native Perplexity SDK
- Replace @perplexity-ai/perplexity_ai mocks with @ai-sdk/perplexity and ai package mocks
- Add API key validation tests and environment variable handling
- Expand test coverage for edge cases including empty queries, missing fields, and result limiting
- Update provider-factory tests to reflect new SDK integration
- Add comprehensive Vercel AI provider tests for unified interface compatibility
- Improve test assertions and error handling validation

* fix(cli): resolve presets directory for both dev and built modes

- Add resolvePresetsDir() function to handle dual path resolution
- Check built mode path first (dist/ → ../presets)
- Fall back to dev mode path if meta.json not found (src/cli/ → ../../presets)
- Update registerMainCommand to use new resolver function
- Fixes preset loading failures when running from different build contexts

* fix(cli): validate presets directory exists before returning path

- Add existence check for meta.json in dev mode presets directory
- Throw descriptive error if presets directory cannot be located
- Improve error messaging to show both build and dev paths checked
- Prevent silent failures when presets directory is misconfigured

* fix(providers): add debug logging for Perplexity source validation failures

* refactor(schemas): export default provider configurations

- Export AZURE_OPENAI_DEFAULT_CONFIG as public constant
- Export ANTHROPIC_DEFAULT_CONFIG as public constant
- Export OPENAI_DEFAULT_CONFIG as public constant
- Export GEMINI_DEFAULT_CONFIG as public constant
- Fix indentation in GLOBAL_CONFIG_SCHEMA object definition
- Enable reuse of default configurations across modules

* feat(providers): add maxTokens support and improve schema conversion

- Add maxTokens configuration option to VercelAIConfig interface
- Support maxTokens parameter for Anthropic provider when configured
- Improve JSON Schema to Zod schema conversion with nullable type handling
- Normalize type arrays and filter out null values for proper type detection
- Replace nullable() with optional() for object properties to match Vercel AI SDK expectations
- Add strict mode support for objects with additionalProperties: false
- Add type casting for Azure OpenAI model and experimental_output to resolve type issues
- Conditionally include temperature and maxTokens in generateText call only when defined

* test: improve mock setup and type safety in provider tests

- Migrate mock declarations to vi.hoisted() for proper factory scope in Perplexity and Vercel AI provider tests
- Consolidate default config imports from env-schemas in env-parser test
- Rename MOCK_RESULTS to MOCK_SOURCES and EXPECTED_RESULTS for clarity on data transformation
- Add non-null assertions (!) for array element access in edge case tests
- Add type assertions for complex mock objects to satisfy TypeScript strict mode
- Improve mock implementation formatting and add afterEach cleanup in debug tests
- Update comments to clarify mock setup patterns and data shape expectations

* fix(providers): remove model defaults and improve enum schema handling

- Remove hardcoded model defaults from provider factory (Azure, Anthropic, OpenAI, Google) to require explicit configuration
- Add type-safe enum schema handling for mixed and non-string enum values using Zod union types
- Fix Perplexity provider tests to use correct field names (text/publishedDate instead of snippet/date)
- Improve test isolation by preserving and restoring process.env in Perplexity provider tests
- Add RequestBuilder type import and remove unsafe type assertions in VercelAIProvider tests
- Add explanatory comment for Azure provider type casting workaround

* fix(providers): improve schema conversion for union and nullable types

- Handle multi-type unions (e.g. ['string', 'number']) by building Zod union schemas
- Properly normalize type arrays by filtering 'null' and handling edge cases
- Support nullable types by tracking nullability separately from type array
- Add test coverage for union type arrays and nullable type handling
- Clarify test expectations for maxResults configuration behavior
- Add explanatory comments for schema validation edge cases
…tation

- Add gate checks (rule_supports_claim, evidence_exact,
  context_supports_violation, plausible_non_violation, fix_is_drop_in,
  fix_preserves_meaning) to both check and judge LLM output schemas
- Require reasoning, confidence, rule_quote, and line fields in
  structured output for improved inspectability
- Add --debug-json CLI flag that writes per-run artifacts to
  .vectorlint/runs/ with raw model output and filter decisions
- Add src/debug/run-artifact.ts and violation-filter.ts for
  deterministic gating and artifact serialization
- Attach raw_model_output to evaluator results for debug access
- Update directive to reflect raw-candidates + gate-checks approach
- Add tests/evaluations/ with test rules and config for manual evals
- Ignore .vectorlint/runs/ in git
- Add tests/evaluations/README.md with instructions for running
  manual evaluations, switching models, and inspecting debug artifacts
- Add docs/pr-description.md for the gate-check PR
@coderabbitai
Copy link

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces a gate-check structured output feature enabling deterministic violation filtering and debug artifact generation. It adds GateChecks and GateCheckNotes types to evaluation results, implements a violation-filter module that surfaces/hides violations based on gate boolean flags, exposes raw LLM outputs, and provides a --debug-json CLI flag to write per-run artifacts to .vectorlint/runs/. The directive is updated to request gate checks alongside raw findings.

Changes

Cohort / File(s) Summary
Configuration
.gitignore
Adds .vectorlint/runs/ to ignored paths for debug artifact storage.
Documentation
docs/pr-description.md, tests/evaluations/README.md, tests/evaluations/TEST_FILE.md
Comprehensive documentation of gate-check feature, debug artifact structure, evaluation workflows, and test file template with MDX-style components.
CLI & Type Definitions
src/cli/commands.ts, src/cli/types.ts, src/schemas/cli-schemas.ts
Adds --debug-json CLI flag, propagates debugJson option through EvaluationOptions and EvaluationContext, and extends CLI schema.
Core Debug Infrastructure
src/debug/run-artifact.ts, src/debug/violation-filter.ts
Introduces DebugRunArtifact type and writeDebugRunArtifact() to persist artifacts; defines FilterDecision type and computeFilterDecision() for deterministic gate-based violation filtering with surface/hide logic and reason collection.
Orchestration & Model Output
src/cli/orchestrator.ts
Adds debugJson pathway across evaluation flow; generates run artifacts with model info, raw_model_output, filter decisions, and surfaced violations; writes artifacts and debug logs when debugJson is enabled.
Schema & Evaluation Types
src/prompts/schema.ts
Introduces GateChecks and GateCheckNotes types; extends JudgeLLMResult, CheckLLMResult, JudgeResult, and CheckResult to include line, description, rule_quote, checks, check_notes, confidence, and raw_model_output in violation objects.
Directive & Prompting
src/prompts/directive-loader.ts
Updates DEFAULT_DIRECTIVE role, instructions, and output format to request two-output model (raw findings + gate checks) with explicit formatting, failure/pass semantics, and hard constraints.
Evaluator Output
src/evaluators/base-evaluator.ts
Exposes raw_model_output in JudgeResult and CheckResult (single LLMResult or array for multi-chunk); aggregates per-chunk reasoning strings into single reasoning field in CheckResult.
Violation Aggregation
src/scoring/scorer.ts
Updates calculateCheckScore and averageJudgeScores to preserve expanded violation shapes (line, description, rule_quote, checks, check_notes, confidence) and collects raw violations without field reconstruction.
Test Infrastructure
tests/evaluations/.vectorlint.ini, tests/evaluations/test-rules/Test/*
Adds evaluation configuration, five test rules (Clarity, Consistency, Passive Voice, Readability, Wordiness), each with YAML metadata and check criteria.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI
    participant Orch as Orchestrator
    participant Eval as Evaluator
    participant Filter as Filter
    participant Writer as Run Artifact Writer
    
    CLI->>Orch: evaluateFiles({ debugJson: true })
    Orch->>Eval: runEvaluation(prompt, model)
    Eval-->>Orch: CheckResult/JudgeResult + raw_model_output
    
    rect rgba(100, 150, 255, 0.5)
    Note over Orch,Writer: Debug Artifact Generation (when debugJson=true)
    Orch->>Filter: computeFilterDecision(violation)
    Filter-->>Orch: FilterDecision { surface, reasons[] }
    Orch->>Writer: writeDebugRunArtifact(runId, artifact)
    Writer-->>Orch: artifact file path
    Orch->>CLI: log debug message with path
    end
    
    CLI-->>CLI: Return aggregated results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ayo6706

Poem

🐰 Gate checks and filters, oh what a sight!
Raw outputs dancing in artifact light,
Debug flags waving through orchestration's flow,
Violations surfaced with reasons to know,
A structured tale of clarity and might! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: introducing gate-check structured output and the --debug-json instrumentation flag, which are the central features of this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/eval-gate-checks

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/evaluators/base-evaluator.ts (1)

150-177: ⚠️ Potential issue | 🟠 Major

Capture raw_model_output only when debug collection is enabled.

Line [150]-Line [177] and Line [202]-Line [239] accumulate and return full chunk-level LLM payloads on every run. That adds avoidable memory/serialization overhead in non-debug execution. Please gate raw capture/attachment behind a debug option propagated from CLI/orchestrator.

Also applies to: 202-239

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluators/base-evaluator.ts` around lines 150 - 177, The code
unconditionally collects and returns chunk-level LLM payloads (rawChunkOutputs
and raw_model_output) which should be gated by a debug flag; add a debug/config
boolean (e.g., this.collectDebug or this.options.debug) and modify the loop
around llmProvider.runPromptStructured so you only push llmResult into
rawChunkOutputs when that flag is true, and change the returned object to
include raw_model_output only when the flag is enabled; apply the same gating to
the similar block referenced (lines ~202-239) so both capture and attachment of
raw_model_output are conditional.
src/prompts/schema.ts (1)

50-50: ⚠️ Potential issue | 🟡 Minor

Harden numeric constraints for line and confidence.

Line [50] and Line [142] currently accept any number for line, and Line [99]/Line [191] accept unbounded confidence. Use integer/minimum constraints for line numbers and a [0, 1] range for confidence.

Proposed schema hardening
- line: { type: "number" },
+ line: { type: "integer", minimum: 1 },

- confidence: { type: "number" },
+ confidence: { type: "number", minimum: 0, maximum: 1 },

Also applies to: 99-99, 142-142, 191-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/prompts/schema.ts` at line 50, The schema currently allows unbounded
numbers for the properties named "line" and "confidence"; update each schema
definition that declares line: { type: "number" } to use an integer with a
sensible minimum (e.g., line: { type: "integer", minimum: 1 }) and update each
confidence declaration to enforce a [0,1] bound (e.g., confidence: { type:
"number", minimum: 0, maximum: 1 }); apply these changes to every occurrence of
the "line" and "confidence" properties found in this file so the validators (the
schema objects that define "line" and "confidence") enforce integer line numbers
and bounded confidence values.
🧹 Nitpick comments (6)
src/prompts/directive-loader.ts (1)

35-40: Minor duplication in hard constraints.

Lines 36-37 repeat constraints already stated in lines 20-21 ("Do NOT invent evidence" and "Every quoted span must be copied exactly"). While repetition in LLM prompts can reinforce important rules, consider whether both occurrences are intentional for emphasis or could be consolidated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/prompts/directive-loader.ts` around lines 35 - 40, The "Hard constraints"
block contains duplicate rules ("Do NOT invent evidence." and "Every quoted span
must be copied exactly...")—edit the prompt template in directive-loader.ts to
remove the repeated lines under the "Hard constraints" section so each
constraint appears only once (keep the first occurrence for emphasis), update
the surrounding string/template accordingly, and verify any code that references
that template (e.g., the directive or prompt-building variable containing the
"Hard constraints" text) still produces the expected prompt output.
src/cli/orchestrator.ts (2)

757-792: Same recommendation: wrap debug artifact writing in try/catch for Judge path.

Applies the same defensive error handling suggestion as the Check path to prevent evaluation failures from debug artifact I/O issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/orchestrator.ts` around lines 757 - 792, The debug JSON write in the
Judge path (guarded by debugJson) can throw and should be wrapped in a
try/catch: surround the block that computes runId, flat (from result.criteria
and computeFilterDecision), model (getModelInfoFromEnv) and the call to
writeDebugRunArtifact so any I/O/error is caught; on error call
processLogger.warn or console.warn with a clear message including the error
details and continue without failing the evaluation. Ensure the catch only
handles the debug-artifact logic (do not swallow other errors) and preserve the
existing console.warn(`[vectorlint] Debug JSON written: ${filePath}`) on
success.

670-695: Consider handling potential write failures for debug artifacts.

writeDebugRunArtifact performs synchronous file I/O (mkdirSync, writeFileSync) that can throw exceptions (e.g., permission denied, disk full). Since this is in the main evaluation flow, an unhandled exception would crash the evaluation. Consider wrapping in a try/catch to log a warning and continue evaluation gracefully.

🛡️ Proposed defensive wrapper
     if (debugJson) {
+      try {
         const runId = randomUUID();
         const decisions = result.violations.map((v) => computeFilterDecision(v));
         // ... rest of artifact logic ...
         console.warn(`[vectorlint] Debug JSON written: ${filePath}`);
+      } catch (e: unknown) {
+        console.warn(`[vectorlint] Failed to write debug artifact: ${e instanceof Error ? e.message : String(e)}`);
+      }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/orchestrator.ts` around lines 670 - 695, The debug artifact write
block guarded by debugJson should be made exception-safe: wrap the call sequence
that generates runId and calls writeDebugRunArtifact (including uses of
randomUUID(), decisions/surfaced computation, and console.warn) in a try/catch
so that any synchronous file I/O errors do not crash evaluation; on catch, log a
non-fatal warning including the caught error (e.g., via console.warn or
processLogger) and continue without rethrowing so the rest of the evaluation
proceeds normally.
src/debug/violation-filter.ts (1)

28-43: Missing fix_empty in reasons when fix is empty.

The fixEmpty variable is computed (line 30) and used in the surface decision (line 38), but no corresponding reason is added when fix is empty. This asymmetry makes it harder to understand why a violation was not surfaced.

♻️ Proposed fix to add fix_empty reason
   if (typeof v.confidence !== "number" || v.confidence < 0.75) reasons.push("confidence<0.75");
 
   const fixEmpty = (v.fix ?? "").trim() === "";
+  if (fixEmpty) reasons.push("fix_empty");
 
   const surface =
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/debug/violation-filter.ts` around lines 28 - 43, The code computes
fixEmpty but never records it in reasons, so add a symmetric reason when the fix
is empty: after computing const fixEmpty = (v.fix ?? "").trim() === ""; check if
(fixEmpty) and push a descriptive token (e.g., "fix_empty") into the reasons
array; ensure you reference the existing variables used here (fixEmpty, reasons,
v) so the new push occurs before the surface decision that relies on fixEmpty.
src/debug/run-artifact.ts (1)

34-62: Consider documenting or validating runId format for filename safety.

The runId is used directly in the filename (${runId}.json). While UUID format from randomUUID() is safe, if this function is ever called with user-provided IDs, it could lead to path traversal or invalid filenames. Consider either documenting the contract that runId must be a safe filename segment or applying sanitizePathSegment to it as well.

🛡️ Optional defensive sanitization
-  const filePath = path.join(dir, `${runId}.json`);
+  const safeRunId = sanitizePathSegment(runId);
+  const filePath = path.join(dir, `${safeRunId}.json`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/debug/run-artifact.ts` around lines 34 - 62, The filename uses runId
directly which can allow unsafe segments; in writeDebugRunArtifact either
validate runId against a safe pattern (e.g., UUID or
alphanumeric/dash/underscore) or apply sanitizePathSegment before using it in
the filename and any path joins (e.g., when building filePath and any other path
operations); update the code to use the sanitizedRunId (or assert the contract)
so path traversal/invalid filename risks are eliminated while keeping existing
sanitizePathSegment for artifact.subdir.
src/prompts/schema.ts (1)

59-98: Extract shared gate schemas to avoid drift.

Line [59]-Line [98] and Line [151]-Line [190] duplicate the same checks and check_notes schema blocks. Pull these into shared constants and reuse in both builders so gate contract updates happen in one place.

Also applies to: 151-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/prompts/schema.ts` around lines 59 - 98, The two identical schema objects
named checks and check_notes should be extracted into shared constants (e.g.,
gateChecksSchema and gateCheckNotesSchema) and reused in both builders instead
of duplicating the inline objects; locate the inline definitions for checks and
check_notes (the objects with additionalProperties:false, properties
{...boolean/string...} and the required arrays) and replace them with references
to the new constants so the properties, additionalProperties and required arrays
are preserved and future changes only need to be made in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/pr-description.md`:
- Around line 55-63: The debug artifact path examples are inconsistent: replace
the shorter `.vectorlint/runs/<uuid>.json` example with the detailed
`.vectorlint/runs/<model-tag>/<run_id>.json` form so all references match;
update the header text, the `ls .vectorlint/runs/` example (or guidance) and the
subsequent "When `--debug-json` is passed..." sentence to consistently use
`.vectorlint/runs/<model-tag>/<run_id>.json` (ensure any mentions of `<uuid>`
are changed to `<model-tag>/<run_id>` and keep the rest of the example text
intact).

In `@tests/evaluations/.vectorlint.ini`:
- Line 1: Update the RulesPath in the .vectorlint.ini so it points to the test
rule pack directory (set RulesPath = ./test-rules) to ensure the pack named by
RunRules = Test (located under ./test-rules/Test) is discoverable; modify the
RulesPath entry (the RulesPath setting in .vectorlint.ini) accordingly so the
custom rules are loaded during manual evaluations.

---

Outside diff comments:
In `@src/evaluators/base-evaluator.ts`:
- Around line 150-177: The code unconditionally collects and returns chunk-level
LLM payloads (rawChunkOutputs and raw_model_output) which should be gated by a
debug flag; add a debug/config boolean (e.g., this.collectDebug or
this.options.debug) and modify the loop around llmProvider.runPromptStructured
so you only push llmResult into rawChunkOutputs when that flag is true, and
change the returned object to include raw_model_output only when the flag is
enabled; apply the same gating to the similar block referenced (lines ~202-239)
so both capture and attachment of raw_model_output are conditional.

In `@src/prompts/schema.ts`:
- Line 50: The schema currently allows unbounded numbers for the properties
named "line" and "confidence"; update each schema definition that declares line:
{ type: "number" } to use an integer with a sensible minimum (e.g., line: {
type: "integer", minimum: 1 }) and update each confidence declaration to enforce
a [0,1] bound (e.g., confidence: { type: "number", minimum: 0, maximum: 1 });
apply these changes to every occurrence of the "line" and "confidence"
properties found in this file so the validators (the schema objects that define
"line" and "confidence") enforce integer line numbers and bounded confidence
values.

---

Nitpick comments:
In `@src/cli/orchestrator.ts`:
- Around line 757-792: The debug JSON write in the Judge path (guarded by
debugJson) can throw and should be wrapped in a try/catch: surround the block
that computes runId, flat (from result.criteria and computeFilterDecision),
model (getModelInfoFromEnv) and the call to writeDebugRunArtifact so any
I/O/error is caught; on error call processLogger.warn or console.warn with a
clear message including the error details and continue without failing the
evaluation. Ensure the catch only handles the debug-artifact logic (do not
swallow other errors) and preserve the existing console.warn(`[vectorlint] Debug
JSON written: ${filePath}`) on success.
- Around line 670-695: The debug artifact write block guarded by debugJson
should be made exception-safe: wrap the call sequence that generates runId and
calls writeDebugRunArtifact (including uses of randomUUID(), decisions/surfaced
computation, and console.warn) in a try/catch so that any synchronous file I/O
errors do not crash evaluation; on catch, log a non-fatal warning including the
caught error (e.g., via console.warn or processLogger) and continue without
rethrowing so the rest of the evaluation proceeds normally.

In `@src/debug/run-artifact.ts`:
- Around line 34-62: The filename uses runId directly which can allow unsafe
segments; in writeDebugRunArtifact either validate runId against a safe pattern
(e.g., UUID or alphanumeric/dash/underscore) or apply sanitizePathSegment before
using it in the filename and any path joins (e.g., when building filePath and
any other path operations); update the code to use the sanitizedRunId (or assert
the contract) so path traversal/invalid filename risks are eliminated while
keeping existing sanitizePathSegment for artifact.subdir.

In `@src/debug/violation-filter.ts`:
- Around line 28-43: The code computes fixEmpty but never records it in reasons,
so add a symmetric reason when the fix is empty: after computing const fixEmpty
= (v.fix ?? "").trim() === ""; check if (fixEmpty) and push a descriptive token
(e.g., "fix_empty") into the reasons array; ensure you reference the existing
variables used here (fixEmpty, reasons, v) so the new push occurs before the
surface decision that relies on fixEmpty.

In `@src/prompts/directive-loader.ts`:
- Around line 35-40: The "Hard constraints" block contains duplicate rules ("Do
NOT invent evidence." and "Every quoted span must be copied exactly...")—edit
the prompt template in directive-loader.ts to remove the repeated lines under
the "Hard constraints" section so each constraint appears only once (keep the
first occurrence for emphasis), update the surrounding string/template
accordingly, and verify any code that references that template (e.g., the
directive or prompt-building variable containing the "Hard constraints" text)
still produces the expected prompt output.

In `@src/prompts/schema.ts`:
- Around line 59-98: The two identical schema objects named checks and
check_notes should be extracted into shared constants (e.g., gateChecksSchema
and gateCheckNotesSchema) and reused in both builders instead of duplicating the
inline objects; locate the inline definitions for checks and check_notes (the
objects with additionalProperties:false, properties {...boolean/string...} and
the required arrays) and replace them with references to the new constants so
the properties, additionalProperties and required arrays are preserved and
future changes only need to be made in one place.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4c55ab and c8af0d9.

📒 Files selected for processing (20)
  • .gitignore
  • docs/pr-description.md
  • src/cli/commands.ts
  • src/cli/orchestrator.ts
  • src/cli/types.ts
  • src/debug/run-artifact.ts
  • src/debug/violation-filter.ts
  • src/evaluators/base-evaluator.ts
  • src/prompts/directive-loader.ts
  • src/prompts/schema.ts
  • src/schemas/cli-schemas.ts
  • src/scoring/scorer.ts
  • tests/evaluations/.vectorlint.ini
  • tests/evaluations/README.md
  • tests/evaluations/TEST_FILE.md
  • tests/evaluations/test-rules/Test/clarity.md
  • tests/evaluations/test-rules/Test/consistency.md
  • tests/evaluations/test-rules/Test/passive-voice.md
  • tests/evaluations/test-rules/Test/readability.md
  • tests/evaluations/test-rules/Test/wordiness.md

Comment on lines +55 to +63
# [vectorlint] Debug JSON written: .vectorlint/runs/<uuid>.json
# Verify the artifact contains raw_model_output.violations[*].line and gate fields
ls .vectorlint/runs/
```

## Expected artifact output

When `--debug-json` is passed, a file is written to `.vectorlint/runs/<model-tag>/<run_id>.json`. Example (one violation, truncated):

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the debug artifact path examples.

Line [55] shows .vectorlint/runs/<uuid>.json, while Line [62] shows .vectorlint/runs/<model-tag>/<run_id>.json. Please make these consistent so verification targets the correct location.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/pr-description.md` around lines 55 - 63, The debug artifact path
examples are inconsistent: replace the shorter `.vectorlint/runs/<uuid>.json`
example with the detailed `.vectorlint/runs/<model-tag>/<run_id>.json` form so
all references match; update the header text, the `ls .vectorlint/runs/` example
(or guidance) and the subsequent "When `--debug-json` is passed..." sentence to
consistently use `.vectorlint/runs/<model-tag>/<run_id>.json` (ensure any
mentions of `<uuid>` are changed to `<model-tag>/<run_id>` and keep the rest of
the example text intact).

@@ -0,0 +1,4 @@
RulesPath = ./builtin
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix RulesPath to target the added test rule pack.

Line 1 points to ./builtin, but this PR’s test rules are under ./test-rules/Test. With RunRules = Test, this can prevent the intended rules from being loaded in manual evaluations.

🔧 Proposed fix
-RulesPath = ./builtin
+RulesPath = ./test-rules

Based on learnings: Organize custom rules into subdirectories (packs) within RulesPath.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/evaluations/.vectorlint.ini` at line 1, Update the RulesPath in the
.vectorlint.ini so it points to the test rule pack directory (set RulesPath =
./test-rules) to ensure the pack named by RunRules = Test (located under
./test-rules/Test) is discoverable; modify the RulesPath entry (the RulesPath
setting in .vectorlint.ini) accordingly so the custom rules are loaded during
manual evaluations.

- Fix RulesPath in tests/evaluations/.vectorlint.ini (./builtin →
  ./test-rules) so manual evaluation runs don't abort on missing path
- Guard writeDebugRunArtifact calls in orchestrator with try/catch;
  filesystem errors now warn instead of interrupting result handling
- Fix artifact path examples in pr-description.md to consistently
  use .vectorlint/runs/<model-tag>/<run_id>.json
@oshorefueled oshorefueled force-pushed the feat/eval-gate-checks branch from c8af0d9 to a08cd94 Compare March 2, 2026 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants