GitHub - bugsbunny88/modaic-cli

What Modaic is today

Modaic is infrastructure for teams building AI systems with LLMs, often with DSPy. It focuses on:

Managing judges and evaluations for AI systems
- Run LLM judges on your agents' outputs
- Measure how good those judges actually are
- Get confidence scores and calibration so you know when to trust them
Data & eval plumbing around those judges
- Store and version your evaluation data
- Run experiments and compare different prompts/agents/judges
- Repeatable and production-grade

Built for teams building agents and DSPy programs who care about reliability, determinism, and safety, not just throwing prompts at an API.

Long-term direction

The "reliability layer" for AI systems
- Define what "good" means for an AI system, evaluate it continuously, and attach calibrated confidence scores and safety checks
- Works with DSPy, but not limited to DSPy.
A full evaluation & data-ops platform
- Host and manage: datasets, eval suites, benchmarks, LLM judges + their training/fine-tuning
- Automate: data labeling / relabeling with LLM judges, QA/QC loops on large, messy datasets
Enterprise-grade, self-hostable AI infra
- Same capabilities as a self-hosted / on-prem product: runs inside the company's own VPC or hardware, satisfies SOC2/HIPAA/compliance constraints
- Priced as a high-value enterprise license.
A bridge between cutting-edge AI and legacy industries
- Start with AI-native companies as design partners, then take the same evaluation + reliability stack into finance, energy/heavy industry, legal, healthcare, compliance
- Where they have a ton of data, regulation, and a need for "deterministic-feeling" AI.

One line:

Modaic is building the evaluation, confidence, and data infrastructure that lets serious teams ship AI agents and DSPy programs they can actually trust, and package it so enterprises can run it in production on their own data.

The CLI

modaic-cli is the operator interface for Modaic's reliability workflows. Teams and CI pipelines use it to run evaluations, manage judges, enforce quality gates, and ship programs to the Modaic Hub.

What exists today

Command group	What it does
`modaic program`	Load, save, inspect, and push precompiled DSPy programs
`modaic hub`	Search the Modaic Hub, authenticate
`modaic batch`	Submit, poll, and fetch results from multi-provider batch jobs (OpenAI, Anthropic, Azure, Vertex, Together, Fireworks)
`modaic optimize`	Run prompt optimization (GEPA, bootstrap few-shot, vanilla few-shot)

All commands support --json for machine-readable output. Exit codes are stable and documented (0 success, 1 error, 2 usage, 3 auth, 4 not found, 5 network, 130 interrupt).

What we're building next

The CLI is evolving into a reliability-first command surface. Work is tracked in TODO.md and sequenced in four phases:

Phase 0 — Contract hardening. Unified JSON envelope (schema_version, ok, command, data, error, meta) across every command. Structured error payloads with stable codes and recovery hints. Conformance tests.

Phase 1 — Reliability commands. First-class gate, eval, judge, experiment, and dataset command groups. gate check enforces deterministic pass/fail thresholds for deployment decisions.

Phase 2 — Extension governance. Optional capability modules under modaic x (x optimize, x index). Audit artifact pipeline (.modaic/logs/<run_id>/ with versioned manifests and traces).

Phase 3 — Policy and ergonomics. Redaction/retention controls, destructive-action safety flags (--yes, --dry-run), --assist interactive mode, compatibility documentation.

Project structure

src/modaic_cli/
  cli/              CLI command layer (Typer). Thin; delegates to core.
    _app.py           Root app, registers command groups
    _errors.py        Exit codes, error handling decorator
    _output.py        JSON/human output helpers
    _data.py          Dataset loading (JSONL, Parquet, Arrow, HuggingFace, stdin)
    batch.py          Batch processing commands
    hub.py            Hub search and auth
    optimize.py       Optimization commands
    program.py        Program lifecycle commands
  modaic/           Core library
    batch/            Multi-provider async batch processing
    hub/              Hub API client, git sync, push workflows
    auto/             AutoProgram/AutoConfig dynamic class loading
    precompiled/      Precompiled program support
    programs/         Program registry and built-ins
    serializers/      DSPy and JSON serialization
    module_utils/     Introspection, import filtering, pyproject parsing
    exceptions/       Custom exception hierarchy
  gepa/             GEPA optimization integration
  optiglot/         Optiglot optimization framework (evaluator, predictors, optimizers)

Setup

uv sync --all-extras
uv run modaic --help
uv run pytest -q

Key conventions

Python 3.11+, managed with uv
Typer for CLI, Pydantic for validation
Lazy imports for heavy dependencies (dspy, datasets, gepa, torch) to keep startup fast
Strict linting: Ruff
Dual-mode output: every command works in both human-readable and --json mode
Exit codes are a stable contract; do not change their meanings

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/modaic_cli		src/modaic_cli
tests		tests
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The CLI

What exists today

What we're building next

Project structure

Setup

Key conventions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The CLI

What exists today

What we're building next

Project structure

Setup

Key conventions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages