What Modaic is today
Modaic is infrastructure for teams building AI systems with LLMs, often with DSPy. It focuses on:
-
Managing judges and evaluations for AI systems
- Run LLM judges on your agents' outputs
- Measure how good those judges actually are
- Get confidence scores and calibration so you know when to trust them
-
Data & eval plumbing around those judges
- Store and version your evaluation data
- Run experiments and compare different prompts/agents/judges
- Repeatable and production-grade
Built for teams building agents and DSPy programs who care about reliability, determinism, and safety, not just throwing prompts at an API.
Long-term direction
-
The "reliability layer" for AI systems
- Define what "good" means for an AI system, evaluate it continuously, and attach calibrated confidence scores and safety checks
- Works with DSPy, but not limited to DSPy.
-
A full evaluation & data-ops platform
- Host and manage: datasets, eval suites, benchmarks, LLM judges + their training/fine-tuning
- Automate: data labeling / relabeling with LLM judges, QA/QC loops on large, messy datasets
-
Enterprise-grade, self-hostable AI infra
- Same capabilities as a self-hosted / on-prem product: runs inside the company's own VPC or hardware, satisfies SOC2/HIPAA/compliance constraints
- Priced as a high-value enterprise license.
-
A bridge between cutting-edge AI and legacy industries
- Start with AI-native companies as design partners, then take the same evaluation + reliability stack into finance, energy/heavy industry, legal, healthcare, compliance
- Where they have a ton of data, regulation, and a need for "deterministic-feeling" AI.
One line:
Modaic is building the evaluation, confidence, and data infrastructure that lets serious teams ship AI agents and DSPy programs they can actually trust, and package it so enterprises can run it in production on their own data.
modaic-cli is the operator interface for Modaic's reliability workflows. Teams and CI pipelines use it to run evaluations, manage judges, enforce quality gates, and ship programs to the Modaic Hub.
| Command group | What it does |
|---|---|
modaic program |
Load, save, inspect, and push precompiled DSPy programs |
modaic hub |
Search the Modaic Hub, authenticate |
modaic batch |
Submit, poll, and fetch results from multi-provider batch jobs (OpenAI, Anthropic, Azure, Vertex, Together, Fireworks) |
modaic optimize |
Run prompt optimization (GEPA, bootstrap few-shot, vanilla few-shot) |
All commands support --json for machine-readable output. Exit codes are stable and documented (0 success, 1 error, 2 usage, 3 auth, 4 not found, 5 network, 130 interrupt).
The CLI is evolving into a reliability-first command surface. Work is tracked in TODO.md and sequenced in four phases:
Phase 0 — Contract hardening. Unified JSON envelope (schema_version, ok, command, data, error, meta) across every command. Structured error payloads with stable codes and recovery hints. Conformance tests.
Phase 1 — Reliability commands. First-class gate, eval, judge, experiment, and dataset command groups. gate check enforces deterministic pass/fail thresholds for deployment decisions.
Phase 2 — Extension governance. Optional capability modules under modaic x (x optimize, x index). Audit artifact pipeline (.modaic/logs/<run_id>/ with versioned manifests and traces).
Phase 3 — Policy and ergonomics. Redaction/retention controls, destructive-action safety flags (--yes, --dry-run), --assist interactive mode, compatibility documentation.
src/modaic_cli/
cli/ CLI command layer (Typer). Thin; delegates to core.
_app.py Root app, registers command groups
_errors.py Exit codes, error handling decorator
_output.py JSON/human output helpers
_data.py Dataset loading (JSONL, Parquet, Arrow, HuggingFace, stdin)
batch.py Batch processing commands
hub.py Hub search and auth
optimize.py Optimization commands
program.py Program lifecycle commands
modaic/ Core library
batch/ Multi-provider async batch processing
hub/ Hub API client, git sync, push workflows
auto/ AutoProgram/AutoConfig dynamic class loading
precompiled/ Precompiled program support
programs/ Program registry and built-ins
serializers/ DSPy and JSON serialization
module_utils/ Introspection, import filtering, pyproject parsing
exceptions/ Custom exception hierarchy
gepa/ GEPA optimization integration
optiglot/ Optiglot optimization framework (evaluator, predictors, optimizers)
uv sync --all-extras
uv run modaic --help
uv run pytest -q
- Python 3.11+, managed with
uv - Typer for CLI, Pydantic for validation
- Lazy imports for heavy dependencies (dspy, datasets, gepa, torch) to keep startup fast
- Strict linting: Ruff
- Dual-mode output: every command works in both human-readable and
--jsonmode - Exit codes are a stable contract; do not change their meanings