Ejentum Eval Pattern: Host-Agnostic Specification

Intent

Given any user prompt, produce two responses from identical LLMs (one baseline, one with Ejentum cognitive scaffold injected) and grade them via an independent blind evaluator. The pattern is model-agnostic, host-agnostic, and requires no Ejentum knowledge from the end user.

Roles

Three roles. Each is a single LLM call. No memory, no multi-turn state.

Role 1: Baseline Producer

Model: gpt-4o (or any capable LLM)
System prompt: system_prompts/baseline_producer.md
Input: the user prompt verbatim
Output: a natural-language response
Tools: none

Role 2: Ejentum Producer

Model: same as baseline (critical for fairness)
System prompt: system_prompts/ejentum_producer.md
Input: the user prompt, prefixed with [COGNITIVE SCAFFOLD] block retrieved from the Ejentum API
Output: a natural-language response
Tools: none (scaffold retrieval happens before the LLM call)

Role 3: Blind Evaluator

Model: different model family from the producers (recommended: Gemini, to avoid same-family bias)
System prompt: system_prompts/blind_evaluator.md
Input: user prompt + Response A + Response B (labels are neutral, evaluator does not know which is which)
Output: structured JSON verdict
Tools: none

Orchestration flow

user_prompt
     |
     +--------- call Ejentum API (mode=reasoning)
     |                |
     |                v
     |          cognitive_scaffold
     |
     +---------> Baseline Producer --------> baseline_response
     |
     +-> inject scaffold -> Ejentum Producer ----> ejentum_response
     |
     v
Blind Evaluator (sees A/B, not baseline/ejentum) ----> evaluation JSON
     |
     v
Final output: { user_message, baseline_response, ejentum_response, evaluation }

Ejentum API call

Endpoint: https://api.ejentum.com/harness/

Method: POST

Headers:

Authorization: Bearer <EJENTUM_API_KEY>
Content-Type: application/json

Body:

{
  "query": "<user's task or your framing of it>",
  "mode": "reasoning"
}

Modes available: reasoning, reasoning-multi, anti-deception, code, code-multi, memory, memory-multi. Default to reasoning for general tasks. Use anti-deception for advice/validation prompts where sycophancy is a risk. Use code for coding tasks.

Response:

[{ "reasoning": "<scaffold string containing [PROCEDURE], [FALSIFICATION TEST], Suppress, Amplify signals, etc.>" }]

Extract the scaffold string from the mode-named key (e.g., response[0].reasoning) and inject it into the augmented producer's context before the user prompt.

Evaluator output schema

{
  "scores": {
    "A": {"specificity": 1-5, "posture": 1-5, "depth": 1-5, "actionability": 1-5, "honesty": 1-5},
    "B": {"specificity": 1-5, "posture": 1-5, "depth": 1-5, "actionability": 1-5, "honesty": 1-5}
  },
  "totals": {"A": <sum>, "B": <sum>},
  "justifications": {
    "specificity": "one sentence comparing A and B",
    "posture": "one sentence comparing A and B",
    "depth": "one sentence comparing A and B",
    "actionability": "one sentence comparing A and B",
    "honesty": "one sentence comparing A and B"
  },
  "verdict": "A" | "B" | "tie",
  "verdict_reason": "one sentence"
}

Fairness requirements

Same producer model for baseline and Ejentum. Only the scaffold differs. Any model difference contaminates the comparison.
Different-family evaluator. If producers are OpenAI, evaluator should be Google / Anthropic / other. Keeps the judge from having the same latent biases.
Blind labels. The evaluator sees Response A and Response B. The mapping to baseline/ejentum is preserved only in the final output formatter, never exposed to the evaluator.
Identical producer system prompts except the scaffold instruction. The baseline and Ejentum system prompts should be byte-identical apart from the block that tells the Ejentum producer to apply the injected scaffold.

Non-requirements

No multi-turn memory needed.
No external databases.
No caching (each run is independent).
No retries / circuit breakers (one-shot; if an API call fails, surface the error).

Extension points

Batch mode: wrap runEval in a loop over an array of prompts. Collect results into an array for export to CSV / JSON.
Scaffolded evaluator: optional second evaluator agent that itself uses Ejentum (anti-deception mode recommended) for sycophancy-resistant judging. Ship as a separate companion, never replacing the independent Raw_Evaluator.
Mode sweeps: run the same prompt with multiple Ejentum modes (reasoning, anti-deception, code) and compare which harness produces the strongest lift per task type.

Intentional non-goals

This pattern does not prove Ejentum is universally better. It measures whether the scaffold produces a visible difference on a specific prompt. Users calibrate on their own tasks.
The evaluator is a data point, not a verdict. When it returns tie, that is a valid and defensible outcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ejentum Eval Pattern: Host-Agnostic Specification

Intent

Roles

Role 1: Baseline Producer

Role 2: Ejentum Producer

Role 3: Blind Evaluator

Orchestration flow

Ejentum API call

Evaluator output schema

Fairness requirements

Non-requirements

Extension points

Intentional non-goals

FilesExpand file tree

spec.md

Latest commit

History

spec.md

File metadata and controls

Ejentum Eval Pattern: Host-Agnostic Specification

Intent

Roles

Role 1: Baseline Producer

Role 2: Ejentum Producer

Role 3: Blind Evaluator

Orchestration flow

Ejentum API call

Evaluator output schema

Fairness requirements

Non-requirements

Extension points

Intentional non-goals