How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits #3
flamehaven01
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.
The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.
The more useful lesson was this:
That worked.
But it exposed the next problem.
If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?
LLMs drift.
One session can enforce a clinical boundary strictly.
Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.
STEM-AI v1.1.2 is my answer to that problem.
It does not try to make the LLM deterministic by writing a longer prompt.
It binds the audit to a memory contract.
What v1.1.2 adds
STEM-AI v1.1.2 introduces MICA: Memory-Injected Contract Architecture.
The idea is simple:
before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.
The v1.1.2 layer includes:
memory/mica.yaml-- composition contractmemory/stem-ai.mica.v1.1.2.json-- machine-checkable memory archivememory/stem-ai-playbook.v1.1.2.md-- session playbook and drift guardmemory/stem-ai-lessons.v1.1.2.md-- historical failure-mode archivespec/STEM-AI_v1.1.2_CORE.md-- canonical audit specThe contract pins 18 invariants.
Examples:
T0_HARD_FLOORcannot be bypassed.This is not a claim that the LLM becomes perfectly deterministic.
It is a narrower claim:
That is the useful layer.
What "loading the contract" means
MICA is not hidden model memory.
It is also not a claim that the model provider changed the LLM.
In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:
The auditor then performs a pre-execution contract test:
Only after that does the audit proceed.
This does not make the LLM mathematically deterministic.
It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.
That is the difference between "please be consistent" and "execute this versioned contract."
The audit workflow
STEM-AI v1.1.2 runs as a structured audit workflow:
In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.
It can inspect:
The output is intentionally split into two files:
That split matters.
The report explains the reasoning.
The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.
A real target audit, not a synthetic example
For this v1.1.2 demonstration, I used a real public repository:
artic-network/fieldbioinformatics
The target is not the protagonist of this post.
It is only the specimen used to show the audit workflow against a real bioinformatics codebase.
The local audit produced:
The target snapshot:
{ "name": "artic-network/fieldbioinformatics", "remote": "https://github.com/artic-network/fieldbioinformatics", "branch": "master", "commit": "8008b4c97c2193a82308ff6f0be507b1d9306e36", "file_count": 114 }This is the important part: the audit did not ask, "Does this README sound trustworthy?"
It asked:
That is where STEM-AI is useful.
The score object
The machine-readable result records the score like this:
{ "stage_1_readme_intent": 65, "stage_2_cross_platform": null, "stage_2_repo_local_consistency": 75, "stage_2_lane": "STAGE_2R_REPO_LOCAL_CONSISTENCY", "stage_2_policy": "External Stage 2 was not collected; LOCAL_ANALYSIS used Stage 2R in the fixed 0.20 Stage 2 slot.", "stage_3_code_bio": 55, "weights": { "stage_1": 0.4, "stage_2": 0.2, "stage_3": 0.4 }, "risk_penalty": 0, "final_score": 63, "formal_tier": "T2 Caution" }External Stage 2 is explicitly represented as
nullfor this local-only audit.That does not mean cross-platform consistency is unimportant.
It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.
Stage 2R asks whether the repository's own surfaces agree with each other:
The contract defines the fixed-weight calculation:
The final tier is therefore:
Not because the prose sounded balanced.
Because the contract math forces that result.
Why the T0 hard floor did not trigger
T0_HARD_FLOORis the rule that prevents a clinically dangerous repository from escaping rejection through good wording.In simplified form:
Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.
The audited repository did not trigger that floor because STEM-AI classified it as:
{ "clinical_adjacent": true, "ca_severity": "CA-INDIRECT", "t0_hard_floor": false }It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.
So the result is not T0.
But it is also not high-trust.
The bounded result is T2 Caution.
Code-integrity findings
The same JSON records C1-C4 LOCAL_ANALYSIS checks:
{ "C1_hardcoded_credentials": { "status": "PASS" }, "C2_dependency_pinning": { "status": "WARN" }, "C3_dead_or_deprecated_patient_adjacent_paths": { "status": "WARN" }, "C4_exception_handling_clinical_adjacent_paths": { "status": "WARN" } }That is the difference between a general review and a code-path audit.
A text review can say:
A code-path audit can say:
That is a more useful governance object.
It is not a certificate.
It is a map of what a reviewer should trust, distrust, or inspect next.
A small Python verifier
Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.
Expected digest:
Why this matters
Bio/medical AI governance is full of language that sounds safe but is hard to verify:
Those phrases are not enough.
STEM-AI asks for observable structure:
v1.1.2 adds another layer:
auditor reality.
The AI auditor itself has to load a memory contract before it scores.
That is what MICA is for.
The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.
Not hype.
Not rejection by default.
A bounded trust judgment with evidence paths.
What comes next
The follow-on lane should:
experiment_results.jsonFor the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.
Final thought
STEM-AI is not a clinical certifier.
It is also not trying to replace scientific review, regulatory review, or domain experts.
Its role is narrower: make the governance conversation start from observable evidence instead of presentation quality.
In practice, that means asking:
That is where I think STEM-AI belongs in AI governance.
Not as the final authority.
As the evidence gate before authority is invoked.
It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:
Beta Was this translation helpful? Give feedback.
All reactions