When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x #5
flamehaven01
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Control slowly becomes authority when nobody marks the boundary.
That is the calibration problem I kept running into while building STEM BIO-AI.
At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.
That was useful.
But it was not enough.
The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.
In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:
They all matter.
But they should not all have the same authority.
That is the core reason calibration became a governance problem in the
1.7.xline.The principle is simple:
But it should not let those inputs silently mutate the official score.
A Short Context for New Readers
STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.
It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.
It scans observable repository surfaces such as:
The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:
The active formula still also applies:
C1_penaltywhen hardcoded credentials are detectedscore_caport0_hard_floorwhen clinical-adjacent boundary rules require itStage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.
That separation is intentional.
What Is Actually Implemented in the Current 1.7.5 State of 1.7.x
Before discussing calibration philosophy, the implementation boundary has to be clear.
In the current
1.7.5state of the1.7.xline, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.This post describes the current released state of the
1.7.xline as ofv1.7.5, not a future authoritative-read-through design.Implemented surfaces include:
stem policy liststem policy explainstem policy derivestem policy simulateThe current named recommendation surface is intentionally narrow:
defaultstrict_clinical_adjacencyreproducibility_firstis still a draft posture, not an active release-grade named recommendation.The important limitation is this:
In other words,
scan --policy <name>can surface selected profile metadata.policy deriveandpolicy simulatecan show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.More specifically, local profile files are currently accepted only by
stem policy simulate, and the CLI rejects them unless the file remainsmirror_only.That is not a missing convenience.
That is the boundary being tested before it is allowed to become authority.
The Pressure That Causes Drift
One question pushed this design forward:
**If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?
**
I do not think the answer is automatically yes.
If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."
That is usually how audit systems drift.
The score stops being a stable artifact and starts becoming a moving interpretation layer.
The danger is not that users want control.
The danger is that control slowly becomes authority without anyone noticing.
So the design question is not:
The design question is:
That is where calibration enters.
Calibration Is Not a Tuning Console
The wrong calibration UX looks like this:
{ "stage_1_percent": 30, "stage_2r_percent": 25, "stage_3_percent": 45, "ca_no_disclaimer_cap": 61, "b2_partial_credit_mode": "looser" }This is editable.
But editable is not the same as governed.
Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:
That is why the current calibration design starts with posture questions, not raw constants.
The goal is not to ask a researcher to become a scoring-engine maintainer.
The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.
The
1–5Scale Is Input, Not AuthorityIn the current design, the user-facing intent layer uses a
1–5scale:1= minimal emphasis2= light emphasis3= moderate emphasis4= strong emphasis5= very strong emphasisThe important line is this:
That means the user can express posture in a natural way:
But those answers do not directly become score constants.
They are translated through explicit rules.
The current decision table is intentionally narrow:
clinical_strictness >= 4andreproducibility_priority <= 3strict_clinical_adjacency2or3defaultpreview_onlyprofile delta from bounded deltas onlyThis table should not be mistaken for an empirically optimized model.
It is a conservative governance rule table.
The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.
That matters because a calibration system can fail in two opposite ways:
The initial rule table chooses the safer failure mode.
If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to
preview_only.For example:
That does not automatically recommend
strict_clinical_adjacency.It falls back to
preview_only, because two strong postures are competing and no release-grade named profile currently resolves that conflict.A hidden similarity function might produce something that looks more flexible.
But it would also make the governance harder to audit.
A narrow rule table is less magical.
It is also safer.
What the CLI Is Allowed to Do
![Easy experimentation, hard drift — sandbox and vault
The preview workflow can look like this:
or this:
But those flows are not the same as saying:
The first two are governed preview surfaces.
The last one is an untracked tuning console.
The design intentionally supports the first and rejects the shape of the last.
This is the practical meaning of easy experimentation, hard drift.
What Actually Gets Verified
The central claim of this design is not:
The claim is narrower:
That claim can be tested by checking whether the system exposes or blocks the relevant control surfaces.
final_scorereplication_scoredoes not changeformal_tierpreview_onlywhen no named rule matchesThis is still not the same as a full empirical benchmark.
But it is a real verification target.
The system can be checked for whether it allows the forbidden mutation path.
That is the level of proof appropriate for this release line: not "the final policy is optimal," but "the policy cannot quietly become authoritative without leaving a trace."
That trace is stronger for some surfaces than others. Profile identity, hash, and read mode are already artifact-visible in
1.7.5. Detector promotion semantics are already versioned and documented, but they are not yet surfaced as first-class per-detector policy metadata in the result object.The B2 Tightening Example
The clearest scoring example is Stage 3 B2.
B2 is the bias and limitations measurement surface. Earlier scoring behavior allowed a weaker boundary: a simple vocabulary-level signal could still receive partial credit.
That became too permissive.
A repository that mentions "bias" or "limitations" once is not necessarily disclosing a meaningful boundary. It may only be surface signaling.
So the B2 rule became stricter.
The important change is not a marketing claim about benchmark improvement. The important change is a deterministic boundary change:
This is the first place where calibration becomes visible as more than a principle.
The rule change creates a concrete score path difference:
That is the current public claim.
I am not presenting a benchmark-wide before/after score delta here, because that would require a pinned fixture set and published comparison protocol.
Without that, a claimed "T3 became T2" example would be anecdotal at best and misleading at worst.
So the honest evidence level is rule-level impact:
In clinical-adjacent repositories, limitation language is not decoration. It is part of the claim boundary.
A one-word mention does not carry the same weight as a structured limitations section, demographic coverage statement, known failure-mode description, or quantitative subgroup analysis.
This is why calibration cannot be only a UI problem.
If a user asks for a stricter limitations posture, the system should not silently subtract points through a hidden override. It should expose the rule that changed and the reason that rule exists.
That is the difference between a score tweak and a governed scoring rationale.
Why Stage 4 Stays Separate
Stage 4 is the place where the strongest counterargument appears.
The counterargument is fair:
My answer is that importance and score authority are not the same thing.
Stage 4 measures replication posture: containers, reproducibility targets, dependency locks, artifact references, seeds, citation surfaces, and similar evidence.
Those signals matter.
But they do not mean the same thing as the formal claim boundary.
A repository can be highly reproducible and still make unsafe or unbounded clinical claims.
A repository can have clean containers and dependency locks while still lacking a clinical-use disclaimer.
A repository can be easy to rerun while still having weak data provenance or shallow limitation language.
If Stage 4 were allowed to lift the formal score too early, reproducibility could start compensating for claim-boundary weakness.
That would be a different scoring philosophy.
It may become valid in the future, but only if the rule is explicit.
For now, Stage 4 is reported as a separate lane because the system is saying:
That is why stronger reproducibility intent currently falls back to
preview_onlyinstead of becoming a release-grade named profile.The system is not saying reproducibility is unimportant.
It is saying reproducibility has not yet been granted formal score authority.
Advisory AI Uses the Same Boundary
Advisory AI follows the same rule.
Helpful interpretation is not score authority.
STEM BIO-AI can export provider-neutral advisory packets and validate downstream advisory responses, but the deterministic scanner does not need an external model runtime to produce the formal score.
If an advisory system becomes useful, it may help interpret findings, prioritize review, or explain evidence patterns.
But unless a future release explicitly changes the policy, advisory output remains structurally subordinate to the deterministic score.
That is enough for this article.
The broader advisory boundary is a separate topic.
From Scoring Tool to Audit Workflow
The
1.7.xtransition is best understood as a shift in the questions the tool is expected to answer.This is why I describe
1.7.xas an audit-system transition.The score still matters.
But the system is increasingly designed around the custody of the score: where it came from, what was allowed to influence it, and what was intentionally kept outside it.
What This Still Does Not Do
This boundary is just as important as the implementation.
STEM BIO-AI still does not:
Those are not missing conveniences.
They are boundaries.
A strong repository evidence tier is still an observable repository-surface signal. It is not clinical clearance, regulatory approval, or proof of biomedical validity.
The Next Version Direction
The next important step is not adding more knobs.
It is authoritative policy read-through in parity mode.
That means:
This is not a big-bang rewrite.
It is authority relocation.
The goal is to move score-affecting constants into versioned policy objects without changing the score by accident.
Only after that parity step does it become safe to discuss broader named profiles.
Final Position
The calibration problem is not really about giving users more control.
It is about deciding when control becomes authority.
If every useful signal can gradually influence the score, the score stops being an audit artifact.
It becomes a negotiation.
That is what STEM BIO-AI is trying to avoid.
Researchers should be able to express posture.
Operators should be able to simulate alternatives.
Policy stewards should be able to promote changes.
But the formal score should not move unless the governance path says it moved.
That is the difference between a tuning console and an audit system.
Beta Was this translation helpful? Give feedback.
All reactions