Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/paper/ordvec_formalization_paper.pdf
Binary file not shown.
95 changes: 85 additions & 10 deletions docs/paper/ordvec_formalization_paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,16 @@ \section{Purpose and Scope}
question is whether a specific admission decision can be made with no loss
after passing from \(\Omega\) to \(Z\).

In applications, the chosen ``full'' observation space is already relative.
A neural embedding, sparse lexical vector, chunked document view, or bitmap
signature can itself be treated as an observation map out of a richer source.
The formalization does not require this chain to end in an ontologically
complete \(\Omega\). It fixes the finite observation layer being audited and
asks whether the target under discussion is invariant on the fibers of the next
quotient. The recursion stops at the target: once the decision, score,
comparison, or multi-target signature is fixed, the theorem asks whether the
proposed quotient preserves exactly that behavior.

The Lean repository answers this question in a finite setting. It proves that
under exact factorization assumptions, quotient-form rules are sufficient for
Bayes-optimal admission. It also proves the converse kind of structural
Expand Down Expand Up @@ -414,7 +424,12 @@ \subsection{Cardinality reduction}
\card{A^Z}=\card{A}^{\card{Z}}.
\]
These are \lean{card_fullTargets}, \lean{card_quotientRules}, and
\lean{quotientCompatible_search_space_bound}. Lean also defines
\lean{quotientCompatible_search_space_bound}. On a reachable image, the
generic count
\[
\card{A^{\im(Q)}}=\card{A}^{\card{\im(Q)}}
\]
is checked by \lean{card_quotientImageRules}. Lean also defines
\lean{UnobservedReachableQuotients} for the reachable image buckets not yet
seen in the sample. The induced full-target unobserved-bucket freedom
has the corresponding count
Expand All @@ -434,7 +449,7 @@ \subsection{Cardinality reduction}
full-target uncertainty is exactly the finite set of assignments on reachable
unobserved image buckets.

\subsection{Refinement and products}
\subsection{Refinement and paired targets over one quotient}

If a fine quotient \(Q_f\) determines a coarse quotient \(Q_c\), Lean writes
this as \lean{QuotientRefines}. Theorems
Expand All @@ -446,10 +461,12 @@ \subsection{Refinement and products}
factor through the fine quotient rules out factorization through any coarser
one.

The product theorem \lean{RuleFactorsThrough.prod_iff} proves that a paired
target factors through a quotient exactly when both components factor through
it. The sample-level analogue \lean{SampleConsistent.prod_iff} gives the same
componentwise criterion for finite samples.
The paired-target theorem \lean{RuleFactorsThrough.prod_iff} proves that a
paired target factors through a single quotient exactly when both components
factor through it. The sample-level analogue \lean{SampleConsistent.prod_iff}
gives the same componentwise criterion for finite samples. This is distinct
from the later product-map quotient \(C_q\times C_d\), which handles
query/document-style product domains.

\begin{interpretation}
These theorems are useful for probe design. A refined quotient gives more
Expand Down Expand Up @@ -604,6 +621,7 @@ \subsection{Product quotients, ranking, and kernel refinements}
\begin{leanblock}
\lean{productMap}\quad
\lean{ProductFiberInvariant}\quad
\lean{kernelContainedInTarget_productMap_iff_productFiberInvariant}\\
\lean{ruleFactorsThrough_product_fiberInvariant}\\
\lean{productFiberInvariant_ruleFactorsThrough_of_surjective}\\
\lean{ruleFactorsThrough_productMap_image_iff_productFiberInvariant}
Expand All @@ -612,6 +630,13 @@ \subsection{Product quotients, ranking, and kernel refinements}
constant on product fibers. The surjective and image-restricted converses are
the clean formal versions of working on the actually reached product image
rather than pretending that the full ambient code space is populated.
The finite product-search vocabulary names both observed and reachable
unobserved product buckets:
\begin{leanblock}
\lean{ObservedProductQuotients}\quad
\lean{UnobservedReachableProductQuotients}\quad
\lean{card_productQuotientRules}
\end{leanblock}
The finite search space is then exactly controlled by the component images:
\begin{leanblock}
\lean{productMap_image_subset_product_images}\\
Expand All @@ -635,10 +660,16 @@ \subsection{Product quotients, ranking, and kernel refinements}
\card{A}^{\card{\im(C_q)}\card{\im(C_d)}}
\]
for finite label type \(A\).
This exact product-image equality is a theorem about the full formal product
domain \(\Omega_q\times\Omega_d\). A logged candidate graph, prefiltered
trace, or finite benchmark subset need not realize every independent
query/document pair; such restricted empirical surfaces should be treated as
their own finite observed images.

Finite pair samples get the corresponding falsifier:
\begin{leanblock}
\lean{ProductSampleConsistent}\quad
\lean{productSampleConsistent_iff_sampleConsistent_productMap}\\
\lean{productSampleConsistent_of_ruleFactorsThrough}\\
\lean{productSampleConsistent_iff_exists_productRuleFitsSample}\\
\lean{no_productQuotientTarget_of_sample_collision}
Expand Down Expand Up @@ -667,8 +698,11 @@ \subsection{Product quotients, ranking, and kernel refinements}
Ranking comparisons are handled by a three-field comparison observation:
\begin{leanblock}
\lean{ComparisonObs}\quad
\lean{comparisonObsEquivProd}\quad
\lean{comparisonMap}\quad
\lean{ComparisonFiberInvariant}\\
\lean{comparisonMap_eq_iff}\\
\lean{ComparisonFiberInvariant}\quad
\lean{ruleFactorsThrough_comparison_fiberInvariant}\\
\lean{rankByScore}\quad
\lean{rankByScore_factorsThrough_of_score_factorsThrough}
\end{leanblock}
Expand Down Expand Up @@ -699,6 +733,7 @@ \subsection{Product quotients, ranking, and kernel refinements}
\lean{FinitePairQuotient.lean} remains as a retrieval-facing wrapper around the
generic product API:
\begin{leanblock}
\lean{pairQuotient}\quad
\lean{pairQuotient_eq_iff}\quad
\lean{pairRuleFactorsThrough_same_on_quotient_fibers}\quad
\lean{no_pair_compatible_target_of_sample_collision}
Expand Down Expand Up @@ -1216,6 +1251,26 @@ \subsection{What gets reduced}
unobserved, Boolean extrapolations have size \(2^m\). This is exactly the
``smaller finite set of extrapolations'' established by the Lean search layer.

For pairwise retrieval targets, the haystack is not a document-only bucket
space. It is the reachable product image of compressed query and document
codes. On the full formal product domain this has size
\(\card{\im(C_q)}\card{\im(C_d)}\), and comparison/ranking targets live over
the one-query/two-document image box of size
\(\card{\im(C_q)}\card{\im(C_d)}^2\). On a finite logged sample or prefiltered
trace, the observed image may be smaller, and the same finite-bucket logic
applies to that restricted surface.

For probe families, the reduction is by the realized joint-code image. The
point is not that each bitmap, ordinal, lexical, query-transform, or
context-window probe is injective by itself. The finite question is whether
agreement on the whole joint code preserves the target behavior being audited.

For score-threshold decisions, the margin/profile layer gives a different
kind of reduction: supplied finite error and margin certificates can turn an
approximate quotient score into an exact threshold-decision certificate. Near
the decision boundary, this requires explicit margin evidence; absent that
evidence, the theorem intentionally does not certify equality.

\subsection{How bitmap probes fit}

For small bitmap probes, the constant-weight theorem gives an idealized target
Expand Down Expand Up @@ -1260,6 +1315,16 @@ \section{Non-Claims}
\item It does not prove that any concrete family of production probes is
globally target-sufficient. The window layer proves the finite kernel
calculus for such probe families once the observations and target are fixed.
\item It does not prove top-\(k\) list equality, tie-policy preservation, or
nDCG preservation from the ranking-comparison theorem alone. Those require
additional ordering, tie, or margin assumptions.
\item It does not let product-image equalities for the full formal product
domain be read as claims about arbitrary logged candidate graphs or
prefiltered empirical traces.
\item It does not certify near-threshold score decisions unless the supplied
score-error and margin hypotheses exclude those near ties.
\item It does not prove a continuous Takens-style embedding theorem. The
observation-window layer is a finite-domain target-separation calculus.
\item It does not include randomized tests, Neyman--Pearson optimality,
uniformly most powerful tests, Karlin--Rubin theory, asymptotics, or
measure-theoretic probability.
Expand Down Expand Up @@ -1288,8 +1353,8 @@ \section{Lean Verification Surface}
[\texttt{propext},\ \texttt{Classical.choice},\ \texttt{Quot.sound}].
\]

The root import file and public verification dashboard cover these main
modules and support layers:
The root import file covers the main modules directly, and the public
verification dashboard also checks their lower support layers:
\begin{leanblock}
\lean{FiniteExperiment}\quad
\lean{FiniteQuotientSearch}\quad
Expand Down Expand Up @@ -1323,7 +1388,7 @@ \section{Lean Verification Surface}

\section{Conclusion}

The Lean development proves a finite theorem stack with two complementary
The Lean development proves a finite theorem stack with three complementary
messages.

First, if the Bayes-relevant evidence for a binary admission task factors
Expand All @@ -1338,6 +1403,16 @@ \section{Conclusion}
leave unobserved extrapolation freedom. It narrows the field of possible
global structures without pretending to identify real encoders automatically.

Third, the same contract composes into retrieval-facing structures. Product
quotients express pairwise query/document targets and multi-target signatures;
comparison quotients express score-induced ranking decisions; observation
windows express finite bundles of lossy probes; and margin/profile theorems
turn supplied finite score certificates into exact threshold-decision
consequences. These layers do not remove the empirical burden. They specify
what must be tested: which compressed buckets, product codes, joint probe
codes, and near-boundary margins would falsify or support the claimed quotient
contract.

\begin{thebibliography}{9}

\bibitem{Fisher1922}
Expand Down
Loading