Skip to content

Non-record JEPA-style regression transformer submission: VRS (Void Rescue System)#1513

Open
ikermoel wants to merge 1 commit intoopenai:mainfrom
ikermoel:codex/vrs-nonrecord-submission
Open

Non-record JEPA-style regression transformer submission: VRS (Void Rescue System)#1513
ikermoel wants to merge 1 commit intoopenai:mainfrom
ikermoel:codex/vrs-nonrecord-submission

Conversation

@ikermoel
Copy link
Copy Markdown

@ikermoel ikermoel commented Apr 9, 2026

Summary

This PR adds a non-record 16MB submission for VRS (Void Rescue System), a JEPA-style regression transformer with a small auxiliary rescue decoder.

The main contribution is not SOTA BPB. The contribution is a new research direction for regression-based transformers:

regression latents can contain useful token information before they are directly decodable, and a tiny jointly learned decoder can correct part of that geometric misalignment.

Core Research Idea

VRS is built around a specific hypothesis about regression decoding.

The Navigator is trained to predict the next token as a continuous embedding vector using MSE, instead of predicting vocabulary logits directly. In this setting, the raw latent can already carry token information but still decode poorly under the shared embedding geometry. In the paper, these ambiguous regions are called voids.

VRS adds a small Rescuer module that maps:

  • v_void -> v_rescued

The system is intentionally split into two roles:

  • Navigator: learns contextual geometry
  • Rescuer: learns lexical / embedding-space correction

This submission is meant as a concrete JEPA / regression contribution under Parameter Golf constraints, not as a leaderboard-optimized architecture.

Why this may be interesting for Parameter Golf

  • the challenge README explicitly asks to see JEPA submissions
  • the method stays under the 16MB artifact cap
  • it is stable across 3 separate 10-minute 8xH100 runs
  • it improves over the raw internal regression decode path
  • it also improves over standalone regression-only baselines trained separately under the same budget

Included Files

records/track_non_record_16mb/2026-04-09_VRS_VoidRescueSystem_JEPARegression/

Contents:

  • README.md
  • submission.json
  • train_gpt.py
  • train.log
  • train_seed42.log
  • train_seed1337.log
  • results.tsv
  • vrs-spec.txt

Metrics

Best included run:

  • val_bpb = 1.8658
  • total artifact bytes = 15,980,840

3-seed mean:

  • val_bpb = 1.8667
  • raw-path val_bpb_A = 1.9436
  • peak nn_acc ≈ 0.5051

Regression-only baselines (separate runs, no VRS):

  • val_bpb = 2.0941 - 2.1301

So the gain is not just an internal probe effect; the rescue module also improves over standalone regression training.

Links

Note on track choice

This is submitted as a non-record research contribution because the value is the architectural idea and the empirical evidence around regression decoding, not leaderboard SOTA.

@ikermoel ikermoel changed the title Non-record JEPA submission: VRS (Void Rescue System) Non-record regression submission: VRS (Void Rescue System) Apr 9, 2026
@ikermoel ikermoel changed the title Non-record regression submission: VRS (Void Rescue System) Non-record JEPA-style regression transformer submission: VRS (Void Rescue System) Apr 9, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record JEPA-style regression transformer: VRS (Void Rescue System)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1513 ("VRS_VoidRescueSystem_JEPARegression") implements a JEPA-style architecture (JEPAVRS) with a causal transformer "Navigator" (Model A) and a small MLP "Rescuer" (Model B). The submission is clean on all four compliance checks. ## N-gram / hash family bug No hash tables, prime arrays, context hashing, or XOR-based key lookups of any kind. input_ids[..., 1:] ^ input_ids[..., :-1] is also absent. The submission does not touch n-gram machinery at all. ## Pre-Quant TTT No test-time training. val_tokens is consumed exclusively inside eval_val() (lines 193–246), which is wrapped entirely in torch.inference_mode() with model.eval(). No optimizer step, no backward pass, and no gradient computation touches val_tokens at any point. ## Score-first-per-chunk TTT (PR #1413 pattern) Not present — but this is expected for PURE_NEURAL_CLEAN; absence is correct. ## Scored-region SLOT Not present. The training loop (lines 832–904) uses train_loader (train split only) for gradient computation. Validation runs as a read-only diagnostic branch (if should_validate) that returns before any optimizer step. There is no masking or optimizing of scored regions. ## Training objective Lines 611–617: training loss is pure MSE in embedding space — loss_A = F.mse_loss(v_void, target_emb) and loss_B = F.mse_loss(v_rescued, target_emb) — with target_emb = self.tok_emb(target_ids).detach(). Cross-entropy and logits are only computed in the else (eval) branch (lines 619–631) under model.eval() / torch.inference_mode(). No CE is involved in gradient flow. ## Summary The model trains entirely on the train split using pure MSE regression toward detached token embeddings. Val tokens are only read under inference_mode for reporting BPB. No auxiliary lookup structures, no...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants