Generational Garbage Collection with Anti-Cruft Packs#5
Open
vaidas-shopify wants to merge 14 commits intomasterfrom
Open
Generational Garbage Collection with Anti-Cruft Packs#5vaidas-shopify wants to merge 14 commits intomasterfrom
vaidas-shopify wants to merge 14 commits intomasterfrom
Conversation
Introduce the .anchored sidecar file format that identifies a pack as an "anti-cruft" pack — one containing objects known to be reachable from configured anchor refs. This is the foundation for generational garbage collection, where pinned (old generation) objects are skipped during GC walks. The .anchored file stores: - The anchor commit OID from which reachability was proven - The anchor ref name used for validation - A pinned timestamp recording when the pack was created Detection follows the same pattern as .mtimes for cruft packs: during pack discovery in add_packed_git(), the presence of a .anchored file sets the is_anchored bit on the packed_git struct. Also add .anchored to the list of extensions cleaned up by unlink_pack_path().
Exclude packs with the is_anchored flag from geometric repack, same as cruft packs are already excluded. Anti-cruft packs represent the old generation in generational GC and should not be merged or reorganized by the regular geometric progression.
Add a new "anti-cruft" maintenance task that incrementally pins objects
reachable from configured anchor refs into anti-cruft packs. This is
the core of generational GC: objects in anti-cruft packs form the "old
generation" and can be skipped during future GC walks.
The task:
- Reads anchor refs from maintenance.anti-cruft.anchor (multi-valued)
- Respects maintenance.anti-cruft.min-age (default: 2.weeks.ago)
- For each anchor ref, finds the last-pinned commit from existing
.anchored packs to avoid re-walking already-pinned history
- Uses rev-list --objects --before=<min-age> to find objects to pin
- Packs them via pack-objects and writes a .anchored sidecar
The auto-condition triggers whenever anchor refs are configured,
making this a no-op when the feature is not in use.
Add validation that runs at the start of each anti-cruft maintenance task. For each existing anti-cruft pack, verify: 1. The .anchored file can be loaded 2. The anchor ref still exists 3. The recorded anchor commit is an ancestor of the current ref tip If any check fails, the pack is demoted to a regular pack by removing the .anchored file. This handles force-pushes and ref deletions gracefully — demoted packs re-enter the normal geometric repack pipeline and their objects will be correctly classified by the next GC. Uses "git merge-base --is-ancestor" for the ancestry check, which is near-constant-time when a commit-graph exists.
Add a new "scoped-gc" maintenance task that performs lightweight garbage collection scoped to the young generation only. Anti-cruft packs (the old generation) are passed to repack via --keep-pack, so the reachability walk and object rewrite only cover unpinned objects. This achieves the key benefit of generational GC: GC cost is proportional to the young generation size (recent objects), not the total repository size. When main's history has been pinned by the anti-cruft task, scoped-gc only walks a few weeks of history. The task uses "git repack -d -l --cruft --cruft-expiration=<exp>" with --keep-pack for each anti-cruft pack. Unreachable objects in the young generation are moved to cruft packs; expired ones are dropped. The auto-condition requires both anti-cruft packs and regular packs to exist, making this a no-op before the first anti-cruft run. Configurable via maintenance.scoped-gc.expiration (default: 2.weeks.ago).
Add the new generational GC tasks to the geometric maintenance strategy schedule: - anti-cruft: daily (after geometric-repack, pins old objects) - scoped-gc: weekly (prunes unreachable from young generation) The resulting geometric strategy schedule is: hourly: commit-graph daily: geometric-repack, pack-refs, anti-cruft weekly: rerere-gc, reflog-expire, worktree-prune, scoped-gc Tasks execute in enum order, so geometric-repack runs before anti-cruft (which needs consolidated packs to pin efficiently), and scoped-gc runs last (after reflog-expire has made objects unreachable). Both tasks are no-ops when maintenance.anti-cruft.anchor is not configured, so existing users see no behavioral change.
Add documentation for the new generational GC maintenance tasks and their configuration options: - maintenance.anti-cruft.anchor - maintenance.anti-cruft.min-age - maintenance.anti-cruft.batch-size - maintenance.scoped-gc.expiration - maintenance.scoped-gc.grace-period The task descriptions are added to git-maintenance.adoc and the config entries to config/maintenance.adoc.
d0a0f34 to
0827f33
Compare
Anti-cruft packs for the same anchor ref form an incremental chain via ^<last_pinned> exclusion — each pack depends on earlier packs for completeness. When a pack fails validation (e.g., force-push on the anchor ref), demote all packs for that ref with equal or later pinned_timestamp to preserve the closed-set property. Scoped-gc currently enumerates all objects in the repository even though it only rewrites the young generation. This is because --keep-pack prevents rewriting kept-pack objects but the reachability walk still traverses into them. Add --kept-pack-boundary (internal, hidden) to pack-objects and repack. When set, the revision walk stops at commits in kept packs (skipping parent processing), and tree/blob traversal skips objects found in kept packs. This is safe because the union of all anti-cruft packs is closed under reachability — cascade demotion ensures this invariant holds after force-pushes. Scoped-gc passes --kept-pack-boundary when anchored packs exist, bounding both enumeration and rewrite cost to the young generation.
0827f33 to
6258529
Compare
A repository may use both full gc and scoped-gc, so having two independent configs for the same cruft expiration threshold is error-prone. Make scoped-gc fall back to gc.pruneExpire when maintenance.scoped-gc.expiration is unset, so that a single config controls both paths by default.
Add trace2 instrumentation to the three anti-cruft generational maintenance tasks to aid debugging and performance analysis: - anti-cruft: log anchor count, per-anchor regions with objects pinned/skipped/already-packed counts, and batch truncation status - consolidate-anti-cruft: log group count, per-group regions with pack totals and merge counts - scoped-gc: log pinning readiness, expiration value, and kept pack count
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generational Garbage Collection with Anti-Cruft Packs
Overview
Traditional Git garbage collection walks all objects in the repository to
determine reachability, then repacks everything. This cost scales with the
total repository size and becomes prohibitive for large repositories.
Generational GC takes a different approach: pin objects that are known to
be reachable into long-lived "anti-cruft" packs (the old generation), then
scope garbage collection to only the unpinned remainder (the young
generation). This bounds GC cost to the size of the young generation, not
the total repository.
....
Traditional GC:
Walk ALL objects -> classify -> repack everything
Cost: O(total repository)
Generational GC:
Pin objects reachable from stable refs -> anti-cruft pack (old gen)
GC walks only unpinned objects -> classify -> repack remainder
Cost: O(unpinned objects) << O(total repository)
....
Object classification
Objects in a repository are classified into three tiers:
Tier 1: Anti-cruft packs (old generation)::
Objects reachable from configured anchor refs, identified by an
.anchoredsidecar file. GC treats these as reachable withoutwalking into them.
Tier 2: Regular packs (young generation)::
Objects not yet pinned: recent commits, feature branches, fetched
objects. Managed by geometric repack and subject to scoped-gc
reachability walks.
Tier 3: Cruft packs (unreachable, awaiting expiration)::
Objects proven unreachable during scoped GC. Same
.mtimesmechanismas today. Expired by cruft expiration.
Anti-cruft packs
An anti-cruft pack is a standard packfile accompanied by an
.anchoredsidecar file that records the reachability proof.
Anchor refs are user-configured refs that represent stable, long-lived
history:
maintenance.anti-cruft.anchor = refs/heads/main
maintenance.anti-cruft.anchor = refs/heads/release/v1
Only objects reachable from commits older than
maintenance.anti-cruft.min-age(default:
2.weeks.ago) are pinned. This avoids pinning objects from recentcommits that might still be rewritten.
.anchored file format
Scoped-gc passes
--kept-pack-boundarytogit repack, which tellsgit pack-objectsto treat objects in kept (anti-cruft) packs astraversal boundaries. When the revision walk encounters a commit in a
kept pack, it skips parent processing entirely — no ancestors are queued.
Similarly, when tree or blob traversal encounters an object in a kept
pack, it marks it seen and does not recurse further.
This relies on the closed-set property: the union of all anti-cruft
packs is closed under object reachability. Every object transitively
reachable from an anchored object is guaranteed to be in some anchored
pack, because anti-cruft packs are created by
rev-list --objectswhichenumerates the full transitive closure. The
^<last_pinned>exclusionmeans individual packs are not self-contained, but the union across all
packs for an anchor ref is. Cascade demotion (see Validation above)
preserves this property after force-pushes.
If main's history is pinned up to 2 weeks ago, the walk only needs to
traverse approximately 2 weeks of commits before hitting pinned objects
and stopping. Both enumeration and writing are bounded by the young
generation size.
....
Full GC walk depth: entire history (years)
Scoped GC walk depth: min-age window (weeks)
Full GC enumeration: all objects
Scoped GC enumeration: only unpinned objects
Full GC rewrite: all objects
Scoped GC rewrite: only unpinned objects
....
Integration with geometric repack
Geometric repack skips anti-cruft packs the same way it already skips cruft
packs. Anti-cruft packs are excluded from the geometric merge sequence and
survive geometric repack unchanged.
Configuration
maintenance.anti-cruft.anchor::
Multi-valued. Refs to use as anchors for anti-cruft packs. Each
value must be an exact ref name. The
anti-crufttask is a no-opif no anchors are configured. No default.
maintenance.anti-cruft.min-age::
Approxidate. Only pin objects from commits older than this.
Default:
2.weeks.ago.maintenance.anti-cruft.batch-size::
Integer. Maximum objects to pin per anchor per run.
0meansunlimited. Default:
0.maintenance.scoped-gc.expiration::
Approxidate. Cruft expiration threshold for scoped-gc. Passed as
--cruft-expirationto linkgit:git-repack[1].Default:
2.weeks.ago.maintenance.scoped-gc.grace-period::
Approxidate. How far behind anti-cruft pinning can lag before
scoped-gc skips its run. The actual cutoff is
min-ageplus thisvalue. Default:
1.week.ago.Steady-state behavior
....
Week 1, Day 1:
geometric-repack merges small regular packs (same as today)
anti-cruft task pins objects reachable from main (>2 weeks old)
-> first anti-cruft pack created
Week 1, Day 2-6:
geometric-repack continues managing regular packs
anti-cruft task incrementally pins new objects crossing min-age
-> anti-cruft packs grow (or new small ones created)
Week 1, Day 7:
scoped-gc runs:
-> walks only unpinned objects (recent history)
-> unreachable unpinned objects -> cruft pack
-> expired cruft -> dropped
-> cost: proportional to ~2 weeks of history, not years
Week 2+:
anti-cruft packs ~ all of main's reachable history
regular packs ~ last 2 weeks of objects
cruft packs ~ 0-2 weeks of unreachable objects
scoped-gc cost ~ constant (bounded by min-age window)
....
Safety properties
The system is safe as long as validation is conservative (demote on any
doubt):
are subject to normal GC, which correctly classifies them.
correctly identifies newly-unreachable objects.
might have been incorrectly retained.
Risks and trade-offs
Delta compression across generations::
Objects in anti-cruft packs are delta-compressed within the pack.
Objects in regular packs cannot delta against anti-cruft pack objects
(different pack-objects invocation). This may increase total size
slightly vs. a single full repack. The cross-pack delta loss is bounded
by the young generation size.
Anti-cruft pack proliferation::
Each anchor ref per run could create a new small anti-cruft pack.
Mitigation: merge small anti-cruft packs using the same geometric
progression, or accumulate into a single pack per anchor ref.
Correctness of the skip optimization::
When scoped-gc keeps anti-cruft packs via
--keep-pack, it assumesevery object in those packs is reachable. If validation misses a case
where this is not true, unreachable objects could survive indefinitely.
Mitigation: periodic full GC as a safety net.