Skip to content

Generational Garbage Collection with Anti-Cruft Packs#5

Open
vaidas-shopify wants to merge 14 commits intomasterfrom
anti-cruft2
Open

Generational Garbage Collection with Anti-Cruft Packs#5
vaidas-shopify wants to merge 14 commits intomasterfrom
anti-cruft2

Conversation

@vaidas-shopify
Copy link
Copy Markdown
Owner

Generational Garbage Collection with Anti-Cruft Packs

Overview

Traditional Git garbage collection walks all objects in the repository to
determine reachability, then repacks everything. This cost scales with the
total repository size and becomes prohibitive for large repositories.

Generational GC takes a different approach: pin objects that are known to
be reachable into long-lived "anti-cruft" packs (the old generation), then
scope garbage collection to only the unpinned remainder (the young
generation). This bounds GC cost to the size of the young generation, not
the total repository.

....
Traditional GC:
Walk ALL objects -> classify -> repack everything
Cost: O(total repository)

Generational GC:
Pin objects reachable from stable refs -> anti-cruft pack (old gen)
GC walks only unpinned objects -> classify -> repack remainder
Cost: O(unpinned objects) << O(total repository)
....

Object classification

Objects in a repository are classified into three tiers:

Tier 1: Anti-cruft packs (old generation)::
Objects reachable from configured anchor refs, identified by an
.anchored sidecar file. GC treats these as reachable without
walking into them.

Tier 2: Regular packs (young generation)::
Objects not yet pinned: recent commits, feature branches, fetched
objects. Managed by geometric repack and subject to scoped-gc
reachability walks.

Tier 3: Cruft packs (unreachable, awaiting expiration)::
Objects proven unreachable during scoped GC. Same .mtimes mechanism
as today. Expired by cruft expiration.

Anti-cruft packs

An anti-cruft pack is a standard packfile accompanied by an .anchored
sidecar file that records the reachability proof.

Anchor refs are user-configured refs that represent stable, long-lived
history:


maintenance.anti-cruft.anchor = refs/heads/main
maintenance.anti-cruft.anchor = refs/heads/release/v1

Only objects reachable from commits older than maintenance.anti-cruft.min-age
(default: 2.weeks.ago) are pinned. This avoids pinning objects from recent
commits that might still be rewritten.

.anchored file format


The `.anchored` file uses the following binary format (version 1):

....
Bytes 0-3:    signature (0x414e4348 = "ANCH", network byte order)
Bytes 4-7:    version (1, network byte order)
Bytes 8-11:   hash_id (1=SHA1, 2=SHA256, network byte order)
Next N bytes: anchor_commit OID (raw, 20 or 32 bytes)
Next 4 bytes: pinned_timestamp (network byte order)
Next:         anchor_ref name (NUL-terminated string)
Next N bytes: trailing checksum of all preceding data
....

The anchor commit and ref are recorded so the pack's reachability guarantee
can be validated: if the anchor ref no longer points to a descendant of the
anchor commit, the pack must be demoted back to a regular pack.

Detection
~~~~~~~~~

Pack discovery checks for the `.anchored` sidecar the same way it checks for
`.mtimes` to set `is_cruft`. The `packed_git` struct carries an `is_anchored`
bit flag.

Anti-cruft packs are included in the multi-pack-index like any other pack.
No MIDX format changes are needed.


The anti-cruft maintenance task
-------------------------------

The `anti-cruft` task incrementally pins objects from regular packs into
anti-cruft packs. For each configured anchor ref, the task:

1. Resolves the ref to a commit.
2. Finds the last-pinned commit from existing `.anchored` files for that ref.
3. Runs `git rev-list --objects --reverse --before=<min-age> <tip> [^<last_pinned>]`
   to enumerate unpinned reachable objects.
4. Feeds the object list to `git pack-objects` to create a new anti-cruft pack.
5. Writes the `.anchored` sidecar, recording the last commit in the output
   as the anchor commit (not the ref tip).

The incremental walk bounded by `^<last_pinned>` means subsequent runs only
process objects newer than what was previously pinned. The walk cost is
proportional to the new objects, not the entire history.

Batch size control
~~~~~~~~~~~~~~~~~~

The `maintenance.anti-cruft.batch-size` option (default: 0, unlimited) limits
the number of objects pinned per anchor ref per run. With `--reverse`, oldest
objects come first, so truncation pins the oldest batch and defers newer
objects to subsequent runs.

The `.anchored` sidecar records the last commit actually included in the
batch (not the ref tip), so the next run's `^<last_pinned>` exclusion
correctly continues from where the previous batch ended.

This is important for the sliding time window: even without batch truncation,
the anchor commit is always the last commit from the `--before=<min-age>`
bounded output. As time passes and the min-age window advances, newly
eligible commits appear between the recorded frontier and the new cutoff,
and are picked up on the next run.

Validation
~~~~~~~~~~

Before creating new anti-cruft packs, the task validates existing ones.
For each anti-cruft pack:

1. Load `.anchored` to get `anchor_ref` and `anchor_commit`.
2. Resolve the anchor ref. If deleted, demote the pack.
3. Check if `anchor_commit` is an ancestor of the current ref tip.
   If not (history was rewritten), demote the pack.

Demotion means deleting the `.anchored` file. The pack becomes a regular
pack and is handled by geometric repack on the next run.

Cascade demotion
^^^^^^^^^^^^^^^^

Anti-cruft packs for the same anchor ref form an incremental chain: each
pack is created with `^<last_pinned>` exclusion, so it depends on earlier
packs for objects that were already pinned. If a pack in the middle of
the chain fails validation (e.g., the anchor ref was force-pushed), all
packs for the same anchor ref with equal or later `pinned_timestamp` must
also be demoted. Without cascade demotion, the surviving packs would have
dangling references to objects that are now in the young generation,
breaking the closed-set property.

Validation processes packs grouped by anchor ref in `pinned_timestamp`
order. On the first failure within a group, all remaining packs in that
group are demoted. Different anchor refs are independent — a failure in
one does not affect the other.

After cascade demotion, the anti-cruft task starts pinning from scratch
for that anchor ref. Scoped-gc skips its run until the pinning frontier
passes the `min-age + grace-period` threshold again; regular geometric
repack handles the young generation packs in the meantime.

This is cheap: one ancestor check per anti-cruft pack, accelerated to
near-constant time by the commit graph.


The scoped-gc maintenance task
------------------------------

The `scoped-gc` task performs lightweight garbage collection scoped to the
young generation:

1. Collects all anti-cruft packs and passes them as `--keep-pack` with
   `--kept-pack-boundary` to `git repack`. The reachability walk treats
   kept packs as traversal boundaries: when the walk hits a commit, tree,
   or blob in a kept pack, it stops traversing — it does not process
   parents or recurse into trees. This is safe because of the closed-set
   property (see below).
2. Runs `git repack -d -l --cruft --cruft-expiration=<expiration>` on the
   remaining (unpinned) objects.
3. Reachable unpinned objects are repacked into a new regular pack.
   Unreachable objects go into a cruft pack. Expired unreachable objects
   are dropped.

Readiness check
~~~~~~~~~~~~~~~

Before running, scoped-gc verifies that anti-cruft pinning has caught up
sufficiently. For each configured anchor ref, the commit date of the
last-pinned commit must be newer than the combined threshold of
`maintenance.anti-cruft.min-age` plus `maintenance.scoped-gc.grace-period`
(default: `1.week.ago`).

If any anchor's pinning frontier is older than this cutoff, scoped-gc is
skipped. The young generation would still be too large for scoped-gc to
save meaningful work over a full repack. This is particularly relevant
when `batch-size` is set and anti-cruft needs multiple runs to fully pin
an anchor's history.

Why scoped-gc is cheap

Scoped-gc passes --kept-pack-boundary to git repack, which tells
git pack-objects to treat objects in kept (anti-cruft) packs as
traversal boundaries. When the revision walk encounters a commit in a
kept pack, it skips parent processing entirely — no ancestors are queued.
Similarly, when tree or blob traversal encounters an object in a kept
pack, it marks it seen and does not recurse further.

This relies on the closed-set property: the union of all anti-cruft
packs is closed under object reachability. Every object transitively
reachable from an anchored object is guaranteed to be in some anchored
pack, because anti-cruft packs are created by rev-list --objects which
enumerates the full transitive closure. The ^<last_pinned> exclusion
means individual packs are not self-contained, but the union across all
packs for an anchor ref is. Cascade demotion (see Validation above)
preserves this property after force-pushes.

If main's history is pinned up to 2 weeks ago, the walk only needs to
traverse approximately 2 weeks of commits before hitting pinned objects
and stopping. Both enumeration and writing are bounded by the young
generation size.

....
Full GC walk depth: entire history (years)
Scoped GC walk depth: min-age window (weeks)

Full GC enumeration: all objects
Scoped GC enumeration: only unpinned objects

Full GC rewrite: all objects
Scoped GC rewrite: only unpinned objects
....

Integration with geometric repack

Geometric repack skips anti-cruft packs the same way it already skips cruft
packs. Anti-cruft packs are excluded from the geometric merge sequence and
survive geometric repack unchanged.

Configuration

maintenance.anti-cruft.anchor::
Multi-valued. Refs to use as anchors for anti-cruft packs. Each
value must be an exact ref name. The anti-cruft task is a no-op
if no anchors are configured. No default.

maintenance.anti-cruft.min-age::
Approxidate. Only pin objects from commits older than this.
Default: 2.weeks.ago.

maintenance.anti-cruft.batch-size::
Integer. Maximum objects to pin per anchor per run. 0 means
unlimited. Default: 0.

maintenance.scoped-gc.expiration::
Approxidate. Cruft expiration threshold for scoped-gc. Passed as
--cruft-expiration to linkgit:git-repack[1].
Default: 2.weeks.ago.

maintenance.scoped-gc.grace-period::
Approxidate. How far behind anti-cruft pinning can lag before
scoped-gc skips its run. The actual cutoff is min-age plus this
value. Default: 1.week.ago.

Steady-state behavior

....
Week 1, Day 1:
geometric-repack merges small regular packs (same as today)
anti-cruft task pins objects reachable from main (>2 weeks old)
-> first anti-cruft pack created

Week 1, Day 2-6:
geometric-repack continues managing regular packs
anti-cruft task incrementally pins new objects crossing min-age
-> anti-cruft packs grow (or new small ones created)

Week 1, Day 7:
scoped-gc runs:
-> walks only unpinned objects (recent history)
-> unreachable unpinned objects -> cruft pack
-> expired cruft -> dropped
-> cost: proportional to ~2 weeks of history, not years

Week 2+:
anti-cruft packs ~ all of main's reachable history
regular packs ~ last 2 weeks of objects
cruft packs ~ 0-2 weeks of unreachable objects
scoped-gc cost ~ constant (bounded by min-age window)
....

Safety properties

The system is safe as long as validation is conservative (demote on any
doubt):

  • A wrongly-demoted anti-cruft pack becomes a regular pack. Its objects
    are subject to normal GC, which correctly classifies them.
  • An anti-cruft pack is never trusted beyond its anchor ref.
  • Force-push on main demotes anti-cruft packs for main. The next GC
    correctly identifies newly-unreachable objects.
  • A periodic full GC can serve as a safety net to catch any objects that
    might have been incorrectly retained.

Risks and trade-offs

Delta compression across generations::
Objects in anti-cruft packs are delta-compressed within the pack.
Objects in regular packs cannot delta against anti-cruft pack objects
(different pack-objects invocation). This may increase total size
slightly vs. a single full repack. The cross-pack delta loss is bounded
by the young generation size.

Anti-cruft pack proliferation::
Each anchor ref per run could create a new small anti-cruft pack.
Mitigation: merge small anti-cruft packs using the same geometric
progression, or accumulate into a single pack per anchor ref.

Correctness of the skip optimization::
When scoped-gc keeps anti-cruft packs via --keep-pack, it assumes
every object in those packs is reachable. If validation misses a case
where this is not true, unreachable objects could survive indefinitely.
Mitigation: periodic full GC as a safety net.

Introduce the .anchored sidecar file format that identifies a pack as
an "anti-cruft" pack — one containing objects known to be reachable
from configured anchor refs. This is the foundation for generational
garbage collection, where pinned (old generation) objects are skipped
during GC walks.

The .anchored file stores:
  - The anchor commit OID from which reachability was proven
  - The anchor ref name used for validation
  - A pinned timestamp recording when the pack was created

Detection follows the same pattern as .mtimes for cruft packs: during
pack discovery in add_packed_git(), the presence of a .anchored file
sets the is_anchored bit on the packed_git struct.

Also add .anchored to the list of extensions cleaned up by
unlink_pack_path().
Exclude packs with the is_anchored flag from geometric repack, same
as cruft packs are already excluded. Anti-cruft packs represent the
old generation in generational GC and should not be merged or
reorganized by the regular geometric progression.
Add a new "anti-cruft" maintenance task that incrementally pins objects
reachable from configured anchor refs into anti-cruft packs. This is
the core of generational GC: objects in anti-cruft packs form the "old
generation" and can be skipped during future GC walks.

The task:
  - Reads anchor refs from maintenance.anti-cruft.anchor (multi-valued)
  - Respects maintenance.anti-cruft.min-age (default: 2.weeks.ago)
  - For each anchor ref, finds the last-pinned commit from existing
    .anchored packs to avoid re-walking already-pinned history
  - Uses rev-list --objects --before=<min-age> to find objects to pin
  - Packs them via pack-objects and writes a .anchored sidecar

The auto-condition triggers whenever anchor refs are configured,
making this a no-op when the feature is not in use.
Add validation that runs at the start of each anti-cruft maintenance
task. For each existing anti-cruft pack, verify:

  1. The .anchored file can be loaded
  2. The anchor ref still exists
  3. The recorded anchor commit is an ancestor of the current ref tip

If any check fails, the pack is demoted to a regular pack by removing
the .anchored file. This handles force-pushes and ref deletions
gracefully — demoted packs re-enter the normal geometric repack
pipeline and their objects will be correctly classified by the next GC.

Uses "git merge-base --is-ancestor" for the ancestry check, which is
near-constant-time when a commit-graph exists.
Add a new "scoped-gc" maintenance task that performs lightweight
garbage collection scoped to the young generation only. Anti-cruft
packs (the old generation) are passed to repack via --keep-pack,
so the reachability walk and object rewrite only cover unpinned
objects.

This achieves the key benefit of generational GC: GC cost is
proportional to the young generation size (recent objects), not the
total repository size. When main's history has been pinned by the
anti-cruft task, scoped-gc only walks a few weeks of history.

The task uses "git repack -d -l --cruft --cruft-expiration=<exp>"
with --keep-pack for each anti-cruft pack. Unreachable objects in
the young generation are moved to cruft packs; expired ones are
dropped.

The auto-condition requires both anti-cruft packs and regular packs
to exist, making this a no-op before the first anti-cruft run.

Configurable via maintenance.scoped-gc.expiration (default: 2.weeks.ago).
Add the new generational GC tasks to the geometric maintenance
strategy schedule:

  - anti-cruft: daily (after geometric-repack, pins old objects)
  - scoped-gc:  weekly (prunes unreachable from young generation)

The resulting geometric strategy schedule is:

  hourly:  commit-graph
  daily:   geometric-repack, pack-refs, anti-cruft
  weekly:  rerere-gc, reflog-expire, worktree-prune, scoped-gc

Tasks execute in enum order, so geometric-repack runs before
anti-cruft (which needs consolidated packs to pin efficiently),
and scoped-gc runs last (after reflog-expire has made objects
unreachable).

Both tasks are no-ops when maintenance.anti-cruft.anchor is not
configured, so existing users see no behavioral change.
Add documentation for the new generational GC maintenance tasks
and their configuration options:

  - maintenance.anti-cruft.anchor
  - maintenance.anti-cruft.min-age
  - maintenance.anti-cruft.batch-size
  - maintenance.scoped-gc.expiration
  - maintenance.scoped-gc.grace-period

The task descriptions are added to git-maintenance.adoc and the
config entries to config/maintenance.adoc.
Anti-cruft packs for the same anchor ref form an incremental chain
via ^<last_pinned> exclusion — each pack depends on earlier packs
for completeness. When a pack fails validation (e.g., force-push on
the anchor ref), demote all packs for that ref with equal or later
pinned_timestamp to preserve the closed-set property.

Scoped-gc currently enumerates all objects in the repository even
though it only rewrites the young generation. This is because
--keep-pack prevents rewriting kept-pack objects but the reachability
walk still traverses into them.

Add --kept-pack-boundary (internal, hidden) to pack-objects and
repack. When set, the revision walk stops at commits in kept packs
(skipping parent processing), and tree/blob traversal skips objects
found in kept packs. This is safe because the union of all anti-cruft
packs is closed under reachability — cascade demotion ensures this
invariant holds after force-pushes.

Scoped-gc passes --kept-pack-boundary when anchored packs exist,
bounding both enumeration and rewrite cost to the young generation.
A repository may use both full gc and scoped-gc, so having two
independent configs for the same cruft expiration threshold is
error-prone. Make scoped-gc fall back to gc.pruneExpire when
maintenance.scoped-gc.expiration is unset, so that a single
config controls both paths by default.
Add trace2 instrumentation to the three anti-cruft generational
maintenance tasks to aid debugging and performance analysis:

- anti-cruft: log anchor count, per-anchor regions with objects
  pinned/skipped/already-packed counts, and batch truncation status
- consolidate-anti-cruft: log group count, per-group regions with
  pack totals and merge counts
- scoped-gc: log pinning readiness, expiration value, and kept pack
  count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant