feat(sync): local-first multi-machine artifact sync#731
Conversation
|
Feedback is welcome. Still in draft mode since work and testing to this point has been completely agent driven, a combination of GPT-5.5 and Claude 4.8. Next up is manually trying various distributed machine scenarios and seeing how well any of this works in practice. Assuming the idea eventually proves out, I'm happy to split into smaller manageable PR. |
roborev: Combined Review (
|
## Summary - Keep the candidate-window and boundary-session behavior in `internal/postgres/push.go` unchanged for this PR, and batch the PostgreSQL-side comparison reads used to decide whether a candidate session can be skipped. - Implement new batched loaders in `internal/postgres/push_fingerprint.go` for message aggregates, message content hashes, role/time fingerprints, message flags, message system ordinals, token fingerprints, tool-call aggregates, tool-call fingerprints, and usage fingerprints, with chunking inside the helper when session counts exceed `ANY($1)` practicality. - Use the preloaded message and tool-call aggregates on the hot no-op path, and retry any comparison-preload SQL failure in a fresh transaction without the batched preload instead of continuing inside an already-aborted transaction. - Add targeted regression tests in `internal/postgres/push_test.go` and `internal/postgres/push_fingerprint_test.go` to cover the new batch-driven skip decision path and helper behavior with empty inputs. ## Scope - Files changed are `internal/postgres/push.go`, `internal/postgres/push_fingerprint.go`, `internal/postgres/push_test.go`, and `internal/postgres/push_fingerprint_test.go`. - No boundary/windowing semantics, no schema changes, and no changes to PR #731 or broader sync-work areas. ## Notes - A focused PG comparison query-count assertion was not added because the existing harness does not expose a stable helper-call/query metric for this exact path without adding brittle test-only instrumentation. - The review-driven follow-up keeps the existing non-batched fingerprint fallback, but now that fallback only runs from a clean transaction after preload failure instead of on the poisoned transaction that raised the preload error. Fixes #331 Co-authored-by: Rod Boev <rodboev@users.noreply.github.com>
|
Thanks for the review. Both findings were valid and are addressed in 2804870 and f252c35. High — Windows-invalid Note this changes the canonical on-disk HLC string ( Medium — divergent origin sources. Confirmed:
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks again. All three findings were valid and are addressed in 055f3b3. High — local metadata events missing from the replay register ( Medium — remote HLCs not observed by the local clock ( Medium — one unavailable target aborted the rest of the origin (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Both convergence gaps were valid and are fixed in 1d8d24c and 8cac9ff. Medium — usage-only sessions never exported ( Medium — bulk star emitted no metadata events (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Addressed in e77db3a, 4110dce, acbb789, and 6ea6fa8. Medium — Medium — unconditional S3 PUT violates write-once ( Medium — Medium — remote events applied before the HLC advances (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
6ea6fa8 to
18c0f18
Compare
roborev: Combined Review (
|
|
I will rebase this |
18c0f18 to
b228d18
Compare
roborev: Combined Review (
|
|
I'll continue to work a bit on this to see if I can get it into a state that I'm comfortable with |
b228d18 to
16e5a7b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
efb934f to
550372b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
eaf6694 to
f944578
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
|
@wesm thanks for taking a direct interest! It's appreciated. I'm largely away for this week plus a bit. I'm looking forward to exercising this first hand when I'm back : ) |
roborev: Combined Review (
|
roborev: Combined Review (
|
Squashed follow-up changes: - chore: clear golangci-lint modernize and staticcheck debt - fix(postgres): resolve relationship ids to pushed-session identities - fix(artifact): carry persisted signal state across manifest round-trip - fix(postgres): respect session ownership when resolving pushed ids - fix(postgres): repair stale subagent links on incremental push - fix(artifact): import scanned sessions as unscanned for secrets - fix(postgres): reuse legacy-prefixed rows, skip per-session conflicts - fix(artifact): keep non-content state out of the manifest hash - fix(artifact): guard os.Stat result in artifactFileExists - fix(artifact): harden trusted-fleet sync - fix(sync): harden trusted-fleet edge cases - fix(postgres): reject foreign already-prefixed identities - fix(sync): harden artifact sync edge cases - fix(artifact): preserve titles and avoid token reuse - Merge origin/main into docs/local-first-multi-machine-sync - fix(metadata): suppress false conflict noise - fix(sync): preserve metadata convergence - fix(sync): make batch delete metadata retryable - fix(sync): suppress no-op unstar artifacts - fix(sync): make unstar artifact retries recoverable - fix(sync): record metadata state after artifact publish - fix(sync): accept Windows artifact folder paths - fix(sync): distinguish published metadata append failures - fix(sync): repair published metadata state on retry - fix(sync): repair restore metadata retries - fix(sync): verify restore repair wins before retry succeeds - fix(sync): satisfy lint for Windows path check Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Imported artifact sessions must not retain another machine's local source-file bookkeeping, because local parsers use file_path and file fingerprints as skip and duplicate state. Clear those fields on import and reject malformed artifact references before they can be used as filesystem names.\n\nBulk star now follows the same durable metadata boundary as the other local metadata mutations: an already-starred retry repairs a published local star artifact or emits a replacement when no winning replay state exists. Existing PostgreSQL schemas also get the pinned_messages.source_uuid column through the migration path, not only the create-table path.
Legacy PostgreSQL schemas can already have pinned_messages without source_uuid. Creating the partial source_uuid index in the core DDL referenced the column before the idempotent column migration had a chance to add it, so upgrades failed before reaching the repair path. Move that index into the post-migration index phase so old schemas add the column first while new schemas still get the same index.
1cd7126 to
ebf0e73
Compare
roborev: Combined Review (
|
HTTP artifact POST implies the receiving server will immediately import what it accepted. Remote serve modes and read-only SQLite cannot do that, so accepting the file creates a false success and can leave immutable artifacts behind that will never converge locally. Gate uploads on a writable local SQLite store before touching the artifact tree. Writable local servers still accept peer artifacts and import them through the existing path; remote and read-only servers return not implemented instead of persisting data they cannot apply.
roborev: Combined Review (
|
Artifact sync init needs to publish local curation state that already existed before metadata event recording was enabled. Without that baseline, peers can import sessions but never learn existing display-name overrides, stars, or pins.\n\nLegacy PostgreSQL pins also need their source_uuid filled from the old message set before an ordinal-shifting rewrite deletes and replaces messages. The normal post-insert reconciliation still handles movement and pruning against the new message set.
Baseline curation export is only safe after the local store has observed metadata already present in the target. Otherwise old local rows get freshly stamped and can outrank real peer edits that were published earlier.\n\nRun the baseline init path through an initial exchange/import phase, then emit baseline events only for replay fields that remain uncovered. Normal syncs keep the existing single exchange path, while init still publishes any newly synthesized baseline artifacts through a second exchange.
Baseline initialization has to observe peer metadata before writing local baseline events, but the curation rows being baselined must still be the rows that existed before that observation step. If the snapshot is taken after pre-baseline import, rows introduced by the peer exchange can be mistaken for local pre-feature curation and re-published under the wrong origin.\n\nCapture the baseline snapshot before the pre-baseline exchange/import and emit from that stable snapshot after replay-state coverage is populated. This preserves the peer-observation ordering without allowing import-created rows to become local baseline metadata.
roborev: Combined Review (
|
Implements the local-first multi-machine sync design proposed in #692:
every machine keeps the complete archive and machines converge by
exchanging immutable, content-addressed artifacts over any dumb transport
instead of depending on an always-on PostgreSQL hub. SQLite stays a local,
rebuildable derivation — the live database file never crosses the wire.
Design rationale and the full set of alternatives considered (Automerge,
cr-sqlite, the SQLite session extension, whole-DB replication, raw-file
mirroring) live in
docs/design/local-first-sync.md; user-facing setup isin
docs/artifact-sync.md.What this adds
write-once, content-addressed store under
$AGENTSVIEW_DATA_DIR/artifacts/<origin>/: append-only checkpoints,session manifests, zstd-compressed NDJSON message segments, a metadata
change feed, and an optional raw-source fallback. Serialization is a
pinned forever-contract enforced by golden tests; readers ignore unknown
fields and skip unknown future ops so mixed app versions keep syncing.
HLC timestamps render without
:so metadata filenames are valid onWindows.
name plus a random suffix). Foreign sessions are stored as
origin~nativeIDwithmachine=origin, the same convention SSHremote-sync already uses, so every read path, the UI, and analytics
render them without composite-PK surgery across backends. Server, CLI
folder sync, peer import, and conflict lookup converge on one persisted
origin via
AdoptOrigin.uploads, imports, SSH-pulled, and orphan-preserved sessions all publish;
it is debounced through the existing pg-watch sink loop. Import diffs
checkpoints against
artifact_sync_state, hash-verifies segments, andwrites foreign sessions through the existing
UpsertSession/messagepaths, inheriting FTS5 maintenance, tombstone rejection, and pin
re-attachment. Undelivered segments are recorded as phantoms and retried,
tolerating out-of-order delivery from dumb transports.
tiny HLC-stamped change events replayed deterministically with per-field
last-writer-wins. Concurrent conflicting edits are never silently
dropped: the losing value is logged to
meta_conflictsand surfaced inthe UI as a fork badge. Local edits record their own LWW register and
applied-event marker on write, so a later peer event with a lower order
key can no longer overwrite a newer local edit; replay advances the local
HLC past observed remote events to keep later local edits causally ahead;
and a single not-yet-durable target defers only its own event rather than
aborting the rest of an origin's replay.
syncverb, three interchangeable target shapesbehind a shared
Transportinterface (export -> set-union exchange ->import):
agentsview sync [--init|--watch] <dir>, safe forSyncthing, Dropbox, NFS, or rclone mounts because every file is
immutable temp+rename and single-writer-per-prefix.
agentsview sync https://peer:8080 [--token <t>]exchanges directly over the embedded server's artifact API behind the
existing Bearer-token middleware. A
GET /{origin}/indexrouteenumerates an origin's artifacts so metadata events (not referenced by
the checkpoint) can be pulled;
--tokendefaults to the local authtoken for a fleet sharing one symmetric token.
agentsview sync s3://bucket/prefixagainst anyS3-compatible store (AWS, MinIO, Backblaze B2). Requests are signed with
AWS Signature Version 4 implemented from the standard library, so there
is no AWS SDK dependency; credentials and addressing come from the
standard
AWS_*env vars plusAGENTSVIEW_S3_*overrides.from an origin's latest checkpoint) are reclaimed both on demand
(
agentsview sync gc [--dry-run] [--grace <d>] <dir>) and automaticallyafter a folder sync, over the local store and the shared target together
so set-union cannot re-propagate the deleted files. A grace window
protects slow peers, origins without checkpoints are skipped (never read
as a deletion), and
--gc-grace/--no-gctune or disable the automaticpass.
--watchkeeps any target shape syncing onchange plus a periodic floor through the pg-watch loop, and a peers page
shows each origin's published vs. locally-present session counts,
checkpoint sequence, last-published time, and total conflict count.
Scope, tradeoffs, and limitations
transports have no per-writer identity and the HTTP API uses one shared
token, so any peer can forge any origin's metadata. This is documented as
exactly that; per-peer tokens and origin signatures are the follow-up
before any sharing story.
append-mostly (a grow-only set), and metadata is a small append-only LWW
log, so a general CRDT library would add real cost without solving a
problem this data has.
compressed artifacts); zstd recovers 5-10x and GC reclaims superseded
bulk artifacts behind a grace window.
can reach the same transport; a NAS, bucket, or always-on peer is the
practical rendezvous by convention, not by privileged architecture.
tests and by a MinIO integration test (
make test-minio, run in CI) thatvalidates real S3 interop end to end; it has not been run against AWS itself.
Remote GC on an object store or HTTP peer is that peer's own responsibility —
auto-GC after a non-folder sync only collects the local store.
owner_markerpush design from the merged fix(postgres): preserve source machine on pg push #701/fix(postgres): guard pg push against same-id cross-machine row collision #724; the session-sinkseam is extracted (
drainSessionBatches) so PG is one sink and theartifact exporter can become another. PG read mode returns no
metadata-ledger conflicts (a parity stub in
internal/postgres/metadata.go).The change is additive: upgrading generates an origin id and behaves
exactly as today when sync is not configured. New tables arrive through the
existing idempotent migration path with no dataVersion bump and no resync.
Where to review
internal/artifact/format_test.gointernal/artifact/hlc.go,internal/artifact/replay.go,internal/db/metadata_replay.gointernal/artifact/sync.gointernal/artifact/transport.go,transport_http.go,transport_s3.gointernal/server/huma_routes_artifacts.go,internal/artifact/peer.gointernal/server/metadata_events.goand the frontendSessionBreadcrumb/TrashPagecomponentsinternal/artifact/twoinstance_test.go,internal/server/artifact_http_transport_test.go,internal/e2e/artifact_sync_test.goRelates to #692.
Claude Opus 4.8 reasoning-medium on behalf of maphew