feat: Microsoft Teams ingestion (delegated Graph sync)#398
Conversation
roborev: Review Unavailable (
|
roborev: Review Unavailable (
|
roborev: Combined Review (
|
|
Two unrelated bug fixes that surfaced while building this were split out into their own PRs to keep this one scoped to Teams ingestion:
Neither depends on this PR; both branch off current |
roborev: Combined Review (
|
## What Several importers built `time.Time` values from epoch timestamps with `time.Unix`/`time.UnixMilli` but **without `.UTC()`**, leaving them in the runner's local zone — while the rest of each importer stores dates in UTC. Any code reading the calendar day (or the Parquet year partition) is then off by one in zones east of UTC. Fixes: - `internal/sync/sync.go` — `processBatch` oldest-message date (progress tracking). - `internal/whatsapp/mapping.go` — message `SentAt`. - `internal/whatsapp/importer.go` — reaction `createdAt`. ## Why it matters `TestProcessBatch_OldestDatePropagation` fails on any machine east of UTC (e.g. NZ): the fixture `2024-01-10T12:00:00Z` reads back as Jan 11 local. The tests are correct; the production code was the bug. Adds `TestMapMessageSentAtIsUTC` (asserts the stored zone is UTC, machine-independent). ## Possible later fixes (out of scope here) The same `time.Unix(...)`-without-`.UTC()` pattern also appears in the embedding-generation status timestamps, but these are **operator-facing status values** round-tripped from unix-int columns (not message dates), so they don't affect partitioning/dedup/cross-system date semantics. Local-time display is arguably fine; normalizing them to UTC would be a consistency-only follow-up. Sites: - `cmd/msgvault/cmd/embeddings_manage.go` — `StartedAt`, `SeededAt`, `CompletedAt`, `ActivatedAt`. - `internal/vector/pgvector/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`. - `internal/vector/sqlitevec/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`. Left unchanged here to avoid churning working code on a style call; documented so a future pass can decide. ## Scope Independent of the Teams PR (#398) — branched from `main`, touches only `internal/sync` and `internal/whatsapp`. Co-authored-by: Nat Torkington <njt@users.noreply.github.com>
|
looking at this |
13c4591 to
7e1ecd2
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
c74f849 to
546a95c
Compare
roborev: Combined Review (
|
546a95c to
4368638
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
Squash the Teams ingestion branch into a single commit before rebasing onto origin/main. The branch adds delegated Microsoft Graph OAuth, Teams source commands, chat/channel import, sync state, hosted-content media handling, daemon scheduling, and the recovery/backfill paths needed to repair already-imported inline media. After rebasing onto origin/main, Teams messages are also included in the new message_type search/help surface and text-mode message-type allowlists so `message_type:teams` works consistently with the main-branch query changes. Included branch commits: - fix(teams): close ingestion review gaps - fix(teams): migrate legacy raw message ids - fix(teams): repair legacy id migration references - fix(teams): make Teams tests portable across CI backends - fix(teams): keep URL attachments as links - fix(teams): constrain Graph URL requests - fix(teams): preserve channel backfill on delta prime errors - fix(teams): reject stale Graph token scopes - fix(teams): address roborev ingestion gaps - fix(teams): refresh mixed attachment stats - fix(teams): maintain conversation stats - fix(teams): use timestamp type in stats test Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Teams chat imports should not advance a per-chat cursor when member metadata could not be loaded, because that metadata feeds conversation participants and to-recipient rows. Leaving the cursor unchanged lets the next sync retry the same message window with member data instead of making the partial import permanent.\n\nTeams inline hosted media also needs ownership metadata so edits can replace the current inline attachment set. Mark new inline rows with a Teams source attachment ID and replace Teams-managed inline rows as a group, including legacy unmarked rows, while preserving existing rows if a current hosted image cannot be fetched.\n\nGenerated with Codex\nCo-authored-by: Codex <codex@openai.com>
f8f07f8 to
a62d3d3
Compare
roborev: Combined Review (
|
Teams sync cursors stay retryable only when message persistence returns an error. Dropping message_raw write failures allowed a successful sync to advance past a Graph message whose original payload was never archived. Surface raw archive failures from message persistence so the sync run is marked failed and the next run retries from the previous successful cursor. Validation: verified TestRawArchiveFailureFailsImport failed before this change when a real message_raw trigger rejected Teams raw JSON writes. Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Teams limited syncs must not page an entire chat or channel before discarding messages, because that defeats scoped recovery runs and risks saving cursors for work the run intentionally skipped. Stop Graph paging as soon as the per-conversation cap is reached and carry an explicit truncation flag back to the importer so chat cursors and channel delta links remain retryable. Also ignore content-less Graph attachment objects when setting initial message attachment flags, since those objects do not produce attachment rows. Validation: reproduced the prior PostgreSQL CI failure locally with the raw-archive regression under MSGVAULT_TEST_DB; verified the full PostgreSQL test lane against a local postgres:16 container. Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
|
Thank you! A fair amount of grinding was needed to get ready but should be good now |
What
Sync your own Microsoft Teams 1:1/group/meeting chats and channel messages into msgvault via delegated Microsoft Graph, searchable alongside mail through the existing TUI / FTS / Parquet analytics.
Highlights
add-teams(delegated Graph OAuth) andsync-teams(full + incremental, with streamed per-conversation progress) commands; Teams also runs underservescheduled syncs — and the daemon now syncs all source types on an identifier (so Teams + Outlook/IMAP on one address both run).to) +@mentionrows, identity resolution (AAD object id → email dedup, unifying with mail identities), inline images downloaded to content-addressed storage, and shared-file links recorded.lastModifiedDateTimelist filtering (no delegated per-chat delta endpoint exists), channels via/messages/delta; per-conversation cursors persisted insync_runs.cursor_after, flushed after each conversation so an interrupted long backfill resumes mid-stream.teams_<email>.jsontoken with Graph scopes only, so IMAP and Teams can each be used alone or together.Use
Chat.Read,ChannelMessage.Read.All,Team.ReadBasic.All,Channel.ReadBasic.All,User.Read) and grant admin consent.config.toml:msgvault add-teams you@tenant.comthenmsgvault sync-teams you@tenant.com(--no-channels/--limitfor scoped runs). Pressainsidemsgvault tuito filter to the Teams account.Notes
🤖 Generated with Claude Code