Add experimental etcd store backend#450
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8dfa03ba9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for _, kv := range response.Kvs { | ||
| txn.deletes[string(kv.Key)] = struct{}{} | ||
| delete(txn.puts, string(kv.Key)) |
There was a problem hiding this comment.
Use range deletes for event cleanup
When deleting a VM with more than about 127 stored log events, DeleteEvents queues one OpDelete per event and the surrounding deleteVM transaction also deletes the VM record; etcd rejects transactions beyond its --max-txn-ops limit (128 by default, https://etcd.io/docs/v3.5/op-guide/configuration/), so normal long-running VMs become undeletable with the etcd backend. Use a single prefix delete or chunk operations under the limit.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4945127. DeleteEvents now records a single prefix delete, so VM event cleanup uses one etcd delete operation instead of one operation per event. Added TestDeleteManyEvents with 200 stored events and verified it against real etcd.
| response, err := txn.store.client.Get(txn.ctx, physicalPrefix, clientv3.WithPrefix(), clientv3.WithSort( | ||
| clientv3.SortByKey, clientv3.SortAscend, | ||
| )) |
There was a problem hiding this comment.
Page etcd event queries at the server
When clients request GET /vms/:name/events?limit=... for a VM with large logs, this still fetches and unmarshals every event under the prefix before applying Limit, unlike the badger iterator that stops after the page. This makes each paginated request O(total log history) and can exhaust memory/latency on the controller; pass the cursor/order/limit into the etcd range instead of slicing after a full scan.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4945127. ListEventsPage now passes cursor, order, and limit into the etcd range query and requests limit+1 keys to preserve NextCursor behavior without scanning the full event history.
|
Update after review comments:
Tests run after the fix:
I also installed Tart 2.32.1 and reran |
|
Additional main-branch comparison: I created a separate clean worktree at
Result: it also times out after 10 minutes in this local Tart environment. The same pattern appears: VM-backed integration tests stay in That makes the full integration timeout look environmental on this machine rather than introduced by this PR. The focused controller/store tests and the real-etcd store tests still pass on the PR branch. |
Summary
Context
I am not certain whether external feature PRs are actively welcomed here: the repository has no CONTRIBUTING.md, direct push to openai/orchard returned 403 for my account, and the recent merged PR list appears to be mostly maintainer and Dependabot work. I opened this from my fork in case it is useful.
Thank you for creating Orchard, and congratulations on the acquisition. If you want a different shape for this change, or if you would rather see the scheduler leader-election follow-up first, I am happy to implement it.
Tests