Skip to content

Add experimental etcd store backend#450

Open
haasonsaas wants to merge 2 commits into
openai:mainfrom
haasonsaas:feature/etcd-store-backend
Open

Add experimental etcd store backend#450
haasonsaas wants to merge 2 commits into
openai:mainfrom
haasonsaas:feature/etcd-store-backend

Conversation

@haasonsaas

Copy link
Copy Markdown

Summary

  • Adds an experimental etcd-backed implementation of the existing controller store interface.
  • Adds controller store selection via --experimental-store=badger|etcd, --experimental-etcd-endpoints, and --experimental-etcd-key-prefix.
  • Extends store tests so Badger runs by default and etcd runs when ORCHARD_TEST_ETCD_ENDPOINTS is set.

Context
I am not certain whether external feature PRs are actively welcomed here: the repository has no CONTRIBUTING.md, direct push to openai/orchard returned 403 for my account, and the recent merged PR list appears to be mostly maintainer and Dependabot work. I opened this from my fork in case it is useful.

Thank you for creating Orchard, and congratulations on the acquisition. If you want a different shape for this change, or if you would rather see the scheduler leader-election follow-up first, I am happy to implement it.

Tests

  • go test ./internal/controller/store ./internal/controller ./internal/command/controller
  • ORCHARD_TEST_ETCD_ENDPOINTS=localhost:2379 go test ./internal/controller/store, against quay.io/coreos/etcd:v3.6.5 in Docker
  • go test ./... was attempted locally, but the integration tests require tart; this machine does not have tart installed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8dfa03ba9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +152 to +154
for _, kv := range response.Kvs {
txn.deletes[string(kv.Key)] = struct{}{}
delete(txn.puts, string(kv.Key))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use range deletes for event cleanup

When deleting a VM with more than about 127 stored log events, DeleteEvents queues one OpDelete per event and the surrounding deleteVM transaction also deletes the VM record; etcd rejects transactions beyond its --max-txn-ops limit (128 by default, https://etcd.io/docs/v3.5/op-guide/configuration/), so normal long-running VMs become undeletable with the etcd backend. Use a single prefix delete or chunk operations under the limit.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4945127. DeleteEvents now records a single prefix delete, so VM event cleanup uses one etcd delete operation instead of one operation per event. Added TestDeleteManyEvents with 200 stored events and verified it against real etcd.

Comment on lines +66 to +68
response, err := txn.store.client.Get(txn.ctx, physicalPrefix, clientv3.WithPrefix(), clientv3.WithSort(
clientv3.SortByKey, clientv3.SortAscend,
))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Page etcd event queries at the server

When clients request GET /vms/:name/events?limit=... for a VM with large logs, this still fetches and unmarshals every event under the prefix before applying Limit, unlike the badger iterator that stops after the page. This makes each paginated request O(total log history) and can exhaust memory/latency on the controller; pass the cursor/order/limit into the etcd range instead of slicing after a full scan.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4945127. ListEventsPage now passes cursor, order, and limit into the etcd range query and requests limit+1 keys to preserve NextCursor behavior without scanning the full event history.

@haasonsaas

Copy link
Copy Markdown
Author

Update after review comments:

  • Fixed DeleteEvents for large event histories in 4945127: the etcd backend now uses one prefix delete instead of one delete operation per event.
  • Fixed ListEventsPage in 4945127: the etcd backend now passes cursor, order, and limit to the etcd range query and uses limit+1 to preserve NextCursor behavior.
  • Added TestDeleteManyEvents with 200 events.

Tests run after the fix:

  • go test ./internal/controller/store ./internal/controller ./internal/command/controller
  • ORCHARD_TEST_ETCD_ENDPOINTS=localhost:2379 go test ./internal/controller/store against quay.io/coreos/etcd:v3.6.5

I also installed Tart 2.32.1 and reran go test ./.... It still does not complete on this local machine: internal/tests times out after 10 minutes while VM-backed integration tests remain pending, including TestSpecUpdatePowerStateStopped and TestVMExecWithoutStdin. The focused controller/store tests and the real-etcd store tests pass.

@haasonsaas

Copy link
Copy Markdown
Author

Additional main-branch comparison:

I created a separate clean worktree at origin/main (b898769) and ran:

  • go test ./internal/tests

Result: it also times out after 10 minutes in this local Tart environment. The same pattern appears: VM-backed integration tests stay in pending, including TestSpecUpdatePowerStateStopped, then the package times out while TestVMExecWithoutStdin is running.

That makes the full integration timeout look environmental on this machine rather than introduced by this PR. The focused controller/store tests and the real-etcd store tests still pass on the PR branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant