Skip to content

vtgate: decouple streaming delivery from workload via --mysql-server-use-streaming#20378

Draft
arthurschreiber wants to merge 11 commits into
mainfrom
arthur/mysql-use-streaming
Draft

vtgate: decouple streaming delivery from workload via --mysql-server-use-streaming#20378
arthurschreiber wants to merge 11 commits into
mainfrom
arthur/mysql-use-streaming

Conversation

@arthurschreiber

Copy link
Copy Markdown
Member

Description

Today the MySQL protocol handler only uses StreamExecute for the OLAP workload — every other workload is buffered through Execute. That couples the delivery mode (streaming vs. buffering) to the workload mode, even though the workload should really only drive connection-pool selection and limits.

This adds an opt-in vtgate flag, --mysql-server-use-streaming (default false), that routes ComQuery, ComQueryMulti, and ComStmtExecute through streaming for all workloads. A new mysqlSessionUsesStreaming helper gates the three handlers (flag || workload == OLAP).

To keep workload semantics independent of the delivery mode, the tablet side now follows the workload rather than the RPC:

  • getStreamConn selects the regular (OLTP) pool for an OLTP workload and the stream pool otherwise, so an OLTP query that happens to be streamed still draws from the OLTP pool. Applied at the single getStreamConn chokepoint, covering the direct, consolidator, and schema-query call sites.
  • streamExecute computes its request timeout via a new streamTimeout helper: an OLTP workload gets the same timeout semantics as buffered Execute (query timeout, or the smaller of query and OLTP tx timeout inside a transaction), while other workloads keep the historical behavior of bounding only transactional streams with the OLAP tx timeout. UNSPECIFIED deliberately stays on the OLAP tx timeout to avoid TxTimeoutForWorkload mapping it to OLTP.
  • A non-transactional streamed query is tracked in the query list matching its workload (statelessql for OLTP, olapql otherwise), so shutdown semantics follow the workload too.

This is based on / extracted from #20213, and builds on #20268 (DML on the StreamExecute path). The earlier commits in this series make the streaming path reach parity with Execute before the flag flips it on for everyone:

  • planbuilder: eliminate PlanSelectStream and unify Build / BuildStreaming so streaming and buffered execution plan identically.
  • tabletserver: align Stream() stats with Execute and handle all plan types.
  • a StreamExecute/Execute parity test suite (stream_execute_compat_test.go).

Related Issue(s)

Based on / extracted from #20213. Follows #20268.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

New opt-in vtgate flag --mysql-server-use-streaming, default false — no behavioral change unless explicitly enabled. When enabled, all workloads are delivered via streaming; connection pools and limits still follow the session workload.

AI Disclosure

This PR was written primarily by Claude Code — I provided direction and review.

arthurschreiber and others added 4 commits June 23, 2026 20:24
BuildStreaming is now the full planner that shares all analysis with Build:
it accepts every statement type Build does and returns the real plan types
(Select, Show, OtherRead, SelectLockFunc, ...) instead of the catch-all
PlanSelectStream. Build becomes a thin wrapper that calls BuildStreaming and
overlays only the Build-specific behavior: the result-size LIMIT for
SELECT/UNION.

ANALYZE and EXPLAIN now map to PlanSelect on both paths, so streaming and
buffered planning agree (buffered already used PlanSelect for these).

The write-safety LIMIT for UPDATE/DELETE stays in analyzeUpdate/analyzeDelete,
so it is applied on both the buffered and streaming paths -- it bounds the
transaction size regardless of streaming. Only the result-buffering LIMIT for
SELECT/UNION is Build-only, since streaming reads have nothing to buffer.

PlanSelectStream is removed entirely, fixing ACL error messages and stats
keys that leaked "SelectStream" into user-visible output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…ypes

Now that BuildStreaming yields real plan types, Stream() handles the plan
types that previously only Execute() did: PlanNextval and PlanShowMigrations
(which carry a nil FullQuery) get dedicated handlers, and the generic SQL path
falls back to the raw query when FullQuery is nil. The consolidator and
schema-name rewrite checks switch from PlanSelectStream to PlanSelect.

Stream() also records the per-table/per-plan QueryEngine stats and the result
histogram that Execute records, so reads on the streaming path are accounted
for the same way. The defer is registered after the DML dispatch so streamDML,
which records its own stats, is not double-counted, and it captures the real
error code via the named return.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Add a suite verifying that the StreamExecute path matches Execute now that
BuildStreaming accepts the full range of statements: DML, DDL, SET, savepoints,
CALL, FLUSH, LOAD and migration statements all run through StreamExecute, ACL
error messages and stats keys no longer leak "SelectStream", PlanNextval and
PlanShowMigrations are handled, and DML on the streaming path honors the
write-safety LIMIT through streamDML.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…rkload

Today the MySQL protocol handler uses StreamExecute only for the OLAP
workload; every other workload is buffered via Execute. This couples the
delivery mode (streaming vs buffering) to the workload mode, even though
the workload should only drive pool selection and limits.

This adds an opt-in vtgate flag, --mysql-server-use-streaming (default
false), which routes ComQuery, ComQueryMulti, and ComStmtExecute through
streaming for all workloads. A new mysqlSessionUsesStreaming helper gates
the three handlers (flag || workload == OLAP).

To keep workload semantics independent of the delivery mode, the tablet
side now follows the workload rather than the RPC:

- getStreamConn selects the regular (OLTP) connection pool for an OLTP
  workload and the stream pool otherwise, so an OLTP query streamed to the
  client still draws from the OLTP pool. This is applied at the single
  getStreamConn chokepoint, covering the direct, consolidator, and schema
  query call sites.
- streamExecute computes its request timeout via the new streamTimeout
  helper: an OLTP workload gets the same timeout semantics as the buffered
  Execute path (query timeout, or the smaller of query and OLTP tx timeout
  in a transaction), while other workloads keep the historical behavior of
  bounding only transactional streams with the OLAP tx timeout. UNSPECIFIED
  deliberately stays on the OLAP tx timeout to avoid TxTimeoutForWorkload
  mapping it to OLTP.
- a non-transactional streamed query is tracked in the query list matching
  its workload (statelessql for OLTP, olapql otherwise) so shutdown
  semantics follow the workload too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 23, 2026 20:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions github-actions Bot added this to the v25.0.0 milestone Jun 23, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 23, 2026
@vitess-bot

vitess-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.60000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.38%. Comparing base (70c7a72) to head (65c6132).
⚠️ Report is 350 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/tabletserver/query_executor.go 85.10% 7 Missing ⚠️
go/vt/vtgate/plugin_mysql_server.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20378       +/-   ##
===========================================
+ Coverage   69.67%   83.38%   +13.71%     
===========================================
  Files        1614      409     -1205     
  Lines      216793    71476   -145317     
===========================================
- Hits       151044    59603    -91441     
+ Misses      65749    11873    -53876     
Flag Coverage Δ
partial 83.38% <93.60%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

arthurschreiber and others added 7 commits June 23, 2026 21:19
Flip the flag default to streaming-for-all-workloads so CI exercises the
streaming delivery path across the whole suite. This is an experiment to
surface what breaks under streaming-by-default; not intended to merge as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The StreamExecute result aggregator copied only Fields and Rows, dropping
RowsAffected. A row-returning statement that returns an OK packet — e.g. a
CALL of a stored procedure that performs DML — reported 0 affected rows to
the client instead of the real count, unlike the buffered Execute path.

Accumulate RowsAffected onto the aggregated result so it reaches the client.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
On the buffered Execute path, updateQueryStats runs unconditionally, so the
per-plan QueryExecutions counter is incremented even when a query errors. The
streaming path skipped it on the error path for row-returning statements, so a
failed EXPLAIN/SELECT over StreamExecute was never counted.

Record the stats on the row-returning error path too, leaving TablesUsed unset
so the per-table QueryExecutionsByTable counter stays untouched, matching the
buffered Execute path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Following the RowsAffected fix, the StreamExecute result aggregator still
dropped the remaining OK-packet fields. A row-returning statement that returns
an OK packet — e.g. a CALL of a procedure that performs an INSERT — lost its
InsertID (last_insert_id) and Info message on the way to the client, unlike the
buffered Execute path.

Carry InsertID (mirroring Result.AppendResult, including InsertIDChanged) and
Info onto the aggregated result so they reach the client.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The previous fix recorded per-plan query stats only for failed row-returning
queries on the streaming path. Non-row-returning queries (INSERT/UPDATE/DELETE)
that failed still returned through rollbackExecIfNeeded without recording them,
so a failed streamed DML was never counted in QueryExecutions — unlike the
buffered Execute path, which records stats before any rollback handling.

Hoist the stats recording to the top of the error block so every failed query,
row-returning or not, is counted. TablesUsed stays unset on error, so the
per-table QueryExecutionsByTable counter remains untouched, matching Execute.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The streaming path never called plan.AddStats, so engine.Plan execution
statistics (exec count, exec time, shard queries, rows, errors) — surfaced by
/queryz — were missing for every streamed query, unlike the buffered Execute
path which always records them via setLogStats.

Call plan.AddStats from updateLogStats for both the success and error paths. The
row counts are sourced from the streaming result receiver and also stored on
logStats (previously unset on this path), matching how buffered Execute couples
logStats and plan stats. On error the row counts stay zero and the error count
is one, mirroring Execute.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
TestComQueryMulti validates the workload -> delivery-mode mapping with explicit
OLTP (buffered) and OLAP (streaming) expectations. It implicitly relied on the
--mysql-server-use-streaming default being off to keep its OLTP cases buffered;
with the flag on, the OLTP cases received the streaming-shaped packets and the
callback indexed past its expectation slice and panicked.

Pin the flag off for the duration of the test so it no longer depends on the
mutable global default. The streaming delivery of OLTP queries is already
covered by the olap cases, which exercise the same path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 24, 2026 05:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTGate Component: VTTablet NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says Type: Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants