Skip to content

go/mysql: streaming errors no longer surface as connection loss#20383

Open
timvaillancourt wants to merge 3 commits into
vitessio:mainfrom
timvaillancourt:streamexecute-real-errors-mid-stream
Open

go/mysql: streaming errors no longer surface as connection loss#20383
timvaillancourt wants to merge 3 commits into
vitessio:mainfrom
timvaillancourt:streamexecute-real-errors-mid-stream

Conversation

@timvaillancourt

@timvaillancourt timvaillancourt commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

When the streaming query handler returns an error after the first row or field packet has been emitted, vtgate today drops the underlying TCP connection and the client sees ERROR 2013 (HY000): Lost connection to MySQL server during query. Common triggers are KILL QUERY on an OLAP session, planner errors that surface late, and per-shard errors discovered after another shard has already returned rows

We're at a packet boundary when the streaming callback returns, and MySQL's wire protocol allows an ERR packet in place of the trailing OK/EOF. This PR writes the real ERR there instead of tearing down the connection, across all three streaming paths in go/mysql:

  1. COM_QUERY (execQuery)
  2. multi-statement COM_QUERY (execQueryMulti)
  3. COM_STMT_EXECUTE (handleComStmtExecute)

When an OK packet has already terminated the result (DML / write-only) the original tear-down is preserved, since appending an ERR there would leave it queued as a stale packet for the next command

The user-visible change is documented in the 25.0.0 changelog — clients that previously branched on ERROR 2013 will now see the real error code (e.g. 1317 / context canceled after KILL QUERY, or planner errors such as specifying two different database in the query is not supported) and the connection remains usable for follow-up queries

cc @arthurschreiber / @maxenglander

Related Issue(s)

Resolves: #20382

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

This PR was written primarily by Claude Code, including this summary

Copilot AI review requested due to automatic review settings June 23, 2026 22:42
@github-actions github-actions Bot added this to the v25.0.0 milestone Jun 23, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 23, 2026
@vitess-bot

vitess-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@github-actions github-actions Bot added the Component: Documentation docs related issues/PRs label Jun 23, 2026
@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: General Changes throughout the code base and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 23, 2026
@timvaillancourt timvaillancourt self-assigned this Jun 23, 2026
When a streaming handler returns an error after the first row or field
packet has been emitted, vtgate previously dropped the connection and
the client saw ERROR 2013 / Lost connection. The result set is always
at a packet boundary when the callback returns, and MySQL's wire
protocol allows an ERR packet in place of OK/EOF, so we can surface
the real error to the client without tearing down the connection.

The fix applies to all three streaming paths in go/mysql: COM_QUERY
(text protocol), multi-statement COM_QUERY, and COM_STMT_EXECUTE
(binary protocol). When an OK packet has already terminated the result
(DML / write-only response), the original tear-down behaviour is
preserved to avoid desynchronising the protocol.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This comment was marked as outdated.

@timvaillancourt timvaillancourt force-pushed the streamexecute-real-errors-mid-stream branch from 05aa732 to 1b450a6 Compare June 23, 2026 22:44
The drain loop in TestHandleComStmtExecuteSurfacesMidStreamError was
unbounded. A regression that ends the stream with OK/EOF (or leaves
the connection open without further packets) would cause the test to
hang until the CI timeout instead of failing fast. Adds a read
deadline and a 16-packet cap so the test fails immediately.

Addresses vitessio#20383 (comment)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 23, 2026 22:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

@timvaillancourt timvaillancourt marked this pull request as ready for review June 23, 2026 22:50
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.25%. Comparing base (70c7a72) to head (507e791).
⚠️ Report is 350 commits behind head on main.

Files with missing lines Patch % Lines
go/mysql/conn.go 83.33% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20383       +/-   ##
===========================================
- Coverage   69.67%   69.25%    -0.42%     
===========================================
  Files        1614      172     -1442     
  Lines      216793    24243   -192550     
===========================================
- Hits       151044    16790   -134254     
+ Misses      65749     7453    -58296     
Flag Coverage Δ
partial 69.25% <83.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread go/mysql/conn_test.go
Comment thread go/mysql/conn_test.go
@arthurschreiber

Copy link
Copy Markdown
Member

I think there's a pre-existing issue where FetchNext (go/mysql/streaming_query.go:133) doesn't clear c.fields on the ERR path. This can cause issues when vttablet receives a mid-stream ERR from MySQL. It's neither introduced nor made worse by this PR, just wanted to call it out. Probably should be fixed in a separate / followup PR.

…or tests

Adds a multi-statement test where the first result set succeeds and the
second fails mid-stream, and asserts the follow-up COM_STMT_EXECUTE returns
a real (non-error) result.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

@maxenglander maxenglander left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a couple things with the help of Codex. i am really unfamiliar with this part of the code so i am going to confess i don't fully understand the comments. sorry to make you do the work of understanding and validating them.

Comment thread go/mysql/conn.go
log.Error("Error after OK-terminated result", slog.String("connection", c.String()), slog.Any("error", err))
return connErr
}
if !c.writeErrorPacketFromErrorAndLog(err) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an agent generated this suggestion. i validated the suggestion with a test, though i'll admit i still don't fully understand it.

This needs to terminate the previous result set before writing the ERR packet when needsEndPacket is still true.

A case like select rows; error leaves the first result set streamed but not yet EOF/OK-terminated.

Since the second statement fails before its first callback packet, the firstPacket && needsEndPacket block above never runs. Writing the ERR directly here makes the client consume that ERR as the terminator for the first result, so ReadQueryResult returns the error and drops the rows instead of returning the first result with more=true.

I reproduced it with a test shaped like:

  • query: select rows; error
  • first ReadQueryResult: expect rows + more=true
  • second ReadQueryResult: expect the SQL error

Line 1608 (if err != nil) would also be a good placement if GitHub won’t let you comment on 1616.

here's the diff of the test:

diff --git a/go/mysql/conn_test.go b/go/mysql/conn_test.go
index 2b1a62d403..eab6a13451 100644
--- a/go/mysql/conn_test.go
+++ b/go/mysql/conn_test.go
@@ -1121,6 +1121,32 @@ func TestExecQueryMultiStreamErrorAfterFirstResultSet(t *testing.T) {
 	require.True(t, result.Equal(selectRowsResult))
 }
 
+func TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult(t *testing.T) {
+	listener, sConn, cConn := createSocketPair(t)
+	sConn.multiQuery = true
+	sConn.Capabilities |= CapabilityClientMultiStatements
+	defer func() {
+		listener.Close()
+		sConn.Close()
+		cConn.Close()
+	}()
+
+	require.NoError(t, cConn.WriteComQuery("select rows; error"))
+
+	handler := &testRun{err: sqlerror.NewSQLError(sqlerror.ERQueryInterrupted, sqlerror.SSQueryInterrupted, "context canceled")}
+	res := sConn.handleNextCommand(handler)
+	require.True(t, res, "error before the next result set must not tear down the connection")
+
+	result, more, _, err := cConn.ReadQueryResult(100, true)
+	require.NoError(t, err)
+	require.True(t, result.Equal(selectRowsResult))
+	require.True(t, more, "more results must follow the first result set")
+
+	_, more, _, err = cConn.ReadQueryResult(100, true)
+	require.ErrorContains(t, err, "context canceled")
+	require.False(t, more, "no further results after an error packet")
+}
+
 // TestExecQueryErrorAfterOKDoesNotDesyncProtocol guards against writing an ERR
 // packet after an OK packet has already terminated the result set. Appending a
 // second packet would leave a stale ERR queued for the next command. The

Running it:

go test ./go/mysql -run TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult -count=1

Result: it failed at the first ReadQueryResult:

  --- FAIL: TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult (0.00s)
      conn_test.go:1141:
          Received unexpected error:
          context canceled (errno 1317) (sqlstate 70100)
  FAIL

Comment thread go/mysql/conn.go
// An OK packet already terminated the last result; we cannot safely
// append an ERR without desynchronizing the protocol for the next
// command. Tear down the connection instead.
if !needsEndPacket {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an agent generated this suggestion. i validated the suggestion with a test, though i'll admit i still don't fully understand it.

Context: multi-statement batches where the previous result was OK-only, not a row result.

A case like update t set id = 2; error writes the first statement's OK packet with SERVER_MORE_RESULTS_EXISTS, so needsEndPacket is false. But the next statement can still fail before producing its first callback packet. In that case the error belongs to the next result and should be returned to the client, not treated as an error after the final OK result that requires closing the connection.

here's the diff of the test:

diff --git a/go/mysql/conn_test.go b/go/mysql/conn_test.go
@@
+func TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult(t *testing.T) {
+	listener, sConn, cConn := createSocketPair(t)
+	sConn.multiQuery = true
+	sConn.Capabilities |= CapabilityClientMultiStatements
+	defer func() {
+		listener.Close()
+		sConn.Close()
+		cConn.Close()
+	}()
+
+	require.NoError(t, cConn.WriteComQuery("update t set id = 2; error"))
+
+	handler := slowQueryTestHandler{
+		queryResults: map[string]*sqltypes.Result{
+			"update t set id = 2": {RowsAffected: 1},
+		},
+	}
+	res := sConn.handleNextCommand(handler)
+	require.True(t, res, "error before the next result set must not tear down the connection")
+
+	result, more, _, err := cConn.ReadQueryResult(100, true)
+	require.NoError(t, err)
+	require.EqualValues(t, 1, result.RowsAffected)
+	require.True(t, more, "more results must follow the first result set")
+
+	_, more, _, err = cConn.ReadQueryResult(100, true)
+	require.ErrorContains(t, err, "unexpected query")
+	require.False(t, more, "no further results after an error packet")
+}

Running it:

go test ./go/mysql -run TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult -count=1

Result: it failed because handleNextCommand returned false:

--- FAIL: TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult (0.00s)
    conn_test.go:1168:
        Error: Should be true
        Messages: error before the next result set must not tear down the connection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Documentation docs related issues/PRs Component: General Changes throughout the code base Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: streaming query errors surface as 2013 / Lost connection instead of the real error

4 participants