go/mysql: streaming errors no longer surface as connection loss#20383
go/mysql: streaming errors no longer surface as connection loss#20383timvaillancourt wants to merge 3 commits into
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
When a streaming handler returns an error after the first row or field packet has been emitted, vtgate previously dropped the connection and the client saw ERROR 2013 / Lost connection. The result set is always at a packet boundary when the callback returns, and MySQL's wire protocol allows an ERR packet in place of OK/EOF, so we can surface the real error to the client without tearing down the connection. The fix applies to all three streaming paths in go/mysql: COM_QUERY (text protocol), multi-statement COM_QUERY, and COM_STMT_EXECUTE (binary protocol). When an OK packet has already terminated the result (DML / write-only response), the original tear-down behaviour is preserved to avoid desynchronising the protocol. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
05aa732 to
1b450a6
Compare
The drain loop in TestHandleComStmtExecuteSurfacesMidStreamError was unbounded. A regression that ends the stream with OK/EOF (or leaves the connection open without further packets) would cause the test to hang until the CI timeout instead of failing fast. Adds a read deadline and a 16-packet cap so the test fails immediately. Addresses vitessio#20383 (comment) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20383 +/- ##
===========================================
- Coverage 69.67% 69.25% -0.42%
===========================================
Files 1614 172 -1442
Lines 216793 24243 -192550
===========================================
- Hits 151044 16790 -134254
+ Misses 65749 7453 -58296
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
I think there's a pre-existing issue where |
…or tests Adds a multi-statement test where the first result set succeeds and the second fails mid-stream, and asserts the follow-up COM_STMT_EXECUTE returns a real (non-error) result. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
maxenglander
left a comment
There was a problem hiding this comment.
found a couple things with the help of Codex. i am really unfamiliar with this part of the code so i am going to confess i don't fully understand the comments. sorry to make you do the work of understanding and validating them.
| log.Error("Error after OK-terminated result", slog.String("connection", c.String()), slog.Any("error", err)) | ||
| return connErr | ||
| } | ||
| if !c.writeErrorPacketFromErrorAndLog(err) { |
There was a problem hiding this comment.
an agent generated this suggestion. i validated the suggestion with a test, though i'll admit i still don't fully understand it.
This needs to terminate the previous result set before writing the ERR packet when
needsEndPacketis still true.A case like
select rows; errorleaves the first result set streamed but not yet EOF/OK-terminated.Since the second statement fails before its first callback packet, the
firstPacket && needsEndPacketblock above never runs. Writing the ERR directly here makes the client consume that ERR as the terminator for the first result, soReadQueryResultreturns the error and drops the rows instead of returning the first result withmore=true.I reproduced it with a test shaped like:
- query:
select rows; error- first
ReadQueryResult: expect rows +more=true- second
ReadQueryResult: expect the SQL errorLine 1608 (if err != nil) would also be a good placement if GitHub won’t let you comment on 1616.
here's the diff of the test:
diff --git a/go/mysql/conn_test.go b/go/mysql/conn_test.go
index 2b1a62d403..eab6a13451 100644
--- a/go/mysql/conn_test.go
+++ b/go/mysql/conn_test.go
@@ -1121,6 +1121,32 @@ func TestExecQueryMultiStreamErrorAfterFirstResultSet(t *testing.T) {
require.True(t, result.Equal(selectRowsResult))
}
+func TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult(t *testing.T) {
+ listener, sConn, cConn := createSocketPair(t)
+ sConn.multiQuery = true
+ sConn.Capabilities |= CapabilityClientMultiStatements
+ defer func() {
+ listener.Close()
+ sConn.Close()
+ cConn.Close()
+ }()
+
+ require.NoError(t, cConn.WriteComQuery("select rows; error"))
+
+ handler := &testRun{err: sqlerror.NewSQLError(sqlerror.ERQueryInterrupted, sqlerror.SSQueryInterrupted, "context canceled")}
+ res := sConn.handleNextCommand(handler)
+ require.True(t, res, "error before the next result set must not tear down the connection")
+
+ result, more, _, err := cConn.ReadQueryResult(100, true)
+ require.NoError(t, err)
+ require.True(t, result.Equal(selectRowsResult))
+ require.True(t, more, "more results must follow the first result set")
+
+ _, more, _, err = cConn.ReadQueryResult(100, true)
+ require.ErrorContains(t, err, "context canceled")
+ require.False(t, more, "no further results after an error packet")
+}
+
// TestExecQueryErrorAfterOKDoesNotDesyncProtocol guards against writing an ERR
// packet after an OK packet has already terminated the result set. Appending a
// second packet would leave a stale ERR queued for the next command. The
Running it:
go test ./go/mysql -run TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult -count=1Result: it failed at the first ReadQueryResult:
--- FAIL: TestExecQueryMultiErrorBeforeNextResultSetTerminatesPreviousResult (0.00s)
conn_test.go:1141:
Received unexpected error:
context canceled (errno 1317) (sqlstate 70100)
FAIL
| // An OK packet already terminated the last result; we cannot safely | ||
| // append an ERR without desynchronizing the protocol for the next | ||
| // command. Tear down the connection instead. | ||
| if !needsEndPacket { |
There was a problem hiding this comment.
an agent generated this suggestion. i validated the suggestion with a test, though i'll admit i still don't fully understand it.
Context: multi-statement batches where the previous result was OK-only, not a row result.
A case like
update t set id = 2; errorwrites the first statement's OK packet withSERVER_MORE_RESULTS_EXISTS, soneedsEndPacketis false. But the next statement can still fail before producing its first callback packet. In that case the error belongs to the next result and should be returned to the client, not treated as an error after the final OK result that requires closing the connection.
here's the diff of the test:
diff --git a/go/mysql/conn_test.go b/go/mysql/conn_test.go
@@
+func TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult(t *testing.T) {
+ listener, sConn, cConn := createSocketPair(t)
+ sConn.multiQuery = true
+ sConn.Capabilities |= CapabilityClientMultiStatements
+ defer func() {
+ listener.Close()
+ sConn.Close()
+ cConn.Close()
+ }()
+
+ require.NoError(t, cConn.WriteComQuery("update t set id = 2; error"))
+
+ handler := slowQueryTestHandler{
+ queryResults: map[string]*sqltypes.Result{
+ "update t set id = 2": {RowsAffected: 1},
+ },
+ }
+ res := sConn.handleNextCommand(handler)
+ require.True(t, res, "error before the next result set must not tear down the connection")
+
+ result, more, _, err := cConn.ReadQueryResult(100, true)
+ require.NoError(t, err)
+ require.EqualValues(t, 1, result.RowsAffected)
+ require.True(t, more, "more results must follow the first result set")
+
+ _, more, _, err = cConn.ReadQueryResult(100, true)
+ require.ErrorContains(t, err, "unexpected query")
+ require.False(t, more, "no further results after an error packet")
+}Running it:
go test ./go/mysql -run TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult -count=1Result: it failed because handleNextCommand returned false:
--- FAIL: TestExecQueryMultiErrorBeforeNextResultSetAfterOKResult (0.00s)
conn_test.go:1168:
Error: Should be true
Messages: error before the next result set must not tear down the connection
Description
When the streaming query handler returns an error after the first row or field packet has been emitted, vtgate today drops the underlying TCP connection and the client sees
ERROR 2013 (HY000): Lost connection to MySQL server during query. Common triggers areKILL QUERYon an OLAP session, planner errors that surface late, and per-shard errors discovered after another shard has already returned rowsWe're at a packet boundary when the streaming callback returns, and MySQL's wire protocol allows an ERR packet in place of the trailing OK/EOF. This PR writes the real ERR there instead of tearing down the connection, across all three streaming paths in
go/mysql:COM_QUERY(execQuery)COM_QUERY(execQueryMulti)COM_STMT_EXECUTE(handleComStmtExecute)When an OK packet has already terminated the result (DML / write-only) the original tear-down is preserved, since appending an ERR there would leave it queued as a stale packet for the next command
The user-visible change is documented in the
25.0.0changelog — clients that previously branched onERROR 2013will now see the real error code (e.g.1317 / context canceledafterKILL QUERY, or planner errors such asspecifying two different database in the query is not supported) and the connection remains usable for follow-up queriescc @arthurschreiber / @maxenglander
Related Issue(s)
Resolves: #20382
Checklist
Deployment Notes
AI Disclosure
This PR was written primarily by Claude Code, including this summary