Orchestrator Version: 3.2.6-17
MySQL Version: 8.0.43 (RDS Source), 8.0.33 (Intermediate/Leaf Replicas)
We are observing a recurring issue where Orchestrator reports GTID:errant on cascading replicas, even though the transactions in question originated from the top-level Source and are simply in the process of propagating through the topology.
The "errant" status appears to be a false positive caused by Orchestrator collecting metadata from replicas faster than the Source (or Intermediate Master) updates its own Executed_Gtid_Set visibility.
Replication topology
Source: AWS RDS instance
Intermediate Source: 10.10.10.200
Leaf Replicas: 10.10.10.201, 10.10.10.202, 10.10.10.206
AWS RDS instance [RW] [8.0.43] [GTID]
└── 10.10.10.200 [SR] [8.0.33-25] [GTID] [repl=ok,lag=1,auto_pos]
└── 10.10.10.201 [SR] [8.0.43-34] [GTID] [repl=ok,lag=0,auto_pos]
└── 10.10.10.206 [SR] [8.0.42-33] [GTID] [repl=ok,lag=0,auto_pos]
└── 10.10.10.202 [SR] [8.0.33-25] [GTID] [repl=ok,lag=0,auto_pos]
Orchestrator topology
$ orchestrator-client -c topology -i 10.10.10.200
AWS RDS instance:3306 [unknown,invalid,Unknown,rw,nobinlog]
+ 10.10.10.200:3306 [0s,ok,8.0.33-25,ro,ROW,>>,GTID]
+ 10.10.10.201:3306 [0s,ok,8.0.43-34,ro,ROW,>>,GTID:errant]
+ 10.10.10.202:3306 [0s,ok,8.0.33-25,ro,ROW,>>,GTID:errant]
+ 10.10.10.206:3306 [0s,ok,8.0.42-33,ro,ROW,>>,GTID:errant]
Evidence of Metadata Racing
When querying Orchestrator via CLI, the errant range "shifts" rapidly as metadata is refreshed:
Bash
$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335748-4335796
# Seconds later...
$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335820-4335836
Verification of Source Origin
We performed a mysqlbinlog dump on the leaf replicas for the reported "errant" range. The output confirms that the transactions originated from Server ID 1483228424 (the RDS Source), not from local writes on the replicas:
Bash
Example from 10.10.10.201
#260224 12:09:48 server id 1483228424 end_log_pos 994222673 CRC32 0x4bf21cdf Query thread_id=9843324 exec_time=0 error_code=0
SET @@session.pseudo_thread_id=9843324/!/;
UPDATE db.run ...
All replicas show the exact same thread_id and server id. Within minutes, the Source Executed_Gtid_Set catches up, and the "errant" status clears on its own without intervention.
The Problem
Because Orchestrator flags these transient states as GTID:errant, it triggers downstream monitoring alerts. In a cascading topology with high-concurrency writes, the "racing" metadata leads to constant flapping alerts for a cluster that is actually in perfect health.
Attempts to Resolve
Restarted Orchestrator container (version 3.2.6-17) - Issue persists.
Verified super_read_only=1 on all replicas - No local writes are occurring.
Expected Behavior
Orchestrator should perhaps implement a grace period or verify the server_id of the "errant" transactions before flagging them as a divergence, especially in cascading setups where metadata propagation might not be atomic across all layers.
{
"ApplyMySQLPromotionAfterMasterFailover": true,
"AutoPseudoGTID": false,
"BackendDB": "sqlite",
"CandidateInstanceExpireMinutes": 60,
"Debug": true,
"DetachLostSlavesAfterMasterFailover": false,
"DiscoverByShowSlaveHosts": true,
"ExpiryHostnameResolvesMinutes": 60,
"HostnameResolveMethod": "none",
"InstancePollSeconds": 10,
"MasterFailoverDetachSlaveMasterHost": true,
"MySQLHostnameResolveMethod": "@@report_host",
"MySQLTopologySSLSkipVerify": true,
"ReadOnly": false,
"ReasonableMaintenanceReplicationLagSeconds": 120,
"RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
"SlaveLagQuery": "",
"SQLite3DataFile": "/var/lib/orchestrator/orchestrator.sqlite3",
"UnseenInstanceForgetHours": 24,
"UseSSL": false,
"UseSuperReadOnly": true,
"ListenAddress": ":3000",
"DefaultInstancePort": 3306,
"AuditToSyslog": false,
"AuthenticationMethod": "multi",
"HTTPAuthUser": "***",
"HTTPAuthPassword": "***",
"MySQLTopologyUser": "***",
"MySQLTopologyPassword": "***",
"PowerAuthUsers": ["***"],
"URLPrefix": "",
"StatusEndpoint": "api/status",
"ClusterNameToAlias": {
"V-L-DB-Prd-2:3306": "prod_cluster",
"10.10.10.201:3306": "prod_cluster",
"V-L-DB-Prd-1:3306": "prod_cluster",
"10.10.10.200:3306": "prod_cluster"
},
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}'"
],
"PreGracefulTakeoverProcesses": [
"echo 'Planned takeover about to take place on {failureCluster}. Source will switch to read_only'"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}'"
],
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >"
],
"PostUnsuccessfulFailoverProcesses": [
""
],
"PostMasterFailoverProcesses": [
"/opt/orchestrator/bin/orchestrator_hook_sql_commands.py --failed-host={failedHost} --failed-port={failedPort} --successor-host={successorHost} --successor-port={successorPort} --config=/opt/orchestrator/etc/orchestrator_hook_sql_commands.yaml --defaults-file=/opt/orchestrator/etc/.my.cnf"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}'"
],
"PostGracefulTakeoverProcesses": [
"echo 'Planned takeover complete'"
]
}
Orchestrator version:
$ podman ps | grep orchestrator
2ada75af3757 localhost/orchestrator:3.2.6-17 /usr/local/orches... 20 hours ago Up 20 hours ago (starting) orchestrator
Orchestrator Version: 3.2.6-17
MySQL Version: 8.0.43 (RDS Source), 8.0.33 (Intermediate/Leaf Replicas)
We are observing a recurring issue where Orchestrator reports GTID:errant on cascading replicas, even though the transactions in question originated from the top-level Source and are simply in the process of propagating through the topology.
The "errant" status appears to be a false positive caused by Orchestrator collecting metadata from replicas faster than the Source (or Intermediate Master) updates its own Executed_Gtid_Set visibility.
Replication topology
Orchestrator topology
Evidence of Metadata Racing
When querying Orchestrator via CLI, the errant range "shifts" rapidly as metadata is refreshed:
Bash
Verification of Source Origin
We performed a mysqlbinlog dump on the leaf replicas for the reported "errant" range. The output confirms that the transactions originated from Server ID 1483228424 (the RDS Source), not from local writes on the replicas:
Bash
Example from 10.10.10.201
#260224 12:09:48 server id 1483228424 end_log_pos 994222673 CRC32 0x4bf21cdf Query thread_id=9843324 exec_time=0 error_code=0
SET @@session.pseudo_thread_id=9843324/!/;
UPDATE
db.run...All replicas show the exact same thread_id and server id. Within minutes, the Source Executed_Gtid_Set catches up, and the "errant" status clears on its own without intervention.
The Problem
Because Orchestrator flags these transient states as GTID:errant, it triggers downstream monitoring alerts. In a cascading topology with high-concurrency writes, the "racing" metadata leads to constant flapping alerts for a cluster that is actually in perfect health.
Attempts to Resolve
Expected Behavior
Orchestrator should perhaps implement a grace period or verify the server_id of the "errant" transactions before flagging them as a divergence, especially in cascading setups where metadata propagation might not be atomic across all layers.
Orchestrator version: