False Positive "Errant GTID" reported on cascading replicas

Orchestrator Version: 3.2.6-17
MySQL Version: 8.0.43 (RDS Source), 8.0.33 (Intermediate/Leaf Replicas)

We are observing a recurring issue where Orchestrator reports GTID:errant on cascading replicas, even though the transactions in question originated from the top-level Source and are simply in the process of propagating through the topology.

The "errant" status appears to be a false positive caused by Orchestrator collecting metadata from replicas faster than the Source (or Intermediate Master) updates its own Executed_Gtid_Set visibility.

Replication topology

```
    Source: AWS RDS instance
    Intermediate Source: 10.10.10.200
    Leaf Replicas: 10.10.10.201, 10.10.10.202, 10.10.10.206

AWS RDS instance [RW] [8.0.43] [GTID]
 └── 10.10.10.200 [SR] [8.0.33-25] [GTID] [repl=ok,lag=1,auto_pos]
      └── 10.10.10.201 [SR] [8.0.43-34] [GTID] [repl=ok,lag=0,auto_pos]
      └── 10.10.10.206 [SR] [8.0.42-33] [GTID] [repl=ok,lag=0,auto_pos]
      └── 10.10.10.202 [SR] [8.0.33-25] [GTID] [repl=ok,lag=0,auto_pos]
```

Orchestrator topology

```
$ orchestrator-client -c topology -i 10.10.10.200
AWS RDS instance:3306 [unknown,invalid,Unknown,rw,nobinlog]
+ 10.10.10.200:3306                                                   [0s,ok,8.0.33-25,ro,ROW,>>,GTID]
  + 10.10.10.201:3306                                                 [0s,ok,8.0.43-34,ro,ROW,>>,GTID:errant]
  + 10.10.10.202:3306                                                 [0s,ok,8.0.33-25,ro,ROW,>>,GTID:errant]
  + 10.10.10.206:3306                                                 [0s,ok,8.0.42-33,ro,ROW,>>,GTID:errant]
```


Evidence of Metadata Racing

When querying Orchestrator via CLI, the errant range "shifts" rapidly as metadata is refreshed:
Bash

```
$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335748-4335796

# Seconds later...
$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335820-4335836
```

Verification of Source Origin

We performed a mysqlbinlog dump on the leaf replicas for the reported "errant" range. The output confirms that the transactions originated from Server ID 1483228424 (the RDS Source), not from local writes on the replicas:
Bash

# Example from 10.10.10.201
#260224 12:09:48 server id 1483228424  end_log_pos 994222673 CRC32 0x4bf21cdf Query thread_id=9843324 exec_time=0 error_code=0
SET @@session.pseudo_thread_id=9843324/*!*/;
### UPDATE `db`.`run` ...

All replicas show the exact same thread_id and server id. Within minutes, the Source Executed_Gtid_Set catches up, and the "errant" status clears on its own without intervention.

The Problem

Because Orchestrator flags these transient states as GTID:errant, it triggers downstream monitoring alerts. In a cascading topology with high-concurrency writes, the "racing" metadata leads to constant flapping alerts for a cluster that is actually in perfect health.
Attempts to Resolve

    Restarted Orchestrator container (version 3.2.6-17) - Issue persists.

    Verified super_read_only=1 on all replicas - No local writes are occurring.

Expected Behavior

Orchestrator should perhaps implement a grace period or verify the server_id of the "errant" transactions before flagging them as a divergence, especially in cascading setups where metadata propagation might not be atomic across all layers.


```
{
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "AutoPseudoGTID": false,
  "BackendDB": "sqlite",
  "CandidateInstanceExpireMinutes": 60,
  "Debug": true,
  "DetachLostSlavesAfterMasterFailover": false,
  "DiscoverByShowSlaveHosts": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "HostnameResolveMethod": "none",
  "InstancePollSeconds": 10,
  "MasterFailoverDetachSlaveMasterHost": true,
  "MySQLHostnameResolveMethod": "@@report_host",
  "MySQLTopologySSLSkipVerify": true,
  "ReadOnly": false,
  "ReasonableMaintenanceReplicationLagSeconds": 120,
  "RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
  "SlaveLagQuery": "",
  "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.sqlite3",
  "UnseenInstanceForgetHours": 24,
  "UseSSL": false,
  "UseSuperReadOnly": true,
  "ListenAddress": ":3000",
  "DefaultInstancePort": 3306,
  "AuditToSyslog": false,
  "AuthenticationMethod": "multi",
  "HTTPAuthUser": "***",
  "HTTPAuthPassword": "***",
  "MySQLTopologyUser": "***",
  "MySQLTopologyPassword": "***",
  "PowerAuthUsers": ["***"],
  "URLPrefix": "",
  "StatusEndpoint": "api/status",
  "ClusterNameToAlias": {
       "V-L-DB-Prd-2:3306": "prod_cluster",
    "10.10.10.201:3306": "prod_cluster",
    "V-L-DB-Prd-1:3306": "prod_cluster",
    "10.10.10.200:3306": "prod_cluster"
  },
    "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}'"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Source will switch to read_only'"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}'"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >"
  ],
  "PostUnsuccessfulFailoverProcesses": [
    ""
  ],
  "PostMasterFailoverProcesses": [
    "/opt/orchestrator/bin/orchestrator_hook_sql_commands.py --failed-host={failedHost} --failed-port={failedPort} --successor-host={successorHost} --successor-port={successorPort} --config=/opt/orchestrator/etc/orchestrator_hook_sql_commands.yaml --defaults-file=/opt/orchestrator/etc/.my.cnf"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}'"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete'"
  ]
}
```

Orchestrator version:

```
$ podman ps | grep orchestrator
2ada75af3757  localhost/orchestrator:3.2.6-17                  /usr/local/orches...  20 hours ago  Up 20 hours ago (starting)                          orchestrator
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False Positive "Errant GTID" reported on cascading replicas #88

Example from 10.10.10.201

UPDATE `db`.`run` ...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

False Positive "Errant GTID" reported on cascading replicas #88

Description

Example from 10.10.10.201

UPDATE db.run ...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

UPDATE `db`.`run` ...