Skip to content

False Positive "Errant GTID" reported on cascading replicas #88

@ivertonschmidt-oss

Description

@ivertonschmidt-oss

Orchestrator Version: 3.2.6-17
MySQL Version: 8.0.43 (RDS Source), 8.0.33 (Intermediate/Leaf Replicas)

We are observing a recurring issue where Orchestrator reports GTID:errant on cascading replicas, even though the transactions in question originated from the top-level Source and are simply in the process of propagating through the topology.

The "errant" status appears to be a false positive caused by Orchestrator collecting metadata from replicas faster than the Source (or Intermediate Master) updates its own Executed_Gtid_Set visibility.

Replication topology

    Source: AWS RDS instance
    Intermediate Source: 10.10.10.200
    Leaf Replicas: 10.10.10.201, 10.10.10.202, 10.10.10.206

AWS RDS instance [RW] [8.0.43] [GTID]
 └── 10.10.10.200 [SR] [8.0.33-25] [GTID] [repl=ok,lag=1,auto_pos]
      └── 10.10.10.201 [SR] [8.0.43-34] [GTID] [repl=ok,lag=0,auto_pos]
      └── 10.10.10.206 [SR] [8.0.42-33] [GTID] [repl=ok,lag=0,auto_pos]
      └── 10.10.10.202 [SR] [8.0.33-25] [GTID] [repl=ok,lag=0,auto_pos]

Orchestrator topology

$ orchestrator-client -c topology -i 10.10.10.200
AWS RDS instance:3306 [unknown,invalid,Unknown,rw,nobinlog]
+ 10.10.10.200:3306                                                   [0s,ok,8.0.33-25,ro,ROW,>>,GTID]
  + 10.10.10.201:3306                                                 [0s,ok,8.0.43-34,ro,ROW,>>,GTID:errant]
  + 10.10.10.202:3306                                                 [0s,ok,8.0.33-25,ro,ROW,>>,GTID:errant]
  + 10.10.10.206:3306                                                 [0s,ok,8.0.42-33,ro,ROW,>>,GTID:errant]

Evidence of Metadata Racing

When querying Orchestrator via CLI, the errant range "shifts" rapidly as metadata is refreshed:
Bash

$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335748-4335796

# Seconds later...
$ orchestrator -c which-gtid-errant -i 10.10.10.201:3306
623335cf-c858-3eb6-a90d-40122a2d21f4:4335820-4335836

Verification of Source Origin

We performed a mysqlbinlog dump on the leaf replicas for the reported "errant" range. The output confirms that the transactions originated from Server ID 1483228424 (the RDS Source), not from local writes on the replicas:
Bash

Example from 10.10.10.201

#260224 12:09:48 server id 1483228424 end_log_pos 994222673 CRC32 0x4bf21cdf Query thread_id=9843324 exec_time=0 error_code=0
SET @@session.pseudo_thread_id=9843324/!/;

UPDATE db.run ...

All replicas show the exact same thread_id and server id. Within minutes, the Source Executed_Gtid_Set catches up, and the "errant" status clears on its own without intervention.

The Problem

Because Orchestrator flags these transient states as GTID:errant, it triggers downstream monitoring alerts. In a cascading topology with high-concurrency writes, the "racing" metadata leads to constant flapping alerts for a cluster that is actually in perfect health.
Attempts to Resolve

Restarted Orchestrator container (version 3.2.6-17) - Issue persists.

Verified super_read_only=1 on all replicas - No local writes are occurring.

Expected Behavior

Orchestrator should perhaps implement a grace period or verify the server_id of the "errant" transactions before flagging them as a divergence, especially in cascading setups where metadata propagation might not be atomic across all layers.

{
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "AutoPseudoGTID": false,
  "BackendDB": "sqlite",
  "CandidateInstanceExpireMinutes": 60,
  "Debug": true,
  "DetachLostSlavesAfterMasterFailover": false,
  "DiscoverByShowSlaveHosts": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "HostnameResolveMethod": "none",
  "InstancePollSeconds": 10,
  "MasterFailoverDetachSlaveMasterHost": true,
  "MySQLHostnameResolveMethod": "@@report_host",
  "MySQLTopologySSLSkipVerify": true,
  "ReadOnly": false,
  "ReasonableMaintenanceReplicationLagSeconds": 120,
  "RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
  "SlaveLagQuery": "",
  "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.sqlite3",
  "UnseenInstanceForgetHours": 24,
  "UseSSL": false,
  "UseSuperReadOnly": true,
  "ListenAddress": ":3000",
  "DefaultInstancePort": 3306,
  "AuditToSyslog": false,
  "AuthenticationMethod": "multi",
  "HTTPAuthUser": "***",
  "HTTPAuthPassword": "***",
  "MySQLTopologyUser": "***",
  "MySQLTopologyPassword": "***",
  "PowerAuthUsers": ["***"],
  "URLPrefix": "",
  "StatusEndpoint": "api/status",
  "ClusterNameToAlias": {
       "V-L-DB-Prd-2:3306": "prod_cluster",
    "10.10.10.201:3306": "prod_cluster",
    "V-L-DB-Prd-1:3306": "prod_cluster",
    "10.10.10.200:3306": "prod_cluster"
  },
    "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}'"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Source will switch to read_only'"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}'"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >"
  ],
  "PostUnsuccessfulFailoverProcesses": [
    ""
  ],
  "PostMasterFailoverProcesses": [
    "/opt/orchestrator/bin/orchestrator_hook_sql_commands.py --failed-host={failedHost} --failed-port={failedPort} --successor-host={successorHost} --successor-port={successorPort} --config=/opt/orchestrator/etc/orchestrator_hook_sql_commands.yaml --defaults-file=/opt/orchestrator/etc/.my.cnf"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}'"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete'"
  ]
}

Orchestrator version:

$ podman ps | grep orchestrator
2ada75af3757  localhost/orchestrator:3.2.6-17                  /usr/local/orches...  20 hours ago  Up 20 hours ago (starting)                          orchestrator

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions