Skip to content

Conversation

LoisSotoLopez
Copy link
Contributor

@LoisSotoLopez LoisSotoLopez commented Jul 16, 2025

Proposed Changes

This PR implements the suggested solution for the issue described in discussion #13131

Currently, when a QQ is deleted and re-declared while one of its nodes is dead, this dead node won't be able to reconcile with the new queue.

In this PR we add the list of ra UIds for the cluster to each node queue record, so that when a Rabbit node recovers a queue it will be able to detect the situation described above and properly reconcile .

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

@kjnilsson
Copy link
Contributor

thanks @LoisSotoLopez - this looks like it will do what we discussed a while back. I feel unsure about adding another key to the queue type state for this, mainly becuause we'd have to keep uids and nodes in sync. It would be nicer if nodes turned from a list into a map although even this is a bit controversial and could become a source of bugs. Let me consider it for a day or two.

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having two keys with similar information (nodes) that will need to be kept in sync. I think we need to move the current nodes list value to the the #{node() => uid()} map format and handle the two formats in the relevant places, mostly in rabbit_queue_type. we do need a list of the member nodes in a few places but we could add a convenience functio: rabbit_queue_type:nodes/1 that takes a queue record and returns a list of member nodes. internally it could just call rabbit_queue_type:info(Q, [members]) and extract the result from that then update all places where we explicity use get_type_state to extract the member nodes.

In addition I think we need to put the use of nodes as a map behind a feature flag to avoid new queue records with nodes map values being created in a mixed versions cluster.

@LoisSotoLopez
Copy link
Contributor Author

@kjnilsson Thanks for the suggestions. Just wanted to let you know we are working on this. Had some incident we had to take care of this week but I'll be pushing this PR forward next week.

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez we have to ask your employer to sign the Broadcom CLA before we can accept this contribution (or its future finished version).

It is about one page long, nothing particularly unusual or onerous.

@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Aug 1, 2025

That commit below is just for showing current progress. Have been struggling to understand why a few of the remaining tests fail.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 1074165 to 35ef780 Compare August 1, 2025 09:50
@michaelklishin
Copy link
Collaborator

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Aug 6, 2025

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

Yes, sorry about not having this already addressed. We are on summer vacations right now so the people who will be signing it for the whole company won't be able to do it until next week. I would do it myself but can't due to a intellectual property clause on my contract that wasn't there the last time I signed the CLA.

Will try to get it signed asap.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from f48962d to 5a5877a Compare August 13, 2025 12:13
@LoisSotoLopez
Copy link
Contributor Author

Hi! The CLA should be signed now. This should be ready to be reviewed again.

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

@LoisSotoLopez
Copy link
Contributor Author

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

Should be there now. 🫡

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez thank you, we should be all set in terms of the CLA.

I'll defer merging to @kjnilsson. That can take another week or two.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 95cabda to 3293a24 Compare September 29, 2025 13:06
@LoisSotoLopez
Copy link
Contributor Author

Just rebased and squashed this PR. Let me know if there's anything new I can do to move it forward.

@michaelklishin
Copy link
Collaborator

@kjnilsson

@michaelklishin michaelklishin added this to the 4.3.0 milestone Sep 29, 2025
@michaelklishin
Copy link
Collaborator

@LoisSotoLopez given that this changes how member nodes are represented in the queue state, it is too late for us to get it into 4.2.0 and the earliest this PR can be considered is for 4.3.0, the work on which will begin in earnest in November.

@gomoripeti
Copy link
Contributor

It is responsible to not rush something released that changes the queue record.

Because 4.3 is 6 months away I wonder if it would have any benefit to still include that part of the PR in 4.2 which moves get_name from amqqueue to rabbit_amqqueue module. That change is just a refactoring but touches multiple modules while the problematic part is restricted to rabbit_quorum_queue. I'm suggesting this because maybe it avoids some conflicts and makes it easier to backport other changes to 4.2 when the whole of this PR is merged into main but not into 4.2.x. Or if the whole of this PR is not merged for a long time it would make it easier to rebase it to a future main.
Just an idea, what do you think?

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more changes, this feature is a bit trickier than I had initially expected. I am happy take over if you feel like we're going back and forth too much. Just let me know.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Oct 1, 2025

Test failures are not flakes. They key part is:

{badmatch,[]}

[{rabbit_queue_type,get_nodes,1,[{file,\"rabbit_queue_type.erl\"}, {line,428}]},

{rabbit_amqp_management,queue_topology,1, [{file,\"rabbit_amqp_management.erl\"},{line,474}]},\n 

{rabbit_amqp_management,encode_queue,3, [{file,\"rabbit_amqp_management.erl\"},{line,438}]}

@LoisSotoLopez
Copy link
Contributor Author

There's some test that passes locally but fails in CI. Expect a few tiny commits while I figure out what's going on.

I thought that asserting the right membership status on all nodes would
guarantee that the other status fields would be the expected ones but it
seems like that assumption was wrong.
@LoisSotoLopez
Copy link
Contributor Author

Just one tiny commit was needed :) I think it's now ready for review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants