-
Notifications
You must be signed in to change notification settings - Fork 4k
Reconcile QQ node dead during delete and redeclare #14241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
thanks @LoisSotoLopez - this looks like it will do what we discussed a while back. I feel unsure about adding another key to the queue type state for this, mainly becuause we'd have to keep |
dac1a44
to
72a48e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having two keys with similar information (nodes) that will need to be kept in sync. I think we need to move the current nodes
list value to the the #{node() => uid()}
map format and handle the two formats in the relevant places, mostly in rabbit_queue_type
. we do need a list of the member nodes in a few places but we could add a convenience functio: rabbit_queue_type:nodes/1
that takes a queue record and returns a list of member nodes. internally it could just call rabbit_queue_type:info(Q, [members])
and extract the result from that then update all places where we explicity use get_type_state to extract the member nodes.
In addition I think we need to put the use of nodes
as a map behind a feature flag to avoid new queue records with nodes
map values being created in a mixed versions cluster.
@kjnilsson Thanks for the suggestions. Just wanted to let you know we are working on this. Had some incident we had to take care of this week but I'll be pushing this PR forward next week. |
@LoisSotoLopez we have to ask your employer to sign the Broadcom CLA before we can accept this contribution (or its future finished version). It is about one page long, nothing particularly unusual or onerous. |
That commit below is just for showing current progress. Have been struggling to understand why a few of the remaining tests fail. |
1074165
to
35ef780
Compare
@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future. We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email). |
Yes, sorry about not having this already addressed. We are on summer vacations right now so the people who will be signing it for the whole company won't be able to do it until next week. I would do it myself but can't due to a intellectual property clause on my contract that wasn't there the last time I signed the CLA. Will try to get it signed asap. |
f48962d
to
5a5877a
Compare
Hi! The CLA should be signed now. This should be ready to be reviewed again. |
@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to |
Should be there now. 🫡 |
@LoisSotoLopez thank you, we should be all set in terms of the CLA. I'll defer merging to @kjnilsson. That can take another week or two. |
694085c
to
95cabda
Compare
Co-authored-by: Péter Gömöri <[email protected]>
95cabda
to
3293a24
Compare
Just rebased and squashed this PR. Let me know if there's anything new I can do to move it forward. |
@LoisSotoLopez given that this changes how member nodes are represented in the queue state, it is too late for us to get it into |
It is responsible to not rush something released that changes the queue record. Because 4.3 is 6 months away I wonder if it would have any benefit to still include that part of the PR in 4.2 which moves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more changes, this feature is a bit trickier than I had initially expected. I am happy take over if you feel like we're going back and forth too much. Just let me know.
Test failures are not flakes. They key part is:
|
There's some test that passes locally but fails in CI. Expect a few tiny commits while I figure out what's going on. |
I thought that asserting the right membership status on all nodes would guarantee that the other status fields would be the expected ones but it seems like that assumption was wrong.
Just one tiny commit was needed :) I think it's now ready for review again. |
Proposed Changes
This PR implements the suggested solution for the issue described in discussion #13131
Currently, when a QQ is deleted and re-declared while one of its nodes is dead, this dead node won't be able to reconcile with the new queue.
In this PR we add the list of ra UIds for the cluster to each node queue record, so that when a Rabbit node recovers a queue it will be able to detect the situation described above and properly reconcile .
Types of Changes
Checklist
CONTRIBUTING.md
documentFurther Comments