-
Notifications
You must be signed in to change notification settings - Fork 9
Initial multicast support #847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Also pushes on the requisite extensions for us to fill in
This implements IPv4 and IPv6 multicast packet forwarding with three
replication modes (External, Underlay, All) for rack-wide multicast
delivery across VPCs.
Includes:
- M2P (Multicast-to-Physical) mappings with admin-scoped IPv6 underlay
- Per-port multicast group subscriptions for local delivery
- Multicast forwarding table with configurable replication strategies
- Geneve multicast option encoding for delivery mode signaling
- RX path loop prevention (packets marked Underlay skip re-relay)
- TX/RX path integration with flow table and encapsulation
- DTrace probes for multicast delivery observability
- API addition: set_mcast_fwd/clear_mcast_fwd for forwarding table management
- API addition: mcast_subscribe/mcast_unsubscribe for port group membership
- API addition: dump_mcast_fwd for observability
- Testing: XDE integration tests covering all replication modes, validation,
and edge cases
- Testing: oxide-vpc integration tests for Geneve encapsulation and parsing
- Enforce DEFAULT_MULTICAST_VNI (77) for all multicast traffic (groups
are fleet-side/cross-VPC) and validate admin-scoped underlay
addresses (ff04::/16, ff05::/16, ff08::/16).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work so far. I haven't looked at the new multicast integration tests yet, but I have a bunch of questions from what I've looked at thus far.
| target | ||
| download | ||
| .DS_STORE | ||
| scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a part of your local setup you're trying to keep out-of-tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an xde reset thing I was using. I'll remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove. I had an xde-reset script in here for testing.
| /// | ||
| /// Returns VNI, outer IPv6 destination, and replication mode from Geneve | ||
| /// options. | ||
| pub fn parse_geneve_packet(bytes: &[u8]) -> Result<GeneveInfo, String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's probably fine to pull in anyhow as a dependency for opte_test_utils, and use that rather than String as the error type here.
| let mgmt = xde.management_lock.lock(); | ||
| let mcast_fwd = mgmt.mcast_fwd.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really cannot afford to take the management lock during any dataplane operations. The most direct fix for something like this is to go through state.devs.read().
In principle, we need to have a copy of mcast_fwd added to PerEntryState (and some equivalent added to XdeDev), which is updated in the same places as we'd call refresh_maps today.
Some context: we used to have everything behind one RwLock on XdeState. Basically we're attempting to a) have as little contention as possible, even between accesses to the readers counter in each Rwlock and b) lockout access to the central source of truth in an upcall-safe way. Taking an exclusive lock at that level here really hurts.
| // Determine replication strategy from XDE-wide multicast forwarding table | ||
| let xde = get_xde_state(); | ||
| let mgmt = xde.management_lock.lock(); | ||
| let mcast_fwd = mgmt.mcast_fwd.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above on hitting the (exclusive!) management lock in the datapath.
| if should_relay { | ||
| // Re-acquire locks for underlay forwarding | ||
| let xde = get_xde_state(); | ||
| let mgmt = xde.management_lock.lock(); | ||
| let mcast_fwd = mgmt.mcast_fwd.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that OPTE should ever be acting as a relay for multicast packets. Practically speaking we should always be the leaf node of the multicast replication tree, such that loop detection/prevention aren't on our list of problems to consider.
Is this solving a particular problem? Unconditionally bouncing the packet out of the same interface it arrived on seems very odd to me, particularly when we may have picked the source/dest MAC of the other underlay port and its attached switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll update this to work only for the leaf / last hop.
|
|
||
| pub fn write(&self) -> KRwLockWriteGuard<'_, DevMap> { | ||
| self.0.write() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused, but also this is very intentionally read-only to keep our lock invariants safe.
| // Filter by VNI - only deliver to listeners in the same VNI | ||
| if el.vni() != ctx.vni { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would run counter to the global multicast delivery space thing just now, but I think this is one of the places where keying on the external v6 underlay dst would suit well for membership tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the Vni check only in overlay, and not do this in xde for sure.
| let xde = get_xde_state(); | ||
| let mgmt = xde.management_lock.lock(); | ||
| let mcast_fwd = mgmt.mcast_fwd.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the tx path, if we have a per-CPU copy of mcast_fwd (as mentioned elsewhere) then we can avoid hitting management_lock, and can probably just hold the lock on said copy for the duration of this function.
| // We found forwarding entries, replicate to each next hop | ||
| for (next_hop, replication) in next_hops.iter() { | ||
| // Clone packet with headers using pullup | ||
| let pullup_len = (ctx.encap_len as usize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these offsets will be identical to the above cases, nor do they need to be recomputed in the for loop. I don't think it's unreasonable to just unconditionally compute pullup_len, geneve_offset and friends considering they must be used at least once somewhere if we've found a valid M2P entry etc and successfully processed the packet.
| // Route lookup for next hop to get outer MAC addresses | ||
| // Use the actual_outer_dst we determined above | ||
| let route_key = RouteKey { | ||
| dst: actual_outer_dst, | ||
| l4_hash: Some(ctx.l4_hash), | ||
| }; | ||
| let Route { src: mac_src, dst: mac_dst, underlay_idx } = | ||
| src_dev.routes.next_hop(route_key, src_dev); | ||
|
|
||
| // Fill in outer MAC addresses | ||
| let final_pkt = unsafe { | ||
| let mblk = fwd_pkt.unwrap_mblk().as_ptr(); | ||
| let rptr = (*mblk).b_rptr; | ||
| ptr::copy(mac_dst.as_ptr(), rptr, 6); | ||
| ptr::copy(mac_src.as_ptr(), rptr.add(6), 6); | ||
|
|
||
| // Note: The outer IPv6 destination was already set correctly in fwd_pkt | ||
| // based on the replication type, and we used the correct address for | ||
| // route lookup, so no need to modify it here. | ||
|
|
||
| MsgBlk::wrap_mblk(mblk).unwrap() | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I think we need this routing step for is to determine which underlay port(s) we're sending out on based on the unicast next_hop, which gives us our src_mac. The dst_mac associated with that address should probably be disregarded though -- we're sending to a v6 multicast outer, so should have the derived v6 multicast MAC destination to match?
XXX currently sketching out the bare minimum requirements around frame delivery, insertion of Geneve options, etc.
TODO:
Ipv6Addr->BTreeMap<(NextHopV6, Replication)>).Mcast2Physand above table via ioctl.