Skip to content

Conversation

@FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Sep 4, 2025

XXX currently sketching out the bare minimum requirements around frame delivery, insertion of Geneve options, etc.

TODO:

  • Add XDE-wide multicast forwarding table (Ipv6Addr -> BTreeMap<(NextHopV6, Replication)>).
  • Add exceptions in source/dest MAC/L3 addr checking for multicast addreses matching known groups. (#cidr-checking only mcast dst side)
  • Populate Mcast2Phys and above table via ioctl.
  • Some tests might be nice? :)

FelixMcFelix and others added 5 commits September 3, 2025 16:26
Also pushes on the requisite extensions for us to fill in
This implements IPv4 and IPv6 multicast packet forwarding with three
replication modes (External, Underlay, All) for rack-wide multicast
delivery across VPCs.

Includes:
  - M2P (Multicast-to-Physical) mappings with admin-scoped IPv6 underlay
  - Per-port multicast group subscriptions for local delivery
  - Multicast forwarding table with configurable replication strategies
  - Geneve multicast option encoding for delivery mode signaling
  - RX path loop prevention (packets marked Underlay skip re-relay)
  - TX/RX path integration with flow table and encapsulation
  - DTrace probes for multicast delivery observability
  - API addition: set_mcast_fwd/clear_mcast_fwd for forwarding table management
  - API addition: mcast_subscribe/mcast_unsubscribe for port group membership
  - API addition: dump_mcast_fwd for observability
  - Testing: XDE integration tests covering all replication modes, validation,
    and edge cases
  - Testing: oxide-vpc integration tests for Geneve encapsulation and parsing
  - Enforce DEFAULT_MULTICAST_VNI (77) for all multicast traffic (groups
    are fleet-side/cross-VPC) and validate admin-scoped underlay
    addresses (ff04::/16, ff05::/16, ff08::/16).
@zeeshanlakhani zeeshanlakhani self-assigned this Oct 20, 2025
Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work so far. I haven't looked at the new multicast integration tests yet, but I have a bunch of questions from what I've looked at thus far.

target
download
.DS_STORE
scripts
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a part of your local setup you're trying to keep out-of-tree?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an xde reset thing I was using. I'll remove this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove. I had an xde-reset script in here for testing.

///
/// Returns VNI, outer IPv6 destination, and replication mode from Geneve
/// options.
pub fn parse_geneve_packet(bytes: &[u8]) -> Result<GeneveInfo, String> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's probably fine to pull in anyhow as a dependency for opte_test_utils, and use that rather than String as the error type here.

Comment on lines +2110 to +2111
let mgmt = xde.management_lock.lock();
let mcast_fwd = mgmt.mcast_fwd.read();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really cannot afford to take the management lock during any dataplane operations. The most direct fix for something like this is to go through state.devs.read().

In principle, we need to have a copy of mcast_fwd added to PerEntryState (and some equivalent added to XdeDev), which is updated in the same places as we'd call refresh_maps today.

Some context: we used to have everything behind one RwLock on XdeState. Basically we're attempting to a) have as little contention as possible, even between accesses to the readers counter in each Rwlock and b) lockout access to the central source of truth in an upcall-safe way. Taking an exclusive lock at that level here really hurts.

Comment on lines +2394 to +2397
// Determine replication strategy from XDE-wide multicast forwarding table
let xde = get_xde_state();
let mgmt = xde.management_lock.lock();
let mcast_fwd = mgmt.mcast_fwd.read();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above on hitting the (exclusive!) management lock in the datapath.

Comment on lines +2548 to +2552
if should_relay {
// Re-acquire locks for underlay forwarding
let xde = get_xde_state();
let mgmt = xde.management_lock.lock();
let mcast_fwd = mgmt.mcast_fwd.read();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that OPTE should ever be acting as a relay for multicast packets. Practically speaking we should always be the leaf node of the multicast replication tree, such that loop detection/prevention aren't on our list of problems to consider.

Is this solving a particular problem? Unconditionally bouncing the packet out of the same interface it arrived on seems very odd to me, particularly when we may have picked the source/dest MAC of the other underlay port and its attached switch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll update this to work only for the leaf / last hop.

Comment on lines +214 to +217

pub fn write(&self) -> KRwLockWriteGuard<'_, DevMap> {
self.0.write()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused, but also this is very intentionally read-only to keep our lock invariants safe.

Comment on lines +2153 to +2156
// Filter by VNI - only deliver to listeners in the same VNI
if el.vni() != ctx.vni {
continue;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would run counter to the global multicast delivery space thing just now, but I think this is one of the places where keying on the external v6 underlay dst would suit well for membership tracking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the Vni check only in overlay, and not do this in xde for sure.

Comment on lines +2223 to +2225
let xde = get_xde_state();
let mgmt = xde.management_lock.lock();
let mcast_fwd = mgmt.mcast_fwd.read();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the tx path, if we have a per-CPU copy of mcast_fwd (as mentioned elsewhere) then we can avoid hitting management_lock, and can probably just hold the lock on said copy for the duration of this function.

// We found forwarding entries, replicate to each next hop
for (next_hop, replication) in next_hops.iter() {
// Clone packet with headers using pullup
let pullup_len = (ctx.encap_len as usize)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these offsets will be identical to the above cases, nor do they need to be recomputed in the for loop. I don't think it's unreasonable to just unconditionally compute pullup_len, geneve_offset and friends considering they must be used at least once somewhere if we've found a valid M2P entry etc and successfully processed the packet.

Comment on lines +2306 to +2327
// Route lookup for next hop to get outer MAC addresses
// Use the actual_outer_dst we determined above
let route_key = RouteKey {
dst: actual_outer_dst,
l4_hash: Some(ctx.l4_hash),
};
let Route { src: mac_src, dst: mac_dst, underlay_idx } =
src_dev.routes.next_hop(route_key, src_dev);

// Fill in outer MAC addresses
let final_pkt = unsafe {
let mblk = fwd_pkt.unwrap_mblk().as_ptr();
let rptr = (*mblk).b_rptr;
ptr::copy(mac_dst.as_ptr(), rptr, 6);
ptr::copy(mac_src.as_ptr(), rptr.add(6), 6);

// Note: The outer IPv6 destination was already set correctly in fwd_pkt
// based on the replication type, and we used the correct address for
// route lookup, so no need to modify it here.

MsgBlk::wrap_mblk(mblk).unwrap()
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I think we need this routing step for is to determine which underlay port(s) we're sending out on based on the unicast next_hop, which gives us our src_mac. The dst_mac associated with that address should probably be disregarded though -- we're sending to a v6 multicast outer, so should have the derived v6 multicast MAC destination to match?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants