Skip to content

Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198

Open
abhishek10004 wants to merge 4 commits into
jacobsa:masterfrom
abhishek10004:abhishek/vectored_io
Open

Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198
abhishek10004 wants to merge 4 commits into
jacobsa:masterfrom
abhishek10004:abhishek/vectored_io

Conversation

@abhishek10004

@abhishek10004 abhishek10004 commented Jun 22, 2026

Copy link
Copy Markdown

Overview

Previously, this library had a hardcoded FUSE message buffer size of 1 MiB + pageSize (corresponding to the standard 1 MiB FUSE payload limit). Because of this hardcoded limit, large reads and writes (> 1 MiB) were not supported at all.

This PR adds support for large I/O operations by introducing a configurable MaxMessageSize in MountConfig, allowing daemons to read and write messages larger than 1 MiB.

To prevent the severe performance regressions, memory fragmentation, and garbage collection (GC) pressure that would arise from allocating giant contiguous buffers (e.g., 4 MiB, 8 MiB, or 16 MiB) on the heap for every request, this support is implemented via Vectored I/O:

  1. Large Reads (> 1 MiB) are read from the FUSE device directly into non-contiguous, block-pooled buffers (1 MiB blocks) via the readv system call. The filesystem can then write the read payload directly into these blocks via ReadFileOp.DstBufs (Zero-Copy Vectored Reads).
  2. Large Writes (> 1 MiB) bypass copying the incoming payload into a single contiguous slice, instead exposing the raw non-contiguous block slices directly to the filesystem via WriteFileOp.DataBlocks (Zero-Copy Vectored Writes).

Key Benefits

  • Support for Large I/O (> 1 MiB): Enables high-throughput FUSE operations by removing the hardcoded 1 MiB message size ceiling.
  • Zero-Copy & Low GC Pressure: Avoids massive heap allocations and contiguous memory copies for large transfers by leveraging a thread-safe, block-pooled allocator (BlockPool1M and BlockPool1MPlusPage) and the readv system call.
  • Backward Compatibility: Preserves contiguous buffers (Dst and Data) for reads/writes under 1 MiB or when vectored I/O is disabled, ensuring existing filesystem implementations continue to work out-of-the-box.

Commit-by-Commit Walkthrough

1. Enable Setting FUSE Buffer Sizes Dynamically

Commit: ddc386c

  • Purpose: Lays the foundation for configurable message sizes by allowing the buffer size to be set dynamically prior to mounting, rather than relying on a hardcoded global limit.
  • Key Changes:
    • Added MaxMessageSize uint32 to mount_config.go.
    • Dynamically calculates c.inMessageSize in connection.go based on MaxMessageSize (or defaults to the maximum of MaxReadSize/MaxWriteSize + 1 page).
    • Configures the FUSE protocol initialization (Init call) to announce this dynamically calculated size as MaxWrite and MaxPages to the FUSE kernel driver.
    • Refactored InMessage in internal/buffer/in_message.go to accept a dynamic allocation size instead of using a global static bufSize.

2. Refactor Error Handling using standard errors.Is

Commit: c263028

  • Purpose: Cleans up error-handling logic when reading FUSE requests to make it more robust.
  • Key Changes:
    • Refactored Connection.readMessage() in connection.go to use standard errors.Is(err, syscall.ENODEV) and errors.Is(err, syscall.EINTR) checks instead of type-casting to *os.PathError and inspecting inner fields. This ensures compatibility in case errors are wrapped.

3. Implement Zero-Copy Vectored Reads Support

Commit: 72acb69

  • Purpose: Implements the infrastructure for reading FUSE requests from the device directly into non-contiguous blocks, and exposing these blocks to the filesystem via ReadFileOp.DstBufs to support large reads (> 1 MiB) without heap thrashing.
  • Key Changes:
    • Block-Pool Allocator:
      • Defined thread-safe pools BlockPool1M (1 MiB blocks) and BlockPool1MPlusPage (1 MiB + hardware page size) in internal/buffer/in_message.go to recycle buffers and avoid heap thrashing.
      • Refactored InMessage to allocate non-contiguous blocks (blocks [][]byte) rather than a single contiguous slice. Block 0 is always sized 1 MiB + pageSize (holding headers and small payloads), and additional 1 MiB blocks are allocated to satisfy larger message limits.
    • Zero-Copy Syscall (readv):
      • Added internal/buffer/readv.go which implements a wrapper around the SYS_READV syscall, converting block slices to unix.Iovec pointers to perform a single-system-call read into multiple non-contiguous memory segments.
      • Retained backward compatibility on macOS (FuseT) by implementing a contiguous fallback pool (fuseTContiguousPool).
    • Vectored Read API:
      • Added EnableVectoredReads to MountConfig.
      • Added DstBufs [][]byte to fuseops/ops.go. When enabled and the read size is larger than block 0, Dst is set to nil and DstBufs is populated with the block-sliced buffers, allowing the filesystem to write read payloads directly into the FUSE message blocks.
    • Testing:
      • Added over 760 lines of comprehensive unit tests in internal/buffer/in_message_test.go covering block allocations, boundary-spanning data consumption, vector slicing, and pool returns.

4. Implement Zero-Copy Vectored Writes Support & MemFS Optimization

Commit: 8005ed3

  • Purpose: Adds zero-copy support for incoming large FUSE write requests, bypassing contiguous payload reconstruction, and optimizes the memfs memory filesystem.
  • Key Changes:
    • Vectored Writes API:
      • Added EnableVectoredWrites to MountConfig.
      • Added DataBlocks [][]byte and a TotalSize() int helper method to fuseops/ops.go.
      • When enabled, convertInMessage slices the write payload directly into DataBlocks (using ConsumeVector), completely avoiding copying the payload into a single contiguous slice.
    • MemFS Optimization:
      • Added WriteBlocksAt(blocks [][]byte, off int64) to samples/memfs/inode.go to copy block-by-block directly into the inode's storage slice.
      • Updated WriteFile in samples/memfs/memfs.go to leverage WriteBlocksAt when DataBlocks is populated.
    • Wirelog, Debug, and Test Cleanups:
      • Simplified write size calculations in debug.go and wirelog.go to use the new WriteFileOp.TotalSize() helper method.
      • Added integration tests verifying VectoredWritesTest in samples/memfs/memfs_test.go.
      • Fixed out-of-cache benchmarks in internal/buffer/out_message_test.go by allocating larger arrays (80 MiB) on the heap to successfully defeat the CPU cache.

Architectural Design: Why Vectored I/O?

To support message sizes larger than 1 MiB, allocating contiguous buffers dynamically (e.g., a single 8 MiB buffer for an 8 MiB read/write) is highly inefficient due to severe GC pressure and heap fragmentation.

Instead, the library now implements a non-contiguous, block-based architecture:

           +-----------------------+      +-------------------+
InMessage  | Block 0 (1MB + page)  | ---> | Block 1 (1MB)     | ---> ...
           +-----------------------+      +-------------------+
           | Headers | Small data  |      | Large data blocks |
           +-----------------------+      +-------------------+

1. Zero-Copy Reads

When a FUSE read request is received:

  • The daemon uses the readv syscall on Linux to read data directly from the /dev/fuse descriptor into the pooled blocks, avoiding any kernel-to-user memory copy.
  • If the read size is larger than Block 0, the library populates ReadFileOp.DstBufs with these blocks.
  • The filesystem writes the data directly into DstBufs, requiring zero extra allocations or copy operations.

2. Zero-Copy Writes

When a FUSE write request is received:

  • The write payload is read into the pooled blocks.
  • Instead of allocating a single contiguous buffer and copying all blocks into it, the library returns the non-contiguous slices directly in WriteFileOp.DataBlocks.
  • Filesystems optimized for vectored writes (such as the updated memfs using WriteBlocksAt) can consume these blocks directly.

Configuration Reference

Three new fields are introduced in MountConfig to manage the large message and vectored I/O behavior:

Field Type Description
MaxMessageSize uint32 Configures the maximum size of FUSE messages the daemon is prepared to read/write. Setting this larger than 1 MiB enables large reads and writes.
EnableVectoredReads bool If true, large read operations bypass contiguous buffer allocations in ReadFileOp.Dst and instead populate ReadFileOp.DstBufs.
EnableVectoredWrites bool If true, large write operations bypass copying payload blocks into a single contiguous slice in WriteFileOp.Data and instead populate WriteFileOp.DataBlocks.

Important

For MaxMessageSize values greater than 1 MiB, enabling both EnableVectoredReads and EnableVectoredWrites is highly recommended to avoid significant performance regressions due to large heap allocations and contiguous copies.


Testing & Verification

  • Unit Tests: internal/buffer/in_message_test.go verifies all edge cases of multi-block message parsing (consuming across block boundaries, shrinking, and block recycling).
  • Integration Tests: samples/memfs/memfs_test.go contains test cases ensuring that MemFS correctly handles vectored write operations when EnableVectoredWrites is enabled.
  • Benchmarks: Benchmarks in internal/buffer/out_message_test.go have been updated and verified to measure reset, growth, and shrink performance accurately.
  • All tests pass successfully.

This introduces support for vectored reads (reading FUSE requests from the device via readv into non-contiguous block buffers, and passing them up to the filesystem via DstBufs).
Includes multi-block allocation infrastructure in InMessage, platform-specific support for FuseT (contiguous fallback), and configuration.
This introduces support for vectored writes, bypassing copying write payload bytes into a single contiguous slice in WriteFileOp.Data, instead providing the raw non-contiguous blocks in WriteFileOp.DataBlocks.
Includes the optimization and implementation in MemFS, along with wirelog, debug, and test updates.
@abhishek10004 abhishek10004 force-pushed the abhishek/vectored_io branch from 1891bb9 to 8005ed3 Compare June 22, 2026 13:15
}

var BlockPool1M = newBlockPool(48, func() []byte {
return make([]byte, MiB)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not used mmap here because currently I'm overflowing to a syncPool and hence if there is a lot of parallelism, we would not be paying the allocation penalty again and again.

In case we take over the memory allocation using mmap and bypass the go runtime, then there would be 2 options:
a) fixed size buffer pool but that would mean constant allocation/deallocation in case parallelism is higher than the configured limits
b) dynamic pool that keeps growing/shrinking but this would be a slightly larger change & would need more testing.
Hence, I've parked it for later, either as a separate commit or a new change.

@vadlakondaswetha vadlakondaswetha left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet to review the testcases and samples

Comment thread connection.go
initOp.MaxReadahead = maxReadahead
initOp.MaxWrite = buffer.MaxWriteSize

maxPayload := c.inMessageSize - buffer.GetPageSize()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we anticipate different sizes for read and write. if not can we have just one variable which tells size of the request for both reads and writes.

}

// NewInMessage creates a new InMessage.
func NewInMessage(size int) *InMessage {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you taking a size parameter if its not used.

Comment thread connection.go
err = nil
continue
}
if errors.Is(err, syscall.ENODEV) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are removing a typecasting here? Are these changes intentiontal? If yes, how did they work earlier?

var err error
if fusekernel.IsPlatformFuseT {
n, err = m.ReadSingle(r)
if len(m.blocks) == 1 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed please remove all changes for MAC and throw not supported exception when messageSize is bigger. Lets not checkin changes which are not reviewed.

return pageSize
}

type blockPool struct {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a seperate file for blockPool

block := BlockPool1M.Get()
m.borrowedBlocks = append(m.borrowedBlocks, block)
allocSize := MiB
if remaining < allocSize {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you remove this check?

// Since n doesn't fit in block 0, and block 0 has size 1MB + pageSize,
// n is necessarily larger than 1MB (assuming typical small offset like
// sizeof(ReadIn)). Thus we always allocate directly on the heap.
return make([]byte, n)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the existing code, if we dont required buffer, we are returning nil vs here we are creating a new buffer.
Also i didnt understand in what scenarios would it cross 1MB?

Comment thread conversions.go
}
// Use part of the incoming message storage as the read buffer.
to.Dst = inMsg.GetFree(int(in.Size))
if config.EnableVectoredReads && int(in.Size) > buffer.MiBPlusPageSize {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming a bit here. why do we need to support both vectoredReads and non vectoredReads. Can we just pass 2-D array always. It would be a minor change on the GCSFuse side. How big of a change will it be on GCSFuse side? I am guessing we can just pick the first block from the array and pass it downstream when messageSize is 1MB?

Comment thread conversions.go
var buf []byte
var dataBlocks [][]byte

if config.EnableVectoredWrites && inMsg.Len() > uintptr(buffer.MiBPlusPageSize) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by moving everything to vectoredReads/writes we need not do if-else every where. the code becomes much simpler.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduces the number of configs too.

// In production, any spanning allocation is larger than 1MB (since block 0
// is 1MB + pageSize and fits all normal headers/payloads). Thus we always
// allocate directly from the heap.
res := make([]byte, n)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as reads? why would we overflow here and not earlier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants