Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198
Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198abhishek10004 wants to merge 4 commits into
Conversation
This introduces support for vectored reads (reading FUSE requests from the device via readv into non-contiguous block buffers, and passing them up to the filesystem via DstBufs). Includes multi-block allocation infrastructure in InMessage, platform-specific support for FuseT (contiguous fallback), and configuration.
This introduces support for vectored writes, bypassing copying write payload bytes into a single contiguous slice in WriteFileOp.Data, instead providing the raw non-contiguous blocks in WriteFileOp.DataBlocks. Includes the optimization and implementation in MemFS, along with wirelog, debug, and test updates.
1891bb9 to
8005ed3
Compare
| } | ||
|
|
||
| var BlockPool1M = newBlockPool(48, func() []byte { | ||
| return make([]byte, MiB) |
There was a problem hiding this comment.
I've not used mmap here because currently I'm overflowing to a syncPool and hence if there is a lot of parallelism, we would not be paying the allocation penalty again and again.
In case we take over the memory allocation using mmap and bypass the go runtime, then there would be 2 options:
a) fixed size buffer pool but that would mean constant allocation/deallocation in case parallelism is higher than the configured limits
b) dynamic pool that keeps growing/shrinking but this would be a slightly larger change & would need more testing.
Hence, I've parked it for later, either as a separate commit or a new change.
vadlakondaswetha
left a comment
There was a problem hiding this comment.
yet to review the testcases and samples
| initOp.MaxReadahead = maxReadahead | ||
| initOp.MaxWrite = buffer.MaxWriteSize | ||
|
|
||
| maxPayload := c.inMessageSize - buffer.GetPageSize() |
There was a problem hiding this comment.
do we anticipate different sizes for read and write. if not can we have just one variable which tells size of the request for both reads and writes.
| } | ||
|
|
||
| // NewInMessage creates a new InMessage. | ||
| func NewInMessage(size int) *InMessage { |
There was a problem hiding this comment.
why are you taking a size parameter if its not used.
| err = nil | ||
| continue | ||
| } | ||
| if errors.Is(err, syscall.ENODEV) { |
There was a problem hiding this comment.
we are removing a typecasting here? Are these changes intentiontal? If yes, how did they work earlier?
| var err error | ||
| if fusekernel.IsPlatformFuseT { | ||
| n, err = m.ReadSingle(r) | ||
| if len(m.blocks) == 1 { |
There was a problem hiding this comment.
as discussed please remove all changes for MAC and throw not supported exception when messageSize is bigger. Lets not checkin changes which are not reviewed.
| return pageSize | ||
| } | ||
|
|
||
| type blockPool struct { |
There was a problem hiding this comment.
Please create a seperate file for blockPool
| block := BlockPool1M.Get() | ||
| m.borrowedBlocks = append(m.borrowedBlocks, block) | ||
| allocSize := MiB | ||
| if remaining < allocSize { |
There was a problem hiding this comment.
what happens if you remove this check?
| // Since n doesn't fit in block 0, and block 0 has size 1MB + pageSize, | ||
| // n is necessarily larger than 1MB (assuming typical small offset like | ||
| // sizeof(ReadIn)). Thus we always allocate directly on the heap. | ||
| return make([]byte, n) |
There was a problem hiding this comment.
in the existing code, if we dont required buffer, we are returning nil vs here we are creating a new buffer.
Also i didnt understand in what scenarios would it cross 1MB?
| } | ||
| // Use part of the incoming message storage as the read buffer. | ||
| to.Dst = inMsg.GetFree(int(in.Size)) | ||
| if config.EnableVectoredReads && int(in.Size) > buffer.MiBPlusPageSize { |
There was a problem hiding this comment.
Brainstorming a bit here. why do we need to support both vectoredReads and non vectoredReads. Can we just pass 2-D array always. It would be a minor change on the GCSFuse side. How big of a change will it be on GCSFuse side? I am guessing we can just pick the first block from the array and pass it downstream when messageSize is 1MB?
| var buf []byte | ||
| var dataBlocks [][]byte | ||
|
|
||
| if config.EnableVectoredWrites && inMsg.Len() > uintptr(buffer.MiBPlusPageSize) { |
There was a problem hiding this comment.
by moving everything to vectoredReads/writes we need not do if-else every where. the code becomes much simpler.
There was a problem hiding this comment.
reduces the number of configs too.
| // In production, any spanning allocation is larger than 1MB (since block 0 | ||
| // is 1MB + pageSize and fits all normal headers/payloads). Thus we always | ||
| // allocate directly from the heap. | ||
| res := make([]byte, n) |
There was a problem hiding this comment.
Same as reads? why would we overflow here and not earlier?
Overview
Previously, this library had a hardcoded FUSE message buffer size of
1 MiB + pageSize(corresponding to the standard 1 MiB FUSE payload limit). Because of this hardcoded limit, large reads and writes (> 1 MiB) were not supported at all.This PR adds support for large I/O operations by introducing a configurable
MaxMessageSizeinMountConfig, allowing daemons to read and write messages larger than 1 MiB.To prevent the severe performance regressions, memory fragmentation, and garbage collection (GC) pressure that would arise from allocating giant contiguous buffers (e.g., 4 MiB, 8 MiB, or 16 MiB) on the heap for every request, this support is implemented via Vectored I/O:
readvsystem call. The filesystem can then write the read payload directly into these blocks viaReadFileOp.DstBufs(Zero-Copy Vectored Reads).WriteFileOp.DataBlocks(Zero-Copy Vectored Writes).Key Benefits
BlockPool1MandBlockPool1MPlusPage) and thereadvsystem call.DstandData) for reads/writes under 1 MiB or when vectored I/O is disabled, ensuring existing filesystem implementations continue to work out-of-the-box.Commit-by-Commit Walkthrough
1. Enable Setting FUSE Buffer Sizes Dynamically
Commit:
ddc386cMaxMessageSize uint32to mount_config.go.c.inMessageSizein connection.go based onMaxMessageSize(or defaults to the maximum ofMaxReadSize/MaxWriteSize+ 1 page).Initcall) to announce this dynamically calculated size asMaxWriteandMaxPagesto the FUSE kernel driver.InMessagein internal/buffer/in_message.go to accept a dynamic allocation size instead of using a global staticbufSize.2. Refactor Error Handling using standard
errors.IsCommit:
c263028Connection.readMessage()in connection.go to use standarderrors.Is(err, syscall.ENODEV)anderrors.Is(err, syscall.EINTR)checks instead of type-casting to*os.PathErrorand inspecting inner fields. This ensures compatibility in case errors are wrapped.3. Implement Zero-Copy Vectored Reads Support
Commit:
72acb69ReadFileOp.DstBufsto support large reads (> 1 MiB) without heap thrashing.BlockPool1M(1 MiB blocks) andBlockPool1MPlusPage(1 MiB + hardware page size) in internal/buffer/in_message.go to recycle buffers and avoid heap thrashing.InMessageto allocate non-contiguous blocks (blocks [][]byte) rather than a single contiguous slice. Block 0 is always sized1 MiB + pageSize(holding headers and small payloads), and additional 1 MiB blocks are allocated to satisfy larger message limits.readv):SYS_READVsyscall, converting block slices tounix.Iovecpointers to perform a single-system-call read into multiple non-contiguous memory segments.FuseT) by implementing a contiguous fallback pool (fuseTContiguousPool).EnableVectoredReadstoMountConfig.DstBufs [][]byteto fuseops/ops.go. When enabled and the read size is larger than block 0,Dstis set tonilandDstBufsis populated with the block-sliced buffers, allowing the filesystem to write read payloads directly into the FUSE message blocks.4. Implement Zero-Copy Vectored Writes Support & MemFS Optimization
Commit:
8005ed3memfsmemory filesystem.EnableVectoredWritestoMountConfig.DataBlocks [][]byteand aTotalSize() inthelper method to fuseops/ops.go.convertInMessageslices the write payload directly intoDataBlocks(usingConsumeVector), completely avoiding copying the payload into a single contiguous slice.WriteBlocksAt(blocks [][]byte, off int64)to samples/memfs/inode.go to copy block-by-block directly into the inode's storage slice.WriteFilein samples/memfs/memfs.go to leverageWriteBlocksAtwhenDataBlocksis populated.debug.goandwirelog.goto use the newWriteFileOp.TotalSize()helper method.VectoredWritesTestin samples/memfs/memfs_test.go.Architectural Design: Why Vectored I/O?
To support message sizes larger than 1 MiB, allocating contiguous buffers dynamically (e.g., a single 8 MiB buffer for an 8 MiB read/write) is highly inefficient due to severe GC pressure and heap fragmentation.
Instead, the library now implements a non-contiguous, block-based architecture:
1. Zero-Copy Reads
When a FUSE read request is received:
readvsyscall on Linux to read data directly from the/dev/fusedescriptor into the pooled blocks, avoiding any kernel-to-user memory copy.ReadFileOp.DstBufswith these blocks.DstBufs, requiring zero extra allocations or copy operations.2. Zero-Copy Writes
When a FUSE write request is received:
WriteFileOp.DataBlocks.memfsusingWriteBlocksAt) can consume these blocks directly.Configuration Reference
Three new fields are introduced in
MountConfigto manage the large message and vectored I/O behavior:MaxMessageSizeuint32EnableVectoredReadsboolReadFileOp.Dstand instead populateReadFileOp.DstBufs.EnableVectoredWritesboolWriteFileOp.Dataand instead populateWriteFileOp.DataBlocks.Important
For
MaxMessageSizevalues greater than 1 MiB, enabling bothEnableVectoredReadsandEnableVectoredWritesis highly recommended to avoid significant performance regressions due to large heap allocations and contiguous copies.Testing & Verification
internal/buffer/in_message_test.goverifies all edge cases of multi-block message parsing (consuming across block boundaries, shrinking, and block recycling).samples/memfs/memfs_test.gocontains test cases ensuring that MemFS correctly handles vectored write operations whenEnableVectoredWritesis enabled.internal/buffer/out_message_test.gohave been updated and verified to measure reset, growth, and shrink performance accurately.