Skip to content

Make using DataSource faster by using lazy reading#949

Open
jmcarcell wants to merge 4 commits intomasterfrom
datasource
Open

Make using DataSource faster by using lazy reading#949
jmcarcell wants to merge 4 commits intomasterfrom
datasource

Conversation

@jmcarcell
Copy link
Copy Markdown
Member

@jmcarcell jmcarcell commented Mar 26, 2026

The problem of DataSource is that everything is being loaded all the time, unlike for traditional RDataFrames. I have added lazy reading to ROOTReader and RNTupleReader, which have to be different because the internals of the readers are different. For ROOTReader it is relatively trivial with a callback, we just read the collection that we need when it is used; for RNTupleReader we create a new ROOT RNTupleReader with a minimal model for each collection since the complete model is fixed at the beginning (the key difference is: for ROOTReader we have one branch for each collection and we read per-branch, for RNTupleReader we have the full model). Possible questions and comments:

  • If we want to support lazy reading more generally, for example for files for which we are only interested in a few collections. Then we would maybe need to have different modes for the readers (at least for the RNTuple one this has to be known before reading). Currently, the implementation of the ROOTReader assumes all reading for an event is done before moving on to the next one, which is something that would have to be changed.
  • In that case, maybe lazy reading could be the default if performance is not too different from what we have now.
  • If we don't, then the lazy reading has to be hidden and only allowed for DataSource, since it does not work in the general case (at least for ROOTReader, explained above).

Benchmarks later but I can read a few GB of TTree and RNTuple files in a close time to using RDataFrame directly on them (tested with single threading only, I think multithreading should bring a similar speedup since DataSource makes several independent readers).

BEGINRELEASENOTES

ENDRELEASENOTES

@tmadlener
Copy link
Copy Markdown
Collaborator

When I initially designed the whole Frame infrastructure, I had the following idea for reading data lazily: Construct some form of LazyFrameData that effectively retains a reference to the reader such that it can read the buffers from there when they are requested from the Frame. For ROOT this would effectively mean that the reader does almost nothing and most of the logic for retrieving data from the file would go into this LazyFrameData.

From the looks of your version here callback goes pretty much in that direction, only that the logic still lives in the Reader and that there is a condition of not going to the next entry before all collections have been lazily read. To make this more generic one would have to probably add

  • a mutex (or another synchronization primitive) to lock the reader from the Frame
  • some bookkeeping information to know which entry in the file belongs to a Frame

Purely from not overloading the existing reader with too much functionality, I would be in favor of having the details of lazy or eager reading entirely hidden behind the existing interface of Reader, i.e. no addition of readFrameLazy, but rather have new lazy readers (and corresponding lazy frames) that provide the same interface.


As a side note some of the excessive data loading could be front-loaded to the users for a quick work around, because DataSource has the ability to only read a subset of all collections. @kjvbrt and me had a brief discussion about automating that on the python side, but I think we arrived at the conclusion that that would essentially require parsing python or doing a "dummy" run to collect data names. However, a user might know which collections they want and could provide that as a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants