I'm extracting this paper to markdown like this:
doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)
I noticed that the TocHeaders start with UTF-8 byte order marks: '\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'
. This prevents recognising the document structure, because title.startswith(text)
fails.
For a quick fix, you could just strip the BOM in get_header_id: #309