In RAGLite we need a dependable estimate of the font size of a span. I thought it might be interesting for you to know how we obtain it with pdftext:
def extract_font_size(span: dict[str, Any]) -> float:
"""Extract the font size from a text span."""
font_size: float = 1.0
if span["font"]["size"] > 1: # A value of 1 appears to mean "unknown" in pdftext.
font_size = span["font"]["size"]
elif digit_sequences := re.findall(r"\d+", span["font"]["name"] or ""):
font_size = float(digit_sequences[-1])
elif "\n" not in span["text"]: # Occasionally a span can contain a newline character.
if round(span["rotation"]) in (0.0, 180.0, -180.0):
font_size = span["bbox"][3] - span["bbox"][1]
elif round(span["rotation"]) in (90.0, -90.0, 270.0, -270.0):
font_size = span["bbox"][2] - span["bbox"][0]
return font_size
In RAGLite we need a dependable estimate of the font size of a span. I thought it might be interesting for you to know how we obtain it with pdftext: