Adds ftfy to DictionaryWordPredictor to fix unicode oddities.#149
Adds ftfy to DictionaryWordPredictor to fix unicode oddities.#149soldni wants to merge 3 commits into
ftfy to DictionaryWordPredictor to fix unicode oddities.#149Conversation
|
When you consider the underlying character stream from the PDF document then it's interesting to note that Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document. Examples of this may be when I want to use different x/y tolerances for detecting words (though this doesn't seem a possible concern with current Does this have consequences for referencing The overall point may be moot since Could a different set up be something like: Fixing ligatures is not so much a model as a detect and replace configuration. The dictionary word predictor is a predictor from one perspective because it can be "trained" on the PDF of concern to have a local dictionary and capture words like Another thought on global token fixing is simply that most(?) models are likely to work better with |
Yes, but that's already the case with WordPredictor. A sequence of tokens
I'm starting to think At a higher-level, this seems to boil down to -- what are Fixing ligatures, in spirit, is doing the same thing the DictWordPredictor is trying to do -- that is, create Maybe what we should do is rename DictWordPredictor to just a generic WordPredictor. Implementation-wise, it would have separate internal methods for handling the Dict-aspect of forming words, as well as the ligature-transformation. In the long-run, I'm thinking more and more this is a task for an efficient PostEditingModel that scans PDF tokens & outputs proposed edits to form words. thoughts? |
|
That's a good point. I think the key difference with "-" is that one is removing characters and in theory can still index into symbols using all Span start/end. So, the individual character indices in symbols can still line up for spans. There is no longer a span that includes the "-". De-hyphenation compresses a span or discards symbols from the Document as not useful to meaning. To your point, if we are keeping SpanGroup.(whatever_method_reaches_doc_symbols) -> just the original characters then everything seems OK. My original comment forgot that we tend to build up new items as we go (symbols then we have tokens then "words" are potentially entirely separate). Seems OK to skip renaming, etc. for now. |
This PR adds
ftfytoDictionaryWordPredictorto mitigate some issues in character parsing from pdfplumber. In short, it callsftfy.fix_textto replace corrupted or low frequencies characters such as ligatures (e.g.Verification, wherefiis a single character) with more common representations.