Skip to content

Running list of Ex. 21 extraction model improvement ideas #88

@katie-lamb

Description

@katie-lamb

Overview

This is a running list of improvements that we could try implementing to improve the performance of the Ex. 21 extraction model. I moved any "nice to have" straggler items from #78 into this issue. These items can be experimented with after record linkage, when we have a better idea of remaining budget and performance needs.

  • Example of a filing with a "footnotes" section that can be excluded:
    • 103872-0001193125-13-444053
### Next steps
- [ ] Use Corpwatch dataset for further validation
- [ ] Nice to have: breakout `layoutlm-finetune` into ops
- [ ] Try clustering the final hidden states instead of using heuristic based table extractor
- [ ] Exclude anything below "Footnotes" or similar keywords
- [ ] Create threshold for entity classification failure based on logits returned by LayoutLM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Icebox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions