Benchmarked Resiliparse & added flag to evaluate parsers individually #25

KhoomeiK · 2024-10-07T19:59:25Z

Resiliparse is actively used by some AI labs to extract web data for LLM pre-training, but it has not been publicly benchmarked alongside many other similar web parsing tools. I've added an eval script for Resiliparse as well as its data output. I also added a flag to eval individual parsers separately.

lopuhin

Thanks a lot for contributing a new extractor @KhoomeiK . I left a few small comments - also if you prefer I could merge your PR as-is and address them in another PR.

Besides that, do you mind also updating the README with the result of the new parser, adding a line at the end of Result of packages added after original evaluation: table?

lopuhin · 2024-10-11T10:19:13Z

evaluate.py

+            try:
+                extractor_module = importlib.import_module(f'extractors.run_{name}')
+                extractor_module.main()
+            except:


I'd rather catch Exception here, e.g. see motivation in this (rejected) PEP https://peps.python.org/pep-0760/#motivation

Suggested change

except:

except Exception:

lopuhin · 2024-10-11T10:20:24Z

evaluate.py

+            'accuracy={accuracy:.3f} ± {accuracy_std:.3f} '
+            .format(name=name, **metrics))
        metrics_by_name[name] = metrics
+    else:


I think it would be best to refactor the code in a way which does not leave to having to repeat the reporting. For example, we could pass args.parser to evaluate function.

KhoomeiK added 2 commits September 15, 2024 19:43

add resiliparse eval script

3b032ad

add flag to eval only one parser

3a551f8

KhoomeiK changed the title ~~Benchmarked Resiliparse & added flag to evaluate an individual parser~~ Benchmarked Resiliparse & added flag to evaluate parsers individually Oct 7, 2024

lopuhin reviewed Oct 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarked Resiliparse & added flag to evaluate parsers individually #25

Benchmarked Resiliparse & added flag to evaluate parsers individually #25

Uh oh!

KhoomeiK commented Oct 7, 2024

Uh oh!

lopuhin left a comment

Uh oh!

lopuhin Oct 11, 2024

Uh oh!

lopuhin Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmarked Resiliparse & added flag to evaluate parsers individually #25

Are you sure you want to change the base?

Benchmarked Resiliparse & added flag to evaluate parsers individually #25

Uh oh!

Conversation

KhoomeiK commented Oct 7, 2024

Uh oh!

lopuhin left a comment

Choose a reason for hiding this comment

Uh oh!

lopuhin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

lopuhin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants