Skip to content

Conversation

@KhoomeiK
Copy link

@KhoomeiK KhoomeiK commented Oct 7, 2024

Resiliparse is actively used by some AI labs to extract web data for LLM pre-training, but it has not been publicly benchmarked alongside many other similar web parsing tools. I've added an eval script for Resiliparse as well as its data output. I also added a flag to eval individual parsers separately.

@KhoomeiK KhoomeiK changed the title Benchmarked Resiliparse & added flag to evaluate an individual parser Benchmarked Resiliparse & added flag to evaluate parsers individually Oct 7, 2024
Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing a new extractor @KhoomeiK . I left a few small comments - also if you prefer I could merge your PR as-is and address them in another PR.

Besides that, do you mind also updating the README with the result of the new parser, adding a line at the end of Result of packages added after original evaluation: table?

try:
extractor_module = importlib.import_module(f'extractors.run_{name}')
extractor_module.main()
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather catch Exception here, e.g. see motivation in this (rejected) PEP https://peps.python.org/pep-0760/#motivation

Suggested change
except:
except Exception:

'accuracy={accuracy:.3f} ± {accuracy_std:.3f} '
.format(name=name, **metrics))
metrics_by_name[name] = metrics
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to refactor the code in a way which does not leave to having to repeat the reporting. For example, we could pass args.parser to evaluate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants