Books dataset by originalankur · Pull Request #21 · algolia/datasets

originalankur · 2025-07-02T09:46:07Z

Extracted from open library dataset.

Haroenv · 2025-07-02T11:38:26Z

+# Books Dataset
+
+The books.json is a subset from the openlibrary [books datasets](https://openlibrary.org/developers/dumps)
+


we would need to add the CC0 1.0 universal license here I think: https://openlibrary.org/help/faq/using#ownership

@Haroenv To the best of my knowledge when it comes to CC0 1.0 universal license following rules apply.

You may use the dataset for commercial purposes.

No need to cite or reference the license.

Attribution is optional, not required.

@Haroenv if you insist will add a copy in the folder. Do advice.

Thanks for digging in on the licensing, Ankur. Based on your research I agree with you.

pixelastic · 2025-07-16T11:11:41Z

Hey @originalankur, thanks for the PR.

I had a look at the content of the file, and I'm afraid some of the books might contain sensitive content (at least one suspicious case of doxxing, and mentions of child pornography), that we don't really want in our public list of data.

I cleaned the list and shrinked the number of books to ~24k rather than ~33k (which also puts the file size at 49MB, right below the suggested 50MB github limit).
You can find my clean version in the books-clean branch.

Can you pull it in to replace your version, please?

originalankur · 2025-07-16T11:14:39Z

@pixelastic Thank you for cleaning the data, I should have thought of this. I will update the PR. Thanks Tim.

pixelastic · 2025-08-04T09:27:40Z

Hey @originalankur ping me once you've updated the PR and I'll merge it. Thanks.

books dataset

ad33025

Haroenv reviewed Jul 2, 2025

View reviewed changes

Haroenv requested review from chuckmeyer and pixelastic July 2, 2025 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Books dataset#21

Books dataset#21
originalankur wants to merge 1 commit intoalgolia:masterfrom
originalankur:master

originalankur commented Jul 2, 2025

Uh oh!

Haroenv Jul 2, 2025

Uh oh!

originalankur Jul 2, 2025

Uh oh!

originalankur Jul 2, 2025

Uh oh!

chuckmeyer Jul 21, 2025

Uh oh!

pixelastic commented Jul 16, 2025

Uh oh!

originalankur commented Jul 16, 2025

Uh oh!

pixelastic commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# Books Dataset

		The books.json is a subset from the openlibrary [books datasets](https://openlibrary.org/developers/dumps)

Conversation

originalankur commented Jul 2, 2025

Uh oh!

Haroenv Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

originalankur Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

originalankur Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

chuckmeyer Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

pixelastic commented Jul 16, 2025

Uh oh!

originalankur commented Jul 16, 2025

Uh oh!

pixelastic commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants