Skip to content

Improve PANGAEA Publication Extraction Logic#55

Merged
khider merged 5 commits into
mainfrom
pangaea-publications
May 12, 2026
Merged

Improve PANGAEA Publication Extraction Logic#55
khider merged 5 commits into
mainfrom
pangaea-publications

Conversation

@doswal

@doswal doswal commented May 12, 2026

Copy link
Copy Markdown
Collaborator

fixes #51

Summary

This PR refactors the get_publications() workflow for PangaeaDataset to improve accuracy, robustness, and alignment with the PANGAEA metadata model. It replaces unreliable regex-based parsing with structured metadata access.

Key Changes

  • Structured extraction

    • Use PanDataSet.supplement_to and PanDataSet.relations instead of regex parsing.
  • Dataset citation handling

    • Use PanDataSet.citation directly (no Crossref).
    • Always included with Type = "citation".
  • Selective Crossref usage

    • Applied only to:
      • supplement_to["uri"]
      • relations[i]["uri"]
    • Avoids failures for PANGAEA dataset DOIs.
  • BibTeX correctness

    • Fixed conversion from bibtexparser entries to pybtex.Entry.
    • Proper handling of authors via Person objects.
  • Deduplication

    • DOI-based deduplication within each study.
  • Test stability

    • Added mocking for Crossref:
      @patch("doi2bib.crossref.get_bib")
    • Prevents conflicts with existing requests.get mocks.

Behavior

Source Type
Dataset citation "citation"
supplement_to "supplement to"
relations relation type
Regex parsing removed

Motivation

  • Regex-based extraction was fragile and inaccurate
  • Crossref is unreliable for dataset DOIs
  • PANGAEA provides structured relationships natively

This change improves correctness, maintainability, and testability.

@khider khider merged commit e1f002b into main May 12, 2026
1 check passed
@doswal doswal deleted the pangaea-publications branch May 19, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PANGAEA: get_publications

2 participants