diff --git a/README.md b/README.md index fe27d4e..98c7cac 100644 --- a/README.md +++ b/README.md @@ -50,6 +50,7 @@ Can be used to perform: * flair - Required if you want to use Flair mentions extractor and for TARS linker and TARS Mentions Extractor. * blink - Required if you want to use Blink for linking to Wikipedia pages. * gliner - Required if you want to use GLiNER Linker or GLiNER Mentions Extractor. +* relik - Required if you want to use Relik Linker. ## Installation @@ -90,7 +91,7 @@ The linguistic approach relies on the idea that mentions will usually be a synta ### Linker The **linker** will link the detected entities to a existing set of labels. Some of the **linkers**, however, are *end-to-end*, i.e. they don't need the **mentions extractor**, as they detect and link the entities at the same time. -Again, there are 5 **linkers** available currently, 3 of them are *end-to-end* and 2 are not. +Again, there are 6 **linkers** available currently, 4 of them are *end-to-end* and 2 are not. | Linker Name | end-to-end | Source Code | Paper | |:-----------:|:----------:|----------------------------------------------------------|--------------------------------------------------------------------| @@ -99,6 +100,7 @@ Again, there are 5 **linkers** available currently, 3 of them are *end-to-end* a | SMXM | ✓ | [Source Code](https://github.com/Raldir/Zero-shot-NERC) | [Paper](https://aclanthology.org/2021.acl-long.120/) | | TARS | ✓ | [Source Code](https://github.com/flairNLP/flair) | [Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf) | | GLINER | ✓ | [Source Code](https://github.com/urchade/GLiNER) | [Paper](https://arxiv.org/abs/2311.08526) | +| RELIK | ✓ | [Source Code](https://github.com/SapienzaNLP/relik) | [Paper](https://arxiv.org/abs/2408.00103) | ### Relations Extractor The **relations extractor** will extract relations among different entities *previously* extracted by a **linker**.. @@ -241,7 +243,7 @@ from zshot import PipelineConfig from zshot.linker import LinkerTARS from zshot.evaluation.dataset import load_ontonotes_zs from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report -from zshot.evaluation.metrics.seqeval.seqeval import Seqeval +from zshot.evaluation.metrics._seqeval._seqeval import Seqeval ontonotes_zs = load_ontonotes_zs('validation') diff --git a/docs/entity_linking.md b/docs/entity_linking.md index 0eb68d6..668b1c3 100644 --- a/docs/entity_linking.md +++ b/docs/entity_linking.md @@ -2,6 +2,16 @@ The **linker** will link the detected entities to a existing set of labels. Some of the **linkers**, however, are *end-to-end*, i.e. they don't need the **mentions extractor**, as they detect and link the entities at the same time. -There are 5 **linkers** available currently, 3 of them are *end-to-end* and 2 are not. +There are 6 **linkers** available currently, 4 of them are *end-to-end* and 2 are not. + +| Linker Name | end-to-end | Source Code | Paper | +|:----------------------------------------------------:|:----------:|----------------------------------------------------------|--------------------------------------------------------------------| +| [Blink](https://ibm.github.io/zshot/blink_linker/) | X | [Source Code](https://github.com/facebookresearch/BLINK) | [Paper](https://arxiv.org/pdf/1911.03814.pdf) | +| [GENRE](https://ibm.github.io/zshot/genre_linker/) | X | [Source Code](https://github.com/facebookresearch/GENRE) | [Paper](https://arxiv.org/pdf/2010.00904.pdf) | +| [SMXM](https://ibm.github.io/zshot/smxm_linker/) | ✓ | [Source Code](https://github.com/Raldir/Zero-shot-NERC) | [Paper](https://aclanthology.org/2021.acl-long.120/) | +| [TARS](https://ibm.github.io/zshot/tars_linker/) | ✓ | [Source Code](https://github.com/flairNLP/flair) | [Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf) | +| [GLINER](https://ibm.github.io/zshot/gliner_linker/) | ✓ | [Source Code](https://github.com/urchade/GLiNER) | [Paper](https://arxiv.org/abs/2311.08526) | +| [RELIK](https://ibm.github.io/zshot/relik_linker/) | ✓ | [Source Code](https://github.com/SapienzaNLP/relik) | [Paper](https://arxiv.org/abs/2408.00103) | + ::: zshot.Linker \ No newline at end of file diff --git a/docs/relik_linker.md b/docs/relik_linker.md new file mode 100644 index 0000000..222c4eb --- /dev/null +++ b/docs/relik_linker.md @@ -0,0 +1,13 @@ +# ReLiK Linker +ReLiK is a lightweight and fast model for Entity Linking and Relation Extraction. It is composed of two main components: a retriever and a reader. The retriever is responsible for retrieving relevant documents from a large collection, while the reader is responsible for extracting entities and relations from the retrieved documents. ReLiK can be used with the from_pretrained method to load a pre-trained pipeline. + +In **Zshot**, we created a linker to use ReLiK, and it works both providing entities or without providing entities, and with descriptions. + +This is an *end-to-end* model, so there is no need to use a **mentions extractor** before. + +The ReLiK **linker** will use the **entities** specified in the `zshot.PipelineConfig`, if any. + +- [Paper](https://arxiv.org/abs/2408.00103) +- [Original Source Code](https://github.com/SapienzaNLP/relik) + +::: zshot.linker.LinkerRelik \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index ddd85c6..30fc8da 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -22,6 +22,7 @@ nav: - regen.md - smxm_linker.md - tars_linker.md + - relik_linker.md - gliner_linker.md - Relations Extractor: - relation_extractor.md diff --git a/requirements/test.txt b/requirements/test.txt index 9b82063..203a0f7 100644 --- a/requirements/test.txt +++ b/requirements/test.txt @@ -7,4 +7,5 @@ gliner>=0.2.9 flake8>=4.0.1 coverage>=6.4.1 pydantic==1.9.2 +relik==1.0.5 IPython \ No newline at end of file diff --git a/zshot/linker/__init__.py b/zshot/linker/__init__.py index 98a6c21..eb69975 100644 --- a/zshot/linker/__init__.py +++ b/zshot/linker/__init__.py @@ -4,4 +4,5 @@ from zshot.linker.linker_smxm import LinkerSMXM # noqa: F401 from zshot.linker.linker_tars import LinkerTARS # noqa: F401 from zshot.linker.linker_ensemble import LinkerEnsemble # noqa: F401 +from zshot.linker.linker_relik import LinkerRelik # noqa: F401 from zshot.linker.linker_gliner import LinkerGLINER # noqa: F401 diff --git a/zshot/linker/linker_relik.py b/zshot/linker/linker_relik.py new file mode 100644 index 0000000..0cac8bc --- /dev/null +++ b/zshot/linker/linker_relik.py @@ -0,0 +1,80 @@ +import contextlib +import logging +import pkgutil +from typing import Iterator, List, Optional, Union + +from relik import Relik +from relik.inference.data.objects import RelikOutput +from relik.retriever.indexers.document import Document +from spacy.tokens import Doc + +from zshot.config import MODELS_CACHE_PATH +from zshot.linker.linker import Linker +from zshot.utils.data_models import Span + +logging.getLogger("relik").setLevel(logging.ERROR) + +MODEL_NAME = "sapienzanlp/relik-entity-linking-large" + + +class LinkerRelik(Linker): + """ Relik linker """ + + def __init__(self, model_name=MODEL_NAME): + super().__init__() + + if not pkgutil.find_loader("relik"): + raise Exception("relik module not installed. You need to install relik in order to use the relik Linker." + "Install it with: pip install relik") + + self.model_name = model_name + self.model = None + # self.device = { + # "retriever_device": self.device, + # "index_device": self.device, + # "reader_device": self.device + # } + + @property + def is_end2end(self) -> bool: + """ relik is end2end """ + return True + + def load_models(self): + """ Load relik model """ + # Remove RELIK print + with contextlib.redirect_stdout(None): + if self.model is None: + if self._entities: + self.model = Relik.from_pretrained(self.model_name, + cache_dir=MODELS_CACHE_PATH, + retriever=None, device=self.device) + else: + self.model = Relik.from_pretrained(self.model_name, + cache_dir=MODELS_CACHE_PATH, device=self.device, + index_device='cpu') + + def predict(self, docs: Iterator[Doc], batch_size: Optional[Union[int, None]] = None) -> List[List[Span]]: + """ + Perform the entity prediction + :param docs: A list of spacy Document + :param batch_size: The batch size + :return: List Spans for each Document in docs + """ + candidates = None + if self._entities: + candidates = [ + Document(text=ent.name, id=i, metadata={'definition': ent.description}) + for i, ent in enumerate(self._entities) + ] + + sentences = [doc.text for doc in docs] + + self.load_models() + span_annotations = [] + for sent in sentences: + relik_out: RelikOutput = self.model(sent, candidates=candidates) + span_annotations.append([Span(start=relik_span.start, end=relik_span.end, label=relik_span.label) + for relik_span in relik_out.spans]) + + return span_annotations diff --git a/zshot/tests/linker/test_gliner_linker.py b/zshot/tests/linker/test_gliner_linker.py index 24ea27a..42cbd77 100644 --- a/zshot/tests/linker/test_gliner_linker.py +++ b/zshot/tests/linker/test_gliner_linker.py @@ -13,7 +13,7 @@ @pytest.fixture(scope="module", autouse=True) def teardown(): - logger.warning("Starting smxm tests") + logger.warning("Starting gliner tests") yield True gc.collect() @@ -25,7 +25,7 @@ def test_gliner_download(): del linker.model, linker -def test_smxm_linker(): +def test_gliner_linker(): nlp = spacy.blank("en") gliner_config = PipelineConfig( linker=LinkerGLINER(), @@ -43,7 +43,7 @@ def test_smxm_linker(): del doc, nlp, gliner_config -def test_smxm_linker_no_entities(): +def test_gliner_linker_no_entities(): nlp = spacy.blank("en") gliner_config = PipelineConfig( linker=LinkerGLINER(), diff --git a/zshot/tests/linker/test_relik_linker.py b/zshot/tests/linker/test_relik_linker.py new file mode 100644 index 0000000..4d51d9a --- /dev/null +++ b/zshot/tests/linker/test_relik_linker.py @@ -0,0 +1,60 @@ +import gc +import logging + +import pytest +import spacy + +from zshot import PipelineConfig, Linker +from zshot.linker import LinkerRelik +from zshot.tests.config import EX_DOCS, EX_ENTITIES + +logger = logging.getLogger(__name__) + + +@pytest.fixture(scope="module", autouse=True) +def teardown(): + logger.warning("Starting relik tests") + yield True + gc.collect() + + +@pytest.mark.skip(reason="Too expensive to run on every commit") +def test_relik_download(): + linker = LinkerRelik() + linker.load_models() + assert isinstance(linker, Linker) + del linker.model, linker + + +@pytest.mark.skip(reason="Too expensive to run on every commit") +def test_relik_linker(): + nlp = spacy.blank("en") + relik_config = PipelineConfig( + linker=LinkerRelik(), + entities=EX_ENTITIES + ) + nlp.add_pipe("zshot", config=relik_config, last=True) + assert "zshot" in nlp.pipe_names + + doc = nlp(EX_DOCS[1]) + assert len(doc.ents) > 0 + del nlp.get_pipe('zshot').linker.model, nlp.get_pipe('zshot').linker + nlp.remove_pipe('zshot') + del doc, nlp, relik_config + + +@pytest.mark.skip(reason="Too expensive to run on every commit") +def test_relik_linker_no_entities(): + nlp = spacy.blank("en") + relik_config = PipelineConfig( + linker=LinkerRelik(), + entities=[] + ) + nlp.add_pipe("zshot", config=relik_config, last=True) + assert "zshot" in nlp.pipe_names + + doc = nlp(EX_DOCS[1]) + assert len(doc.ents) == 0 + del nlp.get_pipe('zshot').linker.model, nlp.get_pipe('zshot').linker + nlp.remove_pipe('zshot') + del doc, nlp, relik_config diff --git a/zshot/utils/download_models.py b/zshot/utils/download_models.py index ce25422..f7ec006 100644 --- a/zshot/utils/download_models.py +++ b/zshot/utils/download_models.py @@ -21,6 +21,10 @@ def load_all(): LinkerGLINER().load_models() except RuntimeError: pass + # try: + # LinkerRelik().load_models() + # except RuntimeError: + # pass try: RelationsExtractorZSRC().load_models() except RuntimeError: