Skip to content
This repository was archived by the owner on Feb 21, 2025. It is now read-only.

Developing Regex Annotators

James Baker edited this page Jul 4, 2017 · 3 revisions

This guide will walk you through the development of a regular expression, or regex, annotator in Baleen. Annotators are the Baleen components that extract information and entities from content being passed through the pipeline. Regular expressions are a way of expressing a pattern that might appear in text. For example, we wouldn't be able to list every possible e-mail address but we are able to describe a pattern that would represent all e-mail addresses.

We will be developing a regex annotator to annotate all .com e-mail addresses. That is, any e-mail address that ends in .com will be annotated as a CommsIdentifier.

Configuring Dependencies

As we are developing a new annotator, we need to ensure we have a dependency on the baleen-annotators module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:

<dependency>
    <groupId>uk.gov.dstl.baleen</groupId>
    <artifactId>baleen-annotators</artifactId>
    <version>2.4.0</version>
</dependency>

Creating the Class

To start with, let's create a new Java class called ComEmail which extends AbstractRegexAnnotator. The AbstractRegexAnnotator class provides most of the functionality for us, and we just need to provide a small amount of code to specify what the regular expression is, and what we should do when we find one.

We will create it in the uk.gov.dstl.baleen.annotators.guides package to keep it separate from existing annotators.

package uk.gov.dstl.baleen.annotators.regex;

import java.util.regex.Matcher;

import org.apache.uima.jcas.JCas;

import uk.gov.dstl.baleen.annotators.regex.helpers.AbstractRegexAnnotator;
import uk.gov.dstl.baleen.core.pipelines.orderers.AnalysisEngineAction;
import uk.gov.dstl.baleen.types.common.CommsIdentifier;

public class ComEmail extends AbstractRegexAnnotator {

	public ComEmail() {
	
	}
	
	@Override
	protected CommsIdentifier create(JCas jCas, Matcher matcher) {
		return null;
	}
	
	@Override
	public AnalysisEngineAction getAction() {
		return new AnalysisEngineAction(null);
	}

}

You will notice in the above code that we have had to tell AbstractRegexAnnotator what type of annotation we will be returning, and we have also created a constructor function and overridden the create(JCas, Matcher) function. It is this function that we will use to create our annotation and set any annotator specific properties.

The Constructor

The constructor of AbstractRegexAnnotator, our super constructor, will do most of the hard work for us so we just need to pass it the correct things. There are several constructors available on AbstractRegexAnnotator, but we will use the one that takes:

  • The regular expression
  • A flag to indicate whether the regular expression is case sensitive
  • The confidence to assign to our regular expression
private static final String COM_EMAIL_REGEX = "[A-Z0-9._%+-]+@([A-Z0-9.-]+\\.com)";

public ComEmail() {
	super(COM_EMAIL_REGEX, false, 1.0);
}

The create(JCas, Matcher) Function

AbstractRegexAnnotator requires us to override the create(JCas, Matcher) function in order to create the annotation. This function should return an annotation of the correct type with any properties specific to that type (for example, identifier type in our case) set. You do not need to set standard properties like the value and begin/end, as these will be set for you by the AbstractRegexAnnotator. In fact, if you do try to set these they will be overwritten.

The Matcher object provided will contain a Matcher with pattern provided in the constructor and the match for the current entity. This may be useful if you need to set annotation properties based on some logic involving the current match, but is not needed in this example so we will ignore it.

@Override
protected CommsIdentifier create(JCas jCas, Matcher matcher) {
	CommsIdentifier ci = new CommsIdentifier(jCas);
	ci.setSubType("email");
	return ci;
}

The getAction() Function

As of Baleen 2.4, all BaleenAnnotators are required to override the getAction() function, which provides pipeline orderers with information about the inputs and outputs of each annotator. This function is called after doInitialise(), so can take into account configuration parameters when determinig what a specific instance's inputs and outputs will be.

In our case, we have no inputs and a single output of CommsIdentifier

@Override
	public AnalysisEngineAction getAction() {
		return new AnalysisEngineAction(Collections.emptySet(), ImmutableSet.of(CommsIdentifier.class));
	}

Conclusion

And that's it! We now have a fully working annotator that will find e-mail addresses ending in .com. Most of the hard work is done behind the scenes, and we just need to implement two simple functions to get our annotator up and running.

To really finish it off, we should provide documentation and unit tests - but that is left as an exercise for the reader!

Clone this wiki locally