Solution driven Named Entity Recognition (NER) for helpful assistants

In my work, I often have to deal with unstructured text data. This data can come in many forms, such as emails, chat messages, or documents. Extracting structured information from this data is a challenging task. Named Entity Recognition (NER) is a powerful tool that can help with this task.

For more details around NER, I recommend the following:

Advanced Article: Chapter 7 of the NLTK book
Introductory Video: SPACY'S ENTITY RECOGNITION MODEL

Today, I would like to describe the pipeline I use to build a solution-driven NER system. This system is designed to extract structured information from unstructured text data and use it to build helpful assistants. It is based on Spacy's NER model and uses a combination of data collection, data augmentation, and model training to achieve its goals.

Mainly, the pipeline consists of the following steps:

Data Collection: Combine multiple sources of structured text data into a single dataset used for training (and evaluating) the NER model.
Training: Train a custom NER model using the combined dataset.

Data Collection

The data collection pipeline is more or less straight forward.

To create a data set, we combine a set of text templates with a set of potential entities. The text templates are used to generate synthetic text data, while the entities are used to annotate the text data.

Let's say we want to build an assistant for movies that can look up information about movies and find theaters nearby that play it.

For this we would need to extract two types of entities: movie titles and locations. So a potential request by a user could look like this:

What movies are playing at the theater in New York?

In this case, the entities would be:

MOVIE_TITLE = null
LOCATION = New York

and the text template would be:

What movies are playing at the theater in [LOCATION]?

Another example could be:

Is "The Matrix" playing at the theater in San Francisco?

here we have

MOVIE_TITLE = The Matrix
LOCATION = San Francisco

and the template

Is "[MOVIE_TITLE]" playing at the theater in [LOCATION]?

And last but not least, we could have a request like this:

Tell me more about "The Matrix".

with the entities

MOVIE_TITLE = The Matrix
LOCATION = null

and the template

Tell me more about "[MOVIE_TITLE]".

After spending some time on collecting these examples, we will end up having three data sets.

For my pipeline I organized them in individual .csv files, where each row represents a single example.

Continuing with the example above, the data sets would look like this:

movie_titles.csv

MOVIE_TITLE
The Matrix
Pulp Fiction
Parasite
...

locations.csv

LOCATION
New York
San Francisco
Los Angeles
Mexico City
Barcelona
...

and movie_requests.csv

TEMPLATE
What movies are playing at the theater in [LOCATION]?
Is "[MOVIE_TITLE]" playing at the theater in [LOCATION]?
Tell me more about "[MOVIE_TITLE]".
Do you have any information about "[MOVIE_TITLE]"?
Any movies playing in [LOCATION]?
What's playing at the theater in [LOCATION]?
...

Amazing, now we have our base data set, that we will continue to work with in the future section.

But hold on a minute, we missed something quite important along the way.

How do we generate a good amount of these data points?

That's a good question! One way how we will make the data more diverse is by using data augmentation techniques in the future section.

But what's a good base for data augmentation?

I noticed through personal experiments that the bigger the base data set, the better the results will be. In other words, spend more time engineering these templates and entities, and you will be rewarded with a better performing model.

One way to get started is prompting one of the freely available LLMs like ChatGPT or GPT-3 with a few examples and let it generate more examples for you.

They're also quite good at translating and creating multi-lingual data sets.

In terms of data points such as movie titles and locations, you can use APIs like the OMDb API or dr5hn/countries-states-cities-database to get a good starting point.

Keep in mind to also translate your data points to other languages in case you want to support multiple languages.

Data Augmentation

In this section I will give you some insights into the data augmentation process used in my pipeline.

Prompt Augmentation

In my pipeline, I differentiate between two types of data augmentation. Augmentation of prompts, and augmentation of data points/samples.

First, I will explain the augmentation of prompts.

One seemingly popular way to augment text data is translating it to other languages, then translating it back to the original language. If you're doing this process for a few iterations, you will end up with a more diverse data set. Yet, the quality of your data set might decrease from this approach.

Another way to augment text data is by using synonyms for certain words. But this requires a great understanding of the language you're dealing with to understand which words to replace as well as a big dictionary of synonyms. So I opted against this approach too.

Instead, I decided to collect a big set of templates and data points and then make them more robust against the human nature.

See, the main issue when we deal with human input is that humans constantly make mistakes. Instead of improving the variety of samples in our data set. We generate a wide variety of misspelled, faulty samples.

We're not trying to generate coherent text, we're trying to find information in a mess.

To do that, I used the following techniques that I will explain and show in details below.

Removal of punctuations
Reduction and substitution of words
Lowercasing
Prefixing
Typos

For each augmentation step, I assigned a probability, you can find my default setting here

punctuation_prob = 10
reduction_prob = 10
substitute_prob = 10
lowercase_prob = 10
prefix_prob = 5
typo_prob = 10

which totals to 55% of the final data set being augmented and 45% being the original data set.

Removal of punctuations

Code:

import random

# your prompts
prompts = [
  'What movies are playing at the theater in [LOCATION]?',
  ...
]
# punctuation characters to consider
puncuation = ['.', ',', '!', '?', ';', ':']
# filter prompts that contain punctuation
punctuated_prompts = list(filter(
  lambda x: any([p in x for p in puncuation]),
  prompts
))
# select N random prompts to remove punctuation
punctuated_prompts = random.sample(
  punctuated_prompts, 
  int((len(punctuated_prompts) * punctuation_prob) // 100)
)
# remove punctuation from selected prompts
no_punctuated_prompts = []
for prompt in punctuated_prompts:
    p = prompt
    for punc in puncuation:
        p = p.replace(punc, '')
    no_punctuated_prompts.append(p)

Reduction and substitution of words

The reduction part removes words from the prompt, while the substitution part replaces words with a random other word in the prompt.

To select a random element from a list of objects, I used the following function:

def rng_elem(objs, exclude=None):
    excluded_elems = [exclude, '[', ' ']
    rn = random.choice(objs)
    while rn in excluded_elems or '[' in rn or len(rn) < 2:
        rn = random.choice(objs)
    return rn

Here's the code for the reduction and substitution part:

prompt_backup = prompts.copy()
reduced_prompts = []
substitute_prompts = []

for prompt in prompt_backup:
    split_prompt = prompt.split(' ')
    if len(split_prompt) < 3:
        continue

    rn1 = rng_elem(split_prompt)
    rn2 = rng_elem(split_prompt, rn1)

    reduced_prompts.append(' '.join([i for i in split_prompt if i != rn1]))
    replaced_prompt = prompt.replace(rn2, '[[[REPLACEME]]]')
    replaced_prompt = replaced_prompt.replace(rn1, rn2)
    replaced_prompt = replaced_prompt.replace('[[[REPLACEME]]]', rn1)
    substitute_prompts.append(replaced_prompt)

Lowercasing

This one is straight forward, some prompts will be converted to only lowercase.

Code:

import re

prompt_backup = prompts.copy()
lower_cased_prompts = []
for prompt in prompt_backup:
    results = re.findall(r'(\[[A-Za-z_1]+\])', prompt)
    new_prompt = prompt.lower()
    # replace lowercase tags with original tags
    for tag in results:
        new_prompt = new_prompt.replace(tag.lower(), tag)
    lower_cased_prompts.append(new_prompt)

Prefixing

Prefixing is a technique where we add a randomly selected phrase to the beginning of the prompt.

This introduces a new tag into the prompt, which will later be replaced when the base data set gets generated. If no set of prefixes is available, the final code will discard the tag and remove it from prompts.

Code:

prefixed_prompts = []
prefix_token = '[PREFIX]'

for prompt in prompt_backup:
    new_prompt = f'{prefix_token} {prompt}'
    results = re.findall(r'(\[[A-Za-z_1]+\])', new_prompt)
    for tag in results:
        new_prompt = new_prompt.replace(tag.lower(), tag)
    prefixed_prompts.append(new_prompt)

Typos

Last but not least, we introduce typos into the prompts.

For this, I use the following method to generate typos:

def introduce_typos(text, typo_prob=10):
    """
    Introduce typos into the input text with a given probability.
    
    Parameters:
        text (str): The input text.
        typo_prob (float): Probability of introducing a typo. [0-100%]
    
    Returns:
        str: Text with introduced typos.
    """
    # Define a dictionary of common typos
    typos = {
        'a': ['q', 's', 'z'],
        'b': ['v', 'g', 'h', 'n'],
        'c': ['x', 'v', 'f'],
        'd': ['s', 'e', 'f', 'c'],
        'e': ['w', 's', 'd', 'r'],
        'f': ['d', 'r', 'g', 'v'],
        'g': ['f', 't', 'h', 'b'],
        'h': ['g', 'y', 'j', 'n'],
        'i': ['u', 'o', 'k', 'j'],
        'j': ['h', 'u', 'k', 'm'],
        'k': ['j', 'i', 'l', 'o'],
        'l': ['k', 'o', 'p'],
        'm': ['n', 'j', 'k'],
        'n': ['b', 'h', 'j', 'm'],
        'o': ['i', 'p', 'l', 'k'],
        'p': ['o', 'l'],
        'q': ['w', 'a'],
        'r': ['e', 't', 'f', 'd'],
        's': ['a', 'w', 'd', 'x'],
        't': ['r', 'y', 'g', 'f'],
        'u': ['y', 'i', 'j', 'h'],
        'v': ['c', 'f', 'g', 'b'],
        'w': ['q', 's', 'e'],
        'x': ['z', 'c', 'd', 's'],
        'y': ['t', 'u', 'h', 'g'],
        'z': ['a', 'x', 's'],
        ' ': [' ']
    }
    
    # Convert text to lowercase for simplicity
    text = text.lower()
    
    # Introduce typos with given probability
    typo_text = ''
    for char in text:
        if random.random() < typo_prob and char in typos:
            char = random.choice(typos[char])
        typo_text += char
    
    return typo_text

The augmentation code is quite simple and can be found in the following snippet:

typo_prompts = []
for prompt in prompt_backup:
    typo_prompts.append(introduce_typos(prompt, typo_prob))

This concludes the prompt augmentation part of the data augmentation process.

Data Point Augmentation

The augmentation of data points uses a similar approach, but is not yet that flexible for configuration as the template augmentation.

The process is quite simple, we take each data point and apply the following techniques:

clean sample
create lemmatized version of sample
create stemmed version of sample
create randomly shuffled version by shuffling words in sample
create randomly removed version by removing a random word from the sample
add lowercase version of the sample with 50% probability

We're using nltk to stem and lemmatize each sample.

Don't forget to run

import nltk

nltk.download('wordnet')
nltk.download('punkt')

at the beginning of your script.

Apart from this we use the following method in the augmentation process to clean samples, i.e. removing emojis

import re


RE_EMOJI = re.compile(r'[\U00010000-\U0010ffff]', flags=re.UNICODE)


def clean_element(elem):
    if isinstance(elem, list):
        elem_out = []
        for record in elem:
            record = RE_EMOJI.sub('', record)
            elem_out.append(record.strip())
        return elem_out
    else:
        elem = RE_EMOJI.sub('', elem)
        return [elem.strip()]


def clean_elements(dataset):
    out = set()
    for elem in filter(bool, dataset):
        out |= set(clean_element(elem))
    return list(out)

clean_element can be easily extended to remove other unwanted characters or symbols, e.g. HTML tags.

With this in place, we can now augment the samples.

Code:

import random
from nltk.stem import PorterStemmer, WordNetLemmatizer

def augment_samples(
        internal_datasets: list[list[str]],
        samples_per_dataset: int
    ):
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    datasets = internal_datasets.copy()
    augmented_datasets = []
    for samples in datasets:
        augmented_samples = []
        cleaned_samples = clean_elements(samples)
        for sample in cleaned_samples:
            # lemmatize the sample
            toks = nltk.word_tokenize(sample)
            lemmatized = ' '.join([lemmatizer.lemmatize(t) for t in toks])
            augmented_samples.append(lemmatized)
            # stem the sample
            stemmed = ' '.join([stemmer.stem(t) for t in toks])
            augmented_samples.append(stemmed)

            # Randomly shuffle the words in the sentence
            s = sample.split(' ')
            s = random.choices(s, k=len(s) - 1)
            augmented_samples.append(' '.join(s))
            # Randomly remove a word from the sentence
            s = sample.split()
            e = random.choice(list(range(len(s))))
            augmented_samples.append(''.join(s[:e]) + ' '.join(s[e+1:]))
            # add lowercase version of the sentence with 50% probability
            if random.random() * 100 > 0.5:
                augmented_samples.append(sample.lower())
        ds = augmented_samples + samples
        random.shuffle(ds)
        augmented_datasets.append(ds)

    return [s[:samples_per_dataset] for s in augmented_datasets]

This concludes all augmentation techniques used in my pipeline.

Data set generation

The final step to generate the data set is to combine the augmented prompts with the augmented data points.

For this we basically iterate over the prompts and data points and replace the tags in the prompts with the data points.

To make use of this process, we need to define a structure in which the final data set will be stored.

Each of the generated samples in the final data set looks like this:

[
  [
    "I'm looking for cinemas in New York that play \"The Matrix\" tonight. Any leads?",
    {
      "entities": [
        [
          48,
          57,
          "MOVIE_TITLE"
        ],
        [
          27,
          34,
          "LOCATION"
        ]
      ]
    }
  ]
]

A tuple of the final sample string and the entities that are present in it. Each entity is a 3-element-tuple of START, END and TOKEN where START denotes the starting index of the named entity of type TOKEN in the sample and END the ending index of the named entity.

I leave the implementation of this step to you, as it is quite straight forward and basically just string search/replacement.

Finally, we can save the generated data set to a file, e.g. train.spacy and dev.spacy. We do that by using spaCy's DocBin class.

Unfortunately, this results in some samples being discarded due to the nature of the data set generation process.

These samples are usually useless anyway, as they contain errors or are not properly formatted, e.g. all whitespaces are missing.

import spacy
from spacy.tokens import DocBin


nlp = spacy.blank('en')

def create_training(TRAIN_DATA, name):
    db = DocBin()
    skipped = 0
    total = 0
    for text, annotation in TRAIN_DATA:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annotation['entities']:
            span = doc.char_span(start, end, label=label, alignment_mode='expand')
            if span is None:
                skipped += 1
            else:
                ents.append(span)
        total += len(annotation['entities'])
        try:
            doc.ents = ents
        except ValueError:
            skipped += 1
        db.add(doc)
    print(f'Skipped {skipped}/{total} entities')
    return db

Assuming we have a list of samples TRAIN_DATA in the format mentioned above, we can now generate the final data set.

# train/validation split of the data set, for simplicity we use 80/20
train_set = TRAIN_DATA[:int(len(TRAIN_DATA) * 0.8)]
valid_set = TRAIN_DATA[int(len(TRAIN_DATA) * 0.8):]

dataset = create_training(train_set, 'train')
dataset.to_disk('train.spacy')

dataset = create_training(valid_set, 'dev')
dataset.to_disk('dev.spacy')

Model Training

As initially mentioned, the whole pipeline is based on the SpaCy library. Hence the final training data set is conform to the SpaCy format and allows to use the spacy cli for the training process.

To select a configuration, we can head over to the SpaCy documentation and look for the NER training section.

In the Quickstart section you can find a widget to generate a configuration file for your training process.

I usually select English as the language, ner as component, CPU under hardware and as optimization target accuracy.

This generates a base configuration file that we can further tweak to our needs.

Let's assume we saved it as base_config.cfg. And from our data set generation process we have two files train.spacy and dev.spacy.

Following the SpaCy documentation, we can now train our model with the following commands:

# generate final config file
python -m spacy init fill-config base_config.cfg config.cfg

# train the model
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

After some time, the model will be trained and saved in the output directory.

Great, now we have a model that can extract named entities from text data!

Conclusion

SpaCy is amazing and a great library to use and build NER models.

There's quite a lot of work that can be done on the data augmentation side to improve the variety of the data set further.

Sample augmentation is less developed and should be the first point to tackle.

I invite you to experiment with the pipeline and see how it can be improved. It should be a robust starting point for future endeavours.