How to Train your Own Model with NLTK and Stanford NER Tagger? (for English, French, German…) (2024)

Charles Bochet

Published in

Sicara's blog

Let’s dive into Named Entity Recognition (NER). NER is about locating and classifying named entities in texts in order to recognize places, people, dates, values, organizations. As an example:

Twenty miles east of Reno, Nev., where packs of wild mustangs roam free through the parched landscape, Tesla Gigafactory 1 sprawls near Interstate 80. […] The Gigafactory, whose construction began in June 2014, is not only outrageously large but also on its way to becoming the biggest manufacturing plant on earth. Now 30 percent complete, its square footage already equals about 35 Costco stores. […] (NY Times, November 2017)

This guide will show you how to implement NER tagging for non-English languages using NLTK. Enjoy reading!

At Sicara, I recently had to build algorithms to extract names and organization from a French corpus. As NLTK comes along with the efficient Stanford Named Entities tagger, I thought that NLTK would do the work for me, out of the box.

But I was wrong: I forgot my corpus was French and Stanford NER tagger is designed for English language only.

The only way to get it done is to train your own NER model. Use cases :

you are working with a non-English corpus too (French, German and Dutch…) ;
you want to improve Stanford English model.

I hope this step-by-step guide will help you.

Let’s start!

Because Stanford NER tagger is written in Java, you are going to need a proper Java Virtual Machine to be installed on your computer.

To do so, install Java JRE 8 or higher. You can install Java JDK (developer kit) if you want because it contains JRE. For Linux users, you will find all needed information on this guide on How To Install Java with Apt-Get on Ubuntu 16.04. For other users, please have a look at Java official documentation.

Once installed, make sure your $JAVA_HOME environment is set:

echo $JAVA_HOME

Mine is /user/lib/jvm/java-8-oracle . That’s it for Java!

If you haven’t done it yet, create a virtual environment to work on:

mkvirtualenv .venv-ner --python=/usr/bin/python3
workon .venv-ner

Download NLTK:

pip install nltk

Get Stanford NER Tagger. Download zip file stanford-ner-xxxx-xx-xx.zip: see ‘Download’ section from The Stanford NLP website.

Unzip it and move ner-tagger ner-tagger.jar and gzipped English model english.all.3class.distsim.crf.ser.gz to your application folder:

cd /home/charles/Downloads/
unzip stanford-ner-2017-06-09.zipmv stanford-ner-2017-06-09/ner-tagger.jar {yourAppFolder}/stanford-ner-tagger/ner-tagger.jarmv stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz {yourAppFolder}/stanford-ner-tagger/ner-model-english.ser.gz

We now have two files in our stanford-ner-tagger folder:

ner-tagger.jar: NER tagger engine properly said ;
ner-model-english.ser.gz : NER model trained on an english corpus.gi

Copy the following ner_english.py script to perform english Named Entities Recognition:

Run it:

python ner_english.py

Output should be:

[('Twenty', 'O'), ('miles', 'O'), ('east', 'O'), ('of', 'O'), ('Reno', 'ORGANIZATION'), (',', 'O'), ('Nev.', 'LOCATION'), (',', 'O'), ('where', 'O'), ('packs', 'O'), ('of', 'O'), ('wild', 'O'), ('mustangs', 'O'), ('roam', 'O'), ('free', 'O'), ('through', 'O'), ('the', 'O'), ('parched', 'O'), ('landscape', 'O'), (',', 'O'), ('Tesla', 'ORGANIZATION'), ('Gigafactory', 'ORGANIZATION'), ('1', 'ORGANIZATION'), ('sprawls', 'O'), ('near', 'O'), ('Interstate', 'LOCATION'), ('80', 'LOCATION'), ('.', 'O'), ('The', 'O'), ('Gigafactory', 'O'), (',', 'O'), ('whose', 'O'), ('construction', 'O'), ('began', 'O'), ('in', 'O'), ('June', 'DATE'), ('2014', 'DATE'), (',', 'O'), ('is', 'O'), ('not', 'O'), ('only', 'O'), ('outrageously', 'O'), ('large', 'O'), ('but', 'O'), ('also', 'O'), ('on', 'O'), ('its', 'O'), ('way', 'O'), ('to', 'O'), ('becoming', 'O'), ('the', 'O'), ('biggest', 'O'), ('manufacturing', 'O'), ('plant', 'O'), ('on', 'O'), ('earth', 'O'), ('.', 'O'), ('Now', 'O'), ('30', 'PERCENT'), ('percent', 'PERCENT'), ('complete', 'O'), (',', 'O'), ('its', 'O'), ('square', 'O'), ('footage', 'O'), ('already', 'O'), ('equals', 'O'), ('about', 'O'), ('35', 'O'), ('Costco', 'ORGANIZATION'), ('stores', 'O'), ('.', 'O')]

Not bad at all! However, it is not perfect :

it does not detect all values : but these can be easily extracted using Regex ;
if does not detect all Named Entities : if you want to go further, you will have to train a more complete (or dataset specific) model.

Now, you know how to run NER on an English corpus. What about other languages like French?

You need to train your own model. To do so, create a dummy-french-corpus.tsv file in {yourAppFolder}/stanford-ner-tagger/train with the following syntax:

En O
2017 DATE
, O
Une O
intelligence O
artificielle O
est O
en O
mesure O
de O
développer O
par O
elle-même O
Super PERSON
Mario PERSON
Bros PERSON
. O
Sans O
avoir O
eu O
accès O
au O
code O
du O
jeu O
, O
elle O
a O
récrée O
ce O
hit O
des O
consoles O
Nintendo ORGANIZATION
. O
Des O
chercheurs O
de O
l'Institut ORGANIZATION
de ORGANIZATION
Technologie ORGANIZATION
de O
Géorgie LOCATION
, O
aux O
Etats-Unis LOCATION
, O
viennent O
de O
la O
mettre O
à O
l'épreuve O
. O

Create a prop.txt file in the same folder too:

trainFile = train/dummy-french-corpus.tsv
serializeTo = dummy-ner-model-french.ser.gz
map = word=0,answer=1useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

Train it, using:

cd stanford-ner-tagger/
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt

This should output dummy-ner-model-french.ser.gz file. Create a new ner_french.py script to use it:

Run it:

python ner_french.py

The output seems to be right:

[('En', 'O'), ('2017', 'DATE'), (',', 'O'), ('une', 'O'), ('intelligence', 'O'), ('artificielle', 'O'), ('est', 'O'), ('en', 'O'), ('mesure', 'O'), ('de', 'O'), ('développer', 'O'), ('par', 'O'), ('elle-même', 'O'), ('Super', 'PERSON'), ('Mario', 'PERSON'), ('Bros.', 'O'), ('Sans', 'O'), ('avoir', 'O'), ('eu', 'O'), ('accès', 'O'), ('au', 'O'), ('code', 'O'), ('du', 'O'), ('jeu', 'O'), (',', 'O'), ('elle', 'O'), ('a', 'O'), ('récrée', 'O'), ('ce', 'O'), ('hit', 'O'), ('des', 'O'), ('consoles', 'O'), ('Nintendo', 'ORGANIZATION'), ('.', 'O'), ('Des', 'O'), ('chercheurs', 'O'), ('de', 'O'), ("l'Institut", 'ORGANIZATION'), ('de', 'ORGANIZATION'), ('Technologie', 'ORGANIZATION'), ('de', 'O'), ('Géorgie', 'LOCATION'), (',', 'O'), ('aux', 'O'), ('Etats-Unis', 'LOCATION'), (',', 'O'), ('viennent', 'O'), ('de', 'O'), ('la', 'O'), ('mettre', 'O'), ('à', 'O'), ("l'épreuve", 'O'), ('.', 'O')]

Congratulations, your model is trained! Of course, as the corpus we trained it on is ridiculous, you won’t succeed on a different text:

As you can see, none of the name entities have been caught:

[(‘La’, ‘O’), (‘première’, ‘O’), (‘Falcon’, ‘O’), (‘Heavy’, ‘O’), (‘de’, ‘O’), (“l’entreprise”, ‘O’), (‘SpaceX’, ‘O’), (‘,’, ‘O’), (‘la’, ‘O’), (‘plus’, ‘O’), (‘puissante’, ‘O’), (‘fusée’, ‘O’), (‘américaine’, ‘O’), (‘jamais’, ‘O’), (‘lancée’, ‘O’), (‘depuis’, ‘O’), (‘plus’, ‘O’), (‘de’, ‘O’), (‘quarante’, ‘O’), (‘ans’, ‘O’), (‘,’, ‘O’), (‘devrait’, ‘O’), (‘bien’, ‘O’), (‘emporter’, ‘O’), (‘le’, ‘O’), (‘roadster’, ‘O’), (‘de’, ‘O’), (“l’entrepreneur”, ‘O’), (‘américain’, ‘O’), (‘,’, ‘O’), (‘mais’, ‘O’), (‘sur’, ‘O’), (‘une’, ‘O’), (‘orbite’, ‘O’), (‘bien’, ‘O’), (‘différente’, ‘O’), (‘.’, ‘O’), (‘Elon’, ‘O’), (‘Musk’, ‘O’), (‘a’, ‘O’), (‘le’, ‘O’), (‘sens’, ‘O’), (‘du’, ‘O’), (‘spectacle’, ‘O’), (‘.’, ‘O’)]

You will need a bigger dataset to train on.

Two solutions:

You face a custom use case (you have specialized vocabulary or you are looking for high accuracy), and you write your own corpus.tsv file by labeling a big corpus by yourself;
You want to perform regular NER and you use an existing labeled corpus.

I have found this nice dataset (FR, DE, NL) that you can use : https://github.com/EuropeanaNewspapers/ner-corpora

Download enp_FR.bnf.bio file into your train folder. Adjust trainFile = train/enp_FR.bnf.bio and serializedTo=trained-ner-model-french-ser.giz in prop.txt file and train your model again (that may last 10 minutes or more) :

cd stanford-ner-tagger/
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt

Run ner_french.py again:

[('La', 'O'), ('première', 'O'), ('Falcon', 'I-PER'), ('Heavy', 'I-PER'), ('de', 'O'), ("l'entreprise", 'O'), ('SpaceX', 'O'), (',', 'O'), ('la', 'O'), ('plus', 'O'), ('puissante', 'O'), ('fusée', 'O'), ('des', 'O'), ('Etats-Unis', 'I-LOC'), ('jamais', 'O'), ('lancée', 'O'), ('depuis', 'O'), ('plus', 'O'), ('de', 'O'), ('quarante', 'O'), ('ans', 'O'), (',', 'O'), ('devrait', 'O'), ('bien', 'O'), ('emporter', 'O'), ('le', 'O'), ('roadster', 'O'), ('de', 'O'), ("l'entrepreneur", 'O'), ('américain', 'O'), (',', 'O'), ('mais', 'O'), ('sur', 'O'), ('une', 'O'), ('orbite', 'O'), ('bien', 'O'), ('différente', 'O'), ('.', 'O'), ('Elon', 'I-PER'), ('Musk', 'I-PER'), ('a', 'O'), ('le', 'O'), ('sens', 'O'), ('du', 'O'), ('spectacle', 'O'), ('.', 'O')]

Now, it looks better, while not perfect !

Note: Output shows ‘I-PER’ instead of ‘PERSON’. It depends on how your initial corpus is labeled.

After a few hours on the Internet, looking for tools or packages that could handle french NER tagging, I had to resign myself. The only software I found is FreeLing, which seems great but it seems rather hard to install and C++ written.

Neither NLTK, Spacy, nor SciPy handles french NER tagging out-of-the-box. Hopefully, you can train models for new languages but respective documentations are really light on that point.

Did you like this article? Feel free to comment or contact me.

Freeling: an NLP tool written in C++ that works for many languages including English, French, German, Spanish, Russian, Italian, Norwegian ;
Spacy: really good NLP python package with a nice documentation. Here is a link to add new language in Spacy.
NLTK (Natural Language Toolkit) is a wonderful Python package that provides a set of natural languages corpora and APIs to an impressing diversity of NLP algorithms
Stanford NER tagger: NER Tagger you can use with NLTK open-sourced by Stanford engineers and used in this tutorial.