Basic natural language processing using spaCy

This section introduces you to some basic tasks in natural language processing.

After reading this section, you should:

  • know some of the key concepts and tasks in natural language processing (NLP)

  • learn how to perform simple NLP tasks in Python using the spaCy library

Getting started

To get started, we import spaCy, one of the many libraries available for natural language processing in Python.

[1]:
import spacy

To perform natural language processing tasks for a given language, we must load a language model that has been trained to perform these tasks for the language in question.

spaCy supports many languages, but provides pre-trained language models for fewer languages.

spaCy language models come in different sizes and flavours. We will explore these models and their differences later.

To get acquainted with basic tasks in natural language processing, we will start with a small language model for the English language.

Language models are loaded using spaCy’s load() function, which takes the name of the model as input.

[2]:
# Load the small language model for English and assign it to the variable 'nlp'
nlp = spacy.load('en_core_web_sm')

# Call the variable to examine the object
nlp
[2]:
<spacy.lang.en.English at 0x11fbf8b80>

Calling the variable nlp returns a spaCy Language object that contains a language model for the English language.

Esentially, spaCy’s Language object is a pipeline that uses the language model to perform a number of natural language processing tasks. We will return to these tasks shortly below.

What is a language model?

Most modern language models are based on statistics instead of human-defined rules.

Statistical language models are based on calculating probabilities, e.g.:

  • What is the probability of a given sentence occurring in a language?

  • How likely does a given word occur next in a sequence of words?

Consider the following sentences from the news articles from the previous sections:

  • From financial exchanges in HIDDEN Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day.

  • Security precautions were being taken around the HIDDEN as the deadline for Iraq to withdraw from Kuwait neared.

You can probably make informed guesses on the HIDDEN words based on your knowledge of the English language and the world in general.

Similarly, creating a statistical language model involves observing the occurrence of words in large corpora and calculating their probabilities of occurrence in a given context. The language model can then be trained by making predictions and adjusting the model based on the errors made during prediction.

How are language models trained?

The spaCy language model for English, for instance, is trained on a corpus called OntoNotes 5.0, which features texts from different genres such as newswire text, broadcast news, broadcast and telephone conversations and blogs.

This allows the corpus to cover linguistic variation in both written and spoken English.

The OntoNotes 5.0 corpus consists of more than just plain text: the annotations include part-of-speech tags, syntactic dependencies and co-references between words.

This allows modelling not just the occurrence of particular words or their sequences, but their grammatical features as well.

Performing basic NLP tasks using spaCy

To process text using the Language object containing the language model for English, we simply call the Language object nlp on some text.

Let’s begin by defining a simple test sentence, a Python string object that is stored under the variable text.

As usual, we can print out the contents by calling the variable.

[3]:
# Assign an example sentence to the variable 'text'
text = "The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday."

# Call the variable to examine the output
text
[3]:
'The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.'

Passing the variable text to the Language object nlp returns a spaCy Doc object, short for document.

In natural language processing, longer pieces of text are commonly referred to as documents, although in this case our document consists of a single sentence.

This object contains both the input text stored under text and the results of natural language processing using spaCy.

[4]:
# Feed the string object under 'text' to the Language object under 'nlp'
# Store the result under the variable 'doc'
doc = nlp(text)

The Doc object is now stored under the variable doc.

[5]:
# Call the variable to examine the object
doc
[5]:
The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.

Calling the variable doc returns the contents of the object.

Although the output resembles that of a Python string, the Doc object contains a wealth of information about its linguistic structure, which was generated by passing the text through the spaCy NLP pipeline.

We will now examine the tasks that were performed under the hood after we provided the input sentence to the language model.

Tokenization

What takes place first is a task known as tokenization, which breaks the text down into analytical units in need of further processing.

In most cases, a token corresponds to words separated by whitespace, but punctuation marks are also considered as independent tokens. Because computers treat words as sequences of characters, assigning punctuation marks to their own tokens prevents trailing punctuation from attaching to the words that precede them.

The diagram below the outlines the tasks that spaCy can perform after a text has been tokenised, such as part-of-speech tagging, syntactic parsing and named entity recognition.

The spaCy pipeline from https://spacy.io/usage/linguistic-features#section-tokenization

A spaCy Doc object is consists of a sequence of Token objects, which store the results of various natural language processing tasks.

Let’s print out each Token object stored in the Doc object doc.

[27]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print each token
    print(token)
The
Federal
Bureau
of
Investigation
has
been
ordered
to
track
down
as
many
as
3,000
Iraqis
in
this
country
whose
visas
have
expired
,
the
Justice
Department
said
yesterday
.

The output shows one Token per line. As expected, punctuation marks such as ‘.’ and ‘,’ constitute their own Tokens.

Part-of-speech tagging

Part-of-speech (POS) tagging is a task that involves determining the word class of a token. This is crucial for disambiguation, because different parts of speech may have similar forms.

Consider the example: The sailor dogs the hatch.

The present tense of the verb dog (to fasten something with something) is precisely the same as the plural form of the noun dog.

To identify the correct word class, we must examine the context in which the word appears.

To access the results of POS tagging, let’s loop over the Doc object doc and print each Token and its associated part-of-speech tag, which is stored under its attribute pos_.

We can access the attributes of an object by inserting the attribute after the object and separating them with a full stop, e.g. token.pos_.

[7]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print the token and its POS tag
    print(token, token.pos_)
The DET
Federal PROPN
Bureau PROPN
of ADP
Investigation PROPN
has AUX
been AUX
ordered VERB
to PART
track VERB
down ADP
as ADV
many ADJ
as SCONJ
3,000 NUM
Iraqis PROPN
in ADP
this DET
country NOUN
whose DET
visas NOUN
have AUX
expired VERB
, PUNCT
the DET
Justice PROPN
Department PROPN
said VERB
yesterday NOUN
. PUNCT

There are many other attributes available for each Token in the Doc object.

Syntactic parsing

Syntactic parsing (or dependency parsing) is the task of defining syntactic dependencies that hold between tokens. This task is closely aligned with traditional grammatical analyses, and consequently the results are often represented using tree diagrams.

The syntactic dependencies identified during parsing are available under the dep_ attribute of a Token.

Let’s print out the syntactic dependencies for each Token in the Doc object.

[10]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print the token and its dependency tag
    print(token, token.dep_)
The det
Federal compound
Bureau nsubjpass
of prep
Investigation pobj
has aux
been auxpass
ordered ccomp
to aux
track xcomp
down prt
as advmod
many amod
as quantmod
3,000 nummod
Iraqis dobj
in prep
this det
country pobj
whose poss
visas nsubj
have aux
expired relcl
, punct
the det
Justice compound
Department nsubj
said ROOT
yesterday npadvmod
. punct

Unlike part-of-speech tags that are associated with a single Token, dependency tags indicate a relation that holds between two Tokens.

To better understand the syntactic relations captured by dependency parsing, let’s use some of the additional attributes available for each Token:

  1. i: the position of the Token in the Doc

  2. token: the Token itself

  3. dep_: the dependency tag

  4. head and i: the Token that governs the current Token and its index

This illustrates how Python attributes can be used in a flexible manner: the attribute head points to another Token, which naturally has the attribute i that contains its index or position in the Doc. We can combine these two attributes to retrieve this information for any token by referring to .head.i.

[11]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print the index of current token, the token itself, the dependency, the head and its index
    print(token.i, token, token.dep_, token.head.i, token.head)
0 The det 2 Bureau
1 Federal compound 2 Bureau
2 Bureau nsubjpass 7 ordered
3 of prep 2 Bureau
4 Investigation pobj 3 of
5 has aux 7 ordered
6 been auxpass 7 ordered
7 ordered ccomp 27 said
8 to aux 9 track
9 track xcomp 7 ordered
10 down prt 9 track
11 as advmod 14 3,000
12 many amod 14 3,000
13 as quantmod 14 3,000
14 3,000 nummod 15 Iraqis
15 Iraqis dobj 9 track
16 in prep 9 track
17 this det 18 country
18 country pobj 16 in
19 whose poss 20 visas
20 visas nsubj 22 expired
21 have aux 22 expired
22 expired relcl 18 country
23 , punct 27 said
24 the det 26 Department
25 Justice compound 26 Department
26 Department nsubj 27 said
27 said ROOT 27 said
28 yesterday npadvmod 27 said
29 . punct 27 said

Although the output above helps to clarify the syntactic dependencies between tokens, they are generally much easier to perceive using diagrams.

spaCy provides a visualisation tool for visualising dependencies. This component of the spaCy library, displacy, can be imported using the following command.

[12]:
from spacy import displacy

The displacy module has a method named render(), which takes a Doc object as input.

To draw a dependency tree, we provide the Doc object doc to the render() method with two arguments:

  1. style: The value dep instructs displacy to draw a visualisation for syntactic dependencies.

  2. options: This argument takes a Python dictionary as input. We provide a dictionary with the key compact and Boolean value True to instruct displacy to draw a compact tree diagram. Additional options for formatting the visualisation can be found in spaCy documentation.

[13]:
displacy.render(doc, style='dep', options={'compact': True})
The DET Federal PROPN Bureau PROPN of ADP Investigation PROPN has AUX been AUX ordered VERB to PART track VERB down ADP as ADV many ADJ as SCONJ 3,000 NUM Iraqis PROPN in ADP this DET country NOUN whose DET visas NOUN have AUX expired, VERB the DET Justice PROPN Department PROPN said VERB yesterday. NOUN det compound nsubjpass prep pobj aux auxpass ccomp aux xcomp prt advmod amod quantmod nummod dobj prep det pobj poss nsubj aux relcl det compound nsubj npadvmod

The syntactic dependencies are visualised using lines that lead from the head Token to the Token governed by that head.

The dependency tags are based on universal dependencies, a framework for describing grammatical features across languages.

If you don’t know what a particular tag means, spaCy provides a function for explaining the tags, explain(), which takes a tag as input (note that the tags are case-sensitive).

[14]:
spacy.explain('pobj')
[14]:
'object of preposition'

Finally, if you wonder about the underscores _ in the attribute names: spaCy encodes all strings by mapping them to hash values (a numerical representation) for computational efficiency.

Let’s print out the first Token in the Doc [0] and its dependencies to examine how this works.

[16]:
print(doc[0], doc[0].dep, doc[0].dep_)
The 415 det

As you can see, the hash value 415 is reserved for the tag corresponding to a determiner (det).

If you want human-readable output for dependency parsing and spaCy returns sequences of numbers, then you most likely forgot to add the underscore to the attribute name.

Sentence segmentation

spaCy also segments Doc objects into sentences as a part of sentence segmentation.

Sentence segmentation imposes additional structure to larger texts. By determining the boundaries of a sentence, we can constrain tasks such as dependency parsing to individual sentences.

spaCy provides access to the results of sentence segmentation via the attribute sents of a Doc object.

Let’s loop over the sentences contained in the Doc object doc and count them using Python’s enumerate() function.

Using the enumerate() function returns a count that increases with each item in the loop.

We assign this count to the variable number, whereas each sentence is stored under sent. We then print out both at the same time using the print() function.

[17]:
# Loop over sentences in the Doc object and count them using enumerate()
for number, sent in enumerate(doc.sents):

    # Print the token and its dependency tag
    print(number, sent)
0 The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.

This only returns a single sentence, but the Doc object could easily hold a longer text with multiple sentences, such as an entire newspaper article.

[ ]:
# TODO Add content related to NP chunking via Doc.noun_chunks

Lemmatization

A lemma is the base form of a word. One must keep in mind that unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens.

If one wants to count the occurrences of words, for instance, a process known as lemmatization is needed to group together the different forms of the same token.

Lemmas are available for each Token under the attribute lemma_.

[18]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:

    # Print the token and its dependency tag
    print(token, token.lemma_)
The the
Federal Federal
Bureau Bureau
of of
Investigation Investigation
has have
been be
ordered order
to to
track track
down down
as as
many many
as as
3,000 3,000
Iraqis Iraqis
in in
this this
country country
whose whose
visas visa
have have
expired expire
, ,
the the
Justice Justice
Department Department
said say
yesterday yesterday
. .

Named entity recognition (NER)

Named entity recognition (NER) is the task of recognising and classifying entities named in a text.

spaCy can recognise the named entities annotated in the OntoNotes 5 corpus, such as persons, geographic locations and products, to name but a few examples.

Instead of looping over Tokens in the Doc object, we can use the Doc object’s .ents attribute to get the entities.

This returns a Span object, which is a sequence of Tokens, as many named entities span multiple Tokens.

The named entities and their types are stored under the attributes .text and .label_.

Let’s print out the named entities in the Doc object doc.

[19]:
# Loop over the named entities in the Doc object
for ent in doc.ents:

    # Print the named entity and its label
    print(ent.text, ent.label_)
The Federal Bureau of Investigation ORG
as many as 3,000 CARDINAL
Iraqis NORP
the Justice Department ORG
yesterday DATE

The majority of named entities identified in the Doc consist of multiple Tokens, which is why they are represented as Span objects.

We can verify this by accessing the first named entity under doc.ents, which can be found at position 0, because Python starts counting from zero, and feeding this object to Python’s type() function.

[20]:
# Check the type of the object used to store named entities
type(doc.ents[0])
[20]:
spacy.tokens.span.Span

spaCy Span objects contain several useful arguments.

Most importantly, the attributes start and end return the indices of Tokens, which determine where the Span starts and ends.

We can examine this in greater detail by printing out the start and end attributes for the first named entity in the document.

[21]:
# Print the named entity and indices of its start and end Tokens
print(doc.ents[0], doc.ents[0].start, doc.ents[0].end)
The Federal Bureau of Investigation 0 5

The named entity starts at index 0 and ends at index 5.

If we print out the fifth Token in the Doc object, we will see that this corresponds to the Token “has”.

[22]:
doc[5]
[22]:
has

This means that the index returned by the end argument does not correspond to the last Token in a Span, but returns the index of the first Token following the Span.

Put differently, the start means that the Span starts here, whereas the end attribute means that the Span has ended here.

We can also render the named entities using displacy, the module used for visualising dependency parses above.

Note that we must pass the string ent to the style argument to indicate that we wish to visualise named entities.

[23]:
displacy.render(doc, style='ent')
The Federal Bureau of Investigation ORG has been ordered to track down as many as 3,000 CARDINAL Iraqis NORP in this country whose visas have expired, the Justice Department ORG said yesterday DATE .

If you don’t recognise a particular tag used for a named entity, you can always ask spaCy for an explanation.

[24]:
spacy.explain('NORP')
[24]:
'Nationalities or religious or political groups'

This section should have given you an idea of some basic natural language processing tasks using spaCy and the kinds of linguistic annotations they produce.

The following section introduces you to evaluating the performance of language models.