Processing diverse languages
Contents
Processing diverse languages#
After reading this section, you should:
know how to download and use language models in Stanza, a Python library for processing many languages
how to interface Stanza with the spaCy natural language processing library
know how to access linguistic annotations produced by Stanza language models via spaCy
Introduction#
Part II introduced basic natural language processing tasks using examples written in the English language.
As a global lingua franca, English is a highly-resourced language in terms of natural language processing. Compared to many other languages, the amount of data – especially human-annotated data – available for English is greater and covers a wider range of domains (Del Gratta et al. 2021).
Unfortunately, the imbalance in resources and research effort has led to a situation where the advances in processing the English language are occasionally claimed to hold for natural language in general.
However, as Bender (2019) has shown, English is not a synonym for natural language: even if one demonstrates that computers can achieve or surpass human-level performance in some natural language processing task for the English language, this does not mean that one has solved this task or problem for natural language as a whole.
To measure progress in the field of natural language processing and to ensure that as many languages as possible can benefit from advances in language technology, it is highly desirable to conduct research on processing languages used across the world.
Stanza – a Python library for processing many languages#
To get started with working languages other than English, we can use a library named Stanza.
Stanza is a Python library for natural language processing that provides pre-trained language models for many languages (Qi et al. 2020).
Stanza language models are trained on corpora annotated using the Universal Dependencies formalism, which means that these models can perform tasks such as tokenization, part-of-speech tagging, morphological tagging and dependency parsing.
These are essentially the same tasks that we explored using the spaCy natural language processing library in Part II.
Let’s start exploring Stanza by importing the library.
# Import the Stanza library
import stanza
To process a given language, we must first download a Stanza language model using the download()
function.
The download()
function requires a single argument, lang
, which defines the language model to be downloaded.
To download a language model for a given language, retrieve the two-letter language code (e.g. wo
) for the language from the list of available language models and pass the language code as a string object to the lang
argument.
For example, the following code would download a model for Wolof, a language spoken in West Africa that belongs to the family of Niger-Congo languages. This model has been trained using the Wolof treebank (Dione 2019).
# Download Stanza language model for Wolof
stanza.download(lang='wo')
For some languages, Stanza provides models that have been trained on different datasets. Stanza refers to models trained on different datasets as packages. By default, Stanza automatically downloads the package with model trained on the largest dataset available for the language in question.
To select a model trained on a specific dataset, pass the name of its package as a string object to the package
argument.
To exemplify, the following command would download a model for Finnish trained on the FinnTreeBank (package: ftb
) dataset instead of the default model, which is trained on the Turku Dependency Treebank dataset (package: tbt
).
# Download a Stanza language model for Finnish trained using the FinnTreeBank (package 'ftb')
stanza.download(lang='fi', package='ftb')
The package names are provided in the list of language models available for Stanza.
Loading a language model into Stanza#
To load a Stanza language model into Python, we must first create a Pipeline object by initialising an instance of the Pipeline()
class from the stanza
module.
To exemplify this procedure, let’s initialise a pipeline with a language model for Wolof.
To load a language model for Wolof into the pipeline, we must provide the string wo
to the lang
argument of the Pipeline()
function.
# Initialise a Stanza pipeline with a language model for Wolof;
# assign model to variable 'nlp_wo'.
nlp_wo = stanza.Pipeline(lang='wo')
# Call the variable to examine the output
nlp_wo
<stanza.pipeline.core.Pipeline at 0x127f94eb0>
Loading a language model into Stanza returns Pipeline object, which consists of a number of processors that perform various natural language processing tasks.
The output above lists the processors under the heading of the same name, together with the names of the packages used to train these processors.
As we learned in Part II, one might not always need all linguistic annotations created by a model, which always come with a computational cost.
To speed up processing, you can define the processors to be included in the Pipeline object by providing the argument processors
with a string object that contains the processor names to be included in the pipeline, which must be separated by commas.
For example, creating a Pipeline using the command below would only include the processors for tokenization and part-of-speech tagging into the pipeline.
# Initialise a Stanza pipeline with a language model for Wolof;
# assign model to variable 'nlp_wo'. Only include tokenizer
# and part-of-speech tagger.
nlp_wo = stanza.Pipeline(lang='wo', processors='tokenize, pos')
Processing text using Stanza#
Now that we have initialised a Stanza Pipeline with a language model, we can feed some text in Wolof to the model under nlp_wo
as a string object.
We store the result under the variable doc_wo
.
# Feed text to the model under 'nlp_wo'; store result under the variable 'doc'
doc_wo = nlp_wo("Réew maa ngi lebe turam wi ci dex gi ko peek ci penku ak bëj-gànnaar, te ab balluwaayam bawoo ca Fuuta Jallon ca Ginne, di Dexug Senegaal. Ab kilimaam bu gëwéel la te di bu fendi te yor ñaari jamono: jamonoy nawet (jamonoy taw) ak ju noor (jamonoy fendi).")
# Check the type of the output
type(doc_wo)
stanza.models.common.doc.Document
This returns a Stanza Document object, which contains the linguistic annotations created by passing the text through the pipeline.
The attribute sentences
of a Stanza Document object contains a list, where each item contains a single sentence.
Thus we can use brackets to access the first item [0]
in the list.
# Get the first item in the list of sentences
doc_wo.sentences[0]
[
{
"id": 1,
"text": "Réew",
"lemma": "réew",
"upos": "NOUN",
"xpos": "NOUN",
"head": 4,
"deprel": "nsubj",
"start_char": 0,
"end_char": 4
},
{
"id": 2,
"text": "maa",
"lemma": "a",
"upos": "AUX",
"xpos": "AUX",
"feats": "PronType=Prs",
"head": 4,
"deprel": "aux",
"start_char": 5,
"end_char": 8
},
{
"id": 3,
"text": "ngi",
"lemma": "ngi",
"upos": "AUX",
"xpos": "AUX",
"feats": "Aspect=Prog",
"head": 4,
"deprel": "aux",
"start_char": 9,
"end_char": 12
},
{
"id": 4,
"text": "lebe",
"lemma": "lebe",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 13,
"end_char": 17
},
{
"id": 5,
"text": "turam",
"lemma": "tur",
"upos": "NOUN",
"xpos": "NOUN",
"feats": "Number=Sing|Poss=Yes",
"head": 4,
"deprel": "obj",
"start_char": 18,
"end_char": 23
},
{
"id": 6,
"text": "wi",
"lemma": "bi",
"upos": "DET",
"xpos": "DET",
"feats": "Definite=Def|Deixis=Prox|NounClass=Wol10|Number=Sing|PronType=Art",
"head": 5,
"deprel": "det",
"start_char": 24,
"end_char": 26
},
{
"id": 7,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 8,
"deprel": "case",
"start_char": 27,
"end_char": 29
},
{
"id": 8,
"text": "dex",
"lemma": "dex",
"upos": "NOUN",
"xpos": "NOUN",
"head": 4,
"deprel": "obl",
"start_char": 30,
"end_char": 33
},
{
"id": 9,
"text": "gi",
"lemma": "bi",
"upos": "PRON",
"xpos": "PRON",
"feats": "Definite=Def|Deixis=Prox|NounClass=Wol3|Number=Sing|Person=3|PronType=Rel",
"head": 11,
"deprel": "nsubj",
"start_char": 34,
"end_char": 36
},
{
"id": 10,
"text": "ko",
"lemma": "ko",
"upos": "PRON",
"xpos": "CL",
"feats": "Case=Acc|Number=Sing|Person=3|PronType=Prs",
"head": 11,
"deprel": "obj",
"start_char": 37,
"end_char": 39
},
{
"id": 11,
"text": "peek",
"lemma": "peek",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 8,
"deprel": "acl:relcl",
"start_char": 40,
"end_char": 44
},
{
"id": 12,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 13,
"deprel": "case",
"start_char": 45,
"end_char": 47
},
{
"id": 13,
"text": "penku",
"lemma": "penku",
"upos": "NOUN",
"xpos": "NOUN",
"head": 11,
"deprel": "obl",
"start_char": 48,
"end_char": 53
},
{
"id": 14,
"text": "ak",
"lemma": "ak",
"upos": "CCONJ",
"xpos": "CONJ",
"head": 15,
"deprel": "cc",
"start_char": 54,
"end_char": 56
},
{
"id": 15,
"text": "bëj-gànnaar",
"lemma": "bëj-gànnaar",
"upos": "NOUN",
"xpos": "NOUN",
"head": 13,
"deprel": "conj",
"start_char": 57,
"end_char": 68
},
{
"id": 16,
"text": ",",
"lemma": ",",
"upos": "PUNCT",
"xpos": "COMMA",
"head": 20,
"deprel": "punct",
"start_char": 68,
"end_char": 69
},
{
"id": 17,
"text": "te",
"lemma": "te",
"upos": "CCONJ",
"xpos": "CONJ",
"head": 20,
"deprel": "cc",
"start_char": 70,
"end_char": 72
},
{
"id": 18,
"text": "ab",
"lemma": "ab",
"upos": "DET",
"xpos": "DET",
"feats": "Definite=Ind|NounClass=Wol5|Number=Sing|PronType=Art",
"head": 19,
"deprel": "det",
"start_char": 73,
"end_char": 75
},
{
"id": 19,
"text": "balluwaayam",
"lemma": "balluwaay",
"upos": "NOUN",
"xpos": "NOUN",
"feats": "Number=Sing|Poss=Yes",
"head": 20,
"deprel": "nsubj",
"start_char": 76,
"end_char": 87
},
{
"id": 20,
"text": "bawoo",
"lemma": "bawoo",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 4,
"deprel": "conj",
"start_char": 88,
"end_char": 93
},
{
"id": 21,
"text": "ca",
"lemma": "ca",
"upos": "ADP",
"xpos": "PREP",
"head": 22,
"deprel": "case",
"start_char": 94,
"end_char": 96
},
{
"id": 22,
"text": "Fuuta",
"lemma": "Fuuta",
"upos": "PROPN",
"xpos": "NAME",
"head": 20,
"deprel": "obl",
"start_char": 97,
"end_char": 102
},
{
"id": 23,
"text": "Jallon",
"lemma": "Jallon",
"upos": "PROPN",
"xpos": "NAME",
"head": 22,
"deprel": "flat",
"start_char": 103,
"end_char": 109
},
{
"id": 24,
"text": "ca",
"lemma": "ca",
"upos": "ADP",
"xpos": "PREP",
"head": 25,
"deprel": "case",
"start_char": 110,
"end_char": 112
},
{
"id": 25,
"text": "Ginne",
"lemma": "Ginne",
"upos": "PROPN",
"xpos": "NAME",
"head": 20,
"deprel": "obl",
"start_char": 113,
"end_char": 118
},
{
"id": 26,
"text": ",",
"lemma": ",",
"upos": "PUNCT",
"xpos": "COMMA",
"head": 28,
"deprel": "punct",
"start_char": 118,
"end_char": 119
},
{
"id": 27,
"text": "di",
"lemma": "di",
"upos": "AUX",
"xpos": "COP",
"feats": "Aspect=Imp|Mood=Ind|Tense=Pres|VerbForm=Fin",
"head": 28,
"deprel": "cop",
"start_char": 120,
"end_char": 122
},
{
"id": 28,
"text": "Dexug",
"lemma": "dex",
"upos": "NOUN",
"xpos": "NOUN",
"feats": "Case=Gen|Number=Sing",
"head": 22,
"deprel": "appos",
"start_char": 123,
"end_char": 128
},
{
"id": 29,
"text": "Senegaal",
"lemma": "Senegaal",
"upos": "PROPN",
"xpos": "NAME",
"head": 28,
"deprel": "nmod",
"start_char": 129,
"end_char": 137
},
{
"id": 30,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": "PERIOD",
"head": 4,
"deprel": "punct",
"start_char": 137,
"end_char": 138
}
]
Although the output contains both brackets []
and curly braces {}
, which Python typically uses for marking lists and dictionaries, respectively, the output is not a list with nested dictionaries, but a Stanza Sentence object.
# Check the type of the first item in the Document object
type(doc_wo.sentences[0])
stanza.models.common.doc.Sentence
The Sentence object contains various attributes and methods for accessing the linguistic annotations created by the language model.
If we wish to interact with the annotations using data structures native to Python, we can use the to_dict()
method to cast the annotations into a list of dictionaries, in which each dictionary stands for a single Stanza Token object.
The key and value pairs in these dictionaries contain the linguistic annotations for each Token.
# Cast the first Sentence object into a Python dictionary; store under variable 'doc_dict'
doc_dict = doc_wo.sentences[0].to_dict()
# Get the dictionary for the first Token
doc_dict[0]
{'id': 1,
'text': 'Réew',
'lemma': 'réew',
'upos': 'NOUN',
'xpos': 'NOUN',
'head': 4,
'deprel': 'nsubj',
'start_char': 0,
'end_char': 4}
As you can see, the dictionary consists of key and value pairs, which hold the linguistic annotations.
We can retrieve a list of keys available for a Python dictionary using the keys()
method.
# Get a list of keys for the first Token in the dictionary 'doc_dict'
doc_dict[0].keys()
dict_keys(['id', 'text', 'lemma', 'upos', 'xpos', 'head', 'deprel', 'start_char', 'end_char'])
Now that we have listed the keys, let’s retrieve the value under the key lemma
.
# Get the value under key 'lemma' for the first item [0] in the dictionary 'doc_dict'
doc_dict[0]['lemma']
'réew'
This returns the lemma of the word “réew”, which stands for “country”.
Processing multiple texts using Stanza#
To process multiple documents with Stanza, the most efficent way is to first collect the documents as string objects into a Python list.
Let’s define a toy example with a couple of example documents in Wolof and store them as string objects into a list under the variable str_docs
.
# Define a Python list consisting of two strings
str_docs = ['Lislaam a ngi njëkk a tàbbi ci Senegaal ci diggante VIIIeelu xarnu ak IXeelu xarnu, ña ko fa dugal di ay yaxantukat yu araab-yu-berber.',
'Li ëpp ci gëstu yi ñu def ci wàllug Gëstu-askan (walla demogaraafi) ci Senegaal dafa sukkandiku ci Waññ (recensement) yi ñu jotoon a def ci 1976, 1988 rawati na 2002.']
Next, we create a list of Stanza Document objects using a Python list comprehension. These Document objects are annotated for their linguistic features when they are passed through a Pipeline object.
At this stage, we simply cast each string in the list str_docs
to a Stanza Document object. We store the result into a list named docs_wo_in
.
Before proceeding to create the Document objects, let’s examine how the list comprehension is structured by taking apart its syntax step by step.
The list comprehension is like a for
loop, which was introduced in Part II, which uses the contents of an existing list to create a new list.
To begin with, just like lists, list comprehensions are marked using surrounding brackets []
.
docs_wo_in = []
Next, on the right-hand side of the for
statement, we use the variable doc
to refer to items in the list str_docs
that we are looping over.
docs_wo_in = [... for doc in str_docs]
Now that we can refer to list items using the variable doc
, we can define what we do to each item on the left-hand side of the for
statement.
docs_wo_in = [stanza.Document([], text=doc) for doc in str_docs]
For each item in the list str_docs
, we initialise an empty Document
object and pass two inputs to this object:
an empty list
[]
that will be populated with linguistic annotations,the contents of the string variable under
doc
to the argumenttext
.
# Use a list comprehension to create a Python list with Stanza Document objects.
docs_wo_in = [stanza.Document([], text=doc) for doc in str_docs]
# Call the variable to check the output
docs_wo_in
[[], []]
Don’t let the output fool you here: what looks like two empty Python lists nested within a list are actually Stanza Document objects.
Let’s use the brackets to access and examine the first Document object in the list docs_wo_in
.
# Check the type of the first item in the list 'docs_wo_in'
type(docs_wo_in[0])
stanza.models.common.doc.Document
As you can see, the object is indeed a Stanza Document object.
We can verify that our input texts made it into this document by examining the text
attribute.
# Check the contents of the 'text' attribute under the
# first Sentence in the list 'docs_wo_in'
docs_wo_in[0].text
'Lislaam a ngi njëkk a tàbbi ci Senegaal ci diggante VIIIeelu xarnu ak IXeelu xarnu, ña ko fa dugal di ay yaxantukat yu araab-yu-berber.'
Now that we have a list of Stanza Document objects, we can pass them all at once to the language model for annotation.
This can be achieved by simply providing the list as input to the Wolof language model stored under nlp_wo
.
We then store the annotated Stanza Document objects under the variable docs_wo_out
.
# Pass the list of Document objects to the language model 'nlp_wo'
# for annotation.
docs_wo_out = nlp_wo(docs_wo_in)
# Call the variable to check the output
docs_wo_out
[[
[
{
"id": 1,
"text": "Lislaam",
"lemma": "Lislaam",
"upos": "PROPN",
"xpos": "NAME",
"head": 4,
"deprel": "nsubj",
"start_char": 0,
"end_char": 7
},
{
"id": 2,
"text": "a",
"lemma": "a",
"upos": "AUX",
"xpos": "AUX",
"feats": "PronType=Prs",
"head": 4,
"deprel": "aux",
"start_char": 8,
"end_char": 9
},
{
"id": 3,
"text": "ngi",
"lemma": "ngi",
"upos": "AUX",
"xpos": "AUX",
"feats": "Aspect=Prog",
"head": 4,
"deprel": "aux",
"start_char": 10,
"end_char": 13
},
{
"id": 4,
"text": "njëkk",
"lemma": "njëkk",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 14,
"end_char": 19
},
{
"id": 5,
"text": "a",
"lemma": "a",
"upos": "PART",
"xpos": "PART",
"head": 6,
"deprel": "mark",
"start_char": 20,
"end_char": 21
},
{
"id": 6,
"text": "tàbbi",
"lemma": "tàbbi",
"upos": "VERB",
"xpos": "VERB",
"feats": "VerbForm=Inf",
"head": 4,
"deprel": "xcomp",
"start_char": 22,
"end_char": 27
},
{
"id": 7,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 8,
"deprel": "case",
"start_char": 28,
"end_char": 30
},
{
"id": 8,
"text": "Senegaal",
"lemma": "Senegaal",
"upos": "PROPN",
"xpos": "NAME",
"head": 6,
"deprel": "obl",
"start_char": 31,
"end_char": 39
},
{
"id": 9,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 12,
"deprel": "case",
"start_char": 40,
"end_char": 42
},
{
"id": 10,
"text": "diggante",
"lemma": "diggante",
"upos": "NOUN",
"xpos": "NOUN",
"head": 9,
"deprel": "fixed",
"start_char": 43,
"end_char": 51
},
{
"id": 11,
"text": "VIIIeelu",
"lemma": "VIII",
"upos": "NUM",
"xpos": "NUMBER",
"feats": "NumType=Ord",
"head": 12,
"deprel": "nummod",
"start_char": 52,
"end_char": 60
},
{
"id": 12,
"text": "xarnu",
"lemma": "xarnu",
"upos": "NOUN",
"xpos": "NOUN",
"head": 6,
"deprel": "obl",
"start_char": 61,
"end_char": 66
},
{
"id": 13,
"text": "ak",
"lemma": "ak",
"upos": "CCONJ",
"xpos": "CONJ",
"head": 14,
"deprel": "cc",
"start_char": 67,
"end_char": 69
},
{
"id": 14,
"text": "IXeelu",
"lemma": "IX",
"upos": "NUM",
"xpos": "NUMBER",
"feats": "NumType=Ord",
"head": 15,
"deprel": "nummod",
"start_char": 70,
"end_char": 76
},
{
"id": 15,
"text": "xarnu",
"lemma": "xarnu",
"upos": "NOUN",
"xpos": "NOUN",
"head": 12,
"deprel": "conj",
"start_char": 77,
"end_char": 82
},
{
"id": 16,
"text": ",",
"lemma": ",",
"upos": "PUNCT",
"xpos": "COMMA",
"head": 23,
"deprel": "punct",
"start_char": 82,
"end_char": 83
},
{
"id": 17,
"text": "ña",
"lemma": "ba",
"upos": "PRON",
"xpos": "PRON",
"feats": "Definite=Def|Deixis=Remt|NounClass=Wol2|Number=Plur|Person=3|PronType=Rel",
"head": 23,
"deprel": "nsubj",
"start_char": 84,
"end_char": 86
},
{
"id": 18,
"text": "ko",
"lemma": "ko",
"upos": "PRON",
"xpos": "CL",
"feats": "Case=Acc|Number=Sing|Person=3|PronType=Prs",
"head": 20,
"deprel": "obj",
"start_char": 87,
"end_char": 89
},
{
"id": 19,
"text": "fa",
"lemma": "fa",
"upos": "ADV",
"xpos": "ADV",
"feats": "Deixis=Remt|NounClass=Wol11|PronType=Dem",
"head": 20,
"deprel": "advmod",
"start_char": 90,
"end_char": 92
},
{
"id": 20,
"text": "dugal",
"lemma": "dugal",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 17,
"deprel": "acl:relcl",
"start_char": 93,
"end_char": 98
},
{
"id": 21,
"text": "di",
"lemma": "di",
"upos": "AUX",
"xpos": "COP",
"feats": "Aspect=Imp|Mood=Ind|Tense=Pres|VerbForm=Fin",
"head": 23,
"deprel": "cop",
"start_char": 99,
"end_char": 101
},
{
"id": 22,
"text": "ay",
"lemma": "ab",
"upos": "DET",
"xpos": "DET",
"feats": "Definite=Ind|NounClass=Wol8|Number=Plur|PronType=Art",
"head": 23,
"deprel": "det",
"start_char": 102,
"end_char": 104
},
{
"id": 23,
"text": "yaxantukat",
"lemma": "yaxantukat",
"upos": "NOUN",
"xpos": "NOUN",
"head": 4,
"deprel": "conj",
"start_char": 105,
"end_char": 115
},
{
"id": 24,
"text": "yu",
"lemma": "yu",
"upos": "ADP",
"xpos": "PREP",
"head": 25,
"deprel": "case",
"start_char": 116,
"end_char": 118
},
{
"id": 25,
"text": "araab-yu-berber",
"lemma": "araab-yu-berber",
"upos": "NOUN",
"xpos": "NOUN",
"head": 23,
"deprel": "nmod",
"start_char": 119,
"end_char": 134
},
{
"id": 26,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": "PERIOD",
"head": 4,
"deprel": "punct",
"start_char": 134,
"end_char": 135
}
]
],
[
[
{
"id": 1,
"text": "Li",
"lemma": "bi",
"upos": "PRON",
"xpos": "PRON",
"feats": "Definite=Def|Deixis=Prox|NounClass=Wol7|Number=Sing|Person=3|PronType=Rel",
"head": 19,
"deprel": "dislocated",
"start_char": 0,
"end_char": 2
},
{
"id": 2,
"text": "ëpp",
"lemma": "ëpp",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 1,
"deprel": "acl:relcl",
"start_char": 3,
"end_char": 6
},
{
"id": 3,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 4,
"deprel": "case",
"start_char": 7,
"end_char": 9
},
{
"id": 4,
"text": "gëstu",
"lemma": "gëstu",
"upos": "NOUN",
"xpos": "NOUN",
"head": 2,
"deprel": "obl",
"start_char": 10,
"end_char": 15
},
{
"id": 5,
"text": "yi",
"lemma": "bi",
"upos": "PRON",
"xpos": "PRON",
"feats": "Definite=Def|Deixis=Prox|NounClass=Wol8|Number=Plur|Person=3|PronType=Rel",
"head": 7,
"deprel": "obj",
"start_char": 16,
"end_char": 18
},
{
"id": 6,
"text": "ñu",
"lemma": "mu",
"upos": "PRON",
"xpos": "PRON",
"feats": "Case=Nom|Number=Plur|Person=3|PronType=Prs",
"head": 7,
"deprel": "nsubj",
"start_char": 19,
"end_char": 21
},
{
"id": 7,
"text": "def",
"lemma": "def",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 4,
"deprel": "acl:relcl",
"start_char": 22,
"end_char": 25
},
{
"id": 8,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 9,
"deprel": "case",
"start_char": 26,
"end_char": 28
},
{
"id": 9,
"text": "wàllug",
"lemma": "wàll",
"upos": "NOUN",
"xpos": "NOUN",
"feats": "Case=Gen|Number=Sing",
"head": 7,
"deprel": "obl",
"start_char": 29,
"end_char": 35
},
{
"id": 10,
"text": "Gëstu-askan",
"lemma": "Gëstu-askan",
"upos": "PROPN",
"xpos": "NAME",
"head": 9,
"deprel": "nmod",
"start_char": 36,
"end_char": 47
},
{
"id": 11,
"text": "(",
"lemma": "(",
"upos": "PUNCT",
"xpos": "PAREN",
"head": 13,
"deprel": "punct",
"start_char": 48,
"end_char": 49
},
{
"id": 12,
"text": "walla",
"lemma": "walla",
"upos": "CCONJ",
"xpos": "CONJ",
"head": 13,
"deprel": "cc",
"start_char": 49,
"end_char": 54
},
{
"id": 13,
"text": "demogaraafi",
"lemma": "demogaraafi",
"upos": "NOUN",
"xpos": "NOUN",
"feats": "Case=Gen|Number=Plur",
"head": 9,
"deprel": "conj",
"start_char": 55,
"end_char": 66
},
{
"id": 14,
"text": ")",
"lemma": ")",
"upos": "PUNCT",
"xpos": "PAREN",
"head": 13,
"deprel": "punct",
"start_char": 66,
"end_char": 67
},
{
"id": 15,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 16,
"deprel": "case",
"start_char": 68,
"end_char": 70
},
{
"id": 16,
"text": "Senegaal",
"lemma": "Senegaal",
"upos": "PROPN",
"xpos": "NAME",
"head": 13,
"deprel": "nmod",
"start_char": 71,
"end_char": 79
},
{
"id": [
17,
18
],
"text": "dafa",
"start_char": 80,
"end_char": 84
},
{
"id": 17,
"text": "da",
"lemma": "da",
"upos": "AUX",
"xpos": "INFL",
"feats": "FocusType=Verb|Mood=Ind",
"head": 19,
"deprel": "aux"
},
{
"id": 18,
"text": "mu",
"lemma": "mu",
"upos": "PRON",
"xpos": "PRON",
"feats": "Case=Nom|Number=Sing|Person=3|PronType=Prs",
"head": 19,
"deprel": "nsubj"
},
{
"id": 19,
"text": "sukkandiku",
"lemma": "sukkandiku",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 85,
"end_char": 95
},
{
"id": 20,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 21,
"deprel": "case",
"start_char": 96,
"end_char": 98
},
{
"id": 21,
"text": "Waññ",
"lemma": "waññ",
"upos": "NOUN",
"xpos": "NOUN",
"head": 19,
"deprel": "obl:appl",
"start_char": 99,
"end_char": 103
},
{
"id": 22,
"text": "(",
"lemma": "(",
"upos": "PUNCT",
"xpos": "PAREN",
"head": 23,
"deprel": "punct",
"start_char": 104,
"end_char": 105
},
{
"id": 23,
"text": "recensement",
"lemma": "recensement",
"upos": "NOUN",
"xpos": "NOUN",
"head": 21,
"deprel": "appos",
"start_char": 105,
"end_char": 116
},
{
"id": 24,
"text": ")",
"lemma": ")",
"upos": "PUNCT",
"xpos": "PAREN",
"head": 23,
"deprel": "punct",
"start_char": 116,
"end_char": 117
},
{
"id": 25,
"text": "yi",
"lemma": "bi",
"upos": "PRON",
"xpos": "PRON",
"feats": "Definite=Def|Deixis=Prox|NounClass=Wol8|Number=Plur|Person=3|PronType=Rel",
"head": 27,
"deprel": "obj",
"start_char": 118,
"end_char": 120
},
{
"id": 26,
"text": "ñu",
"lemma": "mu",
"upos": "PRON",
"xpos": "PRON",
"feats": "Case=Nom|Number=Plur|Person=3|PronType=Prs",
"head": 27,
"deprel": "nsubj",
"start_char": 121,
"end_char": 123
},
{
"id": 27,
"text": "jotoon",
"lemma": "jot",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|Tense=Past|VerbForm=Fin",
"head": 21,
"deprel": "acl:relcl",
"start_char": 124,
"end_char": 130
},
{
"id": 28,
"text": "a",
"lemma": "a",
"upos": "PART",
"xpos": "PART",
"head": 29,
"deprel": "mark",
"start_char": 131,
"end_char": 132
},
{
"id": 29,
"text": "def",
"lemma": "def",
"upos": "VERB",
"xpos": "VERB",
"feats": "VerbForm=Inf",
"head": 27,
"deprel": "xcomp",
"start_char": 133,
"end_char": 136
},
{
"id": 30,
"text": "ci",
"lemma": "ci",
"upos": "ADP",
"xpos": "PREP",
"head": 31,
"deprel": "case",
"start_char": 137,
"end_char": 139
},
{
"id": 31,
"text": "1976",
"lemma": "1976",
"upos": "NUM",
"xpos": "NUMBER",
"feats": "NumType=Card",
"head": 29,
"deprel": "obl",
"start_char": 140,
"end_char": 144
},
{
"id": 32,
"text": ",",
"lemma": ",",
"upos": "PUNCT",
"xpos": "COMMA",
"head": 34,
"deprel": "punct",
"start_char": 144,
"end_char": 145
},
{
"id": 33,
"text": "1988",
"lemma": "1988",
"upos": "NUM",
"xpos": "NUMBER",
"feats": "NumType=Card",
"head": 34,
"deprel": "discourse",
"start_char": 146,
"end_char": 150
},
{
"id": 34,
"text": "rawati",
"lemma": "rawati",
"upos": "VERB",
"xpos": "VERB",
"feats": "Mood=Ind|VerbForm=Fin",
"head": 19,
"deprel": "parataxis",
"start_char": 151,
"end_char": 157
},
{
"id": 35,
"text": "na",
"lemma": "na",
"upos": "AUX",
"xpos": "INFL",
"feats": "Aspect=Perf|Mood=Ind|Number=Sing|Person=3",
"head": 34,
"deprel": "aux",
"start_char": 158,
"end_char": 160
},
{
"id": 36,
"text": "2002",
"lemma": "2002",
"upos": "NUM",
"xpos": "NUMBER",
"feats": "NumType=Card",
"head": 34,
"deprel": "obj",
"start_char": 161,
"end_char": 165
},
{
"id": 37,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": "PERIOD",
"head": 19,
"deprel": "punct",
"start_char": 165,
"end_char": 166
}
]
]]
As you can see, passing the Document objects to the language model populates them with linguistic annotations, which can be then explored as introduced above.
Interfacing Stanza with spaCy#
If you are more familiar with the spaCy library for natural language processing, whose use was covered extensively in Part II, then you will be happy to know that you can also use some of the Stanza language models in spaCy!
This can be achieved using a Python library named spacy-stanza, which interfaces the two libraries.
Given that Stanza currently has more pre-trained language models available than spaCy, the spacy-stanza library considerably increases the number of language models available for spaCy.
There is, however, one major limitation: the language in question must be supported by both Stanza and spaCy.
For example, we cannot use the Stanza language model for Wolof in spaCy, because spaCy does not support the Wolof language.
To start using Stanza language models in spaCy, let’s start by importing the spacy-stanza library (module name: spacy_stanza
).
# Import the spaCy and spacy-stanza libraries
import spacy
import spacy_stanza
This imports both spaCy and spacy-stanza libraries into Python. To continue, we must ensure that we have the Stanza language model for Finnish available as well.
As shown above, this model can be downloaded using the following command:
# Download a Stanza language model for Finnish
stanza.download(lang='fi')
Because spaCy supports the Finnish language, we can load Stanza language models for Finnish into spaCy using the spacy-stanza library.
This can be achieved using the load_pipeline()
function available under the spacy_stanza
module.
To load Stanza language model for a given language, you must provide the two-letter code for the language in question (e.g. fi
) to the argument name
:
# Load a Stanza language model for Finnish into spaCy
nlp_fi = spacy_stanza.load_pipeline(name='fi')
If we examine the resulting object under the variable nlp_fi
using Python’s type()
function, we will see that the object is indeed a spaCy Language object.
# Check the type of the object under 'nlp_fi'
type(nlp_fi)
spacy.lang.fi.Finnish
Generally, this object behaves just like any other spaCy Language object that we learned to use in Part II.
We can explore its use by processing a few sentences from a recent news article in written Finnish.
We feed the text as a string object to the Language object under nlp_fi
and store the result under the variable doc_fi
.
# Feed the text to the language model under 'nlp_fi', store result under 'doc_fi'
doc_fi = nlp_fi('Tove Jansson keräsi 148 ääntä eli 18,2% annetuista äänistä. Kirjailija, kuvataiteilija ja pilapiirtäjä tuli kansainvälisesti tunnetuksi satukirjoistaan ja sarjakuvistaan.')
Let’s continue by retrieving sentences from the Doc object, which are available under the attribute sents
, as we learned in Part II.
The object available under the sents
attribute is a Python generator that yields Doc objects.
To examine them, we must catch the objects into a suitable data structure. In this case, the data structure that best fits our needs is a Python list.
Hence we cast the output from the generator object under sents
into a list using the list()
function.
# Get sentences contained in the Doc object 'doc_fi'.
# Cast the result into list.
sents_fi = list(doc_fi.sents)
# Call the variable to check the output
sents_fi
[Tove Jansson keräsi 148 ääntä eli 18,2% annetuista äänistä.,
Kirjailija, kuvataiteilija ja pilapiirtäjä tuli kansainvälisesti tunnetuksi satukirjoistaan ja sarjakuvistaan.]
We can also use spaCy’s displacy
submodule to visualise the syntactic dependencies.
To do so for the first sentence under sents_fi
, we must first access the first item in the list using brackets [0]
as usual.
Let’s start by checking the type of this object.
# Check the type of the first item in the list 'sents_fi'
type(sents_fi[0])
spacy.tokens.span.Span
As you can see, the result is a spaCy Span object, which is a sequence of Token objects contained within a Doc object.
We can then call the render
function from the displacy
submodule to visualise the syntactic dependencies for the Span object under sents_fi[0]
.
# Import the displacy submodule
from spacy import displacy
# Use the render function to render the first item [0] in the list 'sents_fi'.
# Pass the argument 'style' with the value 'dep' to visualise syntactic dependencies.
displacy.render(sents_fi[0], style='dep')
Note that spaCy will raise a warning about storing custom attributes when writing the Doc object to disk for visualisation.
We can also examine the linguistic annotations created for individual Token objects within this Span object.
# Loop over each Token object in the Span
for token in sents_fi[0]:
# Print the token, its lemma, dependency and morphological features
print(token, token.lemma_, token.dep_, token.morph)
Tove Tove nsubj Case=Nom|Number=Sing
Jansson Jansson flat:name Case=Nom|Number=Sing
keräsi kerätä root Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act
148 148 nummod NumType=Card
ääntä ääni obj Case=Par|Number=Sing
eli eli cc
18,2 18,2 nummod NumType=Card
% % conj
annetuista antaa acl Case=Ela|Number=Plur|PartForm=Past|VerbForm=Part|Voice=Pass
äänistä ääni nmod Case=Ela|Number=Plur
. . punct
The examples above show how we can access the linguistic annotations created by a Stanza language model through spaCy Doc, Span and Token objects.
This section should have given you an idea of how to begin processing diverse languages.
In the following section, we will dive deeper into the Universal Dependencies framework.