{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Processing texts using spaCy\n", "\n", "This section introduces you to basic tasks in natural language processing and how they can be performed using a Python library named spaCy.\n", "\n", "After reading this section, you should:\n", "\n", " - know some of the key concepts and tasks in natural language processing\n", " - know how to perform simple natural language processing tasks using the spaCy library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('hJ6PJoITa6I', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get started, we import [spaCy](https://spacy.io/), one of the many libraries available for natural language processing in Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the spaCy library\n", "import spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform natural language processing tasks for a given language, we must load a _language model_ that has been trained to perform these tasks for the language in question. \n", "\n", "spaCy supports [many languages](https://spacy.io/usage/models#languages), but provides pre-trained [language models](https://spacy.io/models/) for fewer languages.\n", "\n", "These language models come in different sizes and flavours. We will explore these models and their differences later. \n", "\n", "To get acquainted with basic tasks in natural language processing, we will start with a small language model for the English language.\n", "\n", "Language models are loaded using spaCy's `load()` function, which takes the name of the model as input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the small language model for English and assign it to the variable 'nlp'\n", "nlp = spacy.load('en_core_web_sm')\n", "\n", "# Call the variable to examine the object\n", "nlp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the variable `nlp` returns a spaCy [_Language_](https://spacy.io/api/language) object that contains a language model for the English language.\n", "\n", "Esentially, spaCy's *Language* object is a pipeline that uses the language model to perform a number of natural language processing tasks. We will return to these tasks shortly below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is a language model?\n", "\n", "Most modern language models are based on *statistics* instead of human-defined rules. \n", "\n", "Statistical language models are based on probabilities, e.g.: \n", "\n", " - What is the probability of a given sentence occurring in a language? \n", " - How likely is a given word to occur in a sequence of words?\n", "\n", "Consider the following sentences from the news articles from the previous sections:\n", "\n", "> From financial exchanges in `HIDDEN` Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day.\n", "\n", "> Security precautions were being taken around the `HIDDEN` as the deadline for Iraq to withdraw from Kuwait neared.\n", "\n", "You can probably make informed guesses on the `HIDDEN` words based on your knowledge of the English language and the world in general.\n", "\n", "Similarly, creating a statistical language model involves observing the occurrence of words in large corpora and calculating their probabilities of occurrence in a given context. The language model can then be trained by making predictions and adjusting the model based on the errors made during prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How are language models trained?\n", "\n", "The small language model for English, for instance, is trained on a corpus called [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), which features texts from different *genres* such as newswire text, broadcast news, broadcast and telephone conversations and blogs.\n", "\n", "This allows the corpus to cover linguistic variation in both written and spoken English.\n", "\n", "The OntoNotes 5.0 corpus consists of more than just *plain text*: the annotations include *part-of-speech tags*, *syntactic dependencies* and *co-references* between words.\n", "\n", "This allows modelling not just the occurrence of particular words or their sequences, but their grammatical features as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing basic NLP tasks using spaCy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('kJjKJ6qlQmM', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To process text using the *Language* object containing the language model for English, we simply call the *Language* object `nlp` on some text.\n", "\n", "Let's begin by defining a simple test sentence, a Python string object that is stored under the variable `text`.\n", "\n", "As usual, we can print out the contents by calling the variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assign an example sentence to the variable 'text'\n", "text = \"The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\"\n", "\n", "# Call the variable to examine the output\n", "text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing the variable `text` to the _Language_ object `nlp` returns a spaCy *Doc* object, short for document.\n", "\n", "In natural language processing, longer pieces of text are commonly referred to as documents, although in this case our document consists of a single sentence.\n", "\n", "This object contains both the input text stored under `text` and the results of natural language processing using spaCy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feed the string object under 'text' to the Language object under 'nlp'\n", "# Store the result under the variable 'doc'\n", "doc = nlp(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The _Doc_ object is now stored under the variable `doc`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Call the variable to examine the object\n", "doc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the variable `doc` returns the contents of the object.\n", "\n", "Although the output resembles that of a Python string, the *Doc* object contains a wealth of information about its linguistic structure, which spaCy generated by passing the text through the NLP pipeline.\n", "\n", "We will now examine the tasks that were performed under the hood after we provided the input sentence to the language model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('hVOSVlWnl_k', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What takes place first is a task known as *tokenization*, which breaks the text down into analytical units in need of further processing. \n", "\n", "In most cases, a *token* corresponds to words separated by whitespace, but punctuation marks are also considered as independent tokens. Because computers treat words as sequences of characters, assigning punctuation marks to their own tokens prevents trailing punctuation from attaching to the words that precede them.\n", "\n", "The diagram below the outlines the tasks that spaCy can perform after a text has been tokenised, such as *part-of-speech tagging*, *syntactic parsing* and *named entity recognition*.\n", "\n", "![The spaCy pipeline from https://spacy.io/usage/linguistic-features#section-tokenization](img/spacy_pipeline.png)\n", "\n", "A spaCy _Doc_ object is consists of a sequence of *Token* objects, which store the results of various natural language processing tasks.\n", "\n", "Let's print out each *Token* object stored in the _Doc_ object `doc`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc: \n", " \n", " # Print each token\n", " print(token) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output shows one _Token_ per line. As expected, punctuation marks such as '.' and ',' constitute their own _Tokens_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part-of-speech tagging" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('F2bLdmzc2X4', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part-of-speech (POS) tagging is the task of determining the word class of a token. This is crucial for *disambiguation*, because different parts of speech may have similar forms.\n", "\n", "Consider the example: *The sailor dogs the hatch*.\n", "\n", "The present tense of the verb *dog* (to fasten something with something) is precisely the same as the plural form of the noun *dog*: *dogs*.\n", "\n", "To identify the correct word class, we must examine the context in which the word appears.\n", "\n", "*spaCy* provides two types of part-of-speech tags, *coarse* and *fine-grained*, which are stored under the attributes `pos_` and `tag_`, respectively.\n", "\n", "We can access the attributes of a Python object by inserting the *attribute* after the *object* and separating them with a full stop, e.g. `token.pos_`.\n", "\n", "To access the results of POS tagging, let's loop over the *Doc* object `doc` and print each *Token* and its coarse and fine-grained part-of-speech tags." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and the POS tags\n", " print(token, token.pos_, token.tag_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coarse part-of-speech tags available under the `pos_` attribute are based on the [Universal Dependencies](https://universaldependencies.org/u/pos/all.html) tag set.\n", "\n", "The fine-grained part-of-speech tags under `tag_`, in turn, are based on the OntoNotes 5.0 corpus introduced above.\n", "\n", "In contrast to coarse part-of-speech tags, the fine-grained tags also encode [grammatical information](https://spacy.io/api/annotation#pos-en). The tags for verbs, for example, are distinguished by aspect and tense. " ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "### Quick exercise\n", "\n", "Define a string object with some text in English. Be creative and think of language use on social media – and feed the string to the language model stored under the variable `nlp`. \n", "\n", "Store the result under the variable `my_text`, e.g.\n", "\n", "`my_text = nlp(\"Your example goes here.\")`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "Next, print the part of speech tags for each _Token_ in your Doc object `my_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "Is the language model able to classify the parts-of-speech correctly?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Morphological analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('1iliKkbHF30', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Morphemes constitute the smallest grammatical units that carry meaning. Two types of morphemes are generally recognised: *free* morphemes, which consist of words that can stand on their own, and *bound* morphemes, which inflect other morphemes. For the English language, bound morphemes include suffixes such as _-s_, which is used to indicate the plural form of a noun.\n", "\n", "Put differently, morphemes shape the external *form* of a word, and these forms are associated with given grammatical *functions*.\n", "\n", "spaCy performs morphological analysis automatically and stores the result under the attribute `morph` of a _Token_ object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", "\n", " # Print the token and the results of morphological analysis\n", " print(token, token.morph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the output shows, all _Tokens_ do not have morphological information, because they consist of free morphemes.\n", "\n", "To retrieve morphological information from a _Token_ object, we must use the `get()` method of the `morph` attribute.\n", "\n", "We can use the brackets `[]` to access items in the _Doc_ object.\n", "\n", "The following line retrieves morphological information about aspect for the 22nd _Token_ in the _Doc_ object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve morphological information on aspect for the Token at index 22 in the Doc object\n", "doc[22].morph.get('Aspect')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns a list with a single string item _Perf_, which refers to the perfective aspect.\n", "\n", "What if we attempt to retrieve a morphological feature that a _Token_ does not have?\n", "\n", "Let's attempt to retrieve information on aspect for the 21st _Token_ in the _Doc_ object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve morphological information on aspect for 21st Token in the Doc object\n", "doc[21].morph.get('Aspect')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns an empty list, as indicated by the brackets `[ ]` with nothing between them.\n", "\n", "To retrieve all the morphological information available for a given _Token_, the best solution is to use the `to_dict()` method of the `morph` attribute.\n", "\n", "This returns a dictionary, a Python data structure consisting of _key_ and _value_ pairs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve morphological information for 21st Token in the Doc object\n", "# Use the to_dict() method to cast the result into a dictionary\n", "doc[21].morph.to_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Python dictionary is marked by curly brackets `{ }`. Each key/value pair is separated by a colon `:`. In this case, both keys and values consist of string objects.\n", "\n", "The value stored under a key may be accessed by placing the key name in brackets `[ ]` and placing this right after the name of the dictionary, as shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assign morphological information to the dictionary 'morph_dict' \n", "morph_dict = doc[21].morph.to_dict()\n", "\n", "# Get the value corresponding to the key 'Mood'\n", "morph_dict['Mood']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dictionaries are a powerful data structure in Python, which we will frequently use for storing information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Syntactic parsing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('2TBNpxAbh_g', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Syntactic parsing (or *dependency parsing*) is the task of defining syntactic dependencies that hold between tokens. \n", "\n", "The syntactic dependencies are available under the `dep_` attribute of a *Token* object.\n", "\n", "Let's print out the syntactic dependencies for each *Token* in the *Doc* object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and its dependency tag\n", " print(token, token.dep_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike part-of-speech tags that are associated with a single _Token_, dependency tags indicate a relation that holds between two *Tokens*.\n", "\n", "To better understand the syntactic relations captured by dependency parsing, let's use some of the additional attributes available for each *Token*:\n", "\n", " 1. `i`: the position of the *Token* in the *Doc*\n", " 2. `token`: the *Token* itself\n", " 3. `dep_`: a tag for the syntactic relation\n", " 4. `head` and `i`: the *Token* that governs the current *Token* and its index\n", " \n", "This illustrates how Python attributes can be used in a flexible manner: the attribute `head` points to another *Token*, which naturally has the attribute `i` that contains its index or position in the *Doc*. We can combine these two attributes to retrieve this information for any token by referring to `.head.i`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the index of current token, the token itself, the dependency, the head and its index\n", " print(token.i, token, token.dep_, token.head.i, token.head)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the output above helps to clarify the syntactic dependencies between tokens, they are generally much easier to perceive using diagrams.\n", "\n", "spaCy provides a [visualisation tool](https://spacy.io/usage/visualizers) for visualising dependencies. This component of the spaCy library, _displacy_, can be imported using the following command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from spacy import displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `displacy` module has a function named `render()`, which takes a _Doc_ object as input.\n", "\n", "To draw a dependency tree, we provide the _Doc_ object `doc` to the `render()` function with two arguments:\n", "\n", " 1. `style`: The value `dep` instructs _displacy_ to draw a visualisation for syntactic dependencies.\n", " 2. `options`: This argument takes a Python dictionary as input. We provide a dictionary with the key `compact` and Boolean value `True` to instruct _displacy_ to draw a compact tree diagram. Additional options for formatting the visualisation can be found in spaCy [documentation](https://spacy.io/api/top-level#displacy_options)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "displacy.render(doc, style='dep', options={'compact': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntactic dependencies are visualised using lines that lead from the **head** *Token* to the *Token* governed by that head.\n", "\n", "The dependency tags are based on [Universal Dependencies](https://universaldependencies.org/), a framework for describing morphological and syntactic features across languages (for a theoretical discussion of Universal Dependencies, see de Marneffe et al. [2021](https://doi.org/10.1162/coli_a_00402)).\n", "\n", "If you don't know what a particular tag means, spaCy provides a function for explaining the tags, `explain()`, which takes a tag as input (note that the tags are case-sensitive)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spacy.explain('pobj')" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "### Quick exercise\n", "\n", "Choose a tag that you don't know from the examples above and ask spaCy for an explanation using the `explain()` function.\n", "\n", "Input the tag as a string – don't forget the quotation marks!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, if you wonder about the underscores `_` in the attribute names: spaCy encodes all strings by mapping them to hash values (a numerical representation) for computational efficiency.\n", "\n", "Let's print out the first Token in the Doc `[0]` and its dependencies to examine how this works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(doc[0], doc[0].dep, doc[0].dep_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the hash value 415 is reserved for the tag corresponding to a determiner (`det`). \n", "\n", "If you want human-readable output for dependency parsing and spaCy returns sequences of numbers, then you most likely forgot to add the underscore to the attribute name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence segmentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('NknDZSRBT7Y', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "spaCy also segments _Doc_ objects into sentences. This task is known as [_sentence segmentation_](https://spacy.io/usage/linguistic-features#section-sbd).\n", "\n", "Sentence segmentation imposes additional structure to larger texts. By determining the boundaries of a sentence, we can constrain tasks such as dependency parsing to individual sentences.\n", "\n", "spaCy provides access to the results of sentence segmentation via the attribute `sents` of a _Doc_ object.\n", "\n", "Let's loop over the sentences contained in the _Doc_ object `doc` and count them using Python's `enumerate()` function.\n", "\n", "Using the `enumerate()` function returns a count that increases with each item in the loop. \n", "\n", "We assign this count to the variable `number`, whereas each sentence is stored under `sent`. We then print out both at the same time using the `print()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over sentences in the Doc object and count them using enumerate()\n", "for number, sent in enumerate(doc.sents):\n", " \n", " # Print the token and its dependency tag\n", " print(number, sent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This only returns a single sentence, but the _Doc_ object could easily hold a longer text with multiple sentences, such as an entire newspaper article." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('pvMImQEUlN4', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lemma is the base form of a word. Keep in mind that unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens, because their forms differ. \n", "\n", "If one wants to count the occurrences of words, for instance, a process known as _lemmatization_ is needed to group together the different forms of the same token.\n", "\n", "Lemmas are available for each _Token_ under the attribute `lemma_`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and its dependency tag\n", " print(token, token.lemma_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named entity recognition (NER)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('Gn_PjruUtrc', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Named entity recognition (NER) is the task of recognising and classifying entities named in a text. \n", "\n", "spaCy can recognise the [named entities annotated in the OntoNotes 5 corpus](https://spacy.io/api/annotation#named-entities), such as persons, geographic locations and products, to name but a few examples.\n", "\n", "We can use the *Doc* object's `.ents` attribute to get the named entities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc.ents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns a tuple with the named entities.\n", "\n", "Each item in the tuple is a spaCy *Span* object. *Span* objects can consist of multiple *Token* objects, as many named entities span multiple *Tokens*. \n", "\n", "The named entities and their types are stored under the attributes `.text` and `.label_` of each *Span* object.\n", "\n", "Let's loop over the *Span* objects in the tuple and print out both attributes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over the named entities in the Doc object \n", "for ent in doc.ents:\n", "\n", " # Print the named entity and its label\n", " print(ent.text, ent.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the majority of named entities identified in the *Doc* consist of multiple *Tokens*, which is why they are represented as *Span* objects.\n", "\n", "We can verify this by accessing the first named entity under `doc.ents`, which can be found at position `0`, because Python starts counting from zero, and feeding this object to Python's `type()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the type of the object used to store named entities\n", "type(doc.ents[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "spaCy [*Span*](https://spacy.io/api/span) objects contain several useful arguments.\n", "\n", "Most importantly, the attributes `start` and `end` return the indices of _Tokens_, which determine where the _Span_ starts and ends in the *Doc* object.\n", "\n", "We can examine this in greater detail by printing out the `start` and `end` attributes for the first named entity in the document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the named entity and indices of its start and end Tokens\n", "print(doc.ents[0], doc.ents[0].start, doc.ents[0].end)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The named entity starts at index 0 and ends at index 5 in the *Doc* object.\n", "\n", "If we retrieve the sixth *Token* in the *Doc* object (at index 5), we will see that this corresponds to the *Token* \"has\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc[5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that the index returned by the `end` attribute does **not** correspond to the *last Token* in the *Span* that contains the named entity, but returns the index of the *first Token* following the *Span*.\n", "\n", "Let's examine this by looping over the slice of the *Doc* object that corresponds to the first named entity. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over a slice of the Doc object that covers the first named entity\n", "for token in doc[doc.ents[0].start: doc.ents[0].end]:\n", " \n", " # Print the Token and its index\n", " print(token, token.i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the `start` attribute means that the *Span* starts here, whereas the `end` attribute means that the *Span* has *ended* here.\n", "\n", "We can also render the named entities using *displacy*, the spaCy module we used for visualising dependency parses above.\n", "\n", "Note that we must pass the string `ent` to the `style` argument to indicate that we wish to visualise named entities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "displacy.render(doc, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you don't recognise a particular tag used for a named entity, you can always ask spaCy for an explanation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spacy.explain('NORP')" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "### Quick exercise\n", "\n", "Load the example texts in our data directory, feed them to the language model under `nlp` and explore the results.\n", "\n", "How does the quality of optical character recognition affect the results of natural language processing?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Import Path class from pathlib\n", "from pathlib import Path\n", "\n", "# Create a Path object pointing towards the corpus dir\n", "corpus_dir = Path('data')\n", "\n", "# Collect all .txt files\n", "corpus_files = list(corpus_dir.glob('*.txt'))\n", "\n", "# Loop over the files\n", "for corpus_file in corpus_files:\n", " \n", " # Print a status message to separate the results\n", " print(f\"\\n[INFO] Now processing {corpus_file} ...\\n\")\n", " \n", " # Open the file for reading and read the contents (note the .read_text() method at the end)\n", " text = corpus_file.read_text(encoding='utf-8')\n", " \n", " # Feed the text to the spaCy language model / pipeline\n", " text = nlp(text)\n", " \n", " # Try out different things by uncommenting (remove # at the beginning of the line) \n", " # some of the lines below\n", " \n", " # 1. Loop over a range of Tokens [130:150] and print out part-of-speech tags and dependencies\n", " # for token in text[130:150]:\n", " \n", " # print(token, token.pos_, token.tag_, token.dep_)\n", " \n", " # 2. Print out the first 20 named entities in the text\n", " # for ent in text.ents[:20]:\n", " \n", " # print(ent.text, ent.label_)\n", " \n", " # 3. Print out the sentences in each text. Enumerate them for clarity.\n", " # for i, sent in enumerate(text.sents):\n", " \n", " # print(i, sent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section should have given you an idea of some basic natural language processing tasks, how they can be performed using spaCy and what kinds of linguistic annotations they produce.\n", "\n", "The [following section](04_basic_nlp_continued.ipynb) focuses on how to customise the tasks that spaCy performs on an input text. " ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }