{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic natural language processing using spaCy\n", "\n", "This section introduces you to some basic tasks in natural language processing.\n", "\n", "After reading this section, you should:\n", "\n", " - know some of the key concepts and tasks in natural language processing (NLP)\n", " - learn how to perform simple NLP tasks in Python using the spaCy library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started\n", "\n", "To get started, we import [spaCy](https://spacy.io/), one of the many libraries available for natural language processing in Python." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform natural language processing tasks for a given language, we must load a _language model_ that has been trained to perform these tasks for the language in question. \n", "\n", "spaCy supports [many languages](https://spacy.io/usage/models#languages), but provides pre-trained [language models](https://spacy.io/models/) for fewer languages.\n", "\n", "spaCy language models come in different sizes and flavours. We will explore these models and their differences later. \n", "\n", "To get acquainted with basic tasks in natural language processing, we will start with a small language model for the English language.\n", "\n", "Language models are loaded using spaCy's `load()` function, which takes the name of the model as input." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the small language model for English and assign it to the variable 'nlp'\n", "nlp = spacy.load('en_core_web_sm')\n", "\n", "# Call the variable to examine the object\n", "nlp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the variable `nlp` returns a spaCy [_Language_](https://spacy.io/api/language) object that contains a language model for the English language.\n", "\n", "Esentially, spaCy's *Language* object is a pipeline that uses the language model to perform a number of natural language processing tasks. We will return to these tasks shortly below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is a language model?\n", "\n", "Most modern language models are based on *statistics* instead of human-defined rules. \n", "\n", "Statistical language models are based on calculating probabilities, e.g.: \n", "\n", " - What is the probability of a given sentence occurring in a language? \n", " - How likely does a given word occur next in a sequence of words?\n", "\n", "Consider the following sentences from the news articles from the previous sections:\n", "\n", " - From financial exchanges in `HIDDEN` Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day.\n", "\n", " - Security precautions were being taken around the `HIDDEN` as the deadline for Iraq to withdraw from Kuwait neared.\n", "\n", "You can probably make informed guesses on the `HIDDEN` words based on your knowledge of the English language and the world in general.\n", "\n", "Similarly, creating a statistical language model involves observing the occurrence of words in large corpora and calculating their probabilities of occurrence in a given context. The language model can then be trained by making predictions and adjusting the model based on the errors made during prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How are language models trained?\n", "\n", "The spaCy language model for English, for instance, is trained on a corpus called [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), which features texts from different _genres_ such as newswire text, broadcast news, broadcast and telephone conversations and blogs.\n", "\n", "This allows the corpus to cover linguistic variation in both written and spoken English.\n", "\n", "The OntoNotes 5.0 corpus consists of more than just _plain text_: the annotations include _part-of-speech tags_, _syntactic dependencies_ and _co-references_ between words.\n", "\n", "This allows modelling not just the occurrence of particular words or their sequences, but their grammatical features as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing basic NLP tasks using spaCy\n", "\n", "To process text using the *Language* object containing the language model for English, we simply call the *Language* object `nlp` on some text.\n", "\n", "Let's begin by defining a simple test sentence, a Python string object that is stored under the variable `text`.\n", "\n", "As usual, we can print out the contents by calling the variable." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Assign an example sentence to the variable 'text'\n", "text = \"The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\"\n", "\n", "# Call the variable to examine the output\n", "text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing the variable `text` to the _Language_ object `nlp` returns a spaCy _Doc_ object, short for document.\n", "\n", "In natural language processing, longer pieces of text are commonly referred to as documents, although in this case our document consists of a single sentence.\n", "\n", "This object contains both the input text stored under `text` and the results of natural language processing using spaCy." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Feed the string object under 'text' to the Language object under 'nlp'\n", "# Store the result under the variable 'doc'\n", "doc = nlp(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The _Doc_ object is now stored under the variable `doc`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call the variable to examine the object\n", "doc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the variable `doc` returns the contents of the object.\n", "\n", "Although the output resembles that of a Python string, the _Doc_ object contains a wealth of information about its linguistic structure, which was generated by passing the text through the spaCy NLP pipeline. \n", "\n", "We will now examine the tasks that were performed under the hood after we provided the input sentence to the language model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization\n", "\n", "What takes place first is a task known as _tokenization_, which breaks the text down into analytical units in need of further processing. \n", "\n", "In most cases, a _token_ corresponds to words separated by whitespace, but punctuation marks are also considered as independent tokens. Because computers treat words as sequences of characters, assigning punctuation marks to their own tokens prevents trailing punctuation from attaching to the words that precede them.\n", "\n", "The diagram below the outlines the tasks that spaCy can perform after a text has been tokenised, such as *part-of-speech tagging*, *syntactic parsing* and *named entity recognition*.\n", "\n", "![The spaCy pipeline from https://spacy.io/usage/linguistic-features#section-tokenization](img/spacy_pipeline.png)\n", "\n", "A spaCy _Doc_ object is consists of a sequence of _Token_ objects, which store the results of various natural language processing tasks.\n", "\n", "Let's print out each *Token* object stored in the _Doc_ object `doc`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n", "Federal\n", "Bureau\n", "of\n", "Investigation\n", "has\n", "been\n", "ordered\n", "to\n", "track\n", "down\n", "as\n", "many\n", "as\n", "3,000\n", "Iraqis\n", "in\n", "this\n", "country\n", "whose\n", "visas\n", "have\n", "expired\n", ",\n", "the\n", "Justice\n", "Department\n", "said\n", "yesterday\n", ".\n" ] } ], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc: \n", " \n", " # Print each token\n", " print(token) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output shows one _Token_ per line. As expected, punctuation marks such as '.' and ',' constitute their own _Tokens_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part-of-speech tagging\n", "\n", "Part-of-speech (POS) tagging is a task that involves determining the word class of a token. This is crucial for _disambiguation_, because different parts of speech may have similar forms.\n", "\n", "Consider the example: _The sailor dogs the hatch_.\n", "\n", "The present tense of the verb _dog_ (to fasten something with something) is precisely the same as the plural form of the noun _dog_.\n", "\n", "To identify the correct word class, we must examine the context in which the word appears.\n", "\n", "To access the results of POS tagging, let's loop over the _Doc_ object `doc` and print each _Token_ and its associated _part-of-speech tag_, which is stored under its attribute `pos_`.\n", "\n", "We can access the attributes of an object by inserting the _attribute_ after the _object_ and separating them with a full stop, e.g. `token.pos_`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The DET\n", "Federal PROPN\n", "Bureau PROPN\n", "of ADP\n", "Investigation PROPN\n", "has AUX\n", "been AUX\n", "ordered VERB\n", "to PART\n", "track VERB\n", "down ADP\n", "as ADV\n", "many ADJ\n", "as SCONJ\n", "3,000 NUM\n", "Iraqis PROPN\n", "in ADP\n", "this DET\n", "country NOUN\n", "whose DET\n", "visas NOUN\n", "have AUX\n", "expired VERB\n", ", PUNCT\n", "the DET\n", "Justice PROPN\n", "Department PROPN\n", "said VERB\n", "yesterday NOUN\n", ". PUNCT\n" ] } ], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and its POS tag\n", " print(token, token.pos_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many other [attributes](https://spacy.io/api/token#attributes) available for each _Token_ in the _Doc_ object." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### Quick exercise\n", "\n", "Define a string object with some text in English. Be creative and think of language use on social media – and feed the string to the language model stored under the variable `nlp`. \n", "\n", "Store the result under the variable `my_text`, e.g.\n", "\n", "`my_text = nlp(\"Your example goes here.\")`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "Next, print the part of speech tags for each _Token_ in your Doc object `my_text`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "Is the language model able to classify the parts-of-speech correctly?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Syntactic parsing\n", "\n", "Syntactic parsing (or _dependency parsing_) is the task of defining syntactic dependencies that hold between tokens. This task is closely aligned with traditional grammatical analyses, and consequently the results are often represented using tree diagrams.\n", "\n", "The syntactic dependencies identified during parsing are available under the `dep_` attribute of a _Token_.\n", "\n", "Let's print out the syntactic dependencies for each _Token_ in the _Doc_ object." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The det\n", "Federal compound\n", "Bureau nsubjpass\n", "of prep\n", "Investigation pobj\n", "has aux\n", "been auxpass\n", "ordered ccomp\n", "to aux\n", "track xcomp\n", "down prt\n", "as advmod\n", "many amod\n", "as quantmod\n", "3,000 nummod\n", "Iraqis dobj\n", "in prep\n", "this det\n", "country pobj\n", "whose poss\n", "visas nsubj\n", "have aux\n", "expired relcl\n", ", punct\n", "the det\n", "Justice compound\n", "Department nsubj\n", "said ROOT\n", "yesterday npadvmod\n", ". punct\n" ] } ], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and its dependency tag\n", " print(token, token.dep_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike part-of-speech tags that are associated with a single _Token_, dependency tags indicate a relation that holds between two _Tokens_.\n", "\n", "To better understand the syntactic relations captured by dependency parsing, let's use some of the additional attributes available for each _Token_:\n", "\n", " 1. `i`: the position of the _Token_ in the _Doc_\n", " 2. `token`: the _Token_ itself\n", " 3. `dep_`: the dependency tag\n", " 4. `head` and `i`: the _Token_ that governs the current _Token_ and its index\n", " \n", "This illustrates how Python attributes can be used in a flexible manner: the attribute `head` points to another _Token_, which naturally has the attribute `i` that contains its index or position in the _Doc_. We can combine these two attributes to retrieve this information for any token by referring to `.head.i`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 The det 2 Bureau\n", "1 Federal compound 2 Bureau\n", "2 Bureau nsubjpass 7 ordered\n", "3 of prep 2 Bureau\n", "4 Investigation pobj 3 of\n", "5 has aux 7 ordered\n", "6 been auxpass 7 ordered\n", "7 ordered ccomp 27 said\n", "8 to aux 9 track\n", "9 track xcomp 7 ordered\n", "10 down prt 9 track\n", "11 as advmod 14 3,000\n", "12 many amod 14 3,000\n", "13 as quantmod 14 3,000\n", "14 3,000 nummod 15 Iraqis\n", "15 Iraqis dobj 9 track\n", "16 in prep 9 track\n", "17 this det 18 country\n", "18 country pobj 16 in\n", "19 whose poss 20 visas\n", "20 visas nsubj 22 expired\n", "21 have aux 22 expired\n", "22 expired relcl 18 country\n", "23 , punct 27 said\n", "24 the det 26 Department\n", "25 Justice compound 26 Department\n", "26 Department nsubj 27 said\n", "27 said ROOT 27 said\n", "28 yesterday npadvmod 27 said\n", "29 . punct 27 said\n" ] } ], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the index of current token, the token itself, the dependency, the head and its index\n", " print(token.i, token, token.dep_, token.head.i, token.head)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the output above helps to clarify the syntactic dependencies between tokens, they are generally much easier to perceive using diagrams.\n", "\n", "spaCy provides a [visualisation tool](https://spacy.io/usage/visualizers) for visualising dependencies. This component of the spaCy library, _displacy_, can be imported using the following command." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from spacy import displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `displacy` module has a method named `render()`, which takes a _Doc_ object as input.\n", "\n", "To draw a dependency tree, we provide the _Doc_ object `doc` to the `render()` method with two arguments:\n", "\n", " 1. `style`: The value `dep` instructs _displacy_ to draw a visualisation for syntactic dependencies.\n", " 2. `options`: This argument takes a Python dictionary as input. We provide a dictionary with the key `compact` and Boolean value `True` to instruct _displacy_ to draw a compact tree diagram. Additional options for formatting the visualisation can be found in spaCy [documentation](https://spacy.io/api/top-level#displacy_options)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " The\n", " DET\n", "\n", "\n", "\n", " Federal\n", " PROPN\n", "\n", "\n", "\n", " Bureau\n", " PROPN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " Investigation\n", " PROPN\n", "\n", "\n", "\n", " has\n", " AUX\n", "\n", "\n", "\n", " been\n", " AUX\n", "\n", "\n", "\n", " ordered\n", " VERB\n", "\n", "\n", "\n", " to\n", " PART\n", "\n", "\n", "\n", " track\n", " VERB\n", "\n", "\n", "\n", " down\n", " ADP\n", "\n", "\n", "\n", " as\n", " ADV\n", "\n", "\n", "\n", " many\n", " ADJ\n", "\n", "\n", "\n", " as\n", " SCONJ\n", "\n", "\n", "\n", " 3,000\n", " NUM\n", "\n", "\n", "\n", " Iraqis\n", " PROPN\n", "\n", "\n", "\n", " in\n", " ADP\n", "\n", "\n", "\n", " this\n", " DET\n", "\n", "\n", "\n", " country\n", " NOUN\n", "\n", "\n", "\n", " whose\n", " DET\n", "\n", "\n", "\n", " visas\n", " NOUN\n", "\n", "\n", "\n", " have\n", " AUX\n", "\n", "\n", "\n", " expired,\n", " VERB\n", "\n", "\n", "\n", " the\n", " DET\n", "\n", "\n", "\n", " Justice\n", " PROPN\n", "\n", "\n", "\n", " Department\n", " PROPN\n", "\n", "\n", "\n", " said\n", " VERB\n", "\n", "\n", "\n", " yesterday.\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubjpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " auxpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " ccomp\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " xcomp\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prt\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " quantmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " poss\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " relcl\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " npadvmod\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, style='dep', options={'compact': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntactic dependencies are visualised using lines that lead from the __head__ _Token_ to the _Token_ governed by that head.\n", "\n", "The dependency tags are based on [universal dependencies](https://universaldependencies.org/), a framework for describing grammatical features across languages.\n", "\n", "If you don't know what a particular tag means, spaCy provides a function for explaining the tags, `explain()`, which takes a tag as input (note that the tags are case-sensitive)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'object of preposition'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain('pobj')" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### Quick exercise\n", "\n", "Choose a tag that you don't know from the examples above and ask spaCy for an explanation using the `explain()` function.\n", "\n", "Input the tag as a string – don't forget the quotation marks!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, if you wonder about the underscores `_` in the attribute names: spaCy encodes all strings by mapping them to hash values (a numerical representation) for computational efficiency.\n", "\n", "Let's print out the first Token in the Doc `[0]` and its dependencies to examine how this works." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 415 det\n" ] } ], "source": [ "print(doc[0], doc[0].dep, doc[0].dep_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the hash value 415 is reserved for the tag corresponding to a determiner (`det`). \n", "\n", "If you want human-readable output for dependency parsing and spaCy returns sequences of numbers, then you most likely forgot to add the underscore to the attribute name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence segmentation\n", "\n", "spaCy also segments _Doc_ objects into sentences as a part of [_sentence segmentation_](https://spacy.io/usage/linguistic-features#section-sbd).\n", "\n", "Sentence segmentation imposes additional structure to larger texts. By determining the boundaries of a sentence, we can constrain tasks such as dependency parsing to individual sentences.\n", "\n", "spaCy provides access to the results of sentence segmentation via the attribute `sents` of a _Doc_ object.\n", "\n", "Let's loop over the sentences contained in the _Doc_ object `doc` and count them using Python's `enumerate()` function.\n", "\n", "Using the `enumerate()` function returns a count that increases with each item in the loop. \n", "\n", "We assign this count to the variable `number`, whereas each sentence is stored under `sent`. We then print out both at the same time using the `print()` function." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\n" ] } ], "source": [ "# Loop over sentences in the Doc object and count them using enumerate()\n", "for number, sent in enumerate(doc.sents):\n", " \n", " # Print the token and its dependency tag\n", " print(number, sent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This only returns a single sentence, but the _Doc_ object could easily hold a longer text with multiple sentences, such as an entire newspaper article." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO Add content related to NP chunking via Doc.noun_chunks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization\n", "\n", "A lemma is the base form of a word. One must keep in mind that unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens. \n", "\n", "If one wants to count the occurrences of words, for instance, a process known as _lemmatization_ is needed to group together the different forms of the same token.\n", "\n", "Lemmas are available for each _Token_ under the attribute `lemma_`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The the\n", "Federal Federal\n", "Bureau Bureau\n", "of of\n", "Investigation Investigation\n", "has have\n", "been be\n", "ordered order\n", "to to\n", "track track\n", "down down\n", "as as\n", "many many\n", "as as\n", "3,000 3,000\n", "Iraqis Iraqis\n", "in in\n", "this this\n", "country country\n", "whose whose\n", "visas visa\n", "have have\n", "expired expire\n", ", ,\n", "the the\n", "Justice Justice\n", "Department Department\n", "said say\n", "yesterday yesterday\n", ". .\n" ] } ], "source": [ "# Loop over items in the Doc object, using the variable 'token' to refer to items in the list\n", "for token in doc:\n", " \n", " # Print the token and its dependency tag\n", " print(token, token.lemma_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named entity recognition (NER)\n", "\n", "Named entity recognition (NER) is the task of recognising and classifying entities named in a text. \n", "\n", "spaCy can recognise the [named entities annotated in the OntoNotes 5 corpus](https://spacy.io/api/annotation#named-entities), such as persons, geographic locations and products, to name but a few examples.\n", "\n", "Instead of looping over *Tokens* in the *Doc* object, we can use the *Doc* object's `.ents` attribute to get the entities. \n", "\n", "This returns a _Span_ object, which is a sequence of _Tokens_, as many named entities span multiple _Tokens_. \n", "\n", "The named entities and their types are stored under the attributes `.text` and `.label_`.\n", "\n", "Let's print out the named entities in the _Doc_ object `doc`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Federal Bureau of Investigation ORG\n", "as many as 3,000 CARDINAL\n", "Iraqis NORP\n", "the Justice Department ORG\n", "yesterday DATE\n" ] } ], "source": [ "# Loop over the named entities in the Doc object \n", "for ent in doc.ents:\n", "\n", " # Print the named entity and its label\n", " print(ent.text, ent.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The majority of named entities identified in the _Doc_ consist of multiple _Tokens_, which is why they are represented as _Span_ objects.\n", "\n", "We can verify this by accessing the first named entity under `doc.ents`, which can be found at position `0`, because Python starts counting from zero, and feeding this object to Python's `type()` function." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "spacy.tokens.span.Span" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the type of the object used to store named entities\n", "type(doc.ents[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "spaCy [_Span_](https://spacy.io/api/span) objects contain several useful arguments.\n", "\n", "Most importantly, the attributes `start` and `end` return the indices of _Tokens_, which determine where the _Span_ starts and ends.\n", "\n", "We can examine this in greater detail by printing out the `start` and `end` attributes for the first named entity in the document." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Federal Bureau of Investigation 0 5\n" ] } ], "source": [ "# Print the named entity and indices of its start and end Tokens\n", "print(doc.ents[0], doc.ents[0].start, doc.ents[0].end)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The named entity starts at index 0 and ends at index 5.\n", "\n", "If we print out the fifth _Token_ in the _Doc_ object, we will see that this corresponds to the Token \"has\"." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "has" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc[5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that the index returned by the `end` argument does __not__ correspond to the _last Token_ in a _Span_, but returns the index of the _first Token_ following the _Span_.\n", "\n", "Put differently, the `start` means that the _Span_ starts here, whereas the `end` attribute means that the _Span_ has _ended_ here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also render the named entities using _displacy_, the module used for visualising dependency parses above.\n", "\n", "Note that we must pass the string `ent` to the `style` argument to indicate that we wish to visualise named entities. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " The Federal Bureau of Investigation\n", " ORG\n", "\n", " has been ordered to track down \n", "\n", " as many as 3,000\n", " CARDINAL\n", "\n", " \n", "\n", " Iraqis\n", " NORP\n", "\n", " in this country whose visas have expired, \n", "\n", " the Justice Department\n", " ORG\n", "\n", " said \n", "\n", " yesterday\n", " DATE\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you don't recognise a particular tag used for a named entity, you can always ask spaCy for an explanation." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Nationalities or religious or political groups'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain('NORP')" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### In-class exercise\n", "\n", "Load the example texts in our data directory, feed them to the language model under `nlp` and explore the results.\n", "\n", "How does the quality of optical character recognition affect the results of natural language processing?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "nbsphinx": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[INFO] Now processing data/WP_1990-08-10-25A.txt ...\n", "\n", "\n", "[INFO] Now processing data/mod_WP_1990-08-10-25A.txt ...\n", "\n", "\n", "[INFO] Now processing data/NYT_1991-01-16-A15.txt ...\n", "\n", "\n", "[INFO] Now processing data/WP_1991-01-17-A1B.txt ...\n", "\n", "\n", "[INFO] Now processing data/mod_WP_1991-01-17-A1B.txt ...\n", "\n", "\n", "[INFO] Now processing data/mod_NYT_1991-01-16-A15.txt ...\n", "\n" ] } ], "source": [ "# Import Path class from pathlib\n", "from pathlib import Path\n", "\n", "# Create a Path object pointing towards the corpus dir\n", "corpus_dir = Path('data')\n", "\n", "# Collect all .txt files\n", "corpus_files = list(corpus_dir.glob('*.txt'))\n", "\n", "# Loop over the files\n", "for corpus_file in corpus_files:\n", " \n", " # Print a status message to separate the results\n", " print(f\"\\n[INFO] Now processing {corpus_file} ...\\n\")\n", " \n", " # Open the file for reading and read the contents (note the .read() method at the end)\n", " text = open(corpus_file, encoding='utf-8').read()\n", " \n", " # Feed the text to the spaCy language model / pipeline\n", " text = nlp(text)\n", " \n", " # Try out different things by uncommenting (remove #s) some of the lines below\n", " \n", " # 1. Loop over a range of Tokens [130:150] and print out part-of-speech tags and dependencies\n", " # for token in text[130:150]:\n", " \n", " # print(token, token.pos_, token.tag_, token.dep_)\n", " \n", " # 2. Print out the first 20 named entities in the text\n", " # for ent in text.ents[:20]:\n", " \n", " # print(ent.text, ent.label_)\n", " \n", " # 3. Print out the sentences in each text. Enumerate them for clarity.\n", " # for i, sent in enumerate(text.sents):\n", " \n", " # print(i, sent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section should have given you an idea of some basic natural language processing tasks using spaCy and the kinds of linguistic annotations they produce.\n", "\n", "The [following section](evaluating_nlp.ipynb) introduces you to evaluating the performance of language models. " ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }