{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customising the spaCy pipeline\n",
    "\n",
    "This section introduces you to customising the spaCy pipeline, that is, determining just what spaCy does with text and how.\n",
    "\n",
    "After reading this section, you should know how to:\n",
    "\n",
    " - examine and modify the spaCy pipeline\n",
    " - process texts efficiently\n",
    " - add custom attributes to spaCy objects\n",
    " - save processed texts on disk\n",
    " - merge noun phrases and named entities\n",
    "\n",
    "Let's start by importing the spaCy library and the displacy module for drawing dependency trees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the spaCy library and the displacy module\n",
    "from spacy import displacy\n",
    "import spacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then import a language model for English."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load a small language model for English and assign it to the variable 'nlp'\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "\n",
    "# Call the variable to examine the object\n",
    "nlp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modifying spaCy pipelines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('F4SJJQF49b0', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by examining the spaCy *Language* object in more detail.\n",
    "\n",
    "The *Language* object is a essentially pipeline that applies some language model to text by performing the tasks that the model has been trained to do.\n",
    "\n",
    "The tasks performed depend on the components present in the pipeline.\n",
    "\n",
    "We can examine the components of a pipeline using the `pipeline` attribute of a *Language* object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a spaCy *SimpleFrozenList* object, which consists of Python _tuples_ with two items: \n",
    "\n",
    " 1. component names, e.g. `tagger`, \n",
    " 2. the actual components that perform different tasks, e.g. `spacy.pipeline.tok2vec.Tok2Vec`.\n",
    "\n",
    "Components such as `tagger`, `parser`, `ner` and `lemmatizer` should already be familiar to you from the [previous section](../part_ii/03_basic_nlp.ipynb).\n",
    "\n",
    "There are, however, two components present in `nlp.pipeline` that we have not yet encountered. \n",
    "\n",
    " - `tok2vec` maps *Tokens* to their numerical representations. We will learn about these representations in [Part III](../part_iii/04_embeddings.ipynb).\n",
    "\n",
    " - `attribute_ruler` applies user-defined rules to *Tokens*, such as matches for a given linguistic pattern, and adds this information to the *Token* as an attribute if requested. We will explore the use of matchers in [Part III](../part_iii/03_pattern_matching.ipynb).\n",
    "\n",
    "Note also that the list of components under `nlp.pipeline` does not include a *Tokenizer*, because all texts must be tokenized for any kind of processing to take place. Hence the *Tokenizer* is placed under the `tokenizer` attribute of a *Language* object rather than the `pipeline` attribute."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to understand that all pipeline components come with a computational cost. \n",
    "\n",
    "If you do not need the output, you should not include a component in the pipeline, because the time needed to process the data will be longer.\n",
    "\n",
    "To exclude a component from the pipeline, provide the `exclude` argument with a *string* or a *list* that contain the names of the components to exclude when initialising a *Language* object using the `load()` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load a small language model for English, but exclude named entity\n",
    "# recognition ('ner') and syntactic dependency parsing ('parser'). \n",
    "nlp = spacy.load('en_core_web_sm', exclude=['ner', 'parser'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's examine the `pipeline` attribute again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Examine the active components under the Language object 'nlp'\n",
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the output shows, the `ner` and `parser` components are no longer included in the pipeline.\n",
    "\n",
    "A *Language* object also provides a `analyze_pipes()` method for an overview of the pipeline components and their interactions. By setting the attribute `pretty` to `True`, spaCy prints out a table that lists the components and the annotations they produce. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyse the pipeline and store the analysis under 'pipe_analysis'\n",
    "pipe_analysis = nlp.analyze_pipes(pretty=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `analyze_pipes()` method returns a Python *dictionary*, which contains the same information presented in the table above.\n",
    "\n",
    "You can use this dictionary to check that no problems are found before processing large volumes of data.\n",
    "\n",
    "Problem reports are stored in a dictionary under the key `problems`.\n",
    "\n",
    "We can access the values under the `problems` key by placing the name of the key in brackets `[ ]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Examine the value stored under the key 'problems'\n",
    "pipe_analysis['problems']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a dictionary with component names as keys, whose values contain lists of problems.\n",
    "\n",
    "In this case, the lists are empty, because no problems exist.\n",
    "\n",
    "We can, however, easily write a piece of code that checks if this is indeed the case.\n",
    "\n",
    "To do so, we loop over the `pipe_analysis` dictionary, using the `items()` method to fetch the key/value pairs.\n",
    "\n",
    "We assign the keys and values to variables `component_name` and `problem_list`, respectively.\n",
    "\n",
    "We then use the `assert` statement with the `len()` function and the *comparison operator* `==` to check that the length of the list is 0.\n",
    "\n",
    "If this assertion is not true, that is, if the length of `problem_list` is more than 0, which would indicate the presence of a problem, Python will raise an `AssertionError` and stop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over the key/value pairs in the dictionary. Assign the key and\n",
    "# value pairs to the variables 'component_name' and 'problem_list'.\n",
    "for component_name, problem_list in pipe_analysis['problems'].items():\n",
    "    \n",
    "    # Use the assert statement to check the list of problems; raise Error if necessary.\n",
    "    assert len(problem_list) == 0, f\"There is a problem with {component_name}: {problem_list}!\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we also print an error message using a *formatted string*. The error message is separated from the assertion by a comma. \n",
    "\n",
    "Note that the quotation marks are preceded by the character `f`. By declaring that this string can be formatted, we can insert variables into the string! \n",
    "\n",
    "The variable names inserted into the string are surrounded by curly braces `{}`. If an error message is raised, these parts of the string will be populated using the values currently stored under the variables `component_name` and `problem_list`.\n",
    "\n",
    "If no problems are encountered, the loop will pass silently."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing texts efficiently"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('00yChd449uI', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When working with high volumes of data, processing the data as efficiently as possible is highly desirable.\n",
    "\n",
    "To illustrate the best practices of processing texts efficiently using spaCy, let's define a toy example that consists of a Python list with three example sentences from English Wikipedia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise the language model again, because we need dependency\n",
    "# parsing for the following sections.\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "\n",
    "# Define a list of example sentences\n",
    "sents = [\"On October 1, 2009, the Obama administration went ahead with a Bush administration program, increasing nuclear weapons production.\", \n",
    "         \"The 'Complex Modernization' initiative expanded two existing nuclear sites to produce new bomb parts.\", \n",
    "         \"The administration built new plutonium pits at the Los Alamos lab in New Mexico and expanded enriched uranium processing at the Y-12 facility in Oak Ridge, Tennessee.\"]\n",
    "\n",
    "# Call the variable to examine output\n",
    "sents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a list with three sentences.\n",
    "\n",
    "spaCy _Language_ objects have a specific method, `pipe()`, for processing texts stored in a Python list.\n",
    "\n",
    "The `pipe()` method has been optimised for this purpose, processing texts in _batches_ rather than individually, which makes this method faster than processing each list item separately using a `for` loop.\n",
    "\n",
    "The `pipe()` method takes a _list_ as input and returns a Python _generator_ named `pipe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed the list of sentences to the pipe() method\n",
    "docs = nlp.pipe(sents)\n",
    "\n",
    "# Call the variable to examine the output\n",
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generators are Python objects that contain other objects. When called, a generator object will yield objects contained within itself. \n",
    "\n",
    "To retrieve all objects in a generator, we must cast the output into another object type, such as a list. \n",
    "\n",
    "You can think of the list as a data structure that is able to collect the generator output for examination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cast the pipe generator into a list\n",
    "docs = list(docs)\n",
    "\n",
    "# Call the variable to examine the output\n",
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a list of spaCy _Doc_ objects for further processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding custom attributes to spaCy objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('oWsuCwCW29g', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [previous section](../part_ii/03_basic_nlp.ipynb) showed how linguistic annotations produced by spaCy can be accessed through their attributes.\n",
    "\n",
    "In addition, spaCy allows setting custom attributes to *Doc*, *Span* and *Token* objects. These attributes can be used, for example, to store additional information about the texts that are being processed.\n",
    "\n",
    "If you are working with texts that contain information about language users, you can incorporate this information directly into the spaCy objects.\n",
    "\n",
    "To exemplify, custom attributes can be added directly to the *Doc* object using the `set_extension()` method.\n",
    "\n",
    "Because these attributes are added to all *Doc* objects instead of individual *Doc* objects, we must first import the generic *Doc* object from spaCy's `tokens` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the Doc object from the 'tokens' module in spaCy\n",
    "from spacy.tokens import Doc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the `set_extension()` method to add two attributes, `age` and `location` to the *Doc* object.\n",
    "\n",
    "We use the `default` argument to set a default value for both variables. For this purpose, we use the `None` keyword in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add two custom attributes to the Doc object, 'age' and 'location'\n",
    "# using the set_extension() method.\n",
    "Doc.set_extension(\"age\", default=None)\n",
    "Doc.set_extension(\"location\", default=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `age` and `location` attributes are now added to the *Doc* object.\n",
    "\n",
    "Unlike attributes such as `sents` or `heads`, the custom attributes are placed under an attribute that consists of the underscore character `_`, e.g. `Doc._.age`.\n",
    "\n",
    "To exemplify how these custom attributes can be set for actual *Doc* objects, let's define a toy example that consists of a Python dictionary.\n",
    "\n",
    "The `sents_dict` dictionary contains three keys: `0`, `1` and `2`. The values under these keys consist of dictionaries with three keys: `age`, `location` and `text`.\n",
    "\n",
    "This also exemplifies how Python data structures and formats are often nested within one another: a dictionary can easily include another dictionary, which contain both integer and string objects as keys and values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dictionary whose values consist of another dictionary\n",
    "# with three keys: 'age', 'location' and 'text'.\n",
    "sents_dict = {0: {\"age\": 23, \n",
    "                  \"location\": \"Helsinki\", \n",
    "                  \"text\": \"The Senate Square is by far the most important landmark in Helsinki.\"\n",
    "                 },\n",
    "              1: {\"age\": 35, \n",
    "                  \"location\": \"Tallinn\", \n",
    "                  \"text\": \"The Old Town, for sure.\"\n",
    "                 },\n",
    "              2: {\"age\": 58, \n",
    "                  \"location\": \"Stockholm\", \n",
    "                  \"text\": \"Södermalm is interesting!\"\n",
    "                 }\n",
    "             }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's loop over the `sents_dict` dictionary to process the examples and add the custom attributes to the resulting *Doc* objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up a placeholder list to hold the processed texts\n",
    "docs = []\n",
    "\n",
    "# Loop over pairs of keys and values in the 'sents_dict' dictionary.\n",
    "# Note that the key/value pairs are available under the items() method.\n",
    "# We refer to these keys and values as 'key' and 'data', respectively.\n",
    "# This means that we used the variable 'data' to refer to the nested\n",
    "# dictionary.\n",
    "for key, data in sents_dict.items():\n",
    "    \n",
    "    # Retrieve the value under the key 'text' from the nested dictionary.\n",
    "    # Feed this text to the language model under 'nlp' and assign the \n",
    "    # result to the variable 'doc'.\n",
    "    doc = nlp(data['text'])\n",
    "    \n",
    "    # Retrieve the values for 'age' and 'location' from the nested dictionary.\n",
    "    # Assign these values into the custom attributes defined for the Doc object.\n",
    "    # Note that custom attributes reside under a pseudo attribute consisting of\n",
    "    # an underscore '_'!  \n",
    "    doc._.age = data['age']\n",
    "    doc._.location = data['location']\n",
    "    \n",
    "    # Append the current Doc object under 'doc' to the list 'docs'\n",
    "    docs.append(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This provides a list of *Doc* objects, which is assigned under the variable `docs`.\n",
    "\n",
    "Let's loop over the `docs` list and print out the *Doc* and its custom attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over each Doc object in the list 'docs'\n",
    "for doc in docs:\n",
    "    \n",
    "    # Print each Doc and the 'age' and 'location' attributes\n",
    "    print(doc, doc._.age, doc._.location)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The custom attributes can be used, for example, to filter the data.\n",
    "\n",
    "One efficient way to filter the data is to use a Python *list comprehension*.\n",
    "\n",
    "A comprehension evaluates the contents of an existing list and populates a new list based on some criteria.\n",
    "\n",
    "A list comprehension is like a `for` loop that is declared on the fly using brackets `[]`, which are used to designate lists in Python.\n",
    "\n",
    "In this case, the list comprehension consists of three components:\n",
    " \n",
    " - The first reference to `doc` on the left hand-side of the `for` statement defines what is stored in the new list. We store whatever is stored in the original list, that is, a *Doc* object.\n",
    " - The `for` .. `in` statements operate just like in a `for` loop. We loop over items in the list `docs` and refer to each item using the variable `doc`.\n",
    " - The third statement beginning with `if` defines a conditional: we only include *Doc* objects whose custom attribute `age` has a value below 40."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a list comprehension to filter the Docs for those whose\n",
    "# 'age' attribute has a value under 40.\n",
    "under_forty = [doc for doc in docs if doc._.get('age') < 40]\n",
    "\n",
    "# Call the variable to examine the output\n",
    "under_forty"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a list with only two *Doc* objects that fill the designated criteria, that is, their `age` attribute has a value below 40."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing processed texts to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('zKvW8o-1wmk', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When working with high volumes of texts, you should first ensure that the pipeline produces the desired results by using a smaller number of texts. \n",
    "\n",
    "Once everything works as desired, you can proceed to process all of the text and save the result, because processing large volumes of text takes time and resources.\n",
    "\n",
    "spaCy provides a special object type named *DocBin* for storing *Doc* objects that contain linguistic annotations from spaCy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the DocBin object from the 'tokens' module in spacy\n",
    "from spacy.tokens import DocBin"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a *DocBin* object is easy. To populate the *DocBin* object with *Docs* upon creation, use the `docs` argument to pass a Python generator or list that contains *Doc* objects.\n",
    "\n",
    "In this case, we add the three *Docs* stored under the variable `docs` to the *DocBin*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize a DocBin object and add Docs from 'docs'\n",
    "docbin = DocBin(docs=docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have added custom attributes to *Docs*, *Spans*, or *Tokens*, you must also set the `store_user_data` argument to `True`, e.g. `DocBin(docs=docs, store_user_data=True)`.  \n",
    "\n",
    "We can easily verify that all three *Docs* made it into the *DocBin* by examining the output of its `__len__()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the number of Docs in the DocBin\n",
    "docbin.__len__()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `add()` method allows adding additional *Doc* objects to the *DocBin* if necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define and feed a string object the language model under 'nlp'\n",
    "# and add the resulting Doc to the DocBin object 'docbin'\n",
    "docbin.add(nlp(\"Yet another Doc object.\"))\n",
    "\n",
    "# Verify that the Doc was added; length should be now 4\n",
    "docbin.__len__()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have populated a *DocBin* object with your data, you must write the object to a disk for storage.\n",
    "\n",
    "This can be achieved using the `to_disk()` method of the *DocBin*.\n",
    "\n",
    "The `to_disk()` method takes a single argument, `path`, which defines a path to the file in which the *DocBin* object should be written.\n",
    "\n",
    "Let's write the *DocBin* object into a file named `docbin.spacy` in the `data` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write the DocBin object to disk\n",
    "docbin.to_disk(path='data/docbin.spacy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load a *DocBin* object from disk, you need to first initialise an empty *DocBin* object using `DocBin()` and then use the `from_disk()` method to load the actual *DocBin* object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise a new DocBin object and use the 'from_disk' method to\n",
    "# load the data from the disk. Assign the result to the variable\n",
    "# 'docbin_loaded'.\n",
    "docbin_loaded = DocBin().from_disk(path='data/docbin.spacy')\n",
    "\n",
    "# Call the variable to examine the output\n",
    "docbin_loaded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, to access the *Doc* object stored within the *DocBin*, you must use the `get_docs()` method.\n",
    "\n",
    "The `get_docs()` method takes a single argument, `vocab`, which takes the vocabulary of a *Language* object as input. \n",
    "\n",
    "The vocabulary, which is stored under the `vocab` attribute of a *Language* object, is needed to reconstruct the information stored in the *DocBin*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the 'get_docs' method to retrieve Doc objects from the DocBin,\n",
    "# passing the vocabulary under 'nlp.vocab' to reconstruct the data.\n",
    "# Cast the resulting generator object into a list for examination.\n",
    "docs_loaded = list(docbin_loaded.get_docs(nlp.vocab))\n",
    "\n",
    "# Call the variable to examine the output\n",
    "docs_loaded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a list that contains the four *Doc* objects added to the *DocBin* above.\n",
    "\n",
    "To summarise, you should ideally process the texts once, write them to disk and load them for further analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simplifying output for noun phrases and named entities \n",
    "\n",
    "### Merging noun phrases\n",
    "\n",
    "The [previous section](../part_ii/03_basic_nlp.ipynb) described how tasks such as part-of-speech tagging and dependency parsing involve making predictions about individual *Tokens* and their properties, such as their part-of-speech tags or syntactic dependencies.\n",
    "\n",
    "Occasionally, however, it may be more beneficial to operate with larger linguistic units instead of individual *Tokens*, such as noun phrases that consist of multiple *Tokens*.\n",
    "\n",
    "spaCy provides access to noun phrases via the attribute `noun_chunks` of a *Doc* object.\n",
    "\n",
    "Let's print out the noun chunks in each _Doc_ object contained in the list `docs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the first for-loop over the list 'docs'\n",
    "# The variable 'doc' refers to items in the list\n",
    "for doc in docs:\n",
    "    \n",
    "    # Loop over each noun chunk in the Doc object\n",
    "    for noun_chunk in doc.noun_chunks:\n",
    "        \n",
    "        # Print noun chunk\n",
    "        print(noun_chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These two `for` loops return several noun phrases that consist of multiple _Tokens_.\n",
    "\n",
    "For merging noun phrases into a single *Token*, spaCy provides a function named `merge_noun_tokens` that can be added to the pipeline stored in a *Language* object using the `add_pipe` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add component that merges noun phrases into single Tokens\n",
    "nlp.add_pipe('merge_noun_chunks')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we do not need to reassign the *Language* object under `nlp` to the same variable to update its contents. The `add_pipe` method adds the component to the *Language* object automatically. \n",
    "\n",
    "We can verify that the component was added successfully by examining the `pipeline` attribute under the *Language* object `nlp`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List the pipeline components\n",
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the final tuple in the list consists of the `merge_noun_chunks` function.\n",
    "\n",
    "To examine the consequences of adding this function to the pipeline, let's process the three example sentences again using the `pipe()` method of the _Language_ object `nlp`.\n",
    "\n",
    "We overwrite the previous results stored under the same variable by assigning the output to the variable `docs`.\n",
    "\n",
    "Note that we also cast the result into a list by wrapping the _Language_ object and the `pipe()` method into a `list()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the Language object 'nlp' to the list of sentences under 'sents'\n",
    "docs = list(nlp.pipe(sents))\n",
    "\n",
    "# Call the variable to examine the output\n",
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Superficially, everything remains the same: the list contains three *Doc* objects. \n",
    "\n",
    "However, if we loop over the *Tokens* in the first *Doc* object in the list, which can be accessed using brackets at position zero, e.g. `[0]`, we can see that the noun phrases are now merged and tagged using the label `NOUN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over Tokens in the first Doc object in the list\n",
    "for token in docs[0]:\n",
    "    \n",
    "    # Print out the Token and its part-of-speech tag\n",
    "    print(token, token.pos_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tagging noun phrases using the label `NOUN` is a reasonable approximation, as their head words are nouns.\n",
    "\n",
    "As rendering the syntactic parse using displacy shows, merging the noun phrases simplifies the parse tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "displacy.render(docs[0], style='dep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although the noun phrases are now represented by single *Tokens*, the noun chunks are still available under the `noun_chunks` attribute of the *Doc* object.\n",
    "\n",
    "As shown below, spaCy stores noun chunks as *Spans*, whose `start` attribute determines the index of the Token where the *Span* starts, while the `end` attribute determines where the *Span* has ended.\n",
    "\n",
    "This information is useful, if the syntactic analysis reveals patterns that warrant a closer examination of the noun chunks and their structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over the noun chunks in the first Doc object [0] in the list 'docs'\n",
    "for noun_chunk in docs[0].noun_chunks:\n",
    "    \n",
    "    # Print out noun chunk, its type, the Token where the chunks starts and where it ends\n",
    "    print(noun_chunk, type(noun_chunk), noun_chunk.start, noun_chunk.end)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merging named entities\n",
    "\n",
    "Named entities can be merged in the same way as noun phrases by providing `merge_entities` to the `add_pipe()` method of the *Language* object.\n",
    "\n",
    "Let's start by removing the `merge_noun_chunks` function from the pipeline.\n",
    "\n",
    "The `remove_pipe()` method can be used to remove a pipeline component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove the 'merge_noun_chunks' function from the pipeline under 'nlp'\n",
    "nlp.remove_pipe('merge_noun_chunks')\n",
    "\n",
    "# Process the original sentences again\n",
    "docs = list(nlp.pipe(sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The method returns a tuple containing the name of the removed component (in this case, a function) and the component itself.\n",
    "\n",
    "We can verify this by calling the `pipeline` attribute of the *Language* object `nlp`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's add the `merge_entities` component to the pipeline under `nlp`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the 'merge_entities' function to the pipeline\n",
    "nlp.add_pipe('merge_entities')\n",
    "\n",
    "# Process the data again\n",
    "docs = list(nlp.pipe(sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's examine the result by looping over the _Tokens_ in the third _Doc_ object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over Tokens in the third Doc object in the list\n",
    "for token in docs[2]:\n",
    "    \n",
    "    # Print out the Token and its part-of-speech tag\n",
    "    print(token, token.pos_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Named entities that consist of multiple *Tokens*, as exemplified by place names such as \"Los Alamos\" and \"New Mexico\", have been merged into single *Tokens*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section should have given you an idea of how to tailor spaCy to fit your needs, how to process texts efficiently and how to save the result to disk.\n",
    "\n",
    "The [following section](../part_ii/05_evaluating_nlp.ipynb) introduces you to evaluating the performance of language models. "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}