{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word embeddings in spaCy\n",
    "\n",
    "The previous [section](04_embeddings.ipynb) introduced the distributional hypothesis, which underlies modern approaches to *distributional semantics* (Boleda [2020](https://doi.org/10.1146/annurev-linguistics-011619-030303)) and the technique of word embeddings, that is, learning numerical representations for words that approximate their meaning.\n",
    "\n",
    "We started by exploring the distributional hypothesis by quantifying word occurrences, essentially using word counts as an abstraction mechanism that enabled us to represent linguistic information numerically. \n",
    "\n",
    "We then moved to explore the use of a neural network as the abstraction mechanism, learning numerical representations from the data through a proxy task that involved predicting the neighbouring words. \n",
    "\n",
    "In this section, we proceed to word embeddings learned from massive volumes of texts, and their use in the spaCy library.\n",
    "\n",
    "After reading this section, you should:\n",
    "\n",
    " - understand what word embeddings can be used for\n",
    " - know how to use word embeddings in spaCy\n",
    " - know how to visualise words in their embedding space\n",
    " - know how to use contextual word embeddings in spaCy\n",
    " - know how to add a custom component to the spaCy pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using word embeddings in spaCy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('CHRzdvZX_mw', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy provides 300-dimensional word embeddings for several languages, which have been learned from large corpora.\n",
    "\n",
    "In other words, each word in the model's vocabulary is represented by a list of 300 floating point numbers – a vector – and these vectors are embedded into a 300-dimensional space.\n",
    "\n",
    "To explore the use of word vectors in spaCy, let's start by loading a large language model for English, which contains word vectors for 685 000 *Token* objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import spacy\n",
    "import spacy\n",
    "\n",
    "# Load a large language model and assign it to the variable 'nlp_lg'\n",
    "nlp_lg = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define an example sentence and feed it to the language model under `nlp_lg` for processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define example sentence\n",
    "text = \"The Shiba Inu is a dog that is more like a cat.\"\n",
    "\n",
    "# Feed example sentence to the language model under 'nlp_lg'\n",
    "doc = nlp_lg(text)\n",
    "\n",
    "# Call the variable to examine the output\n",
    "doc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a spaCy *Doc* object.\n",
    "\n",
    "Let's examine the word vector for the second *Token* in the *Doc* object (\"Shiba\"), which can be accessed through its attribute `vector`.\n",
    "\n",
    "Instead of printing the 300 floating point numbers that constitute the vector, let's limit the output to the first thirty dimensions using `[:30]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve the second Token in the Doc object at index 1, and \n",
    "# the first 30 dimensions of its vector representation\n",
    "doc[1].vector[:30]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These floating point numbers encode information about this *Token*, which the model has learned by observing the word in its context of occurrences.\n",
    "\n",
    "Just as explained in the [previous section](04_embeddings.ipynb), we can use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity of two vectors. \n",
    "\n",
    "spaCy implements the measure of cosine similarity in the `similarity()` method, which is available for *Token*, *Span* and *Doc* objects.\n",
    "\n",
    "The `similarity()` method can take any of these objects as input and calculate cosine similarity between their vector representations stored under the `vector` attribute.\n",
    "\n",
    "For convenience, let's assign the *Tokens* \"dog\" and \"cat\" in our example *Doc* object into the variables `dog` and `cat` and compare their similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assing the fifth and eleventh items in the Doc into their own variables\n",
    "dog = doc[5]\n",
    "cat = doc[11]\n",
    "\n",
    "# Compare the similarity between Tokens 'dog' and 'cat'\n",
    "dog.similarity(cat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not surprisingly, the vectors for cats and dogs are very similar (and thus close to each other in the 300-dimensional embedding space), because these words are likely to appear in similar linguistic contexts, as both cats and dogs are common household pets.\n",
    "\n",
    "For comparison, let's retrieve the vector representation for \"snake\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed the string \"snake\" to the language model; store result under 'snake'\n",
    "snake = nlp_lg(\"snake\")\n",
    "\n",
    "# Compare the similarity of 'snake' and 'dog'\n",
    "snake.similarity(dog)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Turns out the vector for \"snake\" is not that similar to the vector for \"dog\", although both are animals. Presumably, these words occur in different contexts.\n",
    "\n",
    "Finally, let's compare the similarity of the vectors for \"car\" and \"snake\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed the string \"car\" to the language model and calculate similarity to Token 'snake'\n",
    "snake.similarity(nlp_lg(\"car\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not surprisingly, the vectors for \"car\" and \"snake\" are not very similar at all, as these words are not likely to occur in similar linguistic contexts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "### Quick exercise\n",
    "\n",
    "Define two words with similar or dissimilar meanings, feed them to the language model and compare their cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Write your code below this line and press Shift and Enter to run the code\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As pointed out above, spaCy also provides word vectors for entire *Doc* objects or *Span* objects within them.\n",
    "\n",
    "To move beyond *Tokens*, let's start by examining the *Doc* object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the variable to examine the output\n",
    "doc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The vector for this *Doc* object is also available under the attribute `vector`.\n",
    "\n",
    "Instead of examining the actual vector stored under `vector`, let's retrieve the value stored under its `shape` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve the 'shape' attribute for the vector\n",
    "doc.vector.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives the length of the vector, which shows that just like the *Token* objects above, the *Doc* object has a 300-dimensional vector that encodes information about its meaning.\n",
    "\n",
    "In spaCy, the vector representation for the entire *Doc* is calculated by **averaging the vectors** for each *Token* in the *Doc*.\n",
    "\n",
    "The same applies to _Spans_, which can be examined by retrieving the noun phrases in `doc`, which are available under the `noun_chunks` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the noun chunks under the attribute 'noun_chunks'. This returns\n",
    "# a generator, so we cast the output into a list named 'n_chunks'.\n",
    "n_chunks = list(doc.noun_chunks)\n",
    "\n",
    "# Call the variable to examine the output\n",
    "n_chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the example sentence has several noun phrases. \n",
    "\n",
    "Let's examine the shape of the vector for the first noun phrase, \"The Shiba Inu\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the shape of the vector for the first noun chunk in the list\n",
    "n_chunks[0].vector.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as the *Doc* object, the *Span* object has a 300-dimensional vector. This vector is also calculated by averaging the vectors for each *Token* in the *Span*.\n",
    "\n",
    "We can also use the `similarity()` method to measure cosine similarity between *Span* objects.\n",
    "\n",
    "Let's compare the similarity of noun phrases \"The Shiba Inu\" `[0]` and \"a dog\" `[1]`.\n",
    "\n",
    "Based on our world knowledge, we know that these noun phrases belong to the same semantic field: as a dog breed, the Shiba Inu is a hyponym of dog.\n",
    "\n",
    "For this reason, they should presumably occur in similar contexts and thus their vectors should be close to each other in the embedding space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare the similarity of the two noun chunks\n",
    "n_chunks[0].similarity(n_chunks[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Turns out that the embeddings for the noun phrases \"The Shiba Inu\" and \"a dog\" are about as similar as those of \"dog\" and \"snake\" above!\n",
    "\n",
    "To understand why the vectors for these noun phrases are dissimilar, we must dive deeper into the word embeddings and the effects of averaging vectors for linguistic units beyond a single *Token*.\n",
    "\n",
    "This effort can be supported using visualisations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualising word embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('l66QaVT68W8', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "whatlies is an open source library for visualising \"what lies\" in word embeddings, that is, what kinds of information they encode (Warmerdam et al. [2020](https://www.aclweb.org/anthology/2020.nlposs-1.8.pdf)).\n",
    "\n",
    "The whatlies library is intended to support the interpretation of high-dimensional word embeddings. In this context, high-dimensional refers to the number of dimensions in the embedding space.\n",
    "\n",
    "High-dimensional spaces are notoriously difficult to comprehend, as our experience as embodied beings is strongly grounded into a three-dimensional space.\n",
    "\n",
    "Visualisations such as those provided by whatlies may help to alleviate this challenge.\n",
    "\n",
    "The whatlies library provides wrappers for language models from various popular natural language processing libraries, including spaCy.\n",
    "\n",
    "These wrappers are essentially Python classes that know what to do when provided with an object that contains a language model from a given library.\n",
    "\n",
    "We therefore import the `SpacyLanguage` object from whatlies and *wrap* the spaCy *Language* object stored under the variable `nlp_lg` into this object. \n",
    "\n",
    "We then assign the result to the variable `language_model`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the wrapper class for spaCy language models\n",
    "from whatlies.language import SpacyLanguage\n",
    "\n",
    "# Wrap the spaCy language model under 'nlp_lg' into the\n",
    "# whatlies SpacyLanguage class and assign the result \n",
    "# under the variable 'language_model'\n",
    "language_model = SpacyLanguage(nlp_lg)\n",
    "\n",
    "# Call the variable to examine the output\n",
    "language_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a *SpacyLanguage* object that wraps a spaCy *Language* object.\n",
    "\n",
    "Before we proceed any further, let's take a closer look at the list of noun phrases stored under `n_chunks`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over each noun phrase\n",
    "for chunk in n_chunks:\n",
    "    \n",
    "    # Loop over each Token in noun phrase\n",
    "    for token in chunk:\n",
    "        \n",
    "        # Print Token attributes 'text', 'oov', 'vector' and separate\n",
    "        # each attribute by a string object containing a tabulator \\t\n",
    "        # sequence for pretty output\n",
    "        print(token.text, '\\t', token.is_oov, '\\t', token.vector[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `is_oov` attribute of a *Token* corresponds to **out of vocabulary** and returns `True` or `False` depending on whether the *Token* is included in the vocabulary of the language model or not.\n",
    "\n",
    "In this case, all *Tokens* are present in the vocabulary, hence their value is `False`.\n",
    "\n",
    "`vector[:3]` returns the first three dimensions in the 300-dimensional word vector.\n",
    "\n",
    "Note that just as one might expect, the vector for the indefinite article \"a\" is the same for \"a\" in both \"a dog\" and a \"a cat\". This suggests that the word vectors are *static*, which is an issue to which we will return below.\n",
    "\n",
    "However, if a *Token* were out of vocabulary, the values of each dimension would be set to zero.\n",
    "\n",
    "Let's examine this by mistyping \"Shiba Inu\" as \"shibainu\", feed this string to the language model under `nlp` and retrieve the values for the first three dimensions of its vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed the string 'shibainu' to the language model and assign\n",
    "# the result under the variable 'shibainu'\n",
    "shibainu = nlp_lg(\"shibainu\")\n",
    "\n",
    "# Retrieve the first three dimensions of its word vector\n",
    "shibainu.vector[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first three dimensions are set to zero, suggesting that the word is out of vocabulary.\n",
    "\n",
    "We can easily double-check this using the `is_oov` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if the first item [0] in the Doc object 'shibainu'\n",
    "# is out of vocabulary\n",
    "shibainu[0].is_oov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values of a vector determine its *direction* and *magnitude* in the embedding space, and zero values do not provide information about either.\n",
    "\n",
    "This information is crucial, because word embeddings are based on the idea that semantically similar words are close to each other in the embedding space.\n",
    "\n",
    "We can use the visualisations in the whatlies library to explore this idea further.\n",
    "\n",
    "To examine the embeddings for noun phrases in the `n_chunks` list using whatlies, we must populate the list with string objects rather than *Spans*.\n",
    "\n",
    "We therefore define a list comprehension that retrieves the plain text stored under the attribute `text` of a *Span* object and stores this string into a list of the same name, that is, `n_chunks`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop over noun chunks, retrieve plain text and store\n",
    "# the result under the variable 'n_chunks'\n",
    "n_chunks = [n_chunk.text for n_chunk in n_chunks]\n",
    "\n",
    "# Call the variable to examine the output\n",
    "n_chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the noun chunks as Python string objects in a list, we can feed them to the whatlies *SpacyLanguage* object stored under `language_model`.\n",
    "\n",
    "The input must be placed in brackets `[ ]` right after the variable name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve embeddings for items in list 'n_chunks'\n",
    "# and store the result under 'embeddings'\n",
    "embeddings = language_model[n_chunks]\n",
    "\n",
    "# Call the variable to examine the output\n",
    "embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a whatlies *EmbSet* object which stores the embeddings for our noun phrases.\n",
    "\n",
    "To visualize the embeddings, we can use the `plot()` method of the *EmbSet* object.\n",
    "\n",
    "The arguments `kind`, `color`, `x_axis` and `y_axis` instruct whatlies to draw red arrows that plot the direction and magnitude of each vector along dimensions $0$ and $1$ of the 300-dimensional vector space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings.plot(kind='arrow', color='red', x_axis=0, y_axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each vector originates at the point $(0, 0)$. We can see that along dimensions $0$ and $1$, the directions and magnitudes (or: length) of the vectors differ considerably.\n",
    "\n",
    "Note, however, that the plot above visualises just two dimensions of the 300-dimensional embedding space. Along some other dimensions, the vectors for \"dog\" and \"cat\" may be much closer to each other.\n",
    "\n",
    "This allows word vectors to encode representations in a flexible manner: two words may be close to each other along  certain dimensions, while differing along others.\n",
    "\n",
    "We can also use whatlies to explore the effect of averaging vectors for constructing vector representations for units beyond a single *Token*.\n",
    "\n",
    "To do so, let's retrieve embeddings for the indefinite article \"a\", the noun \"dog\" and the noun phrase \"a dog\" and plot the result along the same dimensions as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed a list of string objects to the whatlies language object to get an EmbSet object\n",
    "dog_embeddings = language_model[['a', 'dog', 'a dog']]\n",
    "\n",
    "# Plot the EmbSet\n",
    "dog_embeddings.plot(kind='arrow', color='red', x_axis=0, y_axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Along dimensions $0$ and $1$, the vector for \"a dog\" is positioned right in the middle between the vectors for \"a\" and \"dog\", because the vector for \"a dog\" is an average of the vectors for \"a\" and \"dog\".\n",
    "\n",
    "We can easily verify this by getting the values for dimension $0$ from the 300-dimensional vectors for the tokens \"a\" and \"dog\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the embedding for 'a' from the EmbSet object 'dog_embeddings'; use the vector attribute\n",
    "# and bracket to retrieve the value at index 0. Do the same for 'dog'. Assign under variables\n",
    "# of the same name.\n",
    "a = dog_embeddings['a'].vector[0]\n",
    "dog = dog_embeddings['dog'].vector[0]\n",
    "\n",
    "# Calculate average value and assign under 'dog_avg'\n",
    "dog_avg = (a + dog) / 2\n",
    "\n",
    "# Call the variable to examine the result\n",
    "dog_avg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you look at the plot above, you see that this value falls right where the arrow for \"a dog\" points along dimension $0$ on the horizontal axis.\n",
    "\n",
    "We can verify this by getting the value for the first dimension in the vector for \"a dog\" from the *EmbSet* object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the value for the first dimension of the vector for 'a dog'\n",
    "dog_embeddings['a dog'].vector[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This raises the question whether averaging vectors for individual *Tokens* is a suitable strategy for representing larger linguistic units, because the direction and magnitude of a vector are supposed to capture the \"meaning\" of a word in relation to other words in the model's vocabulary (Boleda [2020](https://doi.org/10.1146/annurev-linguistics-011619-030303)).\n",
    "\n",
    "To put it simply, averaging *Token* vectors to represent entire clauses and sentences may dilute the information encoded in the vectors, which also raises the question whether the indefinite article \"a\" and the noun \"dog\" are equally informative in the noun phrase \"a dog\".\n",
    "\n",
    "One should also note that the vector representations are **static**. As we saw above, the vector representation for the indefinite article \"a\" remains the same regardless of the context in which this article occurs. In other words, the particular words in a model's vocabulary, such as \"a\", are always mapped to the same vector representation. The unique words in a model's vocabulary are often described as **lexical types**, whereas their instances in the data are known as **tokens**. \n",
    "\n",
    "We know, however, that the same word (or lexical type) may have different meanings, which may be inferred from the context in which they occur, but this cannot be captured by word embeddings which model lexical types, not tokens. In other words, although the vector representations are learned by making predictions about co-occurring words, information about the context in which the tokens occur are not encoded into the vector representation.\n",
    "\n",
    "![](img/type_token.svg)\n",
    "\n",
    "This limitation has been addressed by **contextual word embeddings**, which often use a neural network architecture named a [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)). This architecture can encode information about the context in which a given token occurs into the vector representation.\n",
    "\n",
    "Some models that build on this architecture include BERT (Devlin et al. [2019](https://www.aclweb.org/anthology/N19-1423/)) and GPT-3 (Brown et al. [2020](https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)). Both models are massive, featuring billions of parameters, and thus slow and expensive to train. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contextual word embeddings from Transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Run this cell to view a YouTube video related to this topic\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo('fAeW1D37h90', height=350, width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the time and resources needed to train a language model using the Transformer architecture from scratch, they are often trained once and then fine-tuned to specific tasks. In this context, fine-tuning refers to training only a part of the network, adapting what the model has already learned to more specific tasks.\n",
    "\n",
    "These tasks include, for example, part-of-speech tagging, dependency parsing and the other tasks introduced in [Part II](../part_ii/03_basic_nlp.ipynb).\n",
    "\n",
    "spaCy provides Transformer-based language models for English and several other languages, which outperform the \"traditional\" pipelines in terms of accuracy, but are slower to apply.\n",
    "\n",
    "Let's start by loading a Transformer-based for the English language and assign this model under the variable `nlp_trf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load a Transformer-based language model; assing to variable 'nlp'\n",
    "nlp_trf = spacy.load('en_core_web_trf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the surface, a *Language* object that contains a Transformer-based model looks and works just like any other language model in spaCy.\n",
    "\n",
    "However, if we look under the hood of the *Language* object under `nlp_trf` using the `pipeline` attribute (see [Part II](../part_ii/04_basic_nlp_continued.ipynb)), we will see that the first component in the processing pipeline is a *Transformer*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the 'pipeline' attribute to examine the processing pipeline\n",
    "nlp_trf.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Transformer component generates vector representations that are then used for making predictions about the *Doc* objects and the *Tokens* contained within.\n",
    "\n",
    "These include, among others, the standard linguistic annotations in the form of part-of-speech tags, syntactic dependencies, morphological features, named entities and lemmas.\n",
    "\n",
    "Let's define an example sentence and feed it to the Transformer-based language model under `nlp_trf` and store the resulting *Doc* object under `example_doc`.\n",
    "\n",
    "Note that Python may raise a warning, because it is not generally recommended to use a Transformer model without a graphics processing unit (GPU). This naturally applies to processing large volumes of text – for current purposes, we can safely use the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed an example sentence to the model; store output under 'example_doc'\n",
    "example_doc = nlp_trf(\"Helsinki is the capital of Finland.\")\n",
    "\n",
    "# Check the length of the Doc object\n",
    "example_doc.__len__()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy stores the vector representations generated by the Transformer into a *TransformerData* object, which can be accessed under the custom attribute `trf_data` of a *Doc* object.\n",
    "\n",
    "Remember that spaCy stores custom attributes under a dummy attribute marked by an underscore `_`, which is reserved for user-defined attributes, as explained in [Part II](../part_ii/04_basic_nlp_continued.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the type of the 'trf_data' object using the type() function\n",
    "type(example_doc._.trf_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of from the Transformer is contained in the *[TransformerData](https://spacy.io/api/transformer#transformerdata)* object, which we will now explore in greater detail. \n",
    "\n",
    "To begin with, the `tensors` attribute of a *TransformerData* object contains a Python list with vector representations generated by the Transformer for individual *Tokens* and the entire *Doc* object.\n",
    "\n",
    "The first item in the `tensors` list under index 0 contains the output for individual *Tokens*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the shape of the first item in the list\n",
    "example_doc._.trf_data.tensors[0].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second item under index 1 holds the output for the entire *Doc*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the shape of the first item in the list\n",
    "example_doc._.trf_data.tensors[1].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In both cases, the Transformer output is stored in a *tensor*, which is a mathematical term for describing a \"bundle\" of numerical objects (e.g. vectors) and their shape.\n",
    "\n",
    "In the case of *Tokens*, we have a batch of 1 that consists of 11 vectors with 768 dimensions each.\n",
    "\n",
    "We can access the first ten dimensions of each vector using the expression `[:10]`.\n",
    "\n",
    "Note that we need the preceding `[0]` to enter the first \"batch\" of vectors in the tensor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the first ten dimensions of the tensor\n",
    "example_doc._.trf_data.tensors[0][0][:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike word embeddings, which leverage information about co-occurring words to *learn* representations for tokens but discard this information afterwards, these embeddings also encode information about the context in which the word occurs!\n",
    "\n",
    "But why is a spaCy *Doc* object with 7 *Token* objects represented by 11 vectors?\n",
    "\n",
    "In the [previous section](../part_iii/04_embeddings.ipynb) we learned that vocabulary size is a frequent challenge in language modelling. Learning representations for every unique word would blow up the size of the model!\n",
    "\n",
    "Because Transformers are trained on massive volumes of text, the model's vocabulary must be limited somehow.\n",
    "\n",
    "To address this issue, Transformers use more complex tokenizers that identify frequently occurring character sequences in the data and learn embeddings for these sequences instead. These sequences, which are often referred as *subwords*, make up the vocabulary of the Transformer.\n",
    "\n",
    "Let's examine how the example *Doc* under `example_doc` was tokenized for input to the Transformer.\n",
    "\n",
    "This information is stored under the attribute `tokens` of the *TransformerData* object, which contains a dictionary. We can find the subwords under the key `input_texts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access the Transformer tokens under the key 'input_texts'\n",
    "example_doc._.trf_data.tokens['input_texts']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This provides the tokens provided to the Transformer *by its own tokenizer*. In other words, the Transformer does not use the same tokens as spaCy.\n",
    "\n",
    "The input begins and terminates with tokens `<s>` and `</s>`, which mark the beginning and the end of the input sequence. The Transformer tokenizer also uses the character `Ġ` as a prefix to indicate that the token is preceded by a whitespace character.\n",
    "\n",
    "For the most part, the Transformer tokens correspond roughly to those produced by spaCy, except for \"Helsinki\".\n",
    "\n",
    "Because the token \"Helsinki\" is not present in the Transformer's vocabulary, the token is broken down into three subwords that exist in the vocabulary: `H`, `els` and `inki`. Their vectors are used to construct a representation for the token \"Helsinki\".\n",
    "\n",
    "![](img/alignment.svg)\n",
    "\n",
    "To map these vectors to *Tokens* in the spaCy *Doc* object, we must retrieve alignment information from the `align` attribute of the *TransformerData* object.\n",
    "\n",
    "The `align` attribute can be indexed using the indices of *Token* objects in the *Doc* object. \n",
    "\n",
    "To exemplify, we can retrieve the first *Token* \"Helsinki\" in the *Doc* object `doc` using the expression `example_doc[0]`.\n",
    "\n",
    "We then use the index of this *Token* in the *Doc* object to retrieve alignment data, which is stored under the `align` attribute.\n",
    "\n",
    "More specifically, we need the information stored under the attribute `data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the first spaCy Token, \"Helsinki\", and its alignment data\n",
    "example_doc[0], example_doc._.trf_data.align[0].data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `data` attribute contains a NumPy array that identifies which vectors in the list stored under the `tensors` attribute of a *TransformerData* object contain representations for this *Token*.\n",
    "\n",
    "In this case, vectors at indices 1, 2 and 3 in the batch of 11 vectors contain the representation for \"Helsinki\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the contextual embeddings from the Transformer efficiently, we can define a component that retrieves contextual word embeddings for *Docs*, *Spans* and *Tokens* and add this component to the spaCy pipeline.\n",
    "\n",
    "This can be achieved by creating a new Python *Class* – a user-defined object with attributes and methods.\n",
    "\n",
    "Because the new *Class* will become a component of the spaCy pipeline, we must first import the *Language* object and let spaCy know that we are now defining a new pipeline component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the Language object under the 'language' module in spaCy,\n",
    "# and NumPy for calculating cosine similarity.\n",
    "from spacy.language import Language\n",
    "import numpy as np\n",
    "\n",
    "# We use the @ character to register the following Class definition\n",
    "# with spaCy under the name 'tensor2attr'.\n",
    "@Language.factory('tensor2attr')\n",
    "\n",
    "# We begin by declaring the class name: Tensor2Attr. The name is \n",
    "# declared using 'class', followed by the name and a colon.\n",
    "class Tensor2Attr:\n",
    "    \n",
    "    # We continue by defining the first method of the class, \n",
    "    # __init__(), which is called when this class is used for \n",
    "    # creating a Python object. Custom components in spaCy \n",
    "    # require passing two variables to the __init__() method:\n",
    "    # 'name' and 'nlp'. The variable 'self' refers to any\n",
    "    # object created using this class!\n",
    "    def __init__(self, name, nlp):\n",
    "        \n",
    "        # We do not really do anything with this class, so we\n",
    "        # simply move on using 'pass' when the object is created.\n",
    "        pass\n",
    "\n",
    "    # The __call__() method is called whenever some other object\n",
    "    # is passed to an object representing this class. Since we know\n",
    "    # that the class is a part of the spaCy pipeline, we already know\n",
    "    # that it will receive Doc objects from the preceding layers.\n",
    "    # We use the variable 'doc' to refer to any object received.\n",
    "    def __call__(self, doc):\n",
    "        \n",
    "        # When an object is received, the class will instantly pass\n",
    "        # the object forward to the 'add_attributes' method. The\n",
    "        # reference to self informs Python that the method belongs\n",
    "        # to this class.\n",
    "        self.add_attributes(doc)\n",
    "        \n",
    "        # After the 'add_attributes' method finishes, the __call__\n",
    "        # method returns the object.\n",
    "        return doc\n",
    "    \n",
    "    # Next, we define the 'add_attributes' method that will modify\n",
    "    # the incoming Doc object by calling a series of methods.\n",
    "    def add_attributes(self, doc):\n",
    "        \n",
    "        # spaCy Doc objects have an attribute named 'user_hooks',\n",
    "        # which allows customising the default attributes of a \n",
    "        # Doc object, such as 'vector'. We use the 'user_hooks'\n",
    "        # attribute to replace the attribute 'vector' with the \n",
    "        # Transformer output, which is retrieved using the \n",
    "        # 'doc_tensor' method defined below.\n",
    "        doc.user_hooks['vector'] = self.doc_tensor\n",
    "        \n",
    "        # We then perform the same for both Spans and Tokens that\n",
    "        # are contained within the Doc object.\n",
    "        doc.user_span_hooks['vector'] = self.span_tensor\n",
    "        doc.user_token_hooks['vector'] = self.token_tensor\n",
    "        \n",
    "        # We also replace the 'similarity' method, because the \n",
    "        # default 'similarity' method looks at the default 'vector'\n",
    "        # attribute, which is empty! We must first replace the\n",
    "        # vectors using the 'user_hooks' attribute.\n",
    "        doc.user_hooks['similarity'] = self.get_similarity\n",
    "        doc.user_span_hooks['similarity'] = self.get_similarity\n",
    "        doc.user_token_hooks['similarity'] = self.get_similarity\n",
    "    \n",
    "    # Define a method that takes a Doc object as input and returns \n",
    "    # Transformer output for the entire Doc.\n",
    "    def doc_tensor(self, doc):\n",
    "        \n",
    "        # Return Transformer output for the entire Doc. As noted\n",
    "        # above, this is the last item under the attribute 'tensor'.\n",
    "        # Average the output along axis 0 to handle batched outputs.\n",
    "        return doc._.trf_data.tensors[-1].mean(axis=0)\n",
    "    \n",
    "    # Define a method that takes a Span as input and returns the Transformer \n",
    "    # output.\n",
    "    def span_tensor(self, span):\n",
    "        \n",
    "        # Get alignment information for Span. This is achieved by using\n",
    "        # the 'doc' attribute of Span that refers to the Doc that contains\n",
    "        # this Span. We then use the 'start' and 'end' attributes of a Span\n",
    "        # to retrieve the alignment information. Finally, we flatten the\n",
    "        # resulting array to use it for indexing.\n",
    "        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()\n",
    "        \n",
    "        # Fetch Transformer output shape from the final dimension of the output.\n",
    "        # We do this here to maintain compatibility with different Transformers,\n",
    "        # which may output tensors of different shape.\n",
    "        out_dim = span.doc._.trf_data.tensors[0].shape[-1]\n",
    "        \n",
    "        # Get Token tensors under tensors[0]. Reshape batched outputs so that\n",
    "        # each \"row\" in the matrix corresponds to a single token. This is needed\n",
    "        # for matching alignment information under 'tensor_ix' to the Transformer\n",
    "        # output.\n",
    "        tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]\n",
    "        \n",
    "        # Average vectors along axis 0 (\"columns\"). This yields a 768-dimensional\n",
    "        # vector for each spaCy Span.\n",
    "        return tensor.mean(axis=0)\n",
    "    \n",
    "    # Define a function that takes a Token as input and returns the Transformer\n",
    "    # output.\n",
    "    def token_tensor(self, token):\n",
    "        \n",
    "        # Get alignment information for Token; flatten array for indexing.\n",
    "        # Again, we use the 'doc' attribute of a Token to get the parent Doc,\n",
    "        # which contains the Transformer output.\n",
    "        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()\n",
    "        \n",
    "        # Fetch Transformer output shape from the final dimension of the output.\n",
    "        # We do this here to maintain compatibility with different Transformers,\n",
    "        # which may output tensors of different shape.\n",
    "        out_dim = token.doc._.trf_data.tensors[0].shape[-1]\n",
    "        \n",
    "        # Get Token tensors under tensors[0]. Reshape batched outputs so that\n",
    "        # each \"row\" in the matrix corresponds to a single token. This is needed\n",
    "        # for matching alignment information under 'tensor_ix' to the Transformer\n",
    "        # output.\n",
    "        tensor = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]\n",
    "\n",
    "        # Average vectors along axis 0 (columns). This yields a 768-dimensional\n",
    "        # vector for each spaCy Token.\n",
    "        return tensor.mean(axis=0)\n",
    "    \n",
    "    # Define a function for calculating cosine similarity between vectors\n",
    "    def get_similarity(self, doc1, doc2):\n",
    "        \n",
    "        # Calculate and return cosine similarity\n",
    "        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although the previous cell is relatively long, note that the comments explaining the actions taken took up most of the space!\n",
    "\n",
    "With the *Class* `Tensor2Attr` defined, we can now add it to the pipeline by referring to the name we registered with spaCy using `@Language.factory()`, that is, `tensor2attr`, as instructed in [Part II](../part_ii/04_basic_nlp_continued.ipynb). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the component named 'tensor2attr', which we registered using the\n",
    "# @Language decorator and its 'factory' method to the pipeline.\n",
    "nlp_trf.add_pipe('tensor2attr')\n",
    "\n",
    "# Call the 'pipeline' attribute to examine the pipeline\n",
    "nlp_trf.pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output shows that the component named `tensor2attr` was added to the spaCy pipeline.\n",
    "\n",
    "This component stores the Transformer-based contextual embeddings for *Docs*, *Spans* and *Tokens* under the `vector` attribute.\n",
    "\n",
    "Let's explore contextual embeddings by defining two *Doc* objects and feeding them to the Transformer-based language model under `nlp_trf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define two example sentences and process them using the Transformer-based\n",
    "# language model under 'nlp_trf'.\n",
    "doc_city_trf = nlp_trf(\"Helsinki is the capital of Finland.\")\n",
    "doc_money_trf = nlp_trf(\"The company is close to bankruptcy because its capital is gone.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The noun \"capital\" has two different meanings in these sentences: in `doc_city_trf`, \"capital\" refers to a city, whereas in `doc_money_trf` the word refers to money.\n",
    "\n",
    "The Transformer should encode this difference into the resulting vector based on the context in which the word occurs.\n",
    "\n",
    "Let's fetch the *Token* corresponding to \"capital\" in each example and retrieve their vector representations under the `vector` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve vectors for the two Tokens corresponding to \"capital\";\n",
    "# assign to variables 'city_trf' and 'money_trf'.\n",
    "city_trf = doc_city_trf[3]\n",
    "money_trf = doc_money_trf[8]\n",
    "\n",
    "# Compare the similarity of the two meanings of 'capital'\n",
    "city_trf.similarity(money_trf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the vectors for the word \"capital\" are only somewhat similar, because the Transformer also encodes information about their context of occurrence into the vectors, which has allowed it to learn that the same linguistic form may have different meanings in different contexts.\n",
    "\n",
    "This stands in stark contrast to the *static* word embeddings available in the large language model for English stored under the variable `nlp_lg`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define two example sentences and process them using the large language model\n",
    "# under 'nlp_lg'\n",
    "doc_city_lg = nlp_lg(\"Helsinki is the capital of Finland.\")\n",
    "doc_money_lg = nlp_lg(\"The company is close to bankruptcy because its capital is gone.\")\n",
    "\n",
    "# Retrieve vectors for the two Tokens corresponding to \"capital\";\n",
    "# assign to variables 'city_lg' and 'money_lg'.\n",
    "city_lg = doc_city_lg[3]\n",
    "money_lg = doc_money_lg[8]\n",
    "\n",
    "# Compare the similarity of the two meanings of 'capital'\n",
    "city_lg.similarity(money_lg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the vectors for the word \"capital\" are identical, because the word embeddings do not encode information about the context in which the word occurs. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "### Quick exercise\n",
    "\n",
    "Define two words with similar forms but different meanings, feed them to the Transformer-based language model under `nlp_trf` and compare their cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Write your code below this line and press Shift and Enter to run the code\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section should have given you a basic understanding of word embeddings and their use in spaCy, and introduced you to the difference between word embeddings and contextual word embeddings.\n",
    "\n",
    "In the next [section](06_text_linguistics.ipynb), we proceed to examine the processing of discourse-level annotations."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}