{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating text using Python\n", "\n", "This section introduces you to the very basics of manipulating text in Python.\n", "\n", "After reading this section, you should:\n", "\n", " - understand the difference between rich text, structured text and plain text\n", " - understand the concept of text encoding\n", " - know how to load plain text files into Python and manipulate their content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computers and text\n", "\n", "Computers can store and represent text in different formats. Knowing the distinction between different types of text is crucial for processing them programmatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('P-om89HKx80', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is rich text?\n", "\n", "Word processors, such as Microsoft Word, produce [rich text](https://en.wikipedia.org/wiki/Formatted_text), that is, text whose appearance has been formatted or styled in a specific way.\n", "\n", "Rich text allows defining specific visual styles for document elements. Headers, for example, may use a different font than the body text, which may in turn feature *italic* or **bold** fonts for emphasis. Rich text can also include various types of images, tables and other document elements.\n", "\n", "Rich text is the default format for modern what-you-see-is-what-you-get word processors.\n", "\n", "### What is plain text?\n", "\n", "Unlike rich text, [plain text](https://en.wikipedia.org/wiki/Plain_text) does not contain any information about the visual appearance of text, but consists of *characters* only.\n", "\n", "Characters, in this context, refers to letters, numbers, punctuation marks, spaces and line breaks.\n", "\n", "The definition of plain text is fairly loose, but generally the term refers to text which lacks any formatting or style information.\n", "\n", "\n", "### What is structured text?\n", "\n", "Structured text may be thought of as a special case of plain text, which includes character sequences that are used to format the text for display.\n", "\n", "Forms of structured text include text described using mark-up languages such as XML, Markdown or HTML.\n", "\n", "The example below shows a plain text sentence wrapped into HTML tags for paragraphs `

`. \n", "\n", "The opening tag `

` and the closing tag `

` instruct the computer that any content placed between these tags form a paragraph.\n", "\n", "```\n", "

This is an example sentence.

\n", "```\n", "\n", "This information is used for structuring plain text when *rendering* text for display, typically by styling its appearance." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "If you double-click any content cell in this Jupyter Notebook, you will see the underlying structured text in Markdown.\n", "\n", "Running the cell renders the structured text for visual inspection!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why does this matter?\n", "\n", "If you collect a bunch of texts for a corpus, chances are that some originated in rich or structured format, depending on the medium these texts came from.\n", "\n", "If you collect printed documents that have been digitized using a technique such as [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) and subsequently converted from rich into plain text, the removal of formatting information is likely to introduce errors into the resulting plain text. Working with this kind of \"dirty\" OCR can have an impact on the results of text analysis (Hill & Hengchen [2019](https://doi.org/10.1093/llc/fqz024)).\n", "\n", "If you collect digital documents by scraping discussion forums or websites, you are likely to encounter traces of structured text in the form of markup tags, which may be carried over to plain text during conversion.\n", "\n", "Plain text is by far the most interchangeable format for text, as it is easy to read for computers. This is why programming languages work with plain text, and if you plan to use programming languages to manipulate text, you need to know what plain text is. \n", "\n", "To summarise, when working with plain text, you may need to deal with traces left by conversion from rich or structured text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text encoding\n", "\n", "To be read by computers, plain text needs to be *encoded*. This is achieved using *character encoding*, which maps characters (letters, numbers, punctuation, whitespace ...) to a numerical representation understood by the computer.\n", "\n", "Ideally, we should not have to deal with low-level operations such as character encoding, but practically we do, because there are multiple systems for encoding characters, and these codings are not compatible with each other. This is the source of endless misery and headache when working with plain text.\n", "\n", "There are two character encoding systems that you are likely to encounter: ASCII and Unicode.\n", "\n", "### ASCII\n", "\n", "[ASCII](https://en.wikipedia.org/wiki/ASCII), which stands for American Standard Code for Information Interchange, is a pioneering character encoding system that has provided a foundation for many modern character encoding systems.\n", "\n", "ASCII is still widely used, but is very limited in terms of its character range. If your language happens to include characters such as Γ€ or ΓΆ, you are out of luck with ASCII.\n", "\n", "### Unicode\n", "\n", "[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text in most writing systems used across the world, covering nearly 140 000 characters in modern and historic scripts, symbols and emoji.\n", "\n", "For example, the pizza slice emoji πŸ• has the Unicode \"code\" `U+1F355`, whereas the corresponding code for a whitespace is `U+0020`.\n", "\n", "Unicode can be implemented by different character encodings, such as [UTF-8](https://en.wikipedia.org/wiki/UTF-8), which is defined by the Unicode standard.\n", "\n", "UTF-8 is backwards compatible with ASCII. In other words, the ASCII character encodings form a subset of UTF-8, which makes our life much easier. \n", "\n", "Even if a plain text file has been *encoded* in ASCII, we can *decode* it using UTF-8, but **not vice versa**." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Loading plain text files into Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('ulRFLvNkhHA', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plain text files can be loaded into Python using the `open()` function.\n", "\n", "The first argument to the `open()` function must be a string, which contains a *path* to the file that is being opened.\n", "\n", "In this case, we have a path that points towards a file named `NYT_1991-01-16-A15.txt`, which is located in a directory named `data`. In the definition of the path, the directory and the filename are separated by a backslash `/`.\n", "\n", "To access the file that this path points to, we must provide the path as a string object to the `file` argument of the `open()` function:\n", "\n", "```\n", "open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8')\n", "```\n", "\n", "Before proceeding any further, let's focus on the other arguments provided to the `open()` function.\n", "\n", "By default, Python 3 assumes that the text is encoded using UTF-8, but we can make this explicit using the `encoding` argument. \n", "\n", "The `encoding` argument takes a string as input: we pass the string `utf-8` to the argument to declare that the plain text is encoded in UTF-8.\n", "\n", "We also use the `mode` argument to define that we only want to open the file for *reading*, which is done by passing the string `r` to the argument.\n", "\n", "Finally, we use the `open()` function in combination with the `with` statement, which ensures that the file will be closed after performing whatever we do within the *indented* block of code that follows the `with` statement. This prevents the file from consuming memory and resources after we no longer need it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Open a file and assign it to the variable 'file'\n", "with open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8') as file:\n", " \n", " # The 'with' statement must be followed by an indented code block.\n", " # Here we call the read() method to read the file contents and \n", " # assign the result under the variable 'text'.\n", " text = file.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the `with` statement and the `open()` function are followed by the statement `as` and a variable named `file`.\n", "\n", "This tells Python to assign whatever is returned by the `open()` function under the variable `file`.\n", "\n", "If we now call the variable `file`, we get a Python `TextIOWrapper` object that contains three arguments: the path to the file under the argument `name` and the `mode` and `encoding` arguments that we specified above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Call the variable to examine the object\n", "file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that in the indented code block following the `with` statement, we called the `read()` method of the `TextIOWrapper` object.\n", "\n", "This method read the contents of the file, which we assigned under the variable `text`.\n", "\n", "However, if we attempt to call the `read()` method for the variable `file` outside the `with` statement, Python will raise an error, because the file has been closed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Attempt to use the read() method to read the file content\n", "file.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This behaviour is expected, as we want the file to be closed, so that it does not consume memory or resources now that we no longer need it. This is especially important when working with thousands of files, as every open file will take up memory and resources.\n", "\n", "Let's check the output from applying the `read()` method, which we stored under the variable `text` within the `with` statement.\n", "\n", "The text is fairly long, so let's just take a slice of the text containing the first 500 characters, which can be achieved using brackets `[:500]`.\n", "\n", "As we learned in [Part I](../part_i/02_getting_started_with_python.ipynb), adding brackets directly after the name of a variable allows accessing parts of the object, if the object in question allows this.\n", "\n", "For example, the expression `text[1]` would retrieve the character at position 1 in the string object under the variable `text`.\n", "\n", "Adding the colon `:` as a prefix to the number instructs Python to retrieve all characters contained in the string up to the 500th character." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve the first 500 characters under the variable 'text'\n", "text[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the text is indeed legible, but there are some strange character sequences, such as `\\ufeff` in the very beginning of the text, and the numerous `\\n` sequences occurring throughout the text.\n", "\n", "The `\\ufeff` sequence is simply an explicit declaration (\"signature\") that the file has been encoded using UTF-8. Not all UTF-8 encoded files contain this sequence.\n", "\n", "The `\\n` sequences, in turn, indicate a line change.\n", "\n", "This becomes evident if we use Python's `print()` function to print the first 1000 characters stored in the `text` variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the first 1000 characters under the variable 'text'\n", "print(text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, Python knows how to interpret `\\n` character sequences and inserts a line break if it encounters this sequence when printing the contents of the string object.\n", "\n", "We can also see that the first few lines of the file contain metadata on the article, such as its name, author and source. This information precedes the body text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating text" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('v4FY6TXt0PU', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the entire text stored under the variable `text` is a string object, we can use all methods available for manipulating strings in Python.\n", "\n", "Let's use the `replace()` method to replace all line breaks `\"\\n\"` with empty strings `\"\"` and store the result under the variable `processed_text`. \n", "\n", "We then use the `print()` function to print out a slice containing the first 1000 characters using the brackets `[:1000]`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Replace line breaks \\n with empty strings and assign the result to \n", "# the variable 'processed_text'\n", "processed_text = text.replace('\\n', '')\n", "\n", "# Print out the first 1000 characters under the variable 'processed_text'\n", "print(processed_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, all of the text is now clumped together. We can, however, still identify the beginning of each paragraph, which are marked by three whitespaces.\n", "\n", "Note that replacing the line breaks also causes the article metadata to form a single paragraph, which is also missing some whitespace characters. For this reason, one must always pay attention to unwanted effects of replacements and other transformations!\n", "\n", "However, if we were only interested in the body text of the article, we can now easily remove the metadata, as we know that it is separated from the body text by three whitespace characters.\n", "\n", "The easiest way to do this is to use the `split()` method to split the string into a list by using three whitespace characters as the separator.\n", "\n", "Let's assign the result under the same variable, that is, `processed_text`, and print out the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the split() method with three whitespaces as a separator. Assign the\n", "# result under the variable 'processed_text'.\n", "processed_text = processed_text.split(sep=' ')\n", "\n", "# Print out the result under 'processed_text'\n", "print(processed_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you examine the output, you will see that the `split()` method returned a list of string objects. Let's quickly verify this by checking the type of the object stored under `processed_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the type of the object under 'processed_text'\n", "type(processed_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The metadata is stored in the first item in the list, as the first sequence of three whitespace characters was found where the metadata ends. This is where we first split the string object.\n", "\n", "Let's fetch the first item in the list – remember that Python starts counting from zero, which means that the item we want to access can be found at index `0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve the string object at index 0 from the list 'processed_text'\n", "processed_text[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to remove the metadata and retain just the body text, we can use the `pop()` method of a list object.\n", "\n", "This method expects an integer as input, which corresponds to the index of an item that we want to remove from the list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Call the pop() method of the list under 'processed_text' and the\n", "# index of the item to be removed.\n", "processed_text.pop(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are wondering why we do not assign the result into a variable, the answer is because Python lists are *mutable*, that is, they can be manipulated in place.\n", "\n", "In other words, the `pop()` method can modify the list without \"updating\" the variable by reassigning the value under the same variable name.\n", "\n", "Let's check the result by retrieving the first three items in the list `processed_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve the first three items in the list 'processed_text'\n", "processed_text[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the first item in the list no longer corresponds to the metadata!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('iabwQKS5lVk', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to convert the list back into a string, we can use the `join()` method of a string object.\n", "\n", "The `join()` method expects an *iterable* as input, that is, something that can be iterated over, such as a Python list or a dictionary.\n", "\n", "This is where things may get a little confusing: the `join()` method must be called on a *string* that will be used to join the items in the iterable!\n", "\n", "In this case, we want to use the original sequence of characters that were used to separate paragraphs of text – a line break and three whitespaces – as the string object that joins the items." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the join() method to join the items in the list 'processed_text' using\n", "# the string object '\\n ' – a line break and three whitespaces. Store the \n", "# result under the variable of the same name.\n", "processed_text = '\\n '.join(processed_text)\n", "\n", "# Check the result by printing the first 1000 characters of the resulting \n", "# string object under 'processed_text'\n", "print(processed_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, applying the `join()` method returns a string object with the original paragraph breaks!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('HEinzGt8LG4', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you examine the text closely, you can also see remnants of the digitalisation process: the application of optical character recognition, which was discussed above, has resulted in a mixture of various types of quotation marks, such as `\"`, `β€œ`, `”`, `’’` and `β€˜β€˜` (two single quotation marks), being used in the text.\n", "\n", "If we were interested in retrieving quotes from the body text, it would be good to use the quotation marks consistently. Let's choose `\"` (a single double-quote) as our preferred quotation mark.\n", "\n", "We could replace each quotation mark with this character using the `replace()` method, but applying this method separately for each type of quotation mark would be tedious.\n", "\n", "To make the process more efficient, we can leverage two other Python data structures: *lists* and *tuples*.\n", "\n", "Let's start by defining a list named `pipeline`. We can create and populate a list by simply placing objects within brackets `[]`. Each list item must be separated by a comma (`,`).\n", "\n", "As we saw above, the `replace()` method takes two strings as inputs.\n", "\n", "To combine two strings into a single Python object, the most obvious candidate is a data structure named *tuple*, which consist of finite, ordered lists of items.\n", "\n", "Tuples are marked by parentheses `( )`, and the items in a tuple are also separated by a comma.\n", "\n", "In each tuple, we place the character to be replaced in the first string, and its replacement in the second string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a list with four tuples, which each consist of two strings: the character\n", "# to be replaced and its replacement.\n", "pipeline = [('β€œ', '\"'), ('´´', '\"'), ('”', '\"'), ('’’', '\"')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also illustrates how different data structures are often nested in Python: the list consists of tuples, and the tuples consist of string objects.\n", "\n", "We can now perform a `for` loop over each item in the list, which iterates through each item in the order in which they appear in the list.\n", "\n", "Each item in the list consists of a tuple, which contains two strings.\n", "\n", "Note that to enter a `for` loop, Python expects the next line of code to be indented. Press the Tab β†Ή key on your keyboard to move the cursor.\n", "\n", "What happens next is exactly same that we did before with using the `replace()` method, but instead of manually defining the strings that we want to replace, we use the strings contained in the variables `old` and `new`!\n", "\n", "After each loop, we automatically update the string object stored under the variable `processed_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over tuples in the list 'pipeline'. Each tuple has two values, which we \n", "# assign to variables 'old' and 'new' on the fly!\n", "for old, new in pipeline:\n", " \n", " # Use the replace() method to replace the string under the variable 'old' \n", " # with the string under the variable new 'new'\n", " processed_text = processed_text.replace(old, new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine the output by printing out the string under the variable `processed_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the string\n", "print(processed_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the output shows, we could perform a series of replacements by looping over the list of tuples, which defined the patterns to be replaced and their replacements! \n", "\n", "To recap, the syntax for the `for` loop is as follows: declare the beginning of a loop using `for`, followed by a *variable* that is used to refer to items retrieved from the list.\n", "\n", "The list that is being looped over is preceded by `in` and the name of the variable assigned to the entire *list*.\n", "\n", "To better understand how a `for` loop works, let's define only one variable, `our_tuple`, to refer to the items that we fetch from the list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over the items under the variable 'pipeline'\n", "for our_tuple in pipeline:\n", " \n", " # Print the returned object\n", " print(our_tuple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This print outs the tuples!\n", "\n", "Python is smart enough to understand that a single variable refers to the single items, or *tuples* in the list, whereas for two items, it must proceed to the *strings* contained within the tuple.\n", "\n", "When writing `for` loops, pay close attention to the items contained in the list!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should have given you an idea of the basic issues involved in loading and manipulating text using Python. \n", "\n", "The [following section](02_basic_text_processing_continued.ipynb) builds on these techniques to manipulate texts more efficiently." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" } }, "nbformat": 4, "nbformat_minor": 4 }