{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating text using Python\n", "\n", "This section introduces you to the very basics of manipulating text in Python.\n", "\n", "After reading this section, you should:\n", "\n", " - understand the difference between rich text, structured text and plain text\n", " - understand the concept of text encoding\n", " - know how to load plain text files into Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computers and text\n", "\n", "Computers can store and represent text in different formats. Knowing the distinction between different types of text is crucial for processing them programmatically.\n", "\n", "### What is rich text?\n", "\n", "Word processors, such as Microsoft Word, produce *[rich text](https://en.wikipedia.org/wiki/Formatted_text)*, that is, text whose appearance has been formatted or styled in a specific way.\n", "\n", "Rich text allows defining specific visual styles for document elements. Headers, for example, may use a different font than the body text, which may in turn feature *italic* or **bold** fonts for emphasis. Rich text can also include various types of images, tables and other document elements.\n", "\n", "Rich text is the default format for modern what-you-see-is-what-you-get word processors.\n", "\n", "### What is plain text?\n", "\n", "Unlike rich text, [plain text](https://en.wikipedia.org/wiki/Plain_text) does not contain any information about the visual appearance of text, but consists of *characters* only.\n", "\n", "Characters, in this context, refers to letters, numbers, punctuation marks, spaces and line breaks.\n", "\n", "The definition of plain text is fairly loose, but generally the term refers to text which lacks any formatting or style information.\n", "\n", "\n", "### What is structured text?\n", "\n", "Structured text may be thought of as a special case of plain text, which includes character sequences that are used to format the text for display.\n", "\n", "Forms of structured text include text described using mark-up languages such as XML, Markdown or HTML.\n", "\n", "The example below shows a plain text sentence wrapped into HTML tags for paragraphs `

`. \n", "\n", "The opening tag `

` and the closing tag `

` instruct the computer that any content placed between these tags form a paragraph.\n", "\n", "``` \n", "

This is an example sentence.

\n", "```\n", "\n", "This information is used for structuring plain text when *rendering* text for display, typically by styling its appearance." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "If you double-click any content cell in this Jupyter Notebook, you will see the underlying structured text in Markdown.\n", "\n", "Running the cell renders the structured text for visual inspection!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why does this matter?\n", "\n", "If you collect a bunch of texts for a corpus, chances are that some originated in rich or structured format, depending on the medium these texts came from.\n", "\n", "If you collect printed documents that have been digitized using a technique such as [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) and subsequently converted from rich into plain text, the removal of formatting information is likely to introduce errors into the resulting plain text. Working with this kind of \"dirty\" OCR can have an impact on the results of text analysis (Hill & Hengchen [2019](https://doi.org/10.1093/llc/fqz024)).\n", "\n", "If you collect digital documents by scraping discussion forums or websites, you are likely to encounter traces of structured text in the form of markup tags, which may be carried over to plain text during conversion.\n", "\n", "Plain text is by far the most interchangeable format for text, as it is easy to read for computers. This is why programming languages work with plain text, and if you plan to use programming languages to manipulate text, you need to know what plain text is. \n", "\n", "To summarise, when working with plain text, you may need to deal with traces left by conversion from rich or structured text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text encoding\n", "\n", "To be read by computers, plain text needs to be *encoded*. This is achieved using *character encoding*, which maps characters (letters, numbers, punctuation, whitespace ...) to a numerical representation understood by the computer.\n", "\n", "Ideally, we should not have to deal with low-level operations such as character encoding, but practically we do, because there are multiple systems for encoding characters, and these codings are not compatible with each other. This is the source of endless misery and headache when working with plain text.\n", "\n", "There are two character encoding systems that you are likely to encounter: ASCII and Unicode.\n", "\n", "### ASCII\n", "\n", "[ASCII](https://en.wikipedia.org/wiki/ASCII), which stands for American Standard Code for Information Interchange, is a pioneering character encoding system that has provided a foundation for many modern character encoding systems.\n", "\n", "ASCII is still widely used, but is very limited in terms of its character range. If your language happens to include characters such as ä or ö, you are out of luck with ASCII.\n", "\n", "### Unicode\n", "\n", "[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text in most writing systems used across the world, covering nearly 140 000 characters in modern and historic scripts, symbols and emoji.\n", "\n", "For example, the pizza slice emoji 🍕 has the Unicode \"code\" `U+1F355`, whereas the corresponding code for a whitespace is `U+0020`.\n", "\n", "Unicode can be implemented by different character encodings, such as [UTF-8](https://en.wikipedia.org/wiki/UTF-8), which is defined by the Unicode standard.\n", "\n", "UTF-8 is backwards compatible with ASCII. In other words, the ASCII character encodings form a subset of UTF-8, which makes our life much easier. \n", "\n", "Even if a plain text file has been *encoded* in ASCII, we can *decode* it using UTF-8, but **not vice versa**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading plain text files into Python\n", "\n", "Plain text files can be loaded into Python using the `open()` function.\n", "\n", "The first argument to the `open()` function must be a string, which contains a *path* to the file that is being opened.\n", "\n", "In this case, the directory `data` contains a file named `NYT_1991-01-16-A15.txt`. The directory and the filename are separated by the backslash `/`, as shown below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Open a file and assign it to the variable 'file'\n", "file = open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, Python 3 assumes that the text is encoded using UTF-8, but we can make this explicit using the `encoding` argument. \n", "\n", "The `encoding` argument takes a string as its input: we pass `utf-8` to the argument to declare that the plain text is encoded in UTF-8.\n", "\n", "Moreover, we use the `mode` argument to define that we only want to open the file for *reading*, which is done by passing the string `r` to the argument.\n", "\n", "If we call the variable `file`, we see a Python object that contains three arguments: the path to the file under the argument `name` and the `mode` and `encoding` arguments that we specified above." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<_io.TextIOWrapper name='data/NYT_1991-01-16-A15.txt' mode='r' encoding='utf-8'>" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call the variable to examine the object\n", "file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the file, however, is not sufficient to inspect its contents. We must use the `read()` method provided by this object to read the contents of the file first.\n", "\n", "We assign the result to the variable `text`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Use the read() method to read the file context and assign the\n", "# result to the variable 'text'\n", "text = file.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the result of the `read()` method which is now stored under the variable `text`.\n", "\n", "The text is fairly long, so let's just take a slice of the text containing the first 500 characters, which can be achieved using brackets `[:500]`.\n", "\n", "As we learned in [Part I](../part_i/getting_started_with_python.ipynb), adding brackets directly after the name of a variable allows accessing parts of the object, if the object allows this. For example, the expression `text[1]` would retrieve the character at position 1 in the string object under the variable `text`.\n", "\n", "Adding the colon `:` as a prefix to the number instructs Python to retrieve all characters contained in the string up to the 500th one." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\ufeffU.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired\\nBy JAMES BARRON\\nNew York Times (1923-Current file); Jan 16, 1991;\\nProQuest Historical Newspapers: The New York Times with Index pg. A15\\nU.S. TAKING STEPS TO CURB TERRORISM\\nF.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired\\nBy JAMES BARRON\\n The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call the first 500 characters under the variable 'text'\n", "text[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the text is indeed legible, but there are some strange character sequences, such as `\\ufeff` in the very beginning of the text, and the numerous `\\n` sequences occurring throughout the text.\n", "\n", "The `\\ufeff` sequence is simply an explicit declaration (\"signature\") that the file has been encoded using UTF-8. Not all UTF-8 encoded files contain this sequence.\n", "\n", "The `\\n` sequences, in turn, indicate a line change.\n", "\n", "This becomes evident if we use Python's `print()` function to print the first 1000 characters stored in the `text` variable." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired\n", "By JAMES BARRON\n", "New York Times (1923-Current file); Jan 16, 1991;\n", "ProQuest Historical Newspapers: The New York Times with Index pg. A15\n", "U.S. TAKING STEPS TO CURB TERRORISM\n", "F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired\n", "By JAMES BARRON\n", " The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\n", " The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to anyone.\n", " Concerns about terrorist attack disrupted other daily routines as well. No fast-food deliveries are being all\n" ] } ], "source": [ "# Print the first 1000 characters under the variable 'text'\n", "print(text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, Python knows how to interpret `\\n` character sequences and inserts a line break if it encounters this sequence." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### Quick exercise\n", "\n", "Answer the following questions:\n", "\n", "1. Are both examples above *plain text*?\n", "2. Can you find traces left by the process of digitising the original text?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating text\n", "\n", "Because the entire text stored under the variable `text` is a Python string object, we can use all methods available for manipulating strings.\n", "\n", "Let's use the `replace()` method to replace all line breaks `\"\\n\"` with empty strings `\"\"` and store the text under the variable `processed_text`. \n", "\n", "Finally, we use the `print()` function to print out a slice containing the first 1000 characters using the brackets `[:1000]`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRONNew York Times (1923-Current file); Jan 16, 1991;ProQuest Historical Newspapers: The New York Times with Index pg. A15U.S. TAKING STEPS TO CURB TERRORISMF.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRON The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday. The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to anyone. Concerns about terrorist attack disrupted other daily routines as well. No fast-food deliveries are being allowed at t\n" ] } ], "source": [ "# Replace line breaks \\n with empty strings and assign the result to \n", "# the variable 'processed_text'\n", "processed_text = text.replace(\"\\n\", \"\")\n", "\n", "# Print out the first 1000 characters under the variable 'processed_text'\n", "print(processed_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This simple replacement operation allows us to remove erroneous line breaks within the text. \n", "\n", "We can still identify actual line breaks by the indentations at the beginning of each paragraph, which are marked by three whitespaces.\n", "\n", "We can use this information to insert actual line breaks back into the text by replacing sequences of three whitespaces with the line break sequence and three whitespaces." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRONNew York Times (1923-Current file); Jan 16, 1991;ProQuest Historical Newspapers: The New York Times with Index pg. A15U.S. TAKING STEPS TO CURB TERRORISMF.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRON\n", " The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\n", " The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to anyone.\n", " Concerns about terrorist attack disrupted other daily routines as well. No fast-food deliveries are being allowe\n" ] } ], "source": [ "# Replace three whitespaces with a line break and three white spaces\n", "# Store the result under the same variable 'processed_text'\n", "processed_text = processed_text.replace(\" \", \"\\n \")\n", "\n", "# Print out the first 1000 characters under the variable 'processed_text'\n", "print(processed_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an extremely tedious way of manipulating the text.\n", "\n", "To make the process more efficient, we can leverage two other Python data structures: *lists* and *tuples*.\n", "\n", "Let's start by defining a list named `pipeline`. We can create and populate a list by simply placing objects within brackets `[]`. Each list item must be separated by a comma (`,`).\n", "\n", "As we saw above, the `replace()` method takes two strings as inputs.\n", "\n", "To combine two strings into a single Python object, the most obvious candidate is a data structure named *tuple*, which consist of finite, ordered lists of items.\n", "\n", "Tuples are marked by parentheses `( )`: items in a tuple are also separated by a comma." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Define a list with two tuples, of which each consist of two strings\n", "pipeline = [(\"\\n\", \"\"), (\" \", \"\\n \")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows how different data structures are often nested in Python: the list consists of tuples, and the tuples consist of string objects.\n", "\n", "We can now perform a `for` loop over each item in the list.\n", "\n", "Each item in the list consists of a tuple, which contains two strings.\n", "\n", "Note that to enter a `for` loop, Python expects the next line to be indented. Press the Tab ↹ key on your keyboard to move the cursor.\n", "\n", "What happens next is exactly same that we did before with using the `replace()` method, but instead of manually defining the strings that we want to replace, we use the strings contained in the variables `old` and `new`!\n", "\n", "After each loop, we automatically store the result in to the variable `updated_text`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Let's first create a copy of the original text to manipulate during the loop\n", "updated_text = text\n", "\n", "# Loop over tuples in the list 'pipeline'. Each tuple has two values, which we \n", "# assign to variables 'old' and 'new' on the fly!\n", "for old, new in pipeline:\n", " \n", " # Use the replace() method to replace the string under the variable 'old' \n", " # with the string under the variable new 'new'\n", " updated_text = updated_text.replace(old, new)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRONNew York Times (1923-Current file); Jan 16, 1991;ProQuest Historical Newspapers: The New York Times with Index pg. A15U.S. TAKING STEPS TO CURB TERRORISMF.B.I. Is Ordered to Find Iraqis Whose Visas Have ExpiredBy JAMES BARRON\n", " The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.\n", " The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to anyone.\n", " Concerns about terrorist attack disrupted other daily routines as well. No fast-food deliveries are being allowed a\n" ] } ], "source": [ "print(updated_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax for the `for` loop is as follows: declare the beginning of a loop using `for`, followed by a *variable* assigned to the items retrieved from the list.\n", "\n", "The list that is being looped over is preceded by `in` and the name of the variable assigned to the entire *list*.\n", "\n", "To better understand how a `for` loop works, let's define only one variable, `our_tuple`, to refer to the items that we fetch from the list." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('\\n', '')\n", "(' ', '\\n ')\n" ] } ], "source": [ "# Loop over the items under the variable 'pipeline'\n", "for our_tuple in pipeline:\n", " \n", " # Print the returned object\n", " print(our_tuple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This print outs the tuples!\n", "\n", "Python is smart enough to understand that a single variable refers to the single items, or *tuples* in the list, whereas for two items, it must proceed to the *strings* contained within the tuple.\n", "\n", "When writing `for` loops, pay close attention to the items contained in the list!" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### Quick exercise\n", "\n", "The text stored under `update_text` still contains many traces of conversion from a printed document to digital text.\n", "\n", "Examine, for instance, the variety of quotation marks used in the text (e.g. `”, ’, \"`) and hyphenation (e.g. `- `).\n", "\n", "Create a list named `new_pipeline` and define **tuples** which each contain two **strings**: what is to be replaced and its replacement.\n", "\n", "Then write a `for` loop and use the `replace()` method to apply these patterns to `updated_text`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should have given you an idea of the basic issues involved in loading and manipulating text using Python. \n", "\n", "The [following section](basic_text_processing_continued.ipynb) builds on these techniques to manipulate texts more efficiently." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }