{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating text at scale\n", "\n", "This section introduces you to regular expressions for manipulating text and how to apply the same procedure to several files.\n", "\n", "After reading this section, you should know:\n", "\n", " - how to manipulate multiple text files using Python\n", " - how to define simple patterns using *regular expressions*\n", " - how to save the results" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "## Manipulating text at scale\n", "\n", "Ideally, Python should enable you to manipulate text at scale, that is, to apply the same procedure to ten, hundred or thousand text files *with the same effort*.\n", "\n", "To do so, we must be able to define more flexible patterns than the fixed strings that we used previously with the `replace()` method, while opening and closing files automatically.\n", "\n", "This capability is provided by Python modules for *regular expressions* and *file handling*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular expressions\n", "\n", "[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are a \"language\" that allows defining *search patterns*.\n", "\n", "These patterns can be used to find or to find and replace patterns in Python string objects.\n", "\n", "As opposed to fixed strings, regular expressions allow defining *wildcard characters* that stand in for any character, *quantifiers* that match sequences of repeated characters, and much more.\n", "\n", "Python allows using regular expressions through its `re` module.\n", "\n", "We can activate this module using the `import` command." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by loading the text file, reading its contents, assigning the last 2000 characters to the variable `extract` and printing out the result." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " guardian of democracy .., also works as a deterrent to ter;;, rorists.\n", " If the United States and all its; might wasn't viewed as looking- out for smaller countries’ inter- ; ests, Rogers said, \"things could get completely out of hand.;- Anybody could do anything/; Terrorism would spread like a cancer.”\t,\n", " Poor people may be the first'-'* to feel repercussions from Iraq's invasion of Kuwait. Al->.\" ready, the price of a tankful of gasoline has increased a dollar— or two at many pumps in the region.\n", " \"This is going to hurt the. poor man and the U.S. car tor.... dustry,” Rogers predicted.\n", " “Even if they ration gas, ;lt won’t matter to rich people*”...;, said Belk, from the VA nursing-,:.’ home. “They will buy it on the black market. But if you don't,7 have much money to begin with;-’; it’s going to make a difference/*—.\n", " Gerald Dunn, a federal gov*” - eminent employee from Alex-* 1 andria, said that he believed'\"\" that most middle- and upper-in-\"“ come families wouldn’t even no-.; — tice the increase in gas prices.-'; Besides, he said, it was a small -price for them to pay while the nation defended a \"country un-' der attack.”\t’’\n", " Saddam Hussein, alternatively described as a madman, bully-”\" and dictator, drew passionate'\"''. repudiation.\n", " \"No way can we let him get' away with this,” said Robert Stout, a vice president at Smithy Braedon, referring to . the Iraqi president’s decision to.... annex Kuwait.\t,\n", " “Saddam Hussein really,,,,.* scares me,\" said Falls Church,,,, music instructor Joseph Moq-, ton. \"I fear we have a madmaq,,,„; out there and he is not going to;;,* stop at Kuwait. These days, it takes is a nuclear weapon; small enough to fit in a suit-.;.; case.”\t—\n", " ”1 just hope they stop him — quick,” said Viola Anderson, a\"-\"\n", "retired hotel employee in tlie\t\n", "District. \"I really feel things could turn terrible, if we donfr stop him soon.”\t>•'«■«\n", "Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.\n", "\n" ] } ], "source": [ "# Define a path to the file, open the file for (r)eading using utf-8 encoding\n", "file = open(file='data/WP_1990-08-10-25A.txt', mode='r', encoding='utf-8')\n", "\n", "# Read the file contents using the .read() method\n", "text = file.read()\n", "\n", "# Get the *last* 2000 characters – note the minus sign before the number\n", "extract = text[-2000:]\n", "\n", "# Print the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the text has a lot of errors from optical character recognition, mainly in the form of sequences such as `....` and `,,,,`.\n", "\n", "Let's compile our first regular expression that searches for sequences of *two or more* full stops.\n", "\n", "This is done using the `compile()` function from the `re` module.\n", "\n", "The `compile()` function takes a string as an input. \n", "\n", "Note that we attach the prefix `r` to the string. This tells Python to store the string in 'raw' format. This means that the string is stored as it appears." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "re.Pattern" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compile a regular expression and assign it to the variable 'stops'\n", "stops = re.compile(r'\\.{2,}')\n", "\n", "# Let's check the type of the regular expression!\n", "type(stops)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's unpack this regular expression a bit.\n", "\n", "1. The regular expression is defined using a Python string, as indicated by the single quotation marks `' '`.\n", "\n", "2. We need a backslash `\\` in front of our full stop `.`. The backslash tells Python that we are really referring to a full stop, because regular expressions use a full stop as a *wildcard* character that can stand in for *any character*.\n", "\n", "3. The curly brackets `{ }` instruct the regular expression to search for instances of the previous item `\\.` (our actual full stop) that occur two or more times (`2,`). This (hopefully) preserves true uses of a full stop!\n", "\n", "In plain language, we tell the regular expression to search for *occurrences of two or more full stops*. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To apply this regular expression to some text, we will use the `sub()` method of our newly-defined regular expression object `stops`.\n", "\n", "The `sub()` method takes two arguments:\n", "\n", "1. `repl`: A string containing a string that is used to *replace* possible matches.\n", "2. `string`: A string object to be searched for matches.\n", "\n", "The method returns the modified string object.\n", "\n", "Let's apply our regular expression to the string stored under the variable `extract`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " guardian of democracy , also works as a deterrent to ter;;, rorists.\n", " If the United States and all its; might wasn't viewed as looking- out for smaller countries’ inter- ; ests, Rogers said, \"things could get completely out of hand.;- Anybody could do anything/; Terrorism would spread like a cancer.”\t,\n", " Poor people may be the first'-'* to feel repercussions from Iraq's invasion of Kuwait. Al->.\" ready, the price of a tankful of gasoline has increased a dollar— or two at many pumps in the region.\n", " \"This is going to hurt the. poor man and the U.S. car tor dustry,” Rogers predicted.\n", " “Even if they ration gas, ;lt won’t matter to rich people*”;, said Belk, from the VA nursing-,:.’ home. “They will buy it on the black market. But if you don't,7 have much money to begin with;-’; it’s going to make a difference/*—.\n", " Gerald Dunn, a federal gov*” - eminent employee from Alex-* 1 andria, said that he believed'\"\" that most middle- and upper-in-\"“ come families wouldn’t even no-.; — tice the increase in gas prices.-'; Besides, he said, it was a small -price for them to pay while the nation defended a \"country un-' der attack.”\t’’\n", " Saddam Hussein, alternatively described as a madman, bully-”\" and dictator, drew passionate'\"''. repudiation.\n", " \"No way can we let him get' away with this,” said Robert Stout, a vice president at Smithy Braedon, referring to . the Iraqi president’s decision to annex Kuwait.\t,\n", " “Saddam Hussein really,,,,.* scares me,\" said Falls Church,,,, music instructor Joseph Moq-, ton. \"I fear we have a madmaq,,,„; out there and he is not going to;;,* stop at Kuwait. These days, it takes is a nuclear weapon; small enough to fit in a suit-.;.; case.”\t—\n", " ”1 just hope they stop him — quick,” said Viola Anderson, a\"-\"\n", "retired hotel employee in tlie\t\n", "District. \"I really feel things could turn terrible, if we donfr stop him soon.”\t>•'«■«\n", "Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.\n", "\n" ] } ], "source": [ "# Apply the regular expression to the text under 'extract' and save the output\n", "# to the same variable, essentially overwriting the old text.\n", "extract = stops.sub(repl='', string=extract)\n", "\n", "# Print the text to examine the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the sequences of full stops are gone.\n", "\n", "We can make our regular expression even more powerful by adding alternatives.\n", "\n", "Let's compile another regular expression and store it under the variable `punct`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Compile a regular expression and assign it to the variable 'punct'\n", "punct = re.compile(r'(\\.|,){2,}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's new here are the parentheses `( )` and the vertical bar `|` between them, which separates our actual full stop `\\.` and the comma `,`.\n", "\n", "The characters surrounded by parentheses and separated by a vertical bar mark *alternatives*.\n", "\n", "In plain English, we tell the regular expression to search for *occurrences of two or more full stops or commas*.\n", "\n", "Let's apply our new pattern to the text under `extract`.\n", "\n", "To ensure the pattern works as intended, let's retrieve the original text from the `text` variable and assign it to the variable `extract` to overwrite our previous edits." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " guardian of democracy also works as a deterrent to ter;;, rorists.\n", " If the United States and all its; might wasn't viewed as looking- out for smaller countries’ inter- ; ests, Rogers said, \"things could get completely out of hand.;- Anybody could do anything/; Terrorism would spread like a cancer.”\t,\n", " Poor people may be the first'-'* to feel repercussions from Iraq's invasion of Kuwait. Al->.\" ready, the price of a tankful of gasoline has increased a dollar— or two at many pumps in the region.\n", " \"This is going to hurt the. poor man and the U.S. car tor dustry,” Rogers predicted.\n", " “Even if they ration gas, ;lt won’t matter to rich people*”;, said Belk, from the VA nursing-,:.’ home. “They will buy it on the black market. But if you don't,7 have much money to begin with;-’; it’s going to make a difference/*—.\n", " Gerald Dunn, a federal gov*” - eminent employee from Alex-* 1 andria, said that he believed'\"\" that most middle- and upper-in-\"“ come families wouldn’t even no-.; — tice the increase in gas prices.-'; Besides, he said, it was a small -price for them to pay while the nation defended a \"country un-' der attack.”\t’’\n", " Saddam Hussein, alternatively described as a madman, bully-”\" and dictator, drew passionate'\"''. repudiation.\n", " \"No way can we let him get' away with this,” said Robert Stout, a vice president at Smithy Braedon, referring to . the Iraqi president’s decision to annex Kuwait.\t,\n", " “Saddam Hussein really* scares me,\" said Falls Church music instructor Joseph Moq-, ton. \"I fear we have a madmaq„; out there and he is not going to;;,* stop at Kuwait. These days, it takes is a nuclear weapon; small enough to fit in a suit-.;.; case.”\t—\n", " ”1 just hope they stop him — quick,” said Viola Anderson, a\"-\"\n", "retired hotel employee in tlie\t\n", "District. \"I really feel things could turn terrible, if we donfr stop him soon.”\t>•'«■«\n", "Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.\n", "\n" ] } ], "source": [ "# \"Reset\" the extract variable by taking the last 2000 characters of the original string\n", "extract = text[-2000:]\n", "\n", "# Apply the regular expression\n", "extract = punct.sub(repl='', string=extract)\n", "\n", "# Print out the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Success! The sequences of full stops and commas can be removed using a single regular expression." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "### Quick exercise\n", "\n", "Use `re.compile()` to compile a regular expression that matches `”`, `\"\"` and `’’` and store the result under the variable `quotes`.\n", "\n", "Find matching sequences in `snippet` and replace them with `\"`.\n", "\n", "You will need parentheses `( )` and vertical bars `|` to define the alternatives." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The more irregular sequences resulting from optical character recognition errors in `extract`, such as `'-'*`, `->.\"`, `/*—.`, `-\"“` and `'\"''.` are much harder to capture.\n", "\n", "Capturing these patterns would require defining more complex regular expressions, which are harder to write. Their complexity is, however, what makes regular expressions so powerful, but at the same time, learning how to use them takes time and patience.\n", "\n", "It is therefore a good idea to use a service such as [regex101.com](https://www.regex101.com) to learn the basics of regular expressions.\n", "\n", "In practice, coming up with regular expressions that cover as many matches as possible is particularly hard. \n", "\n", "Capturing most of the errors – and perhaps distributing the manipulations over a series of steps in a pipeline – can already help prepare the text for further processing or analysis.\n", "\n", "However, keep in mind that in order to identify patterns for manipulating text programmatically, you should always look at more than one text in your corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing multiple files\n", "\n", "Many corpora contain texts in multiple files. \n", "\n", "To make manipulating the text as efficient as possible, we must open the files, read their contents, perform the requested operations and close them *programmatically*.\n", "\n", "This procedure is fairly simple using the `Path` class from Python's `pathlib` module.\n", "\n", "Let's import the class first. Using the command `from` with `import` allows us to import only a part of the `pathlib` module, namely the `Path` class. This is useful if you only need some feature contained in a Python module or library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Path` class encodes information about *paths* in a *directory structure*.\n", "\n", "What's particularly great about the Path class is that it can automatically infer what kinds of paths your operating system uses. \n", "\n", "Here the problem is that operating systems such as Windows, Linux and Mac OS X have different file system paths.\n", "\n", "Using the `Path` class allows us to avoid a lot of trouble, particularly if we want to code to run on different operating systems.\n", "\n", "Our repository contains a directory named `data`, which contains the text files that we have been working with recently.\n", "\n", "Let's initialise a Path *object* that points towards this directory by providing a string with the directory name to the Path *class*. We assign the object to the variable `corpus_dir`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Create a Path object that points towards the directory 'data' and assign\n", "# the object to the variable 'corpus_dir'\n", "corpus_dir = Path('data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Path object stored under `corpus_dir` has various useful methods and attributes.\n", "\n", "We can, for instance, easily check if the path is valid using the `exists()` method." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the exists() method to check if the path is valid\n", "corpus_dir.exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check if the path is a directory using the `is_dir()` method." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the exists() method to check if the path points to a directory\n", "corpus_dir.is_dir()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure the path does not point towards a file using the `is_file()` method." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the exists() method to check if the path points to a file\n", "corpus_dir.is_file()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we know that the path refered to is indeed a directory, we can use the `glob()` method to collect all text files in the directory.\n", "\n", "`glob` stands for [*global*](https://en.wikipedia.org/wiki/Glob_(programming)), and was first implemented as a program for matching filenames and paths using wildcards.\n", "\n", "The `glob()` method requires one argument, `pattern`, which takes a string as input. This string defines the kinds of files to be collected. The asterisk symbol `*` acts as a wildcard, which can refer to *any sequence of characters* preceding the sequence `.txt`.\n", "\n", "The file identifier `.txt` is a commonly-used suffix for plain text files.\n", "\n", "We also instruct Python to *cast* the result into a list using the `list()` function, so we can easily loop over the files in the list.\n", "\n", "Finally, we store the result under the variable `files` and call the result." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('data/WP_1990-08-10-25A.txt'),\n", " PosixPath('data/NYT_1991-01-16-A15.txt'),\n", " PosixPath('data/WP_1991-01-17-A1B.txt')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get all files with the suffix .txt in the directory 'corpus_dir' and cast the result into a list\n", "files = list(corpus_dir.glob(pattern='*.txt'))\n", "\n", "# Call the result\n", "files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a list of three Path objects that point towards three text files!\n", "\n", "This allows us to loop over the files using a `for` loop and manipulate text in each file." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WP_1990-08-10-25A.txt *We Don’t Stand for Bullies': Diverse Voices in Area Back Action Against Iraq\n", "Mary Jordan Washingto\n", "NYT_1991-01-16-A15.txt U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired\n", "By J\n", "WP_1991-01-17-A1B.txt U.S., Allies Launch Massive Air War Against Targets in Iraq and ...\n", "Atkinson, Rick;David S Broder W\n" ] } ], "source": [ "# Begin the loop\n", "for f in files:\n", " \n", " # Open the file at the path stored under the variable 'f' and declare encoding ('utf-8')\n", " file = open(f, encoding=\"utf-8\")\n", " \n", " # The Path object contains the filename under the attribute name. Let's assign the filename to a variable.\n", " filename = f.name\n", " \n", " # Read the file contents\n", " text = file.read()\n", " \n", " # Print the filename and the first 100 characters in the file\n", " print(filename, text[:100])\n", " \n", " # Define a new filename for our modified file, which has the prefix 'mod_'\n", " new_filename = 'mod_' + filename\n", " \n", " # Define a new path. Path will automatically join the directory and filenames for us.\n", " new_path = Path('data', new_filename)\n", " \n", " # We then create a file with the new filename. Note the mode for *writing*\n", " new_file = open(new_path, mode='w+', encoding=\"utf-8\")\n", " \n", " # Apply our regular expression for removing excessive punctuation to the text\n", " modified_text = punct.sub('', text)\n", " \n", " # Write the modified text to the new file\n", " new_file.write(modified_text)\n", " \n", " # Let's close the files\n", " file.close()\n", " new_file.close()" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden" }, "source": [ "If you take a look at the directory [data](data), you should now see three files whose names have the prefix `mod_`. These are the files we just modified and saved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should have given you an idea of the some more powerful methods for manipulating text available in Python, such as regular expressions, and how to apply them to multiple files at the same time.\n", "\n", "The [following section](basic_nlp.ipynb) will teach you how to apply basic natural language processing techniques to the data." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }