{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating text at scale\n", "\n", "This section introduces you to regular expressions for manipulating text and how to apply the same procedure to several files.\n", "\n", "Ideally, Python should enable you to manipulate text at scale, that is, to apply the same procedure to ten, hundred or thousand text files *with the same effort*.\n", "\n", "To do so, we must be able to define more flexible patterns than the fixed strings that we used previously with the `replace()` method, while opening and closing files automatically.\n", "\n", "This capability is provided by Python modules for *regular expressions* and *file handling*.\n", "\n", "After reading this section, you should know:\n", "\n", " - how to manipulate multiple text files using Python\n", " - how to define simple patterns using *regular expressions*\n", " - how to save the results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular expressions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('seCpHdTA-vs', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are a \"language\" that allows defining *search patterns*.\n", "\n", "These patterns can be used to find or to find and replace patterns in Python string objects.\n", "\n", "As opposed to fixed strings, regular expressions allow defining *wildcard characters* that stand in for any character, *quantifiers* that match sequences of repeated characters, and much more.\n", "\n", "Python allows using regular expressions through its `re` module.\n", "\n", "We can activate this module using the `import` command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by loading the text file, reading its contents, assigning the last 2000 characters to the variable `extract` and printing out the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a path to the file, open the file for (r)eading using utf-8 encoding\n", "with open(file='data/WP_1990-08-10-25A.txt', mode='r', encoding='utf-8') as file:\n", "\n", " # Read the file contents using the .read() method\n", " text = file.read()\n", "\n", "# Get the *last* 2000 characters – note the minus sign before the number\n", "extract = text[-2000:]\n", "\n", "# Print the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the text has a lot of errors from optical character recognition, mainly in the form of sequences such as `....` and `,,,,`.\n", "\n", "Let's compile our first regular expression that searches for sequences of *two or more* full stops.\n", "\n", "This is done using the `compile()` function from the `re` module.\n", "\n", "The `compile()` function takes a string as an input. \n", "\n", "Note that we attach the prefix `r` to the string. This tells Python to store the string in 'raw' format. This means that the string is stored as it appears." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compile a regular expression and assign it to the variable 'stops'\n", "stops = re.compile(r'\\.{2,}')\n", "\n", "# Let's check the type of the regular expression!\n", "type(stops)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's unpack this regular expression a bit.\n", "\n", "1. The regular expression is defined using a Python string, as indicated by the single quotation marks `' '`.\n", "\n", "2. We need a backslash `\\` in front of our full stop `.`. The backslash tells Python that we are really referring to a full stop, because regular expressions use a full stop as a *wildcard* character that can stand in for *any character*.\n", "\n", "3. The curly brackets `{ }` instruct the regular expression to search for instances of the previous item `\\.` (our actual full stop) that occur two or more times (`2,`). This (hopefully) preserves true uses of a full stop!\n", "\n", "In plain language, we tell the regular expression to search for *occurrences of two or more full stops*. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To apply this regular expression to some text, we will use the `sub()` method of our newly-defined regular expression object `stops`.\n", "\n", "The `sub()` method takes two arguments:\n", "\n", "1. `repl`: A string containing a string that is used to *replace* possible matches.\n", "2. `string`: A string object to be searched for matches.\n", "\n", "The method returns the modified string object.\n", "\n", "Let's apply our regular expression to the string stored under the variable `extract`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Apply the regular expression to the text under 'extract' and save the output\n", "# to the same variable, essentially overwriting the old text.\n", "extract = stops.sub(repl='', string=extract)\n", "\n", "# Print the text to examine the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the sequences of full stops are gone.\n", "\n", "We can make our regular expression even more powerful by adding alternatives.\n", "\n", "Let's compile another regular expression and store it under the variable `punct`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compile a regular expression and assign it to the variable 'punct'\n", "punct = re.compile(r'(\\.|,){2,}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's new here are the parentheses `( )` and the vertical bar `|` between them, which separates our actual full stop `\\.` and the comma `,`.\n", "\n", "The characters surrounded by parentheses and separated by a vertical bar mark *alternatives*.\n", "\n", "In plain English, we tell the regular expression to search for *occurrences of two or more full stops or commas*.\n", "\n", "Let's apply our new pattern to the text under `extract`.\n", "\n", "To ensure the pattern works as intended, let's retrieve the original text from the `text` variable and assign it to the variable `extract` to overwrite our previous edits." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# \"Reset\" the extract variable by taking the last 2000 characters of the original string\n", "extract = text[-2000:]\n", "\n", "# Apply the regular expression\n", "extract = punct.sub(repl='', string=extract)\n", "\n", "# Print out the result\n", "print(extract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Success! The sequences of full stops and commas can be removed using a single regular expression." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "### Quick exercise\n", "\n", "Use `re.compile()` to compile a regular expression that matches `”`, `\"\"` and `’’` and store the result under the variable `quotes`.\n", "\n", "Find matching sequences in `extract` and replace them with `\"`.\n", "\n", "You will need parentheses `( )` and vertical bars `|` to define the alternatives." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The more irregular sequences resulting from optical character recognition errors in `extract`, such as `'-'*`, `->.\"`, `/*—.`, `-\"“` and `'\"''.` are much harder to capture.\n", "\n", "Capturing these patterns would require defining more complex regular expressions, which are harder to write. Their complexity is, however, what makes regular expressions so powerful, but at the same time, learning how to use them takes time and patience.\n", "\n", "It is therefore a good idea to use a service such as [regex101.com](https://www.regex101.com) to learn the basics of regular expressions.\n", "\n", "In practice, coming up with regular expressions that cover as many matches as possible is particularly hard. \n", "\n", "Capturing most of the errors – and perhaps distributing the manipulations over a series of steps in a pipeline – can already help prepare the text for further processing or analysis.\n", "\n", "However, keep in mind that in order to identify patterns for manipulating text programmatically, you should always look at more than one text in your corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing multiple files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('IwhhNfDYvlI', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many corpora contain texts in multiple files. \n", "\n", "To make manipulating high volumes of text as efficient as possible, we must open the files, read their contents, perform the requested operations and close them *programmatically*.\n", "\n", "This procedure is made fairly simple using the `Path` class from Python's `pathlib` module.\n", "\n", "Let's import the class first. Using the command `from` with `import` allows us to import only a part of the `pathlib` module, namely the `Path` class. This is useful if you only need some feature contained in a Python module or library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Path` class encodes information about *paths* in a *directory structure*.\n", "\n", "What's particularly great about the Path class is that it can automatically infer what kinds of paths your operating system uses. \n", "\n", "Here the problem is that operating systems such as Windows, Linux and Mac OS X have different file system paths.\n", "\n", "Using the `Path` class allows us to avoid a lot of trouble, particularly if we want our code to run on different operating systems.\n", "\n", "Our repository contains a directory named `data`, which contains the text files that we have been working with recently.\n", "\n", "Let's initialise a Path *object* that points towards this directory by providing a string with the directory name to the Path *class*. We assign the object to the variable `corpus_dir`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a Path object that points towards the directory 'data' and assign\n", "# the object to the variable 'corpus_dir'\n", "corpus_dir = Path('data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Path object stored under `corpus_dir` has various useful methods and attributes.\n", "\n", "We can, for instance, easily check if the path is valid using the `exists()` method.\n", "\n", "This returns a Boolean value, that is, either *True* or *False*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the exists() method to check if the path is valid\n", "corpus_dir.exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check if the path is a directory using the `is_dir()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the exists() method to check if the path points to a directory\n", "corpus_dir.is_dir()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure the path does not point towards a file using the `is_file()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the exists() method to check if the path points to a file\n", "corpus_dir.is_file()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we know that the path points toward a directory, we can use the `glob()` method to collect all text files in the directory.\n", "\n", "`glob` stands for [*global*](https://en.wikipedia.org/wiki/Glob_(programming)), and was first implemented as a program for matching filenames and paths using wildcards.\n", "\n", "The `glob()` method requires one argument, `pattern`, which takes a string as input. This string defines the kinds of files to be collected. The asterisk symbol `*` acts as a wildcard, which can refer to *any sequence of characters* preceding the sequence `.txt`.\n", "\n", "The file identifier `.txt` is a commonly-used suffix for plain text files.\n", "\n", "We also instruct Python to *cast* the result into a list using the `list()` function, so we can easily loop over the files in the list.\n", "\n", "Finally, we store the result under the variable `files` and call the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get all files with the suffix .txt in the directory 'corpus_dir' and cast the result into a list\n", "files = list(corpus_dir.glob(pattern='*.txt'))\n", "\n", "# Call the result\n", "files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# Run this cell to view a YouTube video related to this topic\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('rM1X6u9-o8A', height=350, width=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a list of three Path objects that point towards three text files!\n", "\n", "This allows us to loop over the files using a `for` loop and manipulate text in each file.\n", "\n", "In the cell below, we iterate over each file defined in the Path object, read and modify its contents, and write them to a new file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop over the list of Path objects under 'files'. Refer to the individual files using\n", "# the variable 'file'.\n", "for file in files:\n", " \n", " # Use the read_text() method of a Path object to read the file contents. Provide \n", " # the value 'utf-8' to the 'encoding' argument to declare the file encoding.\n", " # Store the result under the variable 'text'.\n", " text = file.read_text(encoding='utf-8')\n", " \n", " # Apply the regular expression we defined above to remove excessive punctuation \n", " # from the text. Store the result under the variable 'mod_text'\n", " mod_text = punct.sub('', text)\n", " \n", " # Define a new filename which has the prefix 'mod_' by creating a new string. \n", " # The Path object contains the filename as a string under the attribute 'name'. \n", " # Combine the two strings using the '+' expression.\n", " new_filename = 'mod_' + file.name\n", " \n", " # Define a new Path object that points towards the new file. The Path object \n", " # will automatically join the directory and filename for us.\n", " new_path = Path('data', new_filename)\n", " \n", " # Print a status message using string formatting. By adding the prefix 'f' to \n", " # a string, we can use curly brackets {} to insert a variable within the string. \n", " # Here we add the current file path to the string for printing.\n", " print(f'Writing modified text to {new_path}')\n", " \n", " # Use the write_text() method to write the modified text under 'mod_text' to \n", " # the file using UTF-8 encoding. \n", " new_path.write_text(mod_text, encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see from the code block above, Path objects provide two convenient methods for working with text files: `read_text()` and `write_text()`.\n", "\n", "These methods can be used to read and write text from files without using the `with` statement, which was introduced in the previous [section](01_basic_text_processing.ipynb). Just as using the `with` statement, the file that the Path points to is closed after the text has been read." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "If you now take a look at the directory [data](data), you should now see three files whose names have the prefix `mod_`. These are the files we just modified and wrote to disk.\n", "\n", "To keep the data directory clean, run the following cell to delete the modified files.\n", "\n", "Adding the exclamation mark `!` to the beginning of a code cell tells Jupyter that this is a command to the underlying command line interface, which can be used to manipulate the file system.\n", "\n", "In this case, we run the command `rm` to delete all files in the directory `data`, whose filename begins with the characters `mod`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!rm data/mod*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should have given you an idea of the some more powerful methods for manipulating text available in Python, such as regular expressions, and how to apply them to multiple files at the same time.\n", "\n", "The [following section](03_basic_nlp.ipynb) will teach you how to apply basic natural language processing techniques to texts." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }