This page describes the standard text modules in Python 3, and how to use them.
Overview of text processing
Text processing is one of a software developer’s most common tasks. When a user provides input, it needs to be parsed, translated, broken into its component parts, sanitized, and manipulated in countless ways. Python has a broad, high-level toolkit in its standard library helps you handle all of these tasks and more.
Overview of text processing
string: common string operations
string constants
String formatting
Syntax
Format specification mini-language
Examples
Template strings
Helper functions
Regular expressions
Syntax
re module contents
Regular expression objects
Match objects
Examples
Checking for a pair
Simulating scanf()
search() vs. match()
Making a phonebook
Text munging
Finding all adverbs
Finding all adverbs and their positions
Raw string notation
Writing a tokenizer
difflib: helpers for computing deltas
SequenceMatcher objects
Examples
Differ objects
Examples
A command-line interface to difflib
Python overview
Linux main page
string: common string operations
string constants
String formatting
Syntax
Format specification mini-language
Examples
Template strings
Helper functions
Syntax
re module contents
Regular expression objects
Match objects
Checking for a pair
Simulating scanf()
search() vs. match()
Making a phonebook
Text munging
Finding all adverbs
Finding all adverbs and their positions
Raw string notation
Writing a tokenizer
SequenceMatcher objects
Examples
Differ objects
Examples
A command-line interface to difflib
The modules described below provide a wide range of tools for working with strings and text in general.
string: common string operations
The source code for the string module is located in the file string.py, and it contains the following tools.
For a tutorial about how to use text in Python, see How to extract specific portions of a text file using Python.
string constants
The constants defined in the string module are:
String formatting
The built-in string class provides the ability to do complex variable substitutions and value formatting via the format() method described in PEP 3101. The Formatter class in the string module allows you to create and customize a string formatting behaviors using the same implementation as the built-in format() method.
class string.Formatter
The Formatter class has the following public methods:
Also, the Formatter defines many methods that are intended to be replaced by subclasses:
Syntax
The str.format() method and the Formatter class share the same syntax for format strings (although in the case of Formatter, subclasses can define their format string syntax).
Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.
The grammar for a replacement field is as follows:
replacement_field ::= “{” [field_name] ["!" conversion] [":" format_spec] “}”
field_name ::= arg_name ("." attribute_name | “[” element_index “]”)*
arg_name ::= [identifier | integer]
attribute_name ::= identifier
element_index ::= integer | index_string
index_string ::= <any source character except “]">
conversion ::= “r” | “s” | “a”
format_spec ::=
In less formal terms, the replacement field can start with a field_name that specifies the object whose value is to be formatted and inserted into the output instead of the replacement field. The field_name is optionally followed by a conversion field, which is preceded by an exclamation point ‘!’, and a format_spec, which is preceded by a colon ‘:’. These specify a non-default format for the replacement value.
The field_name itself begins with an arg_name that is either a number or a keyword. If it’s a number, it refers to a positional argument, and if it’s a keyword, it refers to a named keyword argument. If the numerical arg_names in a format string are 0, 1, 2, … in sequence, they can all be omitted (not only some) and the numbers 0, 1, 2, … will be automatically inserted in that order. Because arg_name is not quote-delimited, it is not possible to specify arbitrary dictionary keys (e.g., the strings ‘10’ or ‘:-]’) within a format string. The arg_name can be followed by any number of index or attribute expressions. An expression of the form ‘.name’ selects the named attribute using getattr(), while an expression of the form ‘[index]’ does an index lookup using getitem().
Changed in version 3.1: The positional argument specifiers can be omitted, so ‘{} {}’ is equivalent to ‘{0} {1}’.
Some simple format string examples:
“First, thou shalt count to {0}” # References first positional argument “Bring me a {}” # Implicitly references the first positional argument “From {} to {}” # Same as “From {0} to {1}” “My quest is {name}” # References keyword argument ’name’ “Weight in tons {0.weight}” # ‘weight’ attribute of first positional arg “Units destroyed: {players[0]}” # First element of keyword argument ‘players’.
The conversion field causes a type coercion before formatting. Normally, the job of formatting a value is done by the format() method of the value itself. However, in some cases it is desirable to force a type to be formatted as a string, overriding its own definition of formatting. By converting the value to a string before calling format(), the normal formatting logic is bypassed.
Three conversion flags are currently supported: ‘!s’ which calls str() on the value, ‘!r’ which calls repr() and ‘!a’ which calls ascii().
Some examples:
“Harold’s a clever {0!s}” # Calls str() on the argument first “Bring out the holy {name!r}” # Calls repr() on the argument first “More {!a}” # Calls ascii() on the argument first
The format_spec field contains a specification of how the value should be presented, including such details as field width, alignment, padding, decimal precision and so on. Each value type can define its own “formatting mini-language” or interpretation of the format_spec.
Most built-in types support a common formatting mini-language, which is described in the next section.
A format_spec field can also include nested replacement fields within it. These nested replacement fields can contain only a field name; conversion flags and format specifications are not allowed. The replacement fields in the format_spec are substituted before the format_spec string is interpreted. This allows the formatting of a value to be dynamically specified.
Format Specification Mini-Language
“Format specifications” are used within replacement fields contained within a format string to define how individual values are presented (see Syntax). They can also be passed directly to the built-in format() function. Each formattable type may define how the format specification is to be interpreted.
Most built-in types implement the following options for format specifications, although some of the formatting options are only supported by the numeric types.
A general convention is that an empty format string (”") produces the same result as if you had called str() on the value. A non-empty format string often modifies the result.
The general form of a standard format specifier is:
format_spec ::= [[fill]align][sign][#][0][width][,][.precision][type]
fill ::=
If a valid align value is specified, it can be preceded by a fill character can be any character and defaults to a space if omitted. Note that it is not possible to use { and } as fill char while using the str.format() method. However, this limitation doesn’t affect the format() function.
The meaning of the various alignment options is as follows:
Note that unless a minimum field width is defined, the field width is always the same size as the data to fill it, so that the alignment option has no meaning in this case.
The sign option is only valid for number types, and can be one of the following:
The ‘#’ option causes the “alternate form” to be used for the conversion. The alternate form is defined differently for different types. This option is only valid for integer, float, complex and Decimal types. For integers, when binary, octal, or hexadecimal output is used, this option adds the prefix respective ‘0b’, ‘0o’, or ‘0x’ to the output value. For floats, complex and Decimal the alternate form causes the result of the conversion to always contain a decimal-point character, even if no digits follow it. Normally, a decimal-point character appears in the result of these conversions only if a digit follows it. Also, for ‘g’ and ‘G’ conversions, trailing zeros are not removed from the result.
The ‘,’ option signals the use of a comma for a thousands separator. For a locale aware separator, use the ’n’ integer presentation type instead.
Changed in version 3.1: Added the ‘,’ option (see also PEP 378).
width is a decimal integer defining the minimum field width. If not specified, then the field width is determined by the content.
Preceding the width field by a zero (‘0’) character enables sign-aware zero-padding for numeric types. This is equivalent to a fill character of ‘0’ with an alignment type of ‘=’.
The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with ‘f’ and ‘F’, or before and after the decimal point for a floating point value formatted with ‘g’ or ‘G’. For non-number types the field indicates the maximum field size: in other words, how many characters will be used from the field content. The precision is not allowed for integer values.
Finally, the type determines how the data should be presented.
The available string presentation types are:
The available integer presentation types are:
In addition to the above presentation types, integers can be formatted with the floating point presentation types listed below (except ’n’ and None). When doing so, float() is used to convert the integer to a floating point number before formatting.
The available presentation types for floating point and decimal values are:
Examples
This section contains examples of the new format syntax and comparison with the old %-formatting.
In most of the cases the syntax is similar to the old %-formatting, with the addition of the {} and with : used instead of %. For example, ‘%03.2f’ can be translated to ‘{:03.2f}’.
The new format syntax also supports new and different options, shown in the follow examples.
Accessing arguments by position:
‘{0}, {1}, {2}’.format(‘a’, ‘b’, ‘c’) ‘a, b, c’ ‘{}, {}, {}’.format(‘a’, ‘b’, ‘c’) # 3.1+ only ‘a, b, c’ ‘{2}, {1}, {0}’.format(‘a’, ‘b’, ‘c’) ‘c, b, a’ ‘{2}, {1}, {0}’.format(*‘abc’) # unpacking argument sequence ‘c, b, a’ ‘{0}{1}{0}’.format(‘abra’, ‘cad’) # arguments’ indices can be repeated ‘abracadabra’
Accessing arguments by name:
‘Coordinates: {latitude}, {longitude}’.format(latitude=‘37.24N’, longitude=’-115.81W’) ‘Coordinates: 37.24N, -115.81W’ coord = {’latitude’: ‘37.24N’, ’longitude’: ‘-115.81W’} ‘Coordinates: {latitude}, {longitude}’.format(**coord) ‘Coordinates: 37.24N, -115.81W’
Accessing arguments’ attributes:
c = 3-5j (‘The complex number {0} is formed from the real part {0.real} ’ … ‘and the imaginary part {0.imag}.’).format(c) ‘The complex number (3-5j) is formed from the real part 3.0 and the imaginary part -5.0.’ class Point: … def init(self, x, y): … self.x, self.y = x, y … def str(self): … return ‘Point({self.x}, {self.y})’.format(self=self) … str(Point(4, 2)) ‘Point(4, 2)’
Accessing arguments’ items:
coord = (3, 5) ‘X: {0[0]}; Y: {0[1]}’.format(coord) ‘X: 3; Y: 5’
Replacing %s and %r:
“repr() shows quotes: {!r}; str() doesn’t: {!s}".format(’test1’, ’test2’) “repr() shows quotes: ’test1’; str() doesn’t: test2”
Aligning the text and specifying a width:
‘{:<30}’.format(’left aligned’) ’left aligned ’ ‘{:>30}’.format(‘right aligned’) ’ right aligned’ ‘{:^30}’.format(‘centered’) ’ centered ’ ‘{:^30}’.format(‘centered’) # use ‘’ as a fill char ‘centered’
Replacing %+f, %-f, and % f and specifying a sign:
‘{:+f}; {:+f}’.format(3.14, -3.14) # show it always ‘+3.140000; -3.140000’ ‘{: f}; {: f}’.format(3.14, -3.14) # show a space for positive numbers ’ 3.140000; -3.140000’ ‘{:-f}; {:-f}’.format(3.14, -3.14) # show only the minus – same as ‘{:f}; {:f}’ ‘3.140000; -3.140000’
Replacing %x and %o and converting the value to different bases:
format also supports binary numbers
“int: {0:d}; hex: {0:x}; oct: {0:o}; bin: {0:b}".format(42) ‘int: 42; hex: 2a; oct: 52; bin: 101010’
with 0x, 0o, or 0b as prefix:
“int: {0:d}; hex: {0:#x}; oct: {0:#o}; bin: {0:#b}".format(42) ‘int: 42; hex: 0x2a; oct: 0o52; bin: 0b101010’
Using the comma as a thousands separator:
‘{:,}’.format(1234567890) ‘1,234,567,890’
Expressing a percentage:
points = 19 total = 22 ‘Correct answers: {:.2%}’.format(points/total) ‘Correct answers: 86.36%’
Using type-specific formatting:
import datetime d = datetime.datetime(2010, 7, 4, 12, 15, 58) ‘{:%Y-%m-%d %H:%M:%S}’.format(d) ‘2010-07-04 12:15:58’
Nesting arguments and more complex examples:
for align, text in zip(’<^>’, [’left’, ‘center’, ‘right’]): … ‘{0:{fill}{align}16}’.format(text, fill=align, align=align) … ’left««««««’ ‘^^^^^center^^^^^’ ‘»»»»»>right’
octets = [192, 168, 0, 1] ‘{:02X}{:02X}{:02X}{:02X}’.format(*octets) ‘C0A80001’ int(_, 16) 3232235521
width = 5 for num in range(5,12): … for base in ‘dXob’: … print(’{0:{width}{base}}’.format(num, base=base, width=width), end=’ ‘) … print() … 5 5 5 101 6 6 6 110 7 7 7 111 8 8 10 1000 9 9 11 1001 10 A 12 1010 11 B 13 1011
Template strings
Templates provide simpler string substitutions as described in PEP 292. Instead of the normal %-based substitutions, Templates support $-based substitutions, using the following rules:
- $$ is an escape; it is replaced with a single $.
- $identifier names a substitution placeholder matching a mapping key of “identifier”. By default, “identifier” must spell a Python identifier. The first non-identifier character after the $ character terminates this placeholder specification.
- ${identifier} is equivalent to $identifier. It is required when valid identifier characters follow the placeholder but are not part of the placeholder, such as “${noun}ification”.
Any other appearance of $ in the string results in a ValueError being raised.
The string module provides a Template class that implements these rules.
The constructor of Template is:
class string.Template(template)
The constructor takes a single argument that is the template string.
The methods of Template are:
Template instances also provide one public data attribute:
Here is an example of how to use a Template:
from string import Template s = Template(’$who likes $what’) s.substitute(who=‘tim’, what=‘kung pao’) ’tim likes kung pao’ d = dict(who=‘tim’) Template(‘Give $who $100’).substitute(d) Traceback (most recent call last): … ValueError: Invalid placeholder in string: line 1, col 11 Template(’$who likes $what’).substitute(d) Traceback (most recent call last): … KeyError: ‘what’ Template(’$who likes $what’).safe_substitute(d) ’tim likes $what’
Advanced usage: you can derive subclasses of Template to customize the placeholder syntax, delimiter character, or the entire regular expression used to parse template strings. To do this, you can override these class attributes:
- delimiter – This is the literal string describing a placeholder introducing delimiter. The default value is $. Note that this should not be a regular expression, as the implementation will call re.escape() on this string as needed.
- idpattern – This is the regular expression describing the pattern for non-braced placeholders (the braces are added automatically as appropriate). The default value is the regular expression [_a-z][_a-z0-9]*.
- flags – The regular expression flags that will be applied when compiling the regular expression used for recognizing substitutions. The default value is re.IGNORECASE. Note that re.VERBOSE is always added to the flags, so custom idpatterns must follow conventions for verbose regular expressions.
Alternatively, you can provide the entire regular expression pattern by overriding the class attribute pattern. If you do this, the value must be a regular expression object with four named capturing groups. The capturing groups correspond to the rules given above, along with the invalid placeholder rule:
- escaped – This group matches the escape sequence, e.g., $$, in the default pattern.
- named – This group matches the unbraced placeholder name; it should not include the delimiter in capturing group.
- braced – This group matches the brace enclosed placeholder name; it should not include either the delimiter or braces in the capturing group.
- invalid – This group matches any other delimiter pattern (usually a single delimiter), and it should appear last in the regular expression.
Helper functions
re: regular expression operations
This module provides regular expression matching operations similar to those found in Perl.
Both patterns and strings to be searched can be Unicode strings and 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match an Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.
Regular expressions use the backslash character (’') to indicate special forms or allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write ‘\\’ as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with ‘r’. So r”\n” is a two-character string containing ‘' and ’n’, while “\n” is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches given regular expression (or if given regular expression matches a particular string, which comes down to the same thing).
Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like ‘A’, ‘a’, or ‘0’, are the simplest regular expressions; they match themselves. You can concatenate ordinary characters, so last matches the string ’last’. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched ‘in single quotes’.)
Some characters, like ‘|’ or ‘(’, are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using a \number notation such as ‘\x00’.
The special characters are:
The special sequences consist of ‘' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, $ matches the character ‘$’.
- Characters can be listed individually, e.g., [amk] will match ‘a’, ’m’, or ‘k’.Ranges of characters can be indicated by giving two characters and separating them by a ‘-’, for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g., [a-z]) or if it’s placed as the first or last character (e.g., [a-]), it will match a literal ‘-’.Special characters lose their special meaning inside sets. For example, [(+)] will match any of the literal characters ‘(’, ‘+’, ‘’, or ‘)’.Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.Characters that are not within a range can be matched by complementing the set. If the first character of the set is ‘^’, all the characters that are not in the set will be matched. For example, [^5] will match any character except ‘5’, and [^^] will match any character except ‘^’. ^ has no special meaning if it’s not the first character in the set.To match a literal ‘]’ inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[]{}] and [{}] will both match a parenthesis.
import re»> m = re.search(’(?<=abc)def’, ‘abcdef’)»> m.group(0)‘def’
m = re.search(’(?<=-)\w+’, ‘spam-egg’)»> m.group(0)’egg’
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n\r \t \u \U\v \x \
(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)
‘\u’ and ‘\U’ escape sequences are only recognized in Unicode patterns. In bytes patterns they are not treated specially.
Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
re module contents
The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.
Regular expression objects
Compiled regular expression objects support the following methods and attributes:
prog = re.compile(pattern)result = prog.match(string)
result = re.match(pattern, string)
a = re.compile(r”""\d + # the integral part . # the decimal point \d * # some fractional digits""", re.X)b = re.compile(r"\d+.\d*")
re.split(’\W+’, ‘Words, words, words.’)[‘Words’, ‘words’, ‘words’, ‘’]»> re.split(’(\W+)’, ‘Words, words, words.’)[‘Words’, ‘, ‘, ‘words’, ‘, ‘, ‘words’, ‘.’, ‘’]»> re.split(’\W+’, ‘Words, words, words.’, 1)[‘Words’, ‘words, words.’]»> re.split(’[a-f]+’, ‘0a3B9’, flags=re.IGNORECASE)[‘0’, ‘3’, ‘9’]
re.split(’(\W+)’, ‘…words, words…’)[’’, ‘…’, ‘words’, ‘, ‘, ‘words’, ‘…’, ‘’]
re.split(‘x*’, ‘foo’)[‘foo’]»> re.split("(?m)^$", “foo\n\nbar\n”)[‘foo\n\nbar\n’]
re.sub(r’def\s+([a-zA-Z_][a-zA-Z_0-9])\s(\s*):’,… r’static PyObject*\npy_\1(void)\n{’,… ‘def myfunc():’)‘static PyObject*\npy_myfunc(void)\n{’
def dashrepl(matchobj):… if matchobj.group(0) == ‘-’: return ’ ‘… else: return ‘-’»> re.sub(’-{1,2}’, dashrepl, ‘pro—-gram-files’)‘pro–gram files’»> re.sub(r’\sAND\s’, ’ & ‘, ‘Baked Beans And Spam’, flags=re.IGNORECASE)‘Baked Beans & Spam’
Match Objects
Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:
pattern = re.compile(“d”)»> pattern.search(“dog”) # Match at index 0<_sre.SRE_Match object; span=(0, 1), match=’d’»» pattern.search(“dog”, 1) # No match; search doesn’t include the “d”
pattern = re.compile(“o[gh]”)»> pattern.fullmatch(“dog”) # No match as “o” is not at the start of “dog”.»> pattern.fullmatch(“ogre”) # No match as not the full string matches.»> pattern.fullmatch(“doggie”, 1, 3) # Matches within given limits.<_sre.SRE_Match object; span=(1, 3), match=‘og’>
match = re.search(pattern, string) if match: process(match)
Match objects support the following methods and attributes:
Checking for a Pair
In this example, we’ll use the following helper function to display match objects a little more gracefully:
m = re.match(r"(\w+) (\w+)", “Isaac Newton, physicist”)»> m.group(0) # The entire match’Isaac Newton’»> m.group(1) # The first parenthesized subgroup.‘Isaac’»> m.group(2) # The second parenthesized subgroup.‘Newton’»> m.group(1, 2) # Multiple arguments give us a tuple.(‘Isaac’, ‘Newton’)
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", “Malcolm Reynolds”)»> m.group(‘first_name’)‘Malcolm’»> m.group(’last_name’)‘Reynolds’
m.group(1)‘Malcolm’»> m.group(2)‘Reynolds’
m = re.match(r"(..)+", “a1b2c3”) # Matches 3 times.»> m.group(1) # Returns only the last match.‘c3’
m = re.match(r"(\d+).(\d+)", “24.1632”)»> m.groups()(‘24’, ‘1632’)
m = re.match(r"(\d+).?(\d+)?", “24”)»> m.groups() # Second group defaults to None.(‘24’, None)»> m.groups(‘0’) # Now, the second group defaults to ‘0’.(‘24’, ‘0’)
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", “Malcolm Reynolds”)»> m.groupdict(){‘first_name’: ‘Malcolm’, ’last_name’: ‘Reynolds’}
m.string[m.start(g):m.end(g)]
email = “[email protected]_thisger.net”»> m = re.search(“remove_this”, email)»> email[:m.start()] + email[m.end():]’[email protected]’
def displaymatch(match): if match is None: return None return ‘<Match: %r, groups=%r>’ % (match.group(), match.groups())
Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.
To see if given string is a valid hand, one could do the following:
valid = re.compile(r"^[a2-9tjqk]{5}$") displaymatch(valid.match(“akt5q”)) # Valid. “<Match: ‘akt5q’, groups=()>” displaymatch(valid.match(“akt5e”)) # Invalid. displaymatch(valid.match(“akt”)) # Invalid. displaymatch(valid.match(“727ak”)) # Valid. “<Match: ‘727ak’, groups=()>”
That last hand, “727ak”, contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:
pair = re.compile(r".(.).\1") displaymatch(pair.match(“717ak”)) # Pair of 7s. “<Match: ‘717’, groups=(‘7’,)>” displaymatch(pair.match(“718ak”)) # No pairs. displaymatch(pair.match(“354aa”)) # Pair of aces. “<Match: ‘354aa’, groups=(‘a’,)>”
To find out what card the pair consists of, one could use the group() method of the match object in the following manner:
pair.match(“717ak”).group(1) ‘7’
Error because re.match() returns None, which doesn’t have a group() method:
pair.match(“718ak”).group(1) Traceback (most recent call last): File “<pyshell#23>”, line 1, in
re.match(r".(.).\1", “718ak”).group(1) AttributeError: ‘NoneType’ object has no attribute ‘group’ pair.match(“354aa”).group(1) ‘a’
Simulating scanf()
Python does not currently have an equivalent to scanf(). Regular expressions are generally more powerful, though also more verbose, than scanf() format strings. The table below offers some more-or-less equivalent mappings between scanf() format tokens and regular expressions.
To extract the file name and numbers from a string like:
/usr/sbin/sendmail - 0 errors, 4 warnings
…where you would use a scanf() format similar to:
%s - %d errors, %d warnings
…The equivalent regular expression would be:
(\S+) - (\d+) errors, (\d+) warnings
search() vs. match()
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
re.match(“c”, “abcdef”) # No match re.search(“c”, “abcdef”) # Match <_sre.SRE_Match object; span=(2, 3), match=‘c’>
Regular expressions beginning with ‘^’ can be used with search() to restrict the match at the beginning of the string:
re.match(“c”, “abcdef”) # No match re.search("^c", “abcdef”) # No match re.search("^a", “abcdef”) # Match <_sre.SRE_Match object; span=(0, 1), match=‘a’>
Note, however, that in MULTILINE mode, match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with ‘^’ matches at the beginning of each line.
re.match(‘X’, ‘A\nB\nX’, re.MULTILINE) # No match re.search(’^X’, ‘A\nB\nX’, re.MULTILINE) # Match <_sre.SRE_Match object; span=(4, 5), match=‘X’>
Making a phonebook
split() splits a string into a list delimited by the passed pattern. The method is invaluable for converting textual data into data structures can be easily read and modified by Python as demonstrated in the following example that creates a phonebook.
First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax:
text = “““Ross McFluff: 834.345.1254 155 Elm Street … … Ronald Heathmore: 892.345.3428 436 Finley Avenue … Frank Burger: 925.541.7625 662 South Dogwood Way … … … Heather Albrecht: 548.326.4584 919 Park Place”””
The entries are separated by one or more newlines. Now we convert the string into a list with each nonempty line having its own entry:
entries = re.split("\n+", text) entries [‘Ross McFluff: 834.345.1254 155 Elm Street’, ‘Ronald Heathmore: 892.345.3428 436 Finley Avenue’, ‘Frank Burger: 925.541.7625 662 South Dogwood Way’, ‘Heather Albrecht: 548.326.4584 919 Park Place’]
Finally, split each entry into a list with first name, last name, telephone number, and address. We use the maxsplit parameter of split() because the address has spaces, our splitting pattern, in it:
[re.split(":? “, entry, 3) for entry in entries] [[‘Ross’, ‘McFluff’, ‘834.345.1254’, ‘155 Elm Street’], [‘Ronald’, ‘Heathmore’, ‘892.345.3428’, ‘436 Finley Avenue’], [‘Frank’, ‘Burger’, ‘925.541.7625’, ‘662 South Dogwood Way’], [‘Heather’, ‘Albrecht’, ‘548.326.4584’, ‘919 Park Place’]]
The :? pattern matches the colon after the last name, so that it does not occur in the result list. With a maxsplit of 4, we could separate the house number from the street name:
[re.split(”:? “, entry, 4) for entry in entries] [[‘Ross’, ‘McFluff’, ‘834.345.1254’, ‘155’, ‘Elm Street’], [‘Ronald’, ‘Heathmore’, ‘892.345.3428’, ‘436’, ‘Finley Avenue’], [‘Frank’, ‘Burger’, ‘925.541.7625’, ‘662’, ‘South Dogwood Way’], [‘Heather’, ‘Albrecht’, ‘548.326.4584’, ‘919’, ‘Park Place’]]
Text munging
sub() replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters:
def repl(m): … inner_word = list(m.group(2)) … random.shuffle(inner_word) … return m.group(1) + “".join(inner_word) + m.group(3) text = “Professor Abdolmalek, please report your absences promptly.” re.sub(r”(\w)(\w+)(\w)”, repl, text) ‘Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.’ re.sub(r"(\w)(\w+)(\w)", repl, text) ‘Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.’
Finding all adverbs
findall() matches all occurrences of a pattern, not only the first one as search() does. For example, if one was a writer and wanted to find all of the adverbs in some text, he or she might use findall() in the following manner:
text = “He was carefully disguised but captured quickly by police.” re.findall(r"\w+ly", text) [‘carefully’, ‘quickly’]
Finding all adverbs and their positions
text = “He was carefully disguised but captured quickly by police.” for m in re.finditer(r"\w+ly", text): … print(’%02d-%02d: %s’ % (m.start(), m.end(), m.group(0))) 07-16: carefully 40-47: quickly
Raw string notation
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash (’') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
re.match(r"\W(.)\1\W", " ff “) <_sre.SRE_Match object; span=(0, 4), match=’ ff ‘> re.match(”\W(.)\1\W", " ff “) <_sre.SRE_Match object; span=(0, 4), match=’ ff ‘>
When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r”\". Without raw string notation, one must use “\\”, making the following lines of code functionally identical:
re.match(r"\", r"\") <_sre.SRE_Match object; span=(0, 1), match=’\’> re.match("\\", r"\") <_sre.SRE_Match object; span=(0, 1), match=’\'>
Writing a tokenizer
A tokenizer or scanner analyzes a string to categorize groups of characters. This is a useful first step in writing a compiler or interpreter.
The text categories are specified with regular expressions. The technique is to combine those into a single primary regular expression and to loop over successive matches:
import collections import re Token = collections.namedtuple(‘Token’, [’typ’, ‘value’, ’line’, ‘column’]) def tokenize(s): keywords = {‘IF’, ‘THEN’, ‘ENDIF’, ‘FOR’, ‘NEXT’, ‘GOSUB’, ‘RETURN’} token_specification = [ (‘NUMBER’, r’\d+(.\d*)?’), # Integer or decimal number (‘ASSIGN’, r’:=’), # Assignment operator (‘END’, r’;’), # Statement terminator (‘ID’, r’[A-Za-z]+’), # Identifiers (‘OP’, r’[+*/-]’), # Arithmetic operators (‘NEWLINE’, r’\n’), # Line endings (‘SKIP’, r’[ \t]’), # Skip over spaces and tabs ] tok_regex = ‘|’.join(’(?P<%s>%s)’ % pair for pair in token_specification) get_token = re.compile(tok_regex).match line = 1 pos = line_start = 0 mo = get_token(s) while mo is not None: typ = mo.lastgroup if typ == ‘NEWLINE’: line_start = pos line += 1 elif typ != ‘SKIP’: val = mo.group(typ) if typ == ‘ID’ and val in keywords: typ = val yield Token(typ, val, line, mo.start()-line_start) pos = mo.end() mo = get_token(s, pos) if pos != len(s): raise RuntimeError(‘Unexpected character %r on line %d’ %(s[pos], line)) statements = ’’’ IF quantity THEN total := total + price * quantity; tax := price * 0.05; ENDIF; ’’’ for token in tokenize(statements): print(token)
The tokenizer produces the following output:
Token(typ=‘IF’, value=‘IF’, line=2, column=5) Token(typ=‘ID’, value=‘quantity’, line=2, column=8) Token(typ=‘THEN’, value=‘THEN’, line=2, column=17) Token(typ=‘ID’, value=‘total’, line=3, column=9) Token(typ=‘ASSIGN’, value=’:=’, line=3, column=15) Token(typ=‘ID’, value=‘total’, line=3, column=18) Token(typ=‘OP’, value=’+’, line=3, column=24) Token(typ=‘ID’, value=‘price’, line=3, column=26) Token(typ=‘OP’, value=’’, line=3, column=32) Token(typ=‘ID’, value=‘quantity’, line=3, column=34) Token(typ=‘END’, value=’;’, line=3, column=42) Token(typ=‘ID’, value=‘tax’, line=4, column=9) Token(typ=‘ASSIGN’, value=’:=’, line=4, column=13) Token(typ=‘ID’, value=‘price’, line=4, column=16) Token(typ=‘OP’, value=’’, line=4, column=22) Token(typ=‘NUMBER’, value=‘0.05’, line=4, column=24) Token(typ=‘END’, value=’;’, line=4, column=28) Token(typ=‘ENDIF’, value=‘ENDIF’, line=5, column=5) Token(typ=‘END’, value=’;’, line=5, column=10)
difflib — helpers for computing deltas
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.
SequenceMatcher objects
The SequenceMatcher class has this constructor:
s1 = [‘bacon\n’, ’eggs\n’, ‘ham\n’, ‘guido\n’]»> s2 = [‘python\n’, ’eggy\n’, ‘hamster\n’, ‘guido\n’]»> for line in context_diff(s1, s2, fromfile=‘before.py’, tofile=‘after.py’):… sys.stdout.write(line) *** before.py— after.py****************** 1,4 ****! bacon! eggs! ham guido— 1,4 —-! python! eggy! hamster guido
get_close_matches(‘appel’, [‘ape’, ‘apple’, ‘peach’, ‘puppy’])[‘apple’, ‘ape’]»> import keyword»> get_close_matches(‘wheel’, keyword.kwlist)[‘while’]»> get_close_matches(‘apple’, keyword.kwlist)[]»> get_close_matches(‘accept’, keyword.kwlist)[’except’]
diff = ndiff(‘one\ntwo\nthree\n’.splitlines(keepends=True),… ‘ore\ntree\nemu\n’.splitlines(keepends=True))»> print(’’.join(diff), end="")- one? ^+ ore? ^- two- three? -+ tree+ emu
diff = ndiff(‘one\ntwo\nthree\n’.splitlines(keepends=True),… ‘ore\ntree\nemu\n’.splitlines(keepends=True))»> diff = list(diff) # materialize the generated delta into a list»> print(’’.join(restore(diff, 1)), end="")onetwothree»> print(’’.join(restore(diff, 2)), end="")oretreeemu
s1 = [‘bacon\n’, ’eggs\n’, ‘ham\n’, ‘guido\n’]»> s2 = [‘python\n’, ’eggy\n’, ‘hamster\n’, ‘guido\n’]»> for line in unified_diff(s1, s2, fromfile=‘before.py’, tofile=‘after.py’):… sys.stdout.write(line) — before.py+++ [email protected]@ -1,4 +1,4 @@-bacon-eggs-ham+python+eggy+hamster guido
SequenceMatcher objects have the following methods:
lambda x: x in " \t"
(SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.)
The three methods that return the ratio of matching to total characters can give different results due to differing levels of approximation, although quick_ratio() and real_quick_ratio() are always at least as large as ratio():
s = SequenceMatcher(None, " abcd", “abcd abcd”)»> s.find_longest_match(0, 5, 0, 9)Match(a=0, b=4, size=5)
s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")»> s.find_longest_match(0, 5, 0, 9)Match(a=1, b=0, size=4)
s = SequenceMatcher(None, “abxcd”, “abcd”)»> s.get_matching_blocks()[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
The tag values are strings, with these meanings:
a = “qabxcd” b = “abycdf” s = SequenceMatcher(None, a, b) for tag, i1, i2, j1, j2 in s.get_opcodes(): print(’{:7} a[{}:{}] –> b[{}:{}] {!r:>8} –> {!r}’.format( tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
delete a[0:1] –> b[0:0] ‘q’ –> ‘’ equal a[1:3] –> b[0:2] ‘ab’ –> ‘ab’ replace a[3:4] –> b[2:3] ‘x’ –> ‘y’ equal a[4:6] –> b[3:5] ‘cd’ –> ‘cd’ insert a[6:6] –> b[5:6] ‘’ –> ‘f’
s = SequenceMatcher(None, “abcd”, “bcde”) s.ratio() 0.75 s.quick_ratio() 0.75 s.real_quick_ratio() 1.0
This example compares two strings, considering blanks to be “junk”:
s = SequenceMatcher(lambda x: x == " “, … “private Thread currentThread;”, … “private volatile Thread currentThread;”)
ratio() returns a float in [0, 1], measuring the similarity of the sequences. As a rule of thumb, a ratio() value over 0.6 means the sequences are close matches:
print(round(s.ratio(), 3)) 0.866
If you’re only interested in where the sequences match, get_matching_blocks() is handy:
for block in s.get_matching_blocks(): … print(“a[%d] and b[%d] match for %d elements” % block) a[0] and b[0] match for 8 elements a[8] and b[17] match for 21 elements a[29] and b[38] match for 0 elements
Note that the last tuple returned by get_matching_blocks() is always a dummy, (len(a), len(b), 0), and this is the only case in which the last tuple element (number of elements matched) is 0.
If you want to know how to change the first sequence into the second, use get_opcodes():
for opcode in s.get_opcodes(): … print("%6s a[%d:%d] b[%d:%d]” % opcode) equal a[0:8] b[0:8] insert a[8:8] b[8:17] equal a[8:29] b[17:38]
Differ objects
Note that Differ-generated deltas make no claim to be minimal diffs. To the contrary, minimal diffs are often counter-intuitive, because they synch up anywhere possible, sometimes accidental matches 100 pages apart. Restricting synch points to contiguous matches preserves some notion of locality, at the occasional cost of producing a longer diff.
The Differ class has this constructor:
Differ objects are used (deltas generated) via a single method:
This example compares two texts. First we set up the texts, sequences of individual single-line strings ending with newlines (such sequences can also be obtained from the readlines() method of file-like objects):
text1 = ’’’ 1. Beautiful is better than ugly. … 2. Explicit is better than implicit. … 3. Simple is better than complex. … 4. Complex is better than complicated. … ‘’’.splitlines(keepends=True) len(text1) 4 text1[0][-1] ‘\n’ text2 = ’’’ 1. Beautiful is better than ugly. … 3. Simple is better than complex. … 4. Complicated is better than complex. … 5. Flat is better than nested. … ‘’’.splitlines(keepends=True)
Next we instantiate a Differ object:
d = Differ()
Note that when instantiating a Differ object we may pass functions to filter out line and character “junk.” See the Differ() constructor for details.
Finally, we compare the two:
result = list(d.compare(text1, text2))
result is a list of strings, so let’s pretty-print it:
from pprint import pprint pprint(result) [’ 1. Beautiful is better than ugly.\n’, ‘- 2. Explicit is better than implicit.\n’, ‘- 3. Simple is better than complex.\n’, ‘+ 3. Simple is better than complex.\n’, ‘? ++\n’, ‘- 4. Complex is better than complicated.\n’, ‘? ^ —- ^\n’, ‘+ 4. Complicated is better than complex.\n’, ‘? ++++ ^ ^\n’, ‘+ 5. Flat is better than nested.\n’]
As a single multi-line string it looks like this:
import sys sys.stdout.writelines(result) 1. Beautiful is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Simple is better than complex. ? ++
- Complex is better than complicated. ? ^ —- ^
- Complicated is better than complex. ? ++++ ^ ^
- Flat is better than nested.
A command-line interface to difflib
This example shows how to use difflib to create a diff-like utility. It is also contained in the Python source distribution, as Tools/scripts/diff.py.
""" Command line interface to difflib.py providing diffs in four formats:
- ndiff: lists every line and highlights interline changes.
- context: highlights clusters of changes in a before/after format.
- unified: highlights clusters of changes in an inline format.
- html: generates side by side comparison with change highlights.
"""
import sys, os, time, difflib, optparse
def main():
Configure the option parser
usage = “usage: %prog [options] fromfile tofile” parser = optparse.OptionParser(usage) parser.add_option("-c", action=“store_true”, default=False, help=‘Produce a context format diff (default)’) parser.add_option("-u", action=“store_true”, default=False, help=‘Produce a unified format diff’) hlp = ‘Produce HTML side by side diff (can use -c and -l in conjunction)’ parser.add_option("-m", action=“store_true”, default=False, help=hlp) parser.add_option("-n", action=“store_true”, default=False, help=‘Produce a ndiff format diff’) parser.add_option("-l", “–lines”, type=“int”, default=3, help=‘Set number of context lines (default 3)’) (options, args) = parser.parse_args() if len(args) == 0: parser.print_help() sys.exit(1) if len(args) != 2: parser.error(“need to specify both a fromfile and tofile”) n = options.lines fromfile, tofile = args # as specified in the usage stringwe’re passing these as arguments to the diff function
fromdate = time.ctime(os.stat(fromfile).st_mtime) todate = time.ctime(os.stat(tofile).st_mtime) with open(fromfile) as fromf, open(tofile) as tof: fromlines, tolines = list(fromf), list(tof) if options.u: diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n) elif options.n: diff = difflib.ndiff(fromlines, tolines) elif options.m: diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile, tofile, context=options.c, numlines=n) else: diff = difflib.context_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)we’re using writelines because diff is a generator
sys.stdout.writelines(diff) if name == ‘main’: main()
textwrap — Text wrapping and filling
The textwrap module provides some convenience functions, and TextWrapper, the class that does all the work. If you’re only wrapping or filling one or two text strings, the convenience functions should be good enough; otherwise, use an instance of TextWrapper for efficiency.
The TextWrapper instance attributes (and keyword arguments to the constructor) are as follows:
“\n”.join(wrap(text, …))
textwrap.shorten(“Hello world!”, width=12)‘Hello world!’»> textwrap.shorten(“Hello world!”, width=11)‘Hello […]’»> textwrap.shorten(“Hello world”, width=10, placeholder="…")‘Hello…’
def test(): # end first line with \ to avoid the empty line! s = ‘’’\ hello world ’’’ print(repr(s)) # prints ’ hello\n world\n ’ print(repr(dedent(s))) # prints ‘hello\n world\n’
print(indent(s, ‘+ ‘, lambda line: True))+ hello+++ world
wrapper = TextWrapper(initial_indent="* “)
wrapper = TextWrapper()wrapper.initial_indent = “* "
TextWrapper also provides some public methods, analogous to the module-level convenience functions:
[…] Dr. Frankenstein’s monster […]
[…] See Spot. See Spot run […]
unicodedata: Unicode database
This module provides access to the UCD (Unicode Character Database) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.3.0.
The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”. It defines the following functions:
Also, the module exposes the following constant:
Examples:
import unicodedata unicodedata.lookup(‘LEFT CURLY BRACKET’) ‘{’ unicodedata.name(’/’) ‘SOLIDUS’ unicodedata.decimal(‘9’) 9 unicodedata.decimal(‘a’) Traceback (most recent call last): File “
”, line 1, in ? ValueError: not a decimal unicodedata.category(‘A’) # ‘L’etter, ‘u’ppercase ‘Lu’ unicodedata.bidirectional(’\u0660’) # ‘A’rabic, ‘N’umber ‘AN’
stringprep — Internet string preparation
When identifying things (such as hostnames) in the Internet, it is often necessary to compare such identifications for “equality”. Exactly how this comparison is executed may depend on the application domain, e.g., whether it should be case-insensitive or not. It may be also necessary to restrict the possible identifications, to allow only identifications consisting of “printable” characters.
RFC 3454 defines a procedure for “preparing” Unicode strings in Internet protocols. Before passing strings onto the wire, they are processed with the preparation procedure, after which they have a certain normalized form. The RFC defines a set of tables, which can be combined into profiles. Each profile must define which tables it uses, and what other optional parts of the stringprep procedure are part of the profile. One example of a stringprep profile is nameprep, which is used for internationalized domain names.
The module stringprep only exposes the tables from RFC 3454. As these tables would be very large to represent them as dictionaries or lists, the module uses the Unicode character database internally. The module source code itself was generated using the mkstringprep.py utility.
As a result, these tables are exposed as functions, not as data structures. There are two kinds of tables in the RFC: sets and mappings. For a set, stringprep provides the “characteristic function”, i.e., a function that returns true if the parameter is part of the set. For mappings, it provides the mapping function: given the key, it returns the associated value. Below is a list of all functions available in the module.
readline — GNU readline interface
The readline module defines many functions to facilitate completion and reading/writing of history files from the Python interpreter. This module can be used directly or via the rlcompleter module. Settings made using this module affect the behaviour of both the interpreter’s interactive prompt and the prompts offered by the built-in input() function.
The configuration file for libedit is different from that of GNU readline. If you programmatically load configuration strings you can check for the text “libedit” in readline.doc to differentiate between GNU readline and libedit.
On MacOS X the readline module can be implemented using the libedit library instead of GNU readline.
The readline module defines the following functions:
readline example
The following example demonstrates how to use the readline module’s history reading and writing functions to automatically load and save a history file named .python_history from the user’s home directory. The code below would normally be executed automatically during interactive sessions from the user’s PYTHONSTARTUP file.
import atexit import os import readline histfile = os.path.join(os.path.expanduser("~”), “.python_history”) try: readline.read_history_file(histfile) except FileNotFoundError: pass atexit.register(readline.write_history_file, histfile)
This code is actually automatically run when Python is run in interactive mode.
The following example extends the code.InteractiveConsole class to support history save/restore.
import atexit
import code
import os
import readline
class HistoryConsole(code.InteractiveConsole):
def init(self, locals=None, filename="
rlcompleter — completion function for GNU readline
The rlcompleter module defines a completion function suitable for the readline module by completing valid Python identifiers and keywords.
When this module is imported on a Unix platform with the readline module available, an instance of the Completer class is automatically created and its complete() method is set as the readline completer.
Example:
import rlcompleter import readline readline.parse_and_bind(“tab: complete”) readline.
readline.doc readline.get_line_buffer( readline.read_init_file( readline.file readline.insert_text( readline.set_completer( readline.name readline.parse_and_bind( readline.
The rlcompleter module is designed for use with Python’s interactive mode. Unless Python is run with the -S option, the module is automatically imported and configured (see Readline configuration).
On platforms without readline, the Completer class defined by this module can still be used for custom purposes.
Completer objects
Completer objects have the following method:
The modules described in this chapter provide some basic services operations for manipulation of binary data. Other operations on binary data, specifically concerning file formats and network protocols, are described in the relevant sections.
Some libraries described under Text Processing Services also work with either ASCII-compatible binary formats (for example, re) or all binary data (for example, difflib).
Also, see the documentation for Python’s built-in binary data types in Binary Sequence Types — bytes, bytearray, memoryview.