Hàm re.sub trong python

This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

Regular expressions use the backslash character [

result = re.match[pattern, string]
4] to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write
result = re.match[pattern, string]
5 as the pattern string, because the regular expression must be
result = re.match[pattern, string]
6, and each backslash must be expressed as
result = re.match[pattern, string]
6 inside a regular Python string literal.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with

result = re.match[pattern, string]
8. So
result = re.match[pattern, string]
9 is a two-character string containing
result = re.match[pattern, string]
4 and
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
1, while
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
2 is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

It is important to note that most regular expression operations are available as module-level functions and methods. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.

See also

The third-party regex module, which has an API compatible with the standard library module, but offers additional functionality and a more thorough Unicode support.

7.2.1. Regular Expression Syntax

A regular expression [or RE] specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression [or if a given regular expression matches a particular string, which comes down to the same thing].

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced above, or almost any textbook about compiler construction.

A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the .

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like

a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
5,
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
6, or
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
7, are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
8 matches the string
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
9. [In the rest of this section, we’ll write RE’s in
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
0, usually without quotes, and strings to be matched
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
1.]

Some characters, like

>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
2 or
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
3, are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using the
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
4 notation, e.g.,
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
5.

Repetition qualifiers [

>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
6,
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
7,
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
8,
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
9, etc] cannot be directly nested. This avoids ambiguity with the non-greedy modifier suffix
>>> re.split['\W+', 'Words, words, words.']
['Words', 'words', 'words', '']
>>> re.split['[\W+]', 'Words, words, words.']
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split['\W+', 'Words, words, words.', 1]
['Words', 'words, words.']
>>> re.split['[a-f]+', '0a3B9', flags=re.IGNORECASE]
['0', '3', '9']
8, and with other modifiers in other implementations. To apply a second repetition to an inner repetition, parentheses may be used. For example, the expression
>>> re.split['[\W+]', '...words, words...']
['', '...', 'words', ', ', 'words', '...', '']
1 matches any multiple of six
a = re.compile[r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X]
b = re.compile[r"\d+\.\d*"]
6 characters.

The special characters are:

>>> re.split['[\W+]', '...words, words...']
['', '...', 'words', ', ', 'words', '...', '']
3

[Dot.] In the default mode, this matches any character except a newline. If the flag has been specified, this matches any character including a newline.

>>> re.split['[\W+]', '...words, words...']
['', '...', 'words', ', ', 'words', '...', '']
5

[Caret.] Matches the start of the string, and in mode also matches immediately after each newline.

>>> re.split['[\W+]', '...words, words...']
['', '...', 'words', ', ', 'words', '...', '']
7

Matches the end of the string or just before the newline at the end of the string, and in mode also matches before a newline.

>>> re.split['[\W+]', '...words, words...']
['', '...', 'words', ', ', 'words', '...', '']
9 matches both ‘foo’ and ‘foobar’, while the regular expression
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
0 matches only ‘foo’. More interestingly, searching for
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
1 in
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
2 matches ‘foo2’ normally, but ‘foo1’ in mode; searching for a single
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
4 in
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
5 will find two [empty] matches: one just before the newline, and one at the end of the string.

>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
6

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
7 will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
8

Causes the resulting RE to match 1 or more repetitions of the preceding RE.

>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
9 will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
0

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.

>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
1 will match either ‘a’ or ‘ab’.

>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
2,
>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
3,
>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
4

The

>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
6,
>>> re.split['x*', 'foo']
['foo']
>>> re.split["[?m]^$", "foo\n\nbar\n"]
['foo\n\nbar\n']
8, and
>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
0 qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE
>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
8 is matched against
>>> re.sub[r'def\s+[[a-zA-Z_][a-zA-Z_0-9]*]\s*\[\s*\]:',
...        r'static PyObject*\npy_\1[void]\n{',
...        'def myfunc[]:']
'static PyObject*\npy_myfunc[void]\n{'
9, it will match the entire string, and not just
>>> m = re.search['[?

Chủ Đề