Split a string on all special characters in Python #
Use the re.split[]
method to split a string on all special characters. The re.split[]
method takes a pattern and a string and splits the string on each occurrence of the pattern.
Copied!
import re my_str = "hellothree.four!five'six" my_list = re.split[r'[`!@#$%^&*[]_+\-=\[\]{};\':"\\|,.\/?~]', my_str] # 👇️ ['hello', 'one', 'two', 'three', 'four', 'five', 'six'] print[my_list]
We used the re.split method to split a string on all occurrences of a special character.
The square brackets are used to indicate a set of characters.
Make sure that all characters you consider special characters are in the set.
You can add or remove characters according to your use case.
Alternatively, you can use a regular expression that matches any character that is not a letter, a digit or a space.
Copied!
import re my_str = "hellothree.four!five'six" my_list = re.split[r'[^a-zA-Z0-9\s]', my_str] # 👇️ ['hello', 'one', 'two', 'three', 'four', 'five', 'six'] print[my_list]
The caret ^
at the beginning of the set means "NOT". In other words, match all characters that are NOT lowercase letters a-z
, uppercase letters A-Z
, digits 0-9
or whitespace \s
characters.
You can add any characters that you don't want to match between the square brackets of the regular expression.
You can tweak the regular expression according to your use case. This section of the docs has information regarding what each special character does.
This article describes how to split strings by delimiters, line breaks, regular expressions, and the number of characters in Python.
- Split by delimiter:
split[]
- Specify the delimiter:
sep
- Specify the maximum number of splits:
maxsplit
- Specify the delimiter:
- Split from right by delimiter:
rsplit[]
- Split by line break:
splitlines[]
- Split by regex:
re.split[]
- Split by multiple different delimiters
- Concatenate a list of strings
- Split based on the number of characters: slice
See the following article for more information on how to concatenate and extract strings.
- Concatenate strings in Python [+ operator, join, etc.]
- Extract a substring from a string in Python [position, regex]
Split
by delimiter: split[]
Use split[]
method to split by delimiter.
- str.split[] — Python 3.7.3 documentation
If the argument is omitted, it will be split by whitespace, such as spaces, newlines \n
, and tabs \t
. Consecutive whitespace is processed together.
A list of the words is returned.
s_blank = 'one two three\nfour\tfive'
print[s_blank]
# one two three
# four five
print[s_blank.split[]]
# ['one', 'two', 'three', 'four', 'five']
print[type[s_blank.split[]]]
#
Use join[]
, described below, to concatenate a list
into a string.
Specify the delimiter: sep
Specify a delimiter for the first parameter sep
.
s_comma = 'one,two,three,four,five'
print[s_comma.split[',']]
# ['one', 'two', 'three', 'four', 'five']
print[s_comma.split['three']]
# ['one,two,', ',four,five']
If you want to specify multiple delimiters, use regular expressions as described later.
Specify the maximum number of splits: maxsplit
Specify the maximum number of splits for the second parameter maxsplit
.
If maxsplit
is given, at most, maxsplit
splits are
done.
print[s_comma.split[',', 2]]
# ['one', 'two', 'three,four,five']
For example, it is useful to delete the first line from a string.
If sep='\n'
, maxsplit=1
, you can get a list of strings split by the first newline character \n
. The second element [1]
of this list is a string excluding the first line. As it is the last element, it can be specified as [-1]
.
s_lines = 'one\ntwo\nthree\nfour'
print[s_lines]
# one
# two
# three
# four
print[s_lines.split['\n', 1]]
# ['one', 'two\nthree\nfour']
print[s_lines.split['\n', 1][0]]
# one
print[s_lines.split['\n', 1][1]]
# two
# three
# four
print[s_lines.split['\n', 1][-1]]
# two
# three
# four
Similarly, to delete the first two lines:
print[s_lines.split['\n', 2][-1]]
# three
# four
Split from right by
delimiter: rsplit[]
rsplit[]
splits from the right of the string.
- str.rsplit[] — Python 3.7.3 documentation
The result is different from split[]
only when the second parameter maxsplit
is given.
In the same way as split[]
, if you want to delete the last line, use rsplit[]
.
print[s_lines.rsplit['\n', 1]]
# ['one\ntwo\nthree', 'four']
print[s_lines.rsplit['\n', 1][0]]
# one
# two
# three
print[s_lines.rsplit['\n', 1][1]]
# four
To delete the last two lines:
print[s_lines.rsplit['\n', 2][0]]
# one
# two
Split by line break: splitlines[]
There is also a splitlines[]
for splitting by line boundaries.
- str.splitlines[] — Python 3.7.3 documentation
As in the previous examples, split[]
and rsplit[]
split by default with whitespace including line break, and you can also specify line break with the parameter sep
.
However, it is often better
to use splitlines[]
.
For example, split string that contains \n
[LF, used in Unix OS including Mac] and \r\n
[CR + LF, used in Windows OS].
s_lines_multi = '1 one\n2 two\r\n3 three\n'
print[s_lines_multi]
# 1 one
# 2 two
# 3 three
When split[]
is applied, by default, it is split not only by line breaks but also by spaces.
print[s_lines_multi.split[]]
# ['1', 'one', '2', 'two', '3', 'three']
Since only one newline character can be specified in sep
, it cannot be split if there are mixed newline characters. It is also split at the end of the newline character.
print[s_lines_multi.split['\n']]
# ['1 one', '2 two\r', '3 three', '']
splitlines[]
splits at various newline characters but not at other whitespaces.
print[s_lines_multi.splitlines[]]
# ['1 one', '2 two', '3 three']
If the first argument, keepends
, is set to True
, the result includes a newline character at the end of the line.
print[s_lines_multi.splitlines[True]]
# ['1 one\n', '2 two\r\n', '3 three\n']
See the following article for other operations with line breaks.
- Handle line breaks [newlines] in Python
Split by regex: re.split[]
split[]
and rsplit[]
split only when sep
matches completely.
If you want to split a string that matches a regular expression [regex] instead of perfect match, use the split[]
of the re module.
- re.split[] — Regular expression operations — Python 3.7.3 documentation
In re.split[]
, specify the regex pattern in the first parameter and the target character string in the second
parameter.
An example of split by consecutive numbers is as follows.
import re
s_nums = 'one1two22three333four'
print[re.split['\d+', s_nums]]
# ['one', 'two', 'three', 'four']
The maximum number of splits can be specified in the third parameter, maxsplit
.
print[re.split['\d+', s_nums, 2]]
# ['one', 'two', 'three333four']
Split by multiple different delimiters
The following two are useful to remember even if you are not familiar with the regex.
Enclose a string with []
to match any single character in it. You can split string by multiple different
characters.
s_marks = 'one-two+three#four'
print[re.split['[-+#]', s_marks]]
# ['one', 'two', 'three', 'four']
If patterns are delimited by |
, it matches any pattern. Of course, it is possible to use special characters of regex for each pattern, but it is OK even if normal string is specified as it is. You can split by multiple different strings.
s_strs = 'oneXXXtwoYYYthreeZZZfour'
print[re.split['XXX|YYY|ZZZ', s_strs]]
# ['one', 'two', 'three', 'four']
Concatenate a list of strings
In the previous examples, you can split the string and got the list.
If you want to concatenate a list of strings
into one string, use the string method, join[]
.
Call join[]
from 'separator'
, and pass a list of strings to be concatenated to argument.
l = ['one', 'two', 'three']
print[','.join[l]]
# one,two,three
print['\n'.join[l]]
# one
# two
# three
print[''.join[l]]
# onetwothree
See the following article for details of string concatenation.
- Concatenate strings in Python [+ operator, join, etc.]
Split based on the number of characters: slice
Use slice to split strings based on the number of characters.
- How to slice a list, string, tuple in Python
s = 'abcdefghij'
print[s[:5]]
# abcde
print[s[5:]]
# fghij
It can be obtained as a tuple or assigned to a variable respectively.
- Multiple assignment in Python: Assign multiple values or the same value to multiple variables
s_tuple = s[:5], s[5:]
print[s_tuple]
# ['abcde', 'fghij']
print[type[s_tuple]]
#
s_first, s_last = s[:5], s[5:]
print[s_first]
# abcde
print[s_last]
# fghij
Split into three:
s_first, s_second, s_last = s[:3], s[3:6], s[6:]
print[s_first]
# abc
print[s_second]
# def
print[s_last]
# ghij
The number of characters can be obtained with the built-in function len[]
. It can also be split into halves using this.
half = len[s] // 2
print[half]
# 5
s_first, s_last = s[:half], s[half:]
print[s_first]
# abcde
print[s_last]
# fghij
If you want to concatenate strings, use the +
operator.
print[s_first + s_last]
# abcdefghij