Using regexes with "\s" and doing simple string.split[]'s will also remove other whitespace - like newlines, carriage returns, tabs. Unless this is desired, to only do multiple spaces, I present these examples.
I used 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum to get realistic time tests and used random-length extra spaces throughout:
original_string = ''.join[word + [' ' * random.randint[1, 10]] for word in lorem_ipsum.split[' ']]
The one-liner will essentially do a strip of any leading/trailing spaces, and it preserves a leading/trailing space [but only ONE ;-].
# setup = '''
import re
def while_replace[string]:
while ' ' in string:
string = string.replace[' ', ' ']
return string
def re_replace[string]:
return re.sub[r' {2,}' , ' ', string]
def proper_join[string]:
split_string = string.split[' ']
# To account for leading/trailing spaces that would simply be removed
beg = ' ' if not split_string[ 0] else ''
end = ' ' if not split_string[-1] else ''
# versus simply ' '.join[item for item in string.split[' '] if item]
return beg + ' '.join[item for item in split_string if item] + end
original_string = """Lorem ipsum ... no, really, it kept going... malesuada enim feugiat. Integer imperdiet erat."""
assert while_replace[original_string] == re_replace[original_string] == proper_join[original_string]
#'''
# while_replace_test
new_string = original_string[:]
new_string = while_replace[new_string]
assert new_string != original_string
# re_replace_test
new_string = original_string[:]
new_string = re_replace[new_string]
assert new_string != original_string
# proper_join_test
new_string = original_string[:]
new_string = proper_join[new_string]
assert new_string != original_string
NOTE: The "while
version" made a copy of the original_string
, as I believe once modified on the first run, successive runs would be faster [if only by a bit]. As this adds time, I added this string copy to the other two so that the times showed the difference only in the logic.
Keep in mind that the main stmt
on timeit
instances will only be executed once; the original way I did this, the while
loop worked on the same label, original_string
, thus the second run, there would be nothing to do. The way it's set up now, calling a function, using two different labels, that isn't a problem. I've added assert
statements to all the workers to verify we change
something every iteration [for those who may be dubious]. E.g., change to this and it breaks:
# while_replace_test
new_string = original_string[:]
new_string = while_replace[new_string]
assert new_string != original_string # will break the 2nd iteration
while ' ' in original_string:
original_string = original_string.replace[' ', ' ']
Tests run on a laptop with an i5 processor running Windows 7 [64-bit].
timeit.Timer[stmt = test, setup = setup].repeat[7, 1000]
test_string = 'The fox jumped over\n\t the log.' # trivial
Python 2.7.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001066 | 0.001260 | 0.001128 | 0.001092
re_replace_test | 0.003074 | 0.003941 | 0.003357 | 0.003349
proper_join_test | 0.002783 | 0.004829 | 0.003554 | 0.003035
Python 2.7.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001025 | 0.001079 | 0.001052 | 0.001051
re_replace_test | 0.003213 | 0.004512 | 0.003656 | 0.003504
proper_join_test | 0.002760 | 0.006361 | 0.004626 | 0.004600
Python 3.2.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001350 | 0.002302 | 0.001639 | 0.001357
re_replace_test | 0.006797 | 0.008107 | 0.007319 | 0.007440
proper_join_test | 0.002863 | 0.003356 | 0.003026 | 0.002975
Python 3.3.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001444 | 0.001490 | 0.001460 | 0.001459
re_replace_test | 0.011771 | 0.012598 | 0.012082 | 0.011910
proper_join_test | 0.003741 | 0.005933 | 0.004341 | 0.004009
test_string = lorem_ipsum
# Thanks to //www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"
Python 2.7.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.342602 | 0.387803 | 0.359319 | 0.356284
re_replace_test | 0.337571 | 0.359821 | 0.348876 | 0.348006
proper_join_test | 0.381654 | 0.395349 | 0.388304 | 0.388193
Python 2.7.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.227471 | 0.268340 | 0.240884 | 0.236776
re_replace_test | 0.301516 | 0.325730 | 0.308626 | 0.307852
proper_join_test | 0.358766 | 0.383736 | 0.370958 | 0.371866
Python 3.2.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.438480 | 0.463380 | 0.447953 | 0.446646
re_replace_test | 0.463729 | 0.490947 | 0.472496 | 0.468778
proper_join_test | 0.397022 | 0.427817 | 0.406612 | 0.402053
Python 3.3.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.284495 | 0.294025 | 0.288735 | 0.289153
re_replace_test | 0.501351 | 0.525673 | 0.511347 | 0.508467
proper_join_test | 0.422011 | 0.448736 | 0.436196 | 0.440318
For the trivial string, it would seem that a while-loop is the fastest, followed by the Pythonic string-split/join, and regex pulling up the rear.
For non-trivial strings, seems there's a bit more to consider. 32-bit 2.7? It's regex to the rescue! 2.7 64-bit? A while
loop is best, by a decent margin. 32-bit 3.2, go with the "proper" join
.
64-bit 3.3, go for a while
loop. Again.
In the end, one can improve performance if/where/when needed, but it's always best to remember the mantra:
- Make It Work
- Make It Right
- Make It Fast
IANAL, YMMV, Caveat Emptor!