Inspired by MrSmith42's nice answer, I came up with a version that doesn't require sorting of the full arguments, just sorting of their Counter()
s. And that sorting is triggered only if the set()
s of the arguments aren't equal. That check should allow rejecting most non-anagrams very quickly, without sorting anything.
The .most_common()
method of a Counter
does the sorting. Counters
are dictionaries that return the number of recurrences of each unique item in an iterable. The default order in which these unique items are listed is unpredictable, so sorting via .most_common()
is required.
I haven't implemented case-insensitivity but that would be easy with .lower()
.
from collections import Counter
def is_anagram(test, original):
if set(test) == set(original):
return Counter(test).most_common() == Counter(original).most_common()
return False
Update: I compared the two solutions so far (mine and MrSmith42's) for speed. The use of Counter
is slower than I thought, slower than sorting the whole list. Also, I realized that in addition to set
equality, anagrams must have len
equality, which is another speed pre-check to use to discard non-anagrams before sorting. So combining the best elements of MrSmith42's and my answers so far, we get
def is_anagram(test, original):
if len(test) == len(original):
if set(test) == set(original):
return sorted(list(test)) == sorted(list(original))
return False
Timing:
from string import letters
from random import choice, sample
from copy import copy
import timeit
# Make a very long random word
long_word = ''.join(choice(letters) for _ in xrange(10**6))
# Shuffle it to make an anagram
long_anagram = ''.join(sample(long_word, len(long_word)))
# A near anagram of the same length but with different character sets
wrong_chars = copy(long_anagram[:-1]) + '!'
# A near anagram but with a different length
wrong_len = copy(long_anagram) + 'a'
# the function works
print is_anagram(long_word, long_anagram)
print is_anagram(long_word, wrong_chars)
print is_anagram(long_word, wrong_len)
# when string are anagrams (or near anagrams, unchecked here) sorting is required and takes the longest
%timeit is_anagram(long_word, long_anagram)
# faster rejection of obviously wrong answers
%timeit is_anagram(long_word, wrong_chars)
%timeit is_anagram(long_word, wrong_len)
True
False
False
1 loops, best of 3: 486 ms per loop
10 loops, best of 3: 40 ms per loop
The slowest run took 8.11 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 264 ns per loop
Hallo
andHolllllah
. Thises aren't annagramms of each other but your algorithm would say they are \$\endgroup\$