11

I want to use difflib.SequenceMatcher to extract longest common substrings from two strings. I'm not sure whether I found a bug or misunderstood the documentation of find_longest_match. This is the point that I find confusing:

In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

(https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match)

Comparing the strings X this is a test and this is a test X, the substring X is in fact a maximal block: it can't be extended (i.e., it is inclusion-maximal). Also, it is the first such maximal block in text A. But it is certainly not a longest common substring. I strongly suspect this is not what find_longest_match is supposed to find.

In fact, in this example, find_longest_match does find a longest common substring:

>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)

However, it seems like with some other strings, I can provoke the "find the first maximal block"-behavious described above (sorry for the long strings, if I shorten them, the example somehow breaks):

>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)

In this case, it matches the first - in s1[1] to the - in s2[47], and that's it. A longest common substring would probably be something starting with graph visualization using…

Did I find a bug, or is there another possible interpretation of the documentation that describes this behavior?

I'm using Python 3.5.2 on Ubuntu.

1 Answer 1

22

Okay, I figured it out. In case anybody has the same problem: The SequenceMatcher has an autojunk parameter that does weird things:

The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.

As far as I can tell, the matcher will never find any matches containing any "junk". Not sure why that's useful, but it is enabled by default. That also explains why the example from above breaks when I make the strings shorter. It does however speed up the LCS search considerably.

So, in conclusion: You probably want to pass autojunk=False in the constructor.

2
  • For me the 'find_longest_match()' doesn't give the longest match at all even with 'autojunk=False'. So I had to get all the matchings and pick the largest one with a loop. Commented Aug 30, 2019 at 15:02
  • 4
    Yet, the docs say "Passing None for isjunk is equivalent to passing lambda x: False; in other words, no elements are ignored. ". They should really mention that elements will be ignored by the autojunk algorithm, despite passing None for the first argument.
    – kristianp
    Commented Aug 2, 2023 at 1:54

Not the answer you're looking for? Browse other questions tagged or ask your own question.