115

I'd like to compare 2 strings and keep the matched, splitting off where the comparison fails.

So if I have 2 strings:

string1 = "apples"
string2 = "appleses"

answer = "apples"

Another example, as the string could have more than one word:

string1 = "apple pie available"
string2 = "apple pies"

answer = "apple pie"

I'm sure there is a simple Python way of doing this but I can't work it out, any help and explanation appreciated.

6

20 Answers 20

202

For completeness, difflib in the standard-library provides loads of sequence-comparison utilities. For instance find_longest_match which finds the longest common substring when used on strings. Example use:

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match()

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

If you're using a version older than 3.9, you'need to call find_longest_match() with the following arguments:

SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
5
  • 12
    Heads up to those using this on longer strings, you might want to set the kwarg "autojunk" to False when creating the instance of SequenceMatcher.
    – MLP
    Commented Aug 20, 2018 at 3:57
  • 2
    I'll note that there are outstanding bugs in difflib that should prevent its use in real-world scenarios. For example, it seems that the well known 'heuristic' interferes with the completeness of methods such as 'get_matching_blocks'. Commented Oct 27, 2018 at 16:18
  • 12
    Warning: This answer does not find the longest common substring! Despite its name (and the method's documentation), find_longest_match() does not do what its name implies. The class documentation for SequenceMatcher does hint at this, however, saying: This does not yield minimal edit sequences. For example, in some cases, find_longest_match() will claim there are no matches in two strings of length 1000, even though there are matching substrings of length > 500. Commented Mar 19, 2019 at 14:00
  • 11
    man, what turkey wrote that API. Forcing you to put the lengths of the strings in everytime instead of just assume its the ful strings, and the first argument to SequenceMatcher is nearly always going to be None :@
    – CpILL
    Commented Oct 6, 2021 at 4:15
  • 1
    @CpILL default arguments were added on Python 3.9.
    – Francisco
    Commented Mar 29, 2022 at 9:04
54

One might also consider os.path.commonprefix that works on characters and thus can be used for any strings.

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

As the function name indicates, this only considers the common prefix of two strings.

3
  • 4
    It doesn't work, when compare string like ['an apple pie available', 'apple pies'].
    – GoTop
    Commented Feb 10, 2019 at 3:04
  • 2
    Clarified answer, it should be clear what this solution does now. The question is a bit vague in that regard. The title suggests "any substring", description and examples indicate "common prefix".
    – jonas
    Commented Jun 29, 2020 at 8:48
  • @famzah You linked to the documentation of os.commonpath this is not the same as the os.commonprefix that is used in the answer. But true, there could be some limitations, just the documentation does not mention any.
    – jonas
    Commented Nov 8, 2020 at 13:00
40
def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())
>>> common_start("apple pie available", "apple pies")
'apple pie'

Or a slightly stranger way:

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))
6
  • 11
    This solution, as of now, isn't complete. It only compares both strings from the zeroth position. For instance: >>> common_start("XXXXXapple pie available", "apple pies") returns an empty string.
    – Nitin Nain
    Commented Sep 7, 2014 at 19:36
  • 4
    @NitinNain: That was never clarified in the original question. But yes, this solution only finds the common start of strings
    – Eric
    Commented Sep 7, 2014 at 21:56
  • 1
    will this work once PEP479 is in effect? Commented Aug 19, 2015 at 8:36
  • 1
    No - from that document: "There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target or the predicate (rather than by the __next__() call implied in the for loop proper)."
    – Eric
    Commented Aug 19, 2015 at 20:11
  • 1
    @Eric still, from the Python 3.6 release notes, Raising the StopIteration exception inside a generator will now generate a DeprecationWarning. If you run your code with Python3 -W default::DeprecationWarning, the last two examples both raise DeprecationWarnings
    – jpyams
    Commented Dec 20, 2017 at 16:10
11

Fix bugs with the first's answer:

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp = 0
            match = ''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp += 1
            if len(match) > len(answer):
                answer = match
    return answer

print(longestSubstringFinder("dd apple pie available", "apple pies"))
print(longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print(longestSubstringFinder("bapples", "cappleses"))
print(longestSubstringFinder("apples", "apples"))
0
10

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples
9
  • 9
    This algorithm is incorrect with given some inputs (e.g. "apple pie...", "apple pie") but works if you switch parameter position. I think there's something wrong with the if statement when you compare i+j < len1
    – REALFREE
    Commented Jul 9, 2014 at 6:47
  • 15
    its totaly wrong. try string1="2193588" , string2="21943588" Commented Feb 14, 2018 at 8:28
  • 5
    this needs to get down votes to get removed ...this is a wrong answer...
    – grepit
    Commented Dec 6, 2018 at 6:10
  • 3
    This doesn't work because it does not consider scenario where you will need to do a "re-matching" for the second string. For instance, in "acdaf" vs "acdacdaf", when starting from "a" of the first string it will match all the way till the "acda" part of the second string, then it will break at c. Then no matter what you can no longer pick up acdaf. Commented Jan 24, 2019 at 22:34
  • 2
    this is a wrong answer, won't even work on equal inputs, which should output one of the inputs
    – mounaim
    Commented Mar 1, 2021 at 23:23
5

The same as Evo's, but with arbitrary number of strings to compare:

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())
4

The fastest way I've found is to use suffix_trees package:

from suffix_trees import STree

a = ["xxxabcxxx", "adsaabc"]
st = STree.STree(a)
print(st.lcs()) # "abc"
0
2
def matchingString(x,y):
    match=''
    for i in range(0,len(x)):
        for j in range(0,len(y)):
            k=1
            # now applying while condition untill we find a substring match and length of substring is less than length of x and y
            while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
                if len(match) <= len(x[i:i+k]):
                   match = x[i:i+k]
                k=k+1
    return match  

print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie     
2

This script requests you the minimum common substring length and gives all common substrings in two strings. Also, it eliminates shorter substrings that longer substrings include already.

def common_substrings(str1,str2):
    len1,len2=len(str1),len(str2)

    if len1 > len2:
        str1,str2=str2,str1 
        len1,len2=len2,len1
    #short string=str1 and long string=str2

    min_com = int(input('Please enter the minumum common substring length:'))
    
    cs_array=[]
    for i in range(len1,min_com-1,-1):
        for k in range(len1-i+1):
            if (str1[k:i+k] in str2):
                flag=1
                for m in range(len(cs_array)):
                    if str1[k:i+k] in cs_array[m]:
                    #print(str1[k:i+k])
                        flag=0
                        break
                if flag==1:
                    cs_array.append(str1[k:i+k])
    if len(cs_array):
        print(cs_array)
    else:
        print('There is no any common substring according to the parametres given')

common_substrings('ciguliuana','ciguana')
common_substrings('apples','appleses')
common_substrings('apple pie available','apple pies')
1

Try:

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

3
  • I'm now wanting python to make it.takewhile a language feature: a for a, b in zip(string1, string2) while a == b
    – Eric
    Commented Sep 10, 2013 at 11:23
  • ''.join(el[0] for el in itertools.takewhile(lambda t: t[0] == t[1], zip("ahello", "hello"))) returns "", which appears to be incorrect. The correct result would be "hello". Commented Jan 26, 2015 at 5:43
  • @AndersonGreen: You are right, it doesn't answer exactly the question, althought his examples only took into account the starting point at first char and I pointed out it in my answer too.
    – Birei
    Commented Jan 26, 2015 at 12:17
1

A Trie data structure would work the best, better than DP. Here is the code.

class TrieNode:
    def __init__(self):
        self.child = [None]*26
        self.endWord = False

class Trie:

    def __init__(self):
        self.root = self.getNewNode()

    def getNewNode(self):
        return TrieNode()

    def insert(self,value):
        root = self.root


        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                root.child[index] = self.getNewNode()
            root = root.child[index]

        root.endWord = True


    def search(self,value):
        root = self.root

        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                return False
            root = root.child[index]
        return root.endWord

def main(): 

    # Input keys (use only 'a' through 'z' and lower case) 
    keys = ["the","anaswe"] 
    output = ["Not present in trie", 
            "Present in trie"] 

    # Trie object 
    t = Trie() 

    # Construct trie 
    for key in keys: 
        t.insert(key) 

    # Search for different keys 
    print("{} ---- {}".format("the",output[t.search("the")])) 
    print("{} ---- {}".format("these",output[t.search("these")])) 
    print("{} ---- {}".format("their",output[t.search("their")])) 
    print("{} ---- {}".format("thaw",output[t.search("thaw")])) 

if __name__ == '__main__': 
    main() 

Let me know in case of doubts.

1

In case we have a list of words that we need to find all common substrings I check some of the codes above and the best was https://stackoverflow.com/a/42882629/8520109 but it has some bugs for example 'histhome' and 'homehist'. In this case, we should have 'hist' and 'home' as a result. Furthermore, it differs if the order of arguments is changed. So I change the code to find every block of substring and it results a set of common substrings:

main = input().split(" ")    #a string of words separated by space
def longestSubstringFinder(string1, string2):
    '''Find the longest matching word'''
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp=0
            match=''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp+=1         
            if (len(match) > len(answer)):
                answer = match              
    return answer

def listCheck(main):
    '''control the input for finding substring in a list of words'''
    string1 = main[0]
    result = []
    for i in range(1, len(main)):
        string2 = main[i]
        res1 = longestSubstringFinder(string1, string2)
        res2 = longestSubstringFinder(string2, string1)
        result.append(res1)
        result.append(res2)
    result.sort()
    return result

first_answer = listCheck(main)

final_answer  = []


for item1 in first_answer:    #to remove some incorrect match
    string1 = item1
    double_check = True
    for item2 in main:
        string2 = item2
        if longestSubstringFinder(string1, string2) != string1:
            double_check = False
    if double_check:
        final_answer.append(string1)

print(set(final_answer))

main = 'ABACDAQ BACDAQA ACDAQAW XYZCDAQ' #>>> {'CDAQ'}
main = 'homehist histhome' #>>> {'hist', 'home'}

1
def LongestSubString(s1,s2):
    if len(s1)<len(s2) :
        s1,s2 = s2,s1  
    
    maxsub =''
    for i in range(len(s2)):
        for j in range(len(s2),i,-1):
            if s2[i:j] in s1 and j-i>len(maxsub):                
                return  s2[i:j]
2
  • 1
    I recommend adding a return '' at the end, since the in degenerate case, you do not want to return None (as python does by default); you instead want to return the empty string. Commented Dec 10, 2021 at 15:17
  • Where do you update maxsub
    – Coddy
    Commented Aug 15, 2023 at 22:19
0

Returns the first longest common substring:

def compareTwoStrings(string1, string2):
    list1 = list(string1)
    list2 = list(string2)

    match = []
    output = ""
    length = 0

    for i in range(0, len(list1)):

        if list1[i] in list2:
            match.append(list1[i])

            for j in range(i + 1, len(list1)):

                if ''.join(list1[i:j]) in string2:
                    match.append(''.join(list1[i:j]))

                else:
                    continue
        else:
            continue

    for string in match:

        if length < len(list(string)):
            length = len(list(string))
            output = string

        else:
            continue

    return output
0
**Return the comman longest substring** 
def longestSubString(str1, str2):
    longestString = ""
    maxLength = 0
    for i in range(0, len(str1)):
        if str1[i] in str2:
            for j in range(i + 1, len(str1)):
                if str1[i:j] in str2:
                    if(len(str1[i:j]) > maxLength):
                        maxLength = len(str1[i:j])
                        longestString =  str1[i:j]
return longestString
0

This is the classroom problem called 'Longest sequence finder'. I have given some simple code that worked for me, also my inputs are lists of a sequence which can also be a string:

def longest_substring(list1,list2):
    both=[]
    if len(list1)>len(list2):
        small=list2
        big=list1
    else:
        small=list1
        big=list2
    removes=0
    stop=0
    for i in small:
        for j in big:
            if i!=j:
                removes+=1
                if stop==1:
                    break
            elif i==j:
                both.append(i)
                for q in range(removes+1):
                    big.pop(0)
                stop=1
                break
        removes=0
    return both
0

As if this question doesn't have enough answers, here's another option:

from collections import defaultdict
def LongestCommonSubstring(string1, string2):
    match = ""
    matches = defaultdict(list)
    str1, str2 = sorted([string1, string2], key=lambda x: len(x))

    for i in range(len(str1)):
        for k in range(i, len(str1)):
            cur = match + str1[k]
            if cur in str2:
                match = cur
            else:
                match = ""
            
            if match:
                matches[len(match)].append(match)
        
    if not matches:
        return ""

    longest_match = max(matches.keys())
        
    return matches[longest_match][0]

Some example cases:

LongestCommonSubstring("whose car?", "this is my car")
> ' car'
LongestCommonSubstring("apple pies", "apple? forget apple pie!")
> 'apple pie'

-1

This isn't the most efficient way to do it but it's what I could come up with and it works. If anyone can improve it, please do. What it does is it makes a matrix and puts 1 where the characters match. Then it scans the matrix to find the longest diagonal of 1s, keeping track of where it starts and ends. Then it returns the substring of the input string with the start and end positions as arguments.

Note: This only finds one longest common substring. If there's more than one, you could make an array to store the results in and return that Also, it's case sensitive so (Apple pie, apple pie) will return pple pie.

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer
0
-1

First a helper function adapted from the itertools pairwise recipe to produce substrings.

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

Then a function the iterates over substrings, longest first, and tests for membership. (efficiency not considered)

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>> 
-1
def LongestSubString(s1,s2):
    left = 0
    right =len(s2)
    while(left<right):
        if(s2[left] not in s1):
            left = left+1
        else:
            if(s2[left:right] not in s1):
                right = right - 1
            else:
                return(s2[left:right])

s1 = "pineapple"
s2 = "applc"
print(LongestSubString(s1,s2))

Not the answer you're looking for? Browse other questions tagged or ask your own question.