Find common substring between two strings

Question

I'd like to compare 2 strings and keep the matched, splitting off where the comparison fails.

So if I have 2 strings:

string1 = "apples"
string2 = "appleses"

answer = "apples"

Another example, as the string could have more than one word:

string1 = "apple pie available"
string2 = "apple pies"

answer = "apple pie"

I'm sure there is a simple Python way of doing this but I can't work it out, any help and explanation appreciated.

if you need common prefix: os.path.commonprefix(['apples', 'appleses']) -> 'apples' ` — jfs, Commented Feb 26, 2014 at 21:20
Also check out algorithm implementations on wikibooks: en.wikibooks.org/w/index.php?title=Algorithm_Implementation/… — tommy.carstensen, Commented Oct 7, 2014 at 11:28
The content of the question does not correspond to what is in the title. The problem described is longest common prefix — heorhi, Commented Oct 6, 2018 at 11:19

Francisco · Accepted Answer · 2022-03-29 09:09:24Z

202

For completeness, difflib in the standard-library provides loads of sequence-comparison utilities. For instance find_longest_match which finds the longest common substring when used on strings. Example use:

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match()

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

If you're using a version older than 3.9, you'need to call find_longest_match() with the following arguments:

SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

edited Mar 29, 2022 at 9:09

Francisco

11.3k6 gold badges36 silver badges46 bronze badges

answered Sep 9, 2016 at 6:00

RickardSjogren

4,2103 gold badges18 silver badges26 bronze badges

12

Heads up to those using this on longer strings, you might want to set the kwarg "autojunk" to False when creating the instance of SequenceMatcher.
– MLP
Commented Aug 20, 2018 at 3:57
2

I'll note that there are outstanding bugs in difflib that should prevent its use in real-world scenarios. For example, it seems that the well known 'heuristic' interferes with the completeness of methods such as 'get_matching_blocks'.
– W4t3randWind
Commented Oct 27, 2018 at 16:18
12

Warning: This answer does not find the longest common substring! Despite its name (and the method's documentation), find_longest_match() does not do what its name implies. The class documentation for SequenceMatcher does hint at this, however, saying: This does not yield minimal edit sequences. For example, in some cases, find_longest_match() will claim there are no matches in two strings of length 1000, even though there are matching substrings of length > 500.
– Aleksi Torhamo
Commented Mar 19, 2019 at 14:00
11

man, what turkey wrote that API. Forcing you to put the lengths of the strings in everytime instead of just assume its the ful strings, and the first argument to SequenceMatcher is nearly always going to be None :@
– CpILL
Commented Oct 6, 2021 at 4:15
1

@CpILL default arguments were added on Python 3.9.
– Francisco
Commented Mar 29, 2022 at 9:04

Add a comment |

jonas · Accepted Answer · 2020-06-29 08:46:41Z

54

One might also consider os.path.commonprefix that works on characters and thus can be used for any strings.

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

As the function name indicates, this only considers the common prefix of two strings.

edited Jun 29, 2020 at 8:46

answered Nov 7, 2018 at 14:07

jonas

1,16011 silver badges11 bronze badges

4

It doesn't work, when compare string like ['an apple pie available', 'apple pies'].
– GoTop
Commented Feb 10, 2019 at 3:04
2

Clarified answer, it should be clear what this solution does now. The question is a bit vague in that regard. The title suggests "any substring", description and examples indicate "common prefix".
– jonas
Commented Jun 29, 2020 at 8:48
@famzah You linked to the documentation of os.commonpath this is not the same as the os.commonprefix that is used in the answer. But true, there could be some limitations, just the documentation does not mention any.
– jonas
Commented Nov 8, 2020 at 13:00

Add a comment |

Rahul K P · Accepted Answer · 2018-06-19 07:02:02Z

40

def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())

>>> common_start("apple pie available", "apple pies")
'apple pie'

Or a slightly stranger way:

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

edited Jun 19, 2018 at 7:02

Rahul K P

16k4 gold badges37 silver badges54 bronze badges

answered Sep 10, 2013 at 9:59

Eric

96.9k54 gold badges252 silver badges380 bronze badges

11

This solution, as of now, isn't complete. It only compares both strings from the zeroth position. For instance: >>> common_start("XXXXXapple pie available", "apple pies") returns an empty string.
– Nitin Nain
Commented Sep 7, 2014 at 19:36
4

@NitinNain: That was never clarified in the original question. But yes, this solution only finds the common start of strings
– Eric
Commented Sep 7, 2014 at 21:56
1

will this work once PEP479 is in effect?
– Janus Troelsen
Commented Aug 19, 2015 at 8:36
1

No - from that document: "There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target or the predicate (rather than by the __next__() call implied in the for loop proper)."
– Eric
Commented Aug 19, 2015 at 20:11
1

@Eric still, from the Python 3.6 release notes, Raising the StopIteration exception inside a generator will now generate a DeprecationWarning. If you run your code with Python3 -W default::DeprecationWarning, the last two examples both raise DeprecationWarnings
– jpyams
Commented Dec 20, 2017 at 16:10

| Show 1 more comment

Francisco · Accepted Answer · 2022-03-29 08:42:57Z

Fix bugs with the first's answer:

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp = 0
            match = ''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp += 1
            if len(match) > len(answer):
                answer = match
    return answer

print(longestSubstringFinder("dd apple pie available", "apple pies"))
print(longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print(longestSubstringFinder("bapples", "cappleses"))
print(longestSubstringFinder("apples", "apples"))

Francisco · Accepted Answer · 2022-03-29 08:41:37Z

10

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples

edited Mar 29, 2022 at 8:41

Francisco

11.3k6 gold badges36 silver badges46 bronze badges

answered Sep 10, 2013 at 11:28

thefourtheye

238k52 gold badges464 silver badges499 bronze badges

9

This algorithm is incorrect with given some inputs (e.g. "apple pie...", "apple pie") but works if you switch parameter position. I think there's something wrong with the if statement when you compare i+j < len1
– REALFREE
Commented Jul 9, 2014 at 6:47
15

its totaly wrong. try string1="2193588" , string2="21943588"
– Nozar Safari
Commented Feb 14, 2018 at 8:28
5

this needs to get down votes to get removed ...this is a wrong answer...
– grepit
Commented Dec 6, 2018 at 6:10
3

This doesn't work because it does not consider scenario where you will need to do a "re-matching" for the second string. For instance, in "acdaf" vs "acdacdaf", when starting from "a" of the first string it will match all the way till the "acda" part of the second string, then it will break at c. Then no matter what you can no longer pick up acdaf.
– Tamaki Sakura
Commented Jan 24, 2019 at 22:34
2

this is a wrong answer, won't even work on equal inputs, which should output one of the inputs
– mounaim
Commented Mar 1, 2021 at 23:23

| Show 4 more comments

Community · Accepted Answer · 2017-05-23 11:47:32Z

5

The same as Evo's, but with arbitrary number of strings to compare:

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Jul 15, 2016 at 21:42

SergeyR

5066 silver badges14 bronze badges

Add a comment |

Andrey · Accepted Answer · 2021-05-14 18:14:39Z

4

The fastest way I've found is to use suffix_trees package:

from suffix_trees import STree

a = ["xxxabcxxx", "adsaabc"]
st = STree.STree(a)
print(st.lcs()) # "abc"

answered May 14, 2021 at 18:14

Andrey

6,2293 gold badges20 silver badges42 bronze badges

Add a comment |

radhikesh93 · Accepted Answer · 2017-10-15 17:25:33Z

def matchingString(x,y):
    match=''
    for i in range(0,len(x)):
        for j in range(0,len(y)):
            k=1
            # now applying while condition untill we find a substring match and length of substring is less than length of x and y
            while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
                if len(match) <= len(x[i:i+k]):
                   match = x[i:i+k]
                k=k+1
    return match  

print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie

serko · Accepted Answer · 2022-03-02 22:41:58Z

This script requests you the minimum common substring length and gives all common substrings in two strings. Also, it eliminates shorter substrings that longer substrings include already.

def common_substrings(str1,str2):
    len1,len2=len(str1),len(str2)

    if len1 > len2:
        str1,str2=str2,str1 
        len1,len2=len2,len1
    #short string=str1 and long string=str2

    min_com = int(input('Please enter the minumum common substring length:'))
    
    cs_array=[]
    for i in range(len1,min_com-1,-1):
        for k in range(len1-i+1):
            if (str1[k:i+k] in str2):
                flag=1
                for m in range(len(cs_array)):
                    if str1[k:i+k] in cs_array[m]:
                    #print(str1[k:i+k])
                        flag=0
                        break
                if flag==1:
                    cs_array.append(str1[k:i+k])
    if len(cs_array):
        print(cs_array)
    else:
        print('There is no any common substring according to the parametres given')

common_substrings('ciguliuana','ciguana')
common_substrings('apples','appleses')
common_substrings('apple pie available','apple pies')

Birei · Accepted Answer · 2013-09-10 10:11:30Z

1

Try:

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

edited Sep 10, 2013 at 10:11

answered Sep 10, 2013 at 10:10

Birei

36.1k3 gold badges78 silver badges82 bronze badges

I'm now wanting python to make it.takewhile a language feature: a for a, b in zip(string1, string2) while a == b
– Eric
Commented Sep 10, 2013 at 11:23
''.join(el[0] for el in itertools.takewhile(lambda t: t[0] == t[1], zip("ahello", "hello"))) returns "", which appears to be incorrect. The correct result would be "hello".
– Anderson Green
Commented Jan 26, 2015 at 5:43
@AndersonGreen: You are right, it doesn't answer exactly the question, althought his examples only took into account the starting point at first char and I pointed out it in my answer too.
– Birei
Commented Jan 26, 2015 at 12:17

Add a comment |

David García Bodego · Accepted Answer · 2019-11-28 09:55:24Z

A Trie data structure would work the best, better than DP. Here is the code.

class TrieNode:
    def __init__(self):
        self.child = [None]*26
        self.endWord = False

class Trie:

    def __init__(self):
        self.root = self.getNewNode()

    def getNewNode(self):
        return TrieNode()

    def insert(self,value):
        root = self.root


        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                root.child[index] = self.getNewNode()
            root = root.child[index]

        root.endWord = True


    def search(self,value):
        root = self.root

        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                return False
            root = root.child[index]
        return root.endWord

def main(): 

    # Input keys (use only 'a' through 'z' and lower case) 
    keys = ["the","anaswe"] 
    output = ["Not present in trie", 
            "Present in trie"] 

    # Trie object 
    t = Trie() 

    # Construct trie 
    for key in keys: 
        t.insert(key) 

    # Search for different keys 
    print("{} ---- {}".format("the",output[t.search("the")])) 
    print("{} ---- {}".format("these",output[t.search("these")])) 
    print("{} ---- {}".format("their",output[t.search("their")])) 
    print("{} ---- {}".format("thaw",output[t.search("thaw")])) 

if __name__ == '__main__': 
    main()

Let me know in case of doubts.

rahimz · Accepted Answer · 2020-02-25 06:10:20Z

In case we have a list of words that we need to find all common substrings I check some of the codes above and the best was https://stackoverflow.com/a/42882629/8520109 but it has some bugs for example 'histhome' and 'homehist'. In this case, we should have 'hist' and 'home' as a result. Furthermore, it differs if the order of arguments is changed. So I change the code to find every block of substring and it results a set of common substrings:

main = input().split(" ")    #a string of words separated by space
def longestSubstringFinder(string1, string2):
    '''Find the longest matching word'''
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp=0
            match=''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp+=1         
            if (len(match) > len(answer)):
                answer = match              
    return answer

def listCheck(main):
    '''control the input for finding substring in a list of words'''
    string1 = main[0]
    result = []
    for i in range(1, len(main)):
        string2 = main[i]
        res1 = longestSubstringFinder(string1, string2)
        res2 = longestSubstringFinder(string2, string1)
        result.append(res1)
        result.append(res2)
    result.sort()
    return result

first_answer = listCheck(main)

final_answer  = []


for item1 in first_answer:    #to remove some incorrect match
    string1 = item1
    double_check = True
    for item2 in main:
        string2 = item2
        if longestSubstringFinder(string1, string2) != string1:
            double_check = False
    if double_check:
        final_answer.append(string1)

print(set(final_answer))

main = 'ABACDAQ BACDAQA ACDAQAW XYZCDAQ' #>>> {'CDAQ'}
main = 'homehist histhome' #>>> {'hist', 'home'}

JiPiBi · Accepted Answer · 2020-10-22 22:29:38Z

1

def LongestSubString(s1,s2):
    if len(s1)<len(s2) :
        s1,s2 = s2,s1  
    
    maxsub =''
    for i in range(len(s2)):
        for j in range(len(s2),i,-1):
            if s2[i:j] in s1 and j-i>len(maxsub):                
                return  s2[i:j]

answered Oct 22, 2020 at 22:29

JiPiBi

111 bronze badge

1

I recommend adding a return '' at the end, since the in degenerate case, you do not want to return None (as python does by default); you instead want to return the empty string.
– Max Bileschi
Commented Dec 10, 2021 at 15:17
Where do you update maxsub
– Coddy
Commented Aug 15, 2023 at 22:19

Add a comment |

modulus · Accepted Answer · 2015-12-22 15:07:10Z

Returns the first longest common substring:

def compareTwoStrings(string1, string2):
    list1 = list(string1)
    list2 = list(string2)

    match = []
    output = ""
    length = 0

    for i in range(0, len(list1)):

        if list1[i] in list2:
            match.append(list1[i])

            for j in range(i + 1, len(list1)):

                if ''.join(list1[i:j]) in string2:
                    match.append(''.join(list1[i:j]))

                else:
                    continue
        else:
            continue

    for string in match:

        if length < len(list(string)):
            length = len(list(string))
            output = string

        else:
            continue

    return output

Jagat Singh · Accepted Answer · 2019-10-13 15:49:17Z

0

**Return the comman longest substring** 
def longestSubString(str1, str2):
    longestString = ""
    maxLength = 0
    for i in range(0, len(str1)):
        if str1[i] in str2:
            for j in range(i + 1, len(str1)):
                if str1[i:j] in str2:
                    if(len(str1[i:j]) > maxLength):
                        maxLength = len(str1[i:j])
                        longestString =  str1[i:j]
return longestString

edited Oct 13, 2019 at 15:49

answered Oct 13, 2019 at 15:36

Jagat Singh

12 bronze badges

Add a comment |

Bantu Manjunath · Accepted Answer · 2019-11-04 11:41:24Z

This is the classroom problem called 'Longest sequence finder'. I have given some simple code that worked for me, also my inputs are lists of a sequence which can also be a string:

def longest_substring(list1,list2):
    both=[]
    if len(list1)>len(list2):
        small=list2
        big=list1
    else:
        small=list1
        big=list2
    removes=0
    stop=0
    for i in small:
        for j in big:
            if i!=j:
                removes+=1
                if stop==1:
                    break
            elif i==j:
                both.append(i)
                for q in range(removes+1):
                    big.pop(0)
                stop=1
                break
        removes=0
    return both

mr.plow · Accepted Answer · 2020-08-21 22:18:30Z

As if this question doesn't have enough answers, here's another option:

from collections import defaultdict
def LongestCommonSubstring(string1, string2):
    match = ""
    matches = defaultdict(list)
    str1, str2 = sorted([string1, string2], key=lambda x: len(x))

    for i in range(len(str1)):
        for k in range(i, len(str1)):
            cur = match + str1[k]
            if cur in str2:
                match = cur
            else:
                match = ""
            
            if match:
                matches[len(match)].append(match)
        
    if not matches:
        return ""

    longest_match = max(matches.keys())
        
    return matches[longest_match][0]

Some example cases:

LongestCommonSubstring("whose car?", "this is my car")
> ' car'
LongestCommonSubstring("apple pies", "apple? forget apple pie!")
> 'apple pie'

Rali Tsanova · Accepted Answer · 2016-01-09 13:04:55Z

This isn't the most efficient way to do it but it's what I could come up with and it works. If anyone can improve it, please do. What it does is it makes a matrix and puts 1 where the characters match. Then it scans the matrix to find the longest diagonal of 1s, keeping track of where it starts and ends. Then it returns the substring of the input string with the start and end positions as arguments.

Note: This only finds one longest common substring. If there's more than one, you could make an array to store the results in and return that Also, it's case sensitive so (Apple pie, apple pie) will return pple pie.

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer

wwii · Accepted Answer · 2016-08-12 07:02:59Z

First a helper function adapted from the itertools pairwise recipe to produce substrings.

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

Then a function the iterates over substrings, longest first, and tests for membership. (efficiency not considered)

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>>

GoTop · Accepted Answer · 2019-02-10 03:50:27Z

-1

def LongestSubString(s1,s2):
    left = 0
    right =len(s2)
    while(left<right):
        if(s2[left] not in s1):
            left = left+1
        else:
            if(s2[left:right] not in s1):
                right = right - 1
            else:
                return(s2[left:right])

s1 = "pineapple"
s2 = "applc"
print(LongestSubString(s1,s2))

edited Feb 10, 2019 at 3:50

GoTop

8901 gold badge9 silver badges22 bronze badges

answered Oct 27, 2018 at 21:10

user3838498

291 gold badge1 silver badge5 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Find common substring between two strings

20 Answers 20

Not the answer you're looking for? Browse other questions tagged
python
string
algorithm
time-complexity
dynamic-programming
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

20 Answers 20

Not the answer you're looking for? Browse other questions tagged pythonstringalgorithmtime-complexitydynamic-programming or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
string
algorithm
time-complexity
dynamic-programming
or ask your own question.