Python - difference between two strings

Question

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny and many of words like afrykanerskojęzycznym, afrykanerskojęzyczni, nieafrykanerskojęzyczni. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

What do you mean by "restore the second string from the first one and diff"? — jrd1, Commented Jul 28, 2013 at 1:29
I believe he means "Make the second string the same as the first". — Elias Benevedes, Commented Jul 28, 2013 at 1:39
Are you looking for something like difflib? If so, see, e.g., stackoverflow.com/questions/774316/… — torek, Commented Jul 28, 2013 at 2:35
I want to change one string to the other only on the "add commands". How does one do that with their api? — Charlie Parker, Commented Jul 22, 2022 at 13:45

Lenormju · Accepted Answer · 2023-05-23 06:10:18Z

177

You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.

A simple example:

import difflib

cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
       ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
       ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
       ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
       ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
       ('abcdefg','xac')] 

for a,b in cases:     
    print('{} => {}'.format(a,b))  
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()

prints:

afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20

afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2

afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20

nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2

nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16

abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7

edited May 23, 2023 at 6:10

Lenormju

4,2692 gold badges9 silver badges25 bronze badges

answered Jul 28, 2013 at 4:25

dawg

102k23 gold badges133 silver badges211 bronze badges

32

+1 Python has so many useful modules. It seems that I learn about a new one each day.
– arshajii
Commented Jul 28, 2013 at 4:29
1

This is stepping through the difference manually; restoring the different between the two strings, of course, is much easier with difflib.restore
– dawg
Commented Jul 28, 2013 at 4:48
Thanks! But i'm not sure if this is memory efficient. list(difflib.ndiff("afrykanerskojęzyczny","nieafrykanerskojęzyczny")) ['+ n', '+ i', '+ e', ' a', ' f', ' r', ' y', ' k', ' a', ' n', ' e', ' r', ' s', ' k', ' o', ' j', ' ę', ' z', ' y', ' c', ' z', ' n', ' y']
– user2626682
Commented Jul 28, 2013 at 11:08
1

ndiff is a generator so it is quite memory efficient. You are calling list on it which turns the individually generated character comparisons into a full list of them. You would only have a few in memory at a time if you did't call list on it.
– dawg
Commented Jul 28, 2013 at 13:54
1

Works on Python 2 as well (for me) I would suggest asking a question with the specific source and specific output. I cannot debug in comments...
– dawg
Commented May 6, 2016 at 17:24

| Show 10 more comments

ewerybody · Accepted Answer · 2018-10-04 08:57:38Z

54

I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:

import difflib

case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'

output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']

edited Oct 4, 2018 at 8:57

ewerybody

1,53817 silver badges30 bronze badges

answered Sep 21, 2017 at 16:51

Eric

7296 silver badges5 bronze badges

3

This is just what I was Googling for. One quick note, @Eric, your variables don't match as shown today, 20180905. Either 1) change the last line to output_list = [li for li in list(difflib.ndiff(case_a,case_b)) if li[0] != ' '] or 2) Change the string variables' names as case_a -> a and case_b -> b. Cheers!
– bballdave025
Commented Sep 5, 2018 at 18:03
5

It might also be helpful to show the output of your command: >>> output_list ; #result# ['- b', '+ a', '+ m']
– bballdave025
Commented Sep 5, 2018 at 18:06
2

if not li.startswith(' ') is the eqivalent of if li[0] != ' ' Some may find it more legible. Or even if item.startswith(('-', '+', ))
– dmmfll
Commented Aug 24, 2019 at 13:19
2

@Nathan you can't downvote comments, and while you're correct that lists don't have startswith, you've completely screwed up on the type. li is a string, not a list, and strings have had startswith() since at least 2012. Strings are indexable in the same way as arrays, because a string is essentially a glorified unsigned (w)char[], which in fact is an array. Try running the code - output_list = [li for li in difflib.ndiff(case_a, case_b) if not li.startswith(' ')] works perfectly fine for me on 3.9
– Zoe - Save the data dump ♦
Commented Jul 20, 2021 at 19:34
1

docs for python 3 docs.python.org/3/library/difflib.html
– Charlie Parker
Commented Jul 21, 2022 at 22:31

| Show 1 more comment

perreal · Accepted Answer · 2013-07-28 03:26:07Z

5

You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:

import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni', 
            'nieafrykanerskojezyczni' ]
for q in queries:
    m = regex.search(r'(%s){e<=2}'%q, sequence)
    print 'match' if m else 'nomatch'

answered Jul 28, 2013 at 3:26

perreal

96.8k23 gold badges157 silver badges186 bronze badges

Add a comment |

Craig Silverstein · Accepted Answer · 2016-09-21 23:16:56Z

What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj and zlib.decompressobj with the zdict parameter set to your "base word", e.g. afrykanerskojęzyczny.

Caveats are zdict is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.

moo · Accepted Answer · 2022-02-26 07:36:38Z

You might find the tools available in the NLTK library useful for calculating the difference between different words.

nltk.metrics.distance.edit_distance() is a mature (non-standard) library implementation that calculates the Levenshtein distance

A simple example might be:

from nltk.metrics.distance import *

w1 = 'wordone'
w2 = 'wordtwo'
edit_distance(w1, w2)

Out: 3

Additional parameter allow the output to be weighted, depending on the costs of different actions (substitutions/insertions) and different character differences (e.g. less cost for characters closer on the keyboard).

Zoe - Save the data dump · Accepted Answer · 2021-07-20 19:08:30Z

0

The answer to my comment above on the Original Question makes me think this is all he wants:

loopnum = 0
word = 'afrykanerskojęzyczny'
wordlist = ['afrykanerskojęzycznym','afrykanerskojęzyczni','nieafrykanerskojęzyczni']
for i in wordlist:
    wordlist[loopnum] = word
    loopnum += 1

This will do the following:

For every value in wordlist, set that value of the wordlist to the origional code.

All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.

edited Jul 20, 2021 at 19:08

Zoe - Save the data dump♦

28k22 gold badges127 silver badges157 bronze badges

answered Jul 28, 2013 at 4:00

Elias Benevedes

3801 gold badge8 silver badges28 bronze badges

Thanks, but actually I'd like to store words like 'nieafrykanerskojęzyczni' in a memory efficient way, using similarity to 'afrykanerskojęzyczny'.
– user2626682
Commented Jul 28, 2013 at 10:35

Add a comment |

greenkey · Accepted Answer · 2023-03-11 18:04:53Z

I found this problem interesting and even if this is an old question, I wanted to try developing a solution without any external library.

import pytest

@pytest.mark.parametrize(
    "s1,s2,expected",
    [
        ("", "", ["", ""]),
        ("a", "a", ["", ""]),
        ("a", "b", ["a", "b"]),
        ("ac", "bc", ["a", "b"]),
        ("ca", "cb", ["a", "b"]),
        ("this is a message", "this is b message", ["a", "b"]),
        ("this is a huge message", "this is b message", ["a huge", "b"]),
    ],
)
def test_string_diff(s1, s2, expected):
    assert string_diff(s1, s2) == expected


def string_diff(s1: str, s2: str) -> str:
    d1, d2 = [], []
    i1 = i2 = 0
    l1 = len(s1)
    l2 = len(s2)
    while True:
        if i1 >= len(s1) or i2 >= len(s2):
            d1.extend(s1[i1:])
            d2.extend(s2[i2:])
            break
        if s1[i1] != s2[i2]:
            e1 = l1 - i1
            e2 = l2 - i2
            if e1 > e2:
                d1.append(s1[i1])
                i2 -= 1
            elif e1 < e2:
                d2.append(s2[i2])
                i1 -= 1
            else:
                d1.append(s1[i1])
                d2.append(s2[i2])
        i1 += 1
        i2 += 1
    return ["".join(d1), "".join(d2)]

Collectives™ on Stack Overflow

Python - difference between two strings

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
python
string
python-3.x
diff
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Not the answer you're looking for? Browse other questions tagged pythonstringpython-3.xdiff or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
string
python-3.x
diff
or ask your own question.