17

I am currently trying to reconcile “Name” fields from two separate data sources. I have a number of names that are not an exact match but are close enough to be considered matched (examples below). Do you have any ideas of how I can improve the number of automated matches? I am already eliminating middle initials from the match criteria.

enter image description here

Current Match Formula:

=IFERROR(IF(LEFT(SYSTEM A,IF(ISERROR(SEARCH(" ",SYSTEM A)),LEN(SYSTEM A),SEARCH(" ",SYSTEM A)-1))=LEFT(SYSTEM B,IF(ISERROR(SEARCH(" ",SYSTEM B)),LEN(SYSTEM B),SEARCH(" ",SYSTEM B)-1)),"",IF(LEFT(SYSTEM A,FIND(",",SYSTEM A))=LEFT(SYSTEM B,FIND(",",SYSTEM B)),"Last Name Match","RESEARCH")),"RESEARCH")

7 Answers 7

16

You might consider using the Microsoft Fuzzy Lookup Addin.

From MS site:

Overview

The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.

2
  • I can't install the addon in office due to administrator priviliges required, due to .net framework required. :-(
    – jumpjack
    Commented Sep 14, 2016 at 12:06
  • This is great, but I can't get it to produce more than 10 rows. I've clicked through the config without success. Any tips?
    – bjornte
    Commented Dec 19, 2016 at 10:32
6

I would look into using this list (English section only) to help weed out the common shortenings.

Addition, you might want to consider using a function that will tell you, in exact terms, how "close" two string are. The following code came from here and thanks to smirkingman.

Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function

What this will do is tell you how many insertions and deletions one must do to one string to get to the other. I would try to keep this number low (and last names should be exact).

5

I have a (long) formula that you can use. It's not as well honed as those above – and only works for surname, rather than a full name – but you might find it useful.

So if you have a header row and want to compare A2 with B2, place this in any other cell on that row (e.g., C2) and copy down to the end.

=IF(A2=B2,"EXACT",IF(SUBSTITUTE(A2,"-"," ")=SUBSTITUTE(B2,"-"," "),"Hyphen",IF(LEN(A2)>LEN(B2),IF(LEN(A2)>LEN(SUBSTITUTE(A2,B2,"")),"Whole String",IF(MID(A2,1,1)=MID(B2,1,1),1,0)+IF(MID(A2,2,1)=MID(B2,2,1),1,0)+IF(MID(A2,3,1)=MID(B2,3,1),1,0)+IF(MID(A2,LEN(A2),1)=MID(B2,LEN(B2),1),1,0)+IF(MID(A2,LEN(A2)-1,1)=MID(B2,LEN(B2)-1,1),1,0)+IF(MID(A2,LEN(A2)-2,1)=MID(B2,LEN(B2)-2,1),1,0)&"°"),IF(LEN(B2)>LEN(SUBSTITUTE(B2,A2,"")),"Whole String",IF(MID(A2,1,1)=MID(B2,1,1),1,0)+IF(MID(A2,2,1)=MID(B2,2,1),1,0)+IF(MID(A2,3,1)=MID(B2,3,1),1,0)+IF(MID(A2,LEN(A2),1)=MID(B2,LEN(B2),1),1,0)+IF(MID(A2,LEN(A2)-1,1)=MID(B2,LEN(B2)-1,1),1,0)+IF(MID(A2,LEN(A2)-2,1)=MID(B2,LEN(B2)-2,1),1,0)&"°"))))

This will return:

  • EXACT – if it's an exact match
  • Hyphen – if it's a pair of double-barrelled names but on has a hyphen and the other a space
  • Whole string – if all of one surname is part of the other (e.g., if a Smith has become a French-Smith)

After that it will give you a degree from 0° to 6° depending on the number of points of comparison between the two. (i.e., 6° compares better).

As I say a bit rough and ready, but hopefully gets you in roughly the right ball-park.

1
  • This is so undervalued at all levels. Very nicely done! Do you by any chance have any updates to this? Commented Oct 25, 2017 at 17:56
2

Was searching for something similar. I found the code below. I hope this helps to next user who comes to this question

Returns 91% for Abracadabra / Abrakadabra, 75% for Hollywood Street/Hollyhood Str, 62% for Florence/France and 0 for Disneyland

I'd say it's close enough to what you wanted :)

Public Function Similarity(ByVal String1 As String, _
    ByVal String2 As String, _
    Optional ByRef RetMatch As String, _
    Optional min_match = 1) As Single
Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long

If UCase(String1) = UCase(String2) Then
    Similarity = 1
Else:
    lngLen1 = Len(String1)
    lngLen2 = Len(String2)
    If (lngLen1 = 0) Or (lngLen2 = 0) Then
        Similarity = 0
    Else:
        b1() = StrConv(UCase(String1), vbFromUnicode)
        b2() = StrConv(UCase(String2), vbFromUnicode)
        lngResult = Similarity_sub(0, lngLen1 - 1, _
        0, lngLen2 - 1, _
        b1, b2, _
        String1, _
        RetMatch, _
        min_match)
        Erase b1
        Erase b2
        If lngLen1 >= lngLen2 Then
            Similarity = lngResult / lngLen1
        Else
            Similarity = lngResult / lngLen2
        End If
    End If
End If

End Function

Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
                                ByVal start2 As Long, ByVal end2 As Long, _
                                ByRef b1() As Byte, ByRef b2() As Byte, _
                                ByVal FirstString As String, _
                                ByRef RetMatch As String, _
                                ByVal min_match As Long, _
                                Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity *(RECURSIVE)

Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim I As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String

If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
    Exit Function '(exit if start/end is out of string, or length is too short)
End If

For lngCurr1 = start1 To end1
    For lngCurr2 = start2 To end2
        I = 0
        Do Until b1(lngCurr1 + I) <> b2(lngCurr2 + I)
            I = I + 1
            If I > lngLongestMatch Then
                lngMatchAt1 = lngCurr1
                lngMatchAt2 = lngCurr2
                lngLongestMatch = I
            End If
            If (lngCurr1 + I) > end1 Or (lngCurr2 + I) > end2 Then Exit Do
        Loop
    Next lngCurr2
Next lngCurr1

If lngLongestMatch < min_match Then Exit Function

lngLocalLongestMatch = lngLongestMatch
RetMatch = ""

lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
    RetMatch = RetMatch & strRetMatch1 & "*"
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
    , "*", "")
End If


RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)


lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)

If strRetMatch2 <> "" Then
    RetMatch = RetMatch & "*" & strRetMatch2
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
    Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
    , "*", "")
End If

Similarity_sub = lngLongestMatch

End Function
1
  • 1
    you're copying the code from this answer without giving any credits
    – phuclv
    Commented Aug 13, 2019 at 6:26
1

You can use the similarity function (pwrSIMILARITY) to compare the strings and get a percentage match of the two. You can make it case-sensitive or not. You'll need to decide what percentage of a match is "close enough" for your needs.

There a reference page at http://officepowerups.com/help-support/excel-function-reference/excel-text-analyzer/pwrsimilarity/.

But it works pretty well for comparing text in column A against column B.

1

Although my solution does not allow identifying very different strings, it is useful for partial match (substring match), e.g. "this is a string" and "a string" will result as "matching":

just add "*" before and after the string to look for into the table.

Usual formula:

  • vlookup(A1,B1:B10,1,0)
  • cerca.vert(A1;B1:B10;1;0)

becomes

  • vlookup("*" & A1 & "*",B1:B10;1,0)
  • cerca.vert("*" & A1 & "*";B1:B10;1;0)

"&" is the "short version" for concatenate()

1

This code scan column a and column b, if it finds any similarity in both columns it shows in yellow. You can use color filter to get the final value. I have not added that part into code.

Sub item_difference()

Range("A1").Select

last_row_all = Range("A65536").End(xlUp).Row
last_row_new = Range("B65536").End(xlUp).Row

Range("A1:B" & last_row_new).Select
With Selection.Interior
    .Pattern = xlSolid
    .PatternColorIndex = xlAutomatic
    .Color = 65535
    .TintAndShade = 0
    .PatternTintAndShade = 0
End With

For i = 1 To last_row_new
For j = 1 To last_row_all

If Range("A" & i).Value = Range("A" & j).Value Then

Range("A" & i & ":B" & i).Select
With Selection.Interior
    .Pattern = xlSolid
    .PatternColorIndex = xlAutomatic
    .ThemeColor = xlThemeColorDark1
    .TintAndShade = 0
  .PatternTintAndShade = 0
End With

End If
Next j
Next i
End Sub

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .