5
\$\begingroup\$

I'm writing a method that writes the code for a loading routine. Given an object from the database, I want to put its properties in the right control on a userform. I have the list of the properties and I have the names of the controls.

For each property, I want my code to search in the controls' name and find the most similar.

It doesn't have to be perfect, it's just something to avoid to rewrite the code over and over for every userform of each project. If it can guess 75%-80% it's ok.

I wrote the code below. The idea is:

  • Check the presence of each character in the original string in each of the words in the list. If we are able to find it add 1 point to the score else subtract 1 point.
  • Check if the position of the character is the same in both the words (+1/-1)
  • Check if the closest characters - left and right - are the same (both match +1, 1 match 0, 0 match -1)

You can use the function as a worksheet one and you can see the scores in the immediate window.

The code does work. I mean, the results make sense.

For example:

Original string: michele

List to check: marta, elehcim, valerio, txtmichele, miche

Most similar according to the code: miche

Is this the most similar? How good developers approach this problem?

I'd like to have your opinion on the idea and if there is a better way to achieve the goal. The code is a mess but it's just a draft, doesn't matter at the moment.

Thank you for your time!

Public Function GetMostSimilar(toString As String, between As Variant) As String
    
    Dim i                   As Long
    Dim ch                  As String
    Dim o                   As Long
    Dim comparison          As Variant
    Dim positionScore       As Double
    Dim presenceScore       As Double
    Dim am                  As ArrayManipulation
    Dim index               As Long
    Dim bestScore           As Double
    Dim bestComparison      As String
    Dim closeCharatersScore As Double
    Dim score               As Double
    
    ' range to array
    between = between.value
    Set am = New ArrayManipulation
    
    ' a low number
    bestScore = -1000
    
    For o = LBound(between) To UBound(between)
        comparison = GetArrayOfCharacters(CStr(between(o, 1))) ' returns 1 based array
        
        positionScore = 0
        presenceScore = 0
        closeCharatersScore = 0
        
        ' loop in characters
        For i = 1 To Len(toString)
            ch = Mid(toString, i, 1)
            ' array manipulation is an object to do stuff with arrays. In this case find the index of something in an array
            index = am.FindIndex(comparison, ch, 0, , False)
            ' method that check for match in left and right characters of the current character. +- 0.5 for each character depending if match
            closeCharatersScore = closeCharatersScore + GetCloseCharactersScore(CStr(between(o, 1)), index, toString, i)
            If index = -1 Then
                presenceScore = presenceScore - 1
                positionScore = positionScore - 1
            Else
                presenceScore = presenceScore + 1
                positionScore = positionScore + IIf(i = index, 1, -1)
                comparison(index) = vbNullString
            End If
        Next i
        score = positionScore + presenceScore + closeCharatersScore
        Debug.Print between(o, 1) & ": " & score & "| POS: " & positionScore & " | Pres: " & presenceScore & " | Close: " & closeCharatersScore
        If score > bestScore Then
            bestScore = score
            bestComparison = between(o, 1)
        End If
    Next o
    
    GetMostSimilar = bestComparison
    
End Function

Private Function GetCloseCharactersScore(comparison As String, index As Long, toString As String, i As Long) As Double
    
    Dim leftOriginal        As String
    Dim rightOriginal       As String
    Dim leftComparison      As String
    Dim rightComparison     As String
    
    On Error Resume Next
    leftOriginal = Mid(toString, i - 1, 1)
    rightOriginal = Mid(toString, i + 1, 1)
    leftComparison = Mid(comparison, index - 1, 1)
    rightComparison = Mid(comparison, index + 1, 1)
    On Error GoTo 0
    
    GetCloseCharactersScore = IIf(leftOriginal = leftComparison, 0.5, -0.5) + IIf(rightOriginal = rightComparison, 0.5, -0.5)
    
End Function

Private Function GetArrayOfCharacters(str As String) As Variant
    
    Dim i           As Long
    
    ReDim temp(1 To Len(str)) As Variant
    For i = 1 To Len(str)
        temp(i) = Mid(str, i, 1)
    Next i
    
    GetArrayOfCharacters = temp
    
End Function
\$\endgroup\$
4
  • 3
    \$\begingroup\$ A nice algorithm for fuzzy string comparisons is called "Levenshtein Distance" - it counts the number of swaps and insertions required to convert one string to another. If you divide by the string length, then you can get a % similarity between 2 strings. It's pretty fast too - google it for more info or take a look here for a VBA implementation (although the accepted answer isn't necessarily the fastest implementation - but try it and you can see if speed is a problem). Using a well known algorithm can help people understand your code more easily \$\endgroup\$
    – Greedo
    Commented Aug 15, 2020 at 11:40
  • \$\begingroup\$ What is ArrayManipulation? \$\endgroup\$
    – TinMan
    Commented Aug 15, 2020 at 20:24
  • \$\begingroup\$ @TinMan it is an object with various methods and functions to manipulate o work with arrays \$\endgroup\$
    – DT1
    Commented Aug 16, 2020 at 7:48
  • 1
    \$\begingroup\$ It should be included in your post. \$\endgroup\$
    – TinMan
    Commented Aug 16, 2020 at 8:51

1 Answer 1

1
\$\begingroup\$

Very interesting post.

Naming Conventions

The success of your code is dependent on how the controls on the userform are named. Is miche the most similar? In my opinion no. Controls will generally, have a prefix of suffix to identify the control type. For this reason, when comparing a word to a list control control names, the control name that contains a complete match should be taken over a partial match. Along the same lines, Camel and Pascal case naming conventions dictate that a the control name capitalization may need to be altered. Why would you give precedence to miche over Michele?

Using Arrays for String Comparisons

Creating an array for comparison as you shorten the match is very inefficient. Using a variant array to store characters is itself inefficient. (See MSDN: Data type summary) Consider that it takes 10 bytes + the string length to store a string and 16-22 bytes need to be reserved for each element on an array. So it takes 16 bytes of memory to store michele as a string and a minimum of 96 bytes as a variant array of characters.
With all things being equal, you can not write a VBA function that will outperform a native VBA function that is written in C++. Instr(), Instr$(), InstrB(), Mid(), Mid$() and MidB() are insanely fast and will outperform anything you try to replace them with. The Instr() functions can also make text comparisons which will ignore the text case.

There is a small performance benefit to using byte arrays but IMO it is not significant enough to merit extra work.

Use the Right Name for the Job

• toString As String: It is usually obvious how Object.toString is to be used. toString does not indicate its context. Consider Match • between As Variant: This is very confusing considering that you are considering characters between the start and end of a string. Consider MatchList

Miscellaneous

' range to array
between = between.value

This throw an error in my test. I assume that it was added when the OP was preparing the code to post.

GetCloseCharactersScore()

Always handle obvious errors don't escape them. Mid() will throw an "Invalid procedure call or argument" if the Index < 1. Mid() will also return a vbNullString if the Index > Length which might cause a false positive (although I doubt it). You should handle

Private Function GetCloseCharactersScore(comparison As String, index As Long, toString As String, i As Long) As Double
    If index > 1 And i > 1 And index < Len(comparison) And index < Len(toString) Then
        Dim leftOriginal        As String
        Dim rightOriginal       As String
        Dim leftComparison      As String
        Dim rightComparison     As String
    
        leftOriginal = Mid(toString, i - 1, 1)
        rightOriginal = Mid(toString, i + 1, 1)
        leftComparison = Mid(comparison, index - 1, 1)
        rightComparison = Mid(comparison, index + 1, 1)
    
        GetCloseCharactersScore = IIf(leftOriginal = leftComparison, 0.5, -0.5) + IIf(rightOriginal = rightComparison, 0.5, -0.5)
    Else
        GetCloseCharactersScore = -0.5
    End If
End Function

Is There a Better Way?

Again, this will depend on your naming conventions. My version takes the number of letters in the match value (from right to left) found in the comparison / length of the match value * weighted value and minuses the number of unmatched letters * a different weighted value to determine the overall score. The comparison is done right to left because you will seldom see a match where the first characters were truncated, it will almost always be the last. The weighted values will probably need to be adjusted but I think the theory is sound.

Public Function ClosestMatch(Match As String, MatchList As Variant) As String
    Dim n As Long
    Dim Item As Variant
    Dim BestMatch As String
    Dim BestScore As Double
    Dim CurrentScore As Double
    
    For Each Item In MatchList
        CurrentScore = MatchScore(Match, Item)
        If CurrentScore > BestScore Or BestScore = 0 Then
            BestMatch = CurrentScore
            BestMatch = Item
        End If
    Next
    
    ClosestMatch = BestMatch
End Function

Public Function MatchScore(ByVal Match As String, ByVal MatchItem As Variant) As Double
    Const FullMatchWeight As Long = 10
    Const UnmatchedCharacterWeight As Long = -1
    
    Dim n As Long
    Dim Score As Double
    
    For n = Len(Match) To 1 Step -1
        If InStr(1, MatchItem, Left(Match, n) > 0, vbTextCompare) Then
            Score = Len(Match) / n * FullMatchWeight
            Exit For
        End If
    Next
    
    Dim UnmatchedCharacterScore As Double
    UnmatchedCharacterScore = Abs(n - Len(MatchItem)) * UnmatchedCharacterWeight
    MatchScore = Score + UnmatchedCharacterScore
    
End Function
    
\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.