2

We have a list of keywords associated with a list of documents. The list was created from frequency counts of the words in the document text. We are trying to add a weight to the keywords based on whether or not they appear in the document name or not. For instance, if we had a document called Agency_Solutions.doc, then the keyword agency would sort higher on the list than telephone.

To complicate matters, every document has either the, a, an as it's top keyword, based on counts. Of course all that needs to be excluded; I've set up a VLOOKUP column with 171 'common' words for exclusion.

Here's my problem: if I MATCH(WORD,TITLE,0), Agency does not equal Agency_Solutions (or Agency Solutions; I used SUBSTITUTE to create 'clean' versions of the all the titles) and doesn't get weighted. If I SEARCH(WORD,TITLE) I weight a because a appears in Agency_Solutions. FIND would return identical results to SEARCH in this instance. Rock. Hard place.

I've tried a couple iterations of things, but never get results that identify the keyword as a standalone substring within the document name. Any ideas?

EDIT: Here's some data

Exclusion List (paste into col A)

a
an
is
the
what
when
who

Document, keyword, count (Cols B, C, & D)

Keyboard_and_mouse_problems.txt the 15
Keyboard_and_mouse_problems.txt an  15
Keyboard_and_mouse_problems.txt a   14
Keyboard_and_mouse_problems.txt when    12
Keyboard_and_mouse_problems.txt system  8
Keyboard_and_mouse_problems.txt keyboard    8
Keyboard_and_mouse_problems.txt mouse   8
Keyboard_and_mouse_problems.txt when    9
Keyboard_and_mouse_problems.txt what    9
Keyboard_and_mouse_problems.txt who 8
Keyboard_and_mouse_problems.txt is  8
Keyboard_and_mouse_problems.txt phone   6
Keyboard_and_mouse_problems.txt help    6
Keyboard_and_mouse_problems.txt desk    5
Keyboard_and_mouse_problems.txt cable   4
Keyboard_and_mouse_problems.txt jack    4

Agency_Solutions.txt    X   2
Agency_Solutions.txt    c   1
Agency_Solutions.txt    on  1

Then, my formulae:

Col E   =IFERROR(VLOOKUP(C2,$A$2:$A$225,1,0),"notFound")    Is this in the exclusion list?
Col F   =IFERROR(VLOOKUP(C2,$A$2:$A$225,1,0),"")        Exclude this word
Col G   =IF(F2=C2,0,C2)                     Include this word
Col H   =IF(ISNUMBER(SEARCH(C2,B2)),100,0)          Title Weight
Col I   =IF(G2=0,0,D2+H2)                   Weighted Keywords
Col J   =IF(AND(H2=100,G2=0),"BAD","OK")            OK or Bad calculations

2 Answers 2

1

If you only have spaces between words you can search for the word with spaces either side to avoid partial matches, e.g.

=ISNUMBER(SEARCH(" "&WORD&" "," "&TITLE&" "))

....although that falls over if you have punctuation like _ or . in TITLE.....but you can use any number of SUBSTITUTE functions to replace those with spaces like this

=ISNUMBER(SEARCH(" "&WORD&" "," "&SUBSTITUTE(SUBSTITUTE(TITLE,"_"," "),"."," ")&" "))

3
  • Then we have something like "Agency...." which DOESN'T have a space. :| But good idea; I'd thought of that too. :)
    – dwwilson66
    Commented Feb 19, 2013 at 19:19
  • 1
    Do you mean a title that starts with "Agency....."? My suggestion caters for that because it adds spaces to the start and end of the title too - did you try it? If WORD is Agency and TITLE is Agency appleton then the first formula returns TRUE, but with the same TITLE you'd get FALSE for apple because it's not a full match Commented Feb 19, 2013 at 20:39
  • Y'know, I totally didn't see the spaces on either end of the title. That may work for me. Let me give it a spin. Good call!
    – dwwilson66
    Commented Feb 19, 2013 at 20:41
3

You should give a try to VLOOKUP with wildcards, e.g. if we have apple in A1, =VLOOKUP("*"&A1&"*";B:C;2;0) will match any string that has apple regardless of position. See for more my other answer: https://superuser.com/a/552742/187330

Tip: to avoid case-sensetivity, use LOWER function for comparing.

7
  • Thanks, @Peter L. This question is my next challenge after VLOOKUPing the universe per the post you mention above. My problem now is how to include apple while also EXCLUDING appleton. ...Or, how to exclude apple while INCLUDING appleton. I'm not sure I can do that in the same function.
    – dwwilson66
    Commented Feb 19, 2013 at 18:24
  • Please provide full list (or at least a relevant sample) of your strings array.
    – Peter L.
    Commented Feb 19, 2013 at 18:33
  • @dwwilson66 I think I have a solution, but I'd rather test it on real data instead of generating fake samples.
    – Peter L.
    Commented Feb 19, 2013 at 18:38
  • @dwwilson66 add it as text to original post via edit, just 10-20 strings with mentioned "apples")
    – Peter L.
    Commented Feb 19, 2013 at 19:56
  • Done and done. When all the data & formulae are pasted into the appropriate columns, Col J will display OK or BAD based on what I expect in Col I.
    – dwwilson66
    Commented Feb 19, 2013 at 20:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .