We have a list of keywords associated with a list of documents. The list was created from frequency counts of the words in the document text. We are trying to add a weight to the keywords based on whether or not they appear in the document name or not. For instance, if we had a document called Agency_Solutions.doc
, then the keyword agency
would sort higher on the list than telephone
.
To complicate matters, every document has either the
, a
, an
as it's top keyword, based on counts. Of course all that needs to be excluded; I've set up a VLOOKUP
column with 171 'common' words for exclusion.
Here's my problem: if I MATCH(WORD,TITLE,0)
, Agency
does not equal Agency_Solutions
(or Agency Solutions
; I used SUBSTITUTE
to create 'clean' versions of the all the titles) and doesn't get weighted. If I SEARCH(WORD,TITLE)
I weight a
because a
appears in Agency_Solutions
. FIND
would return identical results to SEARCH
in this instance. Rock. Hard place.
I've tried a couple iterations of things, but never get results that identify the keyword as a standalone substring within the document name. Any ideas?
EDIT: Here's some data
Exclusion List (paste into col A)
a
an
is
the
what
when
who
Document, keyword, count (Cols B, C, & D)
Keyboard_and_mouse_problems.txt the 15
Keyboard_and_mouse_problems.txt an 15
Keyboard_and_mouse_problems.txt a 14
Keyboard_and_mouse_problems.txt when 12
Keyboard_and_mouse_problems.txt system 8
Keyboard_and_mouse_problems.txt keyboard 8
Keyboard_and_mouse_problems.txt mouse 8
Keyboard_and_mouse_problems.txt when 9
Keyboard_and_mouse_problems.txt what 9
Keyboard_and_mouse_problems.txt who 8
Keyboard_and_mouse_problems.txt is 8
Keyboard_and_mouse_problems.txt phone 6
Keyboard_and_mouse_problems.txt help 6
Keyboard_and_mouse_problems.txt desk 5
Keyboard_and_mouse_problems.txt cable 4
Keyboard_and_mouse_problems.txt jack 4
Agency_Solutions.txt X 2
Agency_Solutions.txt c 1
Agency_Solutions.txt on 1
Then, my formulae:
Col E =IFERROR(VLOOKUP(C2,$A$2:$A$225,1,0),"notFound") Is this in the exclusion list?
Col F =IFERROR(VLOOKUP(C2,$A$2:$A$225,1,0),"") Exclude this word
Col G =IF(F2=C2,0,C2) Include this word
Col H =IF(ISNUMBER(SEARCH(C2,B2)),100,0) Title Weight
Col I =IF(G2=0,0,D2+H2) Weighted Keywords
Col J =IF(AND(H2=100,G2=0),"BAD","OK") OK or Bad calculations