3

I have ran into problem when trying to implement fulltext search. To me it seams like math/statistics more then anything. The data pulled from database is book titles, so the scores returned by the query could have very close values(example: 9.98; 9.97; 9.78 - which are all very relevant results) or wide spread(example: 9.99; 8.2; 2.1 - the first two are relevant the third is noise). I can't figure out how to manipulate the query result to remove irrelevant. Std deviation doesn't work, because it filters good results in my first example, various normalization methods will either omit relevant results or include irrelevant. Any thoughts or ideas, please.

Thanks. Victor

1
  • I don't know the exact constraints and use-case of your project, but when making a book title search feature, I wonder... is it best for you to worry about deciding what's relevant? A user could pick poor search terms and end up with what they really wanted being toward the bottom of the rankings list for that particular search. Also, will the results be shown in a paged manner? Maybe it's not worth worrying about outliers and just allowing your paging mechanism to hide the lesser-relevant options without totally keeping the user from finding them.
    – curtisdf
    Commented Jul 10, 2012 at 18:54

1 Answer 1

1

I was just working on a problem much like this, but with time-based data rather than fulltext. I found the 68-95-99.7 rule, which among other things points out that in a true bell curve about 95% of the results are within 2 standard deviations of the mean. I took this knowledge and decided to throw out 5% of the results as outliers. You could do similarly -- omit the 5% of fulltext results having the lowest relevancy scores.

Another option might be to choose a certain threshold relevancy score, or a certain minimum number of results you want to show. Or both -- you could display by whichever criteria yields more results.

1
  • thanks for the suggestion. this is what I was thinking about too, and exactly where I stumbled. Here's the example: The query for "mark twain stories" returned two hits with scores: "mark twain short stories" (8.87) and "mark twain best short stories" (8.25); stddev for these is .2192, the second result is outside 2sigma but inside 3sigma, as expected :) Can't use 3sigma, because all the outliers will be included. After days of reading and manipulating data I'm still in the woods Commented Jul 11, 2012 at 11:53

Not the answer you're looking for? Browse other questions tagged or ask your own question.