0

I have a local env with Mysql 5.7.19 (on windows 10 Pro French) and a prod server with Mysql 5.7.31 (Ubuntu Linux 16.04.5).

The data is synchronised from Prod to local ENV. I have a fullText index on 3 columns and a simple request :

SELECT MATCH (r0_.title, r0_.description, r0_.tag_text)
       AGAINST ('+poulet* +carotte*' IN BOOLEAN MODE) AS sclr_0,
       r0_.id AS id_1, r0_.title AS title_2, r0_.description AS description_3,
       r0_.url AS url_4, r0_.image AS image_5, r0_.slug AS slug_6, r0_.click AS click_7, r0_.tag_text AS tag_text_8, r0_.active AS active_9, r0_.created_at AS created_at_10, r0_.updated_at AS updated_at_11
    FROM recipe r0_
    WHERE r0_.active = 1
    HAVING sclr_0 >= 1
    ORDER BY sclr_0 DESC;

On local env => 98 results
On prod env => 0 result

Create schema :

CREATE TABLE `recipe` (
  `id` int(11) NOT NULL,
  `blog_id` int(11) NOT NULL,
  `title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `description` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `url` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `image` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `slug` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `click` smallint(6) NOT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  `tag_text` varchar(1000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `active` tinyint(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `recipe`
  ADD PRIMARY KEY (`id`),
  ADD KEY `IDX_DA88B137DAE07E97` (`blog_id`),
  ADD KEY `IDX_DA88B1374B1EFC02` (`active`),
  ADD KEY `IDX_DA88B1378B8E8428` (`created_at`);
ALTER TABLE `recipe` ADD FULLTEXT KEY `IDX_DA88B1372B36786B6DE44026D5841871`
        (`title`,`description`,`tag_text`);

More data on Prod actually because new recipe but no result.

5
  • Pick a result you got on local. On prod, run your query without your current where clause, but replace it with WHERE id = the_id_you_picked_on_local. Check if the score and the value for active match your expectation and/or if the row exists. Add your findings to your question.
    – Solarflare
    Commented Aug 17, 2020 at 8:23
  • For the biggest number On local : 10.884532928466797 On prod server : 0.19886906445026398 Why this difference with the same data ?
    – Ionik
    Commented Aug 17, 2020 at 13:29
  • 1
    The score is based on relevance in the complete table (e.g. the more often your search term occurs in other rows, the lower the score), see e.g. here So if you added a lot of carrot receipts to prod, it may have lowered the score there. In any case, the absolute value has not too much meaning, it's main goal is to order the results w.r.t. each other. Mainly to do something like: order by score desc limit 20, and even this can raise questions.
    – Solarflare
    Commented Aug 17, 2020 at 14:06
  • Solarflare has the explanation (and should be made int an Answer). HAVING sclr_0 >= 1 removes "matches" differently when there is a different set of rows.
    – Rick James
    Commented Aug 17, 2020 at 16:21
  • I don't understand why the same set of data, don"t have the same result as score. I understand when i have more result the score is lower but whith same data ! The problem with juste a limit and order score, if i do this with a limit to 40, if there is 25 result the other haven't any matching with the query ....
    – Ionik
    Commented Aug 17, 2020 at 20:23

1 Answer 1

0

The relevancy score is calculated based on the content of the complete table:

InnoDB uses a variation of the “term frequency-inverse document frequency” (TF-IDF) weighting system to rank a document's relevance for a given full-text search query. The TF-IDF weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked.

"Document" here means a single row, "document collection" means all rows. The manual contains the exact formula, but the important thing is: since you have more recipes on prod compared to local, the score will be different. If you e.g. added more recipes containing carottes, the score will go down, if you added recipes that don't contain your search terms, the score will go up.

This is completely independent of how good the single result on its own actually is! A chicken carrot stew is a good fit to your search, but the absolute score will vary if you also have a recipe for carrot cake in your database or not.

So the absolute value of the score itself is usually not a good criteria to filter on, e.g. with your where score > 1, but as a way to order the results you get, e.g. with order by score desc, usually including a limit.

It is unlikely that you will find a good absolute minimum value for your score (except for 0) that would make sense in general:

  • if you find a nice value for now, it may be too high in 2 weeks if carrots become more popular and you add recipes for those (similar to your experience on prod). Or vice versa, if you used a specific value of 1 that will get rid of unwanted, lower score results, they might reappear in 2 weeks if you add carrot-unrelated recipes - not because those unwanted results are suddenly better, but because they became rarer.
  • if you found a nice value that fits for searches containing carrots, it may not be a good value for other search term. If you e.g. search for a frequent ingredient, maybe "sugar", you will still expect results that contain "sugar" even though the absolute value will be low - just because it is used more often than carrots.

But a recipe that uses the word sugar very often (as it may be an important ingredient, maybe a recipe for caramel) will have a higher score than those that only mentions it once ("add some sugar"), so you can use the value to order your results relative to each other.

3
  • Very good explain. But i don't understand why with the same data, the result is not identical ? If i make where score > 0 it is ok and i haven't result withtout the word ?
    – Ionik
    Commented Aug 18, 2020 at 8:54
  • It is possible of the relevance is from all the base in Mysql ? Because the only difference is other database i have most row with this word i think
    – Ionik
    Commented Aug 18, 2020 at 9:09
  • The relevance is not based on the whole database, but on the content of that fulltext index (e.g. all the data that is in your 3 columns of that table) (I hoped I made that clear, it is what the manual calls "document collection"). The important thing to note is: if you have more rows, it is NOT the same data anymore! >0 always works (it means: "found"), and you can actually use where match(...) against(...) without even saying >0, which just means "there is a match". If you choose any specific value > 0, you will run into the problems described.
    – Solarflare
    Commented Aug 18, 2020 at 11:12

Not the answer you're looking for? Browse other questions tagged or ask your own question.