0

I'm compiling a database of decently sized documents (anywhere from 20 to 500 characters), and I would like to be able to very quickly (under 1s) search through them.

I started with around 100k rows worth of data, and now have around 4 million. I plan to cap it at around 30 million rows.

At the beginning of the project, I wrote a quick search query using Postgres' full text search and pg_trgm which worked extremely well from an efficiency and accuracy standpoint. At 4M rows, the full text searches are still on the order of milliseconds, but the trigram searches are taking over 2 minutes at times.

Here is the query:

SELECT text
FROM my_table
WHERE text_vector @@ websearch_to_tsquery('simple', 'a string to search')
OR text_vector @@ websearch_to_tsquery('english', 'a string to search')
OR 'a string to search' <<% text
LIMIT 100;

We store and index ts_vectors on insertion, which allows us to utilize text_vector directly in full text search.

The reason for adding trigram search is because I want some sort of fuzzy searching. Searching for something like "robots" with full text search works very well, but "robot contests" or "cool robot pics" will leave out a lot of relevant results since it doesn't exactly match up. I'm less concerned with having 100% accuracy to the query, and more so concerned about giving a good general picture of the results in the database. I don't care about returning many results either, just the 100 most similar to the query. My thought process is to let the full text search match the first n results, and fuzzy search fill in the next 100 - n.

I've added the appropriate indexes for the both the full text search and the trigram search based on Postgres' recommendations. The trigram index is obviously large, and we expect it to keep getting bigger as we add more and more values.

I'm not a Postgres expert, but from my research, I think my main issue may be that the database just isn't using enough compute to use trigram index quickly. I've tried increasing stats on the database like the shared_buffers, work_mem, and effective_cache_size, but for some reason that has no effect. I'm running the database on an instance with 2 CPUs and 4GB of RAM, and this database is the only thing running on it, so I would like to max out the compute.

I'm mainly asking this question to start a discussion amongst people who know more than me about Postgres, and to see what steps I could do to conduct a fuzzy search on Postgres prioritizing speed over all. I've also looked into extensions like pgvector to try and achieve the same result, but I think the problem persists that the index is just too big to conduct a search fast enough.

My main questions are:

  1. If I just "throw hardware" at the problem, can I get these searches under 1s consistently? I don't mind getting a larger instance, as long as I know that it can handle trigram or vector search well.
  2. Can I achieve what I want to with the instance size (2 CPU, 4GB RAM) I have now? If so, how can I adjust my DB settings to do so?
  3. Is there some simple implementation trick/optimization that I can do that I'm just not aware of?
  4. Are the strings that I'm searching just too large to conduct a trigram/vector search fast enough?
  5. Is Postgres the right tool to use for this problem? If no, are there any better tools that come to mind?

If this is an involved problem, I have a few ideas on how I can speed up the search in a different way. But following the KISS approach, I'd like to try simple similarity/vector search with Postgres before diving into a much more complex solution. I'm relatively new to Postgres, and knowing that it's been around for a while, I'd like to tap in to the experienced users and make sure I'm going about this the right way.

EDIT: Below are the definitions of the text and text_vector columns:

text TEXT NOT NULL

text_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', "text")) STORED

And the index I used

CREATE INDEX trgm_idx ON my_table USING GIST (text gist_trgm_ops);

EDIT 2: As requested in the comments, here are some (abridged) examples of matches on 'robot contests' which occur using <<% and not @@. Ideally, I would return these matches using @@ and full text search.

  • robot contests -> AI-generated images are winning art contests
  • robot contests -> The robots are coming to take over since this AI release
  • robit contests -> We won the company's hackathon contest with our new robot

I added the misspelling on purpose. Trigram does a good job with parts of words, but full text does not. I think this is the biggest issue I'd like to solve here.

1
  • Can you provide a concrete example of something which matches highly against 'robot contests' using <<%, but not using @@?
    – jjanes
    Commented Mar 14, 2023 at 21:21

2 Answers 2

0

Trigram indexes don't work well when searching against large strings unless the threshold is set very high, especially with the GiST version of the index. You could try the GIN version instead, but I still wouldn't expect it to be awesome. There are just too many false positives that can't be exclude without visiting them. (In tension to this, only GiST supports the KNN algorithm which is used when ordering by a distance operator such as <<<->. However, the performance of that algorithm itself falls apart pretty quickly when you are searching in too-large string)

Based on your examples, for the first two I think what you really want is the tsquery using the OR operators (|) between the terms. For the third one, I think you should implement spell checking as a separate layer of your search. Have a separate dictionary table to store every word that occurs anywhere in your main table, and if a query term isn't in that list, then you can assume it is misspelled. Then you can search the dictionary table (of short strings) using trigrams for probable matches very efficiently, and then either propose to the user that they fix it, or automatically fix it for them and show the results of that fixed query (possibly with a note that you have done so--that is how Google presents it when they fix misspellings). The problem is, do you try to store the dictionary before they are stemmed, or after? And if after, how do you stem the misspelled words to compare them to the stemmed dictionary?

As for your other questions, it is going to be hard to scale enough so that just parallelization can bring you down from 2 minutes to <1 second. Or at least, once you've done so it will be hard to pay the bill. And I am not aware of any other tool which magically solves this. You will need an elaborate system and have to put the pieces of it together your self. Although some other tools might make it easier to put those pieces together, I'm not aware of any of them that just do it all automatically.

1
  • Thank you for the thoroughness and recommendations. This answered a lot of my questions. I was definitely looking for a "magic solution" which could make trigram search super fast, either by "hacking" Postgres, or running it on a super powerful machine. Implementing proper search in anything is obviously a hard problem, but I can spend the time and effort to implement something perfect for my use case. Will wait a bit and mark as solved, if nothing else comes up.
    – oseun22
    Commented Mar 17, 2023 at 12:37
0

Perhaps you just created the wrong index; hard to say, since you don't show your object definitions.

But I'd say that with the trigram indexes, the full-text search is more or less redundant: anything that matches the full-text search should be similar enough. Since you say that you want the 100 most similar entries, you could work with a GiST index and a query as follows:

CREATE INDEX ON my_table USING gist (text gist_trgm_ops);

SELECT text
FROM my_table
WHERE 'a string to search' <<% text
ORDER BY 'a string to search' <<<-> text
LIMIT 100;
2
  • Appreciate the response. I created the index exactly the way you showed in your response. I will update the question with some object definitions to give more context. A follow up to this, is there a way to use an operator similar to <<<-> but with a similarity threshold? I'd prefer to do the search in a WHERE clause so I could also sort by timestamp or something similar. Alternatively, I could just do a compound query, get about 1000, and then sort from there. Trying this out with my DB, the search is still around 10 seconds. I'm assuming this has to do with my DB settings?
    – oseun22
    Commented Mar 14, 2023 at 17:28
  • It is simple to add a WHERE condition. Sorting the result is trivial with an outer query that uses my query as a subquery. Commented Mar 14, 2023 at 17:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.