Text searching in Django with trigram

Question

I want to speed up search results in my application, however, I keep getting the same results no matter what method I use. Since it's Django application I'll provide both ORM commands and generated SQL code (PostgreSQL is used).

First, I have enabled GIN indexing and trigram operations on the database:

Second, I have create table that contains 2 varchar columns: first_name and last_name (plus an id field as primary key).

from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

I have also filled the database with 952 example records so that I don't have a situation, where Postgres avoids using index because of too small data set.

Next, I run following queries on non-indexed data.

Simple LIKE query:

In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms

Trigram similar:

In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms

And a more fancy query with sorting results:

In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms

Next step was to create an index:

class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

    class Meta:
        indexes = [GinIndex(fields=['last_name'])]

This resulted in a following SQL migration:

./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;

And now I run the same commands.

LIKE:

In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms

Trigram similar:

In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms

And the more complex query:

In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms

The changes in execution times seem to be insignificant. In the case of the last query the scan takes 0.643 units compared to 0.657 in the previous case. Times also differ by 0.02 miliseconds (and the second query run even a bit slower). Is there some option that I am missing that should be enabled to help with the performance? Is it too simple data set?

Docs I used:

EDIT I've added a few houndred records (now there are nearly 259 000 records) and run tests again. First without indexes:

In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms

In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms

In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms

And with it:

In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms

In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms

In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms

Still very similar times and it seems to be avoiding using the gin index.

Just some ideas: leave the ORM out of this and write raw queries to check if it makes any difference. you could add db_index=True to first and lastname in your models.py. Use a tool like django-silk or django debug-toolbar to inspect the queries properly. What results are you expecting? I have no feeling for how long it should take but everything in the 0.x second range seems not to bad. — hansTheFranz, Commented Dec 1, 2020 at 11:57
I've run those queries by hand in postgres console and got the same results, so I don't know whether it would help. I also doubt whether the db_index=True would actually help, since by default it creates a B-tree which is of no use in text searching, but I'll try. The last suggestions seems to be a good trait though. I see now that GitLab's queries take from 150 to 8000 units, so it might be the size of data that would make difference. Thank you. — gonczor, Commented Dec 1, 2020 at 12:06

jjanes · Accepted Answer · 2020-12-01 15:02:24Z

3

CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");

This did not create a trigram index. It created a GIN index on the whole string, using the operators from btree_gin (which you don't seem to be using for any good purpose). To make a trigram index, it would need to look like this:

CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);

But I don't know how to get django to do that, I'm not a Django user.

answered Dec 1, 2020 at 15:02

jjanes

42.4k5 gold badges34 silver badges43 bronze badges

1

It did help for the trigram_similar operations. I needed to write raw sql migration, so if you wanted to update your answer you'de need to paste this snippet: ``` class Migration(migrations.Migration): operations = [ migrations.RunSQL( sql='CREATE INDEX reviews_authors_last_name_gin_idx ON reviews_author USING gin (last_name gin_trgm_ops);', reverse_sql='DROP INDEX reviews_authors_last_name_gin_idx;' ) ] ```
– gonczor
Commented Dec 1, 2020 at 15:19

Add a comment |

Collectives™ on Stack Overflow

Text searching in Django with trigram

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
django
postgresql
indexing
django-orm
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged djangopostgresqlindexingdjango-orm or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
django
postgresql
indexing
django-orm
or ask your own question.