4

I have been trying to edit a SQL query in a friend's code in order for it to do a Dot product between two embedding vectors: a query vector and a vector it retrieves from the Postgres database. This database stores paragraphs from different documents, and each row has not only the embedding for the specific paragraph, but the page number, document name, document link and the original paragraph itself.

The original query, which use the Null safe operator to do a comparison between the embeddings as strings, looks like this:

 SELECT content, document, page, line, link 
 FROM documents 
 ORDER BY embedding <=> %(embedding)s 
 LIMIT %(n)s

For reference, the full section of code looks like this:

string = query

# instantiate sentence transformer for embedding
model = SentenceTransformer('all-MiniLM-L6-v2')

# embed the input string
embedding = model.encode(string)

# establish connection to database
conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')

# create embedding array for pgvector
embedding_pgarray = "[" + ",".join(map(str, embedding)) + "]"

# Execute a SQL query to find the top 5 vectors closest to the embedding vector
query =  """ 
    SELECT content, document, page, line, link
    FROM documents
    ORDER BY  embedding <=> %(embedding)s
    LIMIT %(n)s
"""

neighbors = conn.execute(query, {'embedding': embedding_pgarray, 'n':n}).fetchall()

# organize query results into dataframe
df = pd.DataFrame()
for neighbor in neighbors:
    content = neighbor[0].strip()
    document = neighbor[1]
    page = neighbor[2]
    #print("Page: ", page)
    line = neighbor[3]
    link = neighbor[4]

    df1 = pd.DataFrame({'Content':content, 'Document':document,'Page':page, 'Line':line, 'Link':link}, index=[0])
    df = pd.concat((df, df1))

df = df.reset_index(drop=True)
return df   

I edited the SQL query to look like this:

query =  """ 
    SELECT content, document, page, line, link
    FROM documents
    ORDER BY DOT_PRODUCT(embedding, embedding) 
    LIMIT %(n)s
"""

When trying to use DOT_PRODUCT(), I seem to get an error saying that this function is not recognized:

UndefinedFunction: function dot_product(vector, vector) does not exist
LINE 4: ORDER BY DOT_PRODUCT(embedding, embedding)
^
HINT: no function matches the given name and argument types. You might need to add explicit type casts.

Note: I should probably also include the code where the object is inserted into the database to show what the columns are:

conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

# replace old data
conn.execute('DROP TABLE IF EXISTS documents')
conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, line integer, word text, document text, content text, embedding vector(384), link text, page text)')

model = SentenceTransformer('all-MiniLM-L6-v2')

for i in range(len(context_list)):
    doc_name = []
    line = []
    words = []
    link = []
    page = []
    for j in range(len(context_list[i])):
        doc_name.append(file_names[i])
        line.append(j)
        page.append('N/A')
        words.append(key_words[i])
        link.append(links[i])
    embeddings = model.encode(context_list[i])

    for content, embedding, word, doc_name, line, link, page in zip(context_list[i], embeddings, words, doc_name, line, link, page):
        conn.execute('INSERT INTO documents (line, word, document, content, embedding, link, page) VALUES (%s, %s, %s, %s, %s, %s, %s)', (line, word, doc_name, content, embedding, link, page))

I have tried different variations of the DOT_PRODUCT including VECTOR_COSINE. I've also tried different type casts for the embedding objects. All I can think is that the type of the vector for one of the objects is off.

I have checked the type for both, one is a vector of Float32 objects, the other a vector of int. I am not sure how to change the object type for a vector within a SQL query itself, and even then, I am not sure if that is the problem with DOT_PRODUCT that I am experiencing. I have also made sure I have the right extensions for DOT_PRODUCT, as I believe the extension required for that is vector.

Any ideas or other things I can try?

6
  • If this is a postgresql question, why do we see sql-server tag??? They're 2 completely different product.
    – Eric
    Commented Jul 8 at 17:13
  • 2
    Make sure the extension is in a schema that's in your search path. show search_path; select extname,extnamespace::regnamespace from pg_extension; See this demo. The dot_product seems to only be used as pseudo-code in pgvector examples while the extension doesn't actually seem to ship with a function named dot_product(). Also, see this.
    – Zegarek
    Commented Jul 8 at 17:53
  • Eric - Sorry I thought sql-server was a general tag referring to any sort of sql database issues, not a product, so my mistake.
    – Sean Last
    Commented Jul 9 at 15:30
  • Zegarek - These resources help greatly, thanks! So within extension Pgvector, it appears dot product can only be done with <#>, if I am understanding correctly? Dot_product itself was only pseudo code in pgvector and not an actual function? I just tried <#> and it appears to work
    – Sean Last
    Commented Jul 9 at 15:34
  • @SeanLast correct. It's nowhere in the source as an actual function name exposed in SQL.
    – Zegarek
    Commented Jul 9 at 15:54

1 Answer 1

1

In short: there's no function named dot_product() in pgvector. Use <#>, the inner product distance operator.

You can query system catalogs pg_proc, pg_depend and pg_extension to check if you actually have a specific function, what extension it belongs to and which schema/namespace it's in:
demo at db<>fiddle

select e.extnamespace::regnamespace
     , e.extname
     , p.pronamespace::regnamespace
     , p.oid::regprocedure as "function with argument types"
from pg_catalog.pg_proc      as p
join pg_catalog.pg_depend    as d on (d.objid    = p.oid)
join pg_catalog.pg_extension as e on (d.refobjid = e.oid)
where p.proname ilike 'dot_product'
order by 2,4;

In a standard PostgreSQL installation, as of version 16, this returns no results - there's no built-in dot_product function. This statement you ran:

CREATE EXTENSION IF NOT EXISTS vector

Most likely attempts to add pgvector extension (most likely, because nothing stops you from building a different extension and exposing it under the name vector). Because of if not exists it won't do anything if the extension had already been added, but if it's in a schema that's not in your search_path, it'll act as if it wasn't there unless you either schema-qualify all references to anything from that extension

select some_other_schema.st_astext('point(0 0)'::some_other_schema.geometry);

Or add that schema to your search path:

set search_path='some_other_schema';
select st_astext(st_geomfromtext('point(0 0)'));
1
  • 1
    Thanks, this was the missing piece I was looking for!
    – Sean Last
    Commented Jul 15 at 16:51

Not the answer you're looking for? Browse other questions tagged or ask your own question.