What is generally faster, grepping through files or running a SQL LIKE %x% query through blobs?

Question

Say I'm designing a tool that would save code snippets either in a PostgreSQL/MySQL database or on the file system. I want to search through these snippets. Using a search engine like Sphinx doesn't seem practical because we need exact text matches of code when searching code.

grep and ack and has always worked great, but storing stuff in a database makes a large collection of stuff more manageable in certain ways. I'm wonder what the relative performance of running grep recursively over a tree of directories is compared to running a query like SQL's LIKE or MySQL's REGEXP function over an equivalent number of records with TEXT blobs is.

it depends on your setup, why not test it? You do know about match against don't you, see: dev.mysql.com/doc/refman/5.5/en/fulltext-search.html — Johan, Commented May 8, 2011 at 11:52

Denis de Bernardy · Accepted Answer · 2011-05-08 17:58:48Z

If you've 1M files to grep through, you will (best I'm aware) go through each one with a regular expression.

For all intents and purposes, you're going to end up doing the same thing over table rows if you mass-query them using a LIKE operator or a regular expression.

My own experience with grep is that I seldom look for something that doesn't contain at least one full word, however, so you might be able to take advantage of a database to reduce the set in which you're searching.

MySQL has native full text search features, but I'd recommend against because they mean you're not using InnoDB.

You can read about those from Postgres here:

http://www.postgresql.org/docs/current/static/textsearch.html

After creating an index on a tsvector column, you can then do your "grep" in two steps, one to immediately find rows that might vaguely qualify, followed by another on your true criteria:

select * from docs where tsvcol @@ :tsquery and (regexp at will);

That will be significantly faster than anything grep can do.

bpgergo · Accepted Answer · 2011-05-08 11:59:47Z

1

I can't compare them but both will take long. My guess is grep will be faster.

But MySQL support full text indexing and searching, which will be faster then grep -- i guess again.

Also, I did not understand, what is the problem with Sphinx or Lucene. Anyway, here's a benchmark for MySQL, Sphinx and Lucene

answered May 8, 2011 at 11:59

bpgergo

15.9k5 gold badges45 silver badges68 bronze badges

I think what I meant about Sphinx being inappropriate is that searching for a code snippet often involves matching on punctuation characters, which I think Sphinx excludes from its index.
– dan
Commented May 8, 2011 at 12:08
I see. I'm familiar with Lucene and sure you can use it for source code search, though you'll probably have to write your own Analyzer How can you do it? How_can_I_index_java_source_files
– bpgergo
Commented May 8, 2011 at 13:36

Add a comment |

ninjagecko · Accepted Answer · 2011-05-08 12:22:47Z

0

The internet seems guess that grep uses Boyer-Moore, which will make the query time depend additively (not multiplicatively) on the query size. This isn't that relevant though.

I think it's near-optimal for a one-time search. But in your case you can do better since you have repeated searches, which you can exploit the structure of (e.g. by indexing certain common substrings in your query), as bpgergo hints at.

Also I'm not sure the regular expression engine you are thinking of using is optimized for a non-special query, you could try it and see.

You may wish to keep all the files you're searching through in memory to avoid harddisk-based slowdown. This should work unless you are searching a staggering amount of text.

answered May 8, 2011 at 12:22

ninjagecko

90.4k24 gold badges141 silver badges148 bronze badges

MySQL also uses enhanced Boyer-Moore on search strings (LIKE and MATCH AGAINST) longer than a given limit., see:dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html If you use ... LIKE '%string%' and string is longer than three characters, MySQL uses the Turbo Boyer-Moore algorithm to initialize the pattern for the string and then uses this pattern to perform the search more quickly.
– Johan
Commented May 8, 2011 at 12:30

Add a comment |

skyde · Accepted Answer · 2013-02-06 20:03:56Z

0

If you want a full-text index on code I would recommend Russ Cox’ codesearch tools https://code.google.com/p/codesearch/

This is How Google Code Search Worked http://swtch.com/~rsc/regexp/regexp4.html

answered Feb 6, 2013 at 20:03

skyde

2,9164 gold badges36 silver badges56 bronze badges

Add a comment |

Collectives™ on Stack Overflow

What is generally faster, grepping through files or running a SQL LIKE %x% query through blobs?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
mysql
postgresql
grep
ack
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged mysqlpostgresqlgrepack or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
mysql
postgresql
grep
ack
or ask your own question.