Real time fulltext search with sphinx

Real time fulltext search
with Sphinx
Adrian Nuta // Sphinxsearch // 2013

Quick intro
Sphinx search
• high performance fulltext search engine
• written in C++
• serving searches since 2001
• can work on any modern architecture
• distributed under GPL2 licence

Why a search engine?
• performance
o a search engine delivery faster a search and with
less resourses
• quality of search
o build-in FTS in databases don’t offer advanced
search options
• independent FTS engines offer speed not
only for FT searches, but other types, like
geo or faceted searches

Classic way of indexing in Sphinx
on-disk (classic) method:
• use a data source which is indexed
• to update the index you need to reindex again
• in addition to main index, a secondary index
(delta) index can be used to reindex only latest
changes
• easy because indexing doesn’t require changes
in the application, but:
• reindexing, even delta one, can put pressure
on data source and system

Real time indexing in Sphinx
• index has no data source
• everything that needs be indexed must be
added manually in the index
• you can add/update/remove at any time
• compared to classic method, RT requires
changes in the application
• performance is same or near same as
classic index
• Only specific requirement :
workers = threads

RealTime index definition
index rt {
type = rt
rt_field = title
rt_field = content
rt_attr_uint = user_id
rt_attr_string = title
rt_attr_json = metadata
}

Schema - Fields
rt_field - fulltext field, raw text is not stored
Tokenization features:
wildcarding ( prefix or infix),
morphology, custom charset definition,
stopwords, synonyms, segmentation, html
stripping, paragraph/sentence detection etc.

Schema - Attributes
• rt_attr_uint & rt_attr_bigint
• rt_attr_bool
• rt_attr_float
• rt_attr_multi & rt_attr_multi64 -
integer set
• rt_attr_timestamp
• rt_attr_string - actual text stored, kept in
memory, used only for display, sorting and
grouping.
• rt_attr_json - full support for JSON
documents

Quick intro to SphinxQL
• our SQL dialect
• any mysql client can be used to connect to
Sphinx
• MySQL server is not required!
• Full document updates only possible with
SphinxQL
• to enable it, add in searchd section of config
listen = host:port:mysql41

Content insert
$mysql> INSERT INTO rt
(id,title,content,user_id,metadata)
VALUES(100,’My title’, ‘Some long content
to search’, 10,
’{“image_id”:1,”props”:[20,30,40]}’);

Full content replace
$mysql> REPLACE INTO rt
(id,title,content,user_id,metadata)
VALUES(100,’My title’, ‘Some long content
to search’, 10,
’{“image_id”:1,”props”:[20,30,40]}’);
• needed for text field, json and string attribute
updates

Updating numerics
• For numeric attributes including MVA:
$mysql> UPDATE rt SET user_id = 10 WHERE id
= 100;
• For numeric JSON elements it’s possible to
do inplace updates:
$mysql> UPDATE rt SET metadata.image_id =
1234 WHERE id=100;

Deleting
$mysql> DELETE FROM rt WHERE id = 100;
$mysql> DELETE FROM rt WHERE user_id > 100;
$mysql> TRUNCATE RTINDEX rt;
● empty the memory shard, delete all disk shards and
release the index binlogs

Adding new attributes
mysql> ALTER TABLE rt ADD COLUMN gid
INTEGER;
• only for int/bigint/float/bool attributes for
now

Searching
• no difference in searching a RT or classic
index
• dict = keywords required for wildcard search.

Relevancy ranking
• build-in rankers:
o proximity_bm25 ( default)
o none, matchany,wordcount,fieldmask,bm25
• custom ranker - create own expression rank
example
ranker = proximity_bm25
same as
ranker = expr(‘sum(lcs*user_weight)*1000+bm25’)

Tokenization settings example
index rt {
…
charset_type = utf-8
dict = keywords
min_word_len = 2
min_infix_len = 3
morphology = stem_en
enable_star = 1
…
}

Operators on fulltext fields
• Boolean: hello | world, hello ! world
• phrasing: “hello world”
• proximity: “hello world”~10
• quorum: “world is a beautiful place”/3
• exact form: =cats and =dogs
• strict order: cats << and << dogs
• zone limit: (h2,h4) cats and dogs
• SENTENCE: all SENTENCE words SENTENCE “ in
one sentence”
• PARAGRAPH: “this search” PARAGRAPH “is fast”
• selected fields only: @(title,body) hello world
• excluded fields: @!(title,body) hello world

Using API
<?php
require("sphinxapi.php");
$cl = new SphinxClient();
$res = $cl->Query('search me now','rt');
print_r($res);
Official: PHP, Python, Ruby, Java, C
Unofficial: JS(Node.js), perl, C++, Haskell,
.NET

Using SphinxQL
$mysql> SELECT * FROM rt WHERE
MATCH('”search me fuzzy”~10') AND featured
= 1 LIMIT 0,20;
$mysql> SELECT * FROM rt WHERE
MATCH('”search me fuzzy”~10 @tag
computers') AND featured = 1 GROUP BY
user_id ORDER BY title ASC LIMIT 30,60
OPTION field_weights=(title=10,content=1),
ranker=expr(‘sum((4*lcs+2*(min_hit_pos==1)
+exact_hit)*user_weight)*1000+bm25’);

Boolean filtering
$mysql> SELECT *,
views > 10 OR category = 4 AS cond
FROM rt WHERE
MATCH('”search me proximity”~10') AND
featured = 1 AND cond = 1
GROUP BY user_id ORDER BY title ASC
LIMIT 30,60 OPTION ranker=sph04;

Geo search
mysql> SELECT *, GEODIST(lat,long,0.71147,-
1.29153) as distance FROM rt WHERE distance <
1000 ORDER BY distance ASC;
mysql> SELECT *, GEODIST(lat,long,40.76439,-
73.99976,
{in=degrees,out=miles,method=adaptive}) as
distance FROM rt WHERE distance < 10 ORDER BY
distance ASC;

Multi-queries
mysql> DELIMITER
mysql> SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_one ORDER by
counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_two ORDER by
counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_three ORDER by
counter DESC;

• used for faceting

Internal architecture
Each RT index is a sharded index consisting of:
• one memory shard for latest content
• one or more disk shards

Internal shards management
rt_mem_limit = maximum size of memory
shard
When full, is flushed to disk as a new disk
shard.
• OPTIMIZE INDEX rt - merge all disk shards
into one.
o Merging too intensive? throttle with rt_merge_iops
and rt_merge_maxiosize

Binlog support
Sphinx support binlogs, so memory shard will
not be lost in case of disasters
• binlog_flush
o like innodb_flush_log_at_trx_commit
o 0 - flush and sync every second - fastest, 1 sec lose
o 1 - flush and sync every transaction - most safe, but
slowest
o 2 - flush every transaction, sync every second - best
balance, default mode
• binlog_path
o binlog_path = # disable logging

Fast RT setup using classic index
• Create classic index to get initial data.
• Declare a RT index
• mysql> ATTACH INDEX classic TO RTINDEX rt
• transform classic index to RT
• operation is almost instant
o in essence is a file renaming: classic index
becomes a RT disk shard

Sphinx use 1 CPU core per
index
More power?
Distribute!

Distributed RT index
Update on each shard, search on everything
index distributed
{
type = distributed
local = rtlocal_one
local = rtlocal_two
agent = some.ip:rtremote_one
}
don’t forget about dist_threads = x

Copy RT index from one server to
another
• just simulate a daemon restart
• searchd --stopwait
• flushes memory shard to disk
• Copy all index files to new server.
• Add RT index on new server sphinx.conf
• Start searchd on new server

Questions?
www.sphinxsearch.com
Docs: http://sphinxsearch.com/docs/
Wiki: http://sphinxsearch.com/wiki/
Official blog: http://sphinxsearch.com/blog/
SVN repository: https://code.google.com/p/sphinxsearch/

Real time fulltext search with sphinx

More Related Content

Real time fulltext search with sphinx