SlideShare a Scribd company logo
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P.
Never Stop Exploring: 
Pushing the Limits of Solr 
Anirudha Jadhav 
©2014 Bloomberg L.P.
Who am I ? 
• Big Search and Distributed database specialist 
• Built a Search as a Service platform 
• Lead Search Architect @ Bloomberg Vault 
• Credit Derivatives Analytics Engineer @ Bloomberg 
• Masters' @ Courant Institute of Mathematical Sciences, New York University 
• Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
bloomberg.com/company
Agenda 
• Search 
at 
Bloomberg 
• 
Goals 
and 
Objec5ves 
• 
A 
li9le 
background 
• 
Factors 
affec5ng 
indexing 
• 
Our 
tests 
and 
benchmarks 
• 
Design 
for 
a 
be9er 
NRT 
indexer 
• 
Future 
work 
• 
Q/A
Search at Bloomberg
Search at Bloomberg 
• News Search 
• Federated Search 
• Complex re-ranking of search results 
• Archival Search 
• GeoSpatial Search 
• Analytics and Statistics on Search
Objective 
Significantly increase Near Real Time (NRT) indexing throughput 
Eg. Building a Search application that receives market data
Indexing workflow
Indexing Data Flow in SolrCloud
Indexing Workflow 
We were talking about IBM during the fishing trip 
Down 
Cas)ng 
[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip] 
Creates 
tokens 
by 
lowercasing 
all 
le4ers 
and 
dropping 
non-­‐le4ers. 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[talk] [fish] 
[talking] [about] [ibm] [fishing] [trip] 
[talk] [big] [blue] [fish] [journey] 
[chat] 
Consider 
the 
sentence: 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[talk] [fish] 
Tokeniza)on 
A 
tokenizer 
splits 
the 
stream 
of 
characters 
into 
a 
series 
of 
tokens. 
Stemming 
Lemma)za)on 
Stemming 
algorithms 
reduce 
words 
"fishing", 
"fished", 
"fish", 
and 
"fisher" 
to 
the 
root 
word, 
"fish" 
Lemma*za*on 
expands 
words 
to 
their 
inflected 
forms 
(ie 
fishing 
-­‐> 
fished, 
fishes, 
fish 
but 
not 
fisher) 
Stop 
Word 
Removal 
Remove 
common 
stop 
words 
“and”,”or” 
etc. 
which 
introduce 
noise 
in 
the 
search 
process 
Synonym 
Expansion 
Mapping 
of 
words 
based 
upon 
thesaurus 
(synonyms, 
acronyms, 
hypernyms, 
business 
rules, 
etc..) 
For 
example 
talk 
-­‐> 
chat, 
IBM 
-­‐> 
“big 
blue”, 
trip 
-­‐> 
journey
Designing the Search Index 
Designing 
a 
good 
Search 
Applica)on 
also 
involves 
many 
aspects 
of 
user 
interac)on 
that 
directly 
influence 
indexing 
design 
• 
Data 
Type 
and 
Data 
Distribu)on 
• 
Server 
side 
parameters 
• 
Networking 
• 
Client 
side 
parameters 
• 
Query 
pa4erns
Factors Affecting Indexing
Data and Distribution of Tokens 
Common types of data that we index in a search index 
• 
Textual 
data 
( 
human 
generated 
) 
e.g. 
messages, 
news, 
blogs 
• 
Textual 
data 
( 
machine 
generated 
) 
e.g. 
logs 
, 
5ckets 
• 
Numerical 
data 
• 
Geospa5al 
data 
How does this affect search index designs ? 
• 
Query 
speed 
and 
indexing 
speed 
depend 
on 
the 
size 
of 
an 
index 
• 
Size 
is 
dependent 
on 
• 
Number 
of 
documents 
in 
the 
index 
• 
Average 
size 
of 
each 
document 
• 
Distribu5on 
of 
tokens 
• 
Index 
features 
eg. 
Face5ng, 
Highligh5ng
Server-side Factors 
• Ratio of CPU’s to the number of solr cores running 
• 
2 
Solr 
indices 
per 
CPU 
or 
a 
Thread 
• Disk space 
• 
Disk 
space 
for 
Solr 
index 
* 
2 
( 
head 
room 
for 
merge 
cycles 
) 
• Memory 
• 
JVM 
heap 
• 
Off 
Heap 
• 
DocValues
Networking 
Cluster design consideration 
• 
Should 
a 
cluster 
span 
data 
centers 
? 
• 
Latency 
between 
datacenters 
• 
Reliability 
and 
availability 
SLA’s 
• 
Where 
does 
your 
Zookeeper 
ensemble 
live 
? 
• 
How 
many 
elec5on 
members 
• 
Consider 
observers 
to 
scale 
zookeeper 
• 
Dynamically 
promote 
an 
observer 
to 
elec5on 
member 
Manage concurrent connections on the server 
Monitor network latencies for QoS guarantees
Client-side Factors 
• Managing connections and reusing connections 
• Which format to use for indexing data 
• 
javabin 
• 
csv 
• 
json 
• 
xml 
• How many simultaneous threads to use
Experiments with NRT Indexing 
It’s not always efficient to send a single document to Solr for indexing 
How do you decide how many documents to send ? 
Collector : A buffer that collects Solr update documents 
• 
Time 
Triggers 
( 
T 
) 
• 
Time 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
• 
Document 
Size 
Triggers 
( 
S 
) 
• 
Document 
size 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
• 
Document 
Number 
Triggers 
( 
N 
) 
• 
Number 
of 
documents 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
The 
collectors 
are 
all 
simultaneously 
used 
in 
order 
of 
priority. 
The 
lower 
priority 
collectors 
act 
as 
a 
cut-­‐off 
backups 
to 
safe 
guard 
from 
overflows.
Tests and Benchmarks
Benchmarking Setup 
• Client application sending data to 4-way replicated SolrCloud 
• 5 node Zookeeper ensemble 
• All tests done with a similar dataset ( machine generated text ) 
• We synthesize a high throughput ingest stream, which serves as our input 
• Soft commits set at 1sec
Benchmarking : Time Limit Tests 
docs/sec 
Time Triggers: Collection window in ms
Benchmarking : Document Limit Tests 
docs/sec 
Document Number Triggers: Collection window in number of documents
Benchmarking : Byte Limit Tests 
docs/sec 
Document Size Triggers: Collection window in bytes
Observations 
• On an average we were able to observe 5x-7x increase in ingestion throughput 
• Optimization parameters are dependent constantly changing factors 
• The tuning variables need to be constantly adjusted for best performance 
• How to use this now
Design for a better NRT indexer
PID Controller 
Proportional term ( P ) – present 
Output proportional to current error value 
Integral term ( I ) - past 
Sum of instantaneous error over time, 
and give accumulated offset that should 
have been corrected previously 
Derivative term ( D ) - future 
Calculated by determining the slope of previous 
error over time times the rate of change
PID implementation in the indexer 
Solr 
Cloud 
Solr 
response 
Sampling 
thread 
Process 
variable 
Docs/sec 
Client 
indexer 
process 
Pick 
one 
of 
the 
Triggers 
Time 
(T 
) 
Control 
Variable 
PID 
controller 
implementa5on 
Indexing 
threads
Future Work
Future work 
• Perfect the PID indexer 
• Add it to the YCSB benchmarking framework 
• Add other server side parameters on the PID indexer 
• Use the PID indexer along with the YCSB framework to size hardware
Never Stop Exploring: 
Pushing the Limits of Solr 
Anirudha Jadhav , Bloomberg LP 
QUESTIONS ?

More Related Content

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P.

  • 2. Never Stop Exploring: Pushing the Limits of Solr Anirudha Jadhav ©2014 Bloomberg L.P.
  • 3. Who am I ? • Big Search and Distributed database specialist • Built a Search as a Service platform • Lead Search Architect @ Bloomberg Vault • Credit Derivatives Analytics Engineer @ Bloomberg • Masters' @ Courant Institute of Mathematical Sciences, New York University • Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
  • 5. Agenda • Search at Bloomberg • Goals and Objec5ves • A li9le background • Factors affec5ng indexing • Our tests and benchmarks • Design for a be9er NRT indexer • Future work • Q/A
  • 7. Search at Bloomberg • News Search • Federated Search • Complex re-ranking of search results • Archival Search • GeoSpatial Search • Analytics and Statistics on Search
  • 8. Objective Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data
  • 10. Indexing Data Flow in SolrCloud
  • 11. Indexing Workflow We were talking about IBM during the fishing trip Down Cas)ng [We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip] Creates tokens by lowercasing all le4ers and dropping non-­‐le4ers. [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish] [talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat] Consider the sentence: [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish] Tokeniza)on A tokenizer splits the stream of characters into a series of tokens. Stemming Lemma)za)on Stemming algorithms reduce words "fishing", "fished", "fish", and "fisher" to the root word, "fish" Lemma*za*on expands words to their inflected forms (ie fishing -­‐> fished, fishes, fish but not fisher) Stop Word Removal Remove common stop words “and”,”or” etc. which introduce noise in the search process Synonym Expansion Mapping of words based upon thesaurus (synonyms, acronyms, hypernyms, business rules, etc..) For example talk -­‐> chat, IBM -­‐> “big blue”, trip -­‐> journey
  • 12. Designing the Search Index Designing a good Search Applica)on also involves many aspects of user interac)on that directly influence indexing design • Data Type and Data Distribu)on • Server side parameters • Networking • Client side parameters • Query pa4erns
  • 14. Data and Distribution of Tokens Common types of data that we index in a search index • Textual data ( human generated ) e.g. messages, news, blogs • Textual data ( machine generated ) e.g. logs , 5ckets • Numerical data • Geospa5al data How does this affect search index designs ? • Query speed and indexing speed depend on the size of an index • Size is dependent on • Number of documents in the index • Average size of each document • Distribu5on of tokens • Index features eg. Face5ng, Highligh5ng
  • 15. Server-side Factors • Ratio of CPU’s to the number of solr cores running • 2 Solr indices per CPU or a Thread • Disk space • Disk space for Solr index * 2 ( head room for merge cycles ) • Memory • JVM heap • Off Heap • DocValues
  • 16. Networking Cluster design consideration • Should a cluster span data centers ? • Latency between datacenters • Reliability and availability SLA’s • Where does your Zookeeper ensemble live ? • How many elec5on members • Consider observers to scale zookeeper • Dynamically promote an observer to elec5on member Manage concurrent connections on the server Monitor network latencies for QoS guarantees
  • 17. Client-side Factors • Managing connections and reusing connections • Which format to use for indexing data • javabin • csv • json • xml • How many simultaneous threads to use
  • 18. Experiments with NRT Indexing It’s not always efficient to send a single document to Solr for indexing How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents • Time Triggers ( T ) • Time based collector on the client-­‐side to batch document payloads to Solr • Document Size Triggers ( S ) • Document size based collector on the client-­‐side to batch document payloads to Solr • Document Number Triggers ( N ) • Number of documents based collector on the client-­‐side to batch document payloads to Solr The collectors are all simultaneously used in order of priority. The lower priority collectors act as a cut-­‐off backups to safe guard from overflows.
  • 20. Benchmarking Setup • Client application sending data to 4-way replicated SolrCloud • 5 node Zookeeper ensemble • All tests done with a similar dataset ( machine generated text ) • We synthesize a high throughput ingest stream, which serves as our input • Soft commits set at 1sec
  • 21. Benchmarking : Time Limit Tests docs/sec Time Triggers: Collection window in ms
  • 22. Benchmarking : Document Limit Tests docs/sec Document Number Triggers: Collection window in number of documents
  • 23. Benchmarking : Byte Limit Tests docs/sec Document Size Triggers: Collection window in bytes
  • 24. Observations • On an average we were able to observe 5x-7x increase in ingestion throughput • Optimization parameters are dependent constantly changing factors • The tuning variables need to be constantly adjusted for best performance • How to use this now
  • 25. Design for a better NRT indexer
  • 26. PID Controller Proportional term ( P ) – present Output proportional to current error value Integral term ( I ) - past Sum of instantaneous error over time, and give accumulated offset that should have been corrected previously Derivative term ( D ) - future Calculated by determining the slope of previous error over time times the rate of change
  • 27. PID implementation in the indexer Solr Cloud Solr response Sampling thread Process variable Docs/sec Client indexer process Pick one of the Triggers Time (T ) Control Variable PID controller implementa5on Indexing threads
  • 29. Future work • Perfect the PID indexer • Add it to the YCSB benchmarking framework • Add other server side parameters on the PID indexer • Use the PID indexer along with the YCSB framework to size hardware
  • 30. Never Stop Exploring: Pushing the Limits of Solr Anirudha Jadhav , Bloomberg LP QUESTIONS ?