Integrating Spark and Solr-(Timothy Potter, Lucidworks)
•  Indexing from Spark
•  Reading data from Solr
•  Solr data as a Spark SQL DataFrame
•  Interacting with Solr from the Spark shell
•  Document Matching
•  Reading Term vectors from Solr for MLlib
Integrating Solr & Spark
•  Solr user since 2010, committer since April 2014, work for
Lucidworks, PMC member ~ May 2015
•  Focus mainly on SolrCloud features … and bin/solr!
ü  Release manager for Lucene / Solr 5.1
•  Co-author of Solr in Action
•  Other contributions include Solr on YARN, Solr Scale
Toolkit, Solr-Storm, and Spark-Solr integration projects on
About Me …
About Solr
•  Vibrant, thriving open source community
•  Solr 5.2 just released!
ü  Pluggable authentication and authorization
ü  ~2x indexing performance w/ replication
ü  Field cardinality estimation using HyperLogLog
ü  Rule-based replica placement strategy (rack awareness)
•  Deploy to YARN cluster using Slider

Lucidworks Fusion
Spark Streaming Example: Solr as Sink
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs =
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
return doc;
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
Spark Streaming Example: Solr as Sink
Direct updates from Spark to shard leaders

Coming Soon! ShardPartitioner
•  Custom partitioning scheme for RDD using Solr’s DocRouter
•  Stream docs directly to each shard leader using metadata from ZooKeeper,
document shard assignment, and ConcurrentUpdateSolrClient
SolrRDD: Reading data from Solr into Spark
•  Can execute any query and expose as an RDD
•  SolrRDD produces JavaRDD<SolrDocument>	
•  Use deep-paging if needed (cursorMark)
•  Stream docs from Solr (vs. building lists on the server-side)
•  More parallelism using a range filter on a numeric field (_version_)
SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
Partition 1
Partition 2
Read collection metadata
Results streamed back from Solr
Spark SQL
Query Solr, then expose results as a SQL table

State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project

Trino (formerly known as PrestoSQL) is an open source distributed SQL query engine for running fast analytical queries against data sources of all sizes. Some key updates since being rebranded from PrestoSQL to Trino include new security features, language features like window functions and temporal types, performance improvements through dynamic filtering and partition pruning, and new connectors. Upcoming improvements include support for MERGE statements, MATCH_RECOGNIZE patterns, and materialized view enhancements.

Query Solr from the Spark Shell
Interactive data mining with the full power of Solr queries
Document Matching using Stored Queries
•  For each document, determine which of a large set of stored queries
•  Useful for alerts, alternative flow paths through a stream, etc
•  Index a micro-batch into an embedded (in-memory) Solr instance and
then determine which queries match
•  Matching framework; you have to decide where to load the stored
queries from and what to do when matches are found
•  Scale it using Spark … need to scale to many queries, checkout Luwak
Document Matching using Stored Queries
Stored Queries
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs =
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
return doc;
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
Initialized from configs
stored in ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take
when docs match
Reading Term Vectors from Solr
•  Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
•  Can be used to construct RDD<Vector> which can then be passed to MLLib:

Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix

The document discusses Netflix's use of Elasticsearch for querying log events. It describes how Netflix evolved from storing logs in files to using Elasticsearch to enable interactive exploration of billions of log events. It also summarizes some of Netflix's best practices for running Elasticsearch at scale, such as automatic sharding and replication, flexible schemas, and extensive monitoring.

Wrap-up and Q & A
Need more use cases …
Feel free to reach out to me with questions: / @thelabdude

Integrating Spark and Solr-(Timothy Potter, Lucidworks)

  • 2. •  Indexing from Spark •  Reading data from Solr •  Solr data as a Spark SQL DataFrame •  Interacting with Solr from the Spark shell •  Document Matching •  Reading Term vectors from Solr for MLlib Integrating Solr & Spark
  • 3. •  Solr user since 2010, committer since April 2014, work for Lucidworks, PMC member ~ May 2015 •  Focus mainly on SolrCloud features … and bin/solr! ü  Release manager for Lucene / Solr 5.1 •  Co-author of Solr in Action •  Other contributions include Solr on YARN, Solr Scale Toolkit, Solr-Storm, and Spark-Solr integration projects on github About Me …
  • 4. About Solr •  Vibrant, thriving open source community •  Solr 5.2 just released! ü  Pluggable authentication and authorization ü  ~2x indexing performance w/ replication ü  Field cardinality estimation using HyperLogLog ü  Rule-based replica placement strategy (rack awareness) •  Deploy to YARN cluster using Slider
  • 6. Spark Streaming Example: Solr as Sink Twitter ./spark-­‐submit  -­‐-­‐master  MASTER  -­‐-­‐class  com.lucidworks.spark.SparkApp  spark-­‐solr-­‐1.0.jar              twitter-­‐to-­‐solr  -­‐zkHost  localhost:2181  –collection  social   Solr JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection) JavaDStream<SolrInputDocument> docs = new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); map() class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs); Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks
  • 7. Spark Streaming Example: Solr as Sink //  start  receiving  a  stream  of  tweets  ...   JavaReceiverInputDStream<Status>  tweets  =      TwitterUtils.createStream(jssc,  null,  filters);     //  map  incoming  tweets  into  SolrInputDocument  objects  for  indexing  in  Solr   JavaDStream<SolrInputDocument>  docs  =      new  Function<Status,SolrInputDocument>()  {          public  SolrInputDocument  call(Status  status)  {              SolrInputDocument  doc  =                  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),  status,  null);              doc.setField("provider_s",  "twitter");              return  doc;          }      }   );     //  when  ready,  send  the  docs  into  a  SolrCloud  cluster   SolrSupport.indexDStreamOfDocs(zkHost,  collection,  docs);  
  • 8. Direct updates from Spark to shard leaders server-sideclient-side
  • 9. Coming Soon! ShardPartitioner •  Custom partitioning scheme for RDD using Solr’s DocRouter •  Stream docs directly to each shard leader using metadata from ZooKeeper, document shard assignment, and ConcurrentUpdateSolrClient final  ShardPartitioner  shardPartitioner  =  new  ShardPartitioner(zkHost,  collection);   pairs.partitionBy(shardPartitioner).foreachPartition(      new  VoidFunction<Iterator<Tuple2<String,  SolrInputDocument>>>()  {          public  void  call(Iterator<Tuple2<String,  SolrInputDocument>>  tupleIter)  throws  Exception  {              ConcurrentUpdateSolrClient  cuss  =  null;              while  (tupleIter.hasNext())  {                    //  ...  Initialize  ConcurrentUpdateSolrClient  once  per  partition                cuss.add(doc);              }        }   });  
  • 10. SolrRDD: Reading data from Solr into Spark •  Can execute any query and expose as an RDD •  SolrRDD produces JavaRDD<SolrDocument>   •  Use deep-paging if needed (cursorMark) •  Stream docs from Solr (vs. building lists on the server-side) •  More parallelism using a range filter on a numeric field (_version_)
  • 11. SolrRDD: Reading data from Solr into Spark Shard 1 Shard 2 Solr Collection Partition 1 SolrRDD Partition 2 Spark Driver App q=*:*   ZooKeeper Read collection metadata q=*:*&rows=1000&   distrib=false&cursorMark=*   Results streamed back from Solr JavaRDD<SolrDocument>
  • 12. Spark SQL Query Solr, then expose results as a SQL table JavaSparkContext  jsc  =  new  JavaSparkContext(conf);   SQLContext  sqlContext  =  new  SQLContext(jsc);     SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);   DataFrame  tweets  =  solrRDD.asTempTable(sqlContext,  queryStr,  "tweets");   DataFrame  results  =  sqlContext.sql(   "SELECT  COUNT(type_s)  FROM  tweets  WHERE  type_s='echo'");     JavaRDD<Row>  resultsRDD  =  results.javaRDD();   List<Long>  count  =  Function<Row,  Long>()  {  …  }).collect();   System.out.println("#  of  echos  :  "+count);  
  • 13. Query Solr from the Spark Shell Interactive data mining with the full power of Solr queries ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar  bin/spark-­‐shell     import  com.lucidworks.spark.SolrRDD;   var  solrRDD  =  new  SolrRDD("localhost:9983","gettingstarted");     var  tweets  =  solrRDD.query(sc,"*:*");   var  count  =  tweets.count();     var  tweets  =  solrRDD.asTempTable(sqlContext,  "*:*",  "tweets");   sqlContext.sql("SELECT  COUNT(type_s)  FROM  tweets  WHERE  type_s='echo'").show();    
  • 14. Document Matching using Stored Queries •  For each document, determine which of a large set of stored queries matches. •  Useful for alerts, alternative flow paths through a stream, etc •  Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match •  Matching framework; you have to decide where to load the stored queries from and what to do when matches are found •  Scale it using Spark … need to scale to many queries, checkout Luwak
  • 15. Document Matching using Stored Queries Stored Queries DocFilterContext Twitter map() Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); JavaDStream<SolrInputDocument> docs = new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …); Get queries Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper … ZooKeeper Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match
  • 16. Reading Term Vectors from Solr •  Pull TF/IDF (or just TF) for each term in a field for each document in query results from Solr •  Can be used to construct RDD<Vector> which can then be passed to MLLib: SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);     JavaRDD<Vector>  vectors  =        solrRDD.queryTermVectors(jsc,  solrQuery,  field,  numFeatures);   vectors.cache();     KMeansModel  clusters  =        KMeans.train(vectors.rdd(),  numClusters,  numIterations);     //  Evaluate  clustering  by  computing  Within  Set  Sum  of  Squared  Errors   double  WSSSE  =  clusters.computeCost(vectors.rdd());    
  • 17. Wrap-up and Q & A Need more use cases … Feel free to reach out to me with questions: / @thelabdude