SlideShare a Scribd company logo
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
https://github.com/LucidWorks/spark-solr/
•  Indexing from Spark
•  Reading data from Solr
•  Solr data as a Spark SQL DataFrame
•  Interacting with Solr from the Spark shell
•  Document Matching
•  Reading Term vectors from Solr for MLlib
Integrating Solr & Spark
•  Solr user since 2010, committer since April 2014, work for
Lucidworks, PMC member ~ May 2015
•  Focus mainly on SolrCloud features … and bin/solr!
ü  Release manager for Lucene / Solr 5.1
•  Co-author of Solr in Action
•  Other contributions include Solr on YARN, Solr Scale
Toolkit, Solr-Storm, and Spark-Solr integration projects on
github
About Me …
About Solr
•  Vibrant, thriving open source community
•  Solr 5.2 just released!
ü  Pluggable authentication and authorization
ü  ~2x indexing performance w/ replication
ü  Field cardinality estimation using HyperLogLog
ü  Rule-based replica placement strategy (rack awareness)
•  Deploy to YARN cluster using Slider

Recommended for you

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!

prestosqlprestodbpresto
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns

This document discusses PostgreSQL database architecture patterns for running PostgreSQL at scale when a relational database as a service like Amazon RDS won't meet needs. It describes challenges faced with MySQL, Redshift and Vertica and how PostgreSQL was better suited through techniques like partitioning by date, TOAST compression, foreign data wrappers, and poor man's parallel processing. Key takeaways are that PostgreSQL supported scaling to petabytes of data, sub-second queries across large date ranges, and custom extensions needed while avoiding limitations and expenses of other database options.

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix

apache avroapache hiveapache parquet
Lucidworks Fusion
Spark Streaming Example: Solr as Sink
Twitter
./spark-­‐submit	
  -­‐-­‐master	
  MASTER	
  -­‐-­‐class	
  com.lucidworks.spark.SparkApp	
  spark-­‐solr-­‐1.0.jar	
  	
  
	
  	
  	
  	
  	
  twitter-­‐to-­‐solr	
  -­‐zkHost	
  localhost:2181	
  –collection	
  social	
  
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
Spark Streaming Example: Solr as Sink
//	
  start	
  receiving	
  a	
  stream	
  of	
  tweets	
  ...	
  
JavaReceiverInputDStream<Status>	
  tweets	
  =	
  
	
  	
  TwitterUtils.createStream(jssc,	
  null,	
  filters);	
  
	
  
//	
  map	
  incoming	
  tweets	
  into	
  SolrInputDocument	
  objects	
  for	
  indexing	
  in	
  Solr	
  
JavaDStream<SolrInputDocument>	
  docs	
  =	
  tweets.map(	
  
	
  	
  new	
  Function<Status,SolrInputDocument>()	
  {	
  
	
  	
  	
  	
  public	
  SolrInputDocument	
  call(Status	
  status)	
  {	
  
	
  	
  	
  	
  	
  	
  SolrInputDocument	
  doc	
  =	
  
	
  	
  	
  	
  	
  	
  	
  	
  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),	
  status,	
  null);	
  
	
  	
  	
  	
  	
  	
  doc.setField("provider_s",	
  "twitter");	
  
	
  	
  	
  	
  	
  	
  return	
  doc;	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
);	
  
	
  
//	
  when	
  ready,	
  send	
  the	
  docs	
  into	
  a	
  SolrCloud	
  cluster	
  
SolrSupport.indexDStreamOfDocs(zkHost,	
  collection,	
  docs);	
  
Direct updates from Spark to shard leaders
server-sideclient-side

Recommended for you

Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @Flipkart

This document summarizes Flipkart's approach to building a real-time search index for e-commerce. It discusses the need for real-time search to address issues like showing out of stock products. It describes Flipkart's traffic volume and catalog size. It then covers challenges like high update rates and using microservices. It evaluates Apache SolrCloud and describes Flipkart's approach using a near real-time forward index with an inverted index for filtering to enable real-time sorting, filtering and querying with latency comparable to columnar data. The system processes over 150 signals and 50k updates per second in production.

searchnear realtime searchcap theorem
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale

This document discusses how to achieve scale with MongoDB. It covers optimization tips like schema design, indexing, and monitoring. Vertical scaling involves upgrading hardware like RAM and SSDs. Horizontal scaling involves adding shards to distribute load. The document also discusses how MongoDB scales for large customers through examples of deployments handling high throughput and large datasets.

Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana

This document provides an overview of Logstash, Elasticsearch, and Kibana for log analysis. It discusses how logging is used for troubleshooting, security, and monitoring. It then introduces Logstash as an open-source log collection and parsing tool. Elasticsearch is described as a search and analytics engine that indexes log data from Logstash. Kibana provides a web interface for visualizing and searching logs stored in Elasticsearch. The document concludes with discussing demo, installation, scaling, and deployment considerations for these log analysis tools.

kibanalogstashelasticsearch
Coming Soon! ShardPartitioner
•  Custom partitioning scheme for RDD using Solr’s DocRouter
•  Stream docs directly to each shard leader using metadata from ZooKeeper,
document shard assignment, and ConcurrentUpdateSolrClient
final	
  ShardPartitioner	
  shardPartitioner	
  =	
  new	
  ShardPartitioner(zkHost,	
  collection);	
  
pairs.partitionBy(shardPartitioner).foreachPartition(	
  
	
  	
  new	
  VoidFunction<Iterator<Tuple2<String,	
  SolrInputDocument>>>()	
  {	
  
	
  	
  	
  	
  public	
  void	
  call(Iterator<Tuple2<String,	
  SolrInputDocument>>	
  tupleIter)	
  throws	
  Exception	
  {	
  
	
  	
  	
  	
  	
  	
  ConcurrentUpdateSolrClient	
  cuss	
  =	
  null;	
  
	
  	
  	
  	
  	
  	
  while	
  (tupleIter.hasNext())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  ...	
  Initialize	
  ConcurrentUpdateSolrClient	
  once	
  per	
  partition	
  
	
  	
  	
  	
  	
  	
  	
  cuss.add(doc);	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  }	
  
});	
  
SolrRDD: Reading data from Solr into Spark
•  Can execute any query and expose as an RDD
•  SolrRDD produces JavaRDD<SolrDocument>	
  
•  Use deep-paging if needed (cursorMark)
•  Stream docs from Solr (vs. building lists on the server-side)
•  More parallelism using a range filter on a numeric field (_version_)
SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
Solr
Collection
Partition 1
SolrRDD
Partition 2
Spark
Driver
App
q=*:*	
  
ZooKeeper
Read collection metadata
q=*:*&rows=1000&	
  
distrib=false&cursorMark=*	
  
Results streamed back from Solr
JavaRDD<SolrDocument>
Spark SQL
Query Solr, then expose results as a SQL table
JavaSparkContext	
  jsc	
  =	
  new	
  JavaSparkContext(conf);	
  
SQLContext	
  sqlContext	
  =	
  new	
  SQLContext(jsc);	
  
	
  
SolrRDD	
  solrRDD	
  =	
  new	
  SolrRDD(zkHost,	
  collection);	
  
DataFrame	
  tweets	
  =	
  solrRDD.asTempTable(sqlContext,	
  queryStr,	
  "tweets");	
  
DataFrame	
  results	
  =	
  sqlContext.sql(	
  
"SELECT	
  COUNT(type_s)	
  FROM	
  tweets	
  WHERE	
  type_s='echo'");	
  
	
  
JavaRDD<Row>	
  resultsRDD	
  =	
  results.javaRDD();	
  
List<Long>	
  count	
  =	
  resultsRDD.map(new	
  Function<Row,	
  Long>()	
  {	
  …	
  }).collect();	
  
System.out.println("#	
  of	
  echos	
  :	
  "+count);	
  

Recommended for you

Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

* apache spark

 *big data

 *ai

 *
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

apache flinkstream processing
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project

Trino (formerly known as PrestoSQL) is an open source distributed SQL query engine for running fast analytical queries against data sources of all sizes. Some key updates since being rebranded from PrestoSQL to Trino include new security features, language features like window functions and temporal types, performance improvements through dynamic filtering and partition pruning, and new connectors. Upcoming improvements include support for MERGE statements, MATCH_RECOGNIZE patterns, and materialized view enhancements.

prestotrinohive
Query Solr from the Spark Shell
Interactive data mining with the full power of Solr queries
ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar	
  bin/spark-­‐shell	
  
	
  
import	
  com.lucidworks.spark.SolrRDD;	
  
var	
  solrRDD	
  =	
  new	
  SolrRDD("localhost:9983","gettingstarted");	
  
	
  
var	
  tweets	
  =	
  solrRDD.query(sc,"*:*");	
  
var	
  count	
  =	
  tweets.count();	
  
	
  
var	
  tweets	
  =	
  solrRDD.asTempTable(sqlContext,	
  "*:*",	
  "tweets");	
  
sqlContext.sql("SELECT	
  COUNT(type_s)	
  FROM	
  tweets	
  WHERE	
  type_s='echo'").show();	
  
	
  
Document Matching using Stored Queries
•  For each document, determine which of a large set of stored queries
matches.
•  Useful for alerts, alternative flow paths through a stream, etc
•  Index a micro-batch into an embedded (in-memory) Solr instance and
then determine which queries match
•  Matching framework; you have to decide where to load the stored
queries from and what to do when matches are found
•  Scale it using Spark … need to scale to many queries, checkout Luwak
Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take
when docs match
Reading Term Vectors from Solr
•  Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
•  Can be used to construct RDD<Vector> which can then be passed to MLLib:
SolrRDD	
  solrRDD	
  =	
  new	
  SolrRDD(zkHost,	
  collection);	
  
	
  
JavaRDD<Vector>	
  vectors	
  =	
  	
  
	
  	
  solrRDD.queryTermVectors(jsc,	
  solrQuery,	
  field,	
  numFeatures);	
  
vectors.cache();	
  
	
  
KMeansModel	
  clusters	
  =	
  	
  
	
  	
  KMeans.train(vectors.rdd(),	
  numClusters,	
  numIterations);	
  
	
  
//	
  Evaluate	
  clustering	
  by	
  computing	
  Within	
  Set	
  Sum	
  of	
  Squared	
  Errors	
  
double	
  WSSSE	
  =	
  clusters.computeCost(vectors.rdd());	
  
	
  

Recommended for you

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

parquetnetflixhadoop
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake

Flink Forward San Francisco 2022. Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink. by Scott Sandre & Denny Lee

apache flinkbig datastream processing
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix

The document discusses Netflix's use of Elasticsearch for querying log events. It describes how Netflix evolved from storing logs in files to using Elasticsearch to enable interactive exploration of billions of log events. It also summarizes some of Netflix's best practices for running Elasticsearch at scale, such as automatic sharding and replication, flexible schemas, and extensive monitoring.

netflixosselasticsearch
Wrap-up and Q & A
Need more use cases …
Feel free to reach out to me with questions:
tim.potter@lucidworks.com / @thelabdude

More Related Content

What's hot

Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
Michelle Ufford
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @Flipkart
Umesh Prasad
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
MongoDB
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
Avinash Ramineni
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
Akshay Rai
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
Ludovico Caldara
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

What's hot (20)

Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @Flipkart
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Viewers also liked

Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
Spark Summit
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Batch Applications for the Java Platform
Batch Applications for the Java PlatformBatch Applications for the Java Platform
Batch Applications for the Java Platform
Sivakumar Thyagarajan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
Container Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup HamburgContainer Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup Hamburg
Timo Derstappen
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
lucenerevolution
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit
 
Containers for Non-Developers
Containers for Non-DevelopersContainers for Non-Developers
Containers for Non-Developers
Amazon Web Services
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
Spark Summit
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 

Viewers also liked (20)

Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Batch Applications for the Java Platform
Batch Applications for the Java PlatformBatch Applications for the Java Platform
Batch Applications for the Java Platform
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Container Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup HamburgContainer Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup Hamburg
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 
Containers for Non-Developers
Containers for Non-DevelopersContainers for Non-Developers
Containers for Non-Developers
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 

Similar to Integrating Spark and Solr-(Timothy Potter, Lucidworks)

Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
Chitturi Kiran
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
Edureka!
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Brightstar DB
Brightstar DBBrightstar DB
Brightstar DB
Connected Data World
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
de:code 2017
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
Murshed Ahmmad Khan
 

Similar to Integrating Spark and Solr-(Timothy Potter, Lucidworks) (20)

Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark core
Spark coreSpark core
Spark core
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Brightstar DB
Brightstar DBBrightstar DB
Brightstar DB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
khansayyad1256
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
taqyea
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
aashuverma204
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
taqyea
 

Recently uploaded (20)

From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
 

Integrating Spark and Solr-(Timothy Potter, Lucidworks)

  • 2. https://github.com/LucidWorks/spark-solr/ •  Indexing from Spark •  Reading data from Solr •  Solr data as a Spark SQL DataFrame •  Interacting with Solr from the Spark shell •  Document Matching •  Reading Term vectors from Solr for MLlib Integrating Solr & Spark
  • 3. •  Solr user since 2010, committer since April 2014, work for Lucidworks, PMC member ~ May 2015 •  Focus mainly on SolrCloud features … and bin/solr! ü  Release manager for Lucene / Solr 5.1 •  Co-author of Solr in Action •  Other contributions include Solr on YARN, Solr Scale Toolkit, Solr-Storm, and Spark-Solr integration projects on github About Me …
  • 4. About Solr •  Vibrant, thriving open source community •  Solr 5.2 just released! ü  Pluggable authentication and authorization ü  ~2x indexing performance w/ replication ü  Field cardinality estimation using HyperLogLog ü  Rule-based replica placement strategy (rack awareness) •  Deploy to YARN cluster using Slider
  • 6. Spark Streaming Example: Solr as Sink Twitter ./spark-­‐submit  -­‐-­‐master  MASTER  -­‐-­‐class  com.lucidworks.spark.SparkApp  spark-­‐solr-­‐1.0.jar              twitter-­‐to-­‐solr  -­‐zkHost  localhost:2181  –collection  social   Solr JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection) JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); map() class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs); Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks
  • 7. Spark Streaming Example: Solr as Sink //  start  receiving  a  stream  of  tweets  ...   JavaReceiverInputDStream<Status>  tweets  =      TwitterUtils.createStream(jssc,  null,  filters);     //  map  incoming  tweets  into  SolrInputDocument  objects  for  indexing  in  Solr   JavaDStream<SolrInputDocument>  docs  =  tweets.map(      new  Function<Status,SolrInputDocument>()  {          public  SolrInputDocument  call(Status  status)  {              SolrInputDocument  doc  =                  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),  status,  null);              doc.setField("provider_s",  "twitter");              return  doc;          }      }   );     //  when  ready,  send  the  docs  into  a  SolrCloud  cluster   SolrSupport.indexDStreamOfDocs(zkHost,  collection,  docs);  
  • 8. Direct updates from Spark to shard leaders server-sideclient-side
  • 9. Coming Soon! ShardPartitioner •  Custom partitioning scheme for RDD using Solr’s DocRouter •  Stream docs directly to each shard leader using metadata from ZooKeeper, document shard assignment, and ConcurrentUpdateSolrClient final  ShardPartitioner  shardPartitioner  =  new  ShardPartitioner(zkHost,  collection);   pairs.partitionBy(shardPartitioner).foreachPartition(      new  VoidFunction<Iterator<Tuple2<String,  SolrInputDocument>>>()  {          public  void  call(Iterator<Tuple2<String,  SolrInputDocument>>  tupleIter)  throws  Exception  {              ConcurrentUpdateSolrClient  cuss  =  null;              while  (tupleIter.hasNext())  {                    //  ...  Initialize  ConcurrentUpdateSolrClient  once  per  partition                cuss.add(doc);              }        }   });  
  • 10. SolrRDD: Reading data from Solr into Spark •  Can execute any query and expose as an RDD •  SolrRDD produces JavaRDD<SolrDocument>   •  Use deep-paging if needed (cursorMark) •  Stream docs from Solr (vs. building lists on the server-side) •  More parallelism using a range filter on a numeric field (_version_)
  • 11. SolrRDD: Reading data from Solr into Spark Shard 1 Shard 2 Solr Collection Partition 1 SolrRDD Partition 2 Spark Driver App q=*:*   ZooKeeper Read collection metadata q=*:*&rows=1000&   distrib=false&cursorMark=*   Results streamed back from Solr JavaRDD<SolrDocument>
  • 12. Spark SQL Query Solr, then expose results as a SQL table JavaSparkContext  jsc  =  new  JavaSparkContext(conf);   SQLContext  sqlContext  =  new  SQLContext(jsc);     SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);   DataFrame  tweets  =  solrRDD.asTempTable(sqlContext,  queryStr,  "tweets");   DataFrame  results  =  sqlContext.sql(   "SELECT  COUNT(type_s)  FROM  tweets  WHERE  type_s='echo'");     JavaRDD<Row>  resultsRDD  =  results.javaRDD();   List<Long>  count  =  resultsRDD.map(new  Function<Row,  Long>()  {  …  }).collect();   System.out.println("#  of  echos  :  "+count);  
  • 13. Query Solr from the Spark Shell Interactive data mining with the full power of Solr queries ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar  bin/spark-­‐shell     import  com.lucidworks.spark.SolrRDD;   var  solrRDD  =  new  SolrRDD("localhost:9983","gettingstarted");     var  tweets  =  solrRDD.query(sc,"*:*");   var  count  =  tweets.count();     var  tweets  =  solrRDD.asTempTable(sqlContext,  "*:*",  "tweets");   sqlContext.sql("SELECT  COUNT(type_s)  FROM  tweets  WHERE  type_s='echo'").show();    
  • 14. Document Matching using Stored Queries •  For each document, determine which of a large set of stored queries matches. •  Useful for alerts, alternative flow paths through a stream, etc •  Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match •  Matching framework; you have to decide where to load the stored queries from and what to do when matches are found •  Scale it using Spark … need to scale to many queries, checkout Luwak
  • 15. Document Matching using Stored Queries Stored Queries DocFilterContext Twitter map() Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …); Get queries Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper … ZooKeeper Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match
  • 16. Reading Term Vectors from Solr •  Pull TF/IDF (or just TF) for each term in a field for each document in query results from Solr •  Can be used to construct RDD<Vector> which can then be passed to MLLib: SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);     JavaRDD<Vector>  vectors  =        solrRDD.queryTermVectors(jsc,  solrQuery,  field,  numFeatures);   vectors.cache();     KMeansModel  clusters  =        KMeans.train(vectors.rdd(),  numClusters,  numIterations);     //  Evaluate  clustering  by  computing  Within  Set  Sum  of  Squared  Errors   double  WSSSE  =  clusters.computeCost(vectors.rdd());    
  • 17. Wrap-up and Q & A Need more use cases … Feel free to reach out to me with questions: tim.potter@lucidworks.com / @thelabdude