SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
• Trying out new ideas
• Making better decisions with self-service
• Random one-off jobs for this and that
• Use AI everywhere!
The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!
Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
• For every connector, there’s probably
20 SQL / NoSQL data silos

Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility
A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
• Spark optimizes the SQL query plan
• 100’s of built-in functions
Demo: Parallel Bulk Loader
Parallel Bulk Loader
Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies

User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!
Signal Data Flow in Fusion
Demo: Parallel Bulk Loader
SQL Aggregation
Join with other
Custom UDAF
Final output to

Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr
Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
• But it has to be fast else you’re right
back to data warehousing problems
Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Demo: Parallel Bulk Loader
Self-Service Analytics
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs

Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id =
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant
Demo: Parallel Bulk Loader
Experiment Metrics
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
• Self-service analytics from popular BI visualization
• Measure outcomes between variants in an
experiment using SQL

Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
Thank you!
Timothy Potter
Manager Smart Data, Lucidworks
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

  • 1. SQL Analytics for Search Engineers Timothy Potter Manager of Smart Data @ Lucidworks / Apache Solr Committer @thelabdude #Activate18 #ActivateSearch
  • 2. An ever-expanding list of needs from search engineers • Better relevancy, less manual tuning • Bigger scale, less downtime, fixed resources • Higher QPS, more complex query pipelines • More bespoke, search-driven applications, faster! • Trying out new ideas • Making better decisions with self-service analytics • Random one-off jobs for this and that • Use AI everywhere!
  • 3. The ideal solution … • Easy to explain to your boss how it works • Tooling available • Résumé friendly • Extensible / customizable / flexible • Scalable • People want to feel productive SQL in Fusion!
  • 4. Data Ingest = Project Friction • Bespoke, search-driven applications > general purpose dashboard tools • Getting data in continues to be a hassle / friction when getting started • Need something nimble but also fast / scalable • For every connector, there’s probably 20 SQL / NoSQL data silos
  • 5. Fusion’s Parallel Bulk Loader • Get to the fun stuff faster! • Complement Fusion’s connectors for those dirty ETL jobs that cause friction in every project • High performance parallel reads from structured data sources, including Cassandra, Elastic, HBase, JDBC, Hadoop, … • Basic ETL tasks with SQL and/or custom Scala • ML Model predictions as UDF • Direct to Solr for optimal speed or send to index- pipelines for optimal flexibility
  • 6. A foundation built on SparkSQL • Expose structured data as a DataFrame: RDD + schema • 100’s of data sources + formats • spark-solr translates Solr query results to a DataFrame • Highly optimized parallel reads, with predicate pushdown across a Spark cluster • Spark optimizes the SQL query plan • 100’s of built-in functions
  • 7. Demo: Parallel Bulk Loader Parallel Bulk Loader
  • 8. Read parquet from S3 Write to a Fusion Index Pipeline Advanced transforms with Scala Transform with SQL Add job dependencies On-the-fly
  • 9. User Feedback to Improve Relevancy • MRR is sub-optimal for many queries? • Want to boost some docs based on user click behavior (per query) • Older clicks should age out over time • Some user actions are more important than others: click < cart add < purchase • Sometimes you need to join signals with other tables, e.g. item metadata • Hide complex business logic behind UDF / UDAF (pluggable) • Designed for change!
  • 10. Signal Data Flow in Fusion
  • 11. Demo: Parallel Bulk Loader SQL Aggregation
  • 12. Join with other tables Custom UDAF Final output to Solr
  • 13. Window Functions WITH sessions AS ( SELECT *, sum(IF(diff_secs > 30, 1, 0)) OVER (PARTITION BY clientip ORDER BY ts) session_id FROM ( SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts)) OVER (PARTITION BY clientip ORDER BY ts) as diff_secs FROM ${inputCollection} WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL AND verb IS NOT NULL AND response IS NOT NULL ) ) SELECT concat_ws('||', clientip,session_id) as id, first(clientip) as clientip, min(ts) as session_start, max(ts) as session_end, timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l, sum(bytes) as total_bytes_l, count(*) as total_requests_l FROM sessions GROUP BY clientip, session_id Lag window function
  • 14. SQL Aggregations Scalability • Aggregate 42M signals into 11M groups (query / doc_id) • ~18 mins on 3 node EC2 cluster (r3.xlarge) • Mostly I/O from/to Solr
  • 15. Why Self-service Analytics? • Powerful connectors, relevance, speed, and massive scalability = more mission- critical datasets finding their way into Fusion • Don’t be another data silo! • Let users ask questions of this data using their tool of choice w/o adding work for the IT group! • Aggregations over full-text ranked results • But it has to be fast else you’re right back to data warehousing problems
  • 16. Self-service Analytics • Fusion SQL is a JDBC service that supports SQL • Fusion SQL plugs into Apache Spark’s query planner to translate SQL into optimized Solr queries (streaming expressions and JSON facets) • Integrate with popular BI tools like Tableau, PowerBI, and Spotfire + Notebooks like Apache Zeppelin
  • 18. Demo: Parallel Bulk Loader Self-Service Analytics
  • 20. Self-service Analytics Performance • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: ~900ms 28M rows: ~1.3secs
  • 21. Self-service Analytics Performance SELECT m.title as title, agg.aggCount as aggCount FROM movies m INNER JOIN ( SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = ORDER BY aggCount DESC 20M rows Fusion SQL : ~1.1 secs MySQL: 17 secs (w/ index on movie_id) Movielens data: Aggregate 20M ratings
  • 22. Experiments • Run live experiments to try out new ideas and compare outcomes between variants • Built-in metrics: MRR, avg|min| max response time, CTR … and you guessed it! SQL • Bayesian Bandits to explore/exploit the best performing variant
  • 23. Demo: Parallel Bulk Loader Experiment Metrics
  • 24. Recap • How to build powerful SQL aggregations with joins, custom UDF/ UDAF, and window functions to power boosting and recommendations • Ingesting data from data sources using SQL for ETL, ML • Self-service analytics from popular BI visualization tools • Measure outcomes between variants in an experiment using SQL
  • 25. Top 10 Things you can do with SQL in Fusion 1. Aggregate signals by query / doc / user to compute boost weights and generate recommendations 2. Ingest & ETL from 100’s of data sources using SparkSQL 3. Use ML models to generate predictions and Lucene text analysis using UDF functions 4. Join data from multiple Solr collections and data sources 5. Self-service analytics with BI tools like Tableau and PowerBI 6. Hide complex business logic behind UDF / UDAF 7. Use window functions for tasks like sessionization 8. Grouping sets and cubes for advanced analytic reporting 9. Compute KPIs across variants in an experiment 10. Expose complex Solr streaming expressions as simple SQL views
  • 26. Thank you! Timothy Potter Manager Smart Data, Lucidworks @thelabdude #Activate18 #ActivateSearch

