SlideShare a Scribd company logo
SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
@thelabdude
#Activate18 #ActivateSearch
An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
faster!
• Trying out new ideas
• Making better decisions with self-service
analytics
• Random one-off jobs for this and that
• Use AI everywhere!
The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!
Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
scalable
• For every connector, there’s probably
20 SQL / NoSQL data silos

Recommended for you

Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse

Modern DW Architecture - The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.

azuresqlserverbigdata
Power BI: Tips and Tricks
Power BI: Tips and TricksPower BI: Tips and Tricks
Power BI: Tips and Tricks

Power BI can be used either through Power BI Desktop or Power BI Embedded. Power BI Desktop is a free desktop application that allows connecting to various data sources and creating visual analytics. Power BI Embedded allows integrating Power BI visualizations into web and mobile applications. Reports in Power BI combine visuals and filters to analyze data, while dashboards combine multiple reports. Filters and slicers allow filtering the data in visuals. Authentication is handled through Azure Active Directory, while access is controlled using various token types.

globallogicgloballogic kharkivpower bi
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows

The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.

datascotlandazuredata factory
Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility
A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
cluster
• Spark optimizes the SQL query plan
• 100’s of built-in functions
Demo: Parallel Bulk Loader
Parallel Bulk Loader
Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies
On-the-fly

Recommended for you

A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory

Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities. This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.

cortana analytics suiteintegration-mondayintegration-user-group
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics

When you model data you are making two decisions: * The location where data will be stored * How the data will be organized for ease of use

azuredata lakesanalytics
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure

This slide deck will show you techniques and technologies necessary to take a large, transaction SQL Server database and migrate it to Azure, Azure SQL Database, and Azure SQL Database Managed Instance

sql serverazure sql databaseazure
User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!
Signal Data Flow in Fusion
Demo: Parallel Bulk Loader
SQL Aggregation
Join with other
tables
Custom UDAF
Final output to
Solr

Recommended for you

ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect

Microsoft Graph provides REST APIs and webhooks to access and connect Microsoft 365 and other organizational data at scale. It enables building custom applications and workflows that extend Microsoft 365 experiences. Data access through Microsoft Graph is designed with data privacy, security, and governance in mind, allowing administrators to control access to organizational data.

ecs19
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB

A quick introduction to Azure DocumentDB - a JSON document store that's Platform as a Service on Microsoft Azure

azuredocumentdbjson
Continuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData Analysis

This document discusses challenges with distributed data analysis and Treasure Data's approach to addressing them. Some key points: - Distributed data analysis faces challenges around network bandwidth, throughput, data consistency, and reliability. - Treasure Data uses a columnar storage format based on MessagePack to more efficiently save bandwidth and storage space. - They implement time index pushdown to enable reading only relevant data within a time range, reducing network usage. - Automatic optimization of partitioning layout and repartitioning aims to balance partition file size, time ranges, and keys to maximize performance and throughput while minimizing memory pressure.

big dataprestohadoop
Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
FROM (
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL
AND verb IS NOT NULL AND response IS NOT NULL
)
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
function
SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr
Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
Fusion
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
results
• But it has to be fast else you’re right
back to data warehousing problems
Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
facets)
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin

Recommended for you

Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story

The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.

parquetsparkdistributed processing
Moving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceMoving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed Instance

In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.

azuresql azuremanaged instance
R in Power BI
R in Power BIR in Power BI
R in Power BI

An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations

rpower bidata
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Demo: Parallel Bulk Loader
Self-Service Analytics
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html

Recommended for you

CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis

Marco Pozzan Power BI consultant & Trainer Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.

powerbicloud conference italia
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.

cloud analyticsawsmatillion
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake

This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.

spark + ai summit

 *
Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
INNER JOIN (
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id = m.id
ORDER BY aggCount DESC
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
Experiments
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant
Demo: Parallel Bulk Loader
Experiment Metrics
Recap
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
ETL, ML
• Self-service analytics from popular BI visualization
tools
• Measure outcomes between variants in an
experiment using SQL
https://github.com/lucidworks/fusion-spark-bootcamp

Recommended for you

Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018

- The document discusses real-time options in Power BI including push, streaming, and PubNub data. It describes the characteristics of each option including refresh rates, visual capabilities, and advantages/limitations. - A case study is presented on creating a dashboard to monitor warehouse workload in real-time using a hybrid dataset with data pushed from SQL Server and SAP HANA via REST APIs into Power BI. PowerApps is also suggested for creating mobile apps connected to the real-time data. - Additional resources are provided on real-time streaming documentation, tutorials for IoT dashboards and connecting Azure Stream Analytics, and using PubNub streams in Power BI.

power bireal-timehana
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse

1) The document discusses integrating MongoDB, a NoSQL database, with Teradata, a data warehouse platform. 2) It provides 5 key things to know about the integration, including how Teradata can pull directly from sharded MongoDB clusters and push data back. 3) Use cases are presented where the operational data in MongoDB can provide context and analytics capabilities for applications, and the data warehouse can enrich the operational data.

mongodb world
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

apache sparksparkaisummit
Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
views
Thank you!
Timothy Potter
Manager Smart Data, Lucidworks
@thelabdude
#Activate18 #ActivateSearch

More Related Content

What's hot

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Databricks
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
MongoDB
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 
Power BI: Tips and Tricks
Power BI: Tips and TricksPower BI: Tips and Tricks
Power BI: Tips and Tricks
GlobalLogic Ukraine
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
Ike Ellis
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
Ike Ellis
 
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
European Collaboration Summit
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Ike Ellis
 
Continuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData Analysis
Kai Sasaki
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Moving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceMoving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed Instance
Thomas Sykes
 
R in Power BI
R in Power BIR in Power BI
R in Power BI
Eric Bragas
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
walk2talk srl
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018
Marco Pozzan
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
MongoDB
 

What's hot (20)

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Power BI: Tips and Tricks
Power BI: Tips and TricksPower BI: Tips and Tricks
Power BI: Tips and Tricks
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data ConnectECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Continuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData Analysis
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Moving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed InstanceMoving to the cloud; PaaS, IaaS or Managed Instance
Moving to the cloud; PaaS, IaaS or Managed Instance
 
R in Power BI
R in Power BIR in Power BI
R in Power BI
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
 

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Dax & sql in power bi
Dax & sql in power biDax & sql in power bi
Dax & sql in power bi
Berkovich Consulting
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Breaking data
Breaking dataBreaking data
Breaking data
Terry Bunio
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
James Serra
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
BIOVIA
 

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Dax & sql in power bi
Dax & sql in power biDax & sql in power bi
Dax & sql in power bi
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Breaking data
Breaking dataBreaking data
Breaking data
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 

Recently uploaded (20)

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

  • 1. SQL Analytics for Search Engineers Timothy Potter Manager of Smart Data @ Lucidworks / Apache Solr Committer @thelabdude #Activate18 #ActivateSearch
  • 2. An ever-expanding list of needs from search engineers • Better relevancy, less manual tuning • Bigger scale, less downtime, fixed resources • Higher QPS, more complex query pipelines • More bespoke, search-driven applications, faster! • Trying out new ideas • Making better decisions with self-service analytics • Random one-off jobs for this and that • Use AI everywhere!
  • 3. The ideal solution … • Easy to explain to your boss how it works • Tooling available • Résumé friendly • Extensible / customizable / flexible • Scalable • People want to feel productive SQL in Fusion!
  • 4. Data Ingest = Project Friction • Bespoke, search-driven applications > general purpose dashboard tools • Getting data in continues to be a hassle / friction when getting started • Need something nimble but also fast / scalable • For every connector, there’s probably 20 SQL / NoSQL data silos
  • 5. Fusion’s Parallel Bulk Loader • Get to the fun stuff faster! • Complement Fusion’s connectors for those dirty ETL jobs that cause friction in every project • High performance parallel reads from structured data sources, including Cassandra, Elastic, HBase, JDBC, Hadoop, … • Basic ETL tasks with SQL and/or custom Scala • ML Model predictions as UDF • Direct to Solr for optimal speed or send to index- pipelines for optimal flexibility
  • 6. A foundation built on SparkSQL • Expose structured data as a DataFrame: RDD + schema • 100’s of data sources + formats • spark-solr translates Solr query results to a DataFrame • Highly optimized parallel reads, with predicate pushdown across a Spark cluster • Spark optimizes the SQL query plan • 100’s of built-in functions
  • 7. Demo: Parallel Bulk Loader Parallel Bulk Loader
  • 8. Read parquet from S3 Write to a Fusion Index Pipeline Advanced transforms with Scala Transform with SQL Add job dependencies On-the-fly
  • 9. User Feedback to Improve Relevancy • MRR is sub-optimal for many queries? • Want to boost some docs based on user click behavior (per query) • Older clicks should age out over time • Some user actions are more important than others: click < cart add < purchase • Sometimes you need to join signals with other tables, e.g. item metadata • Hide complex business logic behind UDF / UDAF (pluggable) • Designed for change!
  • 10. Signal Data Flow in Fusion
  • 11. Demo: Parallel Bulk Loader SQL Aggregation
  • 12. Join with other tables Custom UDAF Final output to Solr
  • 13. Window Functions WITH sessions AS ( SELECT *, sum(IF(diff_secs > 30, 1, 0)) OVER (PARTITION BY clientip ORDER BY ts) session_id FROM ( SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts)) OVER (PARTITION BY clientip ORDER BY ts) as diff_secs FROM ${inputCollection} WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL AND verb IS NOT NULL AND response IS NOT NULL ) ) SELECT concat_ws('||', clientip,session_id) as id, first(clientip) as clientip, min(ts) as session_start, max(ts) as session_end, timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l, sum(bytes) as total_bytes_l, count(*) as total_requests_l FROM sessions GROUP BY clientip, session_id Lag window function
  • 14. SQL Aggregations Scalability • Aggregate 42M signals into 11M groups (query / doc_id) • ~18 mins on 3 node EC2 cluster (r3.xlarge) • Mostly I/O from/to Solr
  • 15. Why Self-service Analytics? • Powerful connectors, relevance, speed, and massive scalability = more mission- critical datasets finding their way into Fusion • Don’t be another data silo! • Let users ask questions of this data using their tool of choice w/o adding work for the IT group! • Aggregations over full-text ranked results • But it has to be fast else you’re right back to data warehousing problems
  • 16. Self-service Analytics • Fusion SQL is a JDBC service that supports SQL • Fusion SQL plugs into Apache Spark’s query planner to translate SQL into optimized Solr queries (streaming expressions and JSON facets) • Integrate with popular BI tools like Tableau, PowerBI, and Spotfire + Notebooks like Apache Zeppelin
  • 18. Demo: Parallel Bulk Loader Self-Service Analytics
  • 20. Self-service Analytics Performance • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: ~900ms 28M rows: ~1.3secs https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
  • 21. Self-service Analytics Performance SELECT m.title as title, agg.aggCount as aggCount FROM movies m INNER JOIN ( SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = m.id ORDER BY aggCount DESC 20M rows Fusion SQL : ~1.1 secs MySQL: 17 secs (w/ index on movie_id) Movielens data: Aggregate 20M ratings https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
  • 22. Experiments • Run live experiments to try out new ideas and compare outcomes between variants • Built-in metrics: MRR, avg|min| max response time, CTR … and you guessed it! SQL • Bayesian Bandits to explore/exploit the best performing variant
  • 23. Demo: Parallel Bulk Loader Experiment Metrics
  • 24. Recap • How to build powerful SQL aggregations with joins, custom UDF/ UDAF, and window functions to power boosting and recommendations • Ingesting data from data sources using SQL for ETL, ML • Self-service analytics from popular BI visualization tools • Measure outcomes between variants in an experiment using SQL https://github.com/lucidworks/fusion-spark-bootcamp
  • 25. Top 10 Things you can do with SQL in Fusion 1. Aggregate signals by query / doc / user to compute boost weights and generate recommendations 2. Ingest & ETL from 100’s of data sources using SparkSQL 3. Use ML models to generate predictions and Lucene text analysis using UDF functions 4. Join data from multiple Solr collections and data sources 5. Self-service analytics with BI tools like Tableau and PowerBI 6. Hide complex business logic behind UDF / UDAF 7. Use window functions for tasks like sessionization 8. Grouping sets and cubes for advanced analytic reporting 9. Compute KPIs across variants in an experiment 10. Expose complex Solr streaming expressions as simple SQL views
  • 26. Thank you! Timothy Potter Manager Smart Data, Lucidworks @thelabdude #Activate18 #ActivateSearch

Editor's Notes

  1. How are you going to get all this done? In Fusion, we chose SQL as the foundational technology to solve many of these issues.
  2. So I think we’re all pretty clear on the scope of the problem, but what might the ideal solution look like? Audience poll: - How many know SQL and have used it in some fashion in the last year - How many have integrated with some sort of SQL database with search today
  3. One of the amazing things about app studio is you can rapidly build bespoke search applications w/o creating another data silo! Getting data indexed is not the end goal of a project, an impediment on most projects, adds friction and distracts us from the important stuff (queries / visualization) Organizations are really good at provisioning data silos To let people ask new questions from your data, they need access across many data sources SQL and NoSQL databases are everywhere! Need something nimble to go grab data from multiple places and Connectors are great for complex business apps like Sharepoint and Box but for every Sharepoint there’s a 100 SQL / NoSQL databases in a modern org
  4. SQL lets Spark create an optimized query plan, which sometimes we know how to optimize further for Solr Typically built by experts NoSQL: Cassandra, HBase, Hive, Mongo S3, HDFS, parquet Search: Solr, Elastic RDBMS: JDBC, Redshift, Hive Azure, Google Analytics
  5. Ingest data from S3 Invoke an ML model to do NLP stuff Do some basic ETL with SQL
  6. Just a placeholder slide for what is shown in the demo
  7. Spark function reference: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/index.html
  8. See: https://doc.lucidworks.com/fusion-ai/4.0/user-guide/signals/signals.html
  9. Fusion’s built-in click signal SQL aggregation job Time-decay function Custom UDF (price bucketing) Custom SQL job: sessionization of logs with window functions
  10. Just a placeholder slide for what is shown in the demo
  11. Just a placeholder to show another example of a SQL agg job, this time with a window function to find sessions.
  12. The traditional problem with self-service analytics is speed, flexibility, scalability A whole that’s greater than the sum of its parts
  13. Pushdown the computation of an aggregated query into Solr for maximum performance Or, pull rows into Spark from Solr to perform most any analytics task
  14. At step 1, a Fusion data analyst is authenticated by the JDBC/ODBC client application (e.g. SpotFire or Tableau) using Kerberos. Once authenticated, the user’s SQL query is sent to the Fusion SQL Thriftserver over HTTP (step 2 in the diagram). The SQL Thriftserver uses the service principal keytab to validate the incoming user identity using Kerberos (step 3). The Fusion SQL Thriftserver is a Spark application with a specific number of CPU cores and memory allocated from the pool of Spark resources. You can scale out the number of Spark worker nodes to increase available memory and CPU resources to the SQL service. The Thriftserver sends the query to Spark to be parsed into a Logical query plan (step 4). During the query planning stage, Spark sends the logical plan to Fusion’s pushdown strategy component (step 5). The pushdown strategy analyzes the query plan to determine if there is an optimal Solr query / streaming expression that can “push-down” aggregations into Solr to improve performance and scalability. For instance, the following SQL query can be translated into a Solr facet query by the Fusion pushdown strategy: select count(1) as the_count, movie_id from ratings group by movie_id The basic idea behind Fusion’s pushdown strategy is it is much faster to let Solr facets perform basic aggregations than it is to export raw documents from Solr and have Spark perform the aggregation. If an optimal pushdown query is not possible, then Spark will pull raw documents from Solr and then perform any joins / aggregations needed in Spark. Put simply, the Fusion SQL service tries to translate SQL queries into optimized Solr queries but failing that, the service simply reads all matching docs for a query into Spark and performs the SQL execution logic across the Spark cluster. During pushdown analysis, Fusion calls out to the registered AuthzFilterProvider implementation to get a filter query to perform row-level filtering for the Kerberos authenticated user (step 6). By default there is no row-level security provider but users can install their own implementation using the Fusion SQL service API. Lastly, a distributed Solr query gets executed by Spark to return documents that satisfy the SQL query criteria and row-level security filter (step 7). To leverage the distributed nature of Spark and Solr, Fusion SQL sends a query to all replicas for each shard in a Solr collection. Consequently, you can scale out SQL query performance by adding more Spark and/or Solr resources to your cluster.
  15. Show connecting to Fusion SQL from Tableau (or maybe Apache Superset) Build a simple data visualization on-the-fly
  16. Just a placeholder slide for what is shown in the demo
  17. Avg. time on site / # of interactions per variant Show results in App Insights