SlideShare a Scribd company logo
Business Proprietary & Confidential
SimilarWeb & Tel-Aviv university
On
Quantum Clustering
Sigalit Bechler
December 1, 2014
Business Proprietary & Confidential
• SimilarWeb – a quick introduction
• Quantum Clustering
December 1, 2014
Agenda
3/31
$65M
Funding
2007
Founded 6
Offices
300
Employees
SimilarWeb
Some of our
clients
What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC ESTIMATION
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
We Provide Digital Insights to the Entire World
2M MOBILE APPS
DAILY
FOR EVERY MOBILE
APP:
RATING
ENGAGEMENT
APP STORE DATA
CATEGORY
KEYWORDS
What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC METRICS
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
2M MOBILE APPS DAILY
FOR EVERY MOBILE APP:
• RATING
• ENGAGEMENT
• APP STORE
• CATEGORY
• KEYWORDS
INGEST:
INTERNATIONAL PANEL, CRAWLING,
ISP DATA, LEARNING SET
• 90K events/sec
• 4TB/day compressed
BATCH & ON DEMAND PROCESSING:
• 100TB i/o a day
• > 150 machines just in processing
cluster
• Statistical & machine learning
algorithms
We Provide Digital Insights to the Entire World
Business Proprietary & Confidential
Quantum clustering
December 1, 2014
Prof. David Horn and Dr. Assaf Gottlieb.
Phys. Rev. Lett. 88 (2002) 018702
• Unsupervised learning problem - dealing with unlabeled data
• Goal: group together elements that are similar to each other in some sense.
• We usually have an idea or a desire of what this “sense” should be
• Might discover new patterns
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
• The user identity is unknown
• Leaving it in for the example
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
?
?
?
?
?
?
?
?
• Grouping by gender
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
• Grouping by fields of interest
Clustering- general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
Quantum Clustering - Motivation
• Relatively easy clustering task
• Still need to set the number of
clusters manually.
• Very complex clustering task.
• Unbiased analysis of X-Ray
absorption data
Quantum Clustering - Example
Analyzing Big Data with Dynamic
Quantum Clustering
M. Weinstein, F. Meirer, A. Hume, Ph.
Sciau, G. Shaked, R. Hofstetter, E. Persi, A.
Mehta, D. Horn
http://arxiv.org/abs/1310.2700
• Information era - big data
• Massive collection of data
• Strong presence of outliers
• Unknown structures
• Non trivial patterns
Why is it important?
Quantum
Clustering
Distributed
computation
technologies
Quantum clustering - the potential trick
1. Turn data-points into Gaussians centered around the data points:
��� 𝑥 =
𝑖=1
𝑛
𝑒
−
| 𝑥− 𝑥 𝑖|2
2𝜎2
2. Plug 𝜑 𝑥 into Schrodinger equation and find V( 𝑥).
Define the solution for V as the potential transform
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• Single point → Gaussian →𝑉 𝑥 =
1
2𝜎2 (𝑥 − 𝑥𝑖)2
• Multi-points: 𝑉 𝑥 =
1
2𝜎2 𝜑( 𝑥) 𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
=
1
2𝜎2
𝑖=1
𝑛 𝑒
−
|𝑥−𝑥 𝑖|2
2𝜎2
𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
3. Move each data point towards the direction of the minima of the 𝑉 𝑥
according to the potential surface with gradient descent.
Quantum clustering – reasoning
• Why does it make sense?
•
𝜎2
2
𝛻2: Models the divergence effects from the cluster center.
• V( 𝑥) : The effects that bind points from the same cluster together.
• We may say that we are looking for the minima of V( 𝑥) since this is where the
divergence effects are minimal (slow changes – small numerator and high
density- denominator:
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• SVD may be performed prior to the clustering: X=USVT , perform QC on U or V
• Solve the fact that each feature is of a different dimension type, and scale.
• enable dimension reduction to those with the highest variance.
A topographic map of the probability distribution for the
crab data set with =1/2 using principal components 2
and 3. There exists only one maximum.
A topographic map of the potential for the crab data set with
=1/2 using principal components 2 and 3 . The four minima
are denoted by crossed circles. The contours are set at values
V=cE for c=0.2,…,1.
The Crabs Example (from Ripley’s textbook), 4 classes, 50 samples each, d=5
The data 3D Plot of the potential
Quantum clustering - summary
• Built-in capability to handle outliers (divergence part): no need for additional
parameters or processes, no effect on the amount of significant clusters
• The cluster may be a line or other shape and not necessarily a point in the
feature space.
• The clusters are not defined by geometric or probability considerations alone
• No need to pre-define the amount clusters
• Existing approximated quantum clustering variation for improving time
complexity.
• Sensitive to small variations in the data density unlike geometry
consideration alone.
• Possible Distributed calculation:
• Since all we have is to calculate V, 𝛻V for every data point parts can be calculated at
each point separately in a different machine
• Performed exceptionally in exposing hidden patterns of data structures
from a wide range of fields - finance, on-line marketing, experimental
physics, speech-recognition, biological data.
Quantum clustering
• Physics may provide interesting perspective to questions that at the first
glance has no connection to physics.
• It has been done in scale space theory
• Sensitive to small variations in the data density
• In bio-informatics for extracting protein structure
• And many more
Quantum clustering
Business Proprietary & Confidential
Thank You!
December 1, 2014

More Related Content

"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler, Researcher, Similar Web

  • 1. Business Proprietary & Confidential SimilarWeb & Tel-Aviv university On Quantum Clustering Sigalit Bechler December 1, 2014
  • 2. Business Proprietary & Confidential • SimilarWeb – a quick introduction • Quantum Clustering December 1, 2014 Agenda
  • 4. What We Do 60M WEBSITES DAILY FOR EVERY WEBSITE: • TRAFFIC ESTIMATION • TRAFFIC SOURCES • AUDIENCE • INDUSTRY • CONTENT We Provide Digital Insights to the Entire World 2M MOBILE APPS DAILY FOR EVERY MOBILE APP: RATING ENGAGEMENT APP STORE DATA CATEGORY KEYWORDS
  • 5. What We Do 60M WEBSITES DAILY FOR EVERY WEBSITE: • TRAFFIC METRICS • TRAFFIC SOURCES • AUDIENCE • INDUSTRY • CONTENT 2M MOBILE APPS DAILY FOR EVERY MOBILE APP: • RATING • ENGAGEMENT • APP STORE • CATEGORY • KEYWORDS INGEST: INTERNATIONAL PANEL, CRAWLING, ISP DATA, LEARNING SET • 90K events/sec • 4TB/day compressed BATCH & ON DEMAND PROCESSING: • 100TB i/o a day • > 150 machines just in processing cluster • Statistical & machine learning algorithms We Provide Digital Insights to the Entire World
  • 6. Business Proprietary & Confidential Quantum clustering December 1, 2014 Prof. David Horn and Dr. Assaf Gottlieb. Phys. Rev. Lett. 88 (2002) 018702
  • 7. • Unsupervised learning problem - dealing with unlabeled data • Goal: group together elements that are similar to each other in some sense. • We usually have an idea or a desire of what this “sense” should be • Might discover new patterns Clustering - general overview label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
  • 8. • The user identity is unknown • Leaving it in for the example Clustering - general overview label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4 ? ? ? ? ? ? ? ?
  • 9. • Grouping by gender Clustering - general overview label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
  • 10. • Grouping by fields of interest Clustering- general overview label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
  • 11. Quantum Clustering - Motivation • Relatively easy clustering task • Still need to set the number of clusters manually. • Very complex clustering task. • Unbiased analysis of X-Ray absorption data
  • 12. Quantum Clustering - Example Analyzing Big Data with Dynamic Quantum Clustering M. Weinstein, F. Meirer, A. Hume, Ph. Sciau, G. Shaked, R. Hofstetter, E. Persi, A. Mehta, D. Horn http://arxiv.org/abs/1310.2700
  • 13. • Information era - big data • Massive collection of data • Strong presence of outliers • Unknown structures • Non trivial patterns Why is it important? Quantum Clustering Distributed computation technologies
  • 14. Quantum clustering - the potential trick 1. Turn data-points into Gaussians centered around the data points: 𝜑 𝑥 = 𝑖=1 𝑛 𝑒 − | 𝑥− 𝑥 𝑖|2 2𝜎2 2. Plug 𝜑 𝑥 into Schrodinger equation and find V( 𝑥). Define the solution for V as the potential transform 𝑉 𝑥 = 𝜎2 2𝜑 𝑥 𝛻2 𝜑 𝑥 • Single point → Gaussian →𝑉 𝑥 = 1 2𝜎2 (𝑥 − 𝑥𝑖)2 • Multi-points: 𝑉 𝑥 = 1 2𝜎2 𝜑( 𝑥) 𝑖(𝑥 − 𝑥𝑖)2 𝑒 − (𝑥−𝑥 𝑖)2 2𝜎2 = 1 2𝜎2 𝑖=1 𝑛 𝑒 − |𝑥−𝑥 𝑖|2 2𝜎2 𝑖(𝑥 − 𝑥𝑖)2 𝑒 − (𝑥−𝑥 𝑖)2 2𝜎2 3. Move each data point towards the direction of the minima of the 𝑉 𝑥 according to the potential surface with gradient descent.
  • 15. Quantum clustering – reasoning • Why does it make sense? • 𝜎2 2 𝛻2: Models the divergence effects from the cluster center. • V( 𝑥) : The effects that bind points from the same cluster together. • We may say that we are looking for the minima of V( 𝑥) since this is where the divergence effects are minimal (slow changes – small numerator and high density- denominator: 𝑉 𝑥 = 𝜎2 2𝜑 𝑥 𝛻2 𝜑 𝑥 • SVD may be performed prior to the clustering: X=USVT , perform QC on U or V • Solve the fact that each feature is of a different dimension type, and scale. • enable dimension reduction to those with the highest variance.
  • 16. A topographic map of the probability distribution for the crab data set with =1/2 using principal components 2 and 3. There exists only one maximum. A topographic map of the potential for the crab data set with =1/2 using principal components 2 and 3 . The four minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1. The Crabs Example (from Ripley’s textbook), 4 classes, 50 samples each, d=5 The data 3D Plot of the potential
  • 17. Quantum clustering - summary • Built-in capability to handle outliers (divergence part): no need for additional parameters or processes, no effect on the amount of significant clusters • The cluster may be a line or other shape and not necessarily a point in the feature space. • The clusters are not defined by geometric or probability considerations alone • No need to pre-define the amount clusters
  • 18. • Existing approximated quantum clustering variation for improving time complexity. • Sensitive to small variations in the data density unlike geometry consideration alone. • Possible Distributed calculation: • Since all we have is to calculate V, 𝛻V for every data point parts can be calculated at each point separately in a different machine • Performed exceptionally in exposing hidden patterns of data structures from a wide range of fields - finance, on-line marketing, experimental physics, speech-recognition, biological data. Quantum clustering
  • 19. • Physics may provide interesting perspective to questions that at the first glance has no connection to physics. • It has been done in scale space theory • Sensitive to small variations in the data density • In bio-informatics for extracting protein structure • And many more Quantum clustering
  • 20. Business Proprietary & Confidential Thank You! December 1, 2014