"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler, Researcher, Similar Web
- 1. Business Proprietary & Confidential
SimilarWeb & Tel-Aviv university
On
Quantum Clustering
Sigalit Bechler
December 1, 2014
- 2. Business Proprietary & Confidential
• SimilarWeb – a quick introduction
• Quantum Clustering
December 1, 2014
Agenda
- 4. What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC ESTIMATION
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
We Provide Digital Insights to the Entire World
2M MOBILE APPS
DAILY
FOR EVERY MOBILE
APP:
RATING
ENGAGEMENT
APP STORE DATA
CATEGORY
KEYWORDS
- 5. What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC METRICS
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
2M MOBILE APPS DAILY
FOR EVERY MOBILE APP:
• RATING
• ENGAGEMENT
• APP STORE
• CATEGORY
• KEYWORDS
INGEST:
INTERNATIONAL PANEL, CRAWLING,
ISP DATA, LEARNING SET
• 90K events/sec
• 4TB/day compressed
BATCH & ON DEMAND PROCESSING:
• 100TB i/o a day
• > 150 machines just in processing
cluster
• Statistical & machine learning
algorithms
We Provide Digital Insights to the Entire World
- 6. Business Proprietary & Confidential
Quantum clustering
December 1, 2014
Prof. David Horn and Dr. Assaf Gottlieb.
Phys. Rev. Lett. 88 (2002) 018702
- 7. • Unsupervised learning problem - dealing with unlabeled data
• Goal: group together elements that are similar to each other in some sense.
• We usually have an idea or a desire of what this “sense” should be
• Might discover new patterns
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
- 8. • The user identity is unknown
• Leaving it in for the example
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
?
?
?
?
?
?
?
?
- 9. • Grouping by gender
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
- 10. • Grouping by fields of interest
Clustering- general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
- 11. Quantum Clustering - Motivation
• Relatively easy clustering task
• Still need to set the number of
clusters manually.
• Very complex clustering task.
• Unbiased analysis of X-Ray
absorption data
- 12. Quantum Clustering - Example
Analyzing Big Data with Dynamic
Quantum Clustering
M. Weinstein, F. Meirer, A. Hume, Ph.
Sciau, G. Shaked, R. Hofstetter, E. Persi, A.
Mehta, D. Horn
http://arxiv.org/abs/1310.2700
- 13. • Information era - big data
• Massive collection of data
• Strong presence of outliers
• Unknown structures
• Non trivial patterns
Why is it important?
Quantum
Clustering
Distributed
computation
technologies
- 14. Quantum clustering - the potential trick
1. Turn data-points into Gaussians centered around the data points:
𝜑 𝑥 =
𝑖=1
𝑛
𝑒
−
| 𝑥− 𝑥 𝑖|2
2𝜎2
2. Plug 𝜑 𝑥 into Schrodinger equation and find V( 𝑥).
Define the solution for V as the potential transform
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• Single point → Gaussian →𝑉 𝑥 =
1
2𝜎2 (𝑥 − 𝑥𝑖)2
• Multi-points: 𝑉 𝑥 =
1
2𝜎2 𝜑( 𝑥) 𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
=
1
2𝜎2
𝑖=1
𝑛 𝑒
−
|𝑥−𝑥 𝑖|2
2𝜎2
𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
3. Move each data point towards the direction of the minima of the 𝑉 𝑥
according to the potential surface with gradient descent.
- 15. Quantum clustering – reasoning
• Why does it make sense?
•
𝜎2
2
𝛻2: Models the divergence effects from the cluster center.
• V( 𝑥) : The effects that bind points from the same cluster together.
• We may say that we are looking for the minima of V( 𝑥) since this is where the
divergence effects are minimal (slow changes – small numerator and high
density- denominator:
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• SVD may be performed prior to the clustering: X=USVT , perform QC on U or V
• Solve the fact that each feature is of a different dimension type, and scale.
• enable dimension reduction to those with the highest variance.
- 16. A topographic map of the probability distribution for the
crab data set with =1/2 using principal components 2
and 3. There exists only one maximum.
A topographic map of the potential for the crab data set with
=1/2 using principal components 2 and 3 . The four minima
are denoted by crossed circles. The contours are set at values
V=cE for c=0.2,…,1.
The Crabs Example (from Ripley’s textbook), 4 classes, 50 samples each, d=5
The data 3D Plot of the potential
- 17. Quantum clustering - summary
• Built-in capability to handle outliers (divergence part): no need for additional
parameters or processes, no effect on the amount of significant clusters
• The cluster may be a line or other shape and not necessarily a point in the
feature space.
• The clusters are not defined by geometric or probability considerations alone
• No need to pre-define the amount clusters
- 18. • Existing approximated quantum clustering variation for improving time
complexity.
• Sensitive to small variations in the data density unlike geometry
consideration alone.
• Possible Distributed calculation:
• Since all we have is to calculate V, 𝛻V for every data point parts can be calculated at
each point separately in a different machine
• Performed exceptionally in exposing hidden patterns of data structures
from a wide range of fields - finance, on-line marketing, experimental
physics, speech-recognition, biological data.
Quantum clustering
- 19. • Physics may provide interesting perspective to questions that at the first
glance has no connection to physics.
• It has been done in scale space theory
• Sensitive to small variations in the data density
• In bio-informatics for extracting protein structure
• And many more
Quantum clustering