Jazzed about Solr: People as a Search Problem - By Joshua Tuberville
- 1. About Solr
People as A Search Problem
Thursday, May 26, 2011
- 2. About Me
• Building websites since 1996, Java since
1997
• Prior web search experience
• Building and scaling eHarmony
products since 2002
Thursday, May 26, 2011
- 3. What is Jazzed
• Subscription Based
Dating Site
• Incubated by
eHarmony
Thursday, May 26, 2011
- 4. What is Jazzed
• Create a profile
• Search for others
• View their photos
• Privately
Communicate
Thursday, May 26, 2011
- 5. What is Jazzed
• Create a profile
• Search for others
• View their photos
• Privately
Communicate
Thursday, May 26, 2011
- 6. What is Jazzed
• Create a profile
• Search for others
• View their photos
• Privately
Communicate
Thursday, May 26, 2011
- 7. What is Jazzed
• Create a profile
• Search for others
• View their photos
• Privately
Communicate
Thursday, May 26, 2011
- 8. How is it different?
• Covers broader range of relationships
• Easy to get started
• Real profiles screened by machine and
humans
• Fast, effective search oriented tools
Thursday, May 26, 2011
- 9. Jazzed Stats
• Started Fall 2009
• Beta Summer 2010
• Launched October 2010
• 100,000s of Profiles
• 1,000s of Searches Daily
Thursday, May 26, 2011
- 10. Jazzed Architecture
• Event-driven SOA
• REST, JSON, EIP, Not-only-SQL
• Technology incubation
Thursday, May 26, 2011
- 11. Tech Stack
• Java 6, Spring 3, Jersey 1.1, JMS
(AQMP)
• RHEL 4, Oracle 11g, Voldemort 0.81,
Solr 1.4.1, NFS
Thursday, May 26, 2011
- 14. Not Covered
• Distributed Search
• Caching Strategies
• Data Import
• Analyzers/Tokenizers
Thursday, May 26, 2011
- 15. Why Lucene?
• Proven Solid IR library
• Prefer Open Source Solutions
• Not Only SQL
• Flexible Ranking
• Pluggable
Thursday, May 26, 2011
- 16. Why Solr
• Performant, Extensible, RESTful Service
• Configuration, Schema, Multicores
• Admin Interface
• Replication, Backups, Monitoring
Thursday, May 26, 2011
- 17. Open Source
• Strengthens Engineering Team
• Be apart of great community
• Not Brochure-ware
Thursday, May 26, 2011
- 18. Not Only SQL
• One solution does not fit all
• Prefer availability over consistency
• Horizontal Scaling over Vertical
Thursday, May 26, 2011
- 19. Flexible Ranking
• Query Strategies
• Boolean Algebra
• Vector Space Analysis
• Hybrids
• Extensive Function Support
• Index and Query Boosting
Thursday, May 26, 2011
- 20. ...Oh My!
• Standard Plugins - Geospatial*,
Faceting, Spelling, MoreLikeThis
• Full Text with Highlighted Results
• Client agnostic
Thursday, May 26, 2011
- 21. Inevitable Question
• “Does it scale?”
• Solr POC Benchmark
• 10 Million profiles
• >200 queries/sec under 100ms 90th
• Default tuning until 5 million profiles
Thursday, May 26, 2011
- 22. Profile Service
• RESTful Hybrid Data Service
• Public, Private, Attributes
• Event Producer
Thursday, May 26, 2011
- 23. Profiles
• Mostly structured
• Categories - Eye Color, Desired
Ethnicity
• Dates - Birthdate
• Numbers - Coordinates, Age Range
• Text -Name, Headline
Thursday, May 26, 2011
- 24. Inverting People
Term Document
MALE 1, 3, 5, 7, 9
FEMALE 2, 4, 6, 8, 10
• Stored as an HAIR_RED 8
inverted index HAIR_BLOND 1, 2, 5, 6
EYE_BLUE 1, 2, 3, 10
• Index random
EYE_BROWN 4, 5, 6, 7, 8, 9
accessed by term fun 1, 3, 7, 9
funny 2, 4, 6, 10
beach 1, 2, 3, 4, 5, 6, 7, 8
Thursday, May 26, 2011
- 25. Schema Design
• Single “Table”
• One-to-many = multi-value fields
• Individual vs Composite Fields
• copyTo and have both!
Thursday, May 26, 2011
- 26. Field considerations
• Stored or not
• Indexed or not
• Multivalued - desires fields
• Type
Thursday, May 26, 2011
- 27. Solr Types Used
The ‘t’ is for Trie
• tdate, tint, tfloat* - birthdate, loginAt
• text - all text
• string - id, non indexed text
• random - good for random sorts
• enum - for all enumerations
Thursday, May 26, 2011
- 28. Data Duplication
• By function - numberPhotos &
hasPhotos
• By relationship - hiddenBy & hidden
• By analysis - name & text
Thursday, May 26, 2011
- 29. Saving Profiles
• Updating is in memory operation
• No partial updates
• Commit means flush index changes
• Autocommit on maxDocs, maxTime or
both
Thursday, May 26, 2011
- 30. Why Also Voldemort
• Private profiles can not be stale
• Many fields not searchable or viewable
by others
• Isolate queries from fetch by id
Thursday, May 26, 2011
- 31. Querying
• Superset of Lucene
• Efficient Range Queries
• Multiple Query Handlers
• Dismax, Boost, Geo
Thursday, May 26, 2011
- 32. Recall vs Precision
• Focus on recall when corpus is small
• Precision once it is at critical mass
Thursday, May 26, 2011
- 33. Boolean Queries
• Default operator set to AND
• +gender:FEMALE +seeking:MALE
+eyeColor:EYE_BLUE +hairColor:
(HAIR_RED, HAIR_BLONDE)
• Sort order is important
Thursday, May 26, 2011
- 34. Hybrid Queries
• Default operator set to OR
• +gender:FEMALE +seeking:MALE
eyeColor:EYE_BLUE hairColor:
(HAIR_RED, HAIR_BLONDE)
Thursday, May 26, 2011
- 35. Why you’re lucky if you
like redheads
• Inverse Document
Frequency (IDF) 1.Blue eyed, redheads
2.Blue eyed, blonds
• Rarer is favored
3.Redheads
over more common
4.Blonds
• More fields
matched = higher
ranking
Thursday, May 26, 2011
- 36. Boosting
• Query time by importance
• eyeColor:EYE_BLUE^2
hairColor:HAIR_BLOND
Thursday, May 26, 2011
- 37. Filter Fields
id hidden
1 2, 4, 6
• Useful for roles and
other lists 2 1
• -hidden:(2 4 6)
Thursday, May 26, 2011
- 38. Filter Fields
id hidden
1 2, 4, 6
• Useful for roles and
other lists 2 1
• -hidden:(2 4 6) id hiddenBy
1 2
• -hiddenBy:1
2 1
4 1
6 1
Thursday, May 26, 2011
- 39. Date Math
• Simplifies query preprocessing
• +birthDate:[NOW/DAY+1DAY-36YEAR
TO NOW/DAY-25YEAR]
Thursday, May 26, 2011
- 40. Date Math
• Simplifies query preprocessing
• +birthDate:[NOW/DAY+1DAY-36YEAR
TO NOW/DAY-25YEAR]
Between 25 and 35 years old
Thursday, May 26, 2011
- 41. Distance Searching
• lat, lon, distance
• SolrLocal by Patrick O’Leary
• Additional overhead ~90ms per query
• Superceded in Solr 3.1
Thursday, May 26, 2011
- 42. Testing Queries
• Log queries and ids returned
• Version your search strategies
• Improve one thing at a time
Thursday, May 26, 2011
- 43. Geo Service
• Read-mostly service
• Fields - Postal Code, Country,
State, Cities, Lat, Lon
• Usage - Registration
Validation, City Selection
Thursday, May 26, 2011
- 44. Operations
• Servlet container and filesystem
• Jetty 6, 64 Java 6 JVM
• 8G Heap -XX:+UseCompressedOops
Thursday, May 26, 2011
- 45. Operations
• Active/Passive
• Layer 7 Load balancing
• Nightly snapshots
• Eventually SolrCloud
Thursday, May 26, 2011
- 46. Multicore
• Run multiple schemas on the same
• Hot swappable for backwards
compatible changes
• private / public profiles
Thursday, May 26, 2011
- 47. Security
• No security provided
• At minimum secure <delete>
<query>*:*</query>
your UpdateHandler </delete>
• Separate Cores
Thursday, May 26, 2011
- 48. Future
• Solr 3.1
• Mutual Matching
• Faceting / Guided Search
• Incorporating spelling
• Hierarchies, categories, better ranking
models
Thursday, May 26, 2011
- 49. Faceting
• Returns counts
with query
results
• Efficient
• Guides the user
toward precision
Thursday, May 26, 2011
- 50. Thank you
jtuberville@eharmony.com
Twitter: @jtuberville
Thursday, May 26, 2011