SlideShare a Scribd company logo
Beyond Collaborative
Filtering: ML &
Recommendations at
Meetup
Evan Estola
Meetup.com
evan@meetup.com
@estola
My Background
● Machine Learning Engineer
● At Meetup since May 2012
● BS Computer Science
○ Information Retrieval
○ Data Mining
○ Math
■ Linear Algebra
■ Graph Theory
Meetup what are you
Why Meetup data is cool
● Real people meeting up
● Every meetup could change someone's life
● No ads, just do the best thing
● Oh and >125 million rsvps by >17 million
members
● ~3 million rsvps in the last 30 days
○ >1/second
Estola meetup big_datacampla_6_14_evan_estola
Tools at Meetup
● Hive - SQL on Hadoop
● Spark - Distributed Scala on Hadoop cluster
● Scala - Recommendations service
● R - Data analysis, Model building
● Python - Scripting, Data organizing
● Java - Backend of our web stack
“Everything is a recommendation”
Estola meetup big_datacampla_6_14_evan_estola
Collaborative Filtering
● Classic recommendations approach
● Users who like this also like this
Weaknesses of CF
● Sparsity
● Cold Start
● Coverage
● Diversity
Why Recs at Meetup are hard
● Incomplete Data (topics)
● Cold start
● Asking user for data is hard
● Going to meetups is scary
● Sparsity
○ Location
○ Groups/person
○ Membership: 0.001%
○ Compare to Netflix: 1%
Cleaning data
● Schenectady
● Beverly Hills
● Fake RSVP boosts (+100 guests!)
● Rsvp hogs
Real data is gross
● Preprocessing is critical!
○ missing data
○ outliers
○ log scale
○ bucketing
○ sampling bias
Supervised Learning/Classification
● “Inferring a function from labeled training
data”
● Joined Meetup group/Didn’t join Meetup
group
Ranking
● Membership << expected error rate
○ Sample to 50/50 join/no-join
● Model output label no longer explicitly true
● Use a classifier that gives you a useful
output
Estola meetup big_datacampla_6_14_evan_estola
Topic Match
State Match
Ensemble Learning
“... use multiple learning algorithms to obtain
better predictive performance than could be
obtained from any of the constituent learning
algorithms”
Ensemble Learning
● Collaborative Filtering on Topics
● Other simple features
Logistic Regression Output
● RsvpScore 0.02
● FbFriends 2.02
● 2ndFbFriends 0.09
● AgeUnmatch -2.40
● GenUnmatch -3.37
● Distance -0.04
● StateMatch 0.54
● CountyMatch 0.41
● ZipScore 0.06
● TopicScore 4.14
● ExtendedTS 0.47
● RelatedTS 0.66
● FbLikeTS 0.78
Facebook Likes
● Lots of information, but how to use?
● Map to topics, let training the model take
care of the rest!
● Bonus: Recommendations server knows
topics, generated topics can be passed in by
request
Mapping FB Likes to Meetup Topics
● Text based?
○ Go(game) vs Go(lang)?
○ Burton?
● Data approach!
○ Grab most popular topics across all members with
the same like
Normalization
● Top topics for Burton-Likers
○ Meeting New People, Coffee, bla bla
○ Most popular still dominates
● Normalize based on expected topic
occurrence in sample
Normalization
● For members with a given Like
● Compare percent with each topic to
expected among total population
● Burton:
○ 20% “Meeting New People”
○ 9% “Snowboarding”
● Total population
○ 20% “Meeting New People”
○ 2% “Snowboarding
Processing
● Load FB Like connections, topics into
Hadoop
● Process with Hive to generate top topics for
each like
● Join with member likes to generate top
topics per member
● Add feature to model using FB-Like-
Generated-Topics crossover with groups...
Results
● Positive weight
○ Very good sign
○ Captured information about member identity, not just
behavior
● Deploy/Split test
○ 1.5% lift in conversion overall
○ (Only have facebook data for ~10% of members)
Thanks!
Smart people come work with me.
http://www.meetup.com/jobs/

More Related Content

Estola meetup big_datacampla_6_14_evan_estola

  • 1. Beyond Collaborative Filtering: ML & Recommendations at Meetup Evan Estola Meetup.com evan@meetup.com @estola
  • 2. My Background ● Machine Learning Engineer ● At Meetup since May 2012 ● BS Computer Science ○ Information Retrieval ○ Data Mining ○ Math ■ Linear Algebra ■ Graph Theory
  • 4. Why Meetup data is cool ● Real people meeting up ● Every meetup could change someone's life ● No ads, just do the best thing ● Oh and >125 million rsvps by >17 million members ● ~3 million rsvps in the last 30 days ○ >1/second
  • 6. Tools at Meetup ● Hive - SQL on Hadoop ● Spark - Distributed Scala on Hadoop cluster ● Scala - Recommendations service ● R - Data analysis, Model building ● Python - Scripting, Data organizing ● Java - Backend of our web stack
  • 7. “Everything is a recommendation”
  • 9. Collaborative Filtering ● Classic recommendations approach ● Users who like this also like this
  • 10. Weaknesses of CF ● Sparsity ● Cold Start ● Coverage ● Diversity
  • 11. Why Recs at Meetup are hard ● Incomplete Data (topics) ● Cold start ● Asking user for data is hard ● Going to meetups is scary ● Sparsity ○ Location ○ Groups/person ○ Membership: 0.001% ○ Compare to Netflix: 1%
  • 12. Cleaning data ● Schenectady ● Beverly Hills ● Fake RSVP boosts (+100 guests!) ● Rsvp hogs
  • 13. Real data is gross ● Preprocessing is critical! ○ missing data ○ outliers ○ log scale ○ bucketing ○ sampling bias
  • 14. Supervised Learning/Classification ● “Inferring a function from labeled training data” ● Joined Meetup group/Didn’t join Meetup group
  • 15. Ranking ● Membership << expected error rate ○ Sample to 50/50 join/no-join ● Model output label no longer explicitly true ● Use a classifier that gives you a useful output
  • 19. Ensemble Learning “... use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms”
  • 20. Ensemble Learning ● Collaborative Filtering on Topics ● Other simple features
  • 21. Logistic Regression Output ● RsvpScore 0.02 ● FbFriends 2.02 ● 2ndFbFriends 0.09 ● AgeUnmatch -2.40 ● GenUnmatch -3.37 ● Distance -0.04 ● StateMatch 0.54 ● CountyMatch 0.41 ● ZipScore 0.06 ● TopicScore 4.14 ● ExtendedTS 0.47 ● RelatedTS 0.66 ● FbLikeTS 0.78
  • 22. Facebook Likes ● Lots of information, but how to use? ● Map to topics, let training the model take care of the rest! ● Bonus: Recommendations server knows topics, generated topics can be passed in by request
  • 23. Mapping FB Likes to Meetup Topics ● Text based? ○ Go(game) vs Go(lang)? ○ Burton? ● Data approach! ○ Grab most popular topics across all members with the same like
  • 24. Normalization ● Top topics for Burton-Likers ○ Meeting New People, Coffee, bla bla ○ Most popular still dominates ● Normalize based on expected topic occurrence in sample
  • 25. Normalization ● For members with a given Like ● Compare percent with each topic to expected among total population ● Burton: ○ 20% “Meeting New People” ○ 9% “Snowboarding” ● Total population ○ 20% “Meeting New People” ○ 2% “Snowboarding
  • 26. Processing ● Load FB Like connections, topics into Hadoop ● Process with Hive to generate top topics for each like ● Join with member likes to generate top topics per member ● Add feature to model using FB-Like- Generated-Topics crossover with groups...
  • 27. Results ● Positive weight ○ Very good sign ○ Captured information about member identity, not just behavior ● Deploy/Split test ○ 1.5% lift in conversion overall ○ (Only have facebook data for ~10% of members)
  • 28. Thanks! Smart people come work with me. http://www.meetup.com/jobs/