Estola meetup big_datacampla_6_14_evan_estola

Beyond Collaborative
Filtering: ML &
Recommendations at
Meetup
Evan Estola
Meetup.com
evan@meetup.com
@estola

My Background
● Machine Learning Engineer
● At Meetup since May 2012
● BS Computer Science
○ Information Retrieval
○ Data Mining
○ Math
■ Linear Algebra
■ Graph Theory

Why Meetup data is cool
● Real people meeting up
● Every meetup could change someone's life
● No ads, just do the best thing
● Oh and >125 million rsvps by >17 million
members
● ~3 million rsvps in the last 30 days
○ >1/second

Tools at Meetup
● Hive - SQL on Hadoop
● Spark - Distributed Scala on Hadoop cluster
● Scala - Recommendations service
● R - Data analysis, Model building
● Python - Scripting, Data organizing
● Java - Backend of our web stack

“Everything is a recommendation”

Collaborative Filtering
● Classic recommendations approach
● Users who like this also like this

Weaknesses of CF
● Sparsity
● Cold Start
● Coverage
● Diversity

Why Recs at Meetup are hard
● Incomplete Data (topics)
● Cold start
● Asking user for data is hard
● Going to meetups is scary
● Sparsity
○ Location
○ Groups/person
○ Membership: 0.001%
○ Compare to Netflix: 1%

Cleaning data
● Schenectady
● Beverly Hills
● Fake RSVP boosts (+100 guests!)
● Rsvp hogs

Real data is gross
● Preprocessing is critical!
○ missing data
○ outliers
○ log scale
○ bucketing
○ sampling bias

Supervised Learning/Classification
● “Inferring a function from labeled training
data”
● Joined Meetup group/Didn’t join Meetup
group

Ranking
● Membership << expected error rate
○ Sample to 50/50 join/no-join
● Model output label no longer explicitly true
● Use a classifier that gives you a useful
output

Ensemble Learning
“... use multiple learning algorithms to obtain
better predictive performance than could be
obtained from any of the constituent learning
algorithms”

Ensemble Learning
● Collaborative Filtering on Topics
● Other simple features

Logistic Regression Output
● RsvpScore 0.02
● FbFriends 2.02
● 2ndFbFriends 0.09
● AgeUnmatch -2.40
● GenUnmatch -3.37
● Distance -0.04
● StateMatch 0.54
● CountyMatch 0.41
● ZipScore 0.06
● TopicScore 4.14
● ExtendedTS 0.47
● RelatedTS 0.66
● FbLikeTS 0.78

Facebook Likes
● Lots of information, but how to use?
● Map to topics, let training the model take
care of the rest!
● Bonus: Recommendations server knows
topics, generated topics can be passed in by
request

Mapping FB Likes to Meetup Topics
● Text based?
○ Go(game) vs Go(lang)?
○ Burton?
● Data approach!
○ Grab most popular topics across all members with
the same like

Normalization
● Top topics for Burton-Likers
○ Meeting New People, Coffee, bla bla
○ Most popular still dominates
● Normalize based on expected topic
occurrence in sample

Normalization
● For members with a given Like
● Compare percent with each topic to
expected among total population
● Burton:
○ 20% “Meeting New People”
○ 9% “Snowboarding”
● Total population
○ 20% “Meeting New People”
○ 2% “Snowboarding

Processing
● Load FB Like connections, topics into
Hadoop
● Process with Hive to generate top topics for
each like
● Join with member likes to generate top
topics per member
● Add feature to model using FB-Like-
Generated-Topics crossover with groups...

Results
● Positive weight
○ Very good sign
○ Captured information about member identity, not just
behavior
● Deploy/Split test
○ 1.5% lift in conversion overall
○ (Only have facebook data for ~10% of members)

Thanks!
Smart people come work with me.
http://www.meetup.com/jobs/

Estola meetup big_datacampla_6_14_evan_estola

Related slideshows

More Related Content

Estola meetup big_datacampla_6_14_evan_estola