Recommendations

Candidate Generation in a Large Scale Graph Recommendation System: People You May Know

Members use the LinkedIn My Network tab to reconnect with people they know as well as make new connections to expand their networks. The Grow section of the My Network tab is focused on helping manage connections and finding relevant people to add to a professional network, while the Catch Up section is focused on uncovering reasons to start a conversation with existing connections - like starting a new gig or celebrating a work anniversary. Underpinning both and powering this experience is our “People You May Know” (PYMK) technology.

These various motivations for connection, combined with our more than one billion members, mean that selecting a candidate pool to display for members to catch up or connect with is a monumental task. As part of PYMK, a member (called viewer hereafter) is shown a list of recommendations of other members that they’ll likely want to add to their network or reconnect with. It is prohibitively expensive to score and sort the entire set of a billion members. This is where a Candidate Generation (CG) stage is needed to retrieve a short-list of highly relevant members that are most likely to get shared with a particular viewer. We shared the overall end-to-end view of our large scale recommendation system, PYMK, in a previous blogpost. In this post, we go into the details of the CG stage for PYMK.

PYMK’s CG consists of multiple retrieval techniques which can be broadly organized into three categories: graph-based, similarity-based and heuristics-based. Candidates pooled from these three techniques capture not only the network properties but also professional, educational and profile similarity with the viewer. Next, we’ll take a look into each of these three CG techniques.

Figure 1: Candidate retrieval techniques from three broad categories are used in PYMK.
Figure 1: Candidate retrieval techniques from three broad categories are used in PYMK.

Graph-Based CG

LinkedIn is a professional network; each member is a node in the network and an edge between two members indicates that they are connected. Members in the network vicinity of a viewer could be great candidates that the viewer might want to connect with. Motivated by this, in the graph-based CG technique, we perform graph walks over the LinkedIn network to traverse the viewer’s network vicinity and generate candidates. Largely two types of graph walks are performed: (a) N-hop neighbors which is a shallow deterministic graph walk and traverses viewer’s closer network vicinity; (b) Personalized PageRank which is a deep probabilistic graph walk that traverses viewer’s farther network vicinity. We share details on these two walks below.

N-hop Neighbors

We generate N-hop (N≠1) neighbors as candidates for the viewer. N is restricted to 3 to control the volume of generated candidates (i.e. we generate 2-hop and 3-hop neighbors of the viewer as candidates).


Figure 2: N-hop neighbors - For the viewer Alice, her 2-hop and 3-hop neighbors Carol and Dave could be her future potential connections and so are added as PYMK candidates.
Figure 2: N-hop neighbors - For the viewer Alice, her 2-hop and 3-hop neighbors Carol and Dave could be her future potential connections and so are added as PYMK candidates.

Each candidate generated from the N-hop approach has a score associated with it which is the number of common connections between the candidate and the viewer. This common-connection score is used to pick the top-k candidates which are added to the viewer’s final candidate pool.

Personalized PageRank (PPR)

N-hop approach could only perform shallow graph walks (remember N was restricted to 3). To run deeper graph walks we leverage the page-rank technique. We compute a page-rank score of a member Dave as follows:

Equation 1: Page-rank score of a member - Dave in the above equation, the first term on RHS indicates that each neighbor v of the member Dave, propagates its page-rank score PR(v) to page-rank score of Dave, scaled by its 1/degree(v). The second term on RHS is a smoothing term which can be interpreted as the probability of teleporting from a random member in the network (G is the total number of members in the network).
Equation 1: Page-rank score of a member - Dave in the above equation, the first term on RHS indicates that each neighbor v of the member Dave, propagates its page-rank score PR(v) to page-rank score of Dave, scaled by its 1/degree(v). The second term on RHS is a smoothing term which can be interpreted as the probability of teleporting from a random member in the network (G is the total number of members in the network).
Figure 3: Page-rank algorithm - Bob and Carol propagate their page-rank scores to Dave, divided by their degrees. Eve can randomly teleport to Dave with a uniform probability 1/7. Dave’s page-rank score is a linear combination of these both.
Figure 3: Page-rank algorithm - Bob and Carol propagate their page-rank scores to Dave, divided by their degrees. Eve can randomly teleport to Dave with a uniform probability 1/7. Dave’s page-rank score is a linear combination of these both.

Page-rank scores are estimated iteratively using the power-iteration method.

To personalize the page-rank scores for a viewer, say Alice, we restrict the teleportation from only the neighborhood of Alice, i.e. instead of teleporting from a random node in the network, teleportation can only happen from the near-neighborhood of Alice. Personalized PageRank (PPR) scores are given by:

Equation 2: Personalized PageRank score of Dave for viewer Alice - Teleportation in PPR is restricted to only the near-neighborhood of Alice. PPR focuses less on the irrelevant part of Alice’s network.
Equation 2: Personalized PageRank score of Dave for viewer Alice - Teleportation in PPR is restricted to only the near-neighborhood of Alice. PPR focuses less on the irrelevant part of Alice’s network.

PPR soft-confines the random graph walk in the near-neighborhood of the viewer. Note that in equation 2 now, the smoothing term is no longer uniform and more personalized, which introduces more variance in PPR scores. Moreover, PPR scores are generated for all (viewer, member) tuples, unlike the non-personalized equation 1 which was agnostic of viewer.

For a viewer, PPR scores are generated for all the members and the top-k are added as candidates to the final pool. A candidate that is 5-6 hops away from the viewer could still end up getting a high PPR score since PPR propagates information over the edges recursively. This is very helpful for generating candidate liquidity for under connected viewers.

A note on Graph Neural Networks (GNNs): Over the last few years, many advancements have been made to GNN embeddings. Ideally, similarity over GNN embeddings should be able to generate candidates in the graph neighborhood of a viewer. So, similarity over GNN embeddings should be able to generate N-hop neighbors and PPR candidates of a viewer, but in practice we haven’t seen this yet. In our evaluation, the current state-of-the-art GNN embeddings failed to generate 2 & 3-hop neighbors and suffered from the problem of graph isomorphism. Hence, we rely on explicitly running the N-hop and PPR graph walks to generate candidates.

Similarity-Based CG

A member who has similar profile information, skills and employment history to the viewer, might also be a good candidate to connect with. Motivated by this, in the Similarity-based CG technique, we capture profile, education, employment, skill, and more similarities by learning embeddings for LinkedIn members and then retrieving the candidates whose embeddings are closest to the viewer (called embedding based retrieval). Below we share details on how these embeddings are learnt while the retrieval of closest embeddings is done via a standard nearest-neighbor search.

Embedding Based Retrieval (EBR)

Member embeddings are learnt via a Two Tower Neural Network.

Figure 4: PYMK Two Towers NN - Member features (such as school, company, geo, activity, etc.) are input to the towers. Each tower is usually a 2-3 layer fully-connected NN. Softmax over the dot-product of the two outputs d(u) and d(v) from the two towers gives the prediction probability of a positive or negative instance, which is then used to compute the cross-entropy loss.
Figure 4: PYMK Two Towers NN - Member features (such as school, company, geo, activity, etc.) are input to the towers. Each tower is usually a 2-3 layer fully-connected NN. Softmax over the dot-product of the two outputs d(u) and d(v) from the two towers gives the prediction probability of a positive or negative instance, which is then used to compute the cross-entropy loss.

A candidate that was shown to the viewer, i.e. a candidate impressed to a viewer, is treated as a positive instance (we discuss this choice of positive later). This way, the PYMK Two Towers NN in Figure 4 learns to pull the embeddings of the impressed candidates closer to the embedding of the viewer. So, a candidate generated for a viewer using the dot-product between their embeddings trained via the PYMK Two Towers NN, will have a high chance of getting shown to the viewer.

Equation 3: Probability that a candidate member v is impressed to u. The denominator is a summation over the entire set of members.
Equation 3: Probability that a candidate member v is impressed to u. The denominator is a summation over the entire set of members.

Members are in O(billion) and so computing the denominator in equation 3 is prohibitively costly. We use negative sampling to get around this. We also tried adaptive importance sampling as another solution but it posed convergence issues and made the learning process unstable.

With negative sampling loss, selecting the right set of negative instances is critical. We use three techniques to mine negatives and pool them together.

  1. In-batch negatives: use the positive instances of other viewers in a mini-batch, as negatives. These usually end up being semi-hard negatives. While this makes negative mining very convenient, it ends up biasing the embeddings away from popular members. The positive instances of other viewers are likely popular members or influencers - using them as negatives will push the embedding of the current viewer away from these popular members. Moreover, the distribution of negatives mined from this approach suffers from sampling bias and doesn’t match the true distribution of negatives, and so we have to rely on multiple corrections such as temperature scaling and logQ correction.
  2. Random negatives: uniformly sample random negatives. These are easy negatives and help us maintain a good recall of the candidates.
  3. Nearest neighbor negatives: generate nearest-neighbors of the in-batch negatives and use them as negatives. To generate neighbors we use dot-product over a separate pre-trained embedding. For instance, we could use pre-trained GNN embeddings - this way members who are in the graph neighborhood of in-batch negatives are also treated as negatives. Nearest neighbor negative mining could be tricky to get right, since it might exemplify the biases of in-batch negatives, but could also be very effective in generating hard negatives which are extremely crucial in avoiding embeddings collapse.

Why impressions as positive instances? A viewer can invite an impressed candidate to connect. Using invited candidates as positive instances hurts the recall of our CG. Candidates that visit LinkedIn infrequently receive less invitations - using invite as positive would have biased our embeddings against such members even though they might be relevant, ultimately hurting the recall. Hence, we chose impressed candidates as the positive instances.

Scoring each candidate for a viewer using the 2-tower model and then selecting the top k based on the score is equivalent to finding the k candidates whose embeddings are closest to the viewer’s embedding. This means that we don’t have to score each candidate in real-time for a viewer. We can simply use a nearest neighbor search to get the k closest candidates. Also, the candidate embeddings can all be pre-computed offline and cached to facilitate an efficient retrieval.


Figure 5: Retrieval of PYMK candidates - Embedding d(u) of the viewer u (member for whom we need to generate recommendations) is computed online using freshest features. Embeddings d(v) of the candidate members are scored in an offline batch manner and indexed in an inverted store. The online embedding d(u) of the viewer is then used to query approximate nearest neighbors and return candidates.
Figure 5: Retrieval of PYMK candidates - Embedding d(u) of the viewer u (member for whom we need to generate recommendations) is computed online using freshest features. Embeddings d(v) of the candidate members are scored in an offline batch manner and indexed in an inverted store. The online embedding d(u) of the viewer is then used to query approximate nearest neighbors and return candidates.

Similarity-based CG helps generate candidates who are more similar in terms of profile, skills and engagement preferences and hence is quite complementary to the graph-based CG which mainly generates candidates in the network neighborhood.

Heuristics-Based CG

In this technique, candidates for a viewer are generated by simple rules (also called filters). Some of the filters used to generate candidates are, members with whom the viewer recently interacted on feed, members who recently clicked notifications generated by viewer’s activity, members who searched the viewer’s profile and more such filters.

Top candidates from each of the above CG techniques are aggregated in a final candidate pool to be passed to rankers.

Final Candidate Pool

Each candidate in the final pool has an associated score with it (N-hop neighbor has number of common-connections, PPR candidate has the PPR score and similarity-based candidate has the dot-product between embeddings as the associated score) which is used as a feature in the rankers. A 1-hot encoded categorical feature that indicates the CG technique (Graph, Similarity or Heuristic) used to generate the candidate is also used in the rankers. The 1-hot categorical feature acts as a bias term for the CG technique and captures its global popularity (i.e. across all viewers which CG technique takes prominence in generating candidates).

This final candidate pool along with the scores is passed to the ranker to generate PYMK recommendations for the viewer.

Offline Evaluation

Recall@k is used as the primary evaluation metric for PYMK’s CG stage. We are also interested in retrieving a diverse set of candidates and so entropy over a categorical attribute is used as a secondary sign-post metric to evaluate the diversity of candidates. The categorical attribute could be for instance, country, if we’d like the candidates to be from different countries.

We also look at the distribution of CG sources in the candidate pool to identify skewness towards a particular CG source.

Future Considerations

Utilizing GNN embeddings to approximate graph-based CG remains an area of continuous exploration for us. Moreover, we have seen disproportionate gains by making parts of the CG system near-realtime (i.e. ability to generate N-hop neighbors in near-realtime delivered huge gains by bringing in fresh candidates) and we plan to identify and migrate parts of the system to near-realtime. Finally, we plan to explore if deep sequential learning can be leveraged to improve candidate generation systems.

Conclusion

To measure the online impact of the aforementioned techniques, we conducted A/B tests, which showed an increase in member engagement and retention. The CG techniques not only helped us increase engagement of frequent members, they also improved retention of infrequent members. Based on this, we can conclude that candidate generation can be considered one of the most critical components in determining the quality of a large-scale recommendation system.

In sharing our techniques and insights - such as soft-confining in PPR to provide personalization, impressions as positives in EBR to not favor frequent members, and nearest-neighbor negatives to generate hard-negatives - we hope to help others who are also working towards building high-quality graph recommendation systems.