Federated Anti-Abuse Defense Ecosystem using AI Migration from Monolithic Infrastructure

LinkedIn, consistently ranked as one of the most trustworthy social networks, stands committed to providing a secure and reliable platform to its members. Although the majority of members have good intent using the platform to find economic opportunity, there are some users with malicious intent who use social engineering practices to perform policy violating activities. 

Dedicated to the sanctity of the member experience, LinkedIn’s Trust Engineering employs a comprehensive set of intelligent mechanisms, meticulously designed to safeguard members from different avenues of threats. Beyond the usual challenges–such as those posed by fake profiles–our defenses extend their coverage to other activities like account hacking and abusive automated behaviors such as testing out valid login handles from a large list of emails.

In our previous blog post, Building Trust and Combating Abuse On Our Platform, we gave an overview of the different abuse vectors, how our dedicated team members have built a framework to address the ever evolving threats at scale. To empower our rapidly growing member base which recently reached the 1 billion mark and combat evolving abuse tactics, we're pioneering innovative infrastructure solutions for an enhanced and safer LinkedIn experience for the next decade.

In this blog post, we will journey through the migration from our legacy infrastructure to a federated ecosystem. We will describe how we focused on upholding the defense guardrails while giving our teams the ability to both iterate faster and provide operational excellence. We will also share what we learned from this migration.

Legacy Monolithic Infrastructure

In the early stages of LinkedIn’s evolution, our approach to building this infrastructure was largely reactive, addressing each new abuse scenario as it surfaced. This reactive strategy, although expedient at the time, resulted in a landscape where our defense guardrails were built to serve the issue of the moment. This approach eventually led to the implementation of a monolithic service, intricately intertwined with multiple REST.li endpoints and Kafka consumers (for online and nearline defense workflows, respectively) forming a complex network illustrated in Figure 1:

Image of the Legacy Anti-Abuse Monolithic Infrastructure
Figure 1: Legacy Anti-Abuse Monolithic Infrastructure

This single monolithic infrastructure was responsible for supporting our anti-abuse systems across different entities like member, non-member, and content. We leveraged a bevy of technologies to classify each request to determine if it violated our Professional Community Standards. However, as time unfolded, it became apparent that this reactive methodology, while effective at the time, brought with it inherent drawbacks that manifested in inefficiencies and scalability limitations which made it more difficult to innovate. 

This sprawling monolithic architecture with disparate logic scattered throughout often resulted in inconsistencies and duplications in critical aspects such as data collection, data transformation, and management of scoring workflows. Though functional, it had become more operationally complex and therefore less efficient for the team to support.

The different set of issues contributing to this overhead can be categorized into high level areas of concern as:

  • Business Logic Overhead: One of the prominent issues stems from the increasing number of scoring workflows, i.e., decision making processes, leading to a high density of endpoints and a subsequent surge in operational overhead. The intricacies are further compounded by the co-existence of irrelevant REST APIs and their corresponding implementations within the same application, causing resource contention and affecting system health.
  • Hardware Limitations: Limited in our ability to scale horizontally, our system suffered performance degradation such as more frequent and longer garbage collections. The introduction of new capabilities, including intricate machine learning (ML) models, made horizontal scaling an inadequate long-term solution. The lack of proper craft practices and comprehensive testing mechanisms has resulted in frequent GC cycles, impacting both CPU utilization and overall system performance.
  • Noisy Neighbors: The limitations on threads with the monolithic application posed a significant constraint, especially when dealing with multitude of downstream dependencies. The absence of isolation between scoring workflows leads to a noisy neighbor scenario, where issues in one workflow affect the performance of others.
  • Operational Excellence & Ownership: As our team grew organically, the ownership lines got blurry. Ownership and on-call management complexities started arising from the responsibilities of the different scoring workflows being distributed across multiple teams and locations, making it challenging to implement a streamlined on-call model. 
  • Productivity & Agility Issues: As the codebase grew large and more complex, making changes or introducing new features became challenging. Additionally, operational challenges, including craft issues and the lack of robust testing suites, have contributed to periodic service degradations. With all components so tightly intertwined, it became even more difficult for teams to work independently and efficiently. Rolling out of new changesets bridled with inefficient coding practices caused frequent rollbacks leading to delayed rollout of new defenses, which created bottlenecks in how we approached and addressed new abuse vectors.

Driven by the need for enhanced scalability, improved resource management, and a more resilient and efficient ecosystem, we decided to migrate from a monolithic infrastructure to a federated ecosystem. 

Migration Principles

We took this as an opportunity to also holistically think about our design and plan for building a system which can live for years and handle the ever-evolving avenues of abuse at scale. In a holistic manner to address all the aspects of the platformization, we used a three-pronged approach shown in Figure 2, wherein our highest priority was to first extract the common functionalities and interfaces into a single library. Second, was to split the scoring decision workflows into smaller clusters with federated ownership, and finally, to support for newer capabilities to focus on improving iteration velocity.

Image of LinkedIn's three-pronged approach
Figure 2: Three-pronged approach

As for CASAL(Centralized Abuse Scoring As a Library), our earlier blog post gave a detailed overview of our library which extracted the functionalities used across different scoring workflows. 

We brainstormed as a team to identify the top two operating principles in design and migration as:

  1. Federated ownership of the scoring workflows, and
  2. Validation using centralized console management of the scoring workflows to manage and monitor.

Federated Design

With the set of principles in consideration, we envisioned our ecosystem to split the entire monolithic application into smaller clusters to address different problems mentioned earlier.

Image of Trust Anti-Abuse Scoring Clusters
Figure 3: Trust Anti-Abuse Scoring Clusters

In Figure 3, we see that CASAL is a foundational block for supporting different scoring workflows catered to a specific abuse vector (content moderation, fake profile detection, spam, etc.). We also isolated the online and nearline defenses given the nature of how the detection happens and the decision is enforced. 

Our vision was that each anti-abuse vector (fake profiles, content moderation, email washing, member harassment, etc.) has their own dedicated clusters of workflows integrated with CASAL. Also, with the above design, we were able to de-couple decision making and data collection into separate isolated systems. Building an Operation Console to author and manage these scoring workflows was also critical as we enhanced the CASAL functionalities to improve iteration velocity and added newer capabilities.

With this design, we were able to successfully address the problems mentioned earlier:

  • Business Logic Overhead: De-coupling scoring workflows with data management workflows has a huge advantage in ensuring that we don’t mix business logic with our defenses. Also, having separate data management workflows ensures that it can be used across different sets of online and nearline workflows.
  • Hardware Limitations & Noisy Neighbors: Splitting the single application into multiple clusters ensured that we were able to dedicate an adequate amount of hardware resources based on the traffic and complexities involved. An increase in traffic or random new incidents or attacks would ensure that other scoring workflows are not impacted.
  • Operational Excellence & Ownership: Aligning with our business needs, product and AI teams, we were able to also split these clusters between different teams to operate and manage on their own and reduce the communication overhead.
  • Productivity & Agility Issues: Splitting the code base for different scoring workflows also empowered the teams to iterate at their own speed with agility in mind. Imposing a focus on keeping a high bar of craft with regression automated testing suites, monitoring helped us to scale them with ease.

Validation

Ensuring the seamless transition of our anti-abuse scoring application from a monolithic infrastructure to a federated ecosystem requires meticulous planning, particularly concerning data validation and decision enforcement. We also had to ensure that it did not impact our defenses, as well as critical business metrics. The integrity, accuracy, and reliability of data must be upheld throughout the migration process, with a keen focus on trust and privacy considerations. Any hitches during migration could compromise the member experience on the LinkedIn platform, leading to potentially irreparable reputation damage. Our primary stakeholders, including AI and Infra engineers, incident management engineers, Trust & Safety reviewers and data scientists, heavily rely on the features, anti-abuse scoring decisions, and A/B testing results generated by our service.

To mitigate potential disruptions and guarantee a smooth migration experience of our clients, we implemented a non-action mode for our workflows. This mode allowed us to perform all necessary tasks except execution actions that involved enforcing the defenses, updating databases, and publishing events to production Kafka streams which were used for tracking the business metrics, ML model training, etc. This way, the newly migrated service could process identical production traffic as the monolithic service. By logging detailed defense flow information into Kafka events, we could meticulously compare the results from both the services, identifying any discrepancies that may arise during the migration process.

Following the planning phase, the actual migration begins by breaking down the application. Adopting a slow ramp approach, wherein production traffic is gradually shifted from the old monolithic service to the newly migrated services, is considered the safest option. However, this method prolongs the period during which all newer capabilities rollouts must be halted until the migration is fully validated and completed. Given the crucial importance of minimizing the impact, especially in handling new attacks or incidents, we developed automated tools to validate Kafka streams containing the blueprint of all the data used for decision making.

Bundled with automated tools, comprehensive monitoring dashboards and regular touchpoints, we ensured that we kept our stakeholders from different LinkedIn product teams aware of the ramps. This cross organization communication and regular syncups were critical pieces to ensure that we were keeping an eye on the defense and business metrics. 

After months of dedicated effort, we were able to successfully migrate and achieve our goal of the federated infrastructure.

Learnings

From our experience, undertaking technology migrations are colossal efforts and our experiences have yielded valuable insights that can benefit organizations navigating similar challenges. Here is a distilled summary of the learnings we acquired throughout this complex endeavor:

Seize opportunities for design enhancement: Beyond breaking down the monolith, it presents a unique chance to reimagine and enhance the underlying design. During our migration, we took the opportunity to overhaul our anti-abuse scoring workflows. These improvements not only facilitated the migration process but also positioned our system for the future, enabling advanced features like canary-based design and validation approach.

Harness the power of built-in metric monitoring: Comprehensive metric monitoring is indispensable during system migration. Monitoring system metrics alone is insufficient to gauge the performance of the migrated system. It’s crucial to leverage built-in monitoring tools and infrastructure for monitoring both system and business metrics. Our platform, integrated with ML model inferences and business rules, required us to configure proactive alerts on abnormal behaviors to detect anomalies as early as possible.

Automate, Automate, Automate: Automation is the linchpin in ensuring a smooth migration. We developed a plethora of automated solutions to make service behavior reactive or passive, a practice that later extended beyond just simple migration. This has become an integral part of our standard CI/CD pipelines in Trust. Automation not only streamlines processes but also enhances the predictability and reliability of the migration.

Prioritize effective communication: Beyond technical considerations, effective communication among stakeholders is pivotal for the success of migration. This involves aligning expectations, identifying and mitigating risks, and engaging stakeholders. Large-scale migrations impact a multitude of individuals beyond the immediate engineering team, necessitating well-prepared plans and transparent communication. We observed that clear communication significantly eased the migration process and garnered positive feedback from various stakeholders.

As technology continues to advance in both offensive and defensive capabilities, migrations will remain inevitable. Armed with our learnings, we are committed to evolving our platform, navigating future challenges and leveraging our experiences to build a resilient and adaptive defense system.

Acknowledgements 

Successful evolution of our large scale online service’s tech stack was a colossal undertaking that required the concentrated effort of many parties. Specifically, it was made possible due to the contributions from the many teams and groups within Trust Engineering (Infra & AI), Site Reliability, LinkedIn product teams, Data Science, and Incident Management teams. We also appreciate our leadership’s vision and support and sustained commitment. This achievement is a testament to the collective dedication and collaboration of all involved parties.