SlideShare a Scribd company logo
Rebuilding Web Tracking
Infrastructure for Scale
Stephen Oakley
Principal Engineer
Marketo
What is Marketo?
Page 3
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
What is Web Tracking at Marketo?
• Ingest web page visits and clicks on customer’s website
• Trigger campaigns in response to web activity
• Trigger real-time personalization of web experience
• Provide lead level analytics for known leads
• Provide aggregate analytics for all lead activity
• Typically known leads < 10 % of all traffic
Page 4
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure

Recommended for you

Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem

Modern data ecosystems require new paradigms to address diverse data sources and user needs. Traditional assumptions about data originating from internal systems and a single data warehouse no longer apply. A new model called "Data Regions" establishes multiple environments for different data usage scenarios, including source onboarding, exploration, reporting, analytics and more. By supporting varied access, structures, domains and integrity across regions, Data Regions can address today's complex data challenges and modernize companies' data ecosystems.

hadoop summit
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines. Speaker Joshua Robinson, JOSHUA ROBINSON Founding Engineer Pure Storage

dataworks summit barcelonadws19pure storage
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017

My talk on building an EDW from Apache Hive, Ranger, Atlas, and other Apache projects. This talk was initially delivered at Dataworks Summit EU 2017

apache atlasapache rangerdruid
Page 5
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure
Page 6
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Problems
• Throughput limitations – 2 million activities per day
• Processing delays can be on the order of hours
• Large customers cause web server brownouts
• Web reporting does not scale
• Fixed-sized clusters prohibit horizontal scaling
• Brittle infrastructure prevents feature development
The Vision
Page 8
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Orion Initiative
• Increase scale to support IoT for Marketers
• Support billions of marketing activities each day
• Trigger on activities in near real time (< 2 minute @ 99th %)
• Reduce operational costs
• Improve multitenancy and QoS

Recommended for you

Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10

Introduction to Apache NiFi 1.10 Parameters, Stateless, RetryFlowFile, Backpressure prediction, parquetreader, parquetwriter, postslack, remote inputport in process group. Dec 2019, Timothy Spann, Field Engineer, Data in Motion Princeton Meetup 10-dec-2019 https://www.meetup.com/futureofdata-princeton/events/266496424/ Hosted By PGA Fund at: https://pga.fund/coworking-space/ Princeton Growth Accelerator 5 Independence Way, 4th Floor, Princeton, NJ

apache nifistatelessnifi 1.10
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi

1) The document discusses using Apache MXNet for industrial IoT applications. MiniFi ingests camera images and sensor data at the edge and runs Apache MXNet to recognize objects in images. The data is then stored in Hadoop. 2) It describes using Apache MXNet on edge devices like the Raspberry Pi and Nvidia Jetson TX1 to perform tasks like image recognition from cameras and sensors. 3) The document provides information on setting up Apache MXNet on various IoT devices and edge servers to enable machine learning and deep learning capabilities for industrial IoT applications.

artificial intelligencedata sciencedata engineering
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber

Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.

dataworks summit 2019dws19dataworks summit washington dc
Requirements
Page 10
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Business Requirements
• 200 MM activities per customer per day
• Near real-time web activity processing (SLA of < 1
minute lag)
• Improve cost efficiency
• Improve flexibility for feature enhancements
Page 11
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Technical Requirements
• Multitenancy support with brownout protections
• Infrastructure must scale horizontally
• Decouple web processing from downstream processing
• Anonymous leads should cost next to nothing to track
Architecture & Design

Recommended for you

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

The document discusses Hive LLAP (Live Long and Process) as a high performance and cost-effective alternative to traditional Massively Parallel Processing (MPP) databases for querying large datasets on Hadoop. It describes Walmart's implementation of Hive LLAP on their data lake to improve query performance for business users. A proof-of-concept found Hive LLAP queries were up to 50% faster when using 15 nodes instead of 10, and it performed comparably or better than two MPP databases with similar or larger infrastructures. Walmart plans to further evaluate Hive LLAP on newer Hadoop distributions and technologies to improve availability and workload management.

dataworks summit 2019dws19dataworks summit washington dc
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics

This document discusses using Pivotal's Big Data Suite to build a real-time analytics solution for processing taxi trip data streams. It presents an architecture that uses Spring XD for data ingestion, Spark Streaming for in-memory analytics on 10-second windows, Gemfire for fast data retrieval, and Pivotal HD for long-term storage. The solution demonstrates filtering inconsistent data, finding top traffic areas, and available taxis in real-time. The document highlights how the Big Data Suite provides a complete toolset for data-driven enterprises through its optimized Hadoop distribution, in-memory processing, stream processing, and low-latency data stores.

pivotal big data suitedata analyticspivotal
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere". This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.

hadoopdataworks summit 2017dataworks summit
Page 13
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Page 14
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Page 15
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Hbase + Phoenix?
• Horizontally scalable
• Leverages the Hadoop cluster for storage and scaling
• Provides secondary indices for query patterns through
Phoenix
• Natural integration with JDBC and Spark JDBC RDDs
Page 16
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Recommended for you

Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager

Companies who use Kafka today struggle with monitoring and managing Kafka clusters. Kafka is a key backbone of IoT streaming analytics applications. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. No open source tool caters to the needs of different users that work with Kafka: DevOps/developers, platform team, and security/governance teams. See how the new Hortonworks Streams Messaging Manager enables users to visualize their entire Kafka environment end-to-end and simplifies Kafka operations. In this session learn how SMM visualizes the intricate details of how Apache Kafka functions in real time while simultaneously surfacing every nuance of tuning, optimizing, and measuring input and output. SMM will assist users to quickly understand and operate Kafka while providing the much-needed transparency that sophisticated and experienced users need to avoid all the pitfalls of running a Kafka cluster. Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks

iot and streamingapache kafkadevops
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN

Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities. In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc. Speaker Billie Rinaldi, Principal Software Engineer I, Hortonworks

cloud and operationsclouddata engineering
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101

In my talk I will discuss and show examples of using Apache Hadoop, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. As part of my talk I will walk through using Apache NXNet Pre-Built Models, MXNet's New Model Server with Apache NiFi, executing MXNet with Apache NiFi and running Apache MXNet on edge nodes utilizing Python and Apache MiniFi. This talk is geared towards Data Engineers interested in the basics of Deep Learning with open source Apache tools in a Big Data environment. I will walk through source code examples available in github and run the code live on an Apache Hadoop / YARN / Apache Spark cluster. This will be an introduction to executing Deep Learning Pipelines in an Apache Big Data environment. My talk at Data Works Summit Sydney was listed in top 7 -&gt; https://hortonworks.com/blog/7-sessions-dataworks-summit-sydney-see/ Also have speak at and run Future of Data Princeton and at Oracle Code NYC. Ref: https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html https://community.hortonworks.com/articles/146704/edge-analytics-with-nvidia-jetson-tx1-running-apac.html https://dzone.com/refcardz/introduction-to-tensorflow Speaker Timothy Spann, Solutions Engineer, Hortonworks

artificial intelligencedata engineeringdata science
Page 17
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Marketo Lambda Architecture
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
Kafka
CRM Sync
Partner APIs
Other Marketing
Activities
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor
Page 18
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Spark Streaming?
• Micro-batching provides sink-side efficiencies
• This is especially important with MySQL touchpoints
• Great integration with Kafka
• No strict real-time processing requirements
• Great community and industry adoption
Page 19
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Multitenancy
• One topic per customer (sized by volume)
• Traffic storms are isolated to a single customer
• Fairness/throttling is easy to control
• Spark Streaming job consumes from many topics
• Allows us to turn a customer off under error conditions
• See “Elastic Streaming” by Neelesh Shastry –
Spark Summit
Page 20
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Spark Streaming Performant
• Coalesce small partitions for the same customer
• Aggressive caching of metadata (mostly from MySQL)
• Heavily leverage Scala future composition for parallelism
• Persist RDDs that are used for multiple outputs
• e.g. write to Kafka and Activity Service

Recommended for you

MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talk

MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi. The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a

ioatiotnifi
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?

This document discusses ways to troubleshoot slow Hadoop jobs using metrics, logging, and tracing. It describes how to use the Ambari metrics system and Grafana dashboards to monitor metrics for clusters. It also explains how to leverage Hadoop logs and the YARN Application Timeline Service for logging and correlation across workloads. Finally, it presents Apache Zeppelin and analyzers for Hive, Tez, and YARN as tools for ad-hoc analysis to diagnose issues.

apache hadoophadoop clusterhadoop summit tokyo
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future

The document discusses the past, present, and future of Apache Hadoop YARN. It describes how YARN started as a sub-project of Hadoop to improve its resource management capabilities. Today, YARN is central to modern data architectures, providing centralized resource management and scheduling. Going forward, YARN aims to better support containers, simplified APIs, treating services as first-class citizens, and enhance its user experience.

hadoop summiths16melb
Page 21
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Anonymous Traffic Cheap
• High costs of web traffic in legacy system
• MySQL storage for all traffic
• Down streaming processing of all events (even anonymous)
• V2 only processes and stores known traffic in MySQL
• Defer triggering for anonymous data until promotion
• Rolled out to our highest volume customers
• Processing latencies < 30s (at 99.9th %)
• Allowed key customers to scale from ~2MM/day to > 20
MM/day
Impact and Results
• Mitigations of straggler effects on processing delays
• Adding sessionization for web reporting
• Scaling Kafka topics as customers increase volume
• Globally distributed ingestion for a single customer
Future Work
We’re Hiring!
Http://Marketo.Jobs
Q & A

Recommended for you

Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?

This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.

Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?

The document discusses Apache Hive and Apache Druid for fast SQL on big data. It provides performance benchmarks showing Hive LLAP is faster than Presto and Spark SQL for TPC-DS queries. It describes features of Hive LLAP including in-memory caching, query result caching, and metadata caching. It also discusses new Hive 3 features like materialized views and optimizer improvements. The document then provides an overview of Apache Druid's capabilities for real-time ingestion and querying of streaming data before discussing how Hive and Druid can work together, with Hive able to push down queries to Druid.

hortonworkssqlhadoop
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop

The truth about SQL and Data Warehousing on Hadoop

future of dataapache hadoophadoop summit tokyo

More Related Content

What's hot

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
DataWorks Summit/Hadoop Summit
 
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
DataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10
Timothy Spann
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
DataWorks Summit
 
MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talk
Joe Percivall
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 

What's hot (20)

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talk
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 

Viewers also liked

The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
DataWorks Summit/Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
DataWorks Summit/Hadoop Summit
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
DataWorks Summit/Hadoop Summit
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
DataWorks Summit/Hadoop Summit
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
DataWorks Summit/Hadoop Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
DataWorks Summit/Hadoop Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Similar to Rebuilding Web Tracking Infrastructure for Scale

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Continuent
 
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
HostedbyConfluent
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
Big Kahuna
Big KahunaBig Kahuna
Big Kahuna
Ritesh Nayak
 
Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications
Tugdual Grall
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data Strategy
MongoDB
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
DataWorks Summit/Hadoop Summit
 
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxSUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
Vasiliy Fomichev
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Kai Wähner
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
DevOps.com
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
VoltDB
 
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
ServiceRocket
 
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
ServiceRocket
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won Transactions
VoltDB
 
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Continuent
 
JAMStack
JAMStackJAMStack
JAMStack
Samundra khatri
 
Digital Transformation in Market Data and Trading Platforms
Digital Transformation in Market Data and Trading PlatformsDigital Transformation in Market Data and Trading Platforms
Digital Transformation in Market Data and Trading Platforms
Solace
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
Aerospike, Inc.
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data Strategy
MongoDB
 

Similar to Rebuilding Web Tracking Infrastructure for Scale (20)

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
 
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
Big Kahuna
Big KahunaBig Kahuna
Big Kahuna
 
Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data Strategy
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxSUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
 
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won Transactions
 
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
 
JAMStack
JAMStackJAMStack
JAMStack
 
Digital Transformation in Market Data and Trading Platforms
Digital Transformation in Market Data and Trading PlatformsDigital Transformation in Market Data and Trading Platforms
Digital Transformation in Market Data and Trading Platforms
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data Strategy
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 

Recently uploaded (20)

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 

Rebuilding Web Tracking Infrastructure for Scale

  • 1. Rebuilding Web Tracking Infrastructure for Scale Stephen Oakley Principal Engineer Marketo
  • 3. Page 3 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 What is Web Tracking at Marketo? • Ingest web page visits and clicks on customer’s website • Trigger campaigns in response to web activity • Trigger real-time personalization of web experience • Provide lead level analytics for known leads • Provide aggregate analytics for all lead activity • Typically known leads < 10 % of all traffic
  • 4. Page 4 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Web Tracking Infrastructure
  • 5. Page 5 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Web Tracking Infrastructure
  • 6. Page 6 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Problems • Throughput limitations – 2 million activities per day • Processing delays can be on the order of hours • Large customers cause web server brownouts • Web reporting does not scale • Fixed-sized clusters prohibit horizontal scaling • Brittle infrastructure prevents feature development
  • 8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Orion Initiative • Increase scale to support IoT for Marketers • Support billions of marketing activities each day • Trigger on activities in near real time (< 2 minute @ 99th %) • Reduce operational costs • Improve multitenancy and QoS
  • 10. Page 10 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Business Requirements • 200 MM activities per customer per day • Near real-time web activity processing (SLA of < 1 minute lag) • Improve cost efficiency • Improve flexibility for feature enhancements
  • 11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Technical Requirements • Multitenancy support with brownout protections • Infrastructure must scale horizontally • Decouple web processing from downstream processing • Anonymous leads should cost next to nothing to track
  • 13. Page 13 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 14. Page 14 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Hbase + Phoenix? • Horizontally scalable • Leverages the Hadoop cluster for storage and scaling • Provides secondary indices for query patterns through Phoenix • Natural integration with JDBC and Spark JDBC RDDs
  • 16. Page 16 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Marketo Lambda Architecture Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase Kafka CRM Sync Partner APIs Other Marketing Activities Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor
  • 18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Spark Streaming? • Micro-batching provides sink-side efficiencies • This is especially important with MySQL touchpoints • Great integration with Kafka • No strict real-time processing requirements • Great community and industry adoption
  • 19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Multitenancy • One topic per customer (sized by volume) • Traffic storms are isolated to a single customer • Fairness/throttling is easy to control • Spark Streaming job consumes from many topics • Allows us to turn a customer off under error conditions • See “Elastic Streaming” by Neelesh Shastry – Spark Summit
  • 20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Spark Streaming Performant • Coalesce small partitions for the same customer • Aggressive caching of metadata (mostly from MySQL) • Heavily leverage Scala future composition for parallelism • Persist RDDs that are used for multiple outputs • e.g. write to Kafka and Activity Service
  • 21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Anonymous Traffic Cheap • High costs of web traffic in legacy system • MySQL storage for all traffic • Down streaming processing of all events (even anonymous) • V2 only processes and stores known traffic in MySQL • Defer triggering for anonymous data until promotion
  • 22. • Rolled out to our highest volume customers • Processing latencies < 30s (at 99.9th %) • Allowed key customers to scale from ~2MM/day to > 20 MM/day Impact and Results
  • 23. • Mitigations of straggler effects on processing delays • Adding sessionization for web reporting • Scaling Kafka topics as customers increase volume • Globally distributed ingestion for a single customer Future Work

Editor's Notes

  1. Next phase was when we were ready to validate our newly built event ingestion system Marketo is a powerful Engagement Marketing Platform. There are several applications that make up the platform, such as ABM, Marketing analytics, predictive content, Digital Ads, and Marketing Automation. Marketing automation is what we are focusing on today. Marketing Automation enables the marketer to create, automate and measure marketing campaigns across channels. A simple example of an automated campaign or workflow is User visits your website and fills out a form Web tracking sees that they spent most their time looking at pages about spark streaming Automatically Send an email to the user to Invite them to a webinar on spark streaming services If they attend the webinar, register their interests in your crm and request a sales person contacts the user The campaigns can be complex and can reach out and track customers across channels like web, email, mobile, social
  2. Explain what a known vs anonymous lead is Known is targetable on other channels, anonymous is only web activity Speak to how the traffic patterns are heavily skewed toward anonymous given our customer base Talk about how anonymous converts to known. Aggregate analytics include company web report, landing page reports, etc.
  3. Speak to the pod Mention how there are many many pods
  4. An additional complication is the fact that the same two webservers also serve the mlm app, soap apis, and the landing pages
  5. Although the talk isn’t about the project…  we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected.   The number of connected devices blows my mind.  It’s not just phones anymore…   Amazon dash buttons, coffee makers, propane tanks, garage doors.  These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
  6. Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
  7. Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Coming from standard 3 tier app Radically reduce processing latency Eliminate backlogs Brownout protection
  8. A few words about the architecture Main goal is to inject, process and store marketing events
  9. Details overview of Munchkin FE component Spray.io for MFE Frontend has the simple job of verifying subscription status, collecting metrics and persisting to kafka Use Avro to allow for schema evolution, strong typing and compact representation in topic Use Schema registry to allow the schema to be upgraded by the producer and them automatically picked up by the spark streaming component Use asynchronous API for kafka to allow high throughput.
  10. Details overview of LeadService component Spray.io for leadservice Hbase for Cookie and anonymous lead storage Salted table Key structure is subscription-cookie-leadid Secondary index for subscription-lead-createdat MySQL for known lead storage Masterdata for reverse ip information enrichments
  11. Overall view for the system Describe how there is a Kafka topic per subscription Spark streaming transforms the raw events into activities by Enriching with web page metadata from MySQL Lead and reverse IP enrichment from LeadService Persist activities to AS for storage and secondary processing (e.g. triggering and solr indexing) Push enriched web events to Kafka for the downstream Druid OLAP infrastructure.
  12. High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
  13. While it is not “true” streaming, we exactly need this as an optimization
  14. Our multitenant Kafka framework coalesces small kafka paritions into large spark rdd partitions to improve batch utilization Several components of the event enrichment requires outbound RPC calls, using async clients and performing the calls in parallel and then composing the futures pipelines the computation and significantly improves throughput. Caching web assets and cookies for temporal locality Cache is > 60% of the executor memory Enriched events are written out to multiple sources and be selective about persisting RDDS prevents recomputing expensive transformations (multiple RPC calls or MySQL queries)
  15. Traditionally both anonymous and known data was treated equally in MLM. This is problematic because Anonymous volumes are usually 10-20x higher than known. Additionally there is very little intrinsic value in performing downstream processing on anonymous data since you cannot target anonymous leads for Campaigns. To improve this, in Munchkin V2 we only allow known traffic to flow to downstream processing. Anonymous data is passed for downstream processing when the lead converts to a known lead Via form fillout, api calls, etc.
  16. Reiterates my points on the last slide. I included in case you wanted to look at the slides later
  17. Give a quick overview of the activities architecture. Introduce Kafka in the presentation
  18. Spend more time on this – purple is our code , teal is spark standard # SubscriptionRegistry is using ZK # OffsetManager is a library, uses low level kafka consumer API # Provisioning framework – Sirius, a new subscription provisioned to registry via oozie