Rebuilding Web Tracking Infrastructure for Scale

•Download as PPTX, PDF•

2 likes•1,145 views

The document discusses Marketo's efforts to rebuild its web tracking infrastructure to greatly increase its capabilities and scale. The legacy system could only handle 2 million activities per day and had problems with delays and lack of flexibility. The new Orion initiative aims to support billions of daily activities with near real-time processing using a distributed architecture with Apache Spark, Kafka, and HBase on Hadoop. The initial results included supporting a key customer's increase from 2 million to over 20 million activities per day with latencies under 30 seconds.

Recommended for you

Data Regions: Modernizing your company's data ecosystem

Modern data ecosystems require new paradigms to address diverse data sources and user needs. Traditional assumptions about data originating from internal systems and a single data warehouse no longer apply. A new model called "Data Regions" establishes multiple environments for different data usage scenarios, including source onboarding, exploration, reporting, analytics and more. By supporting varied access, structures, domains and integrity across regions, Data Regions can address today's complex data challenges and modernize companies' data ecosystems.

•by DataWorks Summit/Hadoop Summit

hadoop summit

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines. Speaker Joshua Robinson, JOSHUA ROBINSON Founding Engineer Pure Storage

•by DataWorks Summit

dataworks summit barcelonadws19pure storage

Hive edw-dataworks summit-eu-april-2017

My talk on building an EDW from Apache Hive, Ranger, Atlas, and other Apache projects. This talk was initially delivered at Dataworks Summit EU 2017

•by alanfgates

apache atlasapache rangerdruid

Page 5
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure

Page 6
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Problems
• Throughput limitations – 2 million activities per day
• Processing delays can be on the order of hours
• Large customers cause web server brownouts
• Web reporting does not scale
• Fixed-sized clusters prohibit horizontal scaling
• Brittle infrastructure prevents feature development

Page 8
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Orion Initiative
• Increase scale to support IoT for Marketers
• Support billions of marketing activities each day
• Trigger on activities in near real time (< 2 minute @ 99th %)
• Reduce operational costs
• Improve multitenancy and QoS

Recommended for you

Introduction to Apache NiFi 1.10

Introduction to Apache NiFi 1.10 Parameters, Stateless, RetryFlowFile, Backpressure prediction, parquetreader, parquetwriter, postslack, remote inputport in process group. Dec 2019, Timothy Spann, Field Engineer, Data in Motion Princeton Meetup 10-dec-2019 https://www.meetup.com/futureofdata-princeton/events/266496424/ Hosted By PGA Fund at: https://pga.fund/coworking-space/ Princeton Growth Accelerator 5 Independence Way, 4th Floor, Princeton, NJ

•by Timothy Spann

apache nifistatelessnifi 1.10

IoT with Apache MXNet and Apache NiFi and MiniFi

1) The document discusses using Apache MXNet for industrial IoT applications. MiniFi ingests camera images and sensor data at the edge and runs Apache MXNet to recognize objects in images. The data is then stored in Hadoop. 2) It describes using Apache MXNet on edge devices like the Raspberry Pi and Nvidia Jetson TX1 to perform tasks like image recognition from cameras and sensors. 3) The document provides information on setting up Apache MXNet on various IoT devices and edge servers to enable machine learning and deep learning capabilities for industrial IoT applications.

•by DataWorks Summit

artificial intelligencedata sciencedata engineering

HBase Global Indexing to support large-scale data ingestion at Uber

Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.

•by DataWorks Summit

dataworks summit 2019dws19dataworks summit washington dc

Page 10
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Business Requirements
• 200 MM activities per customer per day
• Near real-time web activity processing (SLA of < 1
minute lag)
• Improve cost efficiency
• Improve flexibility for feature enhancements

Page 11
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Technical Requirements
• Multitenancy support with brownout protections
• Infrastructure must scale horizontally
• Decouple web processing from downstream processing
• Anonymous leads should cost next to nothing to track

Recommended for you

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

The document discusses Hive LLAP (Live Long and Process) as a high performance and cost-effective alternative to traditional Massively Parallel Processing (MPP) databases for querying large datasets on Hadoop. It describes Walmart's implementation of Hive LLAP on their data lake to improve query performance for business users. A proof-of-concept found Hive LLAP queries were up to 50% faster when using 15 nodes instead of 10, and it performed comparably or better than two MPP databases with similar or larger infrastructures. Walmart plans to further evaluate Hive LLAP on newer Hadoop distributions and technologies to improve availability and workload management.

•by DataWorks Summit

dataworks summit 2019dws19dataworks summit washington dc

Pivotal Real Time Data Stream Analytics

This document discusses using Pivotal's Big Data Suite to build a real-time analytics solution for processing taxi trip data streams. It presents an architecture that uses Spring XD for data ingestion, Spark Streaming for in-memory analytics on 10-second windows, Gemfire for fast data retrieval, and Pivotal HD for long-term storage. The solution demonstrates filtering inconsistent data, finding top traffic areas, and available taxis in real-time. The document highlights how the Big Data Suite provides a complete toolset for data-driven enterprises through its optimized Hadoop distribution, in-memory processing, stream processing, and low-latency data stores.

•by kgshukla

pivotal big data suitedata analyticspivotal

Realizing the promise of portable data processing with Apache Beam

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere". This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.

•by DataWorks Summit

hadoopdataworks summit 2017dataworks summit

Page 13
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Page 14
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Page 15
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Hbase + Phoenix?
• Horizontally scalable
• Leverages the Hadoop cluster for storage and scaling
• Provides secondary indices for query patterns through
Phoenix
• Natural integration with JDBC and Spark JDBC RDDs

Page 16
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Recommended for you

Curing the Kafka Blindness – Streams Messaging Manager

Companies who use Kafka today struggle with monitoring and managing Kafka clusters. Kafka is a key backbone of IoT streaming analytics applications. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. No open source tool caters to the needs of different users that work with Kafka: DevOps/developers, platform team, and security/governance teams. See how the new Hortonworks Streams Messaging Manager enables users to visualize their entire Kafka environment end-to-end and simplifies Kafka operations. In this session learn how SMM visualizes the intricate details of how Apache Kafka functions in real time while simultaneously surfacing every nuance of tuning, optimizing, and measuring input and output. SMM will assist users to quickly understand and operate Kafka while providing the much-needed transparency that sophisticated and experienced users need to avoid all the pitfalls of running a Kafka cluster. Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks

•by DataWorks Summit

iot and streamingapache kafkadevops

Lessons learned running a container cloud on YARN

Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities. In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc. Speaker Billie Rinaldi, Principal Software Engineer I, Hortonworks

•by DataWorks Summit

cloud and operationsclouddata engineering

Apache deep learning 101

In my talk I will discuss and show examples of using Apache Hadoop, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. As part of my talk I will walk through using Apache NXNet Pre-Built Models, MXNet's New Model Server with Apache NiFi, executing MXNet with Apache NiFi and running Apache MXNet on edge nodes utilizing Python and Apache MiniFi. This talk is geared towards Data Engineers interested in the basics of Deep Learning with open source Apache tools in a Big Data environment. I will walk through source code examples available in github and run the code live on an Apache Hadoop / YARN / Apache Spark cluster. This will be an introduction to executing Deep Learning Pipelines in an Apache Big Data environment. My talk at Data Works Summit Sydney was listed in top 7 -> https://hortonworks.com/blog/7-sessions-dataworks-summit-sydney-see/ Also have speak at and run Future of Data Princeton and at Oracle Code NYC. Ref: https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html https://community.hortonworks.com/articles/146704/edge-analytics-with-nvidia-jetson-tx1-running-apac.html https://dzone.com/refcardz/introduction-to-tensorflow Speaker Timothy Spann, Solutions Engineer, Hortonworks

•by DataWorks Summit

artificial intelligencedata engineeringdata science

Page 17
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Marketo Lambda Architecture
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
Kafka
CRM Sync
Partner APIs
Other Marketing
Activities
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor

Page 18
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Spark Streaming?
• Micro-batching provides sink-side efficiencies
• This is especially important with MySQL touchpoints
• Great integration with Kafka
• No strict real-time processing requirements
• Great community and industry adoption

Page 19
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Multitenancy
• One topic per customer (sized by volume)
• Traffic storms are isolated to a single customer
• Fairness/throttling is easy to control
• Spark Streaming job consumes from many topics
• Allows us to turn a customer off under error conditions
• See “Elastic Streaming” by Neelesh Shastry –
Spark Summit

Page 20
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Spark Streaming Performant
• Coalesce small partitions for the same customer
• Aggressive caching of metadata (mostly from MySQL)
• Heavily leverage Scala future composition for parallelism
• Persist RDDs that are used for multiple outputs
• e.g. write to Kafka and Activity Service

Recommended for you

MiNiFi 0.0.1 MeetUp talk

MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi. The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a

•by Joe Percivall

ioatiotnifi

Why is my Hadoop cluster slow?

This document discusses ways to troubleshoot slow Hadoop jobs using metrics, logging, and tracing. It describes how to use the Ambari metrics system and Grafana dashboards to monitor metrics for clusters. It also explains how to leverage Hadoop logs and the YARN Application Timeline Service for logging and correlation across workloads. Finally, it presents Apache Zeppelin and analyzers for Hive, Tez, and YARN as tools for ad-hoc analysis to diagnose issues.

•by DataWorks Summit/Hadoop Summit

apache hadoophadoop clusterhadoop summit tokyo

Apache Hadoop YARN: Past, Present and Future

The document discusses the past, present, and future of Apache Hadoop YARN. It describes how YARN started as a sub-project of Hadoop to improve its resource management capabilities. Today, YARN is central to modern data architectures, providing centralized resource management and scheduling. Going forward, YARN aims to better support containers, simplified APIs, treating services as first-class citizens, and enhance its user experience.

•by DataWorks Summit/Hadoop Summit

hadoop summiths16melb

Page 21
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Anonymous Traffic Cheap
• High costs of web traffic in legacy system
• MySQL storage for all traffic
• Down streaming processing of all events (even anonymous)
• V2 only processes and stores known traffic in MySQL
• Defer triggering for anonymous data until promotion

• Rolled out to our highest volume customers
• Processing latencies < 30s (at 99.9th %)
• Allowed key customers to scale from ~2MM/day to > 20
MM/day
Impact and Results

• Mitigations of straggler effects on processing delays
• Adding sessionization for web reporting
• Scaling Kafka topics as customers increase volume
• Globally distributed ingestion for a single customer
Future Work

Recommended for you

Ingest and Stream Processing - What will you choose?

This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.

•by DataWorks Summit/Hadoop Summit

Fast SQL on Hadoop, Really?

The document discusses Apache Hive and Apache Druid for fast SQL on big data. It provides performance benchmarks showing Hive LLAP is faster than Presto and Spark SQL for TPC-DS queries. It describes features of Hive LLAP including in-memory caching, query result caching, and metadata caching. It also discusses new Hive 3 features like materialized views and optimizer improvements. The document then provides an overview of Apache Druid's capabilities for real-time ingestion and querying of streaming data before discussing how Hive and Druid can work together, with Hive able to push down queries to Druid.

•by DataWorks Summit

hortonworkssqlhadoop

The truth about SQL and Data Warehousing on Hadoop

•by DataWorks Summit/Hadoop Summit

future of dataapache hadoophadoop summit tokyo

What's hot

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

DataWorks Summit/Hadoop Summit

The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.

Practice of large Hadoop cluster in China Mobile

DataWorks Summit

China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business. In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows: 1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes. 2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included. 3. About Flume: We use the reformed Flume to collect data as much as 200TB per day. Speakers Yuxuan Pan, Software Engineer, China Mobile Software Technology Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

DataWorks Summit

On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized? The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events. We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier. Speakers Darryl Sutton, T4G, Principal Consultant Kenneth Poon, RBC, Director, Data Engineering

Data Regions: Modernizing your company's data ecosystem

DataWorks Summit/Hadoop Summit

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

DataWorks Summit

Hive edw-dataworks summit-eu-april-2017

alanfgates

Introduction to Apache NiFi 1.10

Timothy Spann

IoT with Apache MXNet and Apache NiFi and MiniFi

DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

DataWorks Summit

Pivotal Real Time Data Stream Analytics

kgshukla

Realizing the promise of portable data processing with Apache Beam

DataWorks Summit

Curing the Kafka Blindness – Streams Messaging Manager

DataWorks Summit

Lessons learned running a container cloud on YARN

DataWorks Summit

Apache deep learning 101

DataWorks Summit

MiNiFi 0.0.1 MeetUp talk

Joe Percivall

Why is my Hadoop cluster slow?

DataWorks Summit/Hadoop Summit

Apache Hadoop YARN: Past, Present and Future

DataWorks Summit/Hadoop Summit

Ingest and Stream Processing - What will you choose?

DataWorks Summit/Hadoop Summit

Fast SQL on Hadoop, Really?

DataWorks Summit

What's hot (20)

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

Practice of large Hadoop cluster in China Mobile

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Data Regions: Modernizing your company's data ecosystem

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

Hive edw-dataworks summit-eu-april-2017

Introduction to Apache NiFi 1.10

IoT with Apache MXNet and Apache NiFi and MiniFi

HBase Global Indexing to support large-scale data ingestion at Uber

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

Pivotal Real Time Data Stream Analytics

Realizing the promise of portable data processing with Apache Beam

Curing the Kafka Blindness – Streams Messaging Manager

Lessons learned running a container cloud on YARN

Apache deep learning 101

MiNiFi 0.0.1 MeetUp talk

Why is my Hadoop cluster slow?

Apache Hadoop YARN: Past, Present and Future

Ingest and Stream Processing - What will you choose?

Fast SQL on Hadoop, Really?

Viewers also liked

The truth about SQL and Data Warehousing on Hadoop

DataWorks Summit/Hadoop Summit

Comparison of Transactional Libraries for HBase

DataWorks Summit/Hadoop Summit

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...

DataWorks Summit/Hadoop Summit

This document summarizes Coca-Cola East Japan's journey with Hadoop and data analytics. It discusses: 1) Coca-Cola East Japan's background and data landscape prior to Hadoop, which involved data silos and batch-oriented processing. 2) The phases of Coca-Cola East Japan's Hadoop implementation from a pilot project in 2015 to a production environment in 2016 with 13 nodes storing 20TB of data. 3) Examples of Hadoop projects including vending machine replenishment forecasting and a write-off reporting project to aggregate data from multiple sources. 4) Future plans to improve data collection and establish a true data lake, and develop data-driven decision

SEGA : Growth hacking by Spark ML for Mobile games

DataWorks Summit/Hadoop Summit

The real world use of Big Data to change business

DataWorks Summit/Hadoop Summit

Use case and Live demo : Agile data integration from Legacy system to Hadoop ...

DataWorks Summit/Hadoop Summit

Streamline Hadoop DevOps with Apache Ambari

DataWorks Summit/Hadoop Summit

The document discusses how Apache Ambari can be used to streamline Hadoop DevOps. It describes how Ambari can be used to provision, manage, and monitor Hadoop clusters. It highlights new features in Ambari 2.4 like support for additional services, role-based access control, management packs, and Grafana integration. It also covers how Ambari supports automated deployment and cluster management using blueprints.

Case study of DevOps for Hadoop in Recruit.

DataWorks Summit/Hadoop Summit

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...

DataWorks Summit/Hadoop Summit

This document compares the performance of Hive and Spark SQL for processing smart meter data from electric utilities. Spark SQL showed significantly better performance than Hive, being able to process a year's worth of data from 10 million smart meters within 30 minutes. The use of ORCFile format and partitioning the data by individual equipment such as transformers improved performance. Spark SQL was well-suited for this near real-time use case due to its distributed in-memory computations.

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

DataWorks Summit/Hadoop Summit

This document discusses using Apache Spark and Amazon DSSTNE to generate product recommendations at scale. It summarizes that Amazon uses Spark and Zeppelin notebooks to allow data scientists to develop queries in an agile manner. Deep learning jobs are run on GPUs using Amazon ECS, while CPU jobs run on Amazon EMR. DSSTNE is optimized for large sparse neural networks and allows defining networks in a human-readable JSON format to efficiently handle Amazon's large recommendation problems.

#HSTokyo16 Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview and crash course on Apache Spark and related big data technologies. It discusses the history and components of Spark including Spark Core, SQL, Streaming, and MLlib. It also discusses data sources, challenges of big data, and how Spark addresses them through its in-memory computation model. Finally, it introduces Apache Zeppelin for interactive notebooks and the Hortonworks Data Platform sandbox for experimenting with these technologies.

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

DataWorks Summit/Hadoop Summit

Yahoo Japan transitioned their Hadoop cluster network architecture over time to address problems and scale needs. They moved from a stack architecture to an L2 fabric to an IP CLOS architecture. The IP CLOS architecture improved scalability, high availability, and reduced operating costs by allowing over 10,000 nodes with 100-200Gbps uplinks per rack and an oversubscription ratio of 1.25:1. This solved problems around switch failures, BUM traffic loads, decommissioning limitations, and scale-out limits they previously faced.

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

DataWorks Summit/Hadoop Summit

The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.

Hadoop Summit Tokyo HDP Sandbox Workshop

DataWorks Summit/Hadoop Summit

The document discusses the future of data and modern data applications. It notes that data is growing exponentially and will reach 44 zettabytes by 2020. This growth is driving the need for new data architectures like Apache Hadoop which can handle diverse data types from sources like the internet of things. Hadoop provides distributed storage and processing to enable real-time insights from all available data.

Hadoop Summit Tokyo Apache NiFi Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache NiFi and data flow fundamentals. It begins with an introduction to Apache NiFi and outlines the agenda. It then discusses data flow and streaming fundamentals, including challenges in moving data effectively. The document introduces Apache NiFi's architecture and capabilities for addressing these challenges. It also previews a live demo of NiFi and discusses the NiFi community.

Major advancements in Apache Hive towards full support of SQL compliance

DataWorks Summit/Hadoop Summit

Major advancements in Apache Hive towards full support of SQL compliance include: 1) Adding support for SQL2011 keywords and reserved keywords to reduce parser ambiguity issues. 2) Adding support for primary keys and foreign keys to improve query optimization, specifically cardinality estimation for joins. 3) Implementing set operations like INTERSECT and EXCEPT by rewriting them using techniques like grouping, aggregation, and user-defined table functions.

Introduction to Hadoop and Spark (before joining the other talk) and An Overv...

DataWorks Summit/Hadoop Summit

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

DataWorks Summit/Hadoop Summit

This document proposes a container-based sizing framework for Apache Hadoop/Spark clusters that uses a multi-objective genetic algorithm approach. It emulates container execution on different cloud platforms to optimize configuration parameters for minimizing execution time and deployment cost. The framework uses Docker containers with resource constraints to model cluster performance on various public clouds and instance types. Optimization finds Pareto-optimal configurations balancing time and cost across objectives.

Apache Hadoop 3.0 What's new in YARN and MapReduce

DataWorks Summit/Hadoop Summit

This document summarizes a presentation about new features in Apache Hadoop 3.0 related to YARN and MapReduce. It discusses major evolutions like the re-architecture of the YARN Timeline Service (ATS) to address scalability, usability, and reliability limitations. Other evolutions mentioned include improved support for long-running native services in YARN, simplified REST APIs, service discovery via DNS, scheduling enhancements, and making YARN more cloud-friendly with features like dynamic resource configuration and container resizing. The presentation estimates the timeline for Apache Hadoop 3.0 releases with alpha, beta, and general availability targeted throughout 2017.

Data infrastructure architecture for medium size organization: tips for colle...

DataWorks Summit/Hadoop Summit

Viewers also liked (20)

The truth about SQL and Data Warehousing on Hadoop

Comparison of Transactional Libraries for HBase

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...

SEGA : Growth hacking by Spark ML for Mobile games

The real world use of Big Data to change business

Use case and Live demo : Agile data integration from Legacy system to Hadoop ...

Streamline Hadoop DevOps with Apache Ambari

Case study of DevOps for Hadoop in Recruit.

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

#HSTokyo16 Apache Spark Crash Course

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

Hadoop Summit Tokyo HDP Sandbox Workshop

Hadoop Summit Tokyo Apache NiFi Crash Course

Major advancements in Apache Hive towards full support of SQL compliance

Introduction to Hadoop and Spark (before joining the other talk) and An Overv...

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

Apache Hadoop 3.0 What's new in YARN and MapReduce

Data infrastructure architecture for medium size organization: tips for colle...

Similar to Rebuilding Web Tracking Infrastructure for Scale

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...

Continuent

Marketo uses Continuent Tungsten to solve key data management challenges at scale. Tungsten provides high availability, online maintenance, and parallel replication to allow Marketo to process over 600 million MySQL transactions per day across more than 7TB of data without downtime. Tungsten's innovative caching and sharding techniques help replicas keep up with Marketo's high transaction volumes and uneven tenant sizes. The solution has enabled fast failover, rolling maintenance, and scaling to thousands of customers.

Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...

HostedbyConfluent

Our core banking platform has been built using domain driven design and microservices and whilst this provides many well-known advantages, it also presents some challenges. Data encapsulation results in each application having its own data store and it becomes impossible to query the state of a customer’s relationship in totality to provide the right products. This challenge becomes even harder if we want to personalize products based on aggregate values of a customer’s behavior over potentially large periods of time. In this session, we describe how we overcome this problem to enable dynamic charging and rewards based on customer behavior in a banking scenario. We describe • How we guarantee consistency between our event stream and our OLTP databases using the Outbox pattern. • The design decisions faced when considering the schema designs in Pinot and how we balanced flexibility and latency using Trino • Two patterns for enriching the event stream using Kafka streams and how we dealt with late arriving events and transactions.

Adobe Ask the AEM Community Expert Session Oct 2016

AdobeMarketingCloud

Two large enterprise AEM implementations were presented and compared. Anshul Chhabra from Symantec presented their implementation handling 3.3 billion requests per month. Anil Kalbag from Cisco presented their implementation handling 375 million monthly page views. Both implementations utilized multiple data centers for high availability and disaster recovery. Key architecture decisions around virtual/physical infrastructure, storage, caching, and multi-tenancy were discussed and compared between the two organizations.

Big Kahuna

Ritesh Nayak

The Big Kahuna is a framework that performs massively distributed computation using a grid of stateless transient nodes made up of web browsers, handhelds, and thick clients, making it a truly ubiquitous distributed computing platform without installation or setup overheads. It processes large volumes of highly granular data (<100KB) by leveraging idle time on browsers and local networks. Some key advantages include being zero installation, built on open source technologies, scaling rapidly as more clients are added, and working across devices, browsers, and platforms.

Enabling Telco to Build and Run Modern Applications

Tugdual Grall

This document discusses how MongoDB can help enable businesses to build and run modern applications. It begins with an overview of Tugdual Grall and his background. It then discusses how industries and data have changed, driving the need for a next generation database. The rest of the document provides an overview of MongoDB, including the company, technology, and community. Examples are given of how MongoDB has helped companies in the telecommunications industry achieve a single customer view, improve product catalogs and personalization, and build mobile and open data APIs.

Accelerating a Path to Digital with a Cloud Data Strategy

MongoDB

1) The document discusses accelerating a path to digital transformation with a cloud data strategy. It covers topics like the seismic shifts in organizations and application architectures, and the need to rethink underlying data layers. 2) The presentation discusses building an enterprise data fabric at Royal Bank of Scotland using MongoDB to provide data storage, query, and distribution as a service. This simplified development, reduced costs, and improved velocity. 3) MongoDB was presented as the foundation for cloud data strategies, providing the freedom to run applications anywhere while leveraging the benefits of multiple clouds.

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

DataWorks Summit/Hadoop Summit

The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.

SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx

Vasiliy Fomichev

Website performance has become a cornerstone of user experience and brands that incorporate that into all aspects of digital gain a long-term strategic advantage. Join this session to learn about the fascinating realm of how performance optimization can significantly influence your business strategy and user satisfaction. Drawing from real world experience, we will explore the concrete connection between website performance, SEO, loyalty, digital engagement, conversion rates, and revenue. As we trace the evolution of performance, we'll discuss the growing convergence of technology and strategy in the optimization process. Our focus will then shift to the heart of composable architecture, where you will learn invaluable tips on how to implement the experience management tools with personalization and experimentation without sacrificing performance and user experience. We'll provide practical technology recommendations to be aware of as you implement or look to optimize your digital experience using the new Sitecore composable tools. Harness the power of the new modern MarTech to turn performance into a strategic advantage.

Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...

Big Data Spain

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...

Kai Wähner

This document provides an overview of streaming analytics and compares different streaming analytics frameworks. It begins with real-world use cases in various industries and then defines what a data stream is. The core components of a streaming analytics processing pipeline are described, including ingestion, preprocessing, and real-time and batch processing. Popular open-source frameworks like Apache Storm and AWS Kinesis are highlighted. The document concludes by noting that both streaming analytics frameworks and products are growing significantly to enable real-time analytics on streaming data.

Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...

DevOps.com

This document discusses synthetic monitoring and how it can help deliver unrivaled experiences. It introduces xSUM, a synthetic monitoring product from xOps that monitors availability and performance of web services from an end-user perspective. The document outlines the benefits of xSUM, how it works, and its architecture which includes integration with the InfluxDB time series database. It concludes with a call for users and the community to get involved.

Understanding the Top Four Use Cases for IoT

VoltDB

Dheeraj Remella, Director of Solutions Architecture at VoltDB explains how to use real-time data to make more informed decisions, decrease event to action latency and limit downtime. He’ll discuss four in-production customer case studies and highlight how implementing VoltDB has dramatically increased competitive advantage for avoiding unplanned downtime, asset tracking, utilities - smart meter management, and fleet management

Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...

ServiceRocket

Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...

ServiceRocket

Curious how you could leverage Atlassian Confluence and a few powerful add-ons to build a dynamic repository within your Confluence wiki? This webinar will walk you through the steps to do so, and even show you how your repository, such as a product catalog, can be integrated with Salesforce.com data. View the full webinar: http://info.servicerocket.com/product-catalog-in-confluence-salesforce-sql-database

Acting on Real-time Behavior: How Peak Games Won Transactions

VoltDB

Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...

Continuent

Large Number of On-premises Tungsten MySQL Clusters @ Marketo Marketo is a very large marketing automation SaaS provider. Marketo scaled from tens of customers back in 2010 to thousands of enterprise customers today using Tungsten Clustering and several hundreds of MySQL instances. In this webinar, Continuent CEO Eero Teerikorpi discusses some common challenges SaaS providers face, such as having to provide 24/7/365 operations with zero downtime, even during maintenance operations. In addition, SaaS providers need to have an easy, consistent, and cost-effective model to scale. Watch this webinar replay to learn how to guarantee continuous operations for a SaaS provider with billions of daily transactions and terabytes of data using Tungsten MySQL Clusters. AGENDA - Continuent Introduction - How to Guarantee Continuous Operations for a SaaS with Terabytes Data with Tungsten MySQL Clusters - Continuent Tungsten Solutions & Benefits - Key Benefit Highlight: Billions of MySQL Transactions, Very Large Data Volume - Q&A PRESENTER Eero Teerikorpi - founder and CEO, Continuent - is a 7-time serial entrepreneur who has more than 30 years of high-tech management and enterprise software experience. Eero has been in the MySQL marketplace virtually since day one, from the early 2000s. Eero has held top management positions at various cross-Atlantic entities (CEO at Alcom Corporation, President at Capslock, Executive Board Member at Esker S.A.) Eero started his career as a Product Manager at Apple Computer in Finland in the mid-80s. Eero also owns and manages a boutique NOET Vineyards producing high-quality dry-farmed Cabernet Sauvignon. Eero is a former Navy officer and still an avid sailor on San Francisco Bay and around the world. Eero is a very active sportsman: a 4+ tennis player, a rookie golfer, a very careful mountain biker, and an experienced (40+ years) skier, both slalom and cross-country.

JAMStack

Samundra khatri

The document provides an overview of JAMStack, a new approach to building web applications that uses JavaScript, APIs, and markup. It defines JAMStack as using JavaScript in the browser as a runtime, reusable HTTP APIs instead of app-specific databases, and prebuilt markup for delivery. It discusses different types of JAMStack projects including static HTML sites, sites with content from a CMS, web applications, and large websites. It also outlines advantages like improved performance, security, and scalability, as well as considerations for planning a JAMStack project such as managing content, choosing a site generator, automation, and CDNs.

Digital Transformation in Market Data and Trading Platforms

Solace

Digital transformation requires adopting a new approach to doing business that is always sensing and ready to change. The document discusses how traditional market data and trading platforms have inefficient spaghetti connections that are difficult to manage and perform poorly. It then proposes adopting a "market data mesh" architecture using an event mesh to connect systems simply and at scale across hybrid cloud environments. This would create an open and global market data platform. The remainder of the document discusses how Solace technology could be used to build an innovative distributed trading platform for the China Foreign Exchange Trade System to modernize China's financial system and interbank markets.

The role of NoSQL in the Next Generation of Financial Informatics

Aerospike, Inc.

Accelerating a Path to Digital With a Cloud Data Strategy

MongoDB

The document describes a conference on accelerating a path to digital transformation with a cloud data strategy. It provides an agenda for the conference including speakers on executing a cloud data strategy, customer stories from De Persgroep and Toyota Motor Europe, and a session on landing in the cloud with MongoDB Atlas. The document also provides background on the speakers and their companies.

Similar to Rebuilding Web Tracking Infrastructure for Scale (20)

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...

Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...

Adobe Ask the AEM Community Expert Session Oct 2016

Big Kahuna

Enabling Telco to Build and Run Modern Applications

Accelerating a Path to Digital with a Cloud Data Strategy

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx

Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...

Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...

Understanding the Top Four Use Cases for IoT

Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...

Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...

Acting on Real-time Behavior: How Peak Games Won Transactions

Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...

JAMStack

Digital Transformation in Market Data and Trading Platforms

The role of NoSQL in the Next Generation of Financial Informatics

Accelerating a Path to Digital With a Cloud Data Strategy

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production

DataWorks Summit/Hadoop Summit

This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.

State of Security: Apache Spark & Apache Zeppelin

DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger

DataWorks Summit/Hadoop Summit

The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include: - The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies. - Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer. - Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance. - An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared

Enabling Digital Diagnostics with a Data Science Platform

DataWorks Summit/Hadoop Summit

This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.

Revolutionize Text Mining with Spark and Zeppelin

DataWorks Summit/Hadoop Summit

This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.

Double Your Hadoop Performance with Hortonworks SmartSense

DataWorks Summit/Hadoop Summit

This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.

Hadoop Crash Course

DataWorks Summit/Hadoop Summit

The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.

Data Science Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.

Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.

Dataflow with Apache NiFi

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.

Schema Registry - Set you Data Free

DataWorks Summit/Hadoop Summit

Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats. SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc. In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

DataWorks Summit/Hadoop Summit

There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time. The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

DataWorks Summit/Hadoop Summit

DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.

Mool - Automated Log Analysis using Data Science and ML

DataWorks Summit/Hadoop Summit

QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful. At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.

How Hadoop Makes the Natixis Pack More Efficient

DataWorks Summit/Hadoop Summit

Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together. This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear: • How and why the business and IT requirements originated • How we leverage the platform to fulfill security and production requirements • How we organize a community to: o Guard all the players, no one gets left on the ground! o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead) • What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match! DETAILS This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.

HBase in Practice

DataWorks Summit/Hadoop Summit

HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.

The Challenge of Driving Business Value from the Analytics of Things (AOT)

DataWorks Summit/Hadoop Summit

There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases. In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

DataWorks Summit/Hadoop Summit

In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

DataWorks Summit/Hadoop Summit

In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs. Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.

Backup and Disaster Recovery in Hadoop

DataWorks Summit/Hadoop Summit

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Recently uploaded

DealBook of Ukraine: 2024 edition

Yevgen Sysoyev

The Increasing Use of the National Research Platform by the CSU Campuses

Larry Smarr

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

TrustArc

Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk. What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year? Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year. This webinar will review: - Key changes to privacy regulations in 2024 - Key themes in privacy and data governance in 2024 - How to maximize your privacy program in the second half of 2024

What’s New in Teams Calling, Meetings and Devices May 2024

Stephanie Beckett

Best Practices for Effectively Running dbt in Airflow.pdf

Tatiana Al-Chueyr

As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models. This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through: - Standard ways of running dbt (and when to utilize other methods) - How Cosmos can be used to run and visualize your dbt projects in Airflow - Common challenges and how to address them, including performance, dependency conflicts, and more - How running dbt projects in Airflow helps with cost optimization Webinar given on 9 July 2024

Measuring the Impact of Network Latency at Twitter

ScyllaDB

How RPA Help in the Transportation and Logistics Industry.pptx

SynapseIndia

Implementations of Fused Deposition Modeling in real world

Emerging Tech

The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries: 1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes. 2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions. 3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines. 4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors. 5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering. 6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands. 7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems. 8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering. 9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively. Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.

Comparison Table of DiskWarrior Alternatives.pdf

Andrey Yasko

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL

Liveplex

Details of description part II: Describing images in practice - Tech Forum 2024

BookNet Canada

This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator. Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/ Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

RaminGhanbari2

Best Programming Language for Civil Engineers

Awais Yaseen

The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era. Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.

Calgary MuleSoft Meetup APM and IDP .pptx

ishalveerrandhawa1

Quantum Communications Q&A with Gemini LLM

Vijayananda Mohire

論文紹介：A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...

Toru Tamaki

How to Build a Profitable IoT Product.pptx

Adam Dunkels

INDIAN AIR FORCE FIGHTER PLANES LIST.pdf

jackson110191

Observability For You and Me with OpenTelemetry

Eric D. Schabell

Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data. The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs. Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution! Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...

Bert Blevins

Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.

Recently uploaded (20)

DealBook of Ukraine: 2024 edition

The Increasing Use of the National Research Platform by the CSU Campuses

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

What’s New in Teams Calling, Meetings and Devices May 2024

Best Practices for Effectively Running dbt in Airflow.pdf

Measuring the Impact of Network Latency at Twitter

How RPA Help in the Transportation and Logistics Industry.pptx

Implementations of Fused Deposition Modeling in real world

Comparison Table of DiskWarrior Alternatives.pdf

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL

Details of description part II: Describing images in practice - Tech Forum 2024

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

Best Programming Language for Civil Engineers

Calgary MuleSoft Meetup APM and IDP .pptx

Quantum Communications Q&A with Gemini LLM

論文紹介：A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...

How to Build a Profitable IoT Product.pptx

INDIAN AIR FORCE FIGHTER PLANES LIST.pdf

Observability For You and Me with OpenTelemetry

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...

Rebuilding Web Tracking Infrastructure for Scale

1. Rebuilding Web Tracking Infrastructure for Scale Stephen Oakley Principal Engineer Marketo

2. What is Marketo?

3. Page 3 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 What is Web Tracking at Marketo? • Ingest web page visits and clicks on customer’s website • Trigger campaigns in response to web activity • Trigger real-time personalization of web experience • Provide lead level analytics for known leads • Provide aggregate analytics for all lead activity • Typically known leads < 10 % of all traffic

6. Page 6 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Problems • Throughput limitations – 2 million activities per day • Processing delays can be on the order of hours • Large customers cause web server brownouts • Web reporting does not scale • Fixed-sized clusters prohibit horizontal scaling • Brittle infrastructure prevents feature development

7. The Vision

8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Orion Initiative • Increase scale to support IoT for Marketers • Support billions of marketing activities each day • Trigger on activities in near real time (< 2 minute @ 99th %) • Reduce operational costs • Improve multitenancy and QoS

9. Requirements

10. Page 10 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Business Requirements • 200 MM activities per customer per day • Near real-time web activity processing (SLA of < 1 minute lag) • Improve cost efficiency • Improve flexibility for feature enhancements

11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Technical Requirements • Multitenancy support with brownout protections • Infrastructure must scale horizontally • Decouple web processing from downstream processing • Anonymous leads should cost next to nothing to track

12. Architecture & Design

15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Hbase + Phoenix? • Horizontally scalable • Leverages the Hadoop cluster for storage and scaling • Provides secondary indices for query patterns through Phoenix • Natural integration with JDBC and Spark JDBC RDDs

17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Marketo Lambda Architecture Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase Kafka CRM Sync Partner APIs Other Marketing Activities Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor

18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Spark Streaming? • Micro-batching provides sink-side efficiencies • This is especially important with MySQL touchpoints • Great integration with Kafka • No strict real-time processing requirements • Great community and industry adoption

19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Multitenancy • One topic per customer (sized by volume) • Traffic storms are isolated to a single customer • Fairness/throttling is easy to control • Spark Streaming job consumes from many topics • Allows us to turn a customer off under error conditions • See “Elastic Streaming” by Neelesh Shastry – Spark Summit

20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Spark Streaming Performant • Coalesce small partitions for the same customer • Aggressive caching of metadata (mostly from MySQL) • Heavily leverage Scala future composition for parallelism • Persist RDDs that are used for multiple outputs • e.g. write to Kafka and Activity Service

21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Anonymous Traffic Cheap • High costs of web traffic in legacy system • MySQL storage for all traffic • Down streaming processing of all events (even anonymous) • V2 only processes and stores known traffic in MySQL • Defer triggering for anonymous data until promotion

22. • Rolled out to our highest volume customers • Processing latencies < 30s (at 99.9th %) • Allowed key customers to scale from ~2MM/day to > 20 MM/day Impact and Results

23. • Mitigations of straggler effects on processing delays • Adding sessionization for web reporting • Scaling Kafka topics as customers increase volume • Globally distributed ingestion for a single customer Future Work

24. We’re Hiring! Http://Marketo.Jobs Q & A

Editor's Notes

Next phase was when we were ready to validate our newly built event ingestion system Marketo is a powerful Engagement Marketing Platform. There are several applications that make up the platform, such as ABM, Marketing analytics, predictive content, Digital Ads, and Marketing Automation. Marketing automation is what we are focusing on today. Marketing Automation enables the marketer to create, automate and measure marketing campaigns across channels. A simple example of an automated campaign or workflow is User visits your website and fills out a form Web tracking sees that they spent most their time looking at pages about spark streaming Automatically Send an email to the user to Invite them to a webinar on spark streaming services If they attend the webinar, register their interests in your crm and request a sales person contacts the user The campaigns can be complex and can reach out and track customers across channels like web, email, mobile, social
Explain what a known vs anonymous lead is Known is targetable on other channels, anonymous is only web activity Speak to how the traffic patterns are heavily skewed toward anonymous given our customer base Talk about how anonymous converts to known. Aggregate analytics include company web report, landing page reports, etc.
Speak to the pod Mention how there are many many pods
An additional complication is the fact that the same two webservers also serve the mlm app, soap apis, and the landing pages
Although the talk isn’t about the project… we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected. The number of connected devices blows my mind. It’s not just phones anymore… Amazon dash buttons, coffee makers, propane tanks, garage doors. These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Coming from standard 3 tier app Radically reduce processing latency Eliminate backlogs Brownout protection
A few words about the architecture Main goal is to inject, process and store marketing events
Details overview of Munchkin FE component Spray.io for MFE Frontend has the simple job of verifying subscription status, collecting metrics and persisting to kafka Use Avro to allow for schema evolution, strong typing and compact representation in topic Use Schema registry to allow the schema to be upgraded by the producer and them automatically picked up by the spark streaming component Use asynchronous API for kafka to allow high throughput.
Details overview of LeadService component Spray.io for leadservice Hbase for Cookie and anonymous lead storage Salted table Key structure is subscription-cookie-leadid Secondary index for subscription-lead-createdat MySQL for known lead storage Masterdata for reverse ip information enrichments
Overall view for the system Describe how there is a Kafka topic per subscription Spark streaming transforms the raw events into activities by Enriching with web page metadata from MySQL Lead and reverse IP enrichment from LeadService Persist activities to AS for storage and secondary processing (e.g. triggering and solr indexing) Push enriched web events to Kafka for the downstream Druid OLAP infrastructure.
High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
While it is not “true” streaming, we exactly need this as an optimization
Our multitenant Kafka framework coalesces small kafka paritions into large spark rdd partitions to improve batch utilization Several components of the event enrichment requires outbound RPC calls, using async clients and performing the calls in parallel and then composing the futures pipelines the computation and significantly improves throughput. Caching web assets and cookies for temporal locality Cache is > 60% of the executor memory Enriched events are written out to multiple sources and be selective about persisting RDDS prevents recomputing expensive transformations (multiple RPC calls or MySQL queries)
Traditionally both anonymous and known data was treated equally in MLM. This is problematic because Anonymous volumes are usually 10-20x higher than known. Additionally there is very little intrinsic value in performing downstream processing on anonymous data since you cannot target anonymous leads for Campaigns. To improve this, in Munchkin V2 we only allow known traffic to flow to downstream processing. Anonymous data is passed for downstream processing when the lead converts to a known lead Via form fillout, api calls, etc.
Reiterates my points on the last slide. I included in case you wanted to look at the slides later
Give a quick overview of the activities architecture. Introduce Kafka in the presentation
Spend more time on this – purple is our code , teal is spark standard # SubscriptionRegistry is using ZK # OffsetManager is a library, uses low level kafka consumer API # Provisioning framework – Sirius, a new subscription provisioned to registry via oozie

Rebuilding Web Tracking Infrastructure for Scale

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Rebuilding Web Tracking Infrastructure for Scale

Similar to Rebuilding Web Tracking Infrastructure for Scale (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Rebuilding Web Tracking Infrastructure for Scale

Editor's Notes