SlideShare a Scribd company logo
Patterns of the
Lambda Architecture
Truth and Lies at the Edge of Scale
Flip Kromer — CSC
I’m Flip Kromer, Distinguished Engineer at CSC. If you are a large enterprise company
looking to add Big Data capabilities — especially one involving legacy systems —
we’re a big, stable company that specializes in turning technology into an enterprise-
grade solution.
Pattern Set
This talk will equip you with two things.
One is patterns for how we design high-scale architectures to solve specific solution
cases
now that extra infrastructure is nearly free
Tradeoff Rules
PICK
ANY
TWO
Along with a set of tradeoff rules along the lines of the pick-any-two trinity but more
sophisticated
Lambda Architecture
So what is the Lambda Architecture? Here’s two examples.

Recommended for you

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex

Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.

hadoop summit
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka

Kafka Streams is a lightweight stream processing library included in Apache Kafka since version 0.10. It provides a simple yet powerful API for building stream processing applications. The API uses a domain-specific language that allows developers to define stream processing topologies where data from Kafka topics acts as input streams and can be transformed before writing the results to output topics. The library handles common stream processing tasks like state management, windowing, and fault tolerance using Kafka's distributed and fault-tolerant architecture.

hadoop summit
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time. The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements. In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!

hive llaphadoopgluent
Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
In this system, we have a whole ton of historical text, with more arriving all the time,
and want to allow immediate real-time search across the whole corpus.
Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Build

Main
Index
We will use a large periodic batch job to create indexes on the historical data.This
takes a while — far longer than our recency demands allow — so we might as well
have our elephants use clever algorithms and optimally organize the data for rapid
retrieval.
Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Update Recent Index
Until the next stampede arrives with an updated index, as each new record arrives
we will not only file it with the historical data but also use simple fast indexing to
make it immediately searchable. Merging new records directly would require stuffing
them into the right place in the historical index, which eventually means moving
records around, which demands far too much time and complexity to be workable.
Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Serve
Result
The system to serve the data just pulls from both indexes in immediate time

Recommended for you

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm

This document summarizes updates to Apache Storm presented by P. Taylor Goetz of Hortonworks at Hadoop Summit 2016. Some key points include: Storm 0.9.x added high availability features and expanded integration capabilities. Storm 1.0 focused on maturity and improved performance. New features in Storm 1.0 include Pacemaker replacing Zookeeper, distributed caching, high availability Nimbus, native streaming windows, and state management with automatic checkpointing. Storm usability was also improved with features like dynamic log levels, tuple sampling for debugging, and distributed log searching. Future integrations and performance optimizations were also discussed.

A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto

It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines. The results shows significant advantages of Hive LLAP on performance and durability.

hiveprestobenchmark
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

This document introduces Okkam, an Italian company that uses Apache Flink for large-scale data integration and semantic technologies. It discusses Okkam's use of Flink for domain reasoning, RDF data processing, duplicate detection, entity linkage, and telemetry analysis. The document also provides lessons learned from Okkam's Flink experiences and suggestions for improving Flink.

flink forwardapache flinkconference
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Lambda Architecture
Batch
Speed
Serving
We have a batch layer for the global corpus;
A speed layer for recent results;
and a serving layer for access
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Lambda Architecture
Global
Relevant
Immediate
We have a batch layer for the global corpus;
A speed layer for recent results;
and a serving layer for access
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
Another familiar architecture is a high-scale recommender system — “Given that the
user has looked at mod-style dresses and mason jars show them these knitting
needles”.This diagram shows a recommender, but most machine-learning systems
look like this.
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
Build
Model
You have one system process all the examples you’ve ever seen to produce a
predictive model.The trained model it produces can then react immediately to all
future examples as they occur.

Recommended for you

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap

The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.

data streamingapache flinkflink
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Gradoop is a framework for large-scale graph analytics based on Apache Flink. The talk was given at the Flink Forward Conference 2015 in Berlin.

apache flink graph analytics gradoop
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink

Complex event processing (CEP) and stream analytics are commonly treated as distinct classes of stream processing applications. While CEP workloads identify patterns from event streams in near real-time, stream analytics queries ingest and aggregate high-volume streams. Both types of use cases have very different requirements which resulted in diverging system designs. CEP systems excel at low-latency processing whereas engines for stream analytics achieve high throughput. Recent advances in open source stream processing yielded systems that can process several millions of events per second at a sub-second latency. One of these systems is Apache Flink and it enables applications that include typical CEP features as well as heavy aggregations. Guided by examples, I will demonstrate how Apache Flink enables the user to process CEP and stream analytics workloads alike. Starting from aggregations over streams, we will next detect temporal patterns in our data triggering alerts and finally aggregate these alerts to gain more insights from our data. As an outlook, I will present Flink's CEP-enriched StreamSQL interface providing a declarative way to specify temporal patterns in your SQL query.

dataworks summitapache flinkhadoop summit
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
Applies Model
The trained model it produces can then react immediately to all future examples as
they occur. In this system we’re going to have one system to apply the model and
store the recommendation
Your operations team is better off with two systems that can fail without breaking
the site than to have the apply-model step coupled to serving pages.
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
Serves
Result
So that the web layer can just serve the result without being contaminated by the
recommender system’s code.
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
Batch
Speed
Serving
Again, the same three pieces
Lambda Arch Layers
• Batch layer Deep Global Truth throughput
• Speed layer Relevant Local Truth throughput
• Serving layer Rapid Retrieval latency
speed layer cares about throughput
Serving layer cares about latency,

Recommended for you

Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks

A report covers the functional comparison and performance evaluation between Apache Flink, Apache Spark Streaming, Apache Storm and Apache Gearpump(incubating)

streamingsparkstorm
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

big dataapache apexbig data analytics
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote

This document provides information about the first conference on Apache Flink. It summarizes key aspects of the Apache Flink streaming engine, including its improved DataStream API, support for event time processing, high availability, and integration of batch and streaming capabilities. The document also outlines Flink's progress towards version 1.0, which will focus on defining public APIs and backwards compatibility, and outlines future plans such as enhancing usability features on top of the DataStream API.

flink forwardconferenceapache flink
Lambda Arch: Technology
• Batch layer Hadoop, Spark, Batch DB Reports
• Speed layer Storm+Trident, Spark Str., Samza,AMQP, …
• Serving layer Web APIs, Static Assets, RPC, …
Lambda Architecture
Batch
Speed
Serving
λ
λ
Where does the name lambda come from?
In my head it’s cause the flow diagram…
Lambda Architecture
Batch
Speed
Serving
λ
looks like the shape of the character for lambda
Lambda Architecture
λ(v)
• Pure Function on immutable data
But really it means this new mindset of building pure function (lambda) on immutable
data,

Recommended for you

Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data

Apache Beam is a unified programming model for batch and streaming data processing. It defines concepts for describing what computations to perform (the transformations), where the data is located in time (windowing), when to emit results (triggering), and how to accumulate results over time (accumulation mode). Beam aims to provide portable pipelines across multiple execution engines, including Apache Flink, Apache Spark, and Google Cloud Dataflow. The talk will cover the key concepts of the Beam model and how it provides unified, efficient, and portable data processing pipelines.

hadoop summit
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos IzquierdoBig Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos Izquierdo

https://www.bigdataspain.org/2016/program/fri-big-migrations-moving-elephant-herds.html https://www.youtube.com/watch?v=oLLHfMJ_aXA&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik&index=48&t=8s

big data spain
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision

A brief recap at the developments in Apache Flink over the last year (they were massive!) and a look into the streaming future of Apache Flink

apache flink kafka streaming analysis distributed
Ideal Data System
Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
The laziest, and therefore best, knobs are the Capacity/Affordability ones.The pre-big-
data era can be thought of as one where only those two exist. Big Data broke the
handle off the Capacity knob, either because Affordability ramps too fast or because
the speed of light starts threatening resilience, responsiveness or recency
* _Comprehensive_: complete; including all or nearly all elements or aspects of
something
* _concise_: giving a lot of information clearly and in a few words; brief but
Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
You would think that what mattered was correctness — justified true belief
Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
When you look at what we actually do, the non-negotiables are that it be manageable
and economic given that you must process arbitrarily large amounts of data
Truth is a nice-to-have.

Recommended for you

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming

It Provide a way to consume continues stream of data. Build on top of Spark Core It supports Java, Scala and Python. API is similar to Spark Core.

knowledge sharingapache sparkspark
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance

Agenda: • Spark Streaming Architecture • How different is Spark Streaming from other streaming applications • Fault Tolerance • Code Walk through & demo • We will supplement theory concepts with sufficient examples Speakers : Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs) Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719 Sachin Aggarwal (Developer, Analytics Platform at IBM Labs) Profile : https://in.linkedin.com/in/nitksachinaggarwal Github Link: https://github.com/agsachin/spark-meetup

architecturefault tolerancespark streaming
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB

Amazon DynamoDB is a fully managed, highly scalable distributed database service. In this technical talk, we show you how to use Amazon DynamoDB to build high-scale applications like social gaming, chat, and voting. We show you how to use building blocks such as secondary indexes, conditional writes, consistent reads, and batch operations to build the higher-level functionality such as multi-item atomic writes and join queries. We also discuss best practices such as index projections, item sharding, and parallel scan for maximum scalability. Speakers: Philip Fitzsimons, AWS Solutions Architect Richard Freeman, PhD, Senior Data Scientist/Architect, JustGiving

cloudcloud computingevents
Tradeoff Rules
PICK
ANY
TWO
Set of tradeoff rules along the lines of the pick-any-two trinity but more sophisticated
At Scale
AND
THIS
THIS
AND TRY TO BE GOOD
Basically, given big data you have to accomodate any amount of data and produce
static reports or queries that execute within the duration of human patience — so
you must be fast and cheap, sacrificing good.
Patterns
Train
Recomm’der
Visitor
History
History
Alsobuy
Visitor:
Product
Visitor
Alsobuy
Update
Recommendation
Fetch/Update
History
Visitor:
Product
History
Webserver
Recommender
The world you’re modeling changes — new sets of products are released, new and
variated customers sign up, changes to the site drive new behavior — but it changes
slowly. So it’s no big deal if the training stage is only run once a week over several
hours.
The first example follows a pretty familiar general form I’ll call “Train / React”.You
have one system process all the examples you’ve ever seen to produce a predictive
model.The trained model it produces can then react immediately to all future

Recommended for you

Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzureDevoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure

Overview of the Docker ecosystem and orchestration systems, and how to make them run on Microsoft Azure.

azuremicrosoftlinux
Riak in Ten Minutes
Riak in Ten MinutesRiak in Ten Minutes
Riak in Ten Minutes

Riak is an open source, distributed key-value data store designed to be highly available, scalable and fault-tolerant. It stores data as keys mapped to values, with keys organized into buckets. Data can be queried using a map-reduce algorithm. Relationships between data can be represented using links. Riak is designed to scale horizontally across clusters of nodes and uses replication to achieve high availability even if some nodes fail.

riak gluecon nosql
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut

Wes McKinney gave a talk at the 2015 Open Data Science Conference about data frames and the state of data frame interfaces across different languages and libraries. He discussed the challenges of collaboration between different data frame communities due to the tight coupling of user interfaces, data representations, and computation engines in current data frame implementations. McKinney predicted that over time these components would decouple and specialize, improving code sharing across languages.

pythonrdata frames
Pattern: Train / React
• Model of the world lets you make immediate decisions
• World changes slowly, so we can re-build model at leisure
• Relax: Recency
• Batch layer: Train a machine learning model
• Speed layer: Apply that model
• Examples: most Machine Learning thingies
(Recommender)
Big fat job that only needs to run occasionally; results of the job inform what happens
immediately
Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Pattern: Baseline / Delta
• Understanding the world takes a long time
• World changes much faster than that, and you care
• Relax: Simplicity, Accuracy
• Batch layer: Process the entire world
• Speed layer: Handle any changes since last big run
• Examples: Real-time Search index; Count Distinct; 

other Approximate Stream Algorithms
In Train / React, the world changes, but slowly; training in batch mode is just fine
In Baseline / Delta, the world changes so quickly can’t run compute job fast enough
So you are sacrificing simplicity — there’s two systems where there was only one —
and accuracy — the recent records won’t update global normalized frequencies
Pagerank
Converge
Pagerank
Friend
Relations
User
Pagerank
Retrieve Bob’s
Facebook Ntwk
Bob
Bob’s Friends’
Pageranks
Estimate
Bob’s Pagerank
But don’t bother updating
Bob’s Friends (or friends
friends or …)
API
(Lazy Propagation)

Recommended for you

Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015

This document discusses architectural patterns for scaling microservices and APIs. It introduces the "Scale Cube" model of scaling along the X, Y, and Z axes. X-axis scaling uses techniques like clustering and cloning. Y-axis scaling uses routing based on identifiable variables. Z-axis scaling uses sharding. The document explains that APIs can be designed as facades or endpoints, and the most common patterns involve combining Y-axis routing with X-axis scaled service clusters in a 2-tier structure. It warns that choosing the wrong scaling thresholds or scaling too late can lead to timeouts and performance issues, and stresses the importance of monitoring services.

apiscalabilitymicroservices
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration

Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark. Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.

solrsolrcloudspark
Microservices: next-steps
Microservices: next-stepsMicroservices: next-steps
Microservices: next-steps

Managing microservices at scale is all about the tooling: smart deployment, testing in production, label based routing, critical path verification.

microservicessmart deploymentarchitecture
Pagerank
48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
This next example has an importantly different flavor.
The core way that Google identifies important web pages is the “Pagerank”
algorithm, which basically says “a page is interesting if other interesting pages link to
it”.That’s recursive of course but the math works out.You can do similar things on a
social network like twitter to find spammers and superstars, or among college
football teams or world of warcraft players to prepare a competitive ranking, or
among buyers and sellers in a market to detect fraud.
To define a reputation ranking on say Twitter you simulate a game of multiple rounds.
48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
New Record Appears
?
Doing this is kinda literally what Hadoop was born to do, and it’s a simple
Hadoop-101 level program.
Acting out all those rounds using every interaction we’ve ever seen takes a fair
amount of time, though, and so a problem comes when we meet a new person.
This new person accrues some reputational jellybeans, and we don’t want to live in
total ignorance of what their score is; and they dispatch some as well, which should
change the scores of those they follow.
48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
Update Using Local
12÷3 = 4
24÷5 ≈ 5
9
Well, we can roughly guess the score of the new node by having their followers pay
out a jellybean share proportional to what they would have gotten in the last
pagerank round.
“A Guess beats a Blank Stare”
* World rate of change not really relevant
* The solution is actually to tell a lie
48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
…Ignoring Correctness
meh
But we’re not going to update the neighbors.You’d be concurrently updating an
arbitrary number of outbound nodes, and then of course those nodes’ changes
should rightfully propagate as well — this is why we play the multiple pagerank
rounds in the first place.
What we do instead is lie. Look, planes don’t fall out of the sky if you get someone’s
coolness quotient wrong in the first decimal point.

Recommended for you

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data

The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.

sparkbig datadata science
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project

Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?

machine learningpython (programming language)data science
Akka streams
Akka streamsAkka streams
Akka streams

Akka Streams is an implementation of Reactive Streams, which is a standard for asynchronous stream processing with non-blocking backpressure on the JVM. In this talk we'll cover the rationale behind Reactive Streams, and explore the different building blocks available in Akka Streams. I'll use Scala for all coding examples, but Akka Streams also provides a full-fledged Java8 API.After this session you will be all set and ready to reap the benefits of using Akka Streams!

javajava8akka
Batch Updates Graph
42
30
36 11
10
6
21
21
36
4
6
6
6
5 5
4
9
3
9
6
(A Guess beats a Blank)
This has an importantly different flavor
* World rate of change not really relevant
* The solution is actually to tell a lie
Pattern: World/Local
• Understanding the world needs full graph
• You can tell a little white lie reading immediate graph only
• Relaxing: Accuracy, Justification
• Batch layer: uses global graph information
• Speed layer: just reads immediate neighborhood
• Examples:“Whom to follow”, Clustering, anything at 2nd-
degree (friend-of-a-friend)
Problem isn’t so much about the volume of data,
it’s about how _far away_ that data is.
You can’t justify doing that second-order query for three reasons:
* time
* compute resources
* computational risk
Pattern: Guess Beats Blank
• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples:Any time the sparse-data case is also the most
valuable
In this case, we can’t sacrifice comprehensiveness — for every record that exists, we
must return a relevant answer. So we sacrifice truthfulness — or more precisely, we
sacrifice accuracy and justification.
Marine Corp’s 80% Rule
“Any decision made with
more than 80% of the
necessary information is
hesitation”
— “The Marine Corps Way”
Santamaria & Martino
When lots of data already, the imperfect result in the speed layer doesn’t have a huge
effect
When there isn’t much data, overwhelmingly better to fill in with an imperfect result
US Marine Corps:“Any decision made with more than 80% of the necessary
information is hesitation”

Recommended for you

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

LinkedIn has a large professional network with 360M members. They build data-driven products using members' rich profile data. To do this, they ingest online data into offline systems using Apache Kafka. The data is then processed using Hadoop, Spark, Samza and Cubert to compute features and train models. Results are moved back online using Voldemort and Kafka. For example, People You May Know recommendations are generated by triangle closing in Hadoop and Cubert to count common connections faster. Site speed is monitored in real-time using Samza to join logs from different services.

big datamachine learningsamza
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.

large-scale data processingapache flinkbig data
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector

Talk given at PYCON Stockholm 2015 Intro to Deep Learning + taking pretrained imagenet network, extracting features, and RBM on top = 97 Accuracy after 1 hour (!) of training (in top 10% of kaggle cat vs dog competition)

deep learningcomputer visiondog
A Guess Beats a Blank
• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples:Any time the sparse-data case is also the most
valuable
In this case, we can’t sacrifice comprehensiveness — for every record that exists, we
must return a relevant answer. So we sacrifice truthfulness — or more precisely, we
sacrifice accuracy and justification.
Security
Find Potential
Evilness
Connection
Counts
Agents of
Interest
Store
Interaction
Net
Connect
ions
Detected
Evilnesses
Approximate
Streaming Agg
Agent of
Interest?
Dashboard
In security, you have the data-breach type problems — why is someone strip-mining
computers in turn to a server in [name your own semi-friendly country]? — and
bradley-manning type problems — why is a GS-5 at a console in Kuwait downloading
every single diplomatic dispatch?
Pattern: Slow Boil/Flash Fire
• Two tempos of data: months vs milliseconds
• Short-term data too much to store
• Long-term data too much to handle immediately
• Often accompanies Baseline / Deltas, Global / Local
• Examples:
• Trending Topics
• Insider Security
Global/Local:Why has a contractor sysadmin in Hawaii accessed powerpoint presos
from every single group within our organization?
Banking, Oversimplified
Reconcile
Accounts
Account
Balances
Event Store Transaction Update Records
(CAP Tradeoffs)

Recommended for you

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift

In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.

2015 aws summit san franciscovidhya srinivasan & justin cunninghamcloud
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Spark ETL techniques including Web Scraping, Parquet files, RDD transformations, SparkSQL, DataFrames, building moving averages and more.

dataframerddsparksql
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge

This document discusses the challenges of monitoring microservices and containers. It provides six rules for effective monitoring: 1) spend more time on analysis than data collection, 2) reduce latency of key metrics to under 10 seconds, 3) validate measurement accuracy, 4) make monitoring more available than services monitored, 5) optimize for distributed cloud-native applications, 6) fit metrics to models to understand relationships. It also examines models for infrastructure, flow, and ownership and discusses speed, scale, failures, and testing challenges with microservices.

gluecon
Banking, Oversimplified
Reconcile
Accounts
Account
Balances
Event Store Transaction Update Records
nice-to-haveessential
This wins over fast layer
(CAP Tradeoffs)
Pattern: C-A-P Tradeoffs
• C-A-P tradeoffs:
• Can’t depend on when data will roll in (Justification)
• Can’t live in ignorance (Comprehensiveness)
• Batch layer: The final answer
• Speed layer: Actionable views
• Examples: Security (Authorization vs Auditing), 

lots of counting problems
(Banking)
Pattern: Out-of-Order
• C-A-P tradeoffs:
• Can’t depend on when data will roll in (Justification)
• Can’t live in ignorance (Comprehensiveness)
• Batch layer: The final answer
• Speed layer: Actionable views
• Examples: Security (Authorization vs Auditing), 

lots of counting problems
(Banking)
Common Theme
The System Asymptotes
to Truth over time
We keep seeing this common theme — you are building a system that approaches
correctness over time.This leads to a best practice that I’ll call the improver pattern:

Recommended for you

Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures

In this talk we depict how to build streaming architectures connecting to Kafka with different technologies. Includes links to Github code samples.

sparkbig datanatural language processing
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture

This presentation looks at how to build an architecture for big and fast data. It reviews the Kappa & Lambda architectures and looks at the role Hazelcast Jet & IMDG can play in the Kappa architecture. It then proposes an evolution of the Kappa architecture to provide a transactional big data system.

hazelcastkappamu
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.

cloud analyticsawsmatillion
Scrape Product Web
• Scrapers: yield partial records
• Unifier: connects all identifiers for a common object
• Resolver: combines partial records into unified record
Entity Resolution
Pattern: Improver
• Improver:



function(best guess,{new facts}) ~> new best guess
• Batch layer: f(blank, {all facts}) ~> best possible guess
• Speed layer: f(current best, {new fact}) ~> new best guess
• Batch and speed layer share same code & contract,
asymptote to truth.
The way you build your resolver is such that it
Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth

Recommended for you

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...

This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.

Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce

Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit. I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.

cascaloglambda architectureredshift
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...

Morningstar’s Risk Model project is created by stitching together statistical and machine learning models to produce risk and performance metrics for millions of financial securities. Previously, we were running a single version of this application, but needed to expand it to allow for customizations based on client demand. With the goal of running hundreds of custom Risk Model runs at once at an output size of around 1TB of data each, we had a challenging technical problem on our hands! In this presentation, we’ll talk about the challenges we faced replatforming this application to Spark, how we solved them, and the benefits we saw. Some things we’ll touch on include how we created customized models, the architecture of our machine learning application, how we maintain an audit trail of data transformations (for rigorous third party audits), and how we validate the input data our model takes in and output data our model produces. We want the attendees to walk away with some key ideas of what worked for us when productizing a large scale machine learning platform.

* 
apache spark

 *big data

 *ai

 *
Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Approximate a pure function on all data
• What we do now that architecture is free
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Approximate a pure function on all data
• What we do now that architecture is free
• Truth lives at the edge, not the middle
• Data is syndicated forward from arrival to serving
• “Query at write time”
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
• Lambda architecture isn’t about speed layer / batch layer.
• It's about
• moving truth to the edge, not the center;
• enabling fine-grained tradeoffs against fundamental limits;
• decoupling consumer from infrastructure
• decoupling consumer from asynchrony
• …with profound implications for how you build your teams
λ Arch: Truth, not Plumbing
This way of doing it simplifies architecture:
Local interactions only
Elimination of asynchrony
Which in turn profoundly simplifies development and operations
And allows you to structure team like you do the
Lambda Architecture
for a Dinky Little Blog
So far, talked about a bunch of reasons why you might be led **to** a lambda
architecture
And when there's a new technology people always first ask why they should do it
differently, which is a wise Thing to ask and a foolish thing to insist on
But let's look at it from the other end, from what life is like if this were the natural
state of being.
And to do so, let's take the most unjustifiable case for a high scale architecture: a blog
engine

Recommended for you

Introduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive PlatformIntroduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive Platform

In this webinar, Michael Nash of BoldRadius explores the Typesafe Reactive Platform. The Typesafe Reactive Platform is a suite of technologies and tools that support the creation of reactive applications, that is, applications that handle the kind of responsiveness requirements, data volume, and user load that was out of practical reach only a few years ago. From analysis of the human genome to wearable technology to communications at a massive scale, BoldRadius has the premier team of experts with decades of collective experience in designing and building these types of applications, and in helping teams adopt these tools.

akkaplay frameworktypesafe reactive platform
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe

This talk centers on two things: a set of patterns for the architecture of high-scale data systems; and a framework for understanding the tradeoffs we make in designing them.

lambdatridentkafka
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift

Amazon Redshift is a fast, fully managed data warehousing service that allows customers to analyze petabytes of structured data, at one-tenth the cost of traditional data warehousing solutions. It provides massively parallel processing across multiple nodes, columnar data storage for efficient queries, and automatic backups and recovery. Customers have seen up to 100x performance improvements over legacy systems when using Redshift for applications like log and clickstream analytics, business intelligence reporting, and real-time analytics.

aws pop-up loft san franciscoawsamazon redshift
Blog: Traditional Approach
• Familiar with the ORM Rails-style blog:
• Models: User, Article, Comment
• Views:
• /user/:id (user info, links to their articles and comments);
• /articles (list of articles);
• /articles/:id (article content, comments, author info)
User
id 3
name joeman
homepage http://…
photo http://…
bio “…”
Article
id 7
title The Crisis
body These are…
author_id 3
created_at 2014-08-08
Comment
id 12
body “lol”
article_id 7
author_id 3
Author Name
Author Bio Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua.
Author
Photo
Joe has written 2 Articles:
“A Post about My Cat”
Donec nec justo eget felis facilisis
fermentum.Aliquam porttitor mauris sit
amet orci.Aenean dignissim pellentesque
(… read more)
“Welcome to my blog”
Donec nec justo eget felis facilisis
fermentum.Aliquam porttitor mauris sit
amet orci.Aenean dignissim pellentesque
(… read more)
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
article show user show
articles
users
comments
Webserver
Traditional: Assemble on Read
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time

Recommended for you

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift

Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for 1/10th the traditional cost. This session will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs.

startupawscloud-computing
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...

Teradata believes in principles of self-service, automation, and on-demand resource allocation to enable faster, more efficient, and more effective data application development and operation. The document discusses the Lambda architecture, alternatives like the Kappa architecture, and a vision for an "Omega" architecture. It provides examples of how to build real-time data applications using microservices, event streaming, and loosely coupled services across Teradata and other data platforms like Hadoop.

AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT

The world is producing an ever-increasing volume, velocity, and variety of data including data from devices. As we step into the era of Internet of things (IOT), for many consumers, batch analytics is no longer enough; they need sub-second analysis on fast-moving data. AWS delivers many technologies for solving big data and IOT problems. But what services should you use, why, when, and how? In this webinar where we simplify big data processing as a pipeline comprising various stages: ingest, store, process, analyze & visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, and durability. Finally, we provide a reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems.

Syndicate on Write
Δ
article Biographers
View
Fragments
showReportersΔ
user Biographers
Δ
com’t Biographers
articles
users
comments
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
• (…hack hack hack…)
/articles/v2/show.json
/articles/v1/show.json
• (…hack hack hack…)
What data model would you
like to receive? {“title”:”…”,
“body”:”…”,…}
lol um can I also have
Data Engineer Web Engineer
{“title”:”…”,
“body”:”…”,
“snippet”:…}
Syndicated Data
• The Data is always _there_
• …but sometimes it’s more perfect than other times.
Syndicated Data
• Reports are cheap, single-concern, and faithful to the view.
• You start thinking like the customer, not the database
• All pages render in O(1):
• Your imagination doesn’t have to fit inside a TCP timeout
• Data is immutable, flows are idempotent:
• Interface change is safe
• Data is always _there_,
• Asynchrony doesn’t affect consumers
• Everything is decoupled:
• Way harder to break everything
One of the worst pains in asses is the query that takes 1500 milliseconds. Needs to
be immediate, usually mission-critical, expensive in all ways

Recommended for you

Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...

Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/

cloud technologycloudamazon web services
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift

Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for 1/10th the traditional cost. This session will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs. We’ll also cover the recently announced Redshift Spectrum, which allows you to query unstructured data directly from Amazon S3.

startupsawsamazon-redshift
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...

The document discusses the Lambda architecture, which provides a common pattern for integrating real-time and batch processing systems. It describes the key components of Lambda - the batch layer, speed layer, and serving layer. The challenges of implementing Lambda are that it requires multiple systems and technologies to be coordinated. Real-world examples are needed to help practical application. The document also provides examples of medical and customer analytics use cases that could benefit from a Lambda approach.

• Lambda architecture isn’t about speed layer / batch layer.
• It's about
• moving truth to the edge, not the center;
• enabling fine-grained tradeoffs against fundamental limits;
• decoupling consumer from infrastructure
• decoupling consumer from asynchrony
• …with profound implications for how you build your teams
λ Arch: Truth, not Plumbing
This way of doing it simplifies architecture:
Local interactions only
Elimination of asynchrony
Which in turn profoundly simplifies development and operations
And allows you to structure team like you do the
…
…
Changes update models
update
article
update
user
update
comments
Δ
article
Δ
user
Δ
com’nt
models
user
com’nt
article
history
Models stay the same: User, Article, Comment. Updated directly
Reporters can subscribe to models
On update, reporter receives updated object, and can do anything else it wants.
Typically, it's to create a new report
Reports live in the target domain: faithful to the data consumer. In this case, they look
very close to the information hierarchy of the rendered page
All pages render in O(1).Your imagination is not constrained by the length of a TCP
timeout

Recommended for you

AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)

Accenture Cloud Platform helps customers manage public and private enterprise cloud resources effectively and securely. In this session, learn how we designed and built new core platform capabilities using a serverless, microservices-based architecture that is based on AWS services such as AWS Lambda and Amazon API Gateway. During our journey, we discovered a number of key benefits, including a dramatic increase in developer velocity, a reduction (to almost zero) of reliance on other teams, reduced costs, greater resilience, and scalability. We describe the (wild) successes we’ve had and the challenges we’ve overcome to create an AWS serverless architecture at scale. Session sponsored by Accenture. AWS Competency Partner

amazon web servicesre:inventaws reinvent
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix

This document discusses Indix's evolution from its initial Data Platform 1.0 to a new Data Platform 2.0 based on the Lambda Architecture. The Lambda Architecture uses three layers - batch, serving, and speed layers - to process streaming and batch data. This provides robustness, fault tolerance, and the ability to query both real-time and batch processed views. The new system uses technologies like Spark, HBase, and Solr to implement the Lambda Architecture principles.

big datascaldinglambda architecture
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns

The document discusses different cloud data architectures including streaming processing, Lambda architecture, Kappa architecture, and patterns for implementing Lambda architecture on AWS. It provides an overview of each architecture's components and limitations. The key differences between Lambda and Kappa architectures are outlined, with Kappa being based solely on streaming and using a single technology stack. Finally, various AWS services that can be used to implement Lambda architecture patterns are listed.

awsazuredatabricks
Models Trigger Reporters
update
article
update
user
update
comments
Δ
article
Δ
user
Δ
com’nt
models
user
com’nt
article
history
compact
article
user’s #
articles
expanded
user
user’s #
comments
sidebar
user
compact
comment
expanded
article
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
micro
user
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
Serve Report Fragments
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
show
article
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
article show rendered
{
"title":"Article Title",
"body":"Article Body Lorem [...]",
"author":{ ... },
"comments: [
{"comment_id":1, "body":"First Post",...},
{"comment_id":2, "body":"lol",...},
...
]}
Serve Report Fragments
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
show
user
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time

Recommended for you

Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data Applications

This webinar discusses five early challenges of building streaming fast data applications: 1) choosing among alternative streaming frameworks like Kafka Streams, Spark Streaming, and Flink; 2) integrating microservices with streaming services; 3) understanding operational challenges of streaming services; 4) gaining competitive advantage through machine learning on fast data; and 5) optimizing resource utilization across large clusters running many components. The webinar promotes Lightbend's Fast Data Platform as providing an easy on-ramp and complete solution for these challenges.

reactive systemsstreamingakka streams
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds

Today's technical landscape features workloads that can no longer be accomplished on a single server using technology from years past. As a result, we must find new ways to accommodate the increasing demands on our compute performance. Some of these new strategies introduce trade-offs and additional complexity into a system. In this presentation, we give an overview of scaling and how to address performance concerns that business are facing, today.

cap theoremscalingdistributed computing
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx

Iot

Reports are Cheap
update
article
update
user
update
comments
Δ
article
Δ
user
Δ
com’nt
models
user
com’nt
article
history
compact
article
user’s #
articles
expanded
user
user’s #
comments
sidebar
user
compact
comment
expanded
article
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
micro
user
list
articles
show
article
list user’s
articles
show
user
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
Two Big Ideas
• Fine-grained control over those architectural tradeoffs
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
Lambda Architecture
Entity Resolution
Intake
parse
Amazon
Amazon
parse
eBay
eBay
parse
Ma&Pa
Ma&Pa
Electronics
Bulk
Stream
RPC Callback
key
words
mfr &
model
ASIN
VendorListing
Listings

Recommended for you

Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money

Seamlessly Pay Online, Pay In Stores or Send Money

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

Batch Layer: Resolve/Unify
Product
Resolver
Unified
Products
Listings
Unify
Products
Improve
Product
Resolver
key
words
mfr &
model
ASIN
VendorListing
Fetch
Products
Unified
Products
Listings
Unify
Products
Update
Product
Resolver
key
words
mfr &
model
ASIN
VendorListing
Fetch
Products
Unified
Products
Resolve &
Update
Listings
Unify
Products
Cannot have Consistency
Product
Resolver
key
words
mfr &
model
ASIN
VendorListing
Fetch
Products
Unified
Products
Resolve &
Update
Listings
Unify
Products

Recommended for you

Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx

Supervised Machine Learning

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados. QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo. Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.

time-seriesquestdbdatabases
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Objections
Objections
• Three objections
1.Why hasn't it been done before
2.Architecture Astronaut
3.I'm not at high scale
• Response
1.Chef/Puppet/Docker/etc
2.Chef/Puppet/Docker/etc
3.Shut Up
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Objections
• Two APIs? Really?
• Yes. Guilty.That’s dumb and must be fixed.
• Spark or Samza, if you’re willing to only drink one flavor of
Kool-Aid
• EZbake.io, a CSC / 42six project to attack this
• …but we shouldn’t be living at the low level anyhow

Recommended for you

Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)

Sin Involves More Than You Might Think (We'll Explain)

Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript

学历认证补办制【微信:A575476】【(SFU毕业证)西蒙弗雷泽大学毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张,特殊工艺完全按照原版一比一制作(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

瑞尔森大学毕业证多伦多都会大学毕业证劳伦森大学毕业证
Objections
• Orchestration: “logical plan” (dataflow graph)
• Optimization/Allocation: “physical plan” (what goes where)
• Resource Projector: instantiates infrastructure
• HTTP listeners,Trident streams, Oozie scheduling, ETL
flows, cron jobs, etc
• Transport Machineries:
• move data around, fulfilling locality/ordering/etc guarantees
• Data Processing: UDFs and operators

More Related Content

What's hot

Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
Huafeng Wang
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos IzquierdoBig Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Data Spain
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 

What's hot (20)

Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos IzquierdoBig Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos Izquierdo
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 

Viewers also liked

Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
Amazon Web Services
 
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzureDevoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Patrick Chanezon
 
Riak in Ten Minutes
Riak in Ten MinutesRiak in Ten Minutes
Riak in Ten Minutes
Jon Meredith
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Lori MacVittie
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Microservices: next-steps
Microservices: next-stepsMicroservices: next-steps
Microservices: next-steps
Boyan Dimitrov
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Akka streams
Akka streamsAkka streams
Akka streams
mircodotta
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
Roelof Pieters
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
Adrian Cockcroft
 

Viewers also liked (16)

Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzureDevoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
 
Riak in Ten Minutes
Riak in Ten MinutesRiak in Ten Minutes
Riak in Ten Minutes
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
Architectural Patterns for Scaling Microservices and APIs - GlueCon 2015
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Microservices: next-steps
Microservices: next-stepsMicroservices: next-steps
Microservices: next-steps
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
 

Similar to Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe

Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
Gareth Rogers
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Introduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive PlatformIntroduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive Platform
BoldRadius Solutions
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Flip Kromer
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
Amazon Web Services
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
Amazon Web Services
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data Applications
Lightbend
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds
Robert Greiner
 

Similar to Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe (20)

Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Introduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive PlatformIntroduction to the Typesafe Reactive Platform
Introduction to the Typesafe Reactive Platform
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
AWS re:Invent 2016: Accenture Cloud Platform Serverless Journey (ARC202)
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data Applications
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds
 

Recently uploaded

iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
jiya khan$A17
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
fatimaezzahraboumaiz2
 

Recently uploaded (20)

iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
 

Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe

  • 1. Patterns of the Lambda Architecture Truth and Lies at the Edge of Scale Flip Kromer — CSC I’m Flip Kromer, Distinguished Engineer at CSC. If you are a large enterprise company looking to add Big Data capabilities — especially one involving legacy systems — we’re a big, stable company that specializes in turning technology into an enterprise- grade solution.
  • 2. Pattern Set This talk will equip you with two things. One is patterns for how we design high-scale architectures to solve specific solution cases now that extra infrastructure is nearly free
  • 3. Tradeoff Rules PICK ANY TWO Along with a set of tradeoff rules along the lines of the pick-any-two trinity but more sophisticated
  • 4. Lambda Architecture So what is the Lambda Architecture? Here’s two examples.
  • 5. Search w/ Update Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API In this system, we have a whole ton of historical text, with more arriving all the time, and want to allow immediate real-time search across the whole corpus.
  • 6. Search w/ Update Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API Build
 Main Index We will use a large periodic batch job to create indexes on the historical data.This takes a while — far longer than our recency demands allow — so we might as well have our elephants use clever algorithms and optimally organize the data for rapid retrieval.
  • 7. Search w/ Update Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API Update Recent Index Until the next stampede arrives with an updated index, as each new record arrives we will not only file it with the historical data but also use simple fast indexing to make it immediately searchable. Merging new records directly would require stuffing them into the right place in the historical index, which eventually means moving records around, which demands far too much time and complexity to be workable.
  • 8. Search w/ Update Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API Serve Result The system to serve the data just pulls from both indexes in immediate time
  • 9. Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API Lambda Architecture Batch Speed Serving We have a batch layer for the global corpus; A speed layer for recent results; and a serving layer for access
  • 10. Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API Lambda Architecture Global Relevant Immediate We have a batch layer for the global corpus; A speed layer for recent results; and a serving layer for access
  • 11. Train Recomm’der Visitor History History Alsobuy Visitor: Product Visitor Alsobuy Update Recommendation Fetch/Update History Visitor: Product History Webserver Recommender Another familiar architecture is a high-scale recommender system — “Given that the user has looked at mod-style dresses and mason jars show them these knitting needles”.This diagram shows a recommender, but most machine-learning systems look like this.
  • 12. Train Recomm’der Visitor History History Alsobuy Visitor: Product Visitor Alsobuy Update Recommendation Fetch/Update History Visitor: Product History Webserver Recommender Build Model You have one system process all the examples you’ve ever seen to produce a predictive model.The trained model it produces can then react immediately to all future examples as they occur.
  • 13. Train Recomm’der Visitor History History Alsobuy Visitor: Product Visitor Alsobuy Update Recommendation Fetch/Update History Visitor: Product History Webserver Recommender Applies Model The trained model it produces can then react immediately to all future examples as they occur. In this system we’re going to have one system to apply the model and store the recommendation Your operations team is better off with two systems that can fail without breaking the site than to have the apply-model step coupled to serving pages.
  • 16. Lambda Arch Layers • Batch layer Deep Global Truth throughput • Speed layer Relevant Local Truth throughput • Serving layer Rapid Retrieval latency speed layer cares about throughput Serving layer cares about latency,
  • 17. Lambda Arch: Technology • Batch layer Hadoop, Spark, Batch DB Reports • Speed layer Storm+Trident, Spark Str., Samza,AMQP, … • Serving layer Web APIs, Static Assets, RPC, …
  • 18. Lambda Architecture Batch Speed Serving λ λ Where does the name lambda come from? In my head it’s cause the flow diagram…
  • 19. Lambda Architecture Batch Speed Serving λ looks like the shape of the character for lambda
  • 20. Lambda Architecture λ(v) • Pure Function on immutable data But really it means this new mindset of building pure function (lambda) on immutable data,
  • 22. Ideal Data System • Capacity -- Can process arbitrarily large amounts of data • Affordability -- Cheap to run • Simplicity -- Easy to build, maintain, debug • Resilience -- Jobs/Processes fail&restart gracefully • Responsiveness -- Low latency for delivering results • Justification -- Incorporates all relevant data into result • Comprehensive -- Answer questions about any subject • Recency -- Promptly incorporates changes in world • Accuracy -- Few approximations or avoidable errors The laziest, and therefore best, knobs are the Capacity/Affordability ones.The pre-big- data era can be thought of as one where only those two exist. Big Data broke the handle off the Capacity knob, either because Affordability ramps too fast or because the speed of light starts threatening resilience, responsiveness or recency * _Comprehensive_: complete; including all or nearly all elements or aspects of something * _concise_: giving a lot of information clearly and in a few words; brief but
  • 23. Ideal Data System • Capacity -- Can process arbitrarily large amounts of data • Affordability -- Cheap to run • Simplicity -- Easy to build, maintain, debug • Resilience -- Jobs/Processes fail&restart gracefully • Responsiveness -- Low latency for delivering results • Justification -- Incorporates all relevant data into result • Comprehensive -- Answer questions about any subject • Recency -- Promptly incorporates changes in world • Accuracy -- Few approximations or avoidable errors You would think that what mattered was correctness — justified true belief
  • 24. Ideal Data System • Capacity -- Can process arbitrarily large amounts of data • Affordability -- Cheap to run • Simplicity -- Easy to build, maintain, debug • Resilience -- Jobs/Processes fail&restart gracefully • Responsiveness -- Low latency for delivering results • Justification -- Incorporates all relevant data into result • Comprehensive -- Answer questions about any subject • Recency -- Promptly incorporates changes in world • Accuracy -- Few approximations or avoidable errors When you look at what we actually do, the non-negotiables are that it be manageable and economic given that you must process arbitrarily large amounts of data Truth is a nice-to-have.
  • 25. Tradeoff Rules PICK ANY TWO Set of tradeoff rules along the lines of the pick-any-two trinity but more sophisticated
  • 26. At Scale AND THIS THIS AND TRY TO BE GOOD Basically, given big data you have to accomodate any amount of data and produce static reports or queries that execute within the duration of human patience — so you must be fast and cheap, sacrificing good.
  • 28. Train Recomm’der Visitor History History Alsobuy Visitor: Product Visitor Alsobuy Update Recommendation Fetch/Update History Visitor: Product History Webserver Recommender The world you’re modeling changes — new sets of products are released, new and variated customers sign up, changes to the site drive new behavior — but it changes slowly. So it’s no big deal if the training stage is only run once a week over several hours. The first example follows a pretty familiar general form I’ll call “Train / React”.You have one system process all the examples you’ve ever seen to produce a predictive model.The trained model it produces can then react immediately to all future
  • 29. Pattern: Train / React • Model of the world lets you make immediate decisions • World changes slowly, so we can re-build model at leisure • Relax: Recency • Batch layer: Train a machine learning model • Speed layer: Apply that model • Examples: most Machine Learning thingies (Recommender) Big fat job that only needs to run occasionally; results of the job inform what happens immediately
  • 30. Search w/ Update Build Indexes A Ton of Text Historical Index Live Indexer More Text Recent Index API
  • 31. Pattern: Baseline / Delta • Understanding the world takes a long time • World changes much faster than that, and you care • Relax: Simplicity, Accuracy • Batch layer: Process the entire world • Speed layer: Handle any changes since last big run • Examples: Real-time Search index; Count Distinct; 
 other Approximate Stream Algorithms In Train / React, the world changes, but slowly; training in batch mode is just fine In Baseline / Delta, the world changes so quickly can’t run compute job fast enough So you are sacrificing simplicity — there’s two systems where there was only one — and accuracy — the recent records won’t update global normalized frequencies
  • 32. Pagerank Converge Pagerank Friend Relations User Pagerank Retrieve Bob’s Facebook Ntwk Bob Bob’s Friends’ Pageranks Estimate Bob’s Pagerank But don’t bother updating Bob’s Friends (or friends friends or …) API (Lazy Propagation)
  • 33. Pagerank 48 24 42 12 12 6 24 24 42 6 6 6 6 6 6 6 This next example has an importantly different flavor. The core way that Google identifies important web pages is the “Pagerank” algorithm, which basically says “a page is interesting if other interesting pages link to it”.That’s recursive of course but the math works out.You can do similar things on a social network like twitter to find spammers and superstars, or among college football teams or world of warcraft players to prepare a competitive ranking, or among buyers and sellers in a market to detect fraud. To define a reputation ranking on say Twitter you simulate a game of multiple rounds.
  • 34. 48 24 42 12 12 6 24 24 42 6 6 6 6 6 6 6 9 4 - 5 - New Record Appears ? Doing this is kinda literally what Hadoop was born to do, and it’s a simple Hadoop-101 level program. Acting out all those rounds using every interaction we’ve ever seen takes a fair amount of time, though, and so a problem comes when we meet a new person. This new person accrues some reputational jellybeans, and we don’t want to live in total ignorance of what their score is; and they dispatch some as well, which should change the scores of those they follow.
  • 35. 48 24 42 12 12 6 24 24 42 6 6 6 6 6 6 6 9 4 - 5 - Update Using Local 12÷3 = 4 24÷5 ≈ 5 9 Well, we can roughly guess the score of the new node by having their followers pay out a jellybean share proportional to what they would have gotten in the last pagerank round. “A Guess beats a Blank Stare” * World rate of change not really relevant * The solution is actually to tell a lie
  • 36. 48 24 42 12 12 6 24 24 42 6 6 6 6 6 6 6 9 4 - 5 - …Ignoring Correctness meh But we’re not going to update the neighbors.You’d be concurrently updating an arbitrary number of outbound nodes, and then of course those nodes’ changes should rightfully propagate as well — this is why we play the multiple pagerank rounds in the first place. What we do instead is lie. Look, planes don’t fall out of the sky if you get someone’s coolness quotient wrong in the first decimal point.
  • 37. Batch Updates Graph 42 30 36 11 10 6 21 21 36 4 6 6 6 5 5 4 9 3 9 6 (A Guess beats a Blank) This has an importantly different flavor * World rate of change not really relevant * The solution is actually to tell a lie
  • 38. Pattern: World/Local • Understanding the world needs full graph • You can tell a little white lie reading immediate graph only • Relaxing: Accuracy, Justification • Batch layer: uses global graph information • Speed layer: just reads immediate neighborhood • Examples:“Whom to follow”, Clustering, anything at 2nd- degree (friend-of-a-friend) Problem isn’t so much about the volume of data, it’s about how _far away_ that data is. You can’t justify doing that second-order query for three reasons: * time * compute resources * computational risk
  • 39. Pattern: Guess Beats Blank • You can’t calculate a good answer quickly • But Comprehensiveness is a must • Relaxing: Accuracy, Justification • Batch layer: finds the correct answer • Speed layer: makes a reasonable guess • Examples:Any time the sparse-data case is also the most valuable In this case, we can’t sacrifice comprehensiveness — for every record that exists, we must return a relevant answer. So we sacrifice truthfulness — or more precisely, we sacrifice accuracy and justification.
  • 40. Marine Corp’s 80% Rule “Any decision made with more than 80% of the necessary information is hesitation” — “The Marine Corps Way” Santamaria & Martino When lots of data already, the imperfect result in the speed layer doesn’t have a huge effect When there isn’t much data, overwhelmingly better to fill in with an imperfect result US Marine Corps:“Any decision made with more than 80% of the necessary information is hesitation”
  • 41. A Guess Beats a Blank • You can’t calculate a good answer quickly • But Comprehensiveness is a must • Relaxing: Accuracy, Justification • Batch layer: finds the correct answer • Speed layer: makes a reasonable guess • Examples:Any time the sparse-data case is also the most valuable In this case, we can’t sacrifice comprehensiveness — for every record that exists, we must return a relevant answer. So we sacrifice truthfulness — or more precisely, we sacrifice accuracy and justification.
  • 42. Security Find Potential Evilness Connection Counts Agents of Interest Store Interaction Net Connect ions Detected Evilnesses Approximate Streaming Agg Agent of Interest? Dashboard In security, you have the data-breach type problems — why is someone strip-mining computers in turn to a server in [name your own semi-friendly country]? — and bradley-manning type problems — why is a GS-5 at a console in Kuwait downloading every single diplomatic dispatch?
  • 43. Pattern: Slow Boil/Flash Fire • Two tempos of data: months vs milliseconds • Short-term data too much to store • Long-term data too much to handle immediately • Often accompanies Baseline / Deltas, Global / Local • Examples: • Trending Topics • Insider Security Global/Local:Why has a contractor sysadmin in Hawaii accessed powerpoint presos from every single group within our organization?
  • 45. Banking, Oversimplified Reconcile Accounts Account Balances Event Store Transaction Update Records nice-to-haveessential This wins over fast layer (CAP Tradeoffs)
  • 46. Pattern: C-A-P Tradeoffs • C-A-P tradeoffs: • Can’t depend on when data will roll in (Justification) • Can’t live in ignorance (Comprehensiveness) • Batch layer: The final answer • Speed layer: Actionable views • Examples: Security (Authorization vs Auditing), 
 lots of counting problems (Banking)
  • 47. Pattern: Out-of-Order • C-A-P tradeoffs: • Can’t depend on when data will roll in (Justification) • Can’t live in ignorance (Comprehensiveness) • Batch layer: The final answer • Speed layer: Actionable views • Examples: Security (Authorization vs Auditing), 
 lots of counting problems (Banking)
  • 48. Common Theme The System Asymptotes to Truth over time We keep seeing this common theme — you are building a system that approaches correctness over time.This leads to a best practice that I’ll call the improver pattern:
  • 50. • Scrapers: yield partial records • Unifier: connects all identifiers for a common object • Resolver: combines partial records into unified record Entity Resolution
  • 51. Pattern: Improver • Improver:
 
 function(best guess,{new facts}) ~> new best guess • Batch layer: f(blank, {all facts}) ~> best possible guess • Speed layer: f(current best, {new fact}) ~> new best guess • Batch and speed layer share same code & contract, asymptote to truth. The way you build your resolver is such that it
  • 52. Two Big Ideas • Fine-grained control over architectural tradeoffs • Truth lives at the edge, not the middle Lets you trade off how quickly, how expensively, how true, how justified New Paradigm for how, when and where we handle truth
  • 53. Two Big Ideas • Fine-grained control over architectural tradeoffs • Approximate a pure function on all data • What we do now that architecture is free • Truth lives at the edge, not the middle Lets you trade off how quickly, how expensively, how true, how justified New Paradigm for how, when and where we handle truth
  • 54. Two Big Ideas • Fine-grained control over architectural tradeoffs • Approximate a pure function on all data • What we do now that architecture is free • Truth lives at the edge, not the middle • Data is syndicated forward from arrival to serving • “Query at write time” Lets you trade off how quickly, how expensively, how true, how justified New Paradigm for how, when and where we handle truth
  • 55. • Lambda architecture isn’t about speed layer / batch layer. • It's about • moving truth to the edge, not the center; • enabling fine-grained tradeoffs against fundamental limits; • decoupling consumer from infrastructure • decoupling consumer from asynchrony • …with profound implications for how you build your teams λ Arch: Truth, not Plumbing This way of doing it simplifies architecture: Local interactions only Elimination of asynchrony Which in turn profoundly simplifies development and operations And allows you to structure team like you do the
  • 56. Lambda Architecture for a Dinky Little Blog So far, talked about a bunch of reasons why you might be led **to** a lambda architecture And when there's a new technology people always first ask why they should do it differently, which is a wise Thing to ask and a foolish thing to insist on But let's look at it from the other end, from what life is like if this were the natural state of being. And to do so, let's take the most unjustifiable case for a high scale architecture: a blog engine
  • 57. Blog: Traditional Approach • Familiar with the ORM Rails-style blog: • Models: User, Article, Comment • Views: • /user/:id (user info, links to their articles and comments); • /articles (list of articles); • /articles/:id (article content, comments, author info)
  • 58. User id 3 name joeman homepage http://… photo http://… bio “…” Article id 7 title The Crisis body These are… author_id 3 created_at 2014-08-08 Comment id 12 body “lol” article_id 7 author_id 3
  • 59. Author Name Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Author Photo Joe has written 2 Articles: “A Post about My Cat” Donec nec justo eget felis facilisis fermentum.Aliquam porttitor mauris sit amet orci.Aenean dignissim pellentesque (… read more) “Welcome to my blog” Donec nec justo eget felis facilisis fermentum.Aliquam porttitor mauris sit amet orci.Aenean dignissim pellentesque (… read more) Article Title Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident. Author Name Author Photo Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor Comments "First Post" (8/8/2014 by Commenter 1) "lol" (8/8/2014 by Commenter 2) "No comment" (8/8/2014 by Commenter 3) article show user show
  • 60. articles users comments Webserver Traditional: Assemble on Read DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 61. Syndicate on Write Δ article Biographers View Fragments showReportersΔ user Biographers Δ com’t Biographers articles users comments DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 62. • (…hack hack hack…) /articles/v2/show.json /articles/v1/show.json • (…hack hack hack…) What data model would you like to receive? {“title”:”…”, “body”:”…”,…} lol um can I also have Data Engineer Web Engineer {“title”:”…”, “body”:”…”, “snippet”:…}
  • 63. Syndicated Data • The Data is always _there_ • …but sometimes it’s more perfect than other times.
  • 64. Syndicated Data • Reports are cheap, single-concern, and faithful to the view. • You start thinking like the customer, not the database • All pages render in O(1): • Your imagination doesn’t have to fit inside a TCP timeout • Data is immutable, flows are idempotent: • Interface change is safe • Data is always _there_, • Asynchrony doesn’t affect consumers • Everything is decoupled: • Way harder to break everything One of the worst pains in asses is the query that takes 1500 milliseconds. Needs to be immediate, usually mission-critical, expensive in all ways
  • 65. • Lambda architecture isn’t about speed layer / batch layer. • It's about • moving truth to the edge, not the center; • enabling fine-grained tradeoffs against fundamental limits; • decoupling consumer from infrastructure • decoupling consumer from asynchrony • …with profound implications for how you build your teams λ Arch: Truth, not Plumbing This way of doing it simplifies architecture: Local interactions only Elimination of asynchrony Which in turn profoundly simplifies development and operations And allows you to structure team like you do the
  • 66.
  • 67.
  • 68. Changes update models update article update user update comments Δ article Δ user Δ com’nt models user com’nt article history Models stay the same: User, Article, Comment. Updated directly Reporters can subscribe to models On update, reporter receives updated object, and can do anything else it wants. Typically, it's to create a new report Reports live in the target domain: faithful to the data consumer. In this case, they look very close to the information hierarchy of the rendered page All pages render in O(1).Your imagination is not constrained by the length of a TCP timeout
  • 69. Models Trigger Reporters update article update user update comments Δ article Δ user Δ com’nt models user com’nt article history compact article user’s # articles expanded user user’s # comments sidebar user compact comment expanded article exp’d article compact article user’s # articles exp’d user sidebar user user’s # comments compact comment micro user micro user DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 70. Serve Report Fragments exp’d article compact article user’s # articles exp’d user sidebar user user’s # comments compact comment micro user show article Article Title Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident. Author Name Author Photo Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor Comments "First Post" (8/8/2014 by Commenter 1) "lol" (8/8/2014 by Commenter 2) "No comment" (8/8/2014 by Commenter 3) DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 71. Article Title Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident. Author Name Author Photo Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor Comments "First Post" (8/8/2014 by Commenter 1) "lol" (8/8/2014 by Commenter 2) "No comment" (8/8/2014 by Commenter 3) article show rendered { "title":"Article Title", "body":"Article Body Lorem [...]", "author":{ ... }, "comments: [ {"comment_id":1, "body":"First Post",...}, {"comment_id":2, "body":"lol",...}, ... ]}
  • 72. Serve Report Fragments Article Title Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident. Author Name Author Photo Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor Comments "First Post" (8/8/2014 by Commenter 1) "lol" (8/8/2014 by Commenter 2) "No comment" (8/8/2014 by Commenter 3) exp’d article compact article user’s # articles exp’d user sidebar user user’s # comments compact comment micro user show user DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 73. Reports are Cheap update article update user update comments Δ article Δ user Δ com’nt models user com’nt article history compact article user’s # articles expanded user user’s # comments sidebar user compact comment expanded article exp’d article compact article user’s # articles exp’d user sidebar user user’s # comments compact comment micro user micro user list articles show article list user’s articles show user DB models are sole source of truth Denormalized Used directly by reader and writer View is constructed from spare parts at read time
  • 74. Two Big Ideas • Fine-grained control over those architectural tradeoffs • Truth lives at the edge, not the middle Lets you trade off how quickly, how expensively, how true, how justified New Paradigm for how, when and where we handle truth
  • 80. Cannot have Consistency Product Resolver key words mfr & model ASIN VendorListing Fetch Products Unified Products Resolve & Update Listings Unify Products
  • 82. Objections • Three objections 1.Why hasn't it been done before 2.Architecture Astronaut 3.I'm not at high scale • Response 1.Chef/Puppet/Docker/etc 2.Chef/Puppet/Docker/etc 3.Shut Up
  • 84. Objections • Two APIs? Really? • Yes. Guilty.That’s dumb and must be fixed. • Spark or Samza, if you’re willing to only drink one flavor of Kool-Aid • EZbake.io, a CSC / 42six project to attack this • …but we shouldn’t be living at the low level anyhow
  • 85. Objections • Orchestration: “logical plan” (dataflow graph) • Optimization/Allocation: “physical plan” (what goes where) • Resource Projector: instantiates infrastructure • HTTP listeners,Trident streams, Oozie scheduling, ETL flows, cron jobs, etc • Transport Machineries: • move data around, fulfilling locality/ordering/etc guarantees • Data Processing: UDFs and operators