Moving Towards a Streaming Architecture

Moving Towards a
Streaming Architecture
Garry Turkington (CTO), Gabriele Modena (Data Scientist)
Improve Digital

Real Time Adtech
SSP
http://improvedigital.com/
@ImproveDigital

Data sources
 A geographically distributed fleet of ad servers push out several types of data,
primarily:
- 1 minute metrics packets
- 15 minute full log files
 Previously used ad hoc system to process the former, Hadoop for the latter

Conceptualizing the data
 We think of event streams when considering the operational system
 But have this sliced into many log files for practical reasons
 Recreating the stream-based view gets us back to what is happening
 And brings data creation and processing closer together

How did we get here
 Need to process more data, faster
- Reporting
- Decision Making
 Evolution of the system to meet scalability requirements
 Stream processing by definition
 It is changing the way we think about data collection and distribution
 Not everything is obvious
 It opened new doors

Let’s take a step back
 Aggregates grow in number and size
 Keep a copy in HDFS
 Run job, then go and make coffee
 Generate new / other aggregates / ad hoc datasets
 Analytics datawarehouse (column store), denormalized data model
 BI, reporting
 Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)

Incremental processing
 Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)
 Scheduling and partitioning are straightforward
 Schema migrations: Avro + HCatalog
 Fixed length time windows
- What is a good length?
- Ok if it takes 1 hour to process 15 minutes of data
 Need to reprocess data
 It can be made efficient (eg. DataFu Hourglass)
 MapReduce

A more interactive system
We added tools to circumvent MapReduce
- Latency, really
- Still a “batch” mindset
Cloudera Impala (+ Parquet)
- Fast, familiar
Move computation closer to data
- Spark (among other things) for prototyping
- Python/R integration, batch and “streaming”

Outcome
 Multiple overlapping data sources, processing tools
- point-to-point
 Different pipelines for analytics and production models
- Divide
- Where is the “correct” data ?

All in all
 Data collection and distribution is critical
 We are still looking at “yesterday’s” data
 We need a different level of abstraction

Challenges of near-realtime data
 Instructive to ask why near-realtime data analysis is hard
 In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs
 So why do we flinch to process 24hrs in 24hrs?
 What allows batch workloads to process in sub-linear time:
- Scheduling
- Parallelization
- Partitioning

The streaming view
 Consider these from the streaming perspective:
- Scheduling is challenging because of unknown and variable data rates
- Parallelization breaks our nice logical model of a single data stream
- Partitioning can’t be simple file size/no of blocks
- The method of partitioning the stream is fundamental

Kafka
 We use Kafka (http://kafka.apache.org) for all data we view as streams
 It is resilient, performant and provides a very clear topic-based interface
 Producers write to topics, consumers read from topics
 Topics can be partitioned based on a producer-provided key
 Provides at-least once message delivery semantics
 Messages are persistently stored – topics can be re-read

Samza
 We use Samza (http://samza.incubator.apache.org) for stream processing
 Analogous to MapReduce atop YARN atop HDFS
 We have Samza atop YARN atop Kafka
E L E M O D E N A L E A R N IN G H A D O O P 2
A p a c h e S a m z a
afka for streaming !
arn for resource
anagement and exec!
amzaAPI for
ocessing!
weet spot: second,
inutes
Y A R N
S a m z a A P I
K a fk a

Samza API
A simple call-back API that should be familiar to MapReduce developers:
public interface StreamTask{
void process(IncomingMessageEnvelope envelope,
MessageCollector collector, TaskCoordinator) ;
}
public interface WindowableTask{
void window(MessageCollector collector,
StreamCoordinator coordinator);
}

Kafka/Samza integration
 A Samza job is comprised of multiple tasks
 Each Kafka partition topic is consumed by only one Samza task
 Each YARN container runs multiple Samza tasks
 The output of a task is often written to a different topic
 Overall applications are then constructed by composition of jobs

Stream job output
 You’ve processed the data, now what?
 Need consider how the output of each stream partition will be processed
 A downstream destination with an idempotent interface is ideal
 Otherwise try to have expressions of unique state
 Hardest is when the outputs need applied in sequence

Example: log data
 It’s a common model: server process rolls a log every x minutes
 Each resultant file is ingested into Kafka and read by multiple Samza jobs
 Each Samza task receives a series of potentially interleaved log chunks
 Consider a task receiving data from servers S1, S2 and S3 across time periods
T1..T3
 It’s view may become something like:
S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…

Example - Streaming aggregation
 Input is a Stream of records with the form (key, value)
 Output is a series of (key, aggregate)
 3 approaches to producing these outputs:
- Each task pushes local counts downstream
- Each task pushes to another stream aggregated by a different job
- Each task uses local state to produce final aggregate values

Comparing stream aggregation
 Option 1: each stream partition outputs local values for each key it sees across a
time window
- Each partial result is an update record (e.g. increment key k at time t by x)
- Downstream consumer has the responsibility of combining partial results
 Option 2 : aggregate partial results in a second stream processing job
- The output becomes a series of state changes
(set the value of key k at time t to x)

Comparing stream aggregation contd
 Option 3 - push state into the state processors
- Samza tasks can hold persistent local state
- Creates a key/value store modelled as a Kafka topic
- Ensure that all records for each key are sent to the same partition
- State processing task can then create unique state aggregates locally

A side effect of Kafka
 A firehose for event data
- Can tap into with Samza
- Can tap into with Spark
 Unifies how we access and publish data
- Streams are defined by application / use case
- These can come from other teams
- Single entry point, multiple outputs

Analytics on streams
Incremental processing pipelines as a basis
Tried to reuse logic
 Not that simple
Conceptualization vs. implementation

Time windows
 Temporal constraints
 When is data ready?
- The key is defining “ready”, which is weaker than “complete”
- Enough to make an informed decision (this varies case by case)
- We can update the dataset to improve a confidence interval
 When are we done processing?
- Tradeoff between timeliness and accuracy

Schema evolution
 Schema changes appear within / across time windows
- Not a solved problem
 Stop streaming, propagate changes, replay or resume
 SAMZA-317
- Avro SerDe
- Schema stream

Challenges
 Data completeness
 Windows length
 Tradeoff between finding answers / satisfying temporal constraints
 Need to adjust our approach (representation and data structures)
- approximate
- probabilistic
- iterations to convergence
 This is a work in progress

Model based outlier detection
Previously:
Batch and real time datasets = different approaches
Now:
Same dataset
Model based anomaly detection
 Reuse knowledge and infrastructure
What is normal?
 How do we set thresholds ?
 Literature review (q-digest, t-digest)

Testing and alerting
 Some features are hard to test
 Functionally correct but unexpected consequences
 unexpected ~ hard to predict
 advertisment is a complex system
 Bugs are outliers we can detect (sort of)

Testing as predictive modelling
 Monitoring (next to testing)
 Make software monitorable
 Think of testing as an experiment
 Establish short, continuous, feedback loops

Implications
 How do we use real time output
- “live” dashboards
- Instrumentation and alerting (predict failure)
- feed optimization models
- back propagating methods into ETL
 How we think about feedback loops
 Make data more consumable

Conclusion
 Streaming is part of the broader system
 We are fine tuning how the two fit together
 Data collection and distribution is important
 publish results follows
 Kafka + Samza
 data and processing model that fits our applications well
 Conceptualization vs. implementation
 Real time != incremental
 A certain degree of uncertainty
 Tradeoffs

Moving Towards a Streaming Architecture

More Related Content

Moving Towards a Streaming Architecture