Logging infrastructure for Microservices using StreamSets Data Collector

Logging infrastructure for MicroServices using StreamSets Data Collector
Logging Infrastructure for microservices using StreamSets
Data Collector
Presenter:
Virag Kothari
Software Engineer at StreamSets

© 2015 StreamSets, Inc. All rights reserved.
About StreamSets
● Headquartered in San Francisco, CA
● Deep expertise in enterprise data management and integration
○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica)
○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera)
○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm

Containerized services
Run batch jobs, application jobs, microservices
Logging is key in dynamic environments
HBase/Cassandra
HDFS/S3
Elasticsearch
Docker Container
Docker Container
Kafka
Application
Flume/Logstash

Challenges
Semi structured logs
Semantic drift
-> Schema changes
-> Malformed records
Infrastructure drift
->New apps with their own log format

StreamSets Data Collector (SDC) Pipeline
Origin
(Log Source)
Processor
Destination
(Kafka)
On
success
Kafka/Write
to File
On error
Application
Docker
container

Handle semantic and infrastructure drift
● Built in transformations
● Scripting support
● Troubleshoot using snapshots
● Rules and alerting

Data at scale
● Streaming/Batch Cluster deployments
● Batch - MapReduce
● Streaming - Spark Streaming on Mesos and Yarn
● Storm, Samza and others?

Cluster pipeline
Kafka
Spark executor
Task Task
SDC SDC
Yarn/Mesos
HDFS/S3
HBase/Cassandra
Hive
Solr

Spark Streaming + Kafka
Direct Approach
One to one mapping between Kafka and RDD partitions
Allocate executors equal to Kafka partitions
Multiple tasks within executor
Kafka partition RDD partition SDC

Spark on Yarn
Client vs Cluster mode
Fault tolerant driver
Jars available through Distributed Cache
Classloader isolation due to conflicting libraries

Spark on Mesos
Mesos not a framework manager
REST endpoint provided by Spark to manage the Mesos framework
No Distributed Cache
Fault-tolerance through pipeline-level retries

Thank you
http://streamsets.com/careers/
We’re hiring...
https://github.com/streamsets

Logging infrastructure for Microservices using StreamSets Data Collector

Related slideshows

More Related Content

Logging infrastructure for Microservices using StreamSets Data Collector