SlideShare a Scribd company logo
Logging infrastructure for MicroServices using StreamSets Data Collector
Logging Infrastructure for microservices using StreamSets
Data Collector
Presenter:
Virag Kothari
Software Engineer at StreamSets
Open-Source Continuous Ingest
© 2015 StreamSets, Inc. All rights reserved.
About StreamSets
● Headquartered in San Francisco, CA
● Deep expertise in enterprise data management and integration
○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica)
○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera)
○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm
© 2015 StreamSets, Inc. All rights reserved.
Containerized services
Run batch jobs, application jobs, microservices
Logging is key in dynamic environments
HBase/Cassandra
HDFS/S3
Elasticsearch
Docker Container
Docker Container
Kafka
Application
Flume/Logstash
© 2015 StreamSets, Inc. All rights reserved.
Challenges
Semi structured logs
Semantic drift
-> Schema changes
-> Malformed records
Infrastructure drift
->New apps with their own log format
© 2015 StreamSets, Inc. All rights reserved.
StreamSets Data Collector (SDC) Pipeline
Origin
(Log Source)
Processor
Destination
(Kafka)
On
success
Kafka/Write
to File
On error
Application
Docker
container
© 2015 StreamSets, Inc. All rights reserved.
Handle semantic and infrastructure drift
● Built in transformations
● Scripting support
● Troubleshoot using snapshots
● Rules and alerting
© 2015 StreamSets, Inc. All rights reserved.
Data at scale
● Streaming/Batch Cluster deployments
● Batch - MapReduce
● Streaming - Spark Streaming on Mesos and Yarn
● Storm, Samza and others?
© 2015 StreamSets, Inc. All rights reserved.
Cluster pipeline
Kafka
Spark executor
Task Task
SDC SDC
Yarn/Mesos
HDFS/S3
HBase/Cassandra
Hive
Solr
© 2015 StreamSets, Inc. All rights reserved.
Spark Streaming + Kafka
Direct Approach
One to one mapping between Kafka and RDD partitions
Allocate executors equal to Kafka partitions
Multiple tasks within executor
Kafka partition RDD partition SDC
© 2015 StreamSets, Inc. All rights reserved.
Spark on Yarn
Client vs Cluster mode
Fault tolerant driver
Jars available through Distributed Cache
Classloader isolation due to conflicting libraries
© 2015 StreamSets, Inc. All rights reserved.
Spark on Mesos
Mesos not a framework manager
REST endpoint provided by Spark to manage the Mesos framework
No Distributed Cache
Fault-tolerance through pipeline-level retries
© 2015 StreamSets, Inc. All rights reserved.
Thank you
http://streamsets.com/careers/
We’re hiring...
https://github.com/streamsets

More Related Content

Logging infrastructure for Microservices using StreamSets Data Collector

  • 1. Logging infrastructure for MicroServices using StreamSets Data Collector Logging Infrastructure for microservices using StreamSets Data Collector Presenter: Virag Kothari Software Engineer at StreamSets
  • 3. © 2015 StreamSets, Inc. All rights reserved. About StreamSets ● Headquartered in San Francisco, CA ● Deep expertise in enterprise data management and integration ○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica) ○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera) ○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm
  • 4. © 2015 StreamSets, Inc. All rights reserved. Containerized services Run batch jobs, application jobs, microservices Logging is key in dynamic environments HBase/Cassandra HDFS/S3 Elasticsearch Docker Container Docker Container Kafka Application Flume/Logstash
  • 5. © 2015 StreamSets, Inc. All rights reserved. Challenges Semi structured logs Semantic drift -> Schema changes -> Malformed records Infrastructure drift ->New apps with their own log format
  • 6. © 2015 StreamSets, Inc. All rights reserved. StreamSets Data Collector (SDC) Pipeline Origin (Log Source) Processor Destination (Kafka) On success Kafka/Write to File On error Application Docker container
  • 7. © 2015 StreamSets, Inc. All rights reserved. Handle semantic and infrastructure drift ● Built in transformations ● Scripting support ● Troubleshoot using snapshots ● Rules and alerting
  • 8. © 2015 StreamSets, Inc. All rights reserved. Data at scale ● Streaming/Batch Cluster deployments ● Batch - MapReduce ● Streaming - Spark Streaming on Mesos and Yarn ● Storm, Samza and others?
  • 9. © 2015 StreamSets, Inc. All rights reserved. Cluster pipeline Kafka Spark executor Task Task SDC SDC Yarn/Mesos HDFS/S3 HBase/Cassandra Hive Solr
  • 10. © 2015 StreamSets, Inc. All rights reserved. Spark Streaming + Kafka Direct Approach One to one mapping between Kafka and RDD partitions Allocate executors equal to Kafka partitions Multiple tasks within executor Kafka partition RDD partition SDC
  • 11. © 2015 StreamSets, Inc. All rights reserved. Spark on Yarn Client vs Cluster mode Fault tolerant driver Jars available through Distributed Cache Classloader isolation due to conflicting libraries
  • 12. © 2015 StreamSets, Inc. All rights reserved. Spark on Mesos Mesos not a framework manager REST endpoint provided by Spark to manage the Mesos framework No Distributed Cache Fault-tolerance through pipeline-level retries
  • 13. © 2015 StreamSets, Inc. All rights reserved. Thank you http://streamsets.com/careers/ We’re hiring... https://github.com/streamsets