This document discusses different approaches for achieving exactly-once semantics when streaming data from Kafka using Spark Streaming. It presents idempotent and transactional approaches. The idempotent approach works for transformations that have a natural unique key, while the transactional approach works for any transformation by committing offsets and results together in a transaction. It also compares receiver-based and direct streaming, noting the pros and cons of each, and how to store offsets to enable exactly-once processing when using the direct approach.
Real-Time Streaming with Apache Spark Streaming and Apache Storm. Description and comparison of both systems.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial: 1) Spark Streaming - Workflow 2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection 3) Spark Streaming - DStream 4) Word Count Hands-on using Spark Streaming 5) Spark Streaming - Running Locally Vs Running on Cluster 6) Introduction to Apache Kafka 7) Apache Kafka Hands-on on CloudxLab 8) Integrating Spark Streaming & Kafka 9) Spark Streaming & Kafka Hands-on
Streaming SQL allows users to query streaming data using standard SQL. Some key benefits include: - SQL is a widely used language that lets users focus on what data is needed rather than how to process it. The system can optimize queries to meet quality of service needs. - Streaming queries can join streams with static relations or aggregate streams using windows. Monotonic columns like timestamps help the system make progress while maintaining accuracy. - Materialized views allow querying recent historical data from streams. This enables applications like dashboards that summarize the last hour of data. - Streaming SQL provides a unified way to query both streaming and static data sources using a single language. This simplifies development of applications that use both streaming
This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
Common patterns and anti-patterns to consider when integrating Kafka, Cassandra and Storm for a real-time streaming analytics platform.
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
This document discusses two approaches for receiving data from Kafka in Spark Streaming - the receiver-based approach and direct approach. The receiver-based approach uses Kafka's high-level API and enables exactly-once processing semantics but requires writing to WAL. The direct approach fetches offsets manually, provides simplified parallelism with 1:1 mapping of partitions, and more efficiency without needing WAL, but does not guarantee exactly-once processing. It also covers how to set up a Spark Streaming application with Kafka, including library dependencies, Kafka consumer properties, subscribing to topics, and location strategies.
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
We will discuss the three dimensions to evaluate HDFS to S3: cost, SLAs (availability and durability), and performance. He then provided a deep dive on the challenges in writing to Cloud storage with Apache Spark and shared transactional commit benchmarks on Databricks I/O (DBIO) compared to Hadoop.
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System This is the first session in the series of "Apache Spark Hands-on" Topics Covered + Introduction to Apache Spark + Introduction to RDD (Resilient Distributed Datasets) + Loading data into an RDD + RDD Operations - Transformation + RDD Operations - Actions + Hands-on demos using CloudxLab