SlideShare a Scribd company logo
© 2015 MapR Technologies 1© 2014 MapR Technologies
Overview of Apache Spark Streaming
© 2015 MapR Technologies 2
Agenda
• Why Apache Spark Streaming ?
• What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example
© 2015 MapR Technologies 3
Why Spark Streaming?
• Process Time Series data :
– Results in near-real-time
• Use Cases
– Social network trends
– Website statistics, monitoring
– Fraud detection
– Advertising click monetization
put
put
put
put
Time stamped data
data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
monitoring
© 2015 MapR Technologies 4
What is time series data?
• Stuff with timestamps
– Sensor data
– log files
– Phones..
Credit Card Transactions Web user behaviour
Social media
Log files
Geodata
Sensors

Recommended for you

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming

Spark Streaming allows processing of live data streams at scale. Recent improvements include: 1) Enhanced fault tolerance through a write-ahead log and replay of unprocessed data on failure. 2) Dynamic backpressure to automatically adjust ingestion rates and ensure stability. 3) Visualization tools for debugging and monitoring streaming jobs. 4) Support for streaming machine learning algorithms and integration with other Spark components.

spark streamingdatabricksapache spark
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale

Yahoo migrated most of its Pig workload from MapReduce to Tez to achieve significant performance improvements and resource utilization gains. Some key challenges in the migration included addressing misconfigurations, bad programming practices, and behavioral changes between the frameworks. Yahoo was able to run very large and complex Pig on Tez jobs involving hundreds of vertices and terabytes of data smoothly at scale. Further optimizations are still needed around speculative execution and container reuse to improve utilization even more. The migration to Tez resulted in up to 30% reduction in runtime, memory, and CPU usage for Yahoo's Pig workload.

hadoop summit
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase

This document provides an overview and objectives of a session on getting started with HBase application development. It discusses why NoSQL and HBase are needed due to limitations of relational databases in scaling horizontally to handle big data. It provides an introduction to the HBase data model, architecture, and basic operations like put, get, scan, and delete. It explains how HBase stores data in a sorted map structure and how writes flow through the write ahead log, memstore, and are flushed to HFiles on disk.

nosqlhbasebig data
© 2015 MapR Technologies 5
Why Spark Streaming ?
What If?
• You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
© 2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees
It's 6:02 and 75 degrees
It's 6:03 and 77 degrees
It's 6:04 and 85 degrees
It's 6:05 and 90 degrees
It's 6:06 and 85 degrees
It's 6:07 and 77 degrees
It's 6:08 and 75 degrees
It was hot at 6:05
yesterday!
Batch processing may be too late for some events
© 2015 MapR Technologies 7
Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Streaming
Its becoming important to process events as they arrive
© 2015 MapR Technologies 8
What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks

Recommended for you

Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time

Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.

hadoop summit tokyoapache eaglebig data
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How

Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.

maprimpalaapache hbase
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture

This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark

apache sparkdistributed systemtungsten
© 2015 MapR Technologies 9
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps
Stream
Processing
© 2015 MapR Technologies 10
Key Concepts
• Data Sources:
– File Based: HDFS
– Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations
• Output Operations
MapR-FS
Topics
© 2015 MapR Technologies 11
Spark Streaming Architecture
• Divide data stream into batches of X seconds
– Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2015 MapR Technologies 12
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements

Recommended for you

R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers

This document provides an overview of using R, Hadoop, and Rhadoop for scalable analytics. It begins with introductions to basic R concepts like data types, vectors, lists, and data frames. It then covers Hadoop basics like MapReduce. Next, it discusses libraries for data manipulation in R like reshape2 and plyr. Finally, it focuses on Rhadoop projects like RMR for implementing MapReduce in R and considerations for using RMR effectively.

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...

This document summarizes Brian O'Neill's talk on re-envisioning the Lambda architecture using Storm and Cassandra for real-time analytics of web services data. The talk covered using polyglot persistence with technologies like Kafka, Cassandra, Elasticsearch and Titan to build scalable data pipelines. It also discussed using Storm and Trident to build real-time analytics topologies to compute metrics like averages across partitions in Cassandra using conditional updates. The talk concluded by proposing embedding the batch computation layer within the stream processing layer to enable code and logic reuse across layers.

web servicescassandrastorm
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

hadoop summit tokyobig dataverizon
© 2015 MapR Technologies 13
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant
© 2015 MapR Technologies 14
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()
6
linesWithErrorRDD.first()
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
© 2015 MapR Technologies 15
Process DStream
transform
Transform
map
reduceByValue
count
DStream
RDDs
Dstream
RDDs
transformtransform
• Process using transformations
– creates new RDDs
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
© 2015 MapR Technologies 16
Key Concepts
• Data Sources
• Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, …
– Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations

Recommended for you

Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza

The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources. Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines. Speaker Navina Ramesh, Sr. Software Engineer, LinkedIn

linkedinapache samzadataworks summit 2017
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig

This document provides an overview and comparison of different data transformation frameworks including Apache Pig, Apache Hive, and Apache Spark. It discusses features such as file formats, source to target mappings, data quality checks, and core processing functionality. The document contains code examples demonstrating how to perform common ETL tasks in each framework using delimited, XML, JSON, and other file formats. It also covers topics like numeric validation, data mapping, and performance. The overall purpose is to help users understand the different options for large-scale data processing in Hadoop.

pigapachespark
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill

Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL. Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations. The video can be found here: http://vimeo.com/chug/using-apache-drill

sqlsharkmapr
© 2015 MapR Technologies 17
Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed
results
Spark
Streaming
input data
stream
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2015 MapR Technologies 18
Key Concepts
• Data Sources
• Transformations
• Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS
– saveAsHadoopDataset – save to Hbase
– saveAsTextFiles
– foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS
© 2015 MapR Technologies 19
Learning Goals
• How it works by example
© 2015 MapR Technologies 20
Use Case: Time Series Data
Data for
real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data

Recommended for you

Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib

In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop. Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit

This document provides an overview of installing and programming with Apache Spark on Hortonworks Data Platform (HDP). It introduces Spark and its components, benefits over other frameworks, and Hortonworks' commitment to Spark. The document outlines an example Spark programming workflow using Resilient Distributed Datasets (RDDs) in Scala, and covers common RDD transformations, actions, and persistence methods. It also discusses Spark deployment modes like standalone and on YARN, and reference HDP architectures using Spark.

hadoop summitspark
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup

The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.

big datasparkhadoop
© 2015 MapR Technologies 21
Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
© 2015 MapR Technologies 22
Schema
• All events stored, data CF could be set to expire data
• Filtered alerts put in alerts CF
• Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2015 MapR Technologies 23
Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1. create a Dstream
1. Apply transformations
2. Apply output operations
2. Start receiving data and processing it
– using streamingContext.start().
3. Wait for the processing to be stopped
– using streamingContext.awaitTermination().
© 2015 MapR Technologies 24
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
time 0-1
linesDStream
batch
time 1-2
batch
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an
RDD

Recommended for you

Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...

Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis. We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.

hadoop summitdataworks summitdataworks summit 2017
Spark vstez
Spark vstezSpark vstez
Spark vstez

This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks. In the end of the meetup there is small update about our ImpalaToGo project.

map reducelatencytez
Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies

Good systems development often depends on multiple data management disciplines that provide a solid foundation. One of these is metadata. While much of the discussion around metadata focuses on understanding metadata itself along with its associated technologies, this perspective often represents a typical tool-and-technology focus, which has not achieved significant results to date. A more relevant question when considering pockets of metadata is whether to include them in the scope of organizational metadata practices. By understanding what it means to include items in the scope of your metadata practices, you can begin to build systems that allow you to practice sophisticated ways to advance their data management and supported business initiatives. After a bit of practice in this manner you can position your organization to better exploit any and all metadata technologies in support of business strategy. Takeaways: Metadata value proposition: How to leverage metadata in support of your business strategy Understanding foundational metadata concepts based on the DAMA DMBOK Guiding principles & lessons learned

metadatadata managementdataversity
© 2015 MapR Technologies 25
Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map
new RDDs created
for every batch
batch
time 0-1
linesDStream
RDDs
sensorDstream
RDDs
batch
time 1-2
mapmap
batch
time 1-2
© 2015 MapR Technologies 26
Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
}
© 2015 MapR Technologies 27
DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
alertRdd.toDF().registerTempTable(”alert”)
// join alert data with pump maintenance info
val alertViewDF = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
}
© 2015 MapR Technologies 28
Save to HBase
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
// convert alert to put object write to HBase alerts
rdd.map(Sensor.convertToPutAlert)
.saveAsHadoopDataset(jobConfig)
}

Recommended for you

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...

This Edureka Spark Streaming Tutorial will help you understand how to use Spark Streaming to stream data from twitter in real-time and then process it for Sentiment Analysis. This Spark Streaming tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) What is Streaming? 2) Spark Ecosystem 3) Why Spark Streaming? 4) Spark Streaming Overview 5) DStreams 6) DStream Transformations 7) Caching/ Persistence 8) Accumulators, Broadcast Variables and Checkpoints 9) Use Case – Twitter Sentiment Analysis

spark tutorialspark trainingedureka
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

qconinfoqqconsf
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

This document discusses leveraging Apache HBase as a non-relational datastore in Apache Spark batch and streaming applications. It outlines integration patterns for reading from and writing to HBase using Spark, provides examples of API usage, and discusses future work including using HBase edits as a streaming source.

© 2015 MapR Technologies 29
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put objects written
To HBase
batch
time 0-1
linesRDD
DStream
sensorRDD
Dstream
batch
time 1-2
mapmap
batch
time 1-2
HBase
save save save
output operation: persist data to external storage
© 2015 MapR Technologies 30
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
© 2015 MapR Technologies 31
Using HBase as a Source and Sink
read
write
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View
© 2015 MapR Technologies 32
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
conf,classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v)
}.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result

Recommended for you

HBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpHBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejp

HBase Meetup Tokyo Summer 2015 で発表した資料です。

iothbasespark
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand

Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.

apache hbase internals
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)

This document appears to be test results from running the Yahoo! Cloud Serving Benchmark on a system. It includes performance metrics like request latency distributions and throughput for different request sizes and concurrency levels. Various graphs and tables are presented showing results from multiple benchmark runs. The benchmark was run to test the performance of the system for serving requests in a cloud computing environment.

hadoop hbase
© 2015 MapR Technologies 33
Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
// get Result
val resultRDD = hBaseRDD.map(tuple => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0),
Bytes.toDouble(result.value)))
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
StatCounter(list))
© 2015 MapR Technologies 34
Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family
keyStatsRDD.map { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
© 2015 MapR Technologies 35
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/spark-streaming-hbase
© 2015 MapR Technologies 36
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

Recommended for you

Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices

Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.

sparkspark streaming
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art

Jean-Marc Spaggiari of Cloudera at HBaseConEast2016: http://www.meetup.com/HBase-NYC/events/233024937/

apache hbase spark hbaseconeast2016
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase

Zhan Zhang presents improvements made to bring HBase data efficiently into Spark with DataFrame support. The improvements include high performance by moving computation to data and reducing network overhead through partition pruning and column pruning. Full DataFrame support is provided, allowing Spark SQL and integrated language queries to run on existing HBase tables with Java primitive type support.

© 2015 MapR Technologies 37
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
© 2015 MapR Technologies 38
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2015 MapR Technologies 39
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2015 MapR Technologies 40
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Recommended for you

Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)

The document discusses big data and Hadoop concepts. It covers Hadoop operations like put, get, scan, filter, delete as well as join and group by. It also discusses the different types of data access patterns like random write, sequential read, sequential write and random read. The document focuses on big data, Hadoop operations, and data access patterns.

hbasehadoop
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...

Event: TDWI Accelerate Seattle, October 16, 2017 Topic: Distributed and In-Database Analytics with R Presenter: Debraj GuhaThakurta Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server

data sciencemachine learninganalytics
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...

R is a popular statistical programming language used for data analysis and machine learning. It has over 3 million users and is taught widely in universities. While powerful, R has some scaling limitations for big data. Several Apache Spark integrations with R like SparkR and sparklyr enable distributed, parallel processing of large datasets using R on Spark clusters. Other options for scaling R include H2O for in-memory analytics, Microsoft ML Server for on-premises scaling, and ScaleR for portable parallel processing across platforms. These solutions allow R programs and models to be trained on large datasets and deployed for operational use on big data in various cloud and on-premises environments.

rsparksql server

More Related Content

What's hot

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
Gerard Maas
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
Carol McDonald
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
Chicago Hadoop Users Group
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 

What's hot (20)

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 

Viewers also liked

Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies
DATAVERSITY
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
 
HBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpHBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejp
FwardNetwork
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)
tatsuya6502
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
Spark + HBase
Spark + HBase Spark + HBase
Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)
tatsuya6502
 

Viewers also liked (11)

Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
HBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpHBaseとSparkでセ���サーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejp
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)
 

Similar to Free Code Friday - Spark Streaming with HBase

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
MapR Technologies
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
The HDF-EOS Tools and Information Center
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
Rick Chang
 

Similar to Free Code Friday - Spark Streaming with HBase (20)

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Data Science
Data ScienceData Science
Data Science
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 

Recently uploaded (20)

20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 

Free Code Friday - Spark Streaming with HBase

  • 1. © 2015 MapR Technologies 1© 2014 MapR Technologies Overview of Apache Spark Streaming
  • 2. © 2015 MapR Technologies 2 Agenda • Why Apache Spark Streaming ? • What is Apache Spark Streaming? – Key Concepts and Architecture • How it works by Example
  • 3. © 2015 MapR Technologies 3 Why Spark Streaming? • Process Time Series data : – Results in near-real-time • Use Cases – Social network trends – Website statistics, monitoring – Fraud detection – Advertising click monetization put put put put Time stamped data data • Sensor, System Metrics, Events, log files • Stock Ticker, User Activity • Hi Volume, Velocity Data for real-time monitoring
  • 4. © 2015 MapR Technologies 4 What is time series data? • Stuff with timestamps – Sensor data – log files – Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors
  • 5. © 2015 MapR Technologies 5 Why Spark Streaming ? What If? • You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats
  • 6. © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events
  • 7. © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive
  • 8. © 2015 MapR Technologies 8 What is Spark Streaming? • extension of the core Spark AP • enables scalable, high-throughput, fault-tolerant stream processing of live data Data Sources Data Sinks
  • 9. © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps Stream Processing
  • 10. © 2015 MapR Technologies 10 Key Concepts • Data Sources: – File Based: HDFS – Network Based: TCP sockets, Twitter, Kafka, Flume, ZeroMQ, Akka Actor • Transformations • Output Operations MapR-FS Topics
  • 11. © 2015 MapR Technologies 11 Spark Streaming Architecture • Divide data stream into batches of X seconds – Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 12. © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  • 13. © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  • 14. © 2015 MapR Technologies 14 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  • 15. © 2015 MapR Technologies 15 Process DStream transform Transform map reduceByValue count DStream RDDs Dstream RDDs transformtransform • Process using transformations – creates new RDDs data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3
  • 16. © 2015 MapR Technologies 16 Key Concepts • Data Sources • Transformations: create new DStream – Standard RDD operations: map, filter, union, reduce, join, … – Stateful operations: UpdateStateByKey(function), countByValueAndWindow, … • Output Operations
  • 17. © 2015 MapR Technologies 17 Spark Streaming Architecture • processed results are pushed out in batches Spark batches of processed results Spark Streaming input data stream DStream RDD batches data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 18. © 2015 MapR Technologies 18 Key Concepts • Data Sources • Transformations • Output Operations: trigger Computation – saveAsHadoopFiles – save to HDFS – saveAsHadoopDataset – save to Hbase – saveAsTextFiles – foreach – do anything with each batch of RDDs MapR-DB MapR-FS
  • 19. © 2015 MapR Technologies 19 Learning Goals • How it works by example
  • 20. © 2015 MapR Technologies 20 Use Case: Time Series Data Data for real-time monitoring read Spark Processing Spark Streaming Oil Pump Sensor data
  • 21. © 2015 MapR Technologies 21 Convert Line of CSV data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 22. © 2015 MapR Technologies 22 Schema • All events stored, data CF could be set to expire data • Filtered alerts put in alerts CF • Daily summaries put in Stats CF Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 23. © 2015 MapR Technologies 23 Basic Steps for Spark Streaming code These are the basic steps for Spark Streaming code: 1. create a Dstream 1. Apply transformations 2. Apply output operations 2. Start receiving data and processing it – using streamingContext.start(). 3. Wait for the processing to be stopped – using streamingContext.awaitTermination().
  • 24. © 2015 MapR Technologies 24 Create a DStream val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream(“/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: a sequence of RDDs representing a stream of data stored in memory as an RDD
  • 25. © 2015 MapR Technologies 25 Process DStream val linesDStream = ssc.textFileStream(”directory path") val sensorDStream = linesDStream.map(parseSensor) map new RDDs created for every batch batch time 0-1 linesDStream RDDs sensorDstream RDDs batch time 1-2 mapmap batch time 1-2
  • 26. © 2015 MapR Technologies 26 Process DStream // for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . . }
  • 27. © 2015 MapR Technologies 27 DataFrame and SQL Operations // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val alertViewDF = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }
  • 28. © 2015 MapR Technologies 28 Save to HBase // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }
  • 29. © 2015 MapR Technologies 29 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map Put objects written To HBase batch time 0-1 linesRDD DStream sensorRDD Dstream batch time 1-2 mapmap batch time 1-2 HBase save save save output operation: persist data to external storage
  • 30. © 2015 MapR Technologies 30 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 31. © 2015 MapR Technologies 31 Using HBase as a Source and Sink read write Spark applicationHBase database EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
  • 32. © 2015 MapR Technologies 32 HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result
  • 33. © 2015 MapR Technologies 33 Read HBase // Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
  • 34. © 2015 MapR Technologies 34 Write HBase // save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
  • 35. © 2015 MapR Technologies 35 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/spark-streaming-hbase
  • 36. © 2015 MapR Technologies 36 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 37. © 2015 MapR Technologies 37 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/
  • 38. © 2015 MapR Technologies 38 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 39. © 2015 MapR Technologies 39 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 40. © 2015 MapR Technologies 40 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  1. Spark is really cool… This is NOT a silver bullet… This is another GREAT tool to have… Stop using hammers for putting in screws.
  2. a significant amount of data needs to be quickly processed in near real time to gain insights. Common examples include activity stream data from a web or mobile application, time-stamped log data, transactional data, and event streams from sensor or device networks. the stream-based approach applies processing to the data stream as it is generated. This allows near real-time processing Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data. Spark Streaming is for use cases which require a significant amount of data to be quickly processed as soon as it arrives. Example real-time use cases are: Website monitoring , Network monitoring Fraud detection Web clicks Advertising Internet of Things: sensors
  3. Having data in log files is not good for real time processing
  4. Its becoming important to process events as they arrive
  5. Spark Streaming brings Spark's api to stream processing, letting you write streaming jobs the same way you write batch jobs. Spark streaming supports data sources such as HDFS directories, TCP sockets, Twitter, message queues and distributed stream and log transfer frameworks, such as Flume, Kafka, and Amazon Kinesis. Data Streams can be processed with Spark’s core APIS, Dataframes SQL, or machine learning APIs, and can be persisted to a filesystem, HDFS, MapR-FS, MapR-DB, HBase, or any data source offering a Hadoop OutputFormat or Spark connector. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.
  6. A data stream is a continuous sequence of records or events. Common examples include activity stream data from a web or mobile application, time-stamped log data, transactional data, and event streams from sensor or device networks. the stream-based approach applies processing to the data stream as it is generated. This allows near real-time processing. A stream processing architecture is typically made of a following components: Data sources – source of data streams. examples are sensor network, mobile application, web client, a log from a server, or even a thing from Internet of Things. Message bus – messaging systems such as Kafka and Flume ActiveMQ, RabbitMQ, etc. Stream processing system – framework for processing data streams. NoSql store for storing processed data must be capable of fast read and writes. HBase, Cassandra, MongoDb are popular choices. End applications – dashboards and other applications which use the processed data. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.
  7. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  8. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream) Internally, each DStream is represented as a sequence of RDDs arriving at each time step. A DStream is a sequence of mini-batches, where each mini-batch is represented as a Spark RDD . The stream is broken up into time periods equal to the batch interval Each RDD in the stream will contain the records that are received by the Spark Streaming application during a given time window called the batch interval. Spark Streaming receives data from various input sources and groups it into small batches called the batch interval. , Each input batch forms an RDD. a Dstream is a sequence of RDDs, where each RDD has one time slice of the data in the stream.
  9. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
  10. An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  11. actions, which return a value to the driver program after running a computation on the dataset The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.
  12. Your Spark Application processes the Dstream RDDs using Spark transformations like map, reduce, join … which create new RDDs. Any operation applied on a DStream translates to operations on the underlying RDDs, which, in turn, applies the transformation to the elements of the RDD. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream), which is a sequence of RDDs. Spark Application processes the RDDs using the Spark APIs finally the processed results of the RDD operations are returned in batches.
  13. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  14. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream), which is a sequence of RDDs. Spark Application processes the RDDs using the Spark APIs finally the processed results of the RDD operations are returned in batches. Output operations are similar to RDD actions in that they write data to an external system, but in Spark Streaming they run periodically on each time step, producing output in batches.
  15. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  16. These are the basic steps for Spark Streaming code: Initialize a Spark StreamingContext object. Apply transformations and output operations to DStreams. Start receiving data and processing it using streamingContext.start(). Wait for the processing to be stopped using streamingContext.awaitTermination().
  17. We want to store every single event as it comes in, in a Hbase. We also want to filter for and store alarms. Daily Spark processing will store aggregated summary statistics The Spark Streaming example does the following: Reads streaming data. Processes the streaming data. Writes the processed data to an HBase Table. (non Streaming) Spark code does the following: Reads HBase Table data written by the streaming code Calculates daily summary statistics Writes summary statistics to the HBase table Column Family stats
  18. The Oil Pump Sensor data comes in as comma separated value (csv) files dropped in a directory. Spark Streaming will monitor the directory and process any files created in that directory. (As stated before, Spark Streaming supports different streaming data sources; for simplicity, this example will use files.) Below is an example of the csv file with some sample data: We use a Scala case class to define the Sensor schema corresponding to the sensor data csv files, and a parseSensor function to parse the comma separated values into the sensor case class.
  19. The HBase Table Schema for the streaming data is as follows: Composite row key of the pump name date and time stamp Column Family data with columns corresponding to the input data fields Column Family alerts with columns corresponding to any filters for alarming values Note that the data and alert column families could be set to expire values after a certain amount of time. The Schema for the daily statistics summary rollups is as follows: Composite row key of the pump name and date Column Family stats
  20. These are the basic steps for Spark Streaming code: Initialize a Spark StreamingContext object. Using this context, create a Dstream which represents streaming data from a source Apply transformations and/or output operations to DStreams. Start receiving data and processing it using streamingContext.start(). Wait for the processing to be stopped using streamingContext.awaitTermination(). We will go through these steps showing code from our use case example Spark Streaming programs are best run as standalone applications built using Maven or sbt.
  21. First we create a StreamingContext, the main entry point for streaming functionality, with a 2 second batch interval. Using this context, we can create a DStream that represents streaming data from a source . In this example we use the StreamingContext textFileStream(directory) method to create an input stream that monitors a Hadoop-compatible file system for new files and processes any files created in that directory. This ingestion type supports a workflow where new files are written to a landing directory and Spark Streaming is used to detect them, ingest them, and process the data. Only use this ingestion type with files that are moved or copied into a directory. The linesDStream represents the stream of data, each record is a line of text. Internally a DStream is a sequence of RDDs, one RDD per batch interval.
  22. Next we parse the lines of data into Sensor objects, with the map operation on the linesDStream. The map operation applies the Sensor.parseSensor function on the RDDs in the linesDStream, resulting in RDDs of Sensor objects. Any operation applied on a DStream translates to operations on the underlying RDDs. the map operation is applied on each RDD in the linesDStream to generate the RDDs of the sensorDStream.
  23. Next we use the DStream foreachRDD method to apply processing to each RDD in this DStream. We filter the sensor objects for low psi to create an RDD of alert sensor objects
  24. Here we join the alert data with pump vendor and maintenance information (the pump vendor and maintenance information was read in and cached before streaming . Each RDD is converted to a DataFrame , registered as a temporary table and then queried using SQL. rdd.toDF().registerTempTable("sensor")
  25. Next we use the DStream foreachRDD method to apply processing to each RDD in this DStream. We filter the sensor objects for low psi to create alerts, then we write the sensor and alert data to HBase by converting them to Put objects, and using the PairRDDFunctions saveAsHadoopDatasetmethod, which outputs the RDD to any Hadoop-supported storage system using a Hadoop Configuration object for that storage system (see Hadoop Configuration for HBase above).
  26. The sensorRDD objects are converted to put objects then written to HBase.
  27. To start receiving data, we must explicitly call start() on the StreamingContext, then call awaitTermination to wait for the streaming computation to finish.
  28. The last way to leverage MapReduce in Hbase is to use the Hbase database as both a source and sink in our data flow. One example of this use case is to calculate summaries across the Hbase data and then store those summaries back in the Hbase database. For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step.
  29. This shows the Input and Output for the TableMapper class map method and the TableReducer class reduce method for reading from and to HBase. A scan Result object and Row key are sent to the Mapper map method one row at a time, one or more Key Value pairs are output by the Map method . The Reducer reduce method receives a Key and an iterable list of corresponding values, and outputs a put object .
  30. we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  31. we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  32. This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. The new Spark DataFrames API is designed to make big data processing on tabular data easier. A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.