SlideShare a Scribd company logo
Key-Value RDD
Key-Value RDD
Transformations on Pair RDDs
keys()
Returns an RDD with the keys of each tuple.
>>> var m = sc.parallelize(List((1, 2), (3, 4))).keys
>>> m.collect()
Array[Int] = Array(1, 3)
Key-Value RDD
Transformations on Pair RDDs
values()
>>> var m = sc.parallelize(List((1, 2), (3, 4))).values
>>> m.collect()
Array(2, 4)
Return an RDD with the values of each tuple.
Key-Value RDD
var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6)));
var rdd1 = rdd.groupByKey()
var vals = rdd1.collect()
for( i <- vals){
for (k <- i.productIterator) {
println("t" + k);
}
}
Transformations on Pair RDDs
groupByKey()
Group values with the same key.

Recommended for you

Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

hadoopscalabig data
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

此課程專為 Spark 入門者設計,在六小時帶您從無到有建置 Spark 開發環境,並以實作方式帶領您了解 Spark 機器學習函式庫 (MLlib) 的應用及開發。課程實作將以 Spark 核心之實作語言 - Scala 為主,搭配 Scala IDE eclipse 及相關 Library 建置本機開發環境,透過 IDE 強大的開發及偵錯功能加速開發流程;並介紹如何佈置至 Spark 平台,透過 Spark-submit 執行資料分析工作。本課程涵蓋機器學習中最常使用之分類、迴歸及分群方法,歡迎對 Spark 感興趣,卻不知從何下手;或想快速的對 Spark 機器學習有初步的了解的您參與!

spark
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop

Dean Wampler presents on using Scalding, which leverages Cascading, to write MapReduce jobs in a more productive way. Cascading provides higher-level abstractions for building data pipelines and hides much of the boilerplate of the Hadoop MapReduce framework. It allows expressing jobs using concepts like joins and group-bys in a cleaner way focused on the algorithm rather than infrastructure details. Word count is shown implemented in the lower-level MapReduce API versus in Cascading Java code to demonstrate how Cascading minimizes boilerplate and exposes the right abstractions.

pighadoophive
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1)));
rdd.groupByKey().mapValues(_.size).collect()
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd1.combineByKey(cc, mv, mc).collect()
Array((x,1, 2, 3, 4,5))
Transformations on Pair RDDs
combineByKey(createCombiner, mergeValue, mergeCombiners,
numPartitions=None)
Combine values with the same key using a different result type.
Turns RDD[(K, V)] into a result of type RDD[(K, C)]
createCombiner, which turns a V into a C (e.g., creates a one-element list)
mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
mergeCombiners, to combine two C’s into a single one.
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
Example: combineByKey
1 2 3 "1, 2, 3"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
1 2 3
Example: combineByKey

Recommended for you

R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R

This document discusses various functions in R for exporting data, including print(), cat(), paste(), paste0(), sprintf(), writeLines(), write(), write.table(), write.csv(), and sink(). It provides descriptions, syntax, examples, and help documentation for each function. The functions can be used to output data to the console, files, or save R objects. write.table() and write.csv() convert data to a data frame or matrix before writing to a text file or CSV. sink() diverts R output to a file instead of the console.

pasteprintpaste0
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids

This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy. The talk was held at the Helsinki Data Science meetup on January 9th 2014.

monoidsscalascalding
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit

You all know what RDD stands for, right? You have the mental model of a distributed collection. But have you ever consider writing your own RDD? During this talk we will do just that. We will start by explaining essence of how RDDs are implemented internally, following by semi-live demo (*), where we will implement few RDDs from the scratch. After this talk you will not only be able to write your own RDD, but you will also have a deeper understanding of how Apache Spark works under the hood. I guarantee fun during the talk and profit during your next job interview. (*) by 'semi-live' author means not really code live because that almost never works, but slowly pulling small commits from the repo :)

Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
"3"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
1 2 3
"1"
cc
"3"
cc
mv
Example: combineByKey
"1,2"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey

Recommended for you

Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups

This document discusses Scoobi, a Scala library for developing MapReduce applications on Hadoop. Some key points: 1) Scoobi allows developers to write Hadoop MapReduce jobs using a functional programming style in Scala, inspired by Google's FlumeJava. It provides abstractions like DList and DObject to represent distributed datasets and computations. 2) Under the hood, Scoobi compiles Scala code into Java MapReduce jobs that run on Hadoop. It handles partitioning, parallelization, and distribution of data and computation across clusters. 3) Examples show how common operations like filtering, mapping, reducing can be expressed concisely using the Scoobi API, mirroring Scala

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.

apache sparkuclabig data day la
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark

This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.

functional programmingscalaspark
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + ":" + y.toString}
def mc(x:String, y:String):String = {x + "," + y}
myrdd.combineByKey(cc, mv, mc)
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd.combineByKey(cc, mv, mc).collect()(0)._2
String = 1, 2, 3, 4,5
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))
Key-Value RDD
Questions - Set Operations
('[', 1, 2, 3, ']')
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))

Recommended for you

Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science

Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.

map reducesparktutorial
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R

Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.

charmatchncharchartr
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)

Short (45 min) version of my 'Pragmatic Real-World Scala' talk. Discussing patterns and idioms discovered during 1.5 years of building a production system for finance; portfolio management and simulation.

actorsdslclosures
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Questions - Set Operations
[('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))]
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey().collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.

Recommended for you

D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop

This document provides an overview of key concepts for working with D3, including: - D3 uses standard web technologies like HTML, SVG, and CSS rather than introducing new representations. Learning D3 largely means learning web standards. - Visualization with D3 requires mapping data to visual elements using scales. Scales are functions that map from data values to visual values like pixel positions. - Selections in D3 correspond to elements in the DOM. Data joins allow binding data to selections to drive attribute updates. The enter, update, exit pattern is used to handle new, existing and removed data. - Common scale types include linear, log, quantize and quantile for quantitative data, and

campjs d3js d3 javascript workshop
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181

This document summarizes the new features and changes in Ring 1.5.2. It introduces natural language processing capabilities through defining commands for a MyLanguage. It shows code for Hello and Count commands. It also describes improvements to Ring Notepad styles, the RingREPL interactive shell, number conversion functions, standard and web libraries, RingQt classes and functions, and a Qt class converter tool.

Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

The document provides an overview of the different Spark APIs for working with structured data: RDDs, DataFrames, and Datasets. It discusses the timeline and key features of each API. RDDs were introduced in Spark 1.0 and represent resilient distributed datasets. DataFrames were added in Spark 1.3 and introduce schema support and SQL-like capabilities. Datasets, introduced in Spark 1.6, provide a type-safe interface but are still experimental. DataFrames are now considered the most stable and flexible API due to built-in optimizations and support for dynamic languages.

#dataframes#datahadoop
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(true, 1).collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(ascending=false,
numPartitions=2).collect()
Array((d,4), (b,2), (a,1), (2,5), (1,3))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
Transformations on Pair RDDs
subtractByKey(other, numPartitions=None)
Return each (key, value) pair in self that has no pair with matching key in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2)))
>>> var y = sc.parallelize(List(("a", 3), ("c", None)))
>>> x.subtractByKey(y).collect()
[('b', 4), ('b', 5)]
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.

Recommended for you

User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story

This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.

spark + ai summit
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last

Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.

dataframesdatasetsspark sql
Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1

The document discusses various built-in functions in Python including numeric, string, and container data types. It provides examples of using list comprehensions, dictionary comprehensions, lambda functions, enumerate, zip, filter, any, all, map and reduce to manipulate data in Python. It also includes references to online resources for further reading.

funtion programmingpythoncomputer
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7)))
>>> x.join(y).collect()
Array((a,(1,2)), (a,(1,3)))
Key-Value RDD
Transformations on Pair RDDs
leftOuterJoin(other, numPartitions=None)
Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v,
w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2)))
>>> x.leftOuterJoin(y).collect()
Array((a,(1,Some(2))), (b,(4,None)))
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
LEFT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), ('2', ('sravani', None))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
LEFT OUTER JOIN

Recommended for you

User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story

Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis.

spark + ai summit

 *
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd

The document describes various transformations and actions that can be performed on RDDs in Apache Spark. It explains functions like map(), filter(), reduceByKey() for transformations. Actions to extract data from RDDs like collect(), count(), take() are also covered. Examples of working with key-value pairs and performing joins on pair RDDs are provided. The document also includes code examples to analyze sales data from a CSV file using Spark RDD functions.

18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays

This document discusses collections and queries in Java, including associative arrays (maps), lambda expressions, and the stream API. It provides examples of using maps like HashMap, LinkedHashMap and TreeMap to store key-value pairs. Lambda expressions are introduced as anonymous functions. The stream API is shown processing collections through methods like filter, map, sorted and collect. Examples demonstrate common tasks like finding minimum/maximum values, summing elements, and sorting lists.

Key-Value RDD
Transformations on Pair RDDs
rightOuterJoin(other, numPartitions=None)
Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs
(k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key
k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> y.rightOuterJoin(x).collect()
[('a', (2, 1)), ('b', (None, 4))]
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
RIGHT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), (3, (None, 'giri'))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
RIGHT OUTER JOIN
3
Key-Value RDD
Transformations on Pair RDDs
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3)))
>>> var cg = x.cogroup(y)
>>> cgl = cg.collect()
Array((a,(CompactBuffer(1),CompactBuffer(2, 3))),
(b,(CompactBuffer(4),CompactBuffer())))
This is basically same as:
((a, ([1], [2,3])), (b, ([4], []))))
cogroup(other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the
list of values for that key in self as well as other.

Recommended for you

Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops

This document discusses using Erlang for data operations involving multiple relational databases with different schemas located in different locations. Erlang is well-suited for this due to its concurrency, scalability, and ability to act as "glue" between systems. An approach is described using Erlang thin clients to enforce and maintain the relational data model across databases while global state resides in a traditional database. Asynchronous message passing via RabbitMQ is used to route processing jobs between agents located at different sites.

erlangmochiwebdata
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2

This document contains 30 programming exercises involving Scheme functions and data structures. The exercises cover topics like defining functions, manipulating lists, recursive functions, structures, and more. For each exercise, the reader is asked to write Scheme code to solve programming problems related to topics like reversing lists, finding elements in lists, defining recursive functions, and representing geometric and other relationships with nested data structures.

CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM

SOURCE: • https://gist.github.com/hadley/439761 (hadley/clustergram-had.r) http://www.r-statistics.com/tag/large-data/

visualizationr programmingk-means
Key-Value RDD
Actions Available on Pair RDDs
countByKey()
Count the number of elements for each key, and return the result to the master as a
dictionary.
>>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10)))
>>> rdd.countByKey()
Map(a -> 2, a -> 1, b -> 1)
Key-Value RDD
Actions Available on Pair RDDs
lookup(key)
Return the list of values in the RDD for key. This operation is done efficiently if the
RDD has a known partitioner by only searching the partition that the key maps to.
var lr = sc.parallelize(1 to 1000).map(x => (x, x) )
lr.lookup(42)
Job 24 finished: lookup at <console>:28, took 0.037469 s
WrappedArray(42)
var sorted = lr.sortByKey()
sorted.lookup(42) # fast
Job 21 finished: lookup at <console>:28, took 0.008917 s
ArrayBuffer(42)
Thank you!
Basics of RDD

More Related Content

What's hot

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
Cheng Lian
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
Chicago Hadoop Users Group
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
Jonas Bonér
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
Anton Katunin
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
Mahmoud Samir Fayed
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 

What's hot (20)

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 

Similar to Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutorial | CloudxLab

Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
Ke Wei Louis
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
Intro C# Book
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
mnacos
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
rampan
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
Dr. Volkan OBAN
 
Frsa
FrsaFrsa
Frsa
_111
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream API
Svetlin Nakov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Vyacheslav Arbuzov
 
Monadologie
MonadologieMonadologie
Monadologie
league
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
Khaled Al-Shamaa
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
Dr. Volkan OBAN
 
R for you
R for youR for you
R for you
Andreas Chandra
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
Filippo Vitale
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in Scala
Tim Dalton
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
Alberto Labarga
 

Similar to Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
Frsa
FrsaFrsa
Frsa
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream API
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
Monadologie
MonadologieMonadologie
Monadologie
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
R for you
R for youR for you
R for you
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in Scala
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Recently uploaded

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 

Recently uploaded (20)

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Key-Value RDD Transformations on Pair RDDs keys() Returns an RDD with the keys of each tuple. >>> var m = sc.parallelize(List((1, 2), (3, 4))).keys >>> m.collect() Array[Int] = Array(1, 3)
  • 3. Key-Value RDD Transformations on Pair RDDs values() >>> var m = sc.parallelize(List((1, 2), (3, 4))).values >>> m.collect() Array(2, 4) Return an RDD with the values of each tuple.
  • 4. Key-Value RDD var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6))); var rdd1 = rdd.groupByKey() var vals = rdd1.collect() for( i <- vals){ for (k <- i.productIterator) { println("t" + k); } } Transformations on Pair RDDs groupByKey() Group values with the same key.
  • 5. Key-Value RDD Questions - Set Operations What will be the result of the following? var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1))); rdd.groupByKey().mapValues(_.size).collect()
  • 6. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y} def mc(x:String, y:String):String = {x + ", " + y} myrdd1.combineByKey(cc, mv, mc).collect() Array((x,1, 2, 3, 4,5)) Transformations on Pair RDDs combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None) Combine values with the same key using a different result type. Turns RDD[(K, V)] into a result of type RDD[(K, C)] createCombiner, which turns a V into a C (e.g., creates a one-element list) mergeValue, to merge a V into a C (e.g., adds it to the end of a list) mergeCombiners, to combine two C’s into a single one.
  • 7. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) Example: combineByKey 1 2 3 "1, 2, 3"
  • 8. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) 1 2 3 Example: combineByKey
  • 9. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc Example: combineByKey
  • 10. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc "3" cc Example: combineByKey
  • 11. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} 1 2 3 "1" cc "3" cc mv Example: combineByKey "1,2"
  • 12. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 13. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + ":" + y.toString} def mc(x:String, y:String):String = {x + "," + y} myrdd.combineByKey(cc, mv, mc) 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 14. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} myrdd.combineByKey(cc, mv, mc).collect()(0)._2 String = 1, 2, 3, 4,5 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 15. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 16. Key-Value RDD Questions - Set Operations ('[', 1, 2, 3, ']') What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 17. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 18. Key-Value RDD Questions - Set Operations [('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))] What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 19. Key-Value RDD Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 20. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey().collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 21. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(true, 1).collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 22. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(ascending=false, numPartitions=2).collect() Array((d,4), (b,2), (a,1), (2,5), (1,3)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 23. Key-Value RDD Transformations on Pair RDDs subtractByKey(other, numPartitions=None) Return each (key, value) pair in self that has no pair with matching key in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2))) >>> var y = sc.parallelize(List(("a", 3), ("c", None))) >>> x.subtractByKey(y).collect() [('b', 4), ('b', 5)]
  • 24. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
  • 25. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7))) >>> x.join(y).collect() Array((a,(1,2)), (a,(1,3)))
  • 26. Key-Value RDD Transformations on Pair RDDs leftOuterJoin(other, numPartitions=None) Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2))) >>> x.leftOuterJoin(y).collect() Array((a,(1,Some(2))), (b,(4,None)))
  • 27. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? LEFT OUTER JOIN
  • 28. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), ('2', ('sravani', None))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() LEFT OUTER JOIN
  • 29. Key-Value RDD Transformations on Pair RDDs rightOuterJoin(other, numPartitions=None) Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> x = sc.parallelize([("a", 1), ("b", 4)]) >>> y = sc.parallelize([("a", 2)]) >>> y.rightOuterJoin(x).collect() [('a', (2, 1)), ('b', (None, 4))]
  • 30. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? RIGHT OUTER JOIN
  • 31. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), (3, (None, 'giri'))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() RIGHT OUTER JOIN 3
  • 32. Key-Value RDD Transformations on Pair RDDs >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3))) >>> var cg = x.cogroup(y) >>> cgl = cg.collect() Array((a,(CompactBuffer(1),CompactBuffer(2, 3))), (b,(CompactBuffer(4),CompactBuffer()))) This is basically same as: ((a, ([1], [2,3])), (b, ([4], [])))) cogroup(other, numPartitions=None) For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
  • 33. Key-Value RDD Actions Available on Pair RDDs countByKey() Count the number of elements for each key, and return the result to the master as a dictionary. >>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10))) >>> rdd.countByKey() Map(a -> 2, a -> 1, b -> 1)
  • 34. Key-Value RDD Actions Available on Pair RDDs lookup(key) Return the list of values in the RDD for key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. var lr = sc.parallelize(1 to 1000).map(x => (x, x) ) lr.lookup(42) Job 24 finished: lookup at <console>:28, took 0.037469 s WrappedArray(42) var sorted = lr.sortByKey() sorted.lookup(42) # fast Job 21 finished: lookup at <console>:28, took 0.008917 s ArrayBuffer(42)