Spark and MongoDB

"BigDataSpain (2014) is great, hope I can make it
next year too"
Wish making conference!

"BigDataSpain (2015) is great, hope I can win la
Loteria"

7
Agenda
Spark + MongoDB
Connectors
Use Cases
Demo

By now, you should have heard about MongoDB
Unless you've been living under a rock for the last few years!

9
MongoDB
GENERAL PURPOSE DOCUMENT DATABASE OPEN-SOURCE

12
Spark is Taylor Swift of Big Data

13
Agenda
Spark Taylor Swift + MongoDB
Connectors
Use Cases
Demo

Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Seamless integration
with SQL using
DataFrame API. Also
supports HIVE SQL
Fast Feed data processing API.
Designed for Fault Tolerance and
bridges streaming with batch processing
MLib is Spark machine
learning algorithms trick bag.
Spark graph library

Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark

20
Data Management
Offline Processing
Analytics
Data Warehousing
OLTP
Applications
Fine grained operations

The image cannot be
displayed. Your computer21
Delivering User Relevancy
•  Integrate data from many
sources
•  Fast-cycle analytics
•  Real-time
•  Reliable

Fraud Detection
I'm so in love!

Fraud Detection
I'm so in love!
Me, too<3
Now send me your
CC number
?
Ok, XXXX-123-zzz
$$$

Workloads
Chat App
Login
User Profile
Contacts
Messages
…
Spark
Fraud Detection
Segmentation
Recommendations
HDFS HDFS HDFS Archiving
Data Crunching

26
Wearable Devices
Embedded Systems
Internet of Things
Embedded medical devices

The image cannot be
displayed. Your computer27
Access complete patient history
Avoid of conflicting prescriptions
Clinical trials

Time Series
db.ticks.find()
{
_id: 'MSFT_12',
type: 'Open',
date: ISODate("2015-07-12 10:00"),
volume: 1699342,
minutes: {
"0": 12.9,
"1": 14.4,
...
"59": 15.8
}
}
Resource
Type
When
Series

h1p://cdn.theatlan9c.com/sta9c/infocus/ngt051713/n10_00203194.jpg
WiredTiger

> mongod --storageEngine wiredTiger

34
MongoDB Storage Engines
Content
Repo
IoT Sensor
Backend
Ad Service
Customer
Analytics
Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
MMAP V1 WT In-Memory ? ?
Supported in MongoDB 3.0 Future Possible Storage Engines
Management
Security
Experimental

36
Spark Streaming
SparkTwitter Feed

37
Spark Streaming
Twitter Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}

38
Spark Streaming
Spark
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}

39
Capped Collection
Spark Streaming
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
{
"time": "Mon Sep 24 03:40",
"bigdataspain": 400
}
{
"time": "Mon Sep 24 03:50",
"bigdataspain": 7556
}
{
"time": "Mon Sep 24 03:50",
"itshappending": 100
}
Tailable Cursor

MongoDB Hadoop Connector
Spark
HDFS HDFS HDFS
MongoDB Hadoop
Connector
MongoDB
Shard

Spark
HDFS HDFS HDFS
MongoDB Hadoop
Connector
MongoDB
Shard
YARN

43
Positive Not So Good
Battle Tested Not the fastest thing
Integrated with existing
Hadoop components
Not dedicated to Spark
Supports HIVE and PIG Dependent on HDFS
http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

44
Stratio Spark-MongoDB
http://spark-packages.org/?q=mongodb

45
https://github.com/Stratio/spark-mongodb
Spark
HDFS HDFS HDFS
MongoDB
Shard

46
val mcInputBuilder = MongodbConfigBuilder(Map(Host ->
List("localhost:27017"),
Database -> "marketdata",
Collection -> "minbars",
SamplingRatio -> 1.0,
WriteConcern -> MongodbWriteConcern.Normal))
val readConfig = mcInputBuilder.build()
Database
Collec9on
Sampling Ra9o
Write Concern

47
val sqlContext = new HiveContext(sc)
val dfOneMin = sqlContext.fromMongoDB(readConfig)

48
val dfFiveMinForMonth = sqlContext.sql(
"""
SELECT m.Symbol, m.OpenTime as Timestamp, m.Open, m.High, m.Low, m.Close
FROM
...
FROM minbars)
as m
WHERE unix_timestamp(m.CloseTime, 'yyyy-MM-dd HH:mm') -
unix_timestamp(m.OpenTime, 'yyyy-MM-dd HH:mm') = 60*4"""
)

49
Spark
HDFS HDFS HDFS
MongoDB
Shard

50
DC West
DC West
DC West
Spark
MongoDB
Shard
Spark
Spark

52
Demo
Spark Stratio Spark-MongoDB

55
What to expect
•  We are working on a dedicated Spark Connector for
MongoDB
•  Stratio Connector is great but:
– Some Operations are actually faster if performed using
Aggregation Framework
•  Better Integration with upcoming 3.2 Async Java Driver
– Specially for the Apache Streaming Support

MongoDB Days 2015
05 November, 2015 London
https://www.mongodb.com/events/mongodb-days-uk

57
Engineering
Sales & Account Management Finance & People Opera9ons
Pre-Sales Engineering Marke9ng
Join the Team
View all jobs and apply: h1p://grnh.se/pj10su

Obrigado!
Norberto Leite
Technical Evangelist
norberto@mongodb.com
@nleite

Spark and MongoDB

More Related Content

Spark and MongoDB