MapReduce and NoSQL

MapReduce and NoSQL
Exploring the Solution Space

Conventional MR + NoSQL

Cost Dollars

Data Volume

Conventional MR + NoSQL

Scale Dollars

Data Volume

Performance Proﬁles
Good
MapReduce
Bad

Throughput Bulk Update Latency Seek
Good

NoSQL
Bad

MapReduce NoSQL


MapReduce NoSQL
Good
Bad


Data Goals

Collect
Serve
Analyze

Traditional Approach
Collect

Transactional Analytical
Users

System System
Apps
OLTP OLAP

Serve Analyze

One Approach
Collect
Hadoop

MapReduce
Users

Apps NoSQL

HDFS

Serve Analyze

Analysis Challenges
Analytical Latency
Data is always old
Answers can take a long time
Serving up analytical results
Higher cost, complexity
Incremental Updates

Analysis Challenges
Word Counts

a: 5342 New Document:
aardvark: 13
an: 4553 “The red aardvarks
anteater: 27 live in holes.”
...
yellow: 302
zebra:19

Analysis Challenges

HDFS Log ﬁles:

sources/routers MapReduce over data
sources/apps from all sources for the
sources/webservers week of Jan 13th

Possible Approach
Collect
Hadoop

MapReduce
Users

Apps

NoSQL

Serve Analyze

Have to be careful to avoid
creating a system that’s bad
at everything

MapReduce NoSQL MapReduce on NoSQL
Good
Bad


MapReduce on NoSQL
Good
Bad


Best Practices

Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query

THE IN T ER L L E C T I V E
Concept-Based Search

Patents

News Articles

PubMed

Clinical Trials

ArXive Articles

MongoDB

Python

Ruby on Rails

Hadoop

Thrift

Feature Vectors
Unstructured Data

MapReduce on MongoDB

Built-in MapReduce - Javascript
mongo-hadoop
MongoReduce

Conﬁg
Replicas App servers
S P

S P
Shards

S P

S P
mongos

Conﬁg

MR Workers
Shards

mongos

Conﬁg
P

P

Single job
Shards

P

P
mongos

MongoDB

Mappers read directly
from a single mongod mongod
process, not through
mongos - tends to be
local Map HDFS

Balancer can be turned
off to avoid potential for mongos
reading data twice

MongoReduce

Only MongoDb mongod
primaries do writes.
Schedule mappers on
secondaries Map HDFS

Intermediate output
goes to HDFS Reduce mongos

MongoReduce

mongod

Final output can go to
HDFS
HDFS or MongoDb

Reduce mongos

MongoReduce

mongod

Mappers can just write
to global MongoDb Map HDFS
through mongos

mongos

What’s Going On?
Map Map Map Map

r1 r2 r3 r1 r2 r3 mongos mongos

r1 r2 r3 P P P

Identity Reducer

MongoReduce

Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:

q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}

s = {authors:true}
Queries use indexes!

MongoReduce

If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()

MongoReduce

If writing output to MongoDb, specify an objectId to
ensure idempotent writes - i.e. not a random UUID

https://github.com/acordova/MongoReduce

DIY MapRed + NoSQL
YourInputFormat
YourInputSplit
YourRecordReader
YourOutputFormat
YourRecordWriter
YourOutputCommitter

Accumulo

Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators

MapReduce and Accumulo

Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’

TabletServers

App servers

Master

Ingest clients

TabletServer
App server

Reduce’
map()

Ingest client
live:142
WordCount
in:2342
holes:234 Table

TabletServer
App server

Reduce’
map()

Ingest client
live:142
in:2342 The red
holes:234 aardvarks live
in holes.

TabletServer
App server

Reduce’
map()

Ingest client
live:142 aardvarks:1
in:2342 live:1
holes:234 in:1
holes:1

TabletServer
App server

Reduce’
map()

Ingest client
aardvarks:1
live:143
in:2343
holes:235

Accumulo
Map Map map() map()

r1 r2 r3 r1 r2 r3 client client

r1 r2 r3
reduce’()

Iterators

row : column family : column qualiﬁer : ts -> value
can specify which key elements are unique, e.g.
row : column family
can specify a function to execute on values of identical
key-portions, e.g.
sum(), max(), min()

Key to performance
When the functions are run
Rather than atomic increment:
lock, read, +1, write SLOW
Write all values, sum at
read time
minor compaction time
major compaction time

TabletServer

aardvark:1
scan live:142
live:1
Reduce’
in:2342
in:1 live:143
sum()
holes:234
holes:1
read

TabletServer
aardvark:1
live:142
aardvark:1
live:1
live:143
in:2342 Reduce’
in:2343
sum() in:1
holes:235
holes:234
holes:1
major compact

Reduce’ (prime)

Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function

‘Continuous’ MapReduce

Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities

Accumulo - latin
to accumulate ...

Accumulo - latin
to accumulate ...
awesomeness

incubator.apache.org/accumulo
wiki.apache.org/incubator/AccumuloProposal

Google Percolator

A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.

Google Percolator

A novel, proprietary system of Distributed Transactions
and Notiﬁcations built on top of BigTable

Solution Space

Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS

MapReduce and NoSQL

Related slideshows

More Related Content

MapReduce and NoSQL

Editor's Notes