MapReduce and NoSQL
- 11. Traditional Approach
Collect
Transactional Analytical
Users
System System
Apps
OLTP OLAP
Serve Analyze
- 12. One Approach
Collect
Hadoop
MapReduce
Users
Apps NoSQL
HDFS
Serve Analyze
- 15. Analysis Challenges
HDFS Log files:
sources/routers MapReduce over data
sources/apps from all sources for the
sources/webservers week of Jan 13th
- 16. One Approach
Collect
Hadoop
MapReduce
Users
Apps NoSQL
HDFS
Serve Analyze
- 18. Have to be careful to avoid
creating a system that’s bad
at everything
- 21. Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
- 22. Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
- 24. Best Practices
Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query
- 25. THE IN T ER L L E C T I V E
Concept-Based Search
- 28. MongoDB
Python
Ruby on Rails
Hadoop
Thrift
- 34. Config
Replicas App servers
S P
S P
Shards
S P
S P
mongos
- 35. Config
Replicas App servers
MR Workers
Shards
mongos
- 36. Config
Replicas App servers
P
P
Single job
Shards
P
P
mongos
- 37. MongoDB
Mappers read directly
from a single mongod mongod
process, not through
mongos - tends to be
local Map HDFS
Balancer can be turned
off to avoid potential for mongos
reading data twice
- 38. MongoReduce
Only MongoDb mongod
primaries do writes.
Schedule mappers on
secondaries Map HDFS
Intermediate output
goes to HDFS Reduce mongos
- 39. MongoReduce
mongod
Final output can go to
HDFS
HDFS or MongoDb
Reduce mongos
- 40. MongoReduce
mongod
Mappers can just write
to global MongoDb Map HDFS
through mongos
mongos
- 41. What’s Going On?
Map Map Map Map
r1 r2 r3 r1 r2 r3 mongos mongos
r1 r2 r3 P P P
Identity Reducer
- 42. MongoReduce
Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:
q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}
s = {authors:true}
Queries use indexes!
- 43. MongoReduce
If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()
- 47. DIY MapRed + NoSQL
YourInputFormat
YourInputSplit
YourRecordReader
YourOutputFormat
YourRecordWriter
YourOutputCommitter
- 51. Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
- 52. Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
- 53. MapReduce and Accumulo
Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’
- 55. TabletServer
App server
Reduce’
map()
Ingest client
live:142
WordCount
in:2342
holes:234 Table
- 56. TabletServer
App server
Reduce’
map()
Ingest client
live:142
in:2342 The red
holes:234 aardvarks live
in holes.
- 57. TabletServer
App server
Reduce’
map()
Ingest client
live:142 aardvarks:1
in:2342 live:1
holes:234 in:1
holes:1
- 58. TabletServer
App server
Reduce’
map()
Ingest client
aardvarks:1
live:143
in:2343
holes:235
- 59. Accumulo
Map Map map() map()
r1 r2 r3 r1 r2 r3 client client
r1 r2 r3
reduce’()
- 60. Iterators
row : column family : column qualifier : ts -> value
can specify which key elements are unique, e.g.
row : column family
can specify a function to execute on values of identical
key-portions, e.g.
sum(), max(), min()
- 61. Key to performance
When the functions are run
Rather than atomic increment:
lock, read, +1, write SLOW
Write all values, sum at
read time
minor compaction time
major compaction time
- 62. TabletServer
aardvark:1
scan live:142
live:1
Reduce’
in:2342
in:1 live:143
sum()
holes:234
holes:1
read
- 63. TabletServer
aardvark:1
live:142
aardvark:1
live:1
live:143
in:2342 Reduce’
in:2343
sum() in:1
holes:235
holes:234
holes:1
major compact
- 64. Reduce’ (prime)
Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function
- 65. ‘Continuous’ MapReduce
Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities
- 69. Google Percolator
A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.
- 71. Solution Space
Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS
Editor's Notes
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n