SlideShare a Scribd company logo
MapReduce and NoSQL
Exploring the Solution Space
Changing the Game
Conventional          MR + NoSQL




Cost   Dollars




                         Data Volume
Conventional          MR + NoSQL




Scale   Dollars




                          Data Volume
Performance Profiles
MapReduce and NoSQL
Performance Profiles
             Good
MapReduce
             Bad




            Throughput   Bulk Update   Latency   Seek
             Good




 NoSQL
             Bad
Performance Profiles
             MapReduce          NoSQL




Throughput    Bulk Update   Latency     Seek
Performance Profiles
                    MapReduce               NoSQL
Good
Bad




       Throughput       Bulk Update   Latency       Seek
Data Goals


Collect
Serve
Analyze
Traditional Approach
               Collect

                    Transactional   Analytical
Users




                       System        System
        Apps
                         OLTP        OLAP


               Serve                Analyze
One Approach
               Collect
                                  Hadoop

                                 MapReduce
Users




        Apps             NoSQL

                                   HDFS


               Serve             Analyze
Analysis Challenges
Analytical Latency
  Data is always old
  Answers can take a long time
Serving up analytical results
Higher cost, complexity
Incremental Updates
Analysis Challenges
Word Counts

a: 5342         New Document:
aardvark: 13
an: 4553       “The red aardvarks
anteater: 27      live in holes.”
...
yellow: 302
zebra:19
Analysis Challenges

HDFS Log files:

sources/routers      MapReduce over data
sources/apps         from all sources for the
sources/webservers   week of Jan 13th
One Approach
               Collect
                                  Hadoop

                                 MapReduce
Users




        Apps             NoSQL

                                   HDFS


               Serve             Analyze
Possible Approach
               Collect
                          Hadoop

                         MapReduce
Users




        Apps

                          NoSQL


               Serve                 Analyze
Have to be careful to avoid
creating a system that’s bad
at everything
What could go wrong?
Performance Profiles
                    MapReduce               NoSQL
Good
Bad




       Throughput       Bulk Update   Latency       Seek
Performance Profiles
        MapReduce        NoSQL      MapReduce on NoSQL
Good
Bad




       Throughput   Bulk Update   Latency         Seek
Performance Profiles
             MapReduce            NoSQL    MapReduce on NoSQL
 Good
 Bad




Throughput               Bulk Update      Latency               Seek
Performance Profiles
                   MapReduce on NoSQL
 Good
 Bad




Throughput   Bulk Update          Latency   Seek
Best Practices

Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query
THE IN T ER L L E C T I V E
     Concept-Based Search
Patents

News Articles

  PubMed

Clinical Trials

ArXive Articles
MapReduce and NoSQL
MongoDB

   Python

Ruby on Rails

  Hadoop

    Thrift
Feature Vectors
Unstructured Data
MapReduce and NoSQL
www.interllective.com
MapReduce on MongoDB


Built-in MapReduce - Javascript
mongo-hadoop
MongoReduce
MongoReduce
Config
         Replicas           App servers
         S      P


         S      P
Shards




         S      P


         S      P
                             mongos
Config
         Replicas                App servers




                    MR Workers
Shards




                                  mongos
Config
         Replicas                App servers
                P


         P

                    Single job
Shards




                P


         P
                                  mongos
MongoDB

Mappers read directly
from a single mongod         mongod
process, not through
mongos - tends to be
local                         Map     HDFS

Balancer can be turned
off to avoid potential for            mongos
reading data twice
MongoReduce

Only MongoDb           mongod
primaries do writes.
Schedule mappers on
secondaries             Map     HDFS

Intermediate output
goes to HDFS           Reduce   mongos
MongoReduce

                         mongod


Final output can go to
                                  HDFS
HDFS or MongoDb

                         Reduce   mongos
MongoReduce

                         mongod

Mappers can just write
to global MongoDb         Map     HDFS
through mongos

                                  mongos
What’s Going On?
     Map              Map         Map          Map


r1   r2    r3    r1   r2    r3   mongos       mongos




           r1   r2     r3         P       P      P

      Identity Reducer
MongoReduce

Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:

q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}

s = {authors:true}
Queries use indexes!
MongoReduce


If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()
MongoReduce



If writing output to MongoDb, specify an objectId to
ensure idempotent writes - i.e. not a random UUID
MapReduce and NoSQL
https://github.com/acordova/MongoReduce
DIY MapRed + NoSQL
YourInputFormat
  YourInputSplit
  YourRecordReader
YourOutputFormat
  YourRecordWriter
  YourOutputCommitter
Brisk
MapReduce and NoSQL
Accumulo
Accumulo

Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
  cell-level access labels
  server-side programming mechanism called Iterators
Accumulo

Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
  cell-level access labels
  server-side programming mechanism called Iterators
MapReduce and Accumulo


Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’
TabletServers

                         App servers




Master



                         Ingest clients
TabletServer
                           App server


            Reduce’
                            map()


                            Ingest client
live:142
               WordCount
in:2342
holes:234        Table
TabletServer
                      App server


            Reduce’
                       map()


                       Ingest client
live:142
in:2342                           The red
holes:234                      aardvarks live
                                 in holes.
TabletServer
                                    App server


            Reduce’
                                     map()


                                     Ingest client
live:142              aardvarks:1
in:2342               live:1
holes:234             in:1
                      holes:1
TabletServer
                      App server


         Reduce’
                       map()


                       Ingest client
aardvarks:1
live:143
in:2343
holes:235
Accumulo
     Map              Map        map()     map()


r1   r2    r3    r1   r2    r3   client    client




           r1   r2     r3
                                   reduce’()
Iterators

 row : column family : column qualifier : ts -> value
 can specify which key elements are unique, e.g.
   row : column family
 can specify a function to execute on values of identical
 key-portions, e.g.
    sum(), max(), min()
Key to performance
When the functions are run
Rather than atomic increment:
   lock, read, +1, write SLOW
Write all values, sum at
  read time
  minor compaction time
  major compaction time
TabletServer

     aardvark:1
scan live:142
     live:1
                    Reduce’
     in:2342
     in:1                        live:143
                     sum()
     holes:234
     holes:1
                      read
TabletServer
                         aardvark:1
                         live:142
aardvark:1
                         live:1
live:143
                         in:2342        Reduce’
in:2343
                 sum()   in:1
holes:235
                         holes:234
                         holes:1
             major compact
Reduce’ (prime)


Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function
‘Continuous’ MapReduce

Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities
Accumulo - latin
  to accumulate ...
Accumulo - latin
  to accumulate ...
       awesomeness
incubator.apache.org/accumulo
wiki.apache.org/incubator/AccumuloProposal
Google Percolator

A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.
Google Percolator



A novel, proprietary system of Distributed Transactions
and Notifications built on top of BigTable
Solution Space

Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS
Questions

More Related Content

MapReduce and NoSQL

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n