MongoDB Basics

Sarang Shravagi
Python Developer,ScaleArc
@_sarangs

Let’s Know Each Other
• Why are you attending?
• Do you code?
• OS?
• Programing Language?
• JSON?
• MongoDB?

Agenda
• SQL and NoSQL Database
• What is MongoDB?
• Hands-On and Assignment
• Design Models
• MongoDB Language Driver
• Disaster Recovery
• Handling BigData

Data Patterns & Storage Needs
• Product Information
• User Information
• Purchase Information
• Product Reviews
• Site Interactions
• Social Graph
• Search Index

SQL to NoSQL
Design Paradigm Shift

SQL Storage
• Was designed when
– Storage and data transfer was costly
– Processing was slow
– Applications were oriented more towards data collection
• Initial adopters were financial institutions

SQL Storage
• Structured
– schema
• Relational
– foreign keys, constraints
• Transactional
– Atomicity, Consistency, Isolation, Durability
• High Availability through robustness
– Minimize failures
• Optimized for Writes
• Typically Scale Up

NoSQL Storage
• Is designed when
– Storage is cheap
– Data transfer is fast
– Much more processing power is available
• Clustering of machines is also possible
– Applications are oriented towards consumption of User
Generated Content
– Better on-screen user experience is in demand

NoSQL Storage
• Semi-structured
– Schemaless
• Consistency,Availability, Partition Tolerance
• High Availability through clustering
– expect failures
• Optimized for Reads
• Typically Scale Out

Different Databases
Half Level Deep

SQL: RDBMS
• MySql, Postgresql, Oracle etc.
• Stores data in tables having columns
– Basic (number, text) data types
• Strong query language
• Transparent values
– Query language can read and filter on them
– Relationship between tables based on values
• Suited for user info and transactions

NoSQL: Document
• MongoDB, CouchDB etc.
• Object Oriented data models
– Stores data in document objects having fields
– Basic and compound (list, dict) data types
• SQL like queries
– Can be part of query
• Suited for product info and its reviews

NoSQL: Column Family
• Cassandra, Big Table etc.
• Stores data in columns
– Can be part of query
• SQL like queries
• Suited for search

NoSQL: Graph
• Neo4j
• Stores data in form of nodes and relationships
• Query is in form of traversal
• In-memory
• Suited for social graph

MongoDB is a ___________
database
1. Document
2. Open source
3. High performance
4. Horizontally scalable
5. Full featured

1. Document Database
• Not for .PDF & .DOC files
• Adocument is essentially an associative array
• Document = JSON object
• Document = PHPArray
• Document = Python Dict
• Document = Ruby Hash
• etc

2. Open Source
• MongoDB is an open source project
• On GitHub
• Licensed under theAGPL
• Started & sponsored by MongoDB Inc (formerly
known as 10gen)
• Commercial licenses available
• Contributions welcome

7,000,000+
MongoDB Downloads
150,000+
Online Education Registrants
35,000+
MongoDB Management Service (MMS) Users
30,000+
MongoDB User Group Members
20,000+
MongoDB DaysAttendees
Global Community

3. High Performance
• Written in C++
• Extensive use of memory-mapped files
i.e. read-through write-through memory caching.
• Runs nearly everywhere
• Data serialized as BSON (fast parsing)
• Full support for primary & secondary indexes
• Document model = less work

Better Data
Locality
Performance
In-Memory
Caching
In-Place
Updates

4. Scalability
Auto-Sharding
• Increase capacity as you go
• Commodity and cloud architectures
• Improved operational simplicity and cost visibility

High Availability
• Automated replication and failover
• Multi-data center support
• Improved operational simplicity (e.g., HW swaps)
• Data durability and consistency

Scalability: MongoDB Architecture

5. Full Featured
• Ad Hoc queries
• Real time aggregation
• Rich query capabilities
• Strongly consistent
• Geospatial features
• Support for most programming languages
• Flexible schema

Do More With Your Data
MongoDB
Rich Queries
• Find Paul’s cars
• Find everybody in London with a car
built between 1970 and 1980
Geospatial
• Find all of the car owners within 5km of
Trafalgar Sq.
Text Search
• Find all the cars described as having
leather seats
Aggregation
• Calculate the average value of Paul’s
car collection
Map Reduce
• What is the ownership pattern of colors
by geography over time? (is purple
trending up in China?)
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}

$ tar –zxvf mongodb-osx-x86_64-2.6.0.tgz
$ cd mongodb-osx-i386-2.6.0/bin
$ mkdir –p /data/db
$ ./mongod
Running MongoDB

MongoDB: Core Binaries
• mongod
– Database server
• mongo
– Database client shell
• mongos
– Router for Sharding

Getting Help
• For mongo shell
– mongo –help
• Shows options available for running the shell
• Inside mongo shell
– db.help()
• Shows commands available on the object

Database Operations
• Database creation
• Creating/changing collection
• Data insertion
• Data read
• Data update
• Creating indices
• Data deletion
• Dropping collection

MacBook-Pro-:~ $ mongo
MongoDB shell version: 2.6.0
connecting to: test
> db.cms.insert({text: 'Welcome to MongoDB'})
> db.cms.find().pretty()
{
"_id" : ObjectId("51c34130fbd5d7261b4cdb55"),
"text" : "Welcome to MongoDB"
}
Mongo Shell

Diagnostic Tools
• mongostat
• mongoperf
• mongosnif
• mongotop

Import Export Tools
• For objects
– mongodump
– mongorestore
– bsondump
– mongooplog
• For data items
– mongoimport
– mongoexport

Assignment
• Tasks
– assignments.txt
• Data
– students.json

Sarang Shravagi
@_sarangs
Thank You

First step in any application is
Determine your entities

Entities in our Blogging System
• Users (post authors)
• Article
• Comments
• Tags, Category
• Interactions (views, clicks)

In a relational base app
We would start by doing schema
design

In a MongoDB based app
We start building our app
and let the schema evolve

Seek = 5+ ms Read = really really fast
Post
Author
Comment
Disk seeks and data locality

Post
Author
Comment
Comment
Comment
Comment
Comment
Disk seeks and data locality

Real applications are not
built in the shell

MongoDB has native
bindings for over 12
languages

Drivers & Ecosystem
Drivers
Support for the most popular
languages and frameworks
Frameworks
Morphia
MEAN Stack
Java
Python
Perl
Ruby

# Python dictionary (or object)
>>> article = { ‘title’ : ‘Schema design in MongoDB’,
‘author’ : ‘sarangs’,
‘section’ : ‘schema’,
‘slug’ : ‘schema-design-in-mongodb’,
‘text’ : ‘Data in MongoDB has a flexible schema.
So, 2 documents needn’t have same structure.
It allows implicit schema to evolve.’,
‘date’ : datetime.utcnow(),
‘tags’ : [‘MongoDB’, ‘schema’] }
>>> db[‘articles’].insert(article)
Design schema.. In application code

>>> img_data = Binary(open(‘article_img.jpg’).read())
>>> article = { ‘title’ : ‘Schema evolutionin MongoDB’,
‘author’ : ‘mattbates’,
‘section’ : ‘schema’,
‘slug’ : ‘schema-evolution-in-mongodb’,
‘text’ : ‘MongoDb has dynamic schema. For good
performance, you would need an implicit
structure and indexes’,
‘tags’ : [‘MongoDB’, ‘schema’, ‘migration’],
‘headline_img’ : {
‘img’ : img_data,
‘caption’ : ‘A sample document at the shell’
}}
Let’s add a headline image

>>> article = { ‘title’ : ‘Favourite web application framework’,
‘author’ : ‘sarangs’,
‘section’ : ‘web-dev’,
‘slug’ : ‘web-app-frameworks’,
‘gallery’ : [
{ ‘img_url’ : ‘http://x.com/45rty’, ‘caption’ : ‘Flask’, ..},
..
]
‘tags’ : [‘Python’, ‘web’],
}
>>> db[‘articles’].insert(article)
And different types of article

>>> user = {
'user' : 'sarangs',
'email' : ‘sarang.shravagi@gmail.com',
'password' : ‘sarang',
'joined' : datetime.utcnow(),
'location' : { 'city' : 'Mumbai' },
}
} >>> db[‘users’].insert(user)
Users and profiles

Modelling comments (1)
• Two collections – articles and comments
• Use a reference (i.e. foreign key) to link together
• But.. N+1 queries to retrieve article and comments
{
‘_id’: ObjectId(..),
‘title’: ‘Schema design in MongoDB’,
‘author’: ‘mattbates’,
‘date’: ISODate(..),
‘tags’: [‘MongoDB’, ‘schema’],
‘section’: ‘schema’,
‘slug’: ‘schema-design-in-mongodb’,
‘comments’: [ ObjectId(..),…]
}
{ ‘_id’: ObjectId(..),
‘article_id’: 1,
‘text’: ‘Agreat article, helped me
understand schema design’,
‘date’: ISODate(..),,
‘author’: ‘johnsmith’
}

• Single articles collection –
embed comments in article
documents
• Pros
• Single query, document
designed for the access pattern
• Locality (disk, shard)
• Cons
• Comments array is unbounded;
documents will grow in size
(remember 16MB document
limit)
{
‘title’: ‘Schema design in MongoDB’,
‘tags’: [‘MongoDB’, ‘schema’],
…
‘comments’: [
{
‘text’: ‘Agreat article,
helped me
understandschema design’,
},
…
]
}

• Another option: hybrid of (2) and (3), embed
top x comments (e.g. by date, popularity) into
the article document
• Fixed-size (2.4 feature) comments array
• All other comments ‘overflow’ into a comments
collection (double write) in buckets
• Pros
– Document size is more fixed – fewer moves
– Single query built
– Full comment history with rich query/aggregation

{
‘title’: ‘Schemadesignin MongoDB’,
‘tags’:[‘MongoDB’, ‘schema’],
…
‘comments_count’: 45,
‘comments_pages’: 1
‘comments’: [
{
understandschema design’,
},
…
]
}
Total number of comments
• Integer counter updated by
update operation as
comments added/removed
Number of pages
• Page is a bucket of 100
comments (see next slide..)
Fixed-size comments array
• 10 most recent
• Sorted by date on insertion

{
‘article_id’: ObjectId(..),
‘page’: 1,
‘count’: 42
‘comments’: [
{
understand schema design’,
},
…
}
One comment bucket
(page) document
containing up to about 100
comments
Array of 100 comment sub-
documents

Modelling interactions
• Interactions
– Article views
– Comments
– (Social media sharing)
• Requirements
– Time series
– Pre-aggregated in preparation for analytics

Modelling interactions
• Document per article per day –
‘bucketing’
• Daily counter and hourly sub-
document counters for
interactions
• Bounded array (24 hours)
• Single query to retrieve daily
article interactions; ready-made
for graphing and further
aggregation
{
‘article_id’: ObjectId(..),
‘section’: ‘schema’,
‘daily’: { ‘views’: 45, ‘comments’:
150 }
‘hours’: {
0 : { ‘views’: 10 },
1 : { ‘views’: 2 },
…
23 : { ‘comments’: 14, ‘views’: 10
}
}
}

JSON and RESTful API
Client-side
JSON
(eg AngularJS, (BSON)
Real applications are not built at a shell – let’s build a RESTful
API.
Pymongo
driver
Python web
app
HTTP(S) REST
Examples to follow: Python RESTful API using Flask
microframework

myCMS REST endpoints
Method URI Action
GET /articles Retrieve all articles
GET /articles-by-tag/[tag] Retrieve all articles by tag
GET /articles/[article_id] Retrieve a specific article by article_id
POST /articles Add a new article
GET /articles/[article_id]/comments Retrieve all article comments by
article_id
POST /articles/[article_id]/comments Add a new comment to an article.
POST /users Register a user user
GET /users/[username] Retrieve user’s profile
PUT /users/[username] Update a user’s profile

$ git clone http://www.github.com/mattbates/mycms_mongodb
$ cd mycms-mongodb
$ virtualenv venv
$ source venv/bin/activate
$ pip install –r requirements.txt
$ mkdir –p data/db
$ mongod --dbpath=data/db --fork --logpath=mongod.log
$ python web.py
[$ deactivate]
Getting started with the skeleton
code

@app.route('/cms/api/v1.0/articles', methods=['GET'])
def get_articles():
"""Retrieves all articles in the collection
sorted by date
"""
# query all articles and return a cursor sorted by date
cur = db['articles'].find().sort('date’)
if not cur:
abort(400)
# iterate the cursor and add docs to a dict
articles = [article for article in cur]
return jsonify({'articles' : json.dumps(articles, default=json_util.default)})
RESTful API methods in Python +
Flask

@app.route('/cms/api/v1.0/articles/<string:article_id>/comments', methods = ['POST'])
def add_comment(article_id):
"""Adds a comment to the specified article and a
bucket, as well as updating a view counter
"””
…
page_id = article['last_comment_id'] // 100
…
# push the comment to the latest bucket and $inc the count
page = db['comments'].find_and_modify(
{ 'article_id' : ObjectId(article_id),
'page' : page_id},
{ '$inc' : { 'count' : 1 },
'$push' : {
'comments' : comment } },
fields= {'count' : 1},
upsert=True,
new=True)
Flask

# $inc the page count if bucket size (100) is exceeded
if page['count'] > 100:
db.articles.update(
{ '_id' : article_id,
'comments_pages': article['comments_pages'] },
{ '$inc': { 'comments_pages': 1 } } )
# let's also add to the article itself
# most recent 10 comments only
res = db['articles'].update(
{'_id' : ObjectId(article_id)},
{'$push' : {'comments' : { '$each' : [comment],
'$sort' : {’date' : 1 },
'$slice' : -10}},
'$inc' : {'comment_count' : 1}})
…
Flask

def add_interaction(article_id, type):
"""Record the interaction (view/comment) for the
specified article into the daily bucket and
update an hourly counter
"""
ts = datetime.datetime.utcnow()
# $inc daily and hourly view counters in day/article stats bucket
# note the unacknowledged w=0 write concern for performance
db['interactions'].update(
{ 'article_id' : ObjectId(article_id),
'date' : datetime.datetime(ts.year, ts.month, ts.day)},
{ '$inc' : {
'daily.{}’.format(type) : 1,
'hourly.{}.{}'.format(ts.hour, type) : 1
}},
upsert=True,
w=0)
Flask

$ curl -i http://localhost:5000/cms/api/v1.0/articles
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 335
Server: Werkzeug/0.9.4 Python/2.7.5
Date: Thu, 10 Apr 2014 16:00:51 GMT
{
"articles": "[{"title": "Schema design in MongoDB", "text": "Data in MongoDB
has a flexible schema..", "section": "schema", "author": "sarangs", "date":
{"$date": 1397145312505}, "_id": {"$oid": "5346bef5f2610c064a36a793"},
"slug": "schema-design-in-mongodb", "tags": ["MongoDB", "schema"]}]"}
Testing the API – retrieve articles

$ curl -H "Content-Type: application/json" -X POST -d '{"text":"An interesting
article and a great read."}'
http://localhost:5000/cms/api/v1.0/articles/52ed73a30bd031362b3c6bb3/comment
s
{
"comment": "{"date": {"$date": 1391639269724}, "text": "An interesting
article and a great read."}”
}
Testing the API – comment on an
article

Disaster Recovery
Introduction to Replica Sets and
High Availability

Disasters
• Physical Failure
– Hardware
– Network
• Solution
– Replica Sets
• Provide redundant storage for High Availability
– Real time data synchronization
• Automatic failover for zero down time

Multi Replication
• Data can be replicated to multiple places
simultaneously
• Odd number of machines are always needed in
a replica set

Single Replication
• If you want to have only one or odd number of
secondary, you need to setup an arbiter

Failover
• When primary fails, remaining machines vote
for electing new primary

Handling Big Data
Introduction to Map/Reduce
and Sharding

Large Data Sets
• Problem 1
– Performance
• Queries go slow
• Solution
– Map/Reduce

Map Reduce
• A way to divide large query computation into
smaller chunks
• May run in multiple processes across multiple
machines
• Think of it as GROUP BY of SQL

Map/Reduce Example
• Map function digs the data and returns required
values

Map/Reduce Example
• Reduce function uses the output of Map
function and generates aggregated value

Large Data Sets
• Problem 2
– Vertical Scaling of Hardware
• Can’t increase machine size beyond a limit
• Solution
– Sharding

Sharding
• Amethod for storing data across multiple machines
• Data is partitioned using Shard Keys

Data Partitioning: Range Based
• Arange of Shard Keys stay in a chunk

Data Partitioning: Hash Bsed
• Ahash function on Shard Keys decides the chunk

Optimizing Shards: Splitting
• In a shard, when size of a chunk increases, the
chunk is divided into two

Optimizing Shards: Balancing
• When number of chunks in a shard increase, a
few chunks are migrated to other shard

Schema iteration
New feature in the backlog?
Documents have dynamic schema so we just iterate
the object schema.
>>> user = { ‘username’: ‘matt’,
‘first’ : ‘Matt’,
‘last’ : ‘Bates’,
‘preferences’: { ‘opt_out’: True } }
>>> user.save(user)

Online Training at MongoDB
University

For More Information
Resource Location
MongoDB Downloads mongodb.com/download
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
White Papers mongodb.com/white-papers
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Documentation docs.mongodb.org
Additional Info info@mongodb.com
Resource Location

We've introduced a lot of
concepts here

MongoDB Basics

More Related Content

MongoDB Basics