Introduction to Azure DocumentDB
- 2. Denny Lee
• Principal Program Manager for Azure DocumentDB
• 20+ years of experience in databases, distributed systems, data
sciences, and software development at Microsoft, Concur, and
Databricks
• Noteable Projects:
• Project Isotope: Incubation team for HDInsight
• Yahoo! 24TB cube: Largest SSAS cube in production
@dennylee
- 4. {
"name": "SmugMug",
"permalink": "smugmug",
"homepage_url": "http://www.smugmug.com",
"blog_url": "http://blogs.smugmug.com/",
"category_code": "photo_video",
"products": [
{
"name": "SmugMug",
"permalink": "smugmug"
}
],
"offices": [
{
"description": "",
"address1": "67 E. Evelyn Ave",
"address2": "",
"zip_code": "94041",
"city": "Mountain View",
"state_code": "CA",
"country_code": "USA",
"latitude": 37.390056,
"longitude": -122.067692
}
]
}
Perfect for these
Documents
schema-agnostic JSON store
for
hierarchical and de-normalized data at scale
- 6. {
"name": "SmugMug",
"permalink": "smugmug",
"homepage_url": "http://www.smugmug.com",
"blog_url": "http://blogs.smugmug.com/",
"category_code": "photo_video",
"products": [
{
"name": "SmugMug",
"permalink": "smugmug"
}
],
"offices": [
{
"description": "",
"address1": "67 E. Evelyn Ave",
"address2": "",
"zip_code": "94041",
"city": "Mountain View",
"state_code": "CA",
"country_code": "USA",
"latitude": 37.390056,
"longitude": -122.067692
}
]
}
Perfect for these
Documents
schema-agnostic JSON store
for
hierarchical and de-normalized data at scale
- 13. The 4 Vs of Big Data
Exceeds physical limits of vertical scalabilityVolume
Many different formats making integration expensiveVariety
Small decision window compared to data change rateVelocity
Many options or variables confounding analysisVariability
- 14. The 4 Vs of Big Data
Volume Variety Velocity Variability
Mobile
Apps
Retail Learning Telematics IoT Gaming
- 17. Ability to Scale from Day 1
• Bursty
• Unpredictable traffic
Gaming + Social Experience
• Lag-free
• Responsive experiences
Move fast without breaking things
• Iterative development needs
More users, more problems
- 18. • Game scores, guilds and social membership
• Leaderboards by country and social
• Guild management and messaging
• #1 in Apple app store for free apps
<10ms
99P query latency
>1M
game downloads
~1B
requests / day
The Walking Dead, results
- 19. Caches
• Scores are continuously
updated
• Write heavy without locality
RDBMS
• Scale-out requires partitioning
• Schema and index management
Other NoSQL Stores
• Longer tail on latencies
• Need to specify secondary
indexes for lookups
The right tool for the job ?
- 20. Fully managed NoSQL database
Horizontal scaling for TB and RPS
High performance, write optimized
Schema agnostic indexing
+
Azure DocumentDB
The answer for low latency @ massive scale
- 21. Fact: Managing shards is really painful.
Managing shards or partitions
Good news: DocumentDB has done all the heavy lifting.
- 23. Measuring Throughput (Request Units)
Replica gets a fixed budget
of request units
Request Unit/sec (RU) is the
normalized currency
% IOPS
% CPU
% Memory
READ
GET Document
Documents
INSERT
POST
REPLACE
PUT Document
Operations consume request units (RUs)
Query
POST Documents
…
Min RU/sec
Max RU/sec
IncomingRequests
Replica
Quiescent
Rate
limit
No
throttling
Requests get rate limited if
they exceed the SLA
Customers pay for reserved
request units by the hour
- 28. Globally Distributed
Azure DocumentDB gives you the ability circumvent the speed of light!
High Availability and
Disaster Recovery
Replicate to any
Number of regions
Global low latency access
Dynamically configure
write and read regions
- 29. … with well-defined consistency models!
Consistency Level Strong Bounded Stateless Session Eventual
Total Global Order Yes Yes (outside of the “staleness window”) No, partial “session” order No
Consistent prefix
guarantee
Yes Yes Yes Yes
Monotonic Reads Yes Yes (within region and across regions
outside of the staleness window)
Yes (for the given session) No
Monotonic Writes Yes Yes Yes Yes
Read your writes Yes Yes (in the write region) Yes No
stronger consistency
faster performance
- 35. Common scenarios
Retail Gaming IoT Social
Product Catalog
Recommendations
Personalization
User Store
Recommendations
Personalization
Event Store
Device Registry
Telemetry Store
User Behavior
Telemetry
Personalization
- 36. Common scenarios
IoT
Event Store
Device Registry
Telemetry Store
IoT / Sensor Data Challenges:
• Hardware is relatively hard to update
• Different generation of devices
=> different schemas (variety)
• Many sensors emitting telemetry
=> high rate of ingestion (volume + variety)
- 37. Top 5 Automotive Manufacture in the World
Telematics services include:
• Safety service
• Diagnostic service
• Remote service
Ingest and query 100+ TB of semi-structure data
IoT : Vehicle Telematics
- 38. IoT : Vehicle Telematics
Ingress API
Inbound Interface
(Web API)
Raw Event Store (HOT)
(DocumentDB)
Aggregated Event Store (Warm)
(DocumentDB)
Aggregated Event Store (Cold)
(Blob Storage)
Outbound Interface
(Web API)
Message Queue
(Event Hubs)
Stream Processor
(Stream Analytics)
- 39. Common scenarios
Social + AdTech Challenges:
• Ingest + Analyze Third Party Data
=> Who dictates schema? (variety)
=> How do you index?
• A lot of social and user data
=> high rate of ingestion (volume + variety)
Social
User Behavior
Telemetry
Personalization
- 40. • Startup - Advanced Marketing Intelligence
Platform
• Utilizes deep learning to analyze billions of
relational network connections to build a
social fingerprint for each user
• Extracts knowledge and cultural insights by
analyzing what people choose to follow
Social Analytics + Ad Technology
>1B
Social Media
Profiles
>50M
Tweets per Day
- 41. • Store tweets, geo-location data, and ML
results in DocumentDB
• Data from each social media producer
has its own schema that evolves
independently
• Need to iterate rapidly… no time for
managing VMs
Social Analytics + Ad Technology
>1B
Social Media
Profiles
>50M
Tweets per Day
- 42. Before moving to DocumentDB, my developers
would need to come to me to confirm that our
Elasticsearch deployment would support their data or
if I would need to scale things to handle it.
DocumentDB removed me as a bottleneck, which has
been great for me and them.
Stephen Hankinson, CTO, Affinio
Quote
- 50. Data Sciences:
Apache Spark + DocumentDB
Demo
Notebook
View: https://aka.ms/docdb-spark-graph
pyView: https://aka.ms/pydocdb-spark-graph
Code: https://aka.ms/docdb-spark-graph-code
- 51. Graph Calculations: Degrees, PageRank
What is the most important
airport (most flights in / out)
tripGraph.inDegrees
.sort(desc("inDegree"))
.limit(10))
- 53. Advantages
Blazing Fast IoT Scenarios
Flight
information
global safety
alerts
weather
Data Science Scenarios
Device
Notifications
Web / REST API
- 55. Advantages
Pushdown Predicate Filtering Data Science Scenarios
{city:SEA}
locations headquarter exports
0 1
country
Germany
city
Seattle
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
{city:SEA, dst: POR, ...},
{city:SEA, dst: JFK, ...},
{city:SEA, dst: SFO, ...},
{city:SEA, dst: YVR, ...},
{city:SEA, dst: YUL, ...},
...
- 56. References
Get direct access to the engineering team -> askdocdb@microsoft.com
Resources
• Schema Agnostic Indexing with DocumentDB, VLDB 2015
• Consistency Levels in DocumentDB
• SQL Queries with DocumentDB
• Language Integrated JavaScript queries and transactions with
DocumentDB
• Distribute your data globally with DocumentDB
Editor's Notes
- Well nested, multiple properties and values
- Not word documents
- Well nested, multiple properties and values
- Independently scale storage and throughput. Provisioned throughput guaranteed.
Elastically scale throughput from 100 to 10s of millions of requests/sec
Transparent server side partitioning
Optionally evict old data with TTL
Cheaper than hosted OSS NoSQL databases or DynamoDB
Watch “Predictable performance” module
- Write optimized, SSD-based database engine with low latency access
Synchronous and automatic indexing at sustained ingestion rates
Globally distributed with reads and writes served from local region
Watch “Predictable performance” module
- Scale across any number of Azure regions
Turn-key high availability with transparent failover
Multi-homing
Well-defined consistency models
Watch “Achieve planet scale with DocumentDB: Multi-region replication”
- Rich SQL, JavaScript, MongoDB
Multi-modal: key-values, column family, or documents
No impedance mismatch - JavaScript is the type system
Write business logic entirely in JavaScript with stored procedures and triggers
Integrated multi-document transactions with snapshot isolation
.NET, Java, Node, Python SDKs
- That’s right, you were waiting for the zombies … The Walking Dead is a show about a zombie apocalypse … or ‘walkers’ as they’re referred to in the show.
The Walking Dead No Man’s Land is a mobile game based on this very successful AMC series. No Man’s Land is developed by a game company by the name of Next Games.
Some of you may have heard of Next Games from Scott’s keynote this morning but for those who missed the keynote, I’ll roll the video … > PLAY VIDEO
#### summary of what we just saw
- They will help look at a customer’s social media account (e.g. Nike’s Twitter Account) and provide reporting and analytics on the customer’s followers – including segmenting followers (e.g. you have a high number of soccer moms and athletes following you) and analyzing what kind of content is trending amongst each segment of followers. If links are included in the content – they will scrape the link and provide further analysis on any content found in the link. They are using DocumentDB as their data store for storing and querying the scraped content.