SlideShare a Scribd company logo
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
Atlas Data Lake
Technical Deep-Dive
Subhead for the presentation goes here
Craig Wilson, Senior Staff Engineer, MongoDB
State of Affairs
Businesses have a humongous amount of data
• IDC predicts that by 2025 global data will reach 175 Zettabytes and 49% of it will reside in the public
cloud.
Cloud storage is cost-effective
Cloud storage is hard to operationalize
A New Service Offered by MongoDB Atlas
Access long-term data
Query long-term data
Analyze long-term data
Requirements
Look and act like MongoDB
Access customer’s data securely
Handle queries over vast amounts of data
Handle long-running queries
Efficient use of resources
Emulating MongoDB
Language
Must be able to communicate with our drivers
Written in Go
Implemented a TCP server
Used mongo-go-driver’s wireprotocol package
Used mongo-go-driver's bson package
Security
Must have the same security as MongoDB
Users configured in Atlas
Implemented MongoDB’s security model
Require the use of TLS + SNI(Server Name Indicator)
Behavior
Must behave like MongoDB
Implemented commands for a read-only server
Used the server’s aggregation engine
Customer’s Data
Security: Customers
Customers have complete control
Provide us with an IAM Role
Configure your buckets
Configure your users in Atlas
Security: Atlas
Atlas controls access to your data
Storage of IAM Role
Temporary Credentials
Configuration
Customers control their data layout
Stores
Databases, Collections
DataSources
CollectionCollection
Store Store
DataSource DataSource
DataSource
Configuration: File Formats
• BSON (gzipped)
• JSON (gzipped)
• Avro (gzipped)
• CSV/TSV (gzipped)
• Parquet
• XLSX
Configuration (S3 Bucket): ent-archive
/archive/customers
- a-m.json
- n-z.json
/archive/invoices
- 2019
- 1.parquet
- 2.parquet
- 2018
- 1.parquet
- 2017.json.gz
- 2016.json.gz
Configuration: Store
s3 : {
name: "ent-archive",
bucket: "ent-archive",
region: "us-east-1",
prefix: "/archive/"
}
Configuration: Data
history: {
customers: [{
store: "ent-archive",
definition: "/customers/*"
}],
invoices: [{
store: "ent-archive",
definition: "/invoices/{year int}/*"
}, {
store: "ent-archive",
definition: "/invoices/{year int}.json.gz"
}]
}
Configuration: Data (Future)
history: {
invoices: [{
store: "ent-archive",
definition: "/invoices/{year int}/*"
}, {
store: "ent-archive",
definition : "/invoices/{year int}.json.gz
}, {
store: "atlas",
db: "customers",
collection: "invoices"
}]
}
Queries
Processing
MQL à Distributed MQL
Parse
Parallelize
Distribute
Architecture
Atlas
Control
Control
Plane
Compute
Plane
Data
Plane
DataLake
Frontend
DataLake
Agent
Load Balancer
Load Balancer
DataLake
Frontend
DataLake
Agent
Load Balancer
Load Balancer
DataLake
Frontend
DataLake
Agent
Load Balancer
Load Balancer
Architecture
Atlas
Control
Control
Plane
Compute
Plane
Data
Plane
DataLake
Frontend
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
DataLake
Agent
Query Example: $limit
Map:
{ $match: { year: { $gt: 2000 } } }
{ $limit: 10 }
Reduce:
{ $limit: 10 }
{ $match: { year: { $gt: 2000 } } }
{ $limit: 10 }
Query Example: $group
Map:
{ $group: { _id: "$year",
totalAvg_sum: { $sum: "$amount" },
totalAvg_count: { $sum: 1 }
} }
Reduce:
{ $group: { _id: "$_id",
totalAvg_sum: { $sum: "$totalAvg_sum" },
totalAvg_count: { $sum: "$totalAvg_count" }
} }
Finalize:
{ $project: { _id: "$_id", totalAvg: { $divide: ["$totalAvg_sum", "$totalAvg_count"] } } }
{ $group: { _id: "$year", totalAvg: { $avg: "amount" } } }
Future
Future
More supported MongoDB operators.
$out
$merge
Geo operators
Full Text Search
Future
Optimizations
Indexes
Statistics
Future
File Formats
ORC
PDF
Future
Integrations
Atlas
Microsoft Azure
Google Cloud
Hiring
Lots to do
mongodb.com/careers
Craig Wilson
Senior Staff Engineer, MongoDB
Thank You!
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive

More Related Content

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive