Charles Sarrazin, MongoDB
Raiders of the Anti-Patterns:
A journey towards fixing schema mistakes in MongoDB
Charles Sarrazin
Principal Consulting Engineer, Professional Services, Paris, FR
Our Journey
§ Packing
§ Anti-Patterns
§ Fixing schema issues
§ Conclusion

Our backpack
§ Design Patterns
§ Monitoring tools
§ Log analysis
§ Additional tools
Design Patterns
§ Attribute
§ Schema Versioning
§ Document Versioning
§ Tree
§ Polymorphism
§ Pre-allocation
Access Frequency
§ Subset
§ Approximation
§ Extended Reference
§ Computed
§ Bucket
§ Outlier
Data Modeling
Use Cases
Monitoring tools
For example
• Ops/Cloud Manager
• MongoDB Compass

Log Analysis
• mlogfilter
• mloginfo
• mplotqueries
Additional tools
• Oplog analysis
• db.currentOp()
• Profiler
• db.collection.explain()
Understanding your data model and identifying
The Fauna
a.k.a « One Collection Fits All »
or « Schemaless »

The Squashed Database
§ Slow writes
§ High number of indexes (>20-25)
The Fauna
The Anti-Pattern
§ Access patterns are actually
different based on document type
§ Each document type depends on a
specific index
§ No common access patterns
The Actual Reason
§ While indexes improve reads, they
might negatively impact writes
§ You may only have up to 64
indexes in a single collection
§ If you don’t use Partial or Sparse
indexes, null or absent values will
still be indexed
The Fauna
§ Documents sharing different access pattern or business logic
should be stored in separate collections
§ You can temporarily rely on Partial Indexes in order to reduce the
size of indexes and performance impact
§ Spending a just a little time for schema design is important
The Squashed
a.k.a « Flat documents » or
« The RDBMS schema »

The Squashed Database
§ High IOPS (random reads/writes)
§ Low throughput
§ High yields and/or nReturned
§ High index size
The Squashed Database
The Anti-Pattern
§ Flat documents stored in separate
§ Only using root-level fields and no
The Actual Reason
§ In order to parse a flat document,
MongoDB will read each field
§ Normalization also means
redundant data (relations)
§ Data needs to be consolidated
using JOINs ($lookup)
The Squashed Database
§ Simply transposing your data model from a RDBMS to MongoDB
won’t be as helpful for scaling up
§ Consider grouping data from multiple tables in a single collection,
by embedding the relations (1:1, 1:n) when data volume is
$project the Elephant
a.k.a. « Bloated documents » or
« The $project »

$project the Elephant
§ High read IOPS
§ High cache activity (bytes read into cache)
§ High number of yields when reading a single document
§ Slow indexed queries when reading a single document
§ Result length lower than document size
§ Generally, big document size (> 200+ KB)
$project the Elephant
The Anti-Pattern
§ Using big document (>100kb)
while only projecting a few fields
The Actual Reason
§ Documents are the base level
transfer unit from disk to memory
§ Even when using a single field, the
whole document is loaded from
disk to the WiredTiger cache
$project the Elephant
§ Use smaller documents with more
frequently accessed data
§ Store less frequently accessed data
in another collection
Also known as the Subset Pattern
The Single-Person
a.k.a. « The Auto-Incrementing
Counter » or « SQL in
MongoDB »

The Single-Person Bridge
§ Some updates seem to take a long time
§ MongoDB logs show writeConflicts>0 for these updates
§ The application seems to perform write operations sequentially
The Single-Person Bridge
The Anti-Pattern
§ Simulating a SQL sequence by
using a counter document and
The Actual Reason
§ As WiredTiger uses a document-
level lock, concurrent updates to a
single document will block other
writes to the same document
The Single-Person Bridge
§ Do not try to simulate sequences in MongoDB
§ Instead, rely on ObjectIDs, UUIDs or GUIDs
Sorted Monkeys
a.k.a. « Sorted Array Push »

§ Very high Oplog churn (Oplog GB/Hour)
§ Low Oplog window with default Oplog size
§ Oplog size is very high compared to data size to ensure proper
operations (target Oplog window > 3 days)
Sorted Monkeys
The Anti-Pattern
§ Using $push on big arrays (>20
entries) with:
§ The $sort modifier
§ The $slice modifier
The Actual Reason
§ Oplog operations are idempotent,
meaning that these operations are
replaced with a $set statement,
replacing the full array.
Sorted Monkeys
§ Only rely on the $slice and $sort modifiers when manipulating
small arrays
§ You can rely on in-memory or application-level sorts for medium-
sized result sets
The Tree in the House
a.k.a. « Push until the End »

The Tree in the House
§ Your application worked fine for some period of time
§ After a while, some updates fail with:
Resulting document after update is larger than 16777216
The Tree in the House
The Anti-Pattern
§ Using unbounded arrays for
storing data (e.g. Audit logs for
tracing document updates)
The Actual Reason
§ MongoDB documents are limited
to 16MB
§ Depending on relationship, you
might reach maximum document
size if not careful
The Tree in the House
§ For 1:n relationships, you need to
consider cardinality
§ Differentiate 1 to few (<10k array
elements) from 1 to zillions
§ Consider using the Subset, Outlier
or Bucket patterns
Fixing schema
issues gracefully

§ Availability
§ Can your business afford scheduled downtime?
§ Do you need to keep multiple versions of your app online?
§ Performance
§ How does the migration affect performance?
§ Rollback Strategy
§ How do we go back if we run into a problem?
§ Risk
§ What is the impact of a failed migration?
Migration Strategies
§ One-Time
§ Blue/Green
§ Y-Write
§ Read & Upgrade
Principles Pros
§ Fastest migration path
§ Immediate economies of
§ High risk
§ Requires tremendous
§ Complex parallel testing
§ Labor intensive
Principles Pros
§ Always available
§ Easy rollback: change
router to point to
previous version
§ You need to be able to
sync the two DBs
§ Use ChangeStreams
§ You need double the
hardware or resources

Principles Pros
§ Always available
§ Easy rollback: stop
writing to new schema
§ Legacy applications can
still read from the old
§ You need to be able to
sync the two DBs
§ Write logic needs to be
centralized and migrated
before read logic
Read & Upgrade
Principles Pros
§ Always available
§ Good performance
§ You need to consider
schema backward and
forward compatibility
§ Schema upgrade is part
of the application logic
§ Requires a depreciation
roadmap to remove
legacy code
Ensuring backward compatibility
§ Insert data in existing collections
§ Add new field
§ Create a new collection/database
§ Rename/Remove field
§ Remove data
§ Change field type or format
§ Remove/Rename
Availability Performance Risk Cost
One Time ✗✗ ✓ ✗✗ ✓✓
Blue/Green ✓ ✗ ✓✓ ✗✗
Y-Write ✓✓ ✓ ✓✓ ✓✓
Read &
✓ ✓✓ ✗ ✓

Key takeaways
Regularly reassess your hypotheses
§ Your access patterns will change over time
§ Check your actual access patterns
Key takeaways
MongoDB provides flexible migration options
§ You can combine both online and offline schema migrations
§ Consider your development lifecycle and your release schedule to
choose your migration strategy
§ Use $jsonSchema to handle schema validation or check migration
But more importantly…

…Take some time to think
about your data model!
Thank you for taking our FREE
MongoDB classes at
Register Now!

MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Schema Mistakes in MongoDB

MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Schema Mistakes in MongoDB

  • 1. Charles Sarrazin, MongoDB Raiders of the Anti-Patterns: A journey towards fixing schema mistakes in MongoDB @csarrazi
  • 2. Charles Sarrazin Principal Consulting Engineer, Professional Services, Paris, FR
  • 3. Our Journey § Packing § Anti-Patterns § Fixing schema issues gracefully § Conclusion
  • 5. Our backpack § Design Patterns § Monitoring tools § Log analysis § Additional tools
  • 6. Design Patterns Representation § Attribute § Schema Versioning § Document Versioning § Tree § Polymorphism § Pre-allocation Access Frequency § Subset § Approximation § Extended Reference Grouping § Computed § Bucket § Outlier
  • 8. Monitoring tools For example • Ops/Cloud Manager • MongoDB Compass
  • 9. Log Analysis mtools • mlogfilter • mloginfo • mplotqueries
  • 10. Additional tools • Oplog analysis • db.currentOp() • Profiler • db.collection.explain()
  • 11. Anti-Patterns Understanding your data model and identifying mistakes
  • 12. The Fauna a.k.a « One Collection Fits All » or « Schemaless »
  • 13. The Squashed Database Symptoms § Slow writes § High number of indexes (>20-25)
  • 14. The Fauna The Anti-Pattern § Access patterns are actually different based on document type § Each document type depends on a specific index § No common access patterns The Actual Reason § While indexes improve reads, they might negatively impact writes § You may only have up to 64 indexes in a single collection § If you don’t use Partial or Sparse indexes, null or absent values will still be indexed
  • 15. The Fauna Takeaways § Documents sharing different access pattern or business logic should be stored in separate collections § You can temporarily rely on Partial Indexes in order to reduce the size of indexes and performance impact § Spending a just a little time for schema design is important
  • 16. The Squashed Database a.k.a « Flat documents » or « The RDBMS schema »
  • 17. The Squashed Database Symptoms § High IOPS (random reads/writes) § Low throughput § High yields and/or nReturned § High index size
  • 18. The Squashed Database The Anti-Pattern § Flat documents stored in separate collections § Only using root-level fields and no hierarchy The Actual Reason § In order to parse a flat document, MongoDB will read each field sequentially § Normalization also means redundant data (relations) § Data needs to be consolidated using JOINs ($lookup)
  • 19. The Squashed Database Takeaways § Simply transposing your data model from a RDBMS to MongoDB won’t be as helpful for scaling up § Consider grouping data from multiple tables in a single collection, by embedding the relations (1:1, 1:n) when data volume is reasonable
  • 20. $project the Elephant a.k.a. « Bloated documents » or « The $project »
  • 21. $project the Elephant Symptoms § High read IOPS § High cache activity (bytes read into cache) § High number of yields when reading a single document § Slow indexed queries when reading a single document § Result length lower than document size § Generally, big document size (> 200+ KB)
  • 22. $project the Elephant The Anti-Pattern § Using big document (>100kb) while only projecting a few fields The Actual Reason § Documents are the base level transfer unit from disk to memory § Even when using a single field, the whole document is loaded from disk to the WiredTiger cache
  • 23. $project the Elephant Takeaways § Use smaller documents with more frequently accessed data § Store less frequently accessed data in another collection Also known as the Subset Pattern
  • 24. The Single-Person Bridge a.k.a. « The Auto-Incrementing Counter » or « SQL in MongoDB »
  • 25. The Single-Person Bridge Symptoms § Some updates seem to take a long time § MongoDB logs show writeConflicts>0 for these updates § The application seems to perform write operations sequentially
  • 26. The Single-Person Bridge The Anti-Pattern § Simulating a SQL sequence by using a counter document and findOneAndModify The Actual Reason § As WiredTiger uses a document- level lock, concurrent updates to a single document will block other writes to the same document
  • 27. The Single-Person Bridge Takeaways § Do not try to simulate sequences in MongoDB § Instead, rely on ObjectIDs, UUIDs or GUIDs
  • 28. Sorted Monkeys a.k.a. « Sorted Array Push »
  • 29. Sorted Monkeys Symptoms § Very high Oplog churn (Oplog GB/Hour) § Low Oplog window with default Oplog size § Oplog size is very high compared to data size to ensure proper operations (target Oplog window > 3 days)
  • 30. Sorted Monkeys The Anti-Pattern § Using $push on big arrays (>20 entries) with: § The $sort modifier § The $slice modifier The Actual Reason § Oplog operations are idempotent, meaning that these operations are replaced with a $set statement, replacing the full array.
  • 31. Sorted Monkeys Takeaways § Only rely on the $slice and $sort modifiers when manipulating small arrays § You can rely on in-memory or application-level sorts for medium- sized result sets
  • 32. The Tree in the House a.k.a. « Push until the End »
  • 33. The Tree in the House Symptoms § Your application worked fine for some period of time § After a while, some updates fail with: Resulting document after update is larger than 16777216
  • 34. The Tree in the House The Anti-Pattern § Using unbounded arrays for storing data (e.g. Audit logs for tracing document updates) The Actual Reason § MongoDB documents are limited to 16MB § Depending on relationship, you might reach maximum document size if not careful
  • 35. The Tree in the House Takeaways § For 1:n relationships, you need to consider cardinality § Differentiate 1 to few (<10k array elements) from 1 to zillions § Consider using the Subset, Outlier or Bucket patterns
  • 37. Considerations § Availability § Can your business afford scheduled downtime? § Do you need to keep multiple versions of your app online? § Performance § How does the migration affect performance? § Rollback Strategy § How do we go back if we run into a problem? § Risk § What is the impact of a failed migration?
  • 38. Migration Strategies § One-Time § Blue/Green § Y-Write § Read & Upgrade
  • 39. One-Time Principles Pros § Fastest migration path § Immediate economies of scales Cons § High risk § Requires tremendous coordination § Complex parallel testing § Labor intensive YOLO!
  • 40. Blue/Green Principles Pros § Always available § Easy rollback: change router to point to previous version Cons § You need to be able to sync the two DBs § Use ChangeStreams § You need double the hardware or resources
  • 41. Y-Write Principles Pros § Always available § Easy rollback: stop writing to new schema § Legacy applications can still read from the old schema Cons § You need to be able to sync the two DBs § Write logic needs to be centralized and migrated before read logic
  • 42. Read & Upgrade Principles Pros § Always available § Good performance Cons § You need to consider schema backward and forward compatibility § Schema upgrade is part of the application logic § Requires a depreciation roadmap to remove legacy code
  • 43. Ensuring backward compatibility Do § Insert data in existing collections § Add new field § Create a new collection/database Don’t § Rename/Remove field § Remove data § Change field type or format § Remove/Rename collection/database
  • 44. Summary Availability Performance Risk Cost One Time ✗✗ ✓ ✗✗ ✓✓ Blue/Green ✓ ✗ ✓✓ ✗✗ Y-Write ✓✓ ✓ ✓✓ ✓✓ Read & Upgrade ✓ ✓✓ ✗ ✓
  • 46. Key takeaways Regularly reassess your hypotheses § Your access patterns will change over time § Check your actual access patterns
  • 47. Key takeaways MongoDB provides flexible migration options § You can combine both online and offline schema migrations § Consider your development lifecycle and your release schedule to choose your migration strategy § Use $jsonSchema to handle schema validation or check migration status
  • 49. …Take some time to think about your data model!
  • 51. Thank you for taking our FREE MongoDB classes at