Presented virtually at GDG Indy - https://www.meetup.com/indy-gdg/events/269467916/
If you’re thinking about using a document database, it can be intimidating to start. A flexible data model gives you a lot of choices, but which way is the right way? Is a document database even the right tool? In this session we’ll go over the basics of data modeling using JSON. We’ll compare and contrast with traditional RDBMS modeling. Impact on application code will be discussed, as well as some tooling that could be helpful along the way. The examples use the free, open-source Couchbase Server document database, but the principles from this session can also be applied to CosmosDb, Mongo, RavenDb, etc.
4. Who am I?
4
• Matthew D. Groves
• Developer Advocate for Couchbase
• @mgroves on Twitter
• Podcast and blog: https://crosscuttingconcerns.com
• "I am not an expert, but I am an enthusiast." – Alan Stevens
by @natelovett
9. NoSQL Landscape
• Get by key(s)
• Set by key(s)
• Replace by key(s)
• Delete by key(s)
• Map/Reduce
Document
• Couchbase
• MongoDB
• DynamoDB
• Firestore
25. {
"Name" : "Bob Jones",
"DOB" : "1980-01-29",
"Billing" : [
{
"type" : "visa",
"cardnum" : "5927-2842-2847-3909",
"expiry" : "2020-03"
},
{
"type" : "master",
"cardnum" : "6273-2842-2847-3909",
"expiry" : "2019-11"
}
],
"Connections" : [
{
"CustId" : "XYZ987",
"Relation" : "Brother"
},
{
"CustId" : "PQR823",
"Relation" : "Father"
}
],
"Purchases" : [
{ "id":12, item: "mac", "amt": 2823.52 },
{ "id":19, item: "ipad2", "amt": 623.52 }
]
}
DocumentKey: CBL2016
CustomerID Name DOB
CBL2016 Bob Jones 1980-01-29
Custome
rID
Type Cardnum Expiry
CBL2016 visa 5927… 2020-03
CBL2016 maste
r
6273… 2019-11
CustomerI
D
ConnId Relation
CBL2016 XYZ987 Brother
CBL2016 SKR007 Father
CustomerI
D
item amt
CBL2016 mac 2823.5
2
CBL2016 ipad2 623.52
CustomerI
D
ConnI
d
Name
CBL201
6
XYZ98
7
Joe
Smith
CBL201
6
SKR0
07
Sam
Smith
Contacts
Customer
Billing
ConnectionsPurchases
26. Relationship is one-to-one or one-to-many
Store related data as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
27. Relationship is many-to-one or many-to-
many
Store related data as separate documents
{
"Name" : "Jane
Smith",
"DOB" : "1990-01-
30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
30. Data reads are mostly parent fields
Store children as separate documents
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
31. Data reads are mostly parent + child fields
Store children as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
32. Data writes are mostly parent or child (not
both)
Store children as separate documents
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
33. Data writes are mostly parent and child (both)
Store children as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
34. If … Then …
Relationship is one-to-one or one-to-many Store related data as nested objects
Relationship is many-to-one or many-to-
many
Store related data as separate documents
Data reads are mostly parent fields Store children as separate documents
Data reads are mostly parent + child fields Store children as nested objects
Data writes are mostly parent or child (not
both)
Store children as separate documents
Data writes are mostly parent and child
(both)
Store children as nested objects
Modeling your data: Strategies / rules of thumb
35. Accessing your data (Couchbase)
Key-Value
(CRUD)
N1QL
(SQL
Query)
Full Text
(Search)
Documents
Indexes Indexes
Views
(JS Query)
Analytics
(Query)
MapReduc
e
SQL++
36. Key/Value
public ShoppingCart GetCartById(string id)
{
return _bucket.Get<ShoppingCart>(id).Value;
}
public void CreateShoppingCart()
{
_bucket.Insert(new Document<ShoppingCart>
{
Id = "shopping-cart-1",
Content = new ShoppingCart { . . . }
});
}
44. Concept Strategies & Recommendations
Key-Value Operations provide the best
possible performance
• Create an effective key naming strategy
• Create an optimized data model
Full Text Search is well-suited to text • Facets / ranges / geography
• Language aware
N1QL queries provide the most flexibility –
everything else
• Query data regardless of how it is
modeled
• Good indexing is vital
Accessing your data: Strategies and recommendation
62. Frequently Asked Questions
6
2
1. How is Couchbase different than Mongo?
2. Is Couchbase the same thing as CouchDb?
3. How tall are you? Do you play basketball?
4. What is the Couchbase licensing situation?
5. Is Couchbase a Managed Cloud Service (DBaaS)?
63. Managed Cloud Server (DBaaS)
6
3
< Back
https://www.couchbase.com/products/cloud
64. MongoDB vs Couchbase
6
4
• Architecture
• Memory first architecture
• Master-master architecture
• Auto-sharding
• Features
• SQL (N1QL)
• Full Text Search
• Analytics (NoETL)
< Back
65. Licensing
6
5
< Back
Couchbase Server Community
• Source code is Open Source (Apache 2)
• Binary release is one release behind Enterprise (except major versions)
• Free to use in dev/test/qa/prod
• Forum support only
Couchbase Server Enterprise
• Source code is mostly Open Source (Apache 2)
• Some features not available on Community (XDCR TLS, MDS, Rack Zone,
etc)
• Free to use in dev/test/qa
• Need commercial license for prod
• Paid support provided
Most developers are probably already familiar with the relational way of modeling data
But if you want the benefits of a non-relational database, you have to think differently about modeling
Spend just a little time on why people are using NoSQL
Talk about how data is modeled differently in JSON
Let’s talk about why SQL is good and why SQL for JSON is needed
Talk about accessing data, since that has an effect on modeling
Maybe we'll get to migrating/syncing data from relational to nosql
SQL (relational) databases are great. They give you LOT OF functionality.
Great set of abstractions (tables, columns, data types, constraints, triggers, SQL, ACID TRANSACTIONS, stored procedures and more) at a highly reasonable cost.
Change is inevitable
One thing RDBMS does not handle well is CHANGE.
Change of schema (both logical and physical), change of hardware, change of capacity.
NoSQL databases ESPECIALLY ONES DESIGNED TO BE DISTRIBUTED tend to help solve problems with: agility, scalability, performance, and availability
Let’s talk about what NoSQL is, first.
NoSQL generally refers to databases which lack SQL or don’t use a relational model
Once the SQL language, transaction became optional, flurry of databases were created using distinct approaches for common use-cases.
KEY-Value simply provided quick access to data for a given KEY.
Wide Column databases can store large number of arbitrary columns in each row
Graph databases store data and relationships as first class concepts
Document databases aggregate data into a hierarchical structure.
With JSON is a means to the end. Document databases provide flexible schema,built-in data types, rich structure, implicit relationships using JSON.
When we look at document databases, they originally came with a
Minimal set of APIs and features
But as they continue to mature, we’re seeing more features being added
And generally I’m seeing a convergent trend between relational and NoSQL
But anyway, this set of minimal features, lacking a SQL language and tables gives us the buzzword “nosql”
Think of a document database at the simplest as a type
of a key/value store, where the value is in a known format
You write code where you start with a key, and you ask the database to return the document
That corresponds to that key.
And the same with creating/updating
If you are using a cloud based DBaaS, this is basically what's going on behind the scenes
Elastic scaling
Size your cluster for today
Scale out on demand
Cost effective scaling
Commodity hardware
On premise or on cloud
Scale OUT instead of Scale UP
[example: changing the channel to a soccer game or Game of Thrones, everyone makes the same API request in the same 5 minutes]
[example: TV show lets watchers vote during some period of the week, so you can scale up during that period of time]
[example: black Friday]
Schema flexibility
Easier management of change in the business requirements
Easier management of change in the structure of the data
Sometimes you're pulling together data, integrating from different sources (e.g. ELT) and that flexibility helps
Document database means that you have no rigid schema. You can do whatever the heck you want.
That being said, you SHOULDN’T. You should still have discipline about your data.
If one machine goes down, customers can still use the other.
Or if you need to perform maintenance, upgrade, etc, you don't have to take the whole system down
This is related to scaling
Built-in replication and fail-over
No application downtime when hardware fails
Online maintenance & upgrade
No application downtime
NoSQL systems are optimized for specific access patterns
Low response time for web & mobile user experience
Millisecond latency
Consistently high throughput to handle growth
[perf measures can be subjective – talk about architecture, integrated cache, maybe mention MDS too]
NoSQL is very versatile and can be used for a wide variety of use cases
Including, but NOT limited to these
If you're exploring NoSQL, make sure you have the right project or right use case.
Using a NoSQL database does NOT mean you have to abandon relational databases. Most large websites use a combination.
And it's also worth pointing out that plenty of companies are doing (most) of these use cases with relational as well.
Relational usually is at least mediocre, NoSQL may be BETTER
But usually the catalyst is one of the earlier reasons: performance, flexibility, scale.
Let’s talk about data modeling a bit, because storing data in JSON
Is different that storing in tables.
Let’s look at modeling Customer data. This is an example of what a customer might look like.
You might do this as part of a proof of concept, discovery, requirements gathering, planning, etc
There is a rich structure: attributes, potentially sub-attributes (first name and last name)
Relationships: to other data (other customers, to products perhaps)
Value evolution: Maybe we’d start with one purchase, add more as Helen makes more purchases
Structure evolution: Maybe we start will billing information being properties of Helen, then evolve later to be multiple billing options
Let’s look at modeling Customer data. This is an example of what a customer might look like
There is a rich structure: attributes, potentially sub-attributes (first name and last name)
Relationships: to other data (other customers, to products perhaps)
Value evolution: Maybe we’d start with one purchase, add more as Helen makes more purchases
Structure evolution: Maybe we start will billing information being properties of Helen, then evolve later to be multiple billing options
Let’s see how to represent customer data in JSON.
The primary (CustomerID) becomes the DocumentKey
Column name-Column value becomes KEY-VALUE pair.
We aren’t normal form anymore
Rich Structure & Relationships
Billing information is stored as a sub-document
There could be more than a single credit card. So, use an array.
Value evolution
Simply add additional array element or update a value.
Structure evolution
Simply add new key-value pairs
No downtime to add new KV pairs
Applications can validate data
Structure evolution over time.
Relations via Reference
So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
Hackolade supports couchbase, mongo, elastic, Cassandra, dynamo, firebase/firestore
Erwin supports mongodb and couchbase
Idera supports just mongodb
The way you plan to get data out can also affect the way you model your data
What types of relationships are being modeled?
How are the relationships accessed?
I've mostly been talking about key/value access
Most NoSQL databases will have at least one other way to access data besides key/value.
What does this have to do with modeling? Because modeling doesn't exist in a vacuum. You have to think about how you are going to interact with your data.
I'm going to show you some examples from Couchbase.
In Couchbase, we have N1QL, which is ANSI SQL for JSON
I'm also going to briefly cover FTS today. There are other options I'm not going to cover today, including Analytics and Views/MapReduce.
Just to reiterate on key/value
If you know the key already, it's really simple and extremely fast to access that piece of data.
Since key/value is so fast and easy, it would benefit us to use it as much as possible.
Here are some tips to maximize your key/value usage.
Starting from "matt" you can walk through this chain of documents
With ONLY key/value access.
Another thing to consider is whether or not your nosql database
has a "subdocument" API
If they do, then you have even more flexibility
Not all of them do, and some databases may call this something different
But the idea is this: if you only need "address", without a subdocument API
You'd have to pull the entire document over the wire when doing reads/writes
With a subdocument API, you can specify just a specific part to read/write
This can be very helpful is you have large documents, or if you are doing a lot of reads and writes that only need a small portion of the data
Firestore: there is a different concept of "subcollection". You have collections that contain documents, and then documents themselves can contain collections. Those documents have their own keys. If you DELETE a parent, then this subcollection will stick around.
But when it comes to modeling considerations, with Firestore you can access these subcollections and look up individual keys in them. This is where we diverge a little bit from JSON modeling. But the idea is that you don't have to get the ENTIRE document, you can target individual pieces.
In firestore, you have a collection of documents (rooms, in this case).
An individual document (roomA) can itself have a collection (messages)
And so on.
You can do this in plain JSON in just about every other document database
You can nest as deep as you want in JSON.
Much like the subdocument I mentioned before, you can do a similar thing here. If you want to make a change
To one message, you can address it directly with "rooms/roomA/messages/message1"
EXCEPT that Firestore treats the documents in the subcollection kinda like separate documents.
Gotcha: "Deleting a document does not delete its subcollections"
So you should be careful when using this, you could end up with something kinda like orphan documents
N1QL is powerful in it's flexibility, declarative nature, familiar to developers, JOINs, etc.
But note, once we step out of key/value access, we need to involve other processes:
We gotta parse the query, most likely use an index service, and in the end we'll get a bunch of keys to lookup the data.
There is overhead involved, but sometimes this is a necessity.
When you're outside of key/value access, you must understand the query plan.
This is true for ANY database, relational or NoSQL.
As an example, here's a Couchbase SQL query. I execute this and it ran in 1.2 seconds. It's using AN index on the TYPE field, but notice that name field on there.
I can bring up a visualization of the query plan to see which parts are taking up the most time.
In Couchbase, there is an index advisor. It suggested an index for me. After creating the index, the same query went to 146ms (about 8 times faster)
Covering index could make this even faster (note the *)
This is a search that revolves around text. Things like stemming, language awareness, facets, ranking, etc.
This is, again, a very simple example. I'm searching for the keyword "submarine".
In my application, this query may be limited to a certain search radius, or it may be limited to a certain facet, etc.
But the end results are language aware, ranked matches.
You would use this INSTEAD OF a sql 'like' for instance
Build a proof of concept, which will help you see if NoSQL is the right fit
And it will also help you understand the access patterns better
Just to sum up
Also note that in Couchbase, you can combine FTS and N1QL
Migrating or syncing
Because often Couchbase and relational are complimentary
Couchbase can be used for engagement, Relational for transactions
Are you going to take the time to clean up the data? Do you need to?
Do you need to enrich or restructure the data to take advantage of Json?
Also I'm using the term MIGRATION, but it might NOT be the case that you are abandoning a database in favor of another. You might want to sync data, you might want to make a copy of data into a more suitable database for your use case, etc.
Again, proof of concept
KISS: export to CSV and use N1QL to do any ETL that’s required
Export to CSV
Import as documents into a 'staging' bucket
Use N1QL to transform
Insert into new bucket
Align with your data model
The modeling step is vital. If you don't model your data to take advantages of JSON, then you are not going to see the advantages of using JSON. Don't treat a JSON database as if it were a relational database.
Basically keep it as simple as you can and plan for failure. Developers often think of the migration process as “One and Done”, but the reality is that data migration is often an ongoing headache that DevOps needs to monitor and manage in a production environment. Make everyone’s life easier by thinking about the long game as much as possible.
Plan for failure
Bad source data
Hardware failure
Resource limitations (proof of concept vs MVP)
Developers often think of the migration process as “One and Done”, but the reality is that data migration is often an ongoing headache that DevOps needs to monitor and manage in a production environment.
Ensure: Interruptible, restartable, logged -> predictable
Make everyone’s life easier by thinking about the long game as much as possible.
From NoSQL to relational
You can also turn this around and use Kafka in the other direction
From relational to NoSQL:
Goldendate is from oracle
Cdata for SSIS and Couchbase
https://github.com/mahurtado/CouchbaseGoldenGateAdapter
https://www.cdata.com/drivers/couchbase
Make it part of your application directly
May or may not be reusable
This is a lot of work, so make sure you have a good reason
Focus on SOA, microservices, application/use case specific
Modeling, Focus, Success Criteria, Review Architecture
consider using a tool like Hackolade to define models rigorously and collaboratively
This is my family
My enormous head barely fits in the picture
Couchbase Cloud is currently in limited beta
Memory first: integrated cache, you don't need to put redis on top of couchbase
Master-master: easier scaling, better scaling
Auto-sharding: we call vBuckets, you don't have to come up with a sharding scheme, it's done by crc32
N1QL: SQL, mongo has a more limited query language and it's not SQL-like
Full Text Search: Using the bleve search engine, language aware FTS capabilities built in
Mobile & sync: Mongo has nothing like the offline-first and sync capabilities couchbase offers
Mongo DOES have a DbaaS cloud provider
Everything I've shown you today is available in Community edition
The only N1QL feature I can think of not in Community is INFER and Query Plan Visualizer
The Enterprise features you probably don't need unless you are Enterprise developer.