JSON Data Modeling - July 2018 - Tulsa Techfest
- 3. Where am I?
3
• Tulsa Tech Fest
• https://grouplings.com/TulsaTechFest
• https://twitter.com/TulsaTechFest
- 4. Who am I?
4
• Matthew D. Groves
• Developer Advocate for Couchbase
• @mgroves on Twitter
• Podcast and blog: https://crosscuttingconcerns.com
• "I am not an expert, but I am an enthusiast." – Alan Stevens
by @natelovett
- 6. Major Enterprises Across Industries are Adopting NoSQL
CommunicationsTechnology
Travel & Hospitality Media &
Entertainment
E-Commerce &
DigitalAdvertising
Retail & Apparel
Games & GamingFinance &
Business Services
- 8. NoSQL Landscape
Document
• Couchbase
• MongoDB
• DynamoDB
• CosmosDB
Graph
• OrientDB
• Neo4J
• DEX
• GraphBase
Key-Value
• Couchbase
• Riak
• BerkeleyDB
• Redis Wide Column
• Hbase
• Cassandra
• Hypertable
- 9. NoSQL Landscape
• Get by key(s)
• Set by key(s)
• Replace by key(s)
• Delete by key(s)
• Map/Reduce
Document
• Couchbase
• MongoDB
• DynamoDB
• CosmosDB
- 15. Models for Representing Data
1
5
Data Concern Relational Model JSON Document Model
Rich Structure
Relationships
Value Evolution
Structure Evolution
- 17. Modeling Data in a Relational World
1
7
Billing
ConnectionsPurchases
Contacts
Customer
- 18. CustomerID Name DOB
CBL2015 Jane Smith 1990-01-30
Table: Customer
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30”
}
Customer DocumentKey: CBL2015
- 19. ©2017 Couchbase Inc. 19
CustomerID Name DOB
CBL2015 Jane Smith 1990-01-30
Table: Customer {
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
}
]
}
Customer DocumentKey: CBL2015
CustomerID Item Amount Date
CBL2015 laptop 1499.99 2019-03
Table: Purchases
- 20. CustomerID Name DOB
CBL2015 Jane Smith 1990-01-30
Table: Customer {
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Customer DocumentKey: CBL2015
CustomerID Item Amount Date
CBL2015 laptop 1499.99 2019-03
CBL2015 phone 99.99 2018-12
Table: Purchases
- 21. CustomerID ConnId Relation
CBL2015 XYZ987 Brother
CBL2015 SKR007 Father
Table: Connections {
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Billing" : [
{
"type" : "visa",
"cardnum" : "5827-2842-...",
"expiry" : "2019-03"
}, ...
],
"Connections" : [
{
"ConnId" : "XYZ987",
"Relation" : "Brother"
},
{
"ConnId" : "SKR007",
"Relation" : "Father"
}
}
Customer DocumentKey: CBL201
- 22. ©2017 Couchbase Inc. 22
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"cardnum" : "5827-2842…",
"expiry" : "2019-03",
"cardType" : "visa",
"Connections" : [
{
"CustId" : "XYZ987",
"Relation" : "Brother"
},
{
"CustId" : "SKR007",
" Relation " : "Father"
}
],
"Purchases" : [
{ "id":12, item: "mac", "amt": 2823.52
}
{ "id":19, item: "ipad2", "amt": 623.52
}
]
}
DocumentKey: CBL2015
Custome
rID
Name DOB Cardnum Expiry CardType
CBL201
5
Jane
Smith
1990-01-
30
5827-
2842…
2019-03 visa
CustomerI
D
ConnId Relation
CBL2015 XYZ987 Brother
CBL2015 SKR007 Father
CustomerI
D
item amt
CBL2015 mac 2823.5
2
CBL2015 ipad2 623.52
CustomerI
D
ConnId Name
CBL2015 XYZ987 Joe
Smith
CBL2015 SKR007 Sam
Smith
Contacts
Customer
ConnectionsPurchases
- 23. {
"Name" : "Bob Jones",
"DOB" : "1980-01-29",
"Billing" : [
{
"type" : "visa",
"cardnum" : "5927-2842-2847-3909",
"expiry" : "2020-03"
},
{
"type" : "master",
"cardnum" : "6273-2842-2847-3909",
"expiry" : "2019-11"
}
],
"Connections" : [
{
"CustId" : "XYZ987",
"Relation" : "Brother"
},
{
"CustId" : "PQR823",
"Relation" : "Father"
}
],
"Purchases" : [
{ "id":12, item: "mac", "amt": 2823.52 }
{ "id":19, item: "ipad2", "amt": 623.52 }
]
}
DocumentKey: CBL2016
CustomerID Name DOB
CBL2016 Bob Jones 1980-01-29
Custome
rID
Type Cardnum Expiry
CBL2016 visa 5927… 2020-03
CBL2016 maste
r
6273… 2019-11
CustomerI
D
ConnId Relation
CBL2016 XYZ987 Brother
CBL2016 SKR007 Father
CustomerI
D
item amt
CBL2016 mac 2823.5
2
CBL2016 ipad2 623.52
CustomerI
D
ConnI
d
Name
CBL201
6
XYZ98
7
Joe
Smith
CBL201
6
SKR0
07
Sam
Smith
Contacts
Customer
Billing
ConnectionsPurchases
- 24. Models for Representing Data
2
4
Data Concern Relational Model JSON Document Model
Rich Structure
• Multiple flat tables
• Assembly / disassembly
Documents
No (or less) assembly required
Relationships
Represented
Queries with SQL
Represented
Queried…with?
Value Evolution Data can be updated Data can be updated
Structure Evolution
Uniform, rigid, enforced
Manual disruptive change
Flexible
Dynamic change
Increased app responsibility
- 25. Relationship is one-to-one or one-to-many
Store related data as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
- 26. Relationship is many-to-one or many-to-
many
Store related data as separate documents
{
"Name" : "Jane
Smith",
"DOB" : "1990-01-
30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
- 29. Data reads are mostly parent fields
Store children as separate documents
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
- 30. Data reads are mostly parent + child fields
Store children as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
- 31. Data writes are mostly parent or child (not
both)
Store children as separate documents
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Connections" : [
"XYZ987",
"PQR823",
"PQR828"
]
}
Modeling your data: Strategies / rules of thumb
- 32. Data writes are mostly parent and child (both)
Store children as nested objects
{
"Name" : "Jane Smith",
"DOB" : "1990-01-30",
"Purchases" : [
{
"item" : "laptop",
"amount" : 1499.99,
"date" : "2019-03",
},
{
"item" : "phone",
"amount" : 99.99,
"date" : "2018-12"
}
]
}
Modeling your data: Strategies / rules of thumb
- 33. If … Then …
Relationship is one-to-one or one-to-many Store related data as nested objects
Relationship is many-to-one or many-to-
many
Store related data as separate documents
Data reads are mostly parent fields Store children as separate documents
Data reads are mostly parent + child fields Store children as nested objects
Data writes are mostly parent or child (not
both)
Store children as separate documents
Data writes are mostly parent and child
(both)
Store children as nested objects
Modeling your data: Strategies / rules of thumb
- 35. Accessing your data (Couchbase)
Key-Value
(CRUD)
N1QL
(Query)
Views
(Query)
Documents
Indexes
MapReduc
e
Full Text
(Search)
Geospatial
(Search)
Indexes
MapReduc
e
- 36. Key/Value
public ShoppingCart GetCartById(Guid id)
{
return _bucket.Get<ShoppingCart>(id.ToString()).Value;
}
public void CreateShoppingCart()
{
_bucket.Insert(new Document<dynamic>
{
Id = Guid.NewGuid().ToString(),
Content = new { . . . }
});
}
- 38. Key/Value: Example keys
• author::matt
• author::matt::blogs
• blog::csharp_7_features
• blog::csharp_7_features::comments
- 42. Concept Strategies & Recommendations
Key-Value Operations provide the best
possible performance
• Create an effective key naming strategy
• Create an optimized data model
Incremental MapReduce (Views) are well
suited to aggregation
• Ideal for large data sets
• Data set can be used to create complex
view indexes
N1QL queries provide the most flexibility –
everything else
• Query data regardless of how it is
modeled
• Good indexing is vital
Accessing your data: Strategies and recommendation
- 47. • Batch vs. Incremental
• Single threaded vs. multi-threaded
Migration options: Pick your strategy
- 48. Data migration tools:
Informatica, Looker, Talend, DART, ODBC, CData
BYO-tool
• C# / bash / Powershell / curl / REST etc
• GoldenGate / DTS / SSIS
• Hadoop, Spark, Kafka, Nifi
• CLI: cbimport, mongoimport, etc
Migration options: Pick your tools
- 49. Migration options: KISS
• CSV:
• Export to CSV
• Import as documents into a 'staging' bucket
• Use N1QL to transform
• Insert into new bucket
• SQL:
• Transform
• Export
• Insert into document database
- 50. Migration options: Recommendations
• Align with your data model
• Plan for failure
• Bad source data
• Hardware failure
• Resource limitations
• Ensure: Interruptible, restartable, logged, predictable
- 51. Sync NoSQL and relational? Automatic Replication
Couchbase
Kafka
Queue
Producer Consumer
RDBMSDCP
Stream
- 52. How can you sync NoSQL and relational?
RDBMS
Handler
Couchbase
GoldenGate
https://github.com/mahurtado/CouchbaseGoldenGateAdapter
- 53. Data Flow with NiFi
5
3
https://blog.couchbase.com/nifi-processing-flow-couchbase-server/
- 61. Couchbase Plug
6
1
• Go to Couchbase.com to download Couchbase
• Enter to win a $100 gift card here:
https://bit.ly/FEST2018 (use code FEST2018)
- 62. Where do you find us?
6
2
•blog.couchbase.com
•@mgroves
•@couchbasedev
- 63. Frequently Asked Questions
6
3
1. How is Couchbase different than Mongo?
2. Is Couchbase the same thing as CouchDb?
3. How tall are you? Do you play basketball?
4. What is the Couchbase licensing situation?
5. Is Couchbase a Managed Cloud Service (DBaaS)?
- 65. MongoDB vs Couchbase
6
5
• Architecture
• Memory first architecture
• Master-master architecture
• Auto-sharding
• Features
• SQL (N1QL)
• Full Text Search
• Mobile & Sync
< Back
- 66. Licensing
6
6
< Back
Couchbase Server Community
• Open source (Apache 2)
• Binary release is one release behind Enterprise (except major versions)
• Free to use in dev/test/qa/prod
• Forum support only
Couchbase Server Enterprise
• Mostly open source (Apache 2)
• Some features not available on Community (XDCR TLS, MDS, Rack Zone,
etc)
• Free to use in dev/test/qa
• Need commercial license for prod
• Paid support provided
Editor's Notes
- Spend just a little time on why people are using NoSQL
Talk about how data is modeled differently in JSON
Let’s talk about why SQL is good and why SQL for JSON is needed
Let’s talk about the exciting stuff happening in the database ecosystem
Including but not limited to the stuff Couchbase is doing
If we have time, we’ll look at how a .NET developer (or Java developer, etc) would interact with SQL for JSON
- What’s also interesting is that we’re seeing the use of NoSQL expand inside many of these companies. Orbitz, the online travel company, is a great example – they started using Couchbase to store their hotel rate data, and now they use Couchbase in many other ways.
Same with ebay, they recently presented at the Couchbase conference with a chart tracking how many instances of various nosql databases are in use, and we see growth in Cassandra, mongo, and couchbase has actually surpassed them within ebay
- SQL (relational) databases are great. They give you LOT OF functionality.
Great set of abstractions (tables, columns, data types, constraints, triggers, SQL, ACID TRANSACTIONS, stored procedures and more) at a highly reasonable cost.
Change is inevitable
One thing RDBMS does not handle well is CHANGE.
Change of schema (both logical and physical), change of hardware, change of capacity.
NoSQL databases ESPECIALLY ONES DESIGNED TO BE DISTRIBUTED tend to help solve problems with: agility, scalability, performance, and availability
- Let’s talk about what NoSQL is, first.
NoSQL generally refers to databases which lack SQL or don’t use a relational model
Once the SQL language, transaction became optional, flurry of databases were created using distinct approaches for common use-cases.
KEY-Value simply provided quick access to data for a given KEY.
Wide Column databases can store large number of arbitrary columns in each row
Graph databases store data and relationships as first class concepts
Document databases aggregate data into a hierarchical structure.
With JSON is a means to the end. Document databases provide flexible schema,built-in data types, rich structure, implicit relationships using JSON.
- When we look at document databases, they originally came with a
Minimal set of APIs and features
But as they continue to mature, we’re seeing more features being added
And generally I’m seeing a convergent trend between SQL and NoSQL
But anyway, this set of minimal features, lacking a SQL language and tables gives us the buzzword “nosql”
- Elastic scaling
Size your cluster for today
Scale out on demand
Cost effective scaling
Commodity hardware
On premise or on cloud
Scale OUT instead of Scale UP
[example: changing the channel to a soccer game or Game of Thrones, everyone makes the same API request in the same 5 minutes]
[example: TV show lets watchers vote during some period of the week, so you can scale up during that period of time]
[example: black Friday]
- Schema flexibility
Easier management of change in the business requirements
Easier management of change in the structure of the data
Sometimes you're pulling together data, integrating from different sources (e.g. ELT) and that flexibility helps
Document database means that you have no rigid schema. You can do whatever the heck you want.
That being said, you SHOULDN’T. You should still have discipline about your data.
- If one machine goes down, customers can still use the other.
Or if you need to perform maintenance, upgrade, etc, you don't have to take the whole system down
This is related to scaling
Built-in replication and fail-over
No application downtime when hardware fails
Online maintenance & upgrade
No application downtime
- NoSQL systems are optimized for specific access patterns
Low response time for web & mobile user experience
Millisecond latency
Consistently high throughput to handle growth
[perf measures can be subjective – talk about architecture, integrated cache, maybe mention MDS too]
- Let’s talk about data modeling a bit, because storing data in JSON
Is different that storing in tables.
- So I want to compare the approaches over 4 key areas.
I’m going to fill in this table, traditional SQL on the left and JSON on the right
- Let’s look at modeling Customer data. This is an example of what a customer might look like
There is a rich structure: attributes, potentially sub-attributes (first name and last name)
Relationships: to other data (other customers, to products perhaps)
Value evolution: Maybe we’d start with one purchase, add more as Helen makes more purchases
Structure evolution: Maybe we start will billing information being properties of Helen, then evolve later to be multiple billing options
- Let’s look at modeling Customer data. This is an example of what a customer might look like
There is a rich structure: attributes, potentially sub-attributes (first name and last name)
Relationships: to other data (other customers, to products perhaps)
Value evolution: Maybe we’d start with one purchase, add more as Helen makes more purchases
Structure evolution: Maybe we start will billing information being properties of Helen, then evolve later to be multiple billing options
- Let’s see how to represent customer data in JSON.
The primary (CustomerID) becomes the DocumentKey
Column name-Column value becomes KEY-VALUE pair.
- We aren’t normal form anymore
Rich Structure & Relationships
Billing information is stored as a sub-document
There could be more than a single credit card. So, use an array.
- Value evolution
Simply add additional array element or update a value.
- Structure evolution
Simply add new key-value pairs
No downtime to add new KV pairs
Applications can validate data
Structure evolution over time.
Relations via Reference
- So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
- So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
- So I want to compare the approaches over 4 key areas.
I’m going to fill in this table, traditional SQL on the left and JSON on the right
- Hackolade supports couchbase, mongo, elastic, Cassandra, dynamo
Erwin supports mongodb, couchbase soon?
Idera supports mongodb, couchbase soon?
- The way you plan to get data out can also affect the way you model your data
- What types of relationships are being modeled?
How are the relationships accessed?
- Another thing to consider is whether or not your nosql database
has a "subdocument" API
Not all of them do, and some databases may call this something different
But the idea is this: if you only need "address", without a subdocument API
You'd have to pull the entire document over the wire when doing reads/writes
With a subdocument API, you can specify just a specific part to read/write
This can be very helpful is you have large documents, or if you are doing a lot of reads and writes that only need a small portion of the data
- I've mostly been talking about key/value access
But document databases have a wide range of ways to access data other than key/value
Other databases like mongo have their own javascript query API, but I find N1QL to be very compelling because I'm used to writing SQL to interact with data
In Couchbase, we have N1QL, which is ANSI SQL for JSON
Most document databases have MapReduce capabilities, including Couchbase. We're kinda leaning away from M/R these days in favor of N1QL, but it can still be useful in some cases
There are other ways to access data like FTS and Geospatial which I'm not going to cover today
- Notice I’m using Guid
That may not be a good idea
This is C#, and you already saw how to do this with Go
Also supported: Java, Node, Python, PHP, C, and many more
- Starting from "matt" you can walk through this chain of documents
Without having to do a query
You can access them with key/value operations only if you'd like
- N1QL is powerful in it's flexibility, declarative nature, familiar to developers, JOINs, etc.
Indexing is very important, as it's not as performant as key/value or map/reduce
(Maybe talk about indexing on a SQL table vs indexing on a whole bucket)
- Couchbase 5.0 has introduced some tools for analyzing query performance
So you can see what indexes are being used, where the biggest costs are in the query
And so on.
There are a lot of different types of indexes for N1QL
- This is kinda like a materialized view
It's powerful in that it can be run in parallel, can use JavaScript to do filtering/mapping, great for aggregation.
It's limited in that it can't do anything like a JOIN, can't get input from other views, can only order lexigraphically
- Migrating or syncing
Because often Couchbase and relational are complimentary
Couchbase can be used for engagement, Relational for transactions
- Are you going to take the time to clean up the data? Do you need to?
Do you need to enrich or restructure the data to take advantage of Json?
- Duration v resources: how long is it going to take? What tools and resources are available to you?
What’s your biggest constraint – time or resources? Do you need to get the migration done in 1 hr (and have it use as many parallel resources as needed) or do you need to minimize/manage the resource impact on the existing system and it doesn’t matter how long it takes?
- You must obey the claw
Data governance: what are the rules for moving data, auditing, etc?
Do you need to keep track of where the data came from and who is allowed to access it? Many newer systems need to track where sensitive data originated.
- A whole bunch at a time, or one at a time
Single threaded – easier
Multi-threaded – faster, complicated
is the migration a one-time event or does it need to happen incrementally (every day or over a 2-3 month period where both the old system and new system are both operating in parallel)? Do you plan to do the data migration as a single thread (read all the data, write all of the data) or using a multi-threaded or multi-process approach where each thread or process reads some percentage of the data.
- If you're writing your own, Entity Framework can be helpful, because it can do the mapping of aggregate root C# objects for you, which you can then write to a document database
So if you already have EF mappings created, you're part way there.
-
KISS: Either export to CSV and use N1QL to do any ETL that’s required (assuming that it’s Simple) or use SQL to do simple ETL on export and then just import into CB.
- Basically keep it as simple as you can and plan for failure. Developers often think of the migration process as “One and Done”, but the reality is that data migration is often an ongoing headache that DevOps needs to monitor and manage in a production environment. Make everyone’s life easier by thinking about the long game as much as possible.
- From NoSQL to relational
- From relational to NoSQL:
Goldendate is from oracle
Cdata for SSIS and Couchbase
https://github.com/mahurtado/CouchbaseGoldenGateAdapter
https://www.cdata.com/drivers/couchbase
- A tool I recently discovered that helps you move, process, transform data all around your enterprise
Called Nifi
Can connect to Couchbase, SQL Server, just about any source or destination
I wrote a blog post showing how to move data from SQL Server to Couchbase using NiFi
I did this as a proof of concept for the Cincinnati Reds
I'm far from a Nifi expert, but I really like what I've seen
It allows you to make a flow that's repeatable, auditable, debuggable, can be easily stopped, started, monitored, notifications, error handling, and more
And it's visual
Great for migration or syncing or both
https://blog.couchbase.com/nifi-processing-flow-couchbase-server/
- Make it part of your application directly
Maybe use Akka
May or may not be reusable
This is a lot of work, so make sure you have a good reason
- Focus on SOA, application/use case specific
- Use Document type, Versionid
Create optimized, understandable keys
Weigh nested, referenced or mixed designs
Add indexes: Simple, Compound, Functional, Partial, Array, Covering, Memory Optimized
- N1QL, Key-value, Views
- Focus, Success Criteria, Review Architecture
consider using a tool like Hackolade to define models rigorously and collaboratively
- All I ask is that you give Couchbase a chance
Free download
You can also take it for a free test drive on the major cloud providers
Also, this is something new for me this year, please go to this URL to enter to win a $100 gift card. It is literally a 1 question survey and it helps me out a lot.
- This is my family
My enormous head barely fits in the picture
- Not yet. We've been talking about it at least as long as I've been with Couchbase.
It's partly a technical problem, may need additional features for multi-tenant.
It's partly (mostly) a business problem. Would this be worth it?
Couchbase IS in the Azure and AWS marketplaces, and there are some wizards to make config easy, but it runs on your VMs.
- Memory first: integrated cache, you don't need to put redis on top of couchbase
Master-master: easier scaling, better scaling
Auto-sharding: we call vBuckets, you don't have to come up with a sharding scheme, it's done by crc32
N1QL: SQL, mongo has a more limited query language and it's not SQL-like
Full Text Search: Using the bleve search engine, language aware FTS capabilities built in
Mobile & sync: Mongo has nothing like the offline-first and sync capabilities couchbase offers
Mongo DOES have a DbaaS cloud provider
- Everything I've shown you today is available in Community edition
The only N1QL feature I can think of not in Community is INFER and Query Plan Visualizer
The Enterprise features you probably don't need unless you are Enterprise developer.