SlideShare a Scribd company logo
ToroDB
Open-source, MongoDB-compatible database,
built on top of PostgreSQL
Álvaro Hernández <aht@torodb.com>
ToroDB @NoSQLonSQL
DEMO!
ToroDB @NoSQLonSQL
About *8Kdata*
● Research & Development in databases
●
Consulting, Training and Support in PostgreSQL
●
Founders of PostgreSQL España, 3rd
largest PUG
in the world (>400 members as of today)
●
About myself: CTO at 8Kdata:
@ahachete
http://linkd.in/1jhvzQ3
www.8kdata.com
ToroDB @NoSQLonSQL
ToroDB in one slide
●
Document-oriented, JSON, NoSQL db
●
Open source (AGPL)
●
MongoDB compatibility (wire protocol
level)
●
Uses PostgreSQL as a storage backend
ToroDB @NoSQLonSQL
Why relational databases:
technical perspective
●
Document model is very appealing to
many. But all dbs started from scratch
●
DRY: why not use relational
databases? They are proven, durable,
concurrent and flexible
●
Why not base it on relational databases,
like PostgreSQL?
ToroDB @NoSQLonSQL
ToroDB
tables structure
ToroDB @NoSQLonSQL
ToroDB storage
●
Data is stored in tables. No blobs
●
JSON documents are split by hierarchy
levels into “subdocuments”, which
contain no nested structures. Each
subdocument level is stored separately
●
Subdocuments are classified by “type”.
Each “type” maps to a different table
ToroDB @NoSQLonSQL
ToroDB storage (II)
●
A “structure” table keeps the
subdocument “schema”
●
Keys in JSON are mapped to attributes,
which retain the original name
●
Tables are created dinamically and
transparently to match the exact types of
the documents
ToroDB @NoSQLonSQL
ToroDB storage internals
{
"name": "ToroDB",
"data": {
"a": 42, "b": "hello world!"
},
"nested": {
"j": 42,
"deeper": {
"a": 21, "b": "hello"
}
}
}
ToroDB @NoSQLonSQL
ToroDB storage internals
The document is split into the following subdocuments:
{ "name": "ToroDB", "data": {}, "nested": {} }
{ "a": 42, "b": "hello world!"}
{ "j": 42, "deeper": {}}
{ "a": 21, "b": "hello"}
ToroDB @NoSQLonSQL
ToroDB storage internals
select * from demo.t_3
┌─────┬───────┬────────────────────────────┬────────┐
│ did │ index │ _id │ name │
├─────┼───────┼────────────────────────────┼────────┤
│ 0 │ ¤ │ x5451a07de7032d23a908576d │ ToroDB │
└─────┴───────┴────────────────────────────┴────────┘
select * from demo.t_1
┌─────┬───────┬────┬──────────────┐
│ did │ index │ a │ b │
├─────┼───────┼────┼──────────────┤
│ 0 │ ¤ │ 42 │ hello world! │
│ 0 │ 1 │ 21 │ hello │
└─────┴───────┴────��──────────────┘
select * from demo.t_2
┌─────┬───────┬────┐
│ did │ index │ j │
├─────┼───────┼────┤
│ 0 │ ¤ │ 42 │
└─────┴───────┴────┘
ToroDB @NoSQLonSQL
ToroDB storage internals
select * from demo.structures
┌─────┬────────────────────────────────────────────────────────────────────────────┐
│ sid │ _structure │
├─────┼────────────────────────────────────────────────────────────────────────────┤
│ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │
└─────┴────────────────────────────────────────────────────────────────────────────┘
select * from demo.root;
┌─────┬─────┐
│ did │ sid │
├─────┼─────┤
│ 0 │ 0 │
└─────┴─────┘
ToroDB @NoSQLonSQL
ToroDB storage and I/O savings
29% - 68% storage required,
compared to Mongo 2.6
ToroDB @NoSQLonSQL
The software
ToroDB is written in Java, compatible with
versions 6 and above.
It has been tested on Oracle's VM, but we
will also test and verify it on Azul's VM.
It is currently a standalone JAR file but will
also be offered as an EAR, to easily
deploy to application servers.
ToroDB @NoSQLonSQL
Going beyond MongoDB
ToroDB @NoSQLonSQL
Going beyond MongoDB
MongoDB brought the document model
and several features that many love.
But can we go further than that?
Can't the foundation of relational
databases provide a basis for offering
new features on a NoSQL, document-like,
JSON database?
ToroDB @NoSQLonSQL
Going beyond MongoDB
●
Avoid schema repetition. Query-by-type
●
Cheap single-node durability
●
“Clean” reads
●
Atomic bulk operations
●
Highest concurrency
ToroDB @NoSQLonSQL
The schema-less fallacy
{
“name”: “Álvaro”,
“surname”: “Hernández”,
“height”: 200,
“hobbies”: [
“PostgreSQL”, “triathlon”
]
}
ToroDB @NoSQLonSQL
The schema-less fallacy
{
“name”: “Álvaro”,
“surname”: “Hernández”,
“height”: 200,
“hobbies”: [
“PostgreSQL”, “triathlon”
]
}
metadata → Isn't that... schema?
ToroDB @NoSQLonSQL
The schema-less fallacy: BSON
metadata → Isn't that... schema?
{
“name”: (string) “Álvaro”,
“surname”: (string) “Hernández”,
“height”: (number) 200,
“hobbies”: {
“0”: (string) “PostgreSQL” ,
“1”: (string) “triathlon”
}
}
ToroDB @NoSQLonSQL
The schema-less fallacy
●
It's not schema-less
●
It is “attached-schema”
●
It carries an overhead which is not 0
ToroDB @NoSQLonSQL
Schema-attached repetition
{ “a”: 1, “b”: 2 }
{ “a”: 3 }
{ “a”: 4, “c”: 5 }
{ “a”: 6, “b”: 7 }
{ “b”: 8 }
{ “a”: 9, “b”: 10 }
{ “a”: 11, “b”: 12, “j”: 13 }
{ “a”: 14, “c”: 15 }
Counting
“document
types” in
collections
of millions:
at most,
1000s of
different
types
ToroDB @NoSQLonSQL
Schema-attached repetition
How data is stored in schema-less
ToroDB @NoSQLonSQL
This is how we store in ToroDB
ToroDB @NoSQLonSQL
ToroDB: query “by structure”
●
ToroDB is effectively partitioning by
type
●
Structures (schemas, partitioning types)
are cached in ToroDB memory
●
Queries only scan a subset of the data.
●
Negative queries are served directly
from memory.
ToroDB @NoSQLonSQL
Cheap single-node durability
●
Without journaling, MongoDB is not
durable nor crash-safe
●
MongoDB requires “j: true” for true
single-node durability. But who
guarantees its consistent usage? Who
uses it by default?
j:true creates I/O storms equivalent to
SQL CHECKPOINTs
ToroDB @NoSQLonSQL
“Clean” reads
Oh really?
ToroDB @NoSQLonSQL
“Clean” reads
http://docs.mongodb.org/manual/reference/write-concern/#read-isolation-behavior
“MongoDB will allow clients to read the results of a
write operation before the write operation returns.”
“If the mongod terminates before the journal
commits, even if a write returns successfully, queries
may have read data that will not exist after the
mongod restarts.”
“Other database systems refer to these isolation
semantics as read uncommitted.”
ToroDB @NoSQLonSQL
“Clean” reads
Thus, MongoDB suffers from dirty reads.
Or probably better called “tainted
reads”.
What about $snapshot? Nope:
“The snapshot() does not guarantee that the data returned
by the query will reflect a single moment in time nor does it
provide isolation from insert or delete operations.”
http://docs.mongodb.org/manual/faq/developers/#faq-developers-isolate-cursors
ToroDB @NoSQLonSQL
ToroDB: going beyond MongoDB
●
Cheap single-node durability
PostgreSQL is 100% durable. Always.
And it's cheap (doesn't do I/O storms)
●
“Clean” reads
Cursors in ToroDB run in repeatable
read, read-only mode:
globalCursorDataSource.setTransactionIsolation("TRANSACTIO
N_REPEATABLE_READ");
globalCursorDataSource.setReadOnly(true);
ToroDB @NoSQLonSQL
Atomic operations
●
There is no support for atomic bulk
insert/update/delete operations
●
Not even with $isolated:
“Prevents a write operation that affects multiple documents
from yielding to other reads or writes […] You can ensure
that no client sees the changes until the operation completes
or errors out. The $isolated isolation operator does not
provide “all-or-nothing” atomicity for write
operations.”
http://docs.mongodb.org/manual/reference/operator/update/isolated/
ToroDB @NoSQLonSQL
High concurrency
●
MMAPv1 is still collection-locked
●
WiredTiger is document-locked
●
But still exclusive locks (MMAP). Most
relational databases have MVCC, which
means almost conflict-free readers and
writers at the same time
ToroDB @NoSQLonSQL
●
Atomic bulk operations
By default, bulk operations in ToroDB are
atomic. Use flag ContinueOnError: 1 to
perform non-atomic bulk operations
●
Highest concurrency
PostgreSQL uses MVCC. Readers and
writers do not block each other. Writers
block writers only for the same record
ToroDB: going beyond MongoDB
ToroDB @NoSQLonSQL
ToroDB: Developer Preview
●
ToroDB launched on October 2014, as
a Developer Preview. Support for CRUD
and most of the SELECT API
●
github.com/torodb
●
RERO policy. Comments, feedback,
patches... greatly appreciated
●
AGPLv3
ToroDB @NoSQLonSQL
ToroDB: Developer Preview
●
Clone the repo, build with Maven
●
Or download the JAR:
http://maven.torodb.com/jar/com/torodb/torodb/
0.20/torodb.jar
●
Usage:
java -jar torodb-0.20.jar –help
java -jar torodb-0.20.jar -d dbname -u dbuser -P 27017
Connect with normal mongo console!
ToroDB @NoSQLonSQL
ToroDB: Community Response
ToroDB @NoSQLonSQL
ToroDB: Community Response
ToroDB @NoSQLonSQL
ToroDB: Roadmap
●
Current Developer Preview is
single-node
●
Version 1.0:
➔
Expected Q4 2015
➔
Production-ready
➔
MongoDB Replication support
➔
Very high compatibility with Mongo API
ToroDB @NoSQLonSQL
ToroDB: Development priorities
#1 Offer MongoDB-like experience on
top of existing IT infrastructure, like
relational databases and app servers
#2 Go beyond current MongoDB
features, like in ACID and concurrency
#3 Great performance
ToroDB @NoSQLonSQL
ToroDB: Experimental research directions
●
User columnar storage (CitusDB)
●
Use Postgres-XL as a backend. This
requires us to distribute ToroDB's cache
(ehcache, Hazelcast)
●
Use pg_shard for sharding
ToroDB @NoSQLonSQL
Big Data speaking mongo:
Vertical ToroDB
What if we use CitusData's cstore to store
the JSON documents?
ToroDB @NoSQLonSQL
1.17% - 20.26% storage required,
compared to Mongo 2.6
Big Data speaking mongo:
Vertical ToroDB
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL

More Related Content

Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL

  • 1. ToroDB Open-source, MongoDB-compatible database, built on top of PostgreSQL Álvaro Hernández <aht@torodb.com>
  • 3. ToroDB @NoSQLonSQL About *8Kdata* ● Research & Development in databases ● Consulting, Training and Support in PostgreSQL ● Founders of PostgreSQL España, 3rd largest PUG in the world (>400 members as of today) ● About myself: CTO at 8Kdata: @ahachete http://linkd.in/1jhvzQ3 www.8kdata.com
  • 4. ToroDB @NoSQLonSQL ToroDB in one slide ● Document-oriented, JSON, NoSQL db ● Open source (AGPL) ● MongoDB compatibility (wire protocol level) ● Uses PostgreSQL as a storage backend
  • 5. ToroDB @NoSQLonSQL Why relational databases: technical perspective ● Document model is very appealing to many. But all dbs started from scratch ● DRY: why not use relational databases? They are proven, durable, concurrent and flexible ● Why not base it on relational databases, like PostgreSQL?
  • 7. ToroDB @NoSQLonSQL ToroDB storage ● Data is stored in tables. No blobs ● JSON documents are split by hierarchy levels into “subdocuments”, which contain no nested structures. Each subdocument level is stored separately ● Subdocuments are classified by “type”. Each “type” maps to a different table
  • 8. ToroDB @NoSQLonSQL ToroDB storage (II) ● A “structure” table keeps the subdocument “schema” ● Keys in JSON are mapped to attributes, which retain the original name ● Tables are created dinamically and transparently to match the exact types of the documents
  • 9. ToroDB @NoSQLonSQL ToroDB storage internals { "name": "ToroDB", "data": { "a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } } }
  • 10. ToroDB @NoSQLonSQL ToroDB storage internals The document is split into the following subdocuments: { "name": "ToroDB", "data": {}, "nested": {} } { "a": 42, "b": "hello world!"} { "j": 42, "deeper": {}} { "a": 21, "b": "hello"}
  • 11. ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.t_3 ┌─────┬───────┬────────────────────────────┬────────┐ │ did │ index │ _id │ name │ ├─────┼───────┼────────────────────────────┼────────┤ │ 0 │ ¤ │ x5451a07de7032d23a908576d │ ToroDB │ └─────┴───────┴────────────────────────────┴────────┘ select * from demo.t_1 ┌─────┬───────┬────┬──────────────┐ │ did │ index │ a │ b │ ├─────┼───────┼────┼──────────────┤ │ 0 │ ¤ │ 42 │ hello world! │ │ 0 │ 1 │ 21 │ hello │ └─────┴───────┴────┴──────────────┘ select * from demo.t_2 ┌─────┬───────┬────┐ │ did │ index │ j │ ├─────┼───────┼────┤ │ 0 │ ¤ │ 42 │ └─────┴───────┴────┘
  • 12. ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.structures ┌─────┬────────────────────────────────────────────────────────────────────────────┐ │ sid │ _structure │ ├─────┼────────────────────────────────────────────────────────────────────────────┤ │ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │ └─────┴────────────────────────────────────────────────────────────────────────────┘ select * from demo.root; ┌─────┬─────┐ │ did │ sid │ ├─────┼─────┤ │ 0 │ 0 │ └─────┴─────┘
  • 13. ToroDB @NoSQLonSQL ToroDB storage and I/O savings 29% - 68% storage required, compared to Mongo 2.6
  • 14. ToroDB @NoSQLonSQL The software ToroDB is written in Java, compatible with versions 6 and above. It has been tested on Oracle's VM, but we will also test and verify it on Azul's VM. It is currently a standalone JAR file but will also be offered as an EAR, to easily deploy to application servers.
  • 16. ToroDB @NoSQLonSQL Going beyond MongoDB MongoDB brought the document model and several features that many love. But can we go further than that? Can't the foundation of relational databases provide a basis for offering new features on a NoSQL, document-like, JSON database?
  • 17. ToroDB @NoSQLonSQL Going beyond MongoDB ● Avoid schema repetition. Query-by-type ● Cheap single-node durability ● “Clean” reads ● Atomic bulk operations ● Highest concurrency
  • 18. ToroDB @NoSQLonSQL The schema-less fallacy { “name”: “Álvaro”, “surname”: “Hernández”, “height”: 200, “hobbies”: [ “PostgreSQL”, “triathlon” ] }
  • 19. ToroDB @NoSQLonSQL The schema-less fallacy { “name”: “Álvaro”, “surname”: “Hernández”, “height”: 200, “hobbies”: [ “PostgreSQL”, “triathlon” ] } metadata → Isn't that... schema?
  • 20. ToroDB @NoSQLonSQL The schema-less fallacy: BSON metadata → Isn't that... schema? { “name”: (string) “Álvaro”, “surname”: (string) “Hernández”, “height”: (number) 200, “hobbies”: { “0”: (string) “PostgreSQL” , “1”: (string) “triathlon” } }
  • 21. ToroDB @NoSQLonSQL The schema-less fallacy ● It's not schema-less ● It is “attached-schema” ● It carries an overhead which is not 0
  • 22. ToroDB @NoSQLonSQL Schema-attached repetition { “a”: 1, “b”: 2 } { “a”: 3 } { “a”: 4, “c”: 5 } { “a”: 6, “b”: 7 } { “b”: 8 } { “a”: 9, “b”: 10 } { “a”: 11, “b”: 12, “j”: 13 } { “a”: 14, “c”: 15 } Counting “document types” in collections of millions: at most, 1000s of different types
  • 23. ToroDB @NoSQLonSQL Schema-attached repetition How data is stored in schema-less
  • 24. ToroDB @NoSQLonSQL This is how we store in ToroDB
  • 25. ToroDB @NoSQLonSQL ToroDB: query “by structure” ● ToroDB is effectively partitioning by type ● Structures (schemas, partitioning types) are cached in ToroDB memory ● Queries only scan a subset of the data. ● Negative queries are served directly from memory.
  • 26. ToroDB @NoSQLonSQL Cheap single-node durability ● Without journaling, MongoDB is not durable nor crash-safe ● MongoDB requires “j: true” for true single-node durability. But who guarantees its consistent usage? Who uses it by default? j:true creates I/O storms equivalent to SQL CHECKPOINTs
  • 28. ToroDB @NoSQLonSQL “Clean” reads http://docs.mongodb.org/manual/reference/write-concern/#read-isolation-behavior “MongoDB will allow clients to read the results of a write operation before the write operation returns.” “If the mongod terminates before the journal commits, even if a write returns successfully, queries may have read data that will not exist after the mongod restarts.” “Other database systems refer to these isolation semantics as read uncommitted.”
  • 29. ToroDB @NoSQLonSQL “Clean” reads Thus, MongoDB suffers from dirty reads. Or probably better called “tainted reads”. What about $snapshot? Nope: “The snapshot() does not guarantee that the data returned by the query will reflect a single moment in time nor does it provide isolation from insert or delete operations.” http://docs.mongodb.org/manual/faq/developers/#faq-developers-isolate-cursors
  • 30. ToroDB @NoSQLonSQL ToroDB: going beyond MongoDB ● Cheap single-node durability PostgreSQL is 100% durable. Always. And it's cheap (doesn't do I/O storms) ● “Clean” reads Cursors in ToroDB run in repeatable read, read-only mode: globalCursorDataSource.setTransactionIsolation("TRANSACTIO N_REPEATABLE_READ"); globalCursorDataSource.setReadOnly(true);
  • 31. ToroDB @NoSQLonSQL Atomic operations ● There is no support for atomic bulk insert/update/delete operations ● Not even with $isolated: “Prevents a write operation that affects multiple documents from yielding to other reads or writes […] You can ensure that no client sees the changes until the operation completes or errors out. The $isolated isolation operator does not provide “all-or-nothing” atomicity for write operations.” http://docs.mongodb.org/manual/reference/operator/update/isolated/
  • 32. ToroDB @NoSQLonSQL High concurrency ● MMAPv1 is still collection-locked ● WiredTiger is document-locked ● But still exclusive locks (MMAP). Most relational databases have MVCC, which means almost conflict-free readers and writers at the same time
  • 33. ToroDB @NoSQLonSQL ● Atomic bulk operations By default, bulk operations in ToroDB are atomic. Use flag ContinueOnError: 1 to perform non-atomic bulk operations ● Highest concurrency PostgreSQL uses MVCC. Readers and writers do not block each other. Writers block writers only for the same record ToroDB: going beyond MongoDB
  • 34. ToroDB @NoSQLonSQL ToroDB: Developer Preview ● ToroDB launched on October 2014, as a Developer Preview. Support for CRUD and most of the SELECT API ● github.com/torodb ● RERO policy. Comments, feedback, patches... greatly appreciated ● AGPLv3
  • 35. ToroDB @NoSQLonSQL ToroDB: Developer Preview ● Clone the repo, build with Maven ● Or download the JAR: http://maven.torodb.com/jar/com/torodb/torodb/ 0.20/torodb.jar ● Usage: java -jar torodb-0.20.jar –help java -jar torodb-0.20.jar -d dbname -u dbuser -P 27017 Connect with normal mongo console!
  • 38. ToroDB @NoSQLonSQL ToroDB: Roadmap ● Current Developer Preview is single-node ● Version 1.0: ➔ Expected Q4 2015 ➔ Production-ready ➔ MongoDB Replication support ➔ Very high compatibility with Mongo API
  • 39. ToroDB @NoSQLonSQL ToroDB: Development priorities #1 Offer MongoDB-like experience on top of existing IT infrastructure, like relational databases and app servers #2 Go beyond current MongoDB features, like in ACID and concurrency #3 Great performance
  • 40. ToroDB @NoSQLonSQL ToroDB: Experimental research directions ● User columnar storage (CitusDB) ● Use Postgres-XL as a backend. This requires us to distribute ToroDB's cache (ehcache, Hazelcast) ● Use pg_shard for sharding
  • 41. ToroDB @NoSQLonSQL Big Data speaking mongo: Vertical ToroDB What if we use CitusData's cstore to store the JSON documents?
  • 42. ToroDB @NoSQLonSQL 1.17% - 20.26% storage required, compared to Mongo 2.6 Big Data speaking mongo: Vertical ToroDB