SlideShare a Scribd company logo
© 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow
Drilling on JSON
© 2014 MapR Technologies 2#NoSQLNow @apachedrill
NoSQL
We don't need no transaction
We don't need no ACID control
No schema in the tables
No limit to the scale out
DBA, leave them JSON alone
Hey DBA, leave them JSON alone
All in all it's just another data in the BASE
All in all it’s just another shard into cloud.
…With apologies to Roger Waters
© 2014 MapR Technologies 3
Martin Fowler says: “aggregate-
oriented”
What you're most likely to access as
a unit.
Key Value Store
 Couchbase
 Riak
 Citrusleaf
 Redis
 BerkeleyDB
 Membrain
 ...
Document
 MongoDB
 CouchDB
 RavenDB
 Couchbase
 ... Graph
 OrientDB
 DEX
 Neo4j
 GraphBase
 ...Wide Column
 HBase
 Hypertable
 Cassandra
 MapR-DB
 ...
NoSQL Landscape
© 2014 MapR Technologies 4
Data landscape is changing
New types of applications
• Social, mobile, Web, “Internet
of Things”, Cloud…
• Iterative/Agile in nature
• More users, more data
New data models & data types
• Flexible Schema/Schema less
• Rapidly changing
• Semi-structured/Nested data
{
"data": [
"id": "X999_Y999",
"from": {
"name": "Tom Brady", "id": "X12"
},
"message": "Looking forward to 2014!",
"actions": [
{
"name": "Comment",
"link": "http://www.facebook.com/X99/posts Y999"
},
{
"name": "Like",
"link": "http://www.facebook.com/X99/posts Y999"
}
],
"type": "status",
"created_time": "2013-08-02T21:27:44+0000",
"updated_time": "2013-08-02T21:27:44+0000"
}
}
JSON
© 2014 MapR Technologies 5
• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems
© 2014 MapR Technologies 6#NoSQLNow @apachedrill
Zero to Results in 2 Minutes (3 Commands)
$ tar xzf apache-drill.tar.gz
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local>
SELECT DISTINCT users.name as name, users.emails.work as email
FROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` users
WHERE logs.uid = users.id AND
logs.errorLevel > 5;
+------------+------------+
| name | email |
+------------+------------+
| john | john@gmail.com|
| jack | jack@yahoo.com|
| Ronn | ronn@mapr.com |
| Pat | pat@hotmail.com|
...
Install
Launch shell
(embedded
mode)
Query
Query
© 2014 MapR Technologies 7
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
© 2014 MapR Technologies 8#NoSQLNow @apachedrill
Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
© 2014 MapR Technologies 9#NoSQLNow @apachedrill
Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 10#NoSQLNow @apachedrill
Core Modules within a Drillbit
SQL
Parser Optimizer
PhysicalPlan
DFS
HBase
RPC Endpoint
Distributed Cache
StoragePlugins
LogicalPlan Execution Hive
MongoDB
CouchBase
Cassandra
RDBMS
© 2014 MapR Technologies 11#NoSQLNow @apachedrill
Processing
in Files
MapReduce
Generic
fileformats
Rows/Columns
in files (tables)
Hive – Pig - etc
Query
Impala
Tez
Hive
NoSQL
MongoDB
Hbase
Cassandra
Riak
Redis
HADOOPDisk &
Storage
RDBMS
Highly Structured Data
ANSI-
SQL
SQL++
R, etc
bits,bytes,blocks
$100K – $200K / TB$1K/TB$10K/TB
Semi Structured & Self describingNo Structure
OLTP EDW
Apache
Drill
© 2014 MapR Technologies 12#NoSQLNow @apachedrill
NoSQL NoETL
Drill, Baby, Drill: Self-Service Data Exploration using Apache Drill
Thursday, August 21st. 9.30 AM
Apache Drill

More Related Content

Drilling on JSON

  • 1. © 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow Drilling on JSON
  • 2. © 2014 MapR Technologies 2#NoSQLNow @apachedrill NoSQL We don't need no transaction We don't need no ACID control No schema in the tables No limit to the scale out DBA, leave them JSON alone Hey DBA, leave them JSON alone All in all it's just another data in the BASE All in all it’s just another shard into cloud. …With apologies to Roger Waters
  • 3. © 2014 MapR Technologies 3 Martin Fowler says: “aggregate- oriented” What you're most likely to access as a unit. Key Value Store  Couchbase  Riak  Citrusleaf  Redis  BerkeleyDB  Membrain  ... Document  MongoDB  CouchDB  RavenDB  Couchbase  ... Graph  OrientDB  DEX  Neo4j  GraphBase  ...Wide Column  HBase  Hypertable  Cassandra  MapR-DB  ... NoSQL Landscape
  • 4. © 2014 MapR Technologies 4 Data landscape is changing New types of applications • Social, mobile, Web, “Internet of Things”, Cloud… • Iterative/Agile in nature • More users, more data New data models & data types • Flexible Schema/Schema less • Rapidly changing • Semi-structured/Nested data { "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } } JSON
  • 5. © 2014 MapR Technologies 5 • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 6. © 2014 MapR Technologies 6#NoSQLNow @apachedrill Zero to Results in 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT DISTINCT users.name as name, users.emails.work as email FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; +------------+------------+ | name | email | +------------+------------+ | john | john@gmail.com| | jack | jack@yahoo.com| | Ronn | ronn@mapr.com | | Pat | pat@hotmail.com| ... Install Launch shell (embedded mode) Query Query
  • 7. © 2014 MapR Technologies 7 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 8. © 2014 MapR Technologies 8#NoSQLNow @apachedrill Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 9. © 2014 MapR Technologies 9#NoSQLNow @apachedrill Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
  • 10. © 2014 MapR Technologies 10#NoSQLNow @apachedrill Core Modules within a Drillbit SQL Parser Optimizer PhysicalPlan DFS HBase RPC Endpoint Distributed Cache StoragePlugins LogicalPlan Execution Hive MongoDB CouchBase Cassandra RDBMS
  • 11. © 2014 MapR Technologies 11#NoSQLNow @apachedrill Processing in Files MapReduce Generic fileformats Rows/Columns in files (tables) Hive – Pig - etc Query Impala Tez Hive NoSQL MongoDB Hbase Cassandra Riak Redis HADOOPDisk & Storage RDBMS Highly Structured Data ANSI- SQL SQL++ R, etc bits,bytes,blocks $100K – $200K / TB$1K/TB$10K/TB Semi Structured & Self describingNo Structure OLTP EDW Apache Drill
  • 12. © 2014 MapR Technologies 12#NoSQLNow @apachedrill NoSQL NoETL Drill, Baby, Drill: Self-Service Data Exploration using Apache Drill Thursday, August 21st. 9.30 AM Apache Drill

Editor's Notes

  1. With other technologies you have to do this, then this, then this, …
  2. TODO: Add Impala and Splunk logos
  3. Need an example or analogy to explain self-describing data.
  4. All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.