Drilling on JSON

© 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow
Drilling on JSON

© 2014 MapR Technologies 2#NoSQLNow @apachedrill
NoSQL
We don't need no transaction
We don't need no ACID control
No schema in the tables
No limit to the scale out
DBA, leave them JSON alone
Hey DBA, leave them JSON alone
All in all it's just another data in the BASE
All in all it’s just another shard into cloud.
…With apologies to Roger Waters

© 2014 MapR Technologies 3
Martin Fowler says: “aggregate-
oriented”
What you're most likely to access as
a unit.
Key Value Store
 Couchbase
 Riak
 Citrusleaf
 Redis
 BerkeleyDB
 Membrain
 ...
Document
 MongoDB
 CouchDB
 RavenDB
 Couchbase
 ... Graph
 OrientDB
 DEX
 Neo4j
 GraphBase
 ...Wide Column
 HBase
 Hypertable
 Cassandra
 MapR-DB
 ...
NoSQL Landscape

Data landscape is changing
New types of applications
• Social, mobile, Web, “Internet
of Things”, Cloud…
• Iterative/Agile in nature
• More users, more data
New data models & data types
• Flexible Schema/Schema less
• Rapidly changing
• Semi-structured/Nested data
{
"data": [
"id": "X999_Y999",
"from": {
"name": "Tom Brady", "id": "X12"
},
"message": "Looking forward to 2014!",
"actions": [
{
"name": "Comment",
"link": "http://www.facebook.com/X99/posts Y999"
},
{
"name": "Like",
"link": "http://www.facebook.com/X99/posts Y999"
}
],
"type": "status",
"created_time": "2013-08-02T21:27:44+0000",
"updated_time": "2013-08-02T21:27:44+0000"
}
}
JSON

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

Self-Describing Data is Ubiquitous
Flat files in DFS
• Complex data (Thrift, Avro, protobuf)
• Columnar data (Parquet, ORC)
• Loosely defined (JSON)
• Traditional files (CSV, TSV)
Data stored in NoSQL stores
• Relational-like (rows, columns)
• Sparse data (NoSQL maps)
• Embedded blobs (JSON)
• Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}

Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table

Core Modules within a Drillbit
SQL
Parser Optimizer
PhysicalPlan
DFS
HBase
RPC Endpoint
Distributed Cache
StoragePlugins
LogicalPlan Execution Hive
MongoDB
CouchBase
Cassandra
RDBMS

Processing
in Files
MapReduce
Generic
fileformats
Rows/Columns
in files (tables)
Hive – Pig - etc
Query
Impala
Tez
Hive
NoSQL
MongoDB
Hbase
Cassandra
Riak
Redis
HADOOPDisk &
Storage
RDBMS
Highly Structured Data
ANSI-
SQL
SQL++
R, etc
bits,bytes,blocks
$100K – $200K / TB$1K/TB$10K/TB
Semi Structured & Self describingNo Structure
OLTP EDW
Apache
Drill

NoSQL NoETL
Drill, Baby, Drill: Self-Service Data Exploration using Apache Drill
Thursday, August 21st. 9.30 AM
Apache Drill

Drilling on JSON

Related slideshows

More Related Content

Drilling on JSON

Editor's Notes