SlideShare a Scribd company logo
1© 2016 MapR Technologies 1© 2016 MapR Technologies
Evolving from RDBMS to NoSQL + SQL
2© 2016 MapR Technologies 2
Why Does this Matter
• 90%+ of the use cases do not deal with “relational” data
• RDBMS data models are more complex than a single table
– One-to-many relationships require multiple tables
– Creating code to persist data takes time and QA
• Inferred (or removed) keys used without actual foreign keys
– Difficult for others to understand relationships
• Transactional tables never look the same as analytics tables
– OLTP -> ETL -> OLAP
– This takes significant time to build
3© 2016 MapR Technologies 3
Topics
• Changing Data Models
– Relations Model to JSON Model
• A New Database for JSON Data
– Document Database (OJAI)
• Querying JSON Data and More
– Drill
• Resources
4© 2016 MapR Technologies 4
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies
Changing Data Models
6© 2016 MapR Technologies 6
180 Tables
NOT SHOWN!
7© 2016 MapR Technologies 7
236 tables
to describe 7 kinds of things
8© 2016 MapR Technologies 8
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
9© 2016 MapR Technologies 9
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
10© 2016 MapR Technologies 10
Searching for Elvis
// Find discs where Elvis was credited
> SELECT distinct album_id, name FROM
(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums
join
(SELECT distinct artist_id FROM
(SELECT id artist_id, FLATTEN(alias) FROM artist
where name like 'Elvis%Presley’)
) artists
USING artist_id;
11© 2016 MapR Technologies 11
Benefits
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today
12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies
A New Database for JSON Data
13© 2016 MapR Technologies 13
Basics of the API
• http://ojai.github.io/
• Entry point to a table - DocumentStore
– insert()
– insertOrReplace()
– find()
– delete()
– replace()
– update()
– increment()
14© 2016 MapR Technologies 14
Working with JSON in Java
• Step 1 – Create instance of JSON Serializer
Gson gson = new Gson();
• Step 2 – Serialize POJO to JSON
String json = gson.toJson(myObject);
• Step 3 – Deserialize JSON into POJO
MyObject myObject = gson.fromJson(json, MyObject.class);
15© 2016 MapR Technologies 15
Creating Documents in Java OJAI
• Use static methods on class org.ojai.json.Json
Document doc = Json.newDocument(myObject);
Document doc = Json.newDocument(jsonString);
• Alternatively
– Use builders
– Stream from disk
– Use InputStream
16© 2016 MapR Technologies 16
Creating New Documents
• DocumentStore.insert(doc)
Done!
• DocumentStore.insertOrReplace(doc)
Done!
Easy right?
17© 2016 MapR Technologies 17
Updating Existing Documents
• DocumentStore.update(_id, DocumentMutation)
• Mutation methods
– mutation.append(FieldPath, “user visited URL”);
– mutation.set(“field.name”, “What a great example”);
– mutation.increment(“field”, 1);
– mutation.merge(“field”, Map<String, Object>);
– mutation.setOrReplace(…);
– mutation.delete(field);
Yes, these are atomic.
18© 2016 MapR Technologies 18
Deleting Documents
• DocumentStore.delete(doc);
Done!
• DocumentStore.delete(_id);
Done!
This is easy too, right?
19© 2016 MapR Technologies 19
Finding Documents
• DocumentStore.find(QueryCondition);
• Query condition setup:
– qc.is(“field”, EQUAL, “blue”)
.and().notExists(“other.field”)
.or().like(“field”, “%purple”)
.or().matches(“another.field”, “regular expression”)
20© 2016 MapR Technologies 20© 2016 MapR Technologies© 2016 MapR Technologies
Querying JSON Data and More
21© 2016 MapR Technologies 21
How to Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transformation
– No silos of data
• Ease of use
22© 2016 MapR Technologies 22
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
23© 2016 MapR Technologies 23
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
24© 2016 MapR Technologies 24
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
25© 2016 MapR Technologies 25
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
26© 2016 MapR Technologies 26
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
27© 2016 MapR Technologies 27
- Sub-directory
- HBase namespace
- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace
- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapR-DB
- Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance
28© 2016 MapR Technologies 28
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
29© 2016 MapR Technologies 29© 2016 MapR Technologies© 2016 MapR Technologies
Security Controls
30© 2016 MapR Technologies 30
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U
31© 2016 MapR Technologies 31
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
32© 2016 MapR Technologies 32
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
33© 2016 MapR Technologies 33
Security Summary
• Logical
– No physical data copies/silos
• Granular
– Row level and column level security controls
• De-centralized
– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
• Self-service w/ governance
– If you have access to data, you control who and how widely can access it
– Audits
34© 2016 MapR Technologies 34© 2016 MapR Technologies© 2016 MapR Technologies
Using Drill with Yelp
35© 2016 MapR Technologies 35
Business dataset {
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
36© 2016 MapR Technologies 36
Zero to Results in 2 minutes
$ tar -xvzf apache-drill-1.9.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local
$ bin/drill-embedded
> SELECT state, city, count(*) AS businesses
FROM dfs.yelp.`business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
Install
Query files
and
directories
Results
Launch shell
(embedded
mode)
37© 2016 MapR Technologies 37
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1
38© 2016 MapR Technologies 38
Intuitive SQL Access to Complex Data
// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories
FROM dfs.yelp.`business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+------------+------------+------------+------------+
| name | stars | friday | categories |
+------------+------------+------------+------------+
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"]
|
| Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} |
["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+------------+------------+------------+------------+
Query data
with any
levels of
nesting
39© 2016 MapR Technologies 39
Reviews dataset
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
40© 2016 MapR Technologies 40
ANSI SQL Compatibility
//Get top cool rated businesses
 SELECT b.name from dfs.yelp.`business.json` b
WHERE b.business_id IN
(SELECT r.business_id FROM dfs.yelp.`review.json` r
GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY
SUM(r.votes.cool) DESC);
+------------+
| name |
+------------+
| Earl of Sandwich |
| XS Nightclub |
| The Cosmopolitan of Las Vegas |
| Wicked Spoon |
+------------+
Use familiar SQL
functionality
(Joins,
Aggregations,
Sorting, Sub-
queries, SQL data
types)
41© 2016 MapR Technologies 41
Logical Views
//Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS
SELECT b.name, b.stars, r.votes.funny,
r.votes.useful, r.votes.cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |
+------------+------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight file
system based
views for
granular and de-
centralized data
management
42© 2016 MapR Technologies 42
Materialized Views AKA Tables
> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS
SELECT b.name, b.stars, r.votes.funny funny,
r.votes.useful useful, r.votes.cool cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+---------------------------+
| Fragment | Number of records written |
+------------+---------------------------+
| 1_0 | 176448 |
| 1_1 | 192439 |
| 1_2 | 198625 |
| 1_3 | 200863 |
| 1_4 | 181420 |
| 1_5 | 175663 |
+------------+---------------------------+
Save analysis
results as
tables using
familiar CTAS
syntax
43© 2016 MapR Technologies 43
Repeated Values Support
// Flatten repeated categories
> SELECT name, categories
FROM dfs.yelp.`business.json` LIMIT 3;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+------------+------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.yelp.`business.json` LIMIT 5;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+------------+------------+
Dynamically
flatten repeated
and nested data
elements as part
of SQL queries.
No ETL necessary
44© 2016 MapR Technologies 44
Checkins dataset {
"checkin_info":{
"3-4":1,
"13-5":1,
"6-6":1,
"14-5":1,
"14-6":1,
"14-2":1,
"14-3":1,
"19-0":1,
"11-5":1,
"13-2":1,
"11-6":2,
"11-3":1,
"12-6":1,
"6-5":1,
"5-5":1,
"9-2":1,
"9-5":1,
"9-6":1,
"5-2":1,
"7-6":1,
"7-5":1,
"7-4":1,
"17-5":1,
"8-5":1,
"10-2":1,
"10-5":1,
"10-6":1
},
"type":"checkin",
"business_id":"JwUE5GmEO-sH1FuwJgKBlQ"
}
45© 2016 MapR Technologies 45
Supports Dynamic / Unknown Columns
> SELECT KVGEN(checkin_info) checkins
FROM dfs.yelp.`checkin.json` LIMIT 1;
+------------+
| checkins |
+------------+
| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-
5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-
0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-
3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-
2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-
6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-
5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM
dfs.yelp.`checkin.json` limit 6;
+------------+
| checkins |
+------------+
| {"key":"3-4","value":1} |
| {"key":"13-5","value":1} |
| {"key":"6-6","value":1} |
| {"key":"14-5","value":1} |
| {"key":"14-6","value":1} |
| {"key":"14-2","value":1} |
+------------+
Convert Map with
a wide set of
dynamic columns
into an array of
key-value pairs
46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies
Resources
47© 2016 MapR Technologies 47
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
48© 2016 MapR Technologies 48
49© 2016 MapR Technologies 49
OJAI and MapR-DB
Where to find it…
– The source: https://github.com/ojai/ojai
– The site: http://ojai.github.io/
– Python bindings: https://github.com/mapr-demos/python-bindings
– Javascript bindings: https://github.com/mapr-demos/js-bindings
Ready to play with your data?
– Download the sandbox: http://maprdb.io
– Examples:
• Java: https://github.com/mapr-demos/maprdb-ojai-101
• Python: https://github.com/mapr-demos/maprdb_python_examples
50© 2016 MapR Technologies 50
Drill Walkthrough
• Example queries
��� Conversion from relational model to flat JSON model
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql
51© 2016 MapR Technologies 51
Recommendations for Getting Started with Drill
New to Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with AWS
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your cluster and start exploration
– Comprehensive tutorials and documentation available
Ask questions
– user@drill.apache.org
52© 2016 MapR Technologies 52
@kingmesal
jscott@mapr.com
Engage with us!
kingmesal

More Related Content

Evolving from RDBMS to NoSQL + SQL

  • 1. 1© 2016 MapR Technologies 1© 2016 MapR Technologies Evolving from RDBMS to NoSQL + SQL
  • 2. 2© 2016 MapR Technologies 2 Why Does this Matter • 90%+ of the use cases do not deal with “relational” data • RDBMS data models are more complex than a single table – One-to-many relationships require multiple tables – Creating code to persist data takes time and QA • Inferred (or removed) keys used without actual foreign keys – Difficult for others to understand relationships • Transactional tables never look the same as analytics tables – OLTP -> ETL -> OLAP – This takes significant time to build
  • 3. 3© 2016 MapR Technologies 3 Topics • Changing Data Models – Relations Model to JSON Model • A New Database for JSON Data – Document Database (OJAI) • Querying JSON Data and More – Drill • Resources
  • 4. 4© 2016 MapR Technologies 4 Empowering “as it happens” businesses by speeding up the data-to-action cycle
  • 5. 5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies Changing Data Models
  • 6. 6© 2016 MapR Technologies 6 180 Tables NOT SHOWN!
  • 7. 7© 2016 MapR Technologies 7 236 tables to describe 7 kinds of things
  • 8. 8© 2016 MapR Technologies 8 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  • 9. 9© 2016 MapR Technologies 9 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id>
  • 10. 10© 2016 MapR Technologies 10 Searching for Elvis // Find discs where Elvis was credited > SELECT distinct album_id, name FROM (SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums join (SELECT distinct artist_id FROM (SELECT id artist_id, FLATTEN(alias) FROM artist where name like 'Elvis%Presley’) ) artists USING artist_id;
  • 11. 11© 2016 MapR Technologies 11 Benefits • Extended relational model allows massive simplification – On a real example, we see >20x reduction in number of tables • Simplification drives improved introspection – This is good • Apache Drill gives very high performance execution for extended relational problems • You can try this out today
  • 12. 12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies A New Database for JSON Data
  • 13. 13© 2016 MapR Technologies 13 Basics of the API • http://ojai.github.io/ • Entry point to a table - DocumentStore – insert() – insertOrReplace() – find() – delete() – replace() – update() – increment()
  • 14. 14© 2016 MapR Technologies 14 Working with JSON in Java • Step 1 – Create instance of JSON Serializer Gson gson = new Gson(); • Step 2 – Serialize POJO to JSON String json = gson.toJson(myObject); • Step 3 – Deserialize JSON into POJO MyObject myObject = gson.fromJson(json, MyObject.class);
  • 15. 15© 2016 MapR Technologies 15 Creating Documents in Java OJAI • Use static methods on class org.ojai.json.Json Document doc = Json.newDocument(myObject); Document doc = Json.newDocument(jsonString); • Alternatively – Use builders – Stream from disk – Use InputStream
  • 16. 16© 2016 MapR Technologies 16 Creating New Documents • DocumentStore.insert(doc) Done! • DocumentStore.insertOrReplace(doc) Done! Easy right?
  • 17. 17© 2016 MapR Technologies 17 Updating Existing Documents • DocumentStore.update(_id, DocumentMutation) • Mutation methods – mutation.append(FieldPath, “user visited URL”); – mutation.set(“field.name”, “What a great example”); – mutation.increment(“field”, 1); – mutation.merge(“field”, Map<String, Object>); – mutation.setOrReplace(…); – mutation.delete(field); Yes, these are atomic.
  • 18. 18© 2016 MapR Technologies 18 Deleting Documents • DocumentStore.delete(doc); Done! • DocumentStore.delete(_id); Done! This is easy too, right?
  • 19. 19© 2016 MapR Technologies 19 Finding Documents • DocumentStore.find(QueryCondition); • Query condition setup: – qc.is(“field”, EQUAL, “blue”) .and().notExists(“other.field”) .or().like(“field”, “%purple”) .or().matches(“another.field”, “regular expression”)
  • 20. 20© 2016 MapR Technologies 20© 2016 MapR Technologies© 2016 MapR Technologies Querying JSON Data and More
  • 21. 21© 2016 MapR Technologies 21 How to Bring SQL to Non-Relational Data Stores? Familiarity of SQL Agility of NoSQL • ANSI SQL semantics • BI (Tableau, MicroStrategy, etc.) • Low latency • No schema management – HDFS (Parquet, JSON, etc.) – HBase – … • No transformation – No silos of data • Ease of use
  • 22. 22© 2016 MapR Technologies 22 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 23. 23© 2016 MapR Technologies 23 Drill’s Data Model is Flexible JSON BSON HBase Parquet Avro CSV TSV Dynamic schema Fixed schema Complex Flat Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  • 24. 24© 2016 MapR Technologies 24 Enabling “As-It-Happens” Business with Instant Analytics Hadoop data Data modeling Transformation Data movement (optional) Users Hadoop data Users Traditional approach Exploratory approach New Business questionsSource data evolution Total time to insight: weeks to months Total time to insight: minutes
  • 25. 25© 2016 MapR Technologies 25 Evolution Towards Self-Service Data Exploration Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Optional Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 26. 26© 2016 MapR Technologies 26 Common Use Cases Raw Data Exploration JSON Analytics DWH offload Hive HBaseFiles Directories … {JSON}, Parquet Text Files …
  • 27. 27© 2016 MapR Technologies 27 - Sub-directory - HBase namespace - Hive database Drill Enables ‘SQL-on-Everything’ SELECT * FROM dfs.yelp.`business.json` Workspace - Pathnames - Hive table - HBase table Table - DFS (Text, Parquet, JSON) - HBase/MapR-DB - Hive Metastore/HCatalog - Easy API to go beyond Hadoop Storage plugin instance
  • 28. 28© 2016 MapR Technologies 28 Reuse Existing SQL Tools and Skills Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
  • 29. 29© 2016 MapR Technologies 29© 2016 MapR Technologies© 2016 MapR Technologies Security Controls
  • 30. 30© 2016 MapR Technologies 30 Access Controls that Scale PAM Authentication + User Impersonation Fine-grained row and column level access control with Drill Views – no centralized security repository required Files HBase Hive Drill View 1 Drill View 2 UUU U U
  • 31. 31© 2016 MapR Technologies 31 Granular Security via Drill Views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.csv) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 32. 32© 2016 MapR Technologies 32 Ownership Chaining Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 33. 33© 2016 MapR Technologies 33 Security Summary • Logical – No physical data copies/silos • Granular – Row level and column level security controls • De-centralized – User impersonation respecting storage system permissions – No separate permission repository for granular controls – Integrated with Hadoop File System permissions and LDAP • Self-service w/ governance – If you have access to data, you control who and how widely can access it – Audits
  • 34. 34© 2016 MapR Technologies 34© 2016 MapR Technologies© 2016 MapR Technologies Using Drill with Yelp
  • 35. 35© 2016 MapR Technologies 35 Business dataset { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464, "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 36. 36© 2016 MapR Technologies 36 Zero to Results in 2 minutes $ tar -xvzf apache-drill-1.9.0.tar.gz $ bin/sqlline -u jdbc:drill:zk=local $ bin/drill-embedded > SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+ Install Query files and directories Results Launch shell (embedded mode)
  • 37. 37© 2016 MapR Technologies 37 Directories are implicit partitions SELECT dir0, SUM(amount) FROM sales GROUP BY dir1 IN (q1, q2) sales ├── 2014 │ ├── q1 │ ├── q2 │ ├── q3 │ └── q4 └── 2015 └── q1
  • 38. 38© 2016 MapR Technologies 38 Intuitive SQL Access to Complex Data // It’s Friday 10pm in Vegas and looking for Hummus > SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +------------+------------+------------+------------+ | name | stars | friday | categories | +------------+------------+------------+------------+ | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +------------+------------+------------+------------+ Query data with any levels of nesting
  • 39. 39© 2016 MapR Technologies 39 Reviews dataset { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 40. 40© 2016 MapR Technologies 40 ANSI SQL Compatibility //Get top cool rated businesses  SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC); +------------+ | name | +------------+ | Earl of Sandwich | | XS Nightclub | | The Cosmopolitan of Las Vegas | | Wicked Spoon | +------------+ Use familiar SQL functionality (Joins, Aggregations, Sorting, Sub- queries, SQL data types)
  • 41. 41© 2016 MapR Technologies 41 Logical Views //Create a view combining business and reviews datasets > CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; +------------+------------+ | ok | summary | +------------+------------+ | true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema | +------------+------------+ > SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews; +------------+ | Total | +------------+ | 1125458 | +------------+ Lightweight file system based views for granular and de- centralized data management
  • 42. 42© 2016 MapR Technologies 42 Materialized Views AKA Tables > ALTER SESSION SET `store.format` = 'parquet'; > CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; +------------+---------------------------+ | Fragment | Number of records written | +------------+---------------------------+ | 1_0 | 176448 | | 1_1 | 192439 | | 1_2 | 198625 | | 1_3 | 200863 | | 1_4 | 181420 | | 1_5 | 175663 | +------------+---------------------------+ Save analysis results as tables using familiar CTAS syntax
  • 43. 43© 2016 MapR Technologies 43 Repeated Values Support // Flatten repeated categories > SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +------------+------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +------------+------------+ Dynamically flatten repeated and nested data elements as part of SQL queries. No ETL necessary
  • 44. 44© 2016 MapR Technologies 44 Checkins dataset { "checkin_info":{ "3-4":1, "13-5":1, "6-6":1, "14-5":1, "14-6":1, "14-2":1, "14-3":1, "19-0":1, "11-5":1, "13-2":1, "11-6":2, "11-3":1, "12-6":1, "6-5":1, "5-5":1, "9-2":1, "9-5":1, "9-6":1, "5-2":1, "7-6":1, "7-5":1, "7-4":1, "17-5":1, "8-5":1, "10-2":1, "10-5":1, "10-6":1 }, "type":"checkin", "business_id":"JwUE5GmEO-sH1FuwJgKBlQ" }
  • 45. 45© 2016 MapR Technologies 45 Supports Dynamic / Unknown Columns > SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1; +------------+ | checkins | +------------+ | [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14- 5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19- 0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11- 3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9- 2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7- 6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8- 5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] | +------------+ > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6; +------------+ | checkins | +------------+ | {"key":"3-4","value":1} | | {"key":"13-5","value":1} | | {"key":"6-6","value":1} | | {"key":"14-5","value":1} | | {"key":"14-6","value":1} | | {"key":"14-2","value":1} | +------------+ Convert Map with a wide set of dynamic columns into an array of key-value pairs
  • 46. 46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies Resources
  • 47. 47© 2016 MapR Technologies 47 Drill is Top-Ranked SQL-on-Hadoop Source: Gigaom Research, 2015 Key: • Number indicates companies relative strength across all vectors • Size of ball indicates company’s relative strength along individual vector “Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on- pretty-much- anything, immediately, and without formality.”
  • 48. 48© 2016 MapR Technologies 48
  • 49. 49© 2016 MapR Technologies 49 OJAI and MapR-DB Where to find it… – The source: https://github.com/ojai/ojai – The site: http://ojai.github.io/ – Python bindings: https://github.com/mapr-demos/python-bindings – Javascript bindings: https://github.com/mapr-demos/js-bindings Ready to play with your data? – Download the sandbox: http://maprdb.io – Examples: • Java: https://github.com/mapr-demos/maprdb-ojai-101 • Python: https://github.com/mapr-demos/maprdb_python_examples
  • 50. 50© 2016 MapR Technologies 50 Drill Walkthrough • Example queries • Conversion from relational model to flat JSON model https://www.mapr.com/blog/drilling-healthy-choices https://www.mapr.com/blog/evolution-database-schemas-using-sql- nosql
  • 51. 51© 2016 MapR Technologies 51 Recommendations for Getting Started with Drill New to Drill? – Get started with Free MapR On Demand training – Test Drive Drill on cloud with AWS – Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? – Try out Apache Drill in 10 mins guide on your desktop – Download Drill for your cluster and start exploration – Comprehensive tutorials and documentation available Ask questions – user@drill.apache.org
  • 52. 52© 2016 MapR Technologies 52 @kingmesal jscott@mapr.com Engage with us! kingmesal

Editor's Notes

  1. Great news, I have 467 slides today …. Hahah… I’m just kidding… I only have 465…
  2. MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified distribution for Hadoop. MapR is a Hadoop distirbution focussed on delivering an enterprise grade big data platform that supports mission critical and real time use cases
  3. The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems. In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema.
  4. All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  5. IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.). The so-what needs to be conveyed. Why does it matter that it’s not needed. 6 months -> 3 months -> 3 months -> day zero, So imagine now what you can get… Data Agility is needed for Business Agility >>> Stand still during slide, move in at the punchline (why does this matter to YOU)
  6. CSV vs JSON formatting
  7. It’s 10pm in Vegas and I Want Good Hummus!
  8. CSV vs JSON formatting
  9. CSV vs JSON formatting