Evolving from RDBMS to NoSQL + SQL

1© 2016 MapR Technologies 1© 2016 MapR Technologies
Evolving from RDBMS to NoSQL + SQL

2© 2016 MapR Technologies 2
Why Does this Matter
• 90%+ of the use cases do not deal with “relational” data
• RDBMS data models are more complex than a single table
– One-to-many relationships require multiple tables
– Creating code to persist data takes time and QA
• Inferred (or removed) keys used without actual foreign keys
– Difficult for others to understand relationships
• Transactional tables never look the same as analytics tables
– OLTP -> ETL -> OLAP
– This takes significant time to build

Topics
• Changing Data Models
– Relations Model to JSON Model
• A New Database for JSON Data
– Document Database (OJAI)
• Querying JSON Data and More
– Drill
• Resources

Empowering “as it happens”
businesses by speeding up the
data-to-action cycle

5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies
Changing Data Models

180 Tables
NOT SHOWN!

236 tables
to describe 7 kinds of things

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>

Searching for Elvis
// Find discs where Elvis was credited
> SELECT distinct album_id, name FROM
(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums
join
(SELECT distinct artist_id FROM
(SELECT id artist_id, FLATTEN(alias) FROM artist
where name like 'Elvis%Presley’)
) artists
USING artist_id;

Benefits
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today

A New Database for JSON Data

Basics of the API
• http://ojai.github.io/
• Entry point to a table - DocumentStore
– insert()
– insertOrReplace()
– find()
– delete()
– replace()
– update()
– increment()

Working with JSON in Java
• Step 1 – Create instance of JSON Serializer
Gson gson = new Gson();
• Step 2 – Serialize POJO to JSON
String json = gson.toJson(myObject);
• Step 3 – Deserialize JSON into POJO
MyObject myObject = gson.fromJson(json, MyObject.class);

Creating Documents in Java OJAI
• Use static methods on class org.ojai.json.Json
Document doc = Json.newDocument(myObject);
Document doc = Json.newDocument(jsonString);
• Alternatively
– Use builders
– Stream from disk
– Use InputStream

Creating New Documents
• DocumentStore.insert(doc)
Done!
• DocumentStore.insertOrReplace(doc)
Done!
Easy right?

Updating Existing Documents
• DocumentStore.update(_id, DocumentMutation)
• Mutation methods
– mutation.append(FieldPath, “user visited URL”);
– mutation.set(“field.name”, “What a great example”);
– mutation.increment(“field”, 1);
– mutation.merge(“field”, Map<String, Object>);
– mutation.setOrReplace(…);
– mutation.delete(field);
Yes, these are atomic.

Deleting Documents
• DocumentStore.delete(doc);
Done!
• DocumentStore.delete(_id);
Done!
This is easy too, right?

Finding Documents
• DocumentStore.find(QueryCondition);
• Query condition setup:
– qc.is(“field”, EQUAL, “blue”)
.and().notExists(“other.field”)
.or().like(“field”, “%purple”)
.or().matches(“another.field”, “regular expression”)

Querying JSON Data and More

How to Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transformation
– No silos of data
• Ease of use

Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility

Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes

Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …

- Sub-directory
- HBase namespace
- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace
- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapR-DB
- Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance

Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data

Security Controls

Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U

Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists

Ownership Chaining
Combine Self Service Exploration with Data Governance
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path

Security Summary
• Logical
– No physical data copies/silos
• Granular
– Row level and column level security controls
• De-centralized
– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
• Self-service w/ governance
– If you have access to data, you control who and how widely can access it
– Audits

Using Drill with Yelp

Business dataset {
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}

Zero to Results in 2 minutes
$ tar -xvzf apache-drill-1.9.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local
$ bin/drill-embedded
> SELECT state, city, count(*) AS businesses
FROM dfs.yelp.`business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
Install
Query files
and
directories
Results
Launch shell
(embedded
mode)

Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1

Intuitive SQL Access to Complex Data
// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories
FROM dfs.yelp.`business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+------------+------------+------------+------------+
| name | stars | friday | categories |
+------------+------------+------------+------------+
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"]
|
| Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} |
["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+------------+------------+------------+------------+
Query data
with any
levels of
nesting

Reviews dataset
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

ANSI SQL Compatibility
//Get top cool rated businesses
 SELECT b.name from dfs.yelp.`business.json` b
WHERE b.business_id IN
(SELECT r.business_id FROM dfs.yelp.`review.json` r
GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY
SUM(r.votes.cool) DESC);
+------------+
| name |
+------------+
| Earl of Sandwich |
| XS Nightclub |
| The Cosmopolitan of Las Vegas |
| Wicked Spoon |
+------------+
Use familiar SQL
functionality
(Joins,
Aggregations,
Sorting, Sub-
queries, SQL data
types)

Logical Views
//Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS
SELECT b.name, b.stars, r.votes.funny,
r.votes.useful, r.votes.cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |
+------------+------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight file
system based
views for
granular and de-
centralized data
management

Materialized Views AKA Tables
> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS
SELECT b.name, b.stars, r.votes.funny funny,
r.votes.useful useful, r.votes.cool cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+---------------------------+
| Fragment | Number of records written |
+------------+---------------------------+
| 1_0 | 176448 |
| 1_1 | 192439 |
| 1_2 | 198625 |
| 1_3 | 200863 |
| 1_4 | 181420 |
| 1_5 | 175663 |
+------------+---------------------------+
Save analysis
results as
tables using
familiar CTAS
syntax

Checkins dataset {
"checkin_info":{
"3-4":1,
"13-5":1,
"6-6":1,
"14-5":1,
"14-6":1,
"14-2":1,
"14-3":1,
"19-0":1,
"11-5":1,
"13-2":1,
"11-6":2,
"11-3":1,
"12-6":1,
"6-5":1,
"5-5":1,
"9-2":1,
"9-5":1,
"9-6":1,
"5-2":1,
"7-6":1,
"7-5":1,
"7-4":1,
"17-5":1,
"8-5":1,
"10-2":1,
"10-5":1,
"10-6":1
},
"type":"checkin",
"business_id":"JwUE5GmEO-sH1FuwJgKBlQ"
}

Supports Dynamic / Unknown Columns
> SELECT KVGEN(checkin_info) checkins
FROM dfs.yelp.`checkin.json` LIMIT 1;
+------------+
| checkins |
+------------+
| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-
5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-
5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM
dfs.yelp.`checkin.json` limit 6;
+------------+
| checkins |
+------------+
| {"key":"3-4","value":1} |
| {"key":"13-5","value":1} |
| {"key":"6-6","value":1} |
| {"key":"14-5","value":1} |
| {"key":"14-6","value":1} |
| {"key":"14-2","value":1} |
+------------+
Convert Map with
a wide set of
dynamic columns
into an array of
key-value pairs

Resources

Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”

OJAI and MapR-DB
Where to find it…
– The source: https://github.com/ojai/ojai
– The site: http://ojai.github.io/
– Python bindings: https://github.com/mapr-demos/python-bindings
– Javascript bindings: https://github.com/mapr-demos/js-bindings
Ready to play with your data?
– Download the sandbox: http://maprdb.io
– Examples:
• Java: https://github.com/mapr-demos/maprdb-ojai-101
• Python: https://github.com/mapr-demos/maprdb_python_examples

Drill Walkthrough
• Example queries
�� Conversion from relational model to flat JSON model
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql

Recommendations for Getting Started with Drill
New to Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with AWS
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your cluster and start exploration
– Comprehensive tutorials and documentation available
Ask questions
– user@drill.apache.org

@kingmesal
jscott@mapr.com
Engage with us!
kingmesal

Evolving from RDBMS to NoSQL + SQL

Related slideshows

More Related Content

Evolving from RDBMS to NoSQL + SQL

Editor's Notes