Evolving from RDBMS to NoSQL + SQL
- 1. 1© 2016 MapR Technologies 1© 2016 MapR Technologies
Evolving from RDBMS to NoSQL + SQL
- 2. 2© 2016 MapR Technologies 2
Why Does this Matter
• 90%+ of the use cases do not deal with “relational” data
• RDBMS data models are more complex than a single table
– One-to-many relationships require multiple tables
– Creating code to persist data takes time and QA
• Inferred (or removed) keys used without actual foreign keys
– Difficult for others to understand relationships
• Transactional tables never look the same as analytics tables
– OLTP -> ETL -> OLAP
– This takes significant time to build
- 3. 3© 2016 MapR Technologies 3
Topics
• Changing Data Models
– Relations Model to JSON Model
• A New Database for JSON Data
– Document Database (OJAI)
• Querying JSON Data and More
– Drill
• Resources
- 4. 4© 2016 MapR Technologies 4
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
- 5. 5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies
Changing Data Models
- 7. 7© 2016 MapR Technologies 7
236 tables
to describe 7 kinds of things
- 8. 8© 2016 MapR Technologies 8
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
- 9. 9© 2016 MapR Technologies 9
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
- 10. 10© 2016 MapR Technologies 10
Searching for Elvis
// Find discs where Elvis was credited
> SELECT distinct album_id, name FROM
(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums
join
(SELECT distinct artist_id FROM
(SELECT id artist_id, FLATTEN(alias) FROM artist
where name like 'Elvis%Presley’)
) artists
USING artist_id;
- 11. 11© 2016 MapR Technologies 11
Benefits
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today
- 12. 12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies
A New Database for JSON Data
- 13. 13© 2016 MapR Technologies 13
Basics of the API
• http://ojai.github.io/
• Entry point to a table - DocumentStore
– insert()
– insertOrReplace()
– find()
– delete()
– replace()
– update()
– increment()
- 14. 14© 2016 MapR Technologies 14
Working with JSON in Java
• Step 1 – Create instance of JSON Serializer
Gson gson = new Gson();
• Step 2 – Serialize POJO to JSON
String json = gson.toJson(myObject);
• Step 3 – Deserialize JSON into POJO
MyObject myObject = gson.fromJson(json, MyObject.class);
- 15. 15© 2016 MapR Technologies 15
Creating Documents in Java OJAI
• Use static methods on class org.ojai.json.Json
Document doc = Json.newDocument(myObject);
Document doc = Json.newDocument(jsonString);
• Alternatively
– Use builders
– Stream from disk
– Use InputStream
- 16. 16© 2016 MapR Technologies 16
Creating New Documents
• DocumentStore.insert(doc)
Done!
• DocumentStore.insertOrReplace(doc)
Done!
Easy right?
- 17. 17© 2016 MapR Technologies 17
Updating Existing Documents
• DocumentStore.update(_id, DocumentMutation)
• Mutation methods
– mutation.append(FieldPath, “user visited URL”);
– mutation.set(“field.name”, “What a great example”);
– mutation.increment(“field”, 1);
– mutation.merge(“field”, Map<String, Object>);
– mutation.setOrReplace(…);
– mutation.delete(field);
Yes, these are atomic.
- 18. 18© 2016 MapR Technologies 18
Deleting Documents
• DocumentStore.delete(doc);
Done!
• DocumentStore.delete(_id);
Done!
This is easy too, right?
- 19. 19© 2016 MapR Technologies 19
Finding Documents
• DocumentStore.find(QueryCondition);
• Query condition setup:
– qc.is(“field”, EQUAL, “blue”)
.and().notExists(“other.field”)
.or().like(“field”, “%purple”)
.or().matches(“another.field”, “regular expression”)
- 20. 20© 2016 MapR Technologies 20© 2016 MapR Technologies© 2016 MapR Technologies
Querying JSON Data and More
- 21. 21© 2016 MapR Technologies 21
How to Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transformation
– No silos of data
• Ease of use
- 22. 22© 2016 MapR Technologies 22
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
- 23. 23© 2016 MapR Technologies 23
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
- 24. 24© 2016 MapR Technologies 24
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
- 25. 25© 2016 MapR Technologies 25
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
- 26. 26© 2016 MapR Technologies 26
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
- 27. 27© 2016 MapR Technologies 27
- Sub-directory
- HBase namespace
- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace
- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapR-DB
- Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance
- 28. 28© 2016 MapR Technologies 28
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
- 29. 29© 2016 MapR Technologies 29© 2016 MapR Technologies© 2016 MapR Technologies
Security Controls
- 30. 30© 2016 MapR Technologies 30
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U
- 31. 31© 2016 MapR Technologies 31
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
- 32. 32© 2016 MapR Technologies 32
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
- 33. 33© 2016 MapR Technologies 33
Security Summary
• Logical
– No physical data copies/silos
• Granular
– Row level and column level security controls
• De-centralized
– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
• Self-service w/ governance
– If you have access to data, you control who and how widely can access it
– Audits
- 34. 34© 2016 MapR Technologies 34© 2016 MapR Technologies© 2016 MapR Technologies
Using Drill with Yelp
- 35. 35© 2016 MapR Technologies 35
Business dataset {
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
- 36. 36© 2016 MapR Technologies 36
Zero to Results in 2 minutes
$ tar -xvzf apache-drill-1.9.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local
$ bin/drill-embedded
> SELECT state, city, count(*) AS businesses
FROM dfs.yelp.`business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
Install
Query files
and
directories
Results
Launch shell
(embedded
mode)
- 37. 37© 2016 MapR Technologies 37
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1
- 38. 38© 2016 MapR Technologies 38
Intuitive SQL Access to Complex Data
// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories
FROM dfs.yelp.`business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+------------+------------+------------+------------+
| name | stars | friday | categories |
+------------+------------+------------+------------+
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"]
|
| Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} |
["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+------------+------------+------------+------------+
Query data
with any
levels of
nesting
- 39. 39© 2016 MapR Technologies 39
Reviews dataset
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
- 40. 40© 2016 MapR Technologies 40
ANSI SQL Compatibility
//Get top cool rated businesses
SELECT b.name from dfs.yelp.`business.json` b
WHERE b.business_id IN
(SELECT r.business_id FROM dfs.yelp.`review.json` r
GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY
SUM(r.votes.cool) DESC);
+------------+
| name |
+------------+
| Earl of Sandwich |
| XS Nightclub |
| The Cosmopolitan of Las Vegas |
| Wicked Spoon |
+------------+
Use familiar SQL
functionality
(Joins,
Aggregations,
Sorting, Sub-
queries, SQL data
types)
- 41. 41© 2016 MapR Technologies 41
Logical Views
//Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS
SELECT b.name, b.stars, r.votes.funny,
r.votes.useful, r.votes.cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |
+------------+------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight file
system based
views for
granular and de-
centralized data
management
- 42. 42© 2016 MapR Technologies 42
Materialized Views AKA Tables
> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS
SELECT b.name, b.stars, r.votes.funny funny,
r.votes.useful useful, r.votes.cool cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+---------------------------+
| Fragment | Number of records written |
+------------+---------------------------+
| 1_0 | 176448 |
| 1_1 | 192439 |
| 1_2 | 198625 |
| 1_3 | 200863 |
| 1_4 | 181420 |
| 1_5 | 175663 |
+------------+---------------------------+
Save analysis
results as
tables using
familiar CTAS
syntax
- 43. 43© 2016 MapR Technologies 43
Repeated Values Support
// Flatten repeated categories
> SELECT name, categories
FROM dfs.yelp.`business.json` LIMIT 3;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+------------+------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.yelp.`business.json` LIMIT 5;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+------------+------------+
Dynamically
flatten repeated
and nested data
elements as part
of SQL queries.
No ETL necessary
- 44. 44© 2016 MapR Technologies 44
Checkins dataset {
"checkin_info":{
"3-4":1,
"13-5":1,
"6-6":1,
"14-5":1,
"14-6":1,
"14-2":1,
"14-3":1,
"19-0":1,
"11-5":1,
"13-2":1,
"11-6":2,
"11-3":1,
"12-6":1,
"6-5":1,
"5-5":1,
"9-2":1,
"9-5":1,
"9-6":1,
"5-2":1,
"7-6":1,
"7-5":1,
"7-4":1,
"17-5":1,
"8-5":1,
"10-2":1,
"10-5":1,
"10-6":1
},
"type":"checkin",
"business_id":"JwUE5GmEO-sH1FuwJgKBlQ"
}
- 45. 45© 2016 MapR Technologies 45
Supports Dynamic / Unknown Columns
> SELECT KVGEN(checkin_info) checkins
FROM dfs.yelp.`checkin.json` LIMIT 1;
+------------+
| checkins |
+------------+
| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-
5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-
0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-
3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-
2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-
6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-
5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM
dfs.yelp.`checkin.json` limit 6;
+------------+
| checkins |
+------------+
| {"key":"3-4","value":1} |
| {"key":"13-5","value":1} |
| {"key":"6-6","value":1} |
| {"key":"14-5","value":1} |
| {"key":"14-6","value":1} |
| {"key":"14-2","value":1} |
+------------+
Convert Map with
a wide set of
dynamic columns
into an array of
key-value pairs
- 46. 46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies
Resources
- 47. 47© 2016 MapR Technologies 47
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
- 49. 49© 2016 MapR Technologies 49
OJAI and MapR-DB
Where to find it…
– The source: https://github.com/ojai/ojai
– The site: http://ojai.github.io/
– Python bindings: https://github.com/mapr-demos/python-bindings
– Javascript bindings: https://github.com/mapr-demos/js-bindings
Ready to play with your data?
– Download the sandbox: http://maprdb.io
– Examples:
• Java: https://github.com/mapr-demos/maprdb-ojai-101
• Python: https://github.com/mapr-demos/maprdb_python_examples
- 50. 50© 2016 MapR Technologies 50
Drill Walkthrough
• Example queries
• Conversion from relational model to flat JSON model
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql
- 51. 51© 2016 MapR Technologies 51
Recommendations for Getting Started with Drill
New to Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with AWS
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your cluster and start exploration
– Comprehensive tutorials and documentation available
Ask questions
– user@drill.apache.org
- 52. 52© 2016 MapR Technologies 52
@kingmesal
jscott@mapr.com
Engage with us!
kingmesal
Editor's Notes
- Great news, I have 467 slides today …. Hahah… I’m just kidding… I only have 465…
- MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified distribution for Hadoop.
MapR is a Hadoop distirbution focussed on delivering an enterprise grade big data platform that supports mission critical and real time use cases
- The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems.
In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema.
- All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.
If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
- IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.). The so-what needs to be conveyed. Why does it matter that it’s not needed.
6 months -> 3 months -> 3 months -> day zero, So imagine now what you can get…
Data Agility is needed for Business Agility
>>> Stand still during slide, move in at the punchline (why does this matter to YOU)
- CSV vs JSON formatting
- It’s 10pm in Vegas and I Want Good Hummus!
- CSV vs JSON formatting
- CSV vs JSON formatting