SlideShare a Scribd company logo
© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Damon Feldman, Ph.D
@damon.feldman
http://www.marklogic.com/blog/author/dfeldman/
Data Lake, Virtual Database, or Data Hub
How to Choose?
SLIDE: 2 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Who am I?
• Solutions Director at MarkLogic
• About 8 years in the Big Data and Data Integration space
• Previously, in OOP, JEE worlds
• Focus on Data Hub and Customer or Person-360o systems
SLIDE: 3 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
But Why?
• Data Silos
• Usually work well for a single, operational
purpose
• Turn any cross-line-of-business question
into a data integration effort
SLIDE: 4 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
How about EDW
• For a while, Enterprise Data Warehouses were the go-to solution for silos
• One master schema to rule them
• Data Modeler’s Dream!
• Implementors Nightmare!
• BMUF
• Rigid and tightly coupled
SLIDE: 5 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Incompatibililties
• Three forms of data incompatibilities
• Naming is the simplest
• firstName vs. GIVEN_NAME
• Structural is somewhat harder
• Semantic differences are the most challenging
• Status: {in cart, ordered, shipped, delivered}
• Status: {selected, paid, complete}
PERSON
- PERS_ID
- DOB
- FNAME
- LNAME
PERS_ADDR_REL
- PERS_ID
- ADDR_ID
ADDRESS
- ADDR_ID
- LINE1
- CITY
- ZIP
- TYPE: {US, UK}
PERSON
- PERS_ID
- DOB
- FNAME
- LNAME
- ADDR_L1
- ADDR_CITY
- ADDR_ZIP
- ADDR_MAILING_L1
- ADDR_MAILING_ZIP
SLIDE: 6 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Three New Approaches
• Data Lakes
• Put it all somewhere else
• Virtual Databases (AKA Federated Databases)
• Pretend it is somewhere else
• Data Hubs
• Put it all somewhere else, Harmonize, and Index it for operational use
And a Framework to understand and choose approaches
SLIDE: 7 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
A Use Case
Consider a customer churn use case
 Review high-value customers
 .. Who are at-risk customers
 .. Particularly if they are dropping or cancelling services
 Proactively address their trouble tickets or complaints.
Customer Lifetime Value
$$$ $ $$
Customer Support
!@#&!!%! !@#
Order/Change/Drop
 ↑ 😠😠↓
Need
more …
please
upgrade
…
Abysmal…
dissatisfied
…
SLIDE: 8 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Lakes
• Copy the data to a new infrastructure
• Typically Hadoop, but perhaps MarkLogic or other NoSQL
• Difficult with SQL because many sources  Load “as-is”
• Operational Separation
Copy
Process
Support
CLV
Orders
DATA LAKE
Data is Moved to one place,
but still in varied structures
BI/Analytics
SLIDE: 9 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Virtual Database
• Query everything in real time
• Transparent to the caller
• True real-time
• Data is not Moved or Harmonized (except in memory during processing)
Support
CLV
Orders
Data Remains in
source systems
Query Transform
Query Transform
Query Transform
Retain/intervene
Churn Analysis
Reporting
Query
Conversion
Data
Harmonization
SLIDE: 10 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Hubs
• Copy as with a Data Lake
• Harmonize and Index
• Regular structures for analytics, reporting, consumption
• Indexes atop the common structures
Copy
Support
CLV
Orders
DATA HUB
Data is Moved to one place
Also Harmonized and Indexed
Harmonize BI/Analytics
Consumer
Consumer
Consumers
SLIDE: 11 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Beneath and Beyond the Terms
The terms are useful, but vague, and don’t tell us what works for our next project
Consider all these approaches in terms of:
• Movement
• Harmonization
• Indexing
SLIDE: 12 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Movement
• Data Movement is copying data to new, physical storage so it can be accessed via
new servers and processes
• Operational Separation
• Organizational Separation
Orders System
Retain / Intervene
Churn Analysis
Reporting
Sales Department IT
SLIDE: 13 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Movement and the Three Approaches
• Data Lakes are all but defined by Movement
• Operational and Organizational separation
• Virtual Databases - unique in not Moving data
• Load is pushed to the source systems
• Backup, HA/DR, Security implemented on all source systems
• Data Hubs also Move data
SLIDE: 14 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Harmonization
• Recall: Three forms of data incompatibility
• Naming
• Structural
• Semantic
PERSON
- PERS_ID
- DOB
- FNAME
- LNAME
PERS_ADDR_REL
- PERS_ID
- ADDR_ID
ADDRESS
- ADDR_ID
- LINE1
- CITY
- ZIP
- TYPE: {US, UK}
PERSON
- PERS_ID
- DOB
- FNAME
- LNAME
- ADDR_L1
- ADDR_CITY
- ADDR_ZIP
- ADDR_MAILING_L1
- ADDR_MAILING_ZIP
SLIDE: 15 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Harmonization
• Harmonization is mapping into a common structure for key data elements
• Eventually, data must be consumed, aggregated and analyzed in a common form
Orders System
 $1400 equipment order
 £ 270/month – 36 month contract
 Exchange Rate: 1.28
Maintenance/trouble tickets
 Network upgrade needed
 Projected cost $3,000
Customer Expected Net Revenue
Oren Wilkins $4,280
Sarah Ravnick $17,200
David Perez …
SLIDE: 16 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Person
Harmonized
Name
Address
DoB
Source
Eye color
Height
Credit Risk
Data Harmonization
• Harmonization is the “value add” in the process
• The earlier the better for maximum use
• Store it
• Index it
• Yet BMUF fails often
• Progressive Harmonization
Person
Harmonized
Name
Address
DoB
Source
Eye color
Height
Credit Risk
Person
Fname
Lname
BIRTH
PHYSATTR
PHYSATTR
Person
Given-name
Family-name
Eye-color
Demographics
DOB
Person
Harmonized
Name
Address
DoB
EyeColor
Height
Source
Credit Risk
Iteration 1 Iteration 2
SLIDE: 17 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Harmonization and the Approaches
• Data Lakes don’t Harmonize
• Harmonization is pushed downstream, or implicit in the jobs
• Often ETL copies from format to format (particularly in Hadoop)
• Virtual Databases Harmonize in real time
• Each source query and result is harmonized in memory
• Pushes the load to the source systems
• Data Hubs Harmonize and Persist
• Explicit storage and management of Harmonized data
• Governable
Data Lake
Job 1 Job 2
Silo 1
Silo 2
Query
Data Lake Data Hub
SLIDE: 19 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Indexing
“Who Said Databases Weren’t a Good Idea?”
- Ken Krupa, Enterprise CTO, MarkLogic
• Indexing is a decision to make something fast
 Finding, totaling, sorting, grouping, correlating, analyzing
 Sometimes also accessing
• Less obviously
 Caching and memory use
 Reference data usage
SLIDE: 20 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Indexing Benefits
• Advance from Batch to Operational
• Micro-service or SOA architectures
• find the latest address
• A 360o summary record of a customer
• Human Services: reviewing FSA recipients – interactive dashboard
• “Run your business”
SLIDE: 21 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Three Approaches Revisited – Virtual Databases
Issues
• Least-common-denominator Query
• Paradox: more systems = less power
• Coupling to source systems – schema change = broken DB
• Weakest link problem - HA/DR, overload
• Complexity
• Paging, sorting, relevance, dealing with a down federate
Benefit
• Real Time is easy
• May be ok for small or initial systems
SLIDE: 22 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Three Approaches Revisited - Data Lakes
Issues
• Still need to Harmonize the data
• Typically in every batch job, ETL (PIG/HIVE) job, query, analysis
• Risk of the “Data Swamp”
• Batch focus
• In-memory helps, but still batch
• Frankenbeast workarounds create more silos, rather than solving the problem
Benefit
• The data is moved
• Storage is cheap
• One team and process to add functionality
SLIDE: 23 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Three Approaches Revisited – Data Hubs
Data Hubs - Advantages
• Most powerful solution – all of: Movement, Harmonization, Indexing
• “Run your business”
• Indexing builds on Harmonization
• Harmonization is the value add, so index it!
• Grow by regularizing, not by complicating
• More data sources to the Harmonized form
• Progressive Harmonization to increase the Harmonized data elements
• HA/DR, scale, security, query power, batch efficiency, governance
Tradeoffs
• Dedicated hardware
• Change detection or data push needed for real-time
SLIDE: 24 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Lake vs Data Hub
” The fact is, you don't put everything into a datastore and
then go looking for something to do.”
- Ted Dunning, MapR Chief Applications Architect
Data Hubs are Operational and “Purpose-driven”
Use case  API  Progressive Harmonization  Data Integration
The do not merely have Harmonized data and Indexes, they are about serving
Harmonized data and indexes to drive them.
SLIDE: 25 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Value Over Time
Time, Evolution, Range of Data
ROI
Data Lake
Data Hub
Virtual Database0
SLIDE: 26 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Evaluating MarkLogic with the Three Criteria
SLIDE: 27 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic Operational Data Hub Pattern
Some say: “A Data Lake and EDW are better together”
Translation: ”This Data Lake is not doing a very good job, and never will”
 MarkLogic brings database/data warehouse functions into the Data Lake
making it “Operational” and a “Data Hub” by virtue of Harmonization and
Indexing
 but not by trying to build a (smaller) EDW
SLIDE: 28 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic for Operational Data Hubs
• MarkLogic supports all three paradigms
• Our product direction, consulting team, experience are focused on Data Hubs
• MarkLogic is a database
• Allowing an “Operational Data Hub”
• Run your business AND observe your business
• One place for the latest data – address, income, account status, health
• Integrated data for 360o views
SLIDE: 29 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic ODH Features - Movement
• Ingest data “as-is”
• Native support for JSON, XML, Binary, RDF, Text, SQL, Geo
• Data Loading tools for MPP batch ingest
• Index latent structure in each
• Commodity hardware, commodity disk
• Tiered storage for cost effective storage
SLIDE: 30 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Operational Data Hub Pattern in MarkLogic
HARMONIZE
INGEST
Enveloped
Documents
(Entity 1)
SERVE
Enveloped
Documents
(Entity 2)
RDBMS
Source 1
Documents
Message Bus
Content Feed
Data Flow
Staging
Raw, As-is data
Final
Harmonized, Indexed dataSource
Systems
Consuming
Applications
Source 2
Documents
Source N
Documents
… …
Enveloped
Documents
(Entity N)
Operational Apps
Analysis/BI
Data Feeds
Discovery, Harmonization Indexes, Query, Servies
SLIDE: 31 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic ODH Features - Harmonization
• Best in class data Transform capabilities
• XSLT, XQuery implemented to spec from the ground up
• JavaScript via V8 engine
• Triggers, data extraction from binaries, MPP processing
• Multi-modal processing of many data formats
• Ontology processing – RDFS, OWL
SLIDE: 32 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic ODH Features - Indexing
• MarkLogic is built on the “Universal Index”
• Text, document structure, fields, text and security in one index
• Columnar range indexes for analysis and SQL processing
• Triple index for RDF, SPARQL and semantic query
• Geospatial index
• Projection operations to expose one structure (e.g. JSON or XML) as SQL or RDF
• Operational vs. purely analytical. You can run your business on MarkLogic
SLIDE: 33 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Summary
• Data Lakes and Hubs are on a continuum
• Primarily distinguished by level of indexing
• Virtual databases are a very different animal – and not usually in a good way
• Within each pattern, Movement, Harmonization and Indexing are knobs to turn
• Movement – for isolation and data access
• Harmonization – for micro-services and value-add
• Indexing – for speed and operational use cases
• Consider your goals and requirements, and plan accordingly
SLIDE: 34 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
More Info
MarkLogic Data Hub Framework (quick start): https://marklogic.github.io/marklogic-data-hub/
MarkLogic Data Hub information: http://www.marklogic.com/solutions/operational-data-hub/
Damon’s blog on data lakes: http://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/
Follow damon on twitter: https://twitter.com/damonfeldman

More Related Content

Data Lake, Virtual Database, or Data Hub - How to Choose?

  • 1. © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Damon Feldman, Ph.D @damon.feldman http://www.marklogic.com/blog/author/dfeldman/ Data Lake, Virtual Database, or Data Hub How to Choose?
  • 2. SLIDE: 2 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Who am I? • Solutions Director at MarkLogic • About 8 years in the Big Data and Data Integration space • Previously, in OOP, JEE worlds • Focus on Data Hub and Customer or Person-360o systems
  • 3. SLIDE: 3 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. But Why? • Data Silos • Usually work well for a single, operational purpose • Turn any cross-line-of-business question into a data integration effort
  • 4. SLIDE: 4 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. How about EDW • For a while, Enterprise Data Warehouses were the go-to solution for silos • One master schema to rule them • Data Modeler’s Dream! • Implementors Nightmare! • BMUF • Rigid and tightly coupled
  • 5. SLIDE: 5 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Incompatibililties • Three forms of data incompatibilities • Naming is the simplest • firstName vs. GIVEN_NAME • Structural is somewhat harder • Semantic differences are the most challenging • Status: {in cart, ordered, shipped, delivered} • Status: {selected, paid, complete} PERSON - PERS_ID - DOB - FNAME - LNAME PERS_ADDR_REL - PERS_ID - ADDR_ID ADDRESS - ADDR_ID - LINE1 - CITY - ZIP - TYPE: {US, UK} PERSON - PERS_ID - DOB - FNAME - LNAME - ADDR_L1 - ADDR_CITY - ADDR_ZIP - ADDR_MAILING_L1 - ADDR_MAILING_ZIP
  • 6. SLIDE: 6 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three New Approaches • Data Lakes • Put it all somewhere else • Virtual Databases (AKA Federated Databases) • Pretend it is somewhere else • Data Hubs • Put it all somewhere else, Harmonize, and Index it for operational use And a Framework to understand and choose approaches
  • 7. SLIDE: 7 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. A Use Case Consider a customer churn use case  Review high-value customers  .. Who are at-risk customers  .. Particularly if they are dropping or cancelling services  Proactively address their trouble tickets or complaints. Customer Lifetime Value $$$ $ $$ Customer Support !@#&!!%! !@# Order/Change/Drop  ↑ 😠😠↓ Need more … please upgrade … Abysmal… dissatisfied …
  • 8. SLIDE: 8 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Lakes • Copy the data to a new infrastructure • Typically Hadoop, but perhaps MarkLogic or other NoSQL • Difficult with SQL because many sources  Load “as-is” • Operational Separation Copy Process Support CLV Orders DATA LAKE Data is Moved to one place, but still in varied structures BI/Analytics
  • 9. SLIDE: 9 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Virtual Database • Query everything in real time • Transparent to the caller • True real-time • Data is not Moved or Harmonized (except in memory during processing) Support CLV Orders Data Remains in source systems Query Transform Query Transform Query Transform Retain/intervene Churn Analysis Reporting Query Conversion Data Harmonization
  • 10. SLIDE: 10 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Hubs • Copy as with a Data Lake • Harmonize and Index • Regular structures for analytics, reporting, consumption • Indexes atop the common structures Copy Support CLV Orders DATA HUB Data is Moved to one place Also Harmonized and Indexed Harmonize BI/Analytics Consumer Consumer Consumers
  • 11. SLIDE: 11 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Beneath and Beyond the Terms The terms are useful, but vague, and don’t tell us what works for our next project Consider all these approaches in terms of: • Movement • Harmonization • Indexing
  • 12. SLIDE: 12 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Movement • Data Movement is copying data to new, physical storage so it can be accessed via new servers and processes • Operational Separation • Organizational Separation Orders System Retain / Intervene Churn Analysis Reporting Sales Department IT
  • 13. SLIDE: 13 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Movement and the Three Approaches • Data Lakes are all but defined by Movement • Operational and Organizational separation • Virtual Databases - unique in not Moving data • Load is pushed to the source systems • Backup, HA/DR, Security implemented on all source systems • Data Hubs also Move data
  • 14. SLIDE: 14 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Harmonization • Recall: Three forms of data incompatibility • Naming • Structural • Semantic PERSON - PERS_ID - DOB - FNAME - LNAME PERS_ADDR_REL - PERS_ID - ADDR_ID ADDRESS - ADDR_ID - LINE1 - CITY - ZIP - TYPE: {US, UK} PERSON - PERS_ID - DOB - FNAME - LNAME - ADDR_L1 - ADDR_CITY - ADDR_ZIP - ADDR_MAILING_L1 - ADDR_MAILING_ZIP
  • 15. SLIDE: 15 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Harmonization • Harmonization is mapping into a common structure for key data elements • Eventually, data must be consumed, aggregated and analyzed in a common form Orders System  $1400 equipment order  £ 270/month – 36 month contract  Exchange Rate: 1.28 Maintenance/trouble tickets  Network upgrade needed  Projected cost $3,000 Customer Expected Net Revenue Oren Wilkins $4,280 Sarah Ravnick $17,200 David Perez …
  • 16. SLIDE: 16 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Person Harmonized Name Address DoB Source Eye color Height Credit Risk Data Harmonization • Harmonization is the “value add” in the process • The earlier the better for maximum use • Store it • Index it • Yet BMUF fails often • Progressive Harmonization Person Harmonized Name Address DoB Source Eye color Height Credit Risk Person Fname Lname BIRTH PHYSATTR PHYSATTR Person Given-name Family-name Eye-color Demographics DOB Person Harmonized Name Address DoB EyeColor Height Source Credit Risk Iteration 1 Iteration 2
  • 17. SLIDE: 17 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Harmonization and the Approaches • Data Lakes don’t Harmonize • Harmonization is pushed downstream, or implicit in the jobs • Often ETL copies from format to format (particularly in Hadoop) • Virtual Databases Harmonize in real time • Each source query and result is harmonized in memory • Pushes the load to the source systems • Data Hubs Harmonize and Persist • Explicit storage and management of Harmonized data • Governable Data Lake Job 1 Job 2 Silo 1 Silo 2 Query Data Lake Data Hub
  • 18. SLIDE: 19 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Indexing “Who Said Databases Weren’t a Good Idea?” - Ken Krupa, Enterprise CTO, MarkLogic • Indexing is a decision to make something fast  Finding, totaling, sorting, grouping, correlating, analyzing  Sometimes also accessing • Less obviously  Caching and memory use  Reference data usage
  • 19. SLIDE: 20 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Indexing Benefits • Advance from Batch to Operational • Micro-service or SOA architectures • find the latest address • A 360o summary record of a customer • Human Services: reviewing FSA recipients – interactive dashboard • “Run your business”
  • 20. SLIDE: 21 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited – Virtual Databases Issues • Least-common-denominator Query • Paradox: more systems = less power • Coupling to source systems – schema change = broken DB • Weakest link problem - HA/DR, overload • Complexity • Paging, sorting, relevance, dealing with a down federate Benefit • Real Time is easy • May be ok for small or initial systems
  • 21. SLIDE: 22 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited - Data Lakes Issues • Still need to Harmonize the data • Typically in every batch job, ETL (PIG/HIVE) job, query, analysis • Risk of the “Data Swamp” • Batch focus • In-memory helps, but still batch • Frankenbeast workarounds create more silos, rather than solving the problem Benefit • The data is moved • Storage is cheap • One team and process to add functionality
  • 22. SLIDE: 23 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Approaches Revisited – Data Hubs Data Hubs - Advantages • Most powerful solution – all of: Movement, Harmonization, Indexing • “Run your business” • Indexing builds on Harmonization • Harmonization is the value add, so index it! • Grow by regularizing, not by complicating • More data sources to the Harmonized form • Progressive Harmonization to increase the Harmonized data elements • HA/DR, scale, security, query power, batch efficiency, governance Tradeoffs • Dedicated hardware • Change detection or data push needed for real-time
  • 23. SLIDE: 24 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Lake vs Data Hub ” The fact is, you don't put everything into a datastore and then go looking for something to do.” - Ted Dunning, MapR Chief Applications Architect Data Hubs are Operational and “Purpose-driven” Use case  API  Progressive Harmonization  Data Integration The do not merely have Harmonized data and Indexes, they are about serving Harmonized data and indexes to drive them.
  • 24. SLIDE: 25 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Value Over Time Time, Evolution, Range of Data ROI Data Lake Data Hub Virtual Database0
  • 25. SLIDE: 26 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Evaluating MarkLogic with the Three Criteria
  • 26. SLIDE: 27 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic Operational Data Hub Pattern Some say: “A Data Lake and EDW are better together” Translation: ”This Data Lake is not doing a very good job, and never will”  MarkLogic brings database/data warehouse functions into the Data Lake making it “Operational” and a “Data Hub” by virtue of Harmonization and Indexing  but not by trying to build a (smaller) EDW
  • 27. SLIDE: 28 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic for Operational Data Hubs • MarkLogic supports all three paradigms • Our product direction, consulting team, experience are focused on Data Hubs • MarkLogic is a database • Allowing an “Operational Data Hub” • Run your business AND observe your business • One place for the latest data – address, income, account status, health • Integrated data for 360o views
  • 28. SLIDE: 29 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Movement • Ingest data “as-is” • Native support for JSON, XML, Binary, RDF, Text, SQL, Geo • Data Loading tools for MPP batch ingest • Index latent structure in each • Commodity hardware, commodity disk • Tiered storage for cost effective storage
  • 29. SLIDE: 30 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Operational Data Hub Pattern in MarkLogic HARMONIZE INGEST Enveloped Documents (Entity 1) SERVE Enveloped Documents (Entity 2) RDBMS Source 1 Documents Message Bus Content Feed Data Flow Staging Raw, As-is data Final Harmonized, Indexed dataSource Systems Consuming Applications Source 2 Documents Source N Documents … … Enveloped Documents (Entity N) Operational Apps Analysis/BI Data Feeds Discovery, Harmonization Indexes, Query, Servies
  • 30. SLIDE: 31 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Harmonization • Best in class data Transform capabilities • XSLT, XQuery implemented to spec from the ground up • JavaScript via V8 engine • Triggers, data extraction from binaries, MPP processing • Multi-modal processing of many data formats • Ontology processing – RDFS, OWL
  • 31. SLIDE: 32 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic ODH Features - Indexing • MarkLogic is built on the “Universal Index” • Text, document structure, fields, text and security in one index • Columnar range indexes for analysis and SQL processing • Triple index for RDF, SPARQL and semantic query • Geospatial index • Projection operations to expose one structure (e.g. JSON or XML) as SQL or RDF • Operational vs. purely analytical. You can run your business on MarkLogic
  • 32. SLIDE: 33 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Summary • Data Lakes and Hubs are on a continuum • Primarily distinguished by level of indexing • Virtual databases are a very different animal – and not usually in a good way • Within each pattern, Movement, Harmonization and Indexing are knobs to turn • Movement – for isolation and data access • Harmonization – for micro-services and value-add • Indexing – for speed and operational use cases • Consider your goals and requirements, and plan accordingly
  • 33. SLIDE: 34 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. More Info MarkLogic Data Hub Framework (quick start): https://marklogic.github.io/marklogic-data-hub/ MarkLogic Data Hub information: http://www.marklogic.com/solutions/operational-data-hub/ Damon’s blog on data lakes: http://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/ Follow damon on twitter: https://twitter.com/damonfeldman