SlideShare a Scribd company logo
Data Virtualization: Fulfilling the Promise of Data Lakes
Dr. Christian Kurze
Principal Sales Engineer – DACH
ckurze@denodo.com
heiko.klarl@xdi360.com
2
Key qestions I want to answer today
 What is Data Virtualization?
 How to leverage Hadoop Data Lakes to support Internet of Things /
Operational Data Store / Offloading / … use cases?
 How to query Hadoop Data Lakes combined with any other structured,
semi-structured and unstructured data sources using a single logical data
lake? What about Cloud?
 How to avoid Data Swamps via a light weight data governance approach
that helps enterprises maximize the value of their Data Lake?
 How to use a logical data lake/data warehouse to prevent a physical data
lake from becoming a silo?
Agenda
3
Status Quo – Data Integration
Access to all information
MarketingSales ExecutiveSupport
 Access to complete information
 … in an economically meaningful way
 … real-time and in high quality incl.
monitoring, security and audit
Cross-sell / Up-sell
Channel
Warranty
Product Customer
Database
Apps
Warehouse Cloud
Big Data
Documents
AppsNoSQL
 Manual Access to legacy systems and
constantly new technologies – IoT, Big
Data, Cloud
 Point-to-Point connections
 Too slow projects for new initiatives
– from disparate silos and technologies
The Requirement…
… versus the current architecture
4
Status Quo – Data Integration
Access to all information
MarketingSales ExecutiveSupport
 Access to complete information
 … in an economically meaningful way
 … real-time and in high quality incl.
monitoring, security and audit
Cross-sell / Up-sell
Channel
Warranty
Product Customer
Database
Apps
Warehouse Cloud
Big Data
Documents
AppsNoSQL
 Manual Access to legacy systems and
constantly new technologies – IoT, Big
Data, Cloud
 Point-to-Point connections
 Too slow projects for new initiatives
– from disparate silos and technologies
The Requirement…
… versus the current architecture
„My architecture works fine, but I am not
able to access all my silos.“
- Enterprise Data Architect
• Different locations
• Different technologies
• Different data structures
• Too large datasets to move them
• Different APIs and access methods
• Excessive use of ETL to copy data
• Synchronization issues
5
The Solution
Data Virtualization as a Data Abstraction Layer
DATA ABSTRACTION LAYER
Central repository to access all data
Abstracts the underlying technology of
the data sources
Enables the definition of a semantic
data model
Offers a metadata-rich catalog
Multiple access methods:
SQL based
Keyword based search (via index)
RESTful navigation (hyperlinks)
Native support for nexted document
structures (XML, JSON, …)
6
Modelling in a Data Virtualization Solution
Sources
Combine,
Transform
&
Integrate
Publish
Base View
(Source
Abstraction)Client Address Client
Type
Company Invoicing Service
Usage
Product Logs Web
Incidents
Customer Invoice Product
Customer Invoicing
Service Usage Incident
Hadoop Web SiteRest
Web Service
MultidimensionalSalesforceSQL ServerOracle
SQL, SOAP, REST, ODATA,
Message Queues (JMS), etc..
Denodo’s
Information Self Service
Independent of the
access method – all
views use the same
metadata and access
privileges
7
Common Data Virtualization Use Cases
Data Virtualization
BIG DATA, CLOUD INTEGRATION
 Advanced Analytics
 Data Warehouse Offloading
 Big Data for Enterprise
 Cloud / SaaS Integration
AGILE BUSINESS INTELLIGENCE
 Logical Data Warehouse
 Virtual Data Marts
 Self-Service BI
 Operational BI / Analytics
SINGLE VIEW APPLICATIONS
 Single Customer View - Call Centers, Portals
 Single Product View - Catalogs
 Single Inventory View - Inventory Reconciliation
 Vertical Specific - Single View of Wells
DATA SERVICES
 Unified Data Services Layer
 Logical Data Abstraction
 Agile Application Development
 Linked Data Services
8
DWH & MartsAdvanced Analytics
(multiple structures)
Advanced Analytics
(structured)
MDMStreams
Multiple platforms optimized for different Workloads
Additionally in a hybrid environment: OnPrem vs. Cloud
C
R
U
D
NoSQL /
Graph DB
Data Lake:
Hadoop /
Spark / Hive /
…
EDW
Mart
DW
Appliance
DW
Appliance
Cust
Prod
Real-time stream
processing &
decision
management
Graph
analysis
Graph
analysis
Investigative
analysis,
data refinery
Data mining,
model
development
Data mining,
model
development
Traditional
query,
reporting &
analysis
Governed
context
information
Traditional
query,
reporting &
analysis
9
Business requires a combination of data
MDM
C
R
U
D
Hadoop
Cust
Prod
Who are our customers?
What products do we sell?
What are the most popular
naviational paths through
our web site that led to
high-fee products?
Who are our most loyal, low
risk customers that generate
low fees?
What is the online behavior
of our loyal, low risk, low fee
customers so that we can
offer them higher fee
products?
Where do I find this data?
How to combine this data?
How to share it with my
colleagues? What about
their access privileges?
EDW
Big Data Connectivity
BigData and Cloud Databases Connectivity
■ Hadoop Ecosystem:
■ SQL on Hadoop: Hive, Impala, Presto,…
■ HDFS, Parquet, Avro, CSV…
■ Execution of map/reduce Jobs
■ Certified with major Hadoop distributions
■ In-memory platforms: Apache Spark SQL, Presto DB, HANA,…
■ Parallel DWs and Appliances: Vertica, Impala, Teradata, Greenplum,…
■ Cloud RDBMS: Redshift, Snowflake, DynamoDB,…
■ NoSQL (MongoDB, CouchDB, Neo4J, Redis, Oracle NoSQL, Cassandra, etc.)
■ Streaming data (Spark streams, Splunk, IBM Streams, Kafka,…)
10
Enhanced Adapters for Big Data ecosystem
Delimited text files
Sequence files
Map files
Avro files
11
How to provide access by multiple tools and technologies?
DWH MDM Hadoop Appliances NoSQL External
Services
Excel /
MS BI
Tableau Power BI
Composite
Desktop
360 Views Cockpit
Other
Applications
 Complex Security Policies? RBAC?
 Single Sign On (Kerberos)
 Governance / Audit
 Fast Prototyping?
 Automated Processes?
 Manual development of Service Layer?
 Source Changes
 New Attributes and Requirements
 Accounting of source usage
(cloud migration pending)
 Refactoring of sources
 New Sources
12
Marketing
Data Lakes
Research
Logical Data Lake
Finance
Self-Service
Analytics
Operational
Apps
A Single Governed Logical Data Lake
Data Virtualization combines one or more physical data lakes with other enterprise data to create a
“virtual” or “logical” data lake.
Other Data Sources
MDM Cloud Apps
BI/Analytical
Tools
Excel
Reports
DATA VIRTUALIZATION
Semantic
Model
Data
Discovery
Metadata
Catalog
Security
Governance
Denodo Platform Bridges Distinct Data Architectures
 Simplified Architecture
 Single Point of Access
 Lower TCO
 Lower Operational Costs
 Improved Agility
 Improved Flexibility
 Consistency and Integrity
for multiple tools
13
Information Self Service
E/R diagram
1
Click on a view to
navigate to the
details
2
Hover on the
arrows to show
the details of
the PK-FK
relationships
14
Information Self Service
Browse Metadata Catalog
1Browse and search
virtual databases
2 Browse and search
available views
3 Review metadata
and descriptions
4 Query the view
15
Information Self Service
Search Metadata Catalog
1 Full-text search within view metadata
(name, column names, descriptions)
2 Show additional view
information and query data
16
Information Self Service
Querying Data
1Access to the
Denodo catalog
2 Query and filter
for data
3 Click on the green arrows to drill
down into related information
17
Information Self Service
Data Lineage
1 Select Data Lineage
for the View
2 Select column
to see lineage
3 Hover and click the
icons to see details
18
Telematics & Predictive Maintenance
Leading Construction Manufacturer
Dealer
Maintenance
Parts Inventory
OSI PI Hadoop Cluster
Tableau: Dealer / Customer Dashboard
19
Business Benefits
 Improved asset performance and proactive maintenance.
 Reduced warranty costs due to proactive maintenance of
parts preventing parts failure.
 Optimized pricing for services and parts among global service
providers.
 New Business Model opportunities based on real-time
analysis of detailed sensor data.
20
How can I get started?
Read New Whitepaper by Rick F. Van der Lans
Developing a Bimodal Logical Data Warehouse Architecture
Using Data Virtualization
Register at: http://bit.ly/2frs782
Get Started Today!
Download Denodo Express: www.denodoexpress.com
Access Denodo on AWS:
www.denodo.com/en/denodo-platform/denodo-platform-for-aws
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise of Data Lakes"
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical,
including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

More Related Content

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise of Data Lakes"

  • 1. Data Virtualization: Fulfilling the Promise of Data Lakes Dr. Christian Kurze Principal Sales Engineer – DACH ckurze@denodo.com heiko.klarl@xdi360.com
  • 2. 2 Key qestions I want to answer today  What is Data Virtualization?  How to leverage Hadoop Data Lakes to support Internet of Things / Operational Data Store / Offloading / … use cases?  How to query Hadoop Data Lakes combined with any other structured, semi-structured and unstructured data sources using a single logical data lake? What about Cloud?  How to avoid Data Swamps via a light weight data governance approach that helps enterprises maximize the value of their Data Lake?  How to use a logical data lake/data warehouse to prevent a physical data lake from becoming a silo? Agenda
  • 3. 3 Status Quo – Data Integration Access to all information MarketingSales ExecutiveSupport  Access to complete information  … in an economically meaningful way  … real-time and in high quality incl. monitoring, security and audit Cross-sell / Up-sell Channel Warranty Product Customer Database Apps Warehouse Cloud Big Data Documents AppsNoSQL  Manual Access to legacy systems and constantly new technologies – IoT, Big Data, Cloud  Point-to-Point connections  Too slow projects for new initiatives – from disparate silos and technologies The Requirement… … versus the current architecture
  • 4. 4 Status Quo – Data Integration Access to all information MarketingSales ExecutiveSupport  Access to complete information  … in an economically meaningful way  … real-time and in high quality incl. monitoring, security and audit Cross-sell / Up-sell Channel Warranty Product Customer Database Apps Warehouse Cloud Big Data Documents AppsNoSQL  Manual Access to legacy systems and constantly new technologies – IoT, Big Data, Cloud  Point-to-Point connections  Too slow projects for new initiatives – from disparate silos and technologies The Requirement… … versus the current architecture „My architecture works fine, but I am not able to access all my silos.“ - Enterprise Data Architect • Different locations • Different technologies • Different data structures • Too large datasets to move them • Different APIs and access methods • Excessive use of ETL to copy data • Synchronization issues
  • 5. 5 The Solution Data Virtualization as a Data Abstraction Layer DATA ABSTRACTION LAYER Central repository to access all data Abstracts the underlying technology of the data sources Enables the definition of a semantic data model Offers a metadata-rich catalog Multiple access methods: SQL based Keyword based search (via index) RESTful navigation (hyperlinks) Native support for nexted document structures (XML, JSON, …)
  • 6. 6 Modelling in a Data Virtualization Solution Sources Combine, Transform & Integrate Publish Base View (Source Abstraction)Client Address Client Type Company Invoicing Service Usage Product Logs Web Incidents Customer Invoice Product Customer Invoicing Service Usage Incident Hadoop Web SiteRest Web Service MultidimensionalSalesforceSQL ServerOracle SQL, SOAP, REST, ODATA, Message Queues (JMS), etc.. Denodo’s Information Self Service Independent of the access method – all views use the same metadata and access privileges
  • 7. 7 Common Data Virtualization Use Cases Data Virtualization BIG DATA, CLOUD INTEGRATION  Advanced Analytics  Data Warehouse Offloading  Big Data for Enterprise  Cloud / SaaS Integration AGILE BUSINESS INTELLIGENCE  Logical Data Warehouse  Virtual Data Marts  Self-Service BI  Operational BI / Analytics SINGLE VIEW APPLICATIONS  Single Customer View - Call Centers, Portals  Single Product View - Catalogs  Single Inventory View - Inventory Reconciliation  Vertical Specific - Single View of Wells DATA SERVICES  Unified Data Services Layer  Logical Data Abstraction  Agile Application Development  Linked Data Services
  • 8. 8 DWH & MartsAdvanced Analytics (multiple structures) Advanced Analytics (structured) MDMStreams Multiple platforms optimized for different Workloads Additionally in a hybrid environment: OnPrem vs. Cloud C R U D NoSQL / Graph DB Data Lake: Hadoop / Spark / Hive / … EDW Mart DW Appliance DW Appliance Cust Prod Real-time stream processing & decision management Graph analysis Graph analysis Investigative analysis, data refinery Data mining, model development Data mining, model development Traditional query, reporting & analysis Governed context information Traditional query, reporting & analysis
  • 9. 9 Business requires a combination of data MDM C R U D Hadoop Cust Prod Who are our customers? What products do we sell? What are the most popular naviational paths through our web site that led to high-fee products? Who are our most loyal, low risk customers that generate low fees? What is the online behavior of our loyal, low risk, low fee customers so that we can offer them higher fee products? Where do I find this data? How to combine this data? How to share it with my colleagues? What about their access privileges? EDW
  • 10. Big Data Connectivity BigData and Cloud Databases Connectivity ■ Hadoop Ecosystem: ■ SQL on Hadoop: Hive, Impala, Presto,… ■ HDFS, Parquet, Avro, CSV… ■ Execution of map/reduce Jobs ■ Certified with major Hadoop distributions ■ In-memory platforms: Apache Spark SQL, Presto DB, HANA,… ■ Parallel DWs and Appliances: Vertica, Impala, Teradata, Greenplum,… ■ Cloud RDBMS: Redshift, Snowflake, DynamoDB,… ■ NoSQL (MongoDB, CouchDB, Neo4J, Redis, Oracle NoSQL, Cassandra, etc.) ■ Streaming data (Spark streams, Splunk, IBM Streams, Kafka,…) 10 Enhanced Adapters for Big Data ecosystem Delimited text files Sequence files Map files Avro files
  • 11. 11 How to provide access by multiple tools and technologies? DWH MDM Hadoop Appliances NoSQL External Services Excel / MS BI Tableau Power BI Composite Desktop 360 Views Cockpit Other Applications  Complex Security Policies? RBAC?  Single Sign On (Kerberos)  Governance / Audit  Fast Prototyping?  Automated Processes?  Manual development of Service Layer?  Source Changes  New Attributes and Requirements  Accounting of source usage (cloud migration pending)  Refactoring of sources  New Sources
  • 12. 12 Marketing Data Lakes Research Logical Data Lake Finance Self-Service Analytics Operational Apps A Single Governed Logical Data Lake Data Virtualization combines one or more physical data lakes with other enterprise data to create a “virtual” or “logical” data lake. Other Data Sources MDM Cloud Apps BI/Analytical Tools Excel Reports DATA VIRTUALIZATION Semantic Model Data Discovery Metadata Catalog Security Governance Denodo Platform Bridges Distinct Data Architectures  Simplified Architecture  Single Point of Access  Lower TCO  Lower Operational Costs  Improved Agility  Improved Flexibility  Consistency and Integrity for multiple tools
  • 13. 13 Information Self Service E/R diagram 1 Click on a view to navigate to the details 2 Hover on the arrows to show the details of the PK-FK relationships
  • 14. 14 Information Self Service Browse Metadata Catalog 1Browse and search virtual databases 2 Browse and search available views 3 Review metadata and descriptions 4 Query the view
  • 15. 15 Information Self Service Search Metadata Catalog 1 Full-text search within view metadata (name, column names, descriptions) 2 Show additional view information and query data
  • 16. 16 Information Self Service Querying Data 1Access to the Denodo catalog 2 Query and filter for data 3 Click on the green arrows to drill down into related information
  • 17. 17 Information Self Service Data Lineage 1 Select Data Lineage for the View 2 Select column to see lineage 3 Hover and click the icons to see details
  • 18. 18 Telematics & Predictive Maintenance Leading Construction Manufacturer Dealer Maintenance Parts Inventory OSI PI Hadoop Cluster Tableau: Dealer / Customer Dashboard
  • 19. 19 Business Benefits  Improved asset performance and proactive maintenance.  Reduced warranty costs due to proactive maintenance of parts preventing parts failure.  Optimized pricing for services and parts among global service providers.  New Business Model opportunities based on real-time analysis of detailed sensor data.
  • 20. 20 How can I get started? Read New Whitepaper by Rick F. Van der Lans Developing a Bimodal Logical Data Warehouse Architecture Using Data Virtualization Register at: http://bit.ly/2frs782 Get Started Today! Download Denodo Express: www.denodoexpress.com Access Denodo on AWS: www.denodo.com/en/denodo-platform/denodo-platform-for-aws
  • 22. www.denodo.com info@denodo.com © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.