SlideShare a Scribd company logo
What is Data Virtualization and Why It Matters to You Alberto Pan,  CTO Justo Hidalgo,  VP Product Management & Consulting Denodo Technologies
 
Contents Why Data Virtualization? Productivity Distributed Query Optimization Layer Independence Governance Data Quality Architecture
Our Goal:  Serving the Information Barista
GREAT, BUT  WHAT’S THE PROBLEM?
Disjoint Views of Entities – the Elements Customer data spread over different and heterogeneous data sources Too much effort to locate and obtain the data. Data need to be not only extracted, but  combined among different applications, interfaces and formats. Log files (.txt/.log files) CRM (MySQL) Billing System (Web Service - Rest) Incidences System (Web Application) Inventory System (MS SQL Server) Product Catalog (Web Service -SOAP) Knowledge Base (Internet) Product Data (CSV)
It Would be So Nice If…
Happy Ending:  Single View of Element- Virtual Integration JDBC ODBC WS CSV XML Web Web Flat files Homogeneous access to all data CRM (MySQL) Billing System (Web Service - Rest) Incidences System (Web Application) Inventory System (MS SQL Server) Product Catalog (Web Service -SOAP) Knowledge Base Product Data (CSV) Log files (.txt/.log files)
BUT, WHY A  DATA VIRTUALIZATION  LAYER ?
DIDN’T WE HAVE ENOUGH WITH ETL, ESB, EAI, WS, …?
 
So, We Went and Asked our Experts
Why a Data Virtualization Layer? P roductivity D istributed Query Optimization P hysical and Logical independence G overnance D ata Quality
PRODUCTIVITY (because time is money)
Built-in connectors for data sources Complex Data Combination operations do not need to be programmed Productivity… Applications & 3 rd  Party Tools Enterprise Applications, BI, Portals, Dashboards, Web Applications… NAME  DESCRIPTION  PRICE NAME  DESCRIPTION  PRICE NAME  MANUFACTURER  SCORE NAME  DESCRIPTION  PRICE  MANUFACTURER  SCORE U ∞
Applications  do not need to deal with complex data-related issues E.g. swapping of large result sets E.g. caching of costly result sets E.g. management of changes in the sources is done in the DV layer, leaving the business layer unaffected Collaboration  and Prototyping Virtualization allows rapid prototyping and testing …  Productivity…
Uniform  access Developers use a single model and API instead of learning a mixture of different APIs Learning and execution curves are lower for every additional project on top of the DV layer …  Productivity Multi-access A Data Virtualization layer can offer the most appropriate access type for each application (JDBC, Web Service, Sharepoint widget…)
DISTRIBUTED QUERY OPTIMIZATION (because customers are waiting)
Multiple  execution strategies  available Performance of a distributed join query may vary enormously depending on the used method  e.g: hash join , merge join, nested join,… Even if the join is among the same data views, the optimum method may be different for different queries. Distributed Query Optimization…
The final Executable Plan depends on characteristics such as Strategies Sources Order Hash Join Logic Plan Candidate Physical Plans BOOK REVIEW BOOK REVIEW 1 BOOK REVIEW 2 BOOK REVIEW 2 BOOKSTORE A BOOKSTORE B     BOOK STORE A     BOOK STORE B Nested Loop Join BOOK STORE A   NL   BOOK STORE B BOOK STORE A     BOOK STORE B Hash Join
Source  query limitations Push processing  to data sources Materialization : pre-load frequently used data and temporal locality … Distributed Query Optimization join pushed into  data source Delegate join into  data source
 
Applications are  independent of changes  in data source location, implementation (e.g. from legacy to new system) and schema. E.g. A mainframe is replaced by a new system. Customer data now comes from two systems instead of one due to a merge/acquisition. Two aplications are reengineered into a single one.  The data schema of a data source changes. Physical and Logical Independence…
Let each tool do its business ! An ESB is good at orchestrating business services Data Virtualization is good at accessing  information repositories, homogeneizing them  and turning them into services … Physical and Logical Independence… ESB DATA VIRTUALIZATION
Changes  need to be done in a single place. E.g. the way to determine if a customer is ‘VIP’ changes. Many applications will use this data field. In some applications (e.g. BRMS systems) the field can be used many times. …  Physical and Logical Independence
GOVERNANCE (because 24x7 matters)
Single entry point for  data auditing : Track Data and Metadata changes.  E.g. Which user was the last one that modified a certain view?  Single point  to introspect and query metadata. What is the schema provided by any data source? Governance…
Change  impact management . Single point to answer questions like: … Governance… What are the consequences of a change in a data source? Where does the data used by applications come from?. What transformations are applied on source data before they are consumed by applications?
Single entry point for  data monitoring : Track data sources and data services usage. E.g.  how does the number of concurrent connections to a data source evolves throughout the day?  send me an e-mail alert if at least 10% of the last 100 queries to a data source failed. Security : Provide authentication and authorization mechanisms for data access. Provide Data encryption functionalities. Protect  data sources: Limit concurrent queries to a certain data source. Cache all or part of the data. Limit data replication needs at the data source level. … Governance
DATA QUALITY (because reliability matters)
Many  data quality actions  can be applied at this layer, avoiding duplicating them in every data source/ application. Data Quality
…  AND WHAT CAN WE DO WITH THESE PIECES?
Data Virtualization Detailed Architecture…
WRAPPING UP
Denodo Platform 4.6  – Virtualized Data Services in Less Time Improved connectivity with Enterprise Ecosystem Sources Connectivity, Middleware and DQ Tools, Publish level Improved Productivity & Ease of Use for  Application Developer (connectivity, web integration etc.)  and  Data Management Professional (metadata, governance etc) Benefits to Business Rapid access to real-time data from disparate sources for - Agile Reporting and Operational BI / Dashboards - Customer Service Operations, Customer Portals Web Integration becomes “mainstream”
You might want to start small …
…  but you can get very far with Data Virtualization!
www.denodo.com | info@denodo.com

More Related Content

Why Data Virtualization? An Introduction by Denodo

  • 1. What is Data Virtualization and Why It Matters to You Alberto Pan, CTO Justo Hidalgo, VP Product Management & Consulting Denodo Technologies
  • 2.  
  • 3. Contents Why Data Virtualization? Productivity Distributed Query Optimization Layer Independence Governance Data Quality Architecture
  • 4. Our Goal: Serving the Information Barista
  • 5. GREAT, BUT WHAT’S THE PROBLEM?
  • 6. Disjoint Views of Entities – the Elements Customer data spread over different and heterogeneous data sources Too much effort to locate and obtain the data. Data need to be not only extracted, but combined among different applications, interfaces and formats. Log files (.txt/.log files) CRM (MySQL) Billing System (Web Service - Rest) Incidences System (Web Application) Inventory System (MS SQL Server) Product Catalog (Web Service -SOAP) Knowledge Base (Internet) Product Data (CSV)
  • 7. It Would be So Nice If…
  • 8. Happy Ending: Single View of Element- Virtual Integration JDBC ODBC WS CSV XML Web Web Flat files Homogeneous access to all data CRM (MySQL) Billing System (Web Service - Rest) Incidences System (Web Application) Inventory System (MS SQL Server) Product Catalog (Web Service -SOAP) Knowledge Base Product Data (CSV) Log files (.txt/.log files)
  • 9. BUT, WHY A DATA VIRTUALIZATION LAYER ?
  • 10. DIDN’T WE HAVE ENOUGH WITH ETL, ESB, EAI, WS, …?
  • 11.  
  • 12. So, We Went and Asked our Experts
  • 13. Why a Data Virtualization Layer? P roductivity D istributed Query Optimization P hysical and Logical independence G overnance D ata Quality
  • 15. Built-in connectors for data sources Complex Data Combination operations do not need to be programmed Productivity… Applications & 3 rd Party Tools Enterprise Applications, BI, Portals, Dashboards, Web Applications… NAME DESCRIPTION PRICE NAME DESCRIPTION PRICE NAME MANUFACTURER SCORE NAME DESCRIPTION PRICE MANUFACTURER SCORE U ∞
  • 16. Applications do not need to deal with complex data-related issues E.g. swapping of large result sets E.g. caching of costly result sets E.g. management of changes in the sources is done in the DV layer, leaving the business layer unaffected Collaboration and Prototyping Virtualization allows rapid prototyping and testing … Productivity…
  • 17. Uniform access Developers use a single model and API instead of learning a mixture of different APIs Learning and execution curves are lower for every additional project on top of the DV layer … Productivity Multi-access A Data Virtualization layer can offer the most appropriate access type for each application (JDBC, Web Service, Sharepoint widget…)
  • 18. DISTRIBUTED QUERY OPTIMIZATION (because customers are waiting)
  • 19. Multiple execution strategies available Performance of a distributed join query may vary enormously depending on the used method e.g: hash join , merge join, nested join,… Even if the join is among the same data views, the optimum method may be different for different queries. Distributed Query Optimization…
  • 20. The final Executable Plan depends on characteristics such as Strategies Sources Order Hash Join Logic Plan Candidate Physical Plans BOOK REVIEW BOOK REVIEW 1 BOOK REVIEW 2 BOOK REVIEW 2 BOOKSTORE A BOOKSTORE B   BOOK STORE A   BOOK STORE B Nested Loop Join BOOK STORE A   NL BOOK STORE B BOOK STORE A   BOOK STORE B Hash Join
  • 21. Source query limitations Push processing to data sources Materialization : pre-load frequently used data and temporal locality … Distributed Query Optimization join pushed into data source Delegate join into data source
  • 22.  
  • 23. Applications are independent of changes in data source location, implementation (e.g. from legacy to new system) and schema. E.g. A mainframe is replaced by a new system. Customer data now comes from two systems instead of one due to a merge/acquisition. Two aplications are reengineered into a single one. The data schema of a data source changes. Physical and Logical Independence…
  • 24. Let each tool do its business ! An ESB is good at orchestrating business services Data Virtualization is good at accessing information repositories, homogeneizing them and turning them into services … Physical and Logical Independence… ESB DATA VIRTUALIZATION
  • 25. Changes need to be done in a single place. E.g. the way to determine if a customer is ‘VIP’ changes. Many applications will use this data field. In some applications (e.g. BRMS systems) the field can be used many times. … Physical and Logical Independence
  • 27. Single entry point for data auditing : Track Data and Metadata changes. E.g. Which user was the last one that modified a certain view? Single point to introspect and query metadata. What is the schema provided by any data source? Governance…
  • 28. Change impact management . Single point to answer questions like: … Governance… What are the consequences of a change in a data source? Where does the data used by applications come from?. What transformations are applied on source data before they are consumed by applications?
  • 29. Single entry point for data monitoring : Track data sources and data services usage. E.g. how does the number of concurrent connections to a data source evolves throughout the day? send me an e-mail alert if at least 10% of the last 100 queries to a data source failed. Security : Provide authentication and authorization mechanisms for data access. Provide Data encryption functionalities. Protect data sources: Limit concurrent queries to a certain data source. Cache all or part of the data. Limit data replication needs at the data source level. … Governance
  • 30. DATA QUALITY (because reliability matters)
  • 31. Many data quality actions can be applied at this layer, avoiding duplicating them in every data source/ application. Data Quality
  • 32. … AND WHAT CAN WE DO WITH THESE PIECES?
  • 33. Data Virtualization Detailed Architecture…
  • 35. Denodo Platform 4.6 – Virtualized Data Services in Less Time Improved connectivity with Enterprise Ecosystem Sources Connectivity, Middleware and DQ Tools, Publish level Improved Productivity & Ease of Use for Application Developer (connectivity, web integration etc.)  and Data Management Professional (metadata, governance etc) Benefits to Business Rapid access to real-time data from disparate sources for - Agile Reporting and Operational BI / Dashboards - Customer Service Operations, Customer Portals Web Integration becomes “mainstream”
  • 36. You might want to start small …
  • 37. … but you can get very far with Data Virtualization!

Editor's Notes

  1. http://www.flickr.com/photos/maxbraun/98688824/
  2. http://dutchamericantranslations.wordpress.com/2010/01/04/matters-of-taste-acronym-or-initialism/
  3. http://www.flickr.com/photos/glenirah/4376553184/
  4. http://www.flickr.com/photos/adikos/4443291195/
  5. Collaboration: self-documenting model, but also actionable. Rapid prototyping platform.
  6. Collaboration: self-documenting model, but also actionable. Rapid prototyping platform.
  7. http://www.flickr.com/photos/laserstars/908946494/
  8. http://www.flickr.com/photos/tudor/458287668/
  9. http://www.flickr.com/photos/totalaldo/508664515/
  10. http://www.flickr.com/photos/heist_mine/4256417595/
  11. http://www.flickr.com/photos/oskay/2157682522/
  12. http://www.flickr.com/photos/stevendepolo/3703145222/
  13. http://www.flickr.com/photos/m-nicolson/2414298534/
  14. http://www.flickr.com/photos/psd/2086641/