SlideShare a Scribd company logo
Is the traditionnel data
warehouse dead?
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
(Data Lake and Data Warehouse – the
best of both worlds)
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Data Warehouse
 Data Lake
 The best of both worlds
 Federated querying
 Patterns
Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types

Recommended for you

Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview

This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.

azuredata platformdata warehouse
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer

You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.

Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
Of course you still need a data warehouse
A data warehouse is where you store data from multiple data sources to be used for historical and
trend analysis reporting. It acts as a central repository for many subject areas and contains the "single
version of truth".
Reasons for a data warehouse:
 Reduce stress on production system
 Optimized for read access, sequential disk scans
 Integrate many sources of data
 Keep historical records (no need to save hardcopy reports)
 Restructure/rename tables and fields, model data
 Protect against source system upgrades
 Use Master Data Management, including hierarchies
 No IT involvement needed for users to create reports
 Improve data quality and plugs holes in source systems
 One version of the truth
 Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Recommended for you

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...

A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020. Abstract: The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability. At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh. The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership. This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.

devopsmicroservicesdata lake
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook

Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn: - Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart - Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon - Step by step approach to building an effective data warehouse architecture - Common reasons for the failure of data warehouse implementations and how to avoid them

data warehousingdata meshdata fabric
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Traditional Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?

Recommended for you

Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation

This document outlines an agenda for a 90-minute workshop on Snowflake. The agenda includes introductions, an overview of Snowflake and data warehousing, demonstrations of how users utilize Snowflake, hands-on exercises loading sample data and running queries, and discussions of Snowflake architecture and capabilities. Real-world customer examples are also presented, such as a pharmacy building new applications on Snowflake and an education company using it to unify their data sources and achieve a 16x performance improvement.

snowflakeworkshopdmi
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns

There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others. But where are the data science and data engineering patterns? Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.

big datasoftware development
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021

Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.

data meshbig datadata lake
Is the traditional data warehouse dead?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
14
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
The three V’s

Recommended for you

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena

The document discusses AWS Glue Data Catalog and Amazon Athena. It provides an overview of AWS Glue Data Catalog as a unified metadata repository across data sources. It then describes Amazon Athena as an interactive query service that allows users to analyze data stored in Amazon S3 using standard SQL. Various use cases are presented that demonstrate how customers can use AWS Glue Data Catalog and Amazon Athena together to build data lakes on AWS.

Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management

Metadata is a tool that improves data understanding, builds end-user confidence, and improves the return on investment in every asset associated with becoming a data-centric organization. Metadata’s use has expanded beyond “data about data” to cover every phase of data analytics, protection, and quality improvement. Data Governance and metadata are connected at the hip in every way possible. As the song goes, “You can’t have one without the other.” In this RWDG webinar, Bob Seiner will provide a way to renew your energy by focusing on the valuable asset that can make or break your Data Governance program’s success. The truth is metadata is already inherent in your data environment, and it can be leveraged by making it available to all levels of the organization. At issue is finding the most appropriate ways to leverage and share metadata to improve data value and protection. Throughout this webinar, Bob will share information about: - Delivering an improved definition of metadata - Communicating the relationship between successful governance and metadata - Getting your business community to embrace the need for metadata - Determining the metadata that will provide the most bang for your bucks - The importance of Metadata Management to becoming data-centric

datadata managementdataversity
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
17
The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• Place to backup data to
• Place to move older data

Recommended for you

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh

This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.

data meshdata managementstreaming data
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse

Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.

lakehousefree trainingdatabricks
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture

This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.

sql analyticslakehouse
Is the traditional data warehouse dead?
Needs data governance so your data lake does not turn
into a data swamp!
Is the traditional data warehouse dead?
The real cost of Hadoop
https://www.scribd.com/document/172491475/WinterCorp-
Report-Big-Data-What-Does-It-Really-Cost/

Recommended for you

Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure

This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.

azurecloud analyticsanalytics
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse

The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.

sql dwmppdata warehouse
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

data lakeadls gen2modern data warehouse
Is the traditional data warehouse dead?
A data lake is just a glorified file folder with data files in it – how many end-users can
accurately create reports from it?
• Query performance not as good as relational database
• Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management,
concurrency, dynamic workload management and robust indexing
• Concurrency limitations
• No concept of “hot” and “cold” data storage with different levels of performance to reduce cost
• Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security
• File based so no granular security definition at the column level
• No metadata stored in HDFS, so another tool required adding complexity and slowing performance
• Finding expertise in Hadoop is very difficult
• Super complex, with lot’s of integration with multiple technologies to make it work
• Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard
• Lack of master data management tools for Hadoop
• Requires end-users to learn new reporting tools and Hadoop technologies to query the data
• Pace of change is so quick many Hadoop technologies become obsolete, adding risk
• Lack of cost savings: cloud consumption, support, licenses, training, and migration costs
• Need conversion process to convert data to a relational format if a reporting tool requires it
• Some reporting tools don’t work against Hadoop
Is the traditional data warehouse dead?

Recommended for you

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric.  What do all these terms mean and how do they compare to a modern data warehouse?  In this session I’ll cover all of them in detail and compare the pros and cons of each.  They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.

data lakehousedata lakedata fabric
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27

This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake

The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.

big datadatawarehousedata
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?

Recommended for you

So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?

Overview of data lakes architectures, governance and lessons learned. Presented at RVA Data Engineering Meetup on 12/15/2020.

data lakebig datadata architecture
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK

This document provides an overview of big data fundamentals and considerations for setting up a big data practice. It discusses key big data concepts like the four V's of big data. It also outlines common big data questions around business context, architecture, skills, and presents sample reference architectures. The document recommends starting a big data practice by identifying use cases, gaining management commitment, and setting up a center of excellence. It provides an example use case of retail web log analysis and presents big data architecture patterns.

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.

awscloudexperienceargentinainnovationtrack
Modern Data Warehouse
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance
Data Lake Data Warehouse
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Complementary to DW Can be sourced from Data Lake
Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)

Recommended for you

How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)

This document discusses how MicroStrategy can help organizations derive value from big data sources. It begins by defining big data and the types of big data sources. It then outlines five differentiators of MicroStrategy for big data analytics: 1) enterprise data access with complete data governance, 2) self-service data exploration and production dashboards, 3) user accessible advanced and predictive analytics, 4) analysis of semi-structured and unstructured data, and 5) real-time analysis from live updating data. The document demonstrates MicroStrategy's capabilities for optimized access to multiple data sources, intuitive data preparation, in-memory analytics, and multi-source analysis. It positions MicroStrategy as a scalable solution for big data analytics that can meet

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...

Cloud Storage is evolving rapidly, and our Azure Storage portfolio has added a ton of new industry leading capabilities. In this session you will learn the do's and don'ts of building data lakes on Azure Data Lake Storage. You will learn about the commonly used patterns, how to set up your accounts and pipelines to maximize performance, how to organize your data and various options to secure access to your data. We will also cover customer use cases and highlight planned enhancements and upcoming features.

azure data lake storagebig data analyticspatterns and practices
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry

دومین ITWeekend دانشگاه صنعتی شریف وحید امیری معماری Data Lake در محیط کلان داده VahidAmiry.ir DataStack.ir

business intelligencedata lakelambda architecture
Reasons you still need a cube/OLAP
• Semantic layer
• Handle many concurrent users
• Aggregating data for performance
• Multidimensional analysis
• No joins or relationships
• Hierarchies, KPI’s
• Row-level security
• Advanced time-calculations
• Slowly Changing Dimensions (SCD)
?
?
?
?
Federated Querying
Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
SQL Server and PolyBase
Query relational and non-relational data with T-SQL

Recommended for you

Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics

The document discusses using Attunity Replicate to accelerate loading and integrating big data into Microsoft's Analytics Platform System (APS). Attunity Replicate provides real-time change data capture and high-performance data loading from various sources into APS. It offers a simplified and automated process for getting data into APS to enable analytics and business intelligence. Case studies are presented showing how major companies have used APS and Attunity Replicate to improve analytics and gain business insights from their data.

attunitymicrosoftbig data
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...

Non relational data approaches applied effective can result in massive cost reduction and performance improvement compared to an infrastructure of legacy enterprise hardware and software solutions. While still not totally without risk on an enterprise scale some tech savy early adopters are realizing tens of millions of dollars in total cost savings. Astute Corporate IT Buyers should include this on their roadmaps if for nothing else to leverage Legacy IT providers

it purchasinghadoopit category management
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar

This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.

hadoop
Is the traditional data warehouse dead?
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Big data store Hadoop/Spark and
machine learning
Data warehouse
Cloud Bursting
BI + Reporting
Azure Data Factory Azure Blob Storage Azure Databricks
Azure Data Lake
Azure HDInsight
Azure Machine Learning
Machine Learning Server
Azure SQL Data Warehouse
Azure Analysis Services
INGEST STORE PREP & TRAIN MODEL & SERVE
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Databricks
Azure HDInsight
Data Lake Analytics
Analytical
dashboards
PolyBase
Business/custom apps
(Structured) Azure Analysis
Services
Azure Data Lake Store
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Data Lake Store
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Tableau
Server
PolyBase
Operational
Reports
Ad-Hoc
Query
Azure SQL
Database
Hortonworks

Recommended for you

Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data

This document discusses how organizations can save money on database management systems (DBMS) by moving from expensive commercial DBMS to more affordable open-source options like PostgreSQL. It notes that PostgreSQL has matured and can now handle mission critical workloads. The document recommends partnering with EnterpriseDB to take advantage of their commercial support and features for PostgreSQL. It highlights how customers have seen cost savings of 35-80% by switching to PostgreSQL and been able to reallocate funds to new business initiatives.

digital transformationdbmsbig data
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...

This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes: 1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces. 2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases. 3) The complexity of current implementation approaches that involve multiple coding steps across various tools. 4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.

webinarnosqlhadoop
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture

Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.

datadata lakebig data
Is the traditional data warehouse dead?
https://aka.ms/ADAG
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

What's hot

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
Valdas Maksimavičius
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
DATAVERSITY
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
James Serra
 

What's hot (20)

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 

Similar to Is the traditional data warehouse dead?

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
Bill Kohnen
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 

Similar to Is the traditional data warehouse dead? (20)

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 

More from James Serra

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
James Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
James Serra
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
James Serra
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your career
James Serra
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
James Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
James Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
James Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
James Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 

More from James Serra (20)

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 

Recently uploaded

WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 

Recently uploaded (20)

WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 

Is the traditional data warehouse dead?

  • 1. Is the traditionnel data warehouse dead? James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com (Data Lake and Data Warehouse – the best of both worlds)
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Data Warehouse  Data Lake  The best of both worlds  Federated querying  Patterns
  • 4. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  • 5. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  • 8. Of course you still need a data warehouse A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". Reasons for a data warehouse:  Reduce stress on production system  Optimized for read access, sequential disk scans  Integrate many sources of data  Keep historical records (no need to save hardcopy reports)  Restructure/rename tables and fields, model data  Protect against source system upgrades  Use Master Data Management, including hierarchies  No IT involvement needed for users to create reports  Improve data quality and plugs holes in source systems  One version of the truth  Easy to create BI solutions on top of it (i.e. SSAS Cubes)
  • 9. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Traditional Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  • 14. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 14
  • 15. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  • 17. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 17
  • 18. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 19. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  • 20. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements EDW • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data
  • 22. Needs data governance so your data lake does not turn into a data swamp!
  • 24. The real cost of Hadoop https://www.scribd.com/document/172491475/WinterCorp- Report-Big-Data-What-Does-It-Really-Cost/
  • 26. A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  • 27. • Query performance not as good as relational database • Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing • Concurrency limitations • No concept of “hot” and “cold” data storage with different levels of performance to reduce cost • Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security • File based so no granular security definition at the column level • No metadata stored in HDFS, so another tool required adding complexity and slowing performance • Finding expertise in Hadoop is very difficult • Super complex, with lot’s of integration with multiple technologies to make it work • Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard • Lack of master data management tools for Hadoop • Requires end-users to learn new reporting tools and Hadoop technologies to query the data • Pace of change is so quick many Hadoop technologies become obsolete, adding risk • Lack of cost savings: cloud consumption, support, licenses, training, and migration costs • Need conversion process to convert data to a relational format if a reporting tool requires it • Some reporting tools don’t work against Hadoop
  • 29. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  • 30. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  • 31. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  • 32. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  • 33. Modern Data Warehouse • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  • 34. Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying Complementary to DW Can be sourced from Data Lake
  • 35. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  • 36. Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  • 37. Reasons you still need a cube/OLAP • Semantic layer • Handle many concurrent users • Aggregating data for performance • Multidimensional analysis • No joins or relationships • Hierarchies, KPI’s • Row-level security • Advanced time-calculations • Slowly Changing Dimensions (SCD)
  • 39. Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
  • 40. SQL Server and PolyBase Query relational and non-relational data with T-SQL
  • 42. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse Cloud Bursting BI + Reporting Azure Data Factory Azure Blob Storage Azure Databricks Azure Data Lake Azure HDInsight Azure Machine Learning Machine Learning Server Azure SQL Data Warehouse Azure Analysis Services
  • 43. INGEST STORE PREP & TRAIN MODEL & SERVE Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Databricks Azure HDInsight Data Lake Analytics Analytical dashboards PolyBase Business/custom apps (Structured) Azure Analysis Services Azure Data Lake Store
  • 44. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Data Lake Store Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Tableau Server PolyBase Operational Reports Ad-Hoc Query Azure SQL Database Hortonworks
  • 47. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Editor's Notes

  1. With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that?  No! In the presentation I'll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I'll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I'll put it all together by showing common big data architectures. http://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ https://www.slideshare.net/jamserra/big-data-architectures-and-the-data-lake https://www.slideshare.net/jamserra/differentiate-big-data-vs-data-warehouse-use-cases-for-a-cloud-solution
  2. Fluff, but point is I bring real work experience to the session
  3. No Pay the piper now or later The real question is… Dump files into data lake and tell user to go for it
  4. Relational databases (RDBMS) generally work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  5. Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data There are two approaches to doing information management for analytics: Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen? Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen? In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach. .
  6. One version of truth story: different departments using different financial formulas to help bonus This leads to reasons to use BI. This is used to convince your boss of need for DW Note that you still want to do some reporting off of source system (i.e. current inventory counts). It’s important to know upfront if data warehouse needs to be updated in real-time or very frequently as that is a major architectural decision JD Edwards has tables names like T117
  7. The Data Warehouses leverages the top-down approach where there is a well-architected information store and enterprisewide BI solution. To build a data warehouse follows the top-down approach where the company’s corporate strategy is defined first. This is followed by gathering of business and technical requirements for the warehouse. The data warehouse is then implemented by dimension modelling and ETL design followed by the actual development of the warehouse. This is all done prior to any data being collected. It utilizes a rigorous and formalized methodology because a true enterprise data warehouse supports many users/applications within an organization to make better decisions.
  8. Key Points: Businesses can use new data streams to gain a competitive advantage. Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming. Talk Track: Does it not seem like every day there is a new kind of data that we need to understand? New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it. Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business. Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social. Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second. All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach: Collecting data on-premises with SQL Server SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps. If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple. Capture new data types using the power and flexibility of the Microsoft Azure Cloud Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business. Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools. SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology. Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images. Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities. Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities. Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. Document DB: a fully managed, highly scalable, NoSQL document database service Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data Azure Data Factory: enables information production by orchestrating and managing diverse data Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds Microsoft Analytics Platform System: In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse. But this traditional data warehouse is under pressure, hitting limits amidst massive change. Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights. They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments. Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer. The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance. Now you can collect relational and non-relational data in one appliance. You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.   All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
  9. All data has immediate or potential value This leads to data hoarding—all data is stored indefinitely With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  10. Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/ http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses http://www.martinsights.com/?p=1088 http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/ http://www.martinsights.com/?p=1082 http://www.martinsights.com/?p=1094 http://www.martinsights.com/?p=1102
  11. Inexpensively store unlimited data Collect all data “just in case” Easy integration of differently-structured data Store data with no modeling – “Schema on read” Complements enterprise data warehouse (EDW) Frees up expensive EDW resources, especially for refining data Hadoop cluster offers faster ETL processing over SMP solutions Quick user access to data Data exploration to see if data valuable before writing ETL and schema for relational database Allows use of Hadoop tools such as ETL and extreme analytics Place to land IoT streaming data On-line archive or backup for data warehouse data Easily scalable With Hadoop, high availability built in Allows for data to be used many times for different analytic needs and use cases Low-cost storage for raw data saving space on the EDW
  12. https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning Question: Do you see many companies building data lakes? Raw: Raw events are stored for historical reference. Also called staging layer or landing area Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation Sandbox: Optional layer to be used to “play” in.  Also called exploration layer or data science workspace
  13. Question: Do you see many companies building data lakes?
  14. I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo.  But are you as big as them?  Do you have their resources?  Do you generate data like them?  Do you want a solution that only 1% of the workforce has the skillset for?  Is your IT department radical or is it conservative?
  15. does a data scientist or analyst think locally or globally?  Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases?  So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.
  16. As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this.  The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question.  Another risk in the first case is slower performance because the data is not laid out efficiently.  Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).
  17. http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ Not saying your EDW can’t consist of a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset? Radical vs conservative http://www.wintercorp.com/tcod-report/
  18. Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:   - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools) - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
  19. In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottoms-up approach becomes part of the top-down approach. The Top-down approach with the data warehouse utilizes a rigorous and formal approach to designing an enterprise wide data warehouse that can support the entire enterprise. It usually can answer questions that are backwards facing like what just happened or even answer why things happened. The bottoms-up approach with the data lake utilizes a exploratory and informal approach of collecting all data in a single place so that data scientists can do advanced analytics like leveraging Hadoop and machine learning tools. It usually can identify new opportunities, predict future outcomes, etc. In the ideal world, both are leveraged so that they can exploit information in the most valued way where each works together with the other to grow the business.
  20. An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data. Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
  21. HDInsights benefits: Cheap, quickly procure Key goal of slide: Highlight the four main use cases for PolyBase. Slide talk track: There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop. PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first. PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL. There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster. Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
  22. Question: Data warehouses and data lakes are now so fast, do we still need cubes?
  23. Question: Do we need a relational database if we create a data lake?  So we don’t have to make another copy of the data?  Hive LLAP, Spark SQL, and Impala are so fast can’t we get away with just having a data lake? In other words, is the traditional data warehouse dead? I’m a little confused with the update to SQL DW that has unlimited columnar storage.  What was the limit before this change?  Does this change the maximum database size of 240TB? The max previously was governed by max db size of 240TB. We now are holding the columnstore data out on blob store and so the amount of that data we can hold is unlimited. We are still bound by the 240TB limit for page data so indexes and heaps are still capped today. Speed: Blob/ADLS/Local, size, versions, query type, data size, driver, front-end tool, concurrency, etc. Storage spaces Azure SQL Database Managed Instance will be added to this.
  24. SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
  25. http://demo.sqlmag.com/scaling-success-sql-server-2016/integrating-big-data-and-sql-server-2016 When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.
  26. https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/
  27. Question: Should SQL Database be considered in the Model & Serve blade, using it as a data mart?
  28. Four Reasons to Migrate Your SQL Server Databases to the Cloud: Security, Agility, Availability, and Reliability Reasons not to move to the cloud: Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house) Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider) Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own.  Or a concern that you share a cloud data center with other companies and someone from another company can be onsite near your servers) Compliance restrictions Regulations (health, financial) Legal restrictions (i.e. data can’t leave your country) Company policies You may be sharing resources on your server, as well as competing for system and network resources Data getting stolen in-flight (i.e. from the cloud data center to the on-prem user)
  29. Question: Where should be do data transformations (data lake, relational database, Databricks, etc)? Question: What are the cost vs performance tradeoffs with our products? (many companies will sacrifice performance to save money)