Modern data warehouse presentation

David Rice & Tom Bruce
The Modern Data
Warehouse
15th January 2019
@snapanalytics
hello@snapanalytics.co.uk
Snap-analytics

Agenda
2
Topic
01 Introductions
02 Evolution of the Data Warehouse
03 Problems with traditional Data Warehousing
04 Why the Modern Data Platform?
05 Three components of the Modern Data Platform
06 Demo
07 Key takeaways

Introductions
3
Tom Bruce David Rice - aka ‘Data Dave’
(Delivery Lead and Co-
founder)
Extensive experience designing and
delivering enterprise data warehouse
and analytics solutions.
Core functional expertise in
• Finance
• Marketing
Tom has worked with clients
including:
• Jaguar Land Rover
• Deutsche Bank
• Carlsberg
(CEO and Co-founder)
Over 15 years experience in data
analytics including:
• Data warehouse design
• ETL (data integration)
• Data Modelling
• Delivering self service analytics
David has worked with clients
including:
• ING Bank
• Barclays Capital and
• Jaguar Land Rover

Bill Inmon
Mid 1970s
Bill Inmon begins to define and
discuss the term ‘Data Warehouse’.
AC Nielsen’s ‘Data Mart’
Early 1970s
ACNielsen provided ‘Data Marts’ to
their clients in order to help them
understand their sales better.
Evolution of the Data Warehouse
4

IBM Article of Data
Warehousing
Late 1980s
In 1988 IBM published ‘An
architecture for a business information
system’ and coined the term “business
data warehouse”
Early 1980s
MPP Databases
Teradata create the DBC/1012
database.
Goodyear aerospace build the
‘Goodyear MPP’ supercomputer.
5

TDWI
Mid 1990s
‘The Data Warehouse Institute’ is founded.
Early 1990s
Ralph Kimball
Ralph Kimball introduces the ‘Red
Brick Data Warehouse’,
The Data Warehouse Toolkit
1996 - ‘The Data Warehouse Toolkit’ is
published by Ralph Kimball
6

‘Big Data’ & No SQL
Late 2000sEarly 2000s
Data Vault
Dan Linstedt introduces Data Vault
modelling
Cloud Computing
7

Cloud Adoption
Late 2010sEarly 2010s
Cloud Data Warehousing
The benefits of Data Warehousing in the cloud were realised as:
Google Launched a Data Warehouse as a service ‘Big Query’ in 2011
Amazon launched Redshift in 2013
Snowflake Inc. was publicly launched in 2014
Microsoft launched Azure SQL Data Warehouse in 2016
DW Automation
Connectivity
8

Three big
problems!
The Data Warehouse
Data Integration
(ETL)
Data Modelling
9

Poor outcomes
60 percent of Big Data
projects will fail
Gartner, 2017
10

Problems with traditional DW solutions
Initial Set Up
Performance Tuning
Ongoing Maintenance
Scalability
Data Security &
Compliance
Flexibility
High Upfront Costs Resilience
11

Problems with traditional ETL solutions
Time consuming
Documentation
Inconsistent
Auditability & Lineage
Performance
Inefficient
12

A new way of thinking
Modern data platform
Modern data platforms like Snowflake are fast
to set up and scale up. Low cost storage and
decoupled storage and compute eliminate
resource contention. Native JSON support
and ‘time travel’ features also provide great
benefits.
Combining, modern data platforms data
modelling principles and DW automation
tools delivers highly agile, highly scalable,
performant solutions. This can serve the
needs of your data scientists and business
community alike.
Data Warehouse automation
Tools like Fivetran improve consistency, and
significantly reduce development cycles.
Agile data modelling
Data Vault 2.0 enables parallel loading,
support for unstructured data, and is built
with change in mind.
13

Multi
Cloud
Availability
Per
Second
Pricing
Performance
Data
Sharing
Multi
Use
Cases
Zero
Copy
Clone
Time
Travel
Instant
Elasticity
Benefits of Snowflake Data Platform
14

High Level Architecture - Snowflake
15

Streaming
Support
ELT
Performance
Zero-
configuration
SQL
Transforms
Rapid
Dev
Pre-built
Connectors
Benefits of Fivetran
16

Fivetran – Salesforce Schema
17
c

CitiBike Demo Context
3
• CitiBike is a bike share program in New York (similar to
Boris Bikes in London)
• Users are either annual members or buy short term passes
• There are numerous different stations across the city and
users will collect a bike from a station and then return it to
another station once they are finished
• CitiBike want to have a data warehouse to allow them to
analyse all of the historical trips and join this with external
data to give greater insight
• We will see how a modern data platform can be created
within minutes to help them achieve this goal

Demo Architecture
3
Amazon S3
Citibike Trips (CSV)
Amazon S3
NYC Weather Data
(JSON)
Azure Blob
Station MD (JSON)
Snowflake
Staging
Trips Weather
Station
MD
Transformation
Trips & Weather
Reporting
Trips View
Direct Load

Loading in Snowflake
3
• Data is loaded and queried using virtual warehouses available in the following sizes:
• Compute and storage can be completely isolated meaning no resource contention
• Processed using massively parallel processing (MPP) compute clusters
• Able to scale up the server with no administration needed
• Bulk data loading can be done from the following sources:
XS
1 server
XXXL
128 servers

SNOWFLAKE
DEMO
a) Bulk loading from S3 Stage
b) Scaling up the server

02 – ELT v ETL
3
• Modern cloud based solutions now mean that we can utilise ELT
rather than ETL:
 Endless storage capabilities and scalable processing power
 Ability to store semi-structured data meaning that it can be
transformed after loading
• Big advantage of ELT is that it adds extra flexibility:
 Data can be loaded very quickly
 Developers can then decide to transform what is necessary,
and can quickly change what needs to be transformed

FIVETRAN
DEMO
a) JSON source file
b) Loaded into Azure blob storage
c) Fivetran connector
d) Load
e) Transformation

03 – Semi-structured Data
3
• Snowflake is able to store semi-structured data (JSON, Avro, ORC & Parquet) natively enabling ELT
• Variant data type in Snowflake stores this data with SQL extensions to query directly
• Transformation to turn JSON data into structured tables in Snowflake is extremely simple
• Snowflake is a combination of both a Data Warehouse and a Data Lake – a ‘Data Lakehouse’

WEATHER
DATA LOAD
a) Load Weather JSON data from stage
b) View the weather data in raw form
c) Transform the JSON into structured
data

04 – Zero-copy Cloning for Dev and Test
3
• Data is often required to be copied for things such as QA and test
environments
• Creating copies of the data and environments takes considerable time
and there is cost associated to storing the data twice
• Snowflake uses cloning to instantly create copies of the data which do
not persist a copy of the data, simply referencing the original data
 Only new or updated records get stored in the new cloned table

05 – Time Travel
3
• Frequently there are issues with tables or data that
is accidentally deleted
• Data may be corrupted or changes may be
implemented that adversely affect the data
• Snowflake allows access to historical data (i.e.
changed or deleted) at any point within a 90 day
period
• Data can be quickly backed up from key times in the
past

06 – Reporting Connectivity
3
• Snowflake connects to many different reporting tools, we’ve just selected a few below:

Key takeaways
Maximise the work NOT
done
Build for Change
Are you future ready?
33

Modern data warehouse presentation

More Related Content

What's hot

What's hot (20)

Similar to Modern data warehouse presentation

Similar to Modern data warehouse presentation (20)

Recently uploaded

Recently uploaded (20)

Modern data warehouse presentation

Editor's Notes