This document discusses the evolution of data warehousing and the modern data platform. It outlines some common problems with traditional data warehousing approaches like long setup times, poor performance and scalability issues. The modern data platform combines cloud-based data warehousing, data modeling principles, and data warehouse automation tools to provide highly scalable and agile solutions. Key components demonstrated are the Snowflake data platform for scalable data storage and processing, Fivetran for automated data integration, and capabilities like cloning data for testing and time travel to access historical data.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The Microsoft Analytics Platform System (APS) is a turnkey appliance that provides a modern data warehouse with the ability to handle both relational and non-relational data. It uses a massively parallel processing (MPP) architecture with multiple CPUs running queries in parallel. The APS includes an integrated Hadoop distribution called HDInsight that allows users to query Hadoop data using T-SQL with PolyBase. This provides a single query interface and allows users to leverage existing SQL skills. The APS appliance is pre-configured with software and hardware optimized to deliver high performance at scale for data warehousing workloads.
The document discusses elastic data warehousing using Snowflake's cloud-based data warehouse as a service. Traditional data warehousing and NoSQL solutions are costly and complex to manage. Snowflake provides a fully managed elastic cloud data warehouse that can scale instantly. It allows consolidating all data in one place and enables fast analytics on diverse data sources at massive scale, without the infrastructure complexity or management overhead of other solutions. Customers have realized significantly faster analytics, lower costs, and the ability to easily add new workloads compared to their previous data platforms.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
The document discusses data warehouses and their characteristics. A data warehouse integrates data from multiple sources and transforms it into a multidimensional structure to support decision making. It has a complex architecture including source systems, a staging area, operational data stores, and the data warehouse. A data warehouse also has a complex lifecycle as business rules change and new data requirements emerge over time, requiring the architecture to evolve.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Snowflake is a cloud-based data warehouse that is built for the cloud. It was founded in 2012 and has raised $1 billion in funding. Snowflake's architecture separates storage, compute, and metadata services, allowing it to offer unlimited scalability, multiple clusters that can access shared data with no downtime, and full transactional consistency across the system. Snowflake has over 2000 customers including large enterprises that use it for analytics, data science, and sharing large volumes of data securely.
The document discusses data architecture solutions for solving real-time, high-volume data problems with low latency response times. It recommends a data platform capable of capturing, ingesting, streaming, and optionally storing data for batch analytics. The solution should provide fast data ingestion, real-time analytics, fast action, and quick time to value. Multiple data sources like logs, social media, and internal systems would be ingested using Apache Flume and Kafka and analyzed with Spark/Storm streaming. The processed data would be stored in HDFS, Cassandra, S3, or Hive. Kafka, Spark, and Cassandra are identified as key technologies for real-time data pipelines, stream analytics, and high availability persistent storage.
DI&A Slides: Data Lake vs. Data WarehouseDATAVERSITY
Modern data analysis is moving beyond the Data Warehouse to the Data Lake where analysts are able to take advantage of emerging technologies to manage complex analytics on large data volumes and diverse data types. Yet, for some business problems, a Data Warehouse may still be the right solution.
If you’re on the fence, join this webinar as we compare and contrast Data Lakes and Data Warehouses, identifying situations where one approach may be better than the other and highlighting how the two can work together.
Get tips, takeaways and best practices about:
- The benefits and problems of a Data Warehouse
- How a Data Lake can solve the problems of a Data Warehouse
- Data Lake Architecture
- How Data Warehouses and Data Lakes can work together
The document discusses AWS Glue Data Catalog and Amazon Athena. It provides an overview of AWS Glue Data Catalog as a unified metadata repository across data sources. It then describes Amazon Athena as an interactive query service that allows users to analyze data stored in Amazon S3 using standard SQL. Various use cases are presented that demonstrate how customers can use AWS Glue Data Catalog and Amazon Athena together to build data lakes on AWS.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
Introducing Snowflake, an elastic data warehouse delivered as a service in the cloud. It aims to simplify data warehousing by removing the need for customers to manage infrastructure, scaling, and tuning. Snowflake uses a multi-cluster architecture to provide elastic scaling of storage, compute, and concurrency. It can bring together structured and semi-structured data for analysis without requiring data transformation. Customers have seen significant improvements in performance, cost savings, and the ability to add new workloads compared to traditional on-premises data warehousing solutions.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk
Azure Purview is Microsoft's solution for data governance and data lineage. It provides unified data governance across on-premises, multi-cloud and Software as a Service data sources. Azure Purview consists of three main components - the Data Map automates metadata extraction and data lineage, the Data Catalog enables effortless discovery, and Data Insights provides governance over data usage. It is a fully managed cloud service that eliminates the need for manual or homegrown data governance solutions.
Every day, businesses across a wide variety of industries share data to support insights that drive efficiency and new business opportunities. However, existing methods for sharing data involve great effort on the part of data providers to share data, and involve great effort on the part of data customers to make use of that data.
However, existing approaches to data sharing (such as e-mail, FTP, EDI, and APIs) have significant overhead and friction. For one, legacy approaches such as e-mail and FTP were never intended to support the big data volumes of today. Other data sharing methods also involve enormous effort. All of these methods require not only that the data be extracted, copied, transformed, and loaded, but also that related schemas and metadata must be transported as well. This creates a burden on data providers to deconstruct and stage data sets. This burden and effort is mirrored for the data recipient, who must reconstruct the data.
As a result, companies are handicapped in their ability to fully realize the value in their data assets.
Snowflake Data Sharing allows companies to grant instant access to ready-to-use data to any number of partners or data customers without any data movement, copying, or complex pipelines.
Using Snowflake Data Sharing, companies can derive new insights and value from data much more quickly and with significantly less effort than current data sharing methods. As a result, companies now have a new approach and a powerful new tool to get the full value out of their data assets.
10 Reasons Snowflake Is Great for AnalyticsSenturus
Learn why Snowflake analytic data warehouse makes sense for BI including data loading flexibility and scalability, consumption-based storage and compute costs, Time Travel and data sharing features, support across a range of BI tools like Power BI and Tableau and ability to allocate compute costs. View this on-demand webinar: https://senturus.com/resources/10-reasons-snowflake-is-great-for-analytics/.
Senturus offers a full spectrum of services in business intelligence and training on Cognos, Tableau and Power BI. Our resource library has hundreds of free live and recorded webinars, blog posts, demos and unbiased product reviews available on our website at: http://www.senturus.com/senturus-resources/.
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
We must grow the data capabilities of our organization to fully deal with the many and varied forms of data. This cannot be accomplished without an intense focus on the many and growing technical bases that can be used to store, view, and manage data. There are many, now more than ever, that have merit in organizations today.
This session sorts out the valuable data stores, how they work, what workloads they are good for, and how to build the data foundation for a modern competitive enterprise.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
The document provides an introduction to Data Vault data modeling and discusses how it enables agile data warehousing. It describes the core structures of a Data Vault model including hubs, links, and satellites. It explains how the Data Vault approach provides benefits such as model agility, productivity, and extensibility. The document also summarizes the key changes in the Data Vault 2.0 methodology.
The Shifting Landscape of Data IntegrationDATAVERSITY
This document discusses the shifting landscape of data integration. It begins with an introduction by William McKnight, who is described as the "#1 Global Influencer in Data Warehousing". The document then discusses how challenges in data integration are shifting from dealing with volume, velocity and variety to dealing with dynamic, distributed and diverse data in the cloud. It also discusses IDC's view that this shift is occurring from the traditional 3Vs to the 3Ds. The rest of the document discusses Matillion, a vendor that provides a modern solution for cloud data integration challenges.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Delivering rapid-fire Analytics with Snowflake and TableauHarald Erb
Until recently, advancements in data warehousing and analytics were largely incremental. Small innovations in database design would herald a new data warehouse every
2-3 years, which would quickly become overwhelmed with rapidly increasing data volumes. Knowledge workers struggled to access those databases with development intensive BI tools designed for reporting, rather than exploration and sharing. Both databases and BI tools were strained in locally hosted environments that were inflexible to growth or change.
Snowflake and Tableau represent a fundamentally different approach. Snowflake’s multi-cluster shared data architecture was designed for the cloud and to handle logarithmically larger data volumes at blazing speed. Tableau was made to foster an interactive approach to analytics, freeing knowledge workers to use the speed of Snowflake to their greatest advantage.
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
The document introduces Data Vault modeling as an agile approach to data warehousing. It discusses how Data Vault addresses some limitations of traditional dimensional modeling by allowing for more flexible, adaptable designs. The Data Vault model consists of three simple structures - hubs, links, and satellites. Hubs contain unique business keys, links represent relationships between keys, and satellites hold descriptive attributes. This structure supports incremental development and rapid changes to meet evolving business needs in an agile manner.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit.
I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
Watch full webinar here: https://bit.ly/3dudL6u
It's not if you move to the cloud, but when. Most organisations are well underway with migrating applications and data to the cloud. In fact, most organisations - whether they realise it or not - have a multi-cloud strategy. Single, hybrid, or multi-cloud…the potential benefits are huge - flexibility, agility, cost savings, scaling on-demand, etc. However, the challenges can be just as large and daunting. A poorly managed migration to the cloud can leave users frustrated at their inability to get to the data that they need and IT scrambling to cobble together a solution.
In this session, we will look at the challenges facing data management teams as they migrate to cloud and multi-cloud architectures. We will show how the Denodo Platform can:
- Reduce the risk and minimise the disruption of migrating to the cloud.
- Make it easier and quicker for users to find the data that they need - wherever it is located.
- Provide a uniform security layer that spans hybrid and multi-cloud environments.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
The Presentation Talks about how Cloud Computing is Big Data's Best Friend and How AWS Cloud Components Fit in to complete your Big Data Life Cycle.
Agenda:
- How Big is Big Data Actually growing?
- How Cloud has the potential to become Big Data's Best Friend
- A tour on The Big Data Life Cycle
- How AWS Cloud Components Fit in to this Life Cycle
- A Case Study of Our Log Analytics Tool Cloudlytics, using Big Data Implementation
on AWS Cloud.
Similar to Modern data warehouse presentation (20)
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
1. David Rice & Tom Bruce
The Modern Data
Warehouse
15th January 2019
@snapanalytics
hello@snapanalytics.co.uk
Snap-analytics
2. Agenda
2
Topic
01 Introductions
02 Evolution of the Data Warehouse
03 Problems with traditional Data Warehousing
04 Why the Modern Data Platform?
05 Three components of the Modern Data Platform
06 Demo
07 Key takeaways
3. Introductions
3
Tom Bruce David Rice - aka ‘Data Dave’
(Delivery Lead and Co-
founder)
Extensive experience designing and
delivering enterprise data warehouse
and analytics solutions.
Core functional expertise in
• Finance
• Marketing
Tom has worked with clients
including:
• Jaguar Land Rover
• Deutsche Bank
• Carlsberg
(CEO and Co-founder)
Over 15 years experience in data
analytics including:
• Data warehouse design
• ETL (data integration)
• Data Modelling
• Delivering self service analytics
David has worked with clients
including:
• ING Bank
• Barclays Capital and
• Jaguar Land Rover
4. Bill Inmon
Mid 1970s
Bill Inmon begins to define and
discuss the term ‘Data Warehouse’.
AC Nielsen’s ‘Data Mart’
Early 1970s
ACNielsen provided ‘Data Marts’ to
their clients in order to help them
understand their sales better.
Evolution of the Data Warehouse
4
5. IBM Article of Data
Warehousing
Late 1980s
In 1988 IBM published ‘An
architecture for a business information
system’ and coined the term “business
data warehouse”
Early 1980s
Evolution of the Data Warehouse
MPP Databases
Teradata create the DBC/1012
database.
Goodyear aerospace build the
‘Goodyear MPP’ supercomputer.
5
6. TDWI
Mid 1990s
‘The Data Warehouse Institute’ is founded.
Early 1990s
Evolution of the Data Warehouse
Ralph Kimball
Ralph Kimball introduces the ‘Red
Brick Data Warehouse’,
The Data Warehouse Toolkit
1996 - ‘The Data Warehouse Toolkit’ is
published by Ralph Kimball
6
7. ‘Big Data’ & No SQL
Late 2000sEarly 2000s
Evolution of the Data Warehouse
Data Vault
Dan Linstedt introduces Data Vault
modelling
Cloud Computing
7
8. Cloud Adoption
Late 2010sEarly 2010s
Evolution of the Data Warehouse
Cloud Data Warehousing
The benefits of Data Warehousing in the cloud were realised as:
Google Launched a Data Warehouse as a service ‘Big Query’ in 2011
Amazon launched Redshift in 2013
Snowflake Inc. was publicly launched in 2014
Microsoft launched Azure SQL Data Warehouse in 2016
DW Automation
Connectivity
8
11. Problems with traditional DW solutions
Initial Set Up
Performance Tuning
Ongoing Maintenance
Scalability
Data Security &
Compliance
Flexibility
High Upfront Costs Resilience
11
12. Problems with traditional ETL solutions
Time consuming
Documentation
Inconsistent
Auditability & Lineage
Performance
Inefficient
12
13. A new way of thinking
Modern data platform
Modern data platforms like Snowflake are fast
to set up and scale up. Low cost storage and
decoupled storage and compute eliminate
resource contention. Native JSON support
and ‘time travel’ features also provide great
benefits.
Combining, modern data platforms data
modelling principles and DW automation
tools delivers highly agile, highly scalable,
performant solutions. This can serve the
needs of your data scientists and business
community alike.
Data Warehouse automation
Tools like Fivetran improve consistency, and
significantly reduce development cycles.
Agile data modelling
Data Vault 2.0 enables parallel loading,
support for unstructured data, and is built
with change in mind.
13
18. CitiBike Demo Context
3
• CitiBike is a bike share program in New York (similar to
Boris Bikes in London)
• Users are either annual members or buy short term passes
• There are numerous different stations across the city and
users will collect a bike from a station and then return it to
another station once they are finished
• CitiBike want to have a data warehouse to allow them to
analyse all of the historical trips and join this with external
data to give greater insight
• We will see how a modern data platform can be created
within minutes to help them achieve this goal
19. Demo Architecture
3
Amazon S3
Citibike Trips (CSV)
Amazon S3
NYC Weather Data
(JSON)
Azure Blob
Station MD (JSON)
Snowflake
Staging
Trips Weather
Station
MD
Transformation
Trips & Weather
Reporting
Trips View
Direct Load
20. Loading in Snowflake
3
• Data is loaded and queried using virtual warehouses available in the following sizes:
• Compute and storage can be completely isolated meaning no resource contention
• Processed using massively parallel processing (MPP) compute clusters
• Able to scale up the server with no administration needed
• Bulk data loading can be done from the following sources:
XS
1 server
XXXL
128 servers
22. 02 – ELT v ETL
3
• Modern cloud based solutions now mean that we can utilise ELT
rather than ETL:
Endless storage capabilities and scalable processing power
Ability to store semi-structured data meaning that it can be
transformed after loading
• Big advantage of ELT is that it adds extra flexibility:
Data can be loaded very quickly
Developers can then decide to transform what is necessary,
and can quickly change what needs to be transformed
23. FIVETRAN
DEMO
a) JSON source file
b) Loaded into Azure blob storage
c) Fivetran connector
d) Load
e) Transformation
24. 03 – Semi-structured Data
3
• Snowflake is able to store semi-structured data (JSON, Avro, ORC & Parquet) natively enabling ELT
• Variant data type in Snowflake stores this data with SQL extensions to query directly
• Transformation to turn JSON data into structured tables in Snowflake is extremely simple
• Snowflake is a combination of both a Data Warehouse and a Data Lake – a ‘Data Lakehouse’
25. WEATHER
DATA LOAD
a) Load Weather JSON data from stage
b) View the weather data in raw form
c) Transform the JSON into structured
data
26. 04 – Zero-copy Cloning for Dev and Test
3
• Data is often required to be copied for things such as QA and test
environments
• Creating copies of the data and environments takes considerable time
and there is cost associated to storing the data twice
• Snowflake uses cloning to instantly create copies of the data which do
not persist a copy of the data, simply referencing the original data
Only new or updated records get stored in the new cloned table
28. 05 – Time Travel
3
• Frequently there are issues with tables or data that
is accidentally deleted
• Data may be corrupted or changes may be
implemented that adversely affect the data
• Snowflake allows access to historical data (i.e.
changed or deleted) at any point within a 90 day
period
• Data can be quickly backed up from key times in the
past
Modern Data Platform
Start small and scale up quickly…minimising risk. Support for JSON and data science use cases at massive scale.
Data Warehouse Automation
Note that there are other solutions that solve some of these problems too. SQL Data Warehouse as presented by Kamil, Data Bricks presented by Niall based on Apache Spark.
We recommend exploring multiple options, taking a fact based view and implementing what works for you and your organisation.