SlideShare a Scribd company logo
Building Lakehouses
on Delta Lake
and SQL Analytics-
A Primer
Franco Patano
Senior Solutions Architect, Databricks
@fpatano
linkedin.com/in/francopatano/
Wayne Dyer
If you believe it will work out, you’ll
see opportunities. If you believe it
won’t you’ll see obstacles.
Agenda
▪ What is Lakehouse
▪ Delta Lake Architecture
▪ Delta Engine Optimizations
▪ SQL Analytics
▪ Implementation Example
▪ Frictionless Loading
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Recommended for you

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse

Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.

lakehousefree trainingdatabricks
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology

Enterprise data architectures usually contain many systems��data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.

lakehousedata warehousingdata lake
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.

data lakehousedata meshdata fabric
Streaming
Batch
One platform to unify all of
your data, analytics, and AI workloads
Filtered, Cleaned,
Augmented
Silver
Business-level
Aggregates
Gold
Semi-structured
Unstructured
Structured
Raw Ingestion
and History
Bronze
Implementing Lakehouse Architecture with
Delta Lake
Bronze Silver Gold
Data usability
Raw Ingestion,
and History
Filtered, Cleaned,
Augmented
Business-level
Aggregates
AutoLoader
Structured
Streaming
Batch
COPY INTO
Partners
Land data as it is received
Provenance to source
Handle NULLS
Fix bad dates (1970-01-01)
Clean text fields
Demux nested objects
Friendly field names
Analytics Engineering
Business Models
Aggregates for visible dimensions
Business friendly field names
Common logical views
Table Structure
Stats are only collected on the first 32 ordinal
fields, including fields in nested structures
• You can change this with this property:
dataSkippingNumIndexedCols
Restructure data accordingly
• Move numericals, keys, high cardinality query predicates
to the left, long strings that are not distinct enough for
stats collection, and date/time stamps to the right past the
dataSkippingNumIndexedCols
• Long strings are kryptonite to stats collection, move these
to past the 32nd position, or past
dataSkippingNumIndexedCols
Numerical, Keys, High Cardinality Long Strings, Date/Time
32 columns or dataSkippingNumIndexedCols
Optimize and Z-Order
Optimize will bin pack our files for better read performance
Z-Order will organize our data for better data skipping
What fields should you Z-Order by?
Fields that are being joined on, or included in a predicate
• Primary Key , Foriegn Keys on dim and fact tables
• ID fields that are joined to other tables
• High Cardinality fields used in query predicates

Recommended for you

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...

Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.

Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data

In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster. The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.

Partitioning and Z-Order effectiveness
High Cardinality Regular Cardinality Low Cardinality
Very Uncommon or Unique Datum
● User or Device ID
● Email Address
● Phone Number
Common Repeatable Data
● People or Object Names
● Street Addresses
● Categories
Repeatable, limited distinct data
● Gender
● Status Flags
● Boolean Values
SELECT COUNT(DISTINCT(x))
Partitioning effectiveness
Z-Order effectiveness
Tips for each layer
✓ When files, land raw
✓ When streaming, land in delta raw
✓ Turn off stats collection
○ dataSkippingNumIndexedCols 0
✓ Optimize and Z-Order by merge
join keys between Bronze and
Silver
✓ Restructure columns to account
for data skipping index columns
✓ Use Delta Cache Enabled clusters
○ or enable it for other types YMMV
✓ Optimize and Z-Order by join keys
or common High Cardinality query
predicates
✓ Turn up Staleness Limit to align
with your orchestration
✓ Use SQL Analytics for Analysts
Business-level
Aggregates
Filtered, Cleaned,
Augmented
Raw Ingestion,
and History
Bronze Silver Gold
MERGE INTO
JOIN KEYS
MERGE INTO
JOIN KEYS
Databricks SQL Analytics
Delivering analytics on the freshest data
with data warehouse performance and
data lake economics
• Query your lakehouse with better price / performance
• Simplify discovery and sharing of new insights
• Connect to familiar BI tools, like Tableau or Power BI
• Simplify administration and governance
Why did Databricks Create SQL Analytics?
➔ Customers have standardized on data lakes
as a foundation for modern data analytics
➔ ~41% of queries on Databricks are SQL
➔ SQL Analytics was created to provide these
users with a familiar SQL editor experience

Recommended for you

Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt. Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com. Chris Riccomini works as a Software Engineer at WePay.

chris riccominiinfoqqcon
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake

Delta Lake is an open source storage layer that sits on top of data lakes and brings ACID transactions and reliability to Apache Spark. It addresses challenges with data lakes like lack of schema enforcement and transactions. Delta Lake provides features like ACID transactions, scalable metadata handling, schema enforcement and evolution, time travel/data versioning, and unified batch and streaming processing. Delta Lake stores data in Apache Parquet format and uses a transaction log to track changes and ensure consistency even for large datasets. It allows for updates, deletes, and merges while enforcing schemas during writes.

spark with delta lakeapache sparkspark
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx

The document discusses modernizing a healthcare organization's data platform from version 1.0 to 2.0 using Azure Databricks. Version 1.0 used Azure HDInsight (HDI) which was challenging to scale and maintain. It presented performance issues and lacked integrations. Version 2.0 with Databricks will provide improved scalability, cost optimization, governance, and ease of use through features like Delta Lake, Unity Catalog, and collaborative notebooks. This will help address challenges faced by consumers, data engineers, and the client.

Easy to use SQL experience
Enable data analysts to quickly
perform ad-hoc and exploratory
data analysis, with a new and easy
to use SQL query editor, built-in
visualizations and dashboards.
Automatic alerts can be triggered
for critical changes, allowing to
respond to business needs faster.
Simple administration and governance
Quickly setup SQL / BI
optimized compute with SQL
endpoints. Databricks automatically
determines instance types and
configuration for the best
price/performance. Then, easily
manage usage, perform quick
auditing, and troubleshooting with
query history.
Curated Data
Delta Lake helps build curated data lakes, so
you can store and manage all your data in
one place and standardize your big data
storage with an open format accessible from
various tools.
Curated Data
Structured, Semi-Structured, and Unstructured Data
Filtered, Cleaned,
Augmented
Silver
Raw Ingestion
and History
Bronze
Business-level
Aggregates
Gold
SQL Editor: Based on Redash, the SQL
Native Interface provides a simple and
familiar experience for data analysts to
explore data, query their data lakes,
visualize results, share dashboards,
and setup automatic alerts.
Query Execution: SQL Analytics’s compute
clusters are powered by Photon Engine.
100% Apache Spark-compatible vectorized
query engine designed to take advantage of
modern CPU architecture for extremely fast
parallel processing of data
Optimized ODBC/JDBC Drivers
Re-engineered drivers provide lower latency
and less overhead to reduce round trips by
0.25 seconds. Data transfer rate is improved
50%, and metadata retrieval operations
execute up to 10x faster.
Improved Queuing and Load Balancing
SQL Analytics Endpoints extend Delta Lake’s
capabilities to better handle peaks in query
traffic and high cluster utilization.
Additionally, execution is improved for both
short and long queries.
Spot Instance Pricing
By using spot instances, SQL Analytics
Endpoints provide optimal pricing and
reliability with minimal administration.
“Unified Catalog”
The one version of the truth for your
organization. “Unity Catalog” is the data
catalog, governance, and query monitoring
solution that unifies your data in the cloud.
“Delta Sharing” offers Open Data Sharing, for
securely sharing data in the cloud.
Vectorized Execution Engine
Compiler | High Perf. Async IO | IO Caching
Admin Console
SQL Endpoints
Workload Mgt & Queuing | Auto-scaling | Results Caching
Analyst Experience
"Unified Catalog"
Databricks SQL Analytics
DELTA Lake
ODBC/JDBC
Drivers
BI & SQL Client
Connectors
SQL
End Point
Query
Planning
Query
Execution
Performance - The Databricks BI Stack

Recommended for you

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering

Slides for the talk at AI in Production meetup: https://www.meetup.com/LearnDataScience/events/255723555/ Abstract: Demystifying Data Engineering With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood. In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.

data miningdata sciencedata engineering
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx

The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.

data platformdelta lakeanalytics
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake

The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include: - Defining top-down and bottom-up approaches to data management - Explaining what a data lake is and how Hadoop can function as the data lake - Describing how a modern data warehouse combines features of a traditional data warehouse and data lake - Discussing how federated querying allows data to be accessed across multiple sources - Highlighting benefits of implementing big data solutions in the cloud - Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (

data lakebig dataazure data lake store
Better price / performance
Run SQL queries on your lakehouse
and analyze your freshest data
with up to 4x better
price/performance than
traditional cloud data warehouses.
Source: Performance Benchmark with Barcelona Supercomputing Center
Common Use Cases
Collaborative exploratory data
analysis on your data lake
Data-enhanced
applications
Connect existing BI tools and use
one source of truth for all your
data
Respond to business needs faster with a
self-served experience designed for every
analysts in your organization. Databricks
SQL Analytics provides a simple and
secure access to data, ability to create or
reuse SQL queries to analyze the data that
sits directly on your data lake, and quickly
mock-up and iterate on visualizations and
dashboards that fit best the business.
Build rich and custom data enhanced
applications for your own
organization or your customers.
Simplify development and leverage
the price / performance and scale of
Databricks SQL Analytics, all served
from your data lake.
Maximize existing investments by connecting
your preferred BI tools such as Tableau or
PowerBI to your data lake with SQL Analytics
Endpoints. Re-engineered and optimized
connectors ensure fast performance, low
latency, and high user concurrency to your
data lake. Now analysts can use the best tool
for the job on one single source of truth for
your data: your data lake.
Common Governance Models
Enterprise Data Sources
IT or Governed Datasets
GRANT USAGE, SELECT ON
DATABASE CrownJewels TO users
GRANT USAGE, MODIFY ON
DATABASE CrownJewels TO
admin-read-write
GRANT ALL PRIVILEGES ON
CATALOG TO superuser
Department/Business Unit Data
Business Unit Datasets
GRANT USAGE, SELECT ON
DATABASE DepartmentJewels TO
users
GRANT USAGE, MODIFY ON
DATABASE DepartmentJewels TO
department-read-write
User Level Data
Self-Service Datasets
GRANT USAGE, SELECT ON
DATABASE MyJewels TO users
GRANT USAGE, MODIFY ON
DATABASE MyJewels TO
`franco@databricks.com`
Data Security Governance
Catalog
Database
Table/View/Function
SQL objects in Databricks are hierarchical and
privileges are inherited. This means that
granting or denying a privilege on the
CATALOG automatically grants or denies the
privilege to all databases in the catalog.
Similarly, privileges granted on a DATABASE
object are inherited by all objects in that
database.
To perform an action on a database object, a
user must have the USAGE privilege on that
database in addition to the privilege to
perform that action.
GRANT USAGE, SELECT ON Catalog
TO users
GRANT USAGE, SELECT ON
Database D TO users
GRANT USAGE ON DATABASE D to
users;
GRANT SELECT ON TABLE T TO
users;

Recommended for you

Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System

The Microsoft Analytics Platform System (APS) is a turnkey appliance that provides a modern data warehouse with the ability to handle both relational and non-relational data. It uses a massively parallel processing (MPP) architecture with multiple CPUs running queries in parallel. The APS includes an integrated Hadoop distribution called HDInsight that allows users to query Hadoop data using T-SQL with PolyBase. This provides a single query interface and allows users to leverage existing SQL skills. The APS appliance is pre-configured with software and hardware optimized to deliver high performance at scale for data warehousing workloads.

pdwanalytics platform systemaps
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)

Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.

azure synapse analyticssql data warehouse
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure

This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.

Grant Access in SQL via Roles
Sync Users + Groups with SCIM
finance read-only
finance-read-write
cs-read-only
cs-read-write
admin-all
Jill, Jon
Jane, Jack
Fred, James, Will
Wilbur
Finance Users
Finance Admin
Jake
Customer Service
Users
Customer Service
Admin
Architect/Admin
GRANT USAGE, SELECT ON DATABASE F
TO finance-read-only
GRANT USAGE, MODIFY ON DATABASE F
TO finance-read-write
GRANT USAGE, SELECT ON DATABASE C
TO cs-read-only
GRANT ALL PRIVILEGES ON CATALOG TO
admin-all
GRANT USAGE, MODIFY ON DATABASE C
TO cs-read-write
Role Based Access Control
Managed
Data Source
Managed
Data Source
Cluster or SQL
Endpoint Managed Catalog
Cross-Workspace
External Tables
SQL access
controls
Audit
log
Defined
Credentials
Other Existing
Data Sources
User Identity
Passthrough
Define Once, Secure Everywhere
Centralized Governance/Unified Security Model-
passthrough and ACLs supported in the same
workspaces for all your data sources
Use exclusively or in tandem with your existing Hive
Metastore
Centralized metastore with integrated fine-grained security across all your workspaces
Databricks Managed Catalog
Announced in Keynote!
Implementation Example with Frictionless ETL
TPC-DI
Data Integration (DI), also known as ETL, is the analysis,
combination, and transformation of data from a variety of
sources and formats into a unified data model
representation. Data Integration is a key element of data
warehousing lakehouseing, application integration, and
business analytics.
http://www.tpc.org/tpcdi/default5.asp

Recommended for you

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist. From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives. As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms. In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly. In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data). Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021. You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.

lakehousedelta lakemlflow
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization. Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support. In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

datadata managementdataversity
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud

This document discusses ETL patterns in the cloud using Azure Data Factory. It covers topics like ETL vs ELT, scaling ETL in the cloud, handling flexible schemas, and using ADF for orchestration. Key points include staging data in low-cost storage before processing, using ADF's integration runtime to process data both on-premises and in the cloud, and building resilient data flows that can handle schema drift.

microsoft azure data factoryazureetl
Main Concepts of TPC-DI
TPC-DI uses data integration of a factious Retail Brokerage Firm as model:
● Main Trading System
● Internal Human Resource System
● Internal Customer Relationship Management System
● Externally acquired data
Operations measured use the above model, but are not limited to those of a brokerage firm
They capture the variety and complexity of typical DI tasks:
● Loading of large volumes of historical data
● Loading of incremental updates
● Execution of a variety of transformation types using various input types and various target types with inter-table
relationships
● Assuring consistency of loaded data
Benchmark is technology agnostic
Why TPC-DI?
Data Generator
• Produces scales of files from GBs to TBs
• Produces CSV, CDC, XML, and Text files
• Has historical and incremental
Data Model
• Transformations documented
• Dimensional Model for Analytics
Implementation Reference Architecture
Bronze Silver Gold
OLTP
CDC
Extract Frictionless Load
HR DB
CSV
Frictionless Load
Prospect
List
CSV
Frictionless Load
Financial
Newswire
Multi
Format Frictionless Load
Customers
XML
Frictionless Load
MERGE
INTO
What is Frictionless Loading?
Autoloader
• Load files from cloud object storage
• With notifications
• Structured Streaming
• Trigger Once for Batch equivalency
• Schema Inference
• Schema Hints
• Schema Drift Handling
Delta Lake
• Streaming Source and Sink
• Checkpoint + Transaction Log = Ultimate State Store
• Schema Enforcement and Evolution
• Time Travel
• Optimize for Analytics
• Change Data Feed
Batch mode for migrating existing orchestration, simple code change for near real-time streaming!

Recommended for you

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

data lakeadls gen2modern data warehouse
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP

Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years. Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines. In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.

Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI

This document discusses techniques for optimizing Power BI performance. It recommends tracing queries using DAX Studio to identify slow queries and refresh times. Tracing tools like SQL Profiler and log files can provide insights into issues occurring in the data sources, Power BI layer, and across the network. Focusing on optimization by addressing wait times through a scientific process can help resolve long-term performance problems.

power bioptimizationmicrosoft
Enable everyone to declaratively build robust
data pipelines with testing baked-in. First-class
support for SQL and Python.
Databricks auto-tunes infrastructure and takes
care of orchestration, failures and retries
Define data quality checks and data
documentation in the pipeline
Cost and latency conscious, with support for
streaming vs. batch and full vs. incremental
workloads
A simple way to build and operate ETL data flows to deliver fresh, high quality data
Delta Live Tables (formerly “Delta Pipelines”)
Announced in Keynote
Watch the Breakout from Awez, Delta Live Tables!
Demo Time
Related Talks
WEDNESDAY
03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn &
Alex Behm, Databricks
04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya,
Plume
THURSDAY
11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano,
Databricks
FRIDAY
10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks
How to get started
On June 1
databricks.com/try

Recommended for you

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both. Topics include: - Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline - How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution - How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency - How Snowflake + Spark enables production and ad hoc analytics on a single repository of data

snowflake celtra analytics spark apache events
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27

This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.

Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud

This document discusses ETL patterns in the cloud using Azure Data Factory. It covers topics like ETL vs ELT, the importance of scale and flexible schemas in cloud ETL, and how Azure Data Factory supports workflows, templates, and integration with on-premises and cloud data. It also provides examples of nightly ETL data flows, handling schema drift, loading dimensional models, and data science scenarios using Azure data services.

microsoftazuredata factory

More Related Content

What's hot

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 

What's hot (20)

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 

Similar to Building Lakehouses on Delta Lake with SQL Analytics Primer

SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
webuploader
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
Kellyn Pot'Vin-Gorman
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
TIBCO Spotfire
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
Mark Kromer
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 

Similar to Building Lakehouses on Delta Lake with SQL Analytics Primer (20)

SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Recently uploaded

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
taqyea
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
taqyea
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 

Recently uploaded (20)

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 

Building Lakehouses on Delta Lake with SQL Analytics Primer

  • 1. Building Lakehouses on Delta Lake and SQL Analytics- A Primer Franco Patano Senior Solutions Architect, Databricks @fpatano linkedin.com/in/francopatano/
  • 2. Wayne Dyer If you believe it will work out, you’ll see opportunities. If you believe it won’t you’ll see obstacles.
  • 3. Agenda ▪ What is Lakehouse ▪ Delta Lake Architecture ▪ Delta Engine Optimizations ▪ SQL Analytics ▪ Implementation Example ▪ Frictionless Loading
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 5. Streaming Batch One platform to unify all of your data, analytics, and AI workloads Filtered, Cleaned, Augmented Silver Business-level Aggregates Gold Semi-structured Unstructured Structured Raw Ingestion and History Bronze
  • 6. Implementing Lakehouse Architecture with Delta Lake Bronze Silver Gold Data usability Raw Ingestion, and History Filtered, Cleaned, Augmented Business-level Aggregates AutoLoader Structured Streaming Batch COPY INTO Partners Land data as it is received Provenance to source Handle NULLS Fix bad dates (1970-01-01) Clean text fields Demux nested objects Friendly field names Analytics Engineering Business Models Aggregates for visible dimensions Business friendly field names Common logical views
  • 7. Table Structure Stats are only collected on the first 32 ordinal fields, including fields in nested structures • You can change this with this property: dataSkippingNumIndexedCols Restructure data accordingly • Move numericals, keys, high cardinality query predicates to the left, long strings that are not distinct enough for stats collection, and date/time stamps to the right past the dataSkippingNumIndexedCols • Long strings are kryptonite to stats collection, move these to past the 32nd position, or past dataSkippingNumIndexedCols Numerical, Keys, High Cardinality Long Strings, Date/Time 32 columns or dataSkippingNumIndexedCols
  • 8. Optimize and Z-Order Optimize will bin pack our files for better read performance Z-Order will organize our data for better data skipping What fields should you Z-Order by? Fields that are being joined on, or included in a predicate • Primary Key , Foriegn Keys on dim and fact tables • ID fields that are joined to other tables • High Cardinality fields used in query predicates
  • 9. Partitioning and Z-Order effectiveness High Cardinality Regular Cardinality Low Cardinality Very Uncommon or Unique Datum ● User or Device ID ● Email Address ● Phone Number Common Repeatable Data ● People or Object Names ● Street Addresses ● Categories Repeatable, limited distinct data ● Gender ● Status Flags ● Boolean Values SELECT COUNT(DISTINCT(x)) Partitioning effectiveness Z-Order effectiveness
  • 10. Tips for each layer ✓ When files, land raw ✓ When streaming, land in delta raw ✓ Turn off stats collection ○ dataSkippingNumIndexedCols 0 ✓ Optimize and Z-Order by merge join keys between Bronze and Silver ✓ Restructure columns to account for data skipping index columns ✓ Use Delta Cache Enabled clusters ○ or enable it for other types YMMV ✓ Optimize and Z-Order by join keys or common High Cardinality query predicates ✓ Turn up Staleness Limit to align with your orchestration ✓ Use SQL Analytics for Analysts Business-level Aggregates Filtered, Cleaned, Augmented Raw Ingestion, and History Bronze Silver Gold MERGE INTO JOIN KEYS MERGE INTO JOIN KEYS
  • 11. Databricks SQL Analytics Delivering analytics on the freshest data with data warehouse performance and data lake economics • Query your lakehouse with better price / performance • Simplify discovery and sharing of new insights • Connect to familiar BI tools, like Tableau or Power BI • Simplify administration and governance
  • 12. Why did Databricks Create SQL Analytics? ➔ Customers have standardized on data lakes as a foundation for modern data analytics ➔ ~41% of queries on Databricks are SQL ➔ SQL Analytics was created to provide these users with a familiar SQL editor experience
  • 13. Easy to use SQL experience Enable data analysts to quickly perform ad-hoc and exploratory data analysis, with a new and easy to use SQL query editor, built-in visualizations and dashboards. Automatic alerts can be triggered for critical changes, allowing to respond to business needs faster.
  • 14. Simple administration and governance Quickly setup SQL / BI optimized compute with SQL endpoints. Databricks automatically determines instance types and configuration for the best price/performance. Then, easily manage usage, perform quick auditing, and troubleshooting with query history.
  • 15. Curated Data Delta Lake helps build curated data lakes, so you can store and manage all your data in one place and standardize your big data storage with an open format accessible from various tools. Curated Data Structured, Semi-Structured, and Unstructured Data Filtered, Cleaned, Augmented Silver Raw Ingestion and History Bronze Business-level Aggregates Gold SQL Editor: Based on Redash, the SQL Native Interface provides a simple and familiar experience for data analysts to explore data, query their data lakes, visualize results, share dashboards, and setup automatic alerts. Query Execution: SQL Analytics’s compute clusters are powered by Photon Engine. 100% Apache Spark-compatible vectorized query engine designed to take advantage of modern CPU architecture for extremely fast parallel processing of data Optimized ODBC/JDBC Drivers Re-engineered drivers provide lower latency and less overhead to reduce round trips by 0.25 seconds. Data transfer rate is improved 50%, and metadata retrieval operations execute up to 10x faster. Improved Queuing and Load Balancing SQL Analytics Endpoints extend Delta Lake’s capabilities to better handle peaks in query traffic and high cluster utilization. Additionally, execution is improved for both short and long queries. Spot Instance Pricing By using spot instances, SQL Analytics Endpoints provide optimal pricing and reliability with minimal administration. “Unified Catalog” The one version of the truth for your organization. “Unity Catalog” is the data catalog, governance, and query monitoring solution that unifies your data in the cloud. “Delta Sharing” offers Open Data Sharing, for securely sharing data in the cloud. Vectorized Execution Engine Compiler | High Perf. Async IO | IO Caching Admin Console SQL Endpoints Workload Mgt & Queuing | Auto-scaling | Results Caching Analyst Experience "Unified Catalog" Databricks SQL Analytics
  • 16. DELTA Lake ODBC/JDBC Drivers BI & SQL Client Connectors SQL End Point Query Planning Query Execution Performance - The Databricks BI Stack
  • 17. Better price / performance Run SQL queries on your lakehouse and analyze your freshest data with up to 4x better price/performance than traditional cloud data warehouses. Source: Performance Benchmark with Barcelona Supercomputing Center
  • 18. Common Use Cases Collaborative exploratory data analysis on your data lake Data-enhanced applications Connect existing BI tools and use one source of truth for all your data Respond to business needs faster with a self-served experience designed for every analysts in your organization. Databricks SQL Analytics provides a simple and secure access to data, ability to create or reuse SQL queries to analyze the data that sits directly on your data lake, and quickly mock-up and iterate on visualizations and dashboards that fit best the business. Build rich and custom data enhanced applications for your own organization or your customers. Simplify development and leverage the price / performance and scale of Databricks SQL Analytics, all served from your data lake. Maximize existing investments by connecting your preferred BI tools such as Tableau or PowerBI to your data lake with SQL Analytics Endpoints. Re-engineered and optimized connectors ensure fast performance, low latency, and high user concurrency to your data lake. Now analysts can use the best tool for the job on one single source of truth for your data: your data lake.
  • 19. Common Governance Models Enterprise Data Sources IT or Governed Datasets GRANT USAGE, SELECT ON DATABASE CrownJewels TO users GRANT USAGE, MODIFY ON DATABASE CrownJewels TO admin-read-write GRANT ALL PRIVILEGES ON CATALOG TO superuser Department/Business Unit Data Business Unit Datasets GRANT USAGE, SELECT ON DATABASE DepartmentJewels TO users GRANT USAGE, MODIFY ON DATABASE DepartmentJewels TO department-read-write User Level Data Self-Service Datasets GRANT USAGE, SELECT ON DATABASE MyJewels TO users GRANT USAGE, MODIFY ON DATABASE MyJewels TO `franco@databricks.com`
  • 20. Data Security Governance Catalog Database Table/View/Function SQL objects in Databricks are hierarchical and privileges are inherited. This means that granting or denying a privilege on the CATALOG automatically grants or denies the privilege to all databases in the catalog. Similarly, privileges granted on a DATABASE object are inherited by all objects in that database. To perform an action on a database object, a user must have the USAGE privilege on that database in addition to the privilege to perform that action. GRANT USAGE, SELECT ON Catalog TO users GRANT USAGE, SELECT ON Database D TO users GRANT USAGE ON DATABASE D to users; GRANT SELECT ON TABLE T TO users;
  • 21. Grant Access in SQL via Roles Sync Users + Groups with SCIM finance read-only finance-read-write cs-read-only cs-read-write admin-all Jill, Jon Jane, Jack Fred, James, Will Wilbur Finance Users Finance Admin Jake Customer Service Users Customer Service Admin Architect/Admin GRANT USAGE, SELECT ON DATABASE F TO finance-read-only GRANT USAGE, MODIFY ON DATABASE F TO finance-read-write GRANT USAGE, SELECT ON DATABASE C TO cs-read-only GRANT ALL PRIVILEGES ON CATALOG TO admin-all GRANT USAGE, MODIFY ON DATABASE C TO cs-read-write Role Based Access Control
  • 22. Managed Data Source Managed Data Source Cluster or SQL Endpoint Managed Catalog Cross-Workspace External Tables SQL access controls Audit log Defined Credentials Other Existing Data Sources User Identity Passthrough Define Once, Secure Everywhere Centralized Governance/Unified Security Model- passthrough and ACLs supported in the same workspaces for all your data sources Use exclusively or in tandem with your existing Hive Metastore Centralized metastore with integrated fine-grained security across all your workspaces Databricks Managed Catalog Announced in Keynote!
  • 23. Implementation Example with Frictionless ETL
  • 24. TPC-DI Data Integration (DI), also known as ETL, is the analysis, combination, and transformation of data from a variety of sources and formats into a unified data model representation. Data Integration is a key element of data warehousing lakehouseing, application integration, and business analytics. http://www.tpc.org/tpcdi/default5.asp
  • 25. Main Concepts of TPC-DI TPC-DI uses data integration of a factious Retail Brokerage Firm as model: ● Main Trading System ● Internal Human Resource System ● Internal Customer Relationship Management System ● Externally acquired data Operations measured use the above model, but are not limited to those of a brokerage firm They capture the variety and complexity of typical DI tasks: ● Loading of large volumes of historical data ● Loading of incremental updates ● Execution of a variety of transformation types using various input types and various target types with inter-table relationships ● Assuring consistency of loaded data Benchmark is technology agnostic
  • 26. Why TPC-DI? Data Generator • Produces scales of files from GBs to TBs • Produces CSV, CDC, XML, and Text files • Has historical and incremental Data Model • Transformations documented • Dimensional Model for Analytics
  • 27. Implementation Reference Architecture Bronze Silver Gold OLTP CDC Extract Frictionless Load HR DB CSV Frictionless Load Prospect List CSV Frictionless Load Financial Newswire Multi Format Frictionless Load Customers XML Frictionless Load MERGE INTO
  • 28. What is Frictionless Loading? Autoloader • Load files from cloud object storage • With notifications • Structured Streaming • Trigger Once for Batch equivalency • Schema Inference • Schema Hints • Schema Drift Handling Delta Lake • Streaming Source and Sink • Checkpoint + Transaction Log = Ultimate State Store • Schema Enforcement and Evolution • Time Travel • Optimize for Analytics • Change Data Feed Batch mode for migrating existing orchestration, simple code change for near real-time streaming!
  • 29. Enable everyone to declaratively build robust data pipelines with testing baked-in. First-class support for SQL and Python. Databricks auto-tunes infrastructure and takes care of orchestration, failures and retries Define data quality checks and data documentation in the pipeline Cost and latency conscious, with support for streaming vs. batch and full vs. incremental workloads A simple way to build and operate ETL data flows to deliver fresh, high quality data Delta Live Tables (formerly “Delta Pipelines”) Announced in Keynote Watch the Breakout from Awez, Delta Live Tables!
  • 31. Related Talks WEDNESDAY 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm, Databricks 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume THURSDAY 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks FRIDAY 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast & Molly Nagamuthu, Databricks
  • 32. How to get started On June 1 databricks.com/try