How to Realize an Additional 270% ROI on Snowflake
- 2. 2
Introduction
How to dramatically increase the ROI of your Snowflake investment by:
● Managing the size of your data warehouse
● Defining and setting limits on query times to prevent runaway queries
● Implementing visibility and telemetry to monitor usage
● Automating the creation, maintenance and management of data aggregates
- 3. 3
Today’s Speakers
VP, Analytics, Rakuten
Rewards
Twitter?
Mark is VP, Analytics, at Rakuten
Rewards, formerly Ebates. He’s been
with the company since 2014, and
was with the team that sold Ebates
to Rakuten in 2015. Mark plays a
double role, leading a center of
excellence analytics group and
product managing the enterprise
business intelligence stack. Prior to
joining Ebates, Mark worked in the
residential real estate, and online
tournament spaces.
Mark Stange-Tregear
Chief Strategy Officer, AtScale
@dmariani
Dave is one of the co-founders of
AtScale and is currently the Chief
Strategy Officer.
Prior to AtScale, Dave was VP of
Engineering at Klout & at Yahoo!
where he built the world's largest
multi-dimensional cube for BI on
Hadoop.
Dave is a Big Data visionary & serial
entrepreneur.
Dave Mariani
@rakutenrewards
- 4. 4
In Pursuit of Processing Power - A Timeline
2014 2016 20172015 2018 2019
SSRS
SQL Server
SSAS
M.Strategy
Hadoop (Cloudera)
@scale
Tableau
Hadoop (Cloudera)
@scale
Tableau
Hadoop (Cloudera)
@scale
Tableau
@scale
Tableau
Snowflake
- 5. 5
Why Snowflake?
USERS “TRIP” OVER EACH OTHER
FRUSTRATION AND MISSED GOALS
SEPARATE COMPUTE INTO
DISCRETE CLUSTERS
• Dozens of discreet warehouses
• Budget cost to the business unit
• Separate ETL from ad hoc workload
• Horizontally scalable on-demand
PROBLEM
RESULT
- 6. Managing Data Warehouses
MORE WAREHOUSES MORE VISIBILITY MORE CONTROL
HOW MANY?
● At least one warehouse per business team or engineering group (often several)
● Dedicated warehouses for ETL components
● Dedicated warehouses for 3rd party products
CONTROL OF WAREHOUSE SIZE?
● Constantly reviewed for potential down-sizing
● Resizing control centralized with cost management and oversight “team”
IS BIGGER BETTER?
● Not always
● IO intensive workloads can work on smaller clusters
● Aggregations and joins on bigger clusters
- 7. 7
More Visibility … More Control
Get to know and love:
● "SNOWFLAKE"."ACCOUNT_USAGE"."WAREHOUSE_METERING_HISTORY"
● "SNOWFLAKE"."ACCOUNT_USAGE"."QUERY_HISTORY”
and these are well worth knowing as well…
● "SNOWFLAKE"."ACCOUNT_USAGE"."STORAGE_USAGE"
● "SNOWFLAKE"."ACCOUNT_USAGE"."METERING_DAILY_HISTORY"
● SNOWFLAKE"."READER_ACCOUNT_USAGE"."WAREHOUSE_METERING_HISTORY”
MAKE MONITORING EASY… AUTOMATED DAILY REPORTS
- 9. Levers to Pull
9
1. Warehouse size… typically moving down, but sometimes up
2. Horizontal scaling
3. Move jobs between warehouses
4. Caching
5. Code rewrite
6. Clustering
A note on code rewrite…
● Data modeling is still important
● Snowflake is very powerful, but joins and aggregations cost
● Repetitive joins, repetitive aggregation? Consider creating “flat” warehouse table
- 10. 10
Query
Performance
User
Concurrency
Compute Costs
How fast can the Cloud
Data Warehouse answer a
query for one user?
How do multiple users
running queries affect
performance & stability?
How do query workloads
and configuration impact
your monthly bill?
Semantic
Complexity
How difficult is it to write
the query to answer the
business question?
Additional Considerations When Managing Snowflake
- 11. The Cloud Analytics Stack
COMPONENT
CONSUMPTION
VISUALIZATION, ANALYSIS, REPORTING
SEMANTIC LAYER
QUERY ACCESS,METADATA, MASKING, AUDITING
PREPARED DATA
DATA PROCESSING, MODELING
RAW DATA
DATA STORAGE, ENCRYPTION
DATA TRANSFORMATION
ETL,MERGING, AGGREGATION
LAYER (FUNCTION)
BI Tools AI/ML Tools Applications
UNIVERSAL SEMANTIC LAYER
Data Warehouse File Access Engine
ETL Engine
File System (Data Lake)
Data
Catalog
- 12. The Cloud Analytics Stack
12
COMPONENT
CONSUMPTION
VISUALIZATION, ANALYSIS, REPORTING
SEMANTIC LAYER
QUERY ACCESS, FILTERING, MASKING, AUDITING
PREPARED DATA
DATA PROCESSING, MODELING
RAW DATA
DATA STORAGE, ENCRYPTION
DATA TRANSFORMATION
ETL,MERGING, AGGREGATION
LAYER (FUNCTION)
BI Tools AI/ML Tools Applications
Multi-dimensional Engine
Data Governance Engine
Virtualization Engine
Data Warehouse File Access Engine
ETL Engine
File System (Data Lake)
Data
Catalog
- 13. A Semantic Layer is Critical to Success
13
1. Simplicity
2. Single Source of truth
3. Governance for all
- 14. 14
A Semantic Layer Simplifies & Normalizes Data Access
SELECT
`d_product_manufacturer_id` AS `d_product_manufacturer_id`,
SUM( `Total Ext Sales Price` ) AS `sum_total__ext_sales_price_ok`
FROM
`tpc-ds benchmark model` `TPC-DS Benchmark Model`
WHERE
`I Category` = 'Electronics'
AND `Sold Calendar Year-Week` = 1999
AND `Sold d_customer_gmt_offset` = -5.00
AND `Sold d_month_of_year` = 7
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;
with ss as (
select
i_manufact_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where
i_manufact_id in (select
i_manufact_id
from
item
where i_category in ('Electronics'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 1999
and d_moy = 7
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -5
group by i_manufact_id),
cs as (
select
i_manufact_id,sum(cs_ext_sales_price) total_sales
from
catalog_sales,
date_dim,
customer_address,
item
where
...
TPC-DS Query
#33:
What is the monthly sales
figure based on extended
price for a specific month
in a specific year, for
manufacturers in a specific
category in a given time
zone? Group sales by
manufacturer identifier
and sort output by sales
amount, by channel, and
give Total sales.
398 bytes 1,872 bytes
AtScale SQL TPC-DS Raw
- 15. 15
AtScale’s TPC-DS 10TB Benchmark (10,000 Scale Factor)
THE TPC-DS 10TB
DATASET HAS:
Multiple fact tables
Large fact tables
Large dimensions
1
2
3
- 16. Delivers orders of
magnitude query
improvements that
are amplified with high
user concurrency
16
Benchmark Results: Query Performance
14x
Faster
- 17. Smooths out & mitigates
user concurrency
challenges without
requiring additional
compute resources
17
Benchmark Results: Concurrency
Note: Tthread group 1 is the average of 5 runs for each of the 20
queries, The 5, 25 & 50 thead groups ran each of the 20 queries 1
time per thread.
- 18. 18
Benchmark Results: Compute Cost
Allows for smaller compute
resources
& mitigates unpredictable
& unbounded costs for
on-demand pricing models
4x
Less Cost
- 20. Test
Improvement Factor
with AtScale
Snowflake
Query Performance1 4x Faster
User Concurrency2 14x Faster
Compute Cost3 73% Cheaper
Complexity4 76% less complex
SQL queries
Results of TPC-DS 10TB Benchmark Test
20
1. Elapsed time for executing 1 query five times
2. Elapsed time executing 1 (x5), 5, 25, 50 queries
3. Compute costs for cluster time (Redshift, Snowflake) or bytes read (BigQuery) for user concurrency test
4. Complexity score for SQL queries for number of: functions, operations, tables, objects & subqueries (AtScale = 258, TPC-DS = 1,057)
Configuration
Virtual Data
Warehouse Used
Compute
Cost
per Hour1
Snowflake
3X-Large (64
credits/hour)
$128.00
AtScale on
Snowflake
1X-Large (16
credits/hour)
$32.00
AtScale customers realize an additional
270% ROI on Snowflake
- 21. 21
DEMO
SELECT
`d_product_manufacturer_id` AS `d_product_manufacturer_id`,
SUM( `Total Ext Sales Price` ) AS `sum_total__ext_sales_price_ok`
FROM
`tpc-ds benchmark model` `TPC-DS Benchmark Model`
WHERE
`I Category` = 'Electronics'
AND `Sold Calendar Year-Week` = 1999
AND `Sold d_customer_gmt_offset` = -5.00
AND `Sold d_month_of_year` = 7
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;
with ss as (
select
i_manufact_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where
i_manufact_id in (select
i_manufact_id
from
item
where i_category in ('Electronics'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 1999
and d_moy = 7
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -5
group by i_manufact_id),
cs as (
select
i_manufact_id,sum(cs_ext_sales_price) total_sales
from
catalog_sales,
date_dim,
customer_address,
item
where
...
TPC-DS Query
#33:
What is the monthly sales
figure based on extended
price for a specific month
in a specific year, for
manufacturers in a specific
category in a given time
zone? Group sales by
manufacturer identifier
and sort output by sales
amount, by channel, and
give Total sales.
398 bytes 1,872 bytes
AtScale SQL TPC-DS Raw
- 22. 22
Summary: How to realize an additional 270% ROI on Snowflake
▵ Download the Snowflake benchmark report
at: https://www.atscale.com/snowflake
benchmark
▵ Read the Rakuten Rewards case study at:
https://www.atscale.com/rakutenrewards
▵ COMING SOON! Estimate your cost savings
using the AtScale calculator
Editor's Notes
- Companies of all sizes have embraced the power, scale and ease of use of Snowflake along with the promise of cost savings. But as some have learned, cloud compute usage can sneak up on you if you aren’t careful.
Today, our experts will discuss how to dramatically increase the ROI of your Snowflake investment by:
- AtScale is built to leverage the efficiencies and performance of the cloud for the data consumer whether you’re on premise or in the cloud (or both).
We connect people to data. We do that without moving data and without complexity—leveraging existing investments in big data platforms, applications and tools.
We also do that consistently, securely and with one set of semantics—and without interrupting existing data usage so that data workers no longer have to understand how or where it is stored.
Performance
Optimizing performance is difficult and that’s where we focus our energies. AtScale’s data warehouse virtualization can reduce queries performance from 5 weeks to 5 seconds—automatically optimizing each time a user queries the database.
Security
Because we haven’t copied the data and applied new code or embedded rules, we’ve reduced the amount of complexity and maintain consistent data lineage throughout the data lifecycle. AtScale not only leverages existing data security and governance but applies an additional layer so that data can be ported to new data tools, applications and platforms.
Agility
What’s more powerful is we create simple interface to querying data and building models for data science and analytics data workers with deep integrations with BI and AI/ML tools. For the first time, users (and IT) have visibilities into how data is being queried and used throughout the organization (no more data silos).
- AtScale is built to leverage the efficiencies and performance of the cloud for the data consumer whether you’re on premise or in the cloud (or both).
We connect people to data. We do that without moving data and without complexity—leveraging existing investments in big data platforms, applications and tools.
We also do that consistently, securely and with one set of semantics—and without interrupting existing data usage so that data workers no longer have to understand how or where it is stored.
Performance
Optimizing performance is difficult and that’s where we focus our energies. AtScale’s data warehouse virtualization can reduce queries performance from 5 weeks to 5 seconds—automatically optimizing each time a user queries the database.
Security
Because we haven’t copied the data and applied new code or embedded rules, we’ve reduced the amount of complexity and maintain consistent data lineage throughout the data lifecycle. AtScale not only leverages existing data security and governance but applies an additional layer so that data can be ported to new data tools, applications and platforms.
Agility
What’s more powerful is we create simple interface to querying data and building models for data science and analytics data workers with deep integrations with BI and AI/ML tools. For the first time, users (and IT) have visibilities into how data is being queried and used throughout the organization (no more data silos).