SlideShare a Scribd company logo
Building highly efficient data
lakes using Apache Hudi
(Incubating)
Vinoth Chandar | Sr. Staff Engineer, Uber
Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or
other countries.
Data Architectures
Lakes, Marts, Silos
Simple… Right?
Database
Events
Service
Mesh
Queries
DFS/Cloud Storage
Extract-Transform-Load
Real-time/OLTP Analytics/OLAP
External
Sources
Tables
OK.. May be not that simple ...
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
Schemas
Data Audit
Data Lake Implementation : It’s actually hard..
High-value data
- User information in RDBMS
- Trip, transaction logs in NoSQL
Replicate CRUD operations
- Strict ordering guarantees
- Zero data loss
Bulk loads don’t scale
- Adds more load to database
- Wasteful re-writing of data
Requirement #1: Incremental Database Ingestion
MySQL
users
users
Inserts, updates, deletes
Replicate
userID int
country string
last_mod long
... ...
Data Lake
High-scale time series data
- Several billions/day
- Few millions/sec
- Heavily aggregated
Cause of duplicates
- Client retries/failures/network
errors
- At least-once data pipes
Overcounting problems
- More impressions => more $
- Low fidelity data
Requirement #2: De-Duping Log Events
Impressions
Impressions
Produce impression
events
Replicate
w/o
duplicates
event_id string
datestr string
time long
... ...
Data Lake
Requirement #3: Transactional Writes
Atomic publish of data
- Ingestion can fail midway
- Rollback bad data
Consistency Guarantees
- No partial data exposed
- Repeatable queries
Snapshot Isolation
- Time-travel queries
- Concurrent writer/readers
Strong Durability
- No data loss
Requirement #4: Unique Key Constraints
Data model parity
- Enforce upstream primary keys
- 1-1 Mapping w/ source table
- Great data quality!
Transaction Processing
- e.g: Settling orders, fraud
detection
- Lakes are well-suited for large
scale processes
Multi stage ETL DAGS
- Very common in batch analytics
- Large amount of data
Derived/ETL tables
- Keep afresh with new/changed raw
data
- Star schema/warehousing
Scaling challenges
- Intelligent recomputations
- Window based joins
Requirement #5: Faster Derived Data
raw_trips
std_trips
standardize_fare(row)
id string
datestr string
currency string
fare double
id string
datestr string
std_fare double
... ...
Raw Table
Derived
Table
Requirement #6: File Management
Small Files = Big Problem
- Slow queries
- Stress filesystem metadata
Big Files = Large Delays
- 2GB Parquet writing => ~5-10
mins
File Stitching?
- Band-aid for bullet wound
- Consistency?
- Standardization?
Requirement #7: Scalable DFS/Storage RPCs
Ingestion/Query all list DFS
- List folders/files, take action
- Single threaded vs parallel
Subtle gotchas/differences
- Cloud storage => no append()
- S3 => Eventual consistency
- S3 => rename() = copy()
- Large directory listings
- HDFS NameNode bottlenecks
Requirement #8: Incremental Copy to Data marts
Data Marts
- Specialized, often MPP OLAP databases
- E.g Redshift, Vertica
Online Serving
- Sync ML features to databases
- Throttling syncing rate
Need to sync Lake => Mart
- Full data refresh often very expensive
- Need for incremental egress
Requirement #9: Legal Requirements/Data Deletions
Strict rules on data retention
- Delete records
- Correct data
- Raw + Derived tables
Need efficient delete()
- “needle in haystack”
- Indexed on write (point-ish lookup)
- Still optimized for scans
- Propagate deleted records downstream
Requirement #10: Late Data Handling
Data often arrives late
- Minutes, Hours, even days
- E.g: credit card txn settlement
Not implicitly complete
- Can lead to large data quality issues
- Trigger recomputation of derived tables
Data arrival tracking
- First class, audit log
- Flexible, rewindable windowing
Apache Hudi
At a glance
Apache Hudi (Incubating)
Overview
● Snapshot isolation between writer & queries
● upsert() support with pluggable indexes
● Atomically publish data with rollback support
● Savepoints for data recovery
● Manages file sizes, layout using statistics
● Async compaction of new & old data
● Timeline metadata to track lineage
Apache Hudi (Incubating)
Storage
● Three logical views on single physical dataset
● Read Optimized View
○ Provides excellent query performance
○ Replaces plain Apache Parquet tables
● Incremental View
○ Change stream to feed downstream jobs/ETLs
● Near-Real time Table
○ Provides queries on real-time data
○ Combination of Apache Parquet & Apache Avro
data
Apache Hudi (Incubating)
Queries/Views of data
REALTIME
READ
OPTIMIZED
Cost
Latency
Hudi: Upserts + Incremental Changes
Incrementalize batch jobs
Dataset
Hudi upsert
Incoming
Changes
Outgoing
Changes
Hudi Incremental
Pull
upsert(RDD<Record>)
Updates records if present already or inserts them
into its corresponding partitions
RDD<Record> pullDelta(startTs, endTs)
Gets all the records that changed (updated or
inserted) between start and end time. The Delta can
span any number of partitions.
Apache Hudi @ Uber
Foundation for the vast Data Lake
>1 Trillion
Records/day
10s PB
Entire Data Lake
1000s
Pipelines/Tables
Apache Hudi Data Lake
Meeting the requirements
Data Lake built on Apache Hudi
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
upsert()
/insert()
Incr
Pull()
#1: upsert() database changelogs
// Command to extract incrementals using sqoop
bin/sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://localhost/users 
--username root 
--password ******* 
--table users 
--as-avrodatafile 
--target-dir 
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use your fav datasource to read
extracted data and directly “upsert” the
users table on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
#2: Filter out duplicate events
// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
/path/to/hoodie-utilities-bundle-*.jar` 
--props s3://path/to/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider 
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource 
--source-ordering-field time 
--target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions 
--op BULK_INSERT
--filter-dupes
// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-
value/versions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081
#3: Timeline consistency
Atomic multi-row commits
- Mask partial failures using timeline
- Rollback/savepoint support
Timeline
- Special .hoodie folder
- Actions are instantaneous
MVCC based isolation
- Between queries/ingestion
- Between ingestion/compaction
Future
- Unlimited timeline lookback
#4: Keyed update/insert() operations
Ingested record tagging
- Merge updates
- Log inserts
- HoodieRecordPayload interface to
support complex merges
Pluggable indexing
- Built-in : Bloom/Range based, HBase
- Scales with long term data growth
- Handles data skews
Future
- Support via SQL
#5: Incremental ETL/Data Pipelines
// Spark Datasource
Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie")
.option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM)
.load(“s3://tables/raw_trips”);
Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Bring Streaming APIs
on Data Lake
Incrementally pull
- Avoid recomputes!
- Order of magnitudes
faster
Transform + upsert
- Avoid rewriting all data
Future
- Incr pull on logs
- Watermark APIs
#6: File Sizing & Fast Ingestion
Enforce file size on write
- Pay up cost to keep queries healthy
- Set hoodie.parquet.max.file.size &
hoodie.parquet.small.file.limit
- See docs for full list
Near real-time log ingest
- Asynchronous compact & write
columnar data
Future
- Support for split/collapse
- Auto tune compression ratio etc
#7: Optimized Timeline/FileSystem APIs
Embedded Timeline Server
- 0-listings from Spark executors
- Incremental file-system views on Spark driver
Consistency Guards
- Masks eventual consistency on S3
- No data file renames, in-place writing
- Storage aware “append” usage
- Graceful MVCC design to handle various
failures
Future
- Standalone timeline server
#8: Data Dispersal out of Lake
Incremental pull as sync mechanism
- Only copy updated ML features
- Only copy affected data ranges
Decoupled from ETL writing
- Shock absorber between Lake & Mart
- Enables throttling, retrying, rewinding
Future
- Support Lake => Mart in DeltaStreamer tool
#9: Efficient/Fast Deletes
Soft deletes
- upsert(k, null)
- Propagates seamlessly via incr-pull
Hard deletes
- Using EmptyHoodieRecordPayload
Indexing
- 7-10x faster than using regular joins
Future
- Standardized tooling
#10: Safe Reprocessing
Identify late data
- Timeline tracks all write activity
- E.g: obtain bounds on lateness
Adjust incremental pull windows
- Still much efficient than bulk
recomputation
Future
- Support parrival(data, window) APIs in
TimelineServer
- Apache Beam support for composing
safe, incremental pipelines
Open Source
Roadmap, community, and the future
Current Status
Where we are at
● Committed to open, vendor neutral data lake standard
● 2+ yrs of OSS community support
● First Apache release imminent
● EMIS Health, Yields.io + more in production
● Bunch of companies trying out
● Production tested on cloud
● hudi.apache.org/community.html
2019 Roadmap
Key initiatives
Bootstrapping tables into Hudi
- With indexing benefits
- Convenient tooling
Standalone Timeline Server
- Eliminate fs listings for query planning/ingestion
- Track column level statistics for query
Smart storage layouts
- Increase file sizes for older data
- Re-clustering data for queries
Thank you
dev@hudi.apache.org
@apachehudi
https://hudi.apache.org
?
Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or
utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or
retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to
whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable
law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information
of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

More Related Content

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

  • 1. Building highly efficient data lakes using Apache Hudi (Incubating) Vinoth Chandar | Sr. Staff Engineer, Uber Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  • 4. OK.. May be not that simple ... Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables Schemas Data Audit
  • 5. Data Lake Implementation : It’s actually hard..
  • 6. High-value data - User information in RDBMS - Trip, transaction logs in NoSQL Replicate CRUD operations - Strict ordering guarantees - Zero data loss Bulk loads don’t scale - Adds more load to database - Wasteful re-writing of data Requirement #1: Incremental Database Ingestion MySQL users users Inserts, updates, deletes Replicate userID int country string last_mod long ... ... Data Lake
  • 7. High-scale time series data - Several billions/day - Few millions/sec - Heavily aggregated Cause of duplicates - Client retries/failures/network errors - At least-once data pipes Overcounting problems - More impressions => more $ - Low fidelity data Requirement #2: De-Duping Log Events Impressions Impressions Produce impression events Replicate w/o duplicates event_id string datestr string time long ... ... Data Lake
  • 8. Requirement #3: Transactional Writes Atomic publish of data - Ingestion can fail midway - Rollback bad data Consistency Guarantees - No partial data exposed - Repeatable queries Snapshot Isolation - Time-travel queries - Concurrent writer/readers Strong Durability - No data loss
  • 9. Requirement #4: Unique Key Constraints Data model parity - Enforce upstream primary keys - 1-1 Mapping w/ source table - Great data quality! Transaction Processing - e.g: Settling orders, fraud detection - Lakes are well-suited for large scale processes
  • 10. Multi stage ETL DAGS - Very common in batch analytics - Large amount of data Derived/ETL tables - Keep afresh with new/changed raw data - Star schema/warehousing Scaling challenges - Intelligent recomputations - Window based joins Requirement #5: Faster Derived Data raw_trips std_trips standardize_fare(row) id string datestr string currency string fare double id string datestr string std_fare double ... ... Raw Table Derived Table
  • 11. Requirement #6: File Management Small Files = Big Problem - Slow queries - Stress filesystem metadata Big Files = Large Delays - 2GB Parquet writing => ~5-10 mins File Stitching? - Band-aid for bullet wound - Consistency? - Standardization?
  • 12. Requirement #7: Scalable DFS/Storage RPCs Ingestion/Query all list DFS - List folders/files, take action - Single threaded vs parallel Subtle gotchas/differences - Cloud storage => no append() - S3 => Eventual consistency - S3 => rename() = copy() - Large directory listings - HDFS NameNode bottlenecks
  • 13. Requirement #8: Incremental Copy to Data marts Data Marts - Specialized, often MPP OLAP databases - E.g Redshift, Vertica Online Serving - Sync ML features to databases - Throttling syncing rate Need to sync Lake => Mart - Full data refresh often very expensive - Need for incremental egress
  • 14. Requirement #9: Legal Requirements/Data Deletions Strict rules on data retention - Delete records - Correct data - Raw + Derived tables Need efficient delete() - “needle in haystack” - Indexed on write (point-ish lookup) - Still optimized for scans - Propagate deleted records downstream
  • 15. Requirement #10: Late Data Handling Data often arrives late - Minutes, Hours, even days - E.g: credit card txn settlement Not implicitly complete - Can lead to large data quality issues - Trigger recomputation of derived tables Data arrival tracking - First class, audit log - Flexible, rewindable windowing
  • 18. ● Snapshot isolation between writer & queries ● upsert() support with pluggable indexes ● Atomically publish data with rollback support ● Savepoints for data recovery ● Manages file sizes, layout using statistics ● Async compaction of new & old data ● Timeline metadata to track lineage Apache Hudi (Incubating) Storage
  • 19. ● Three logical views on single physical dataset ● Read Optimized View ○ Provides excellent query performance ○ Replaces plain Apache Parquet tables ● Incremental View ○ Change stream to feed downstream jobs/ETLs ● Near-Real time Table ○ Provides queries on real-time data ○ Combination of Apache Parquet & Apache Avro data Apache Hudi (Incubating) Queries/Views of data REALTIME READ OPTIMIZED Cost Latency
  • 20. Hudi: Upserts + Incremental Changes Incrementalize batch jobs Dataset Hudi upsert Incoming Changes Outgoing Changes Hudi Incremental Pull upsert(RDD<Record>) Updates records if present already or inserts them into its corresponding partitions RDD<Record> pullDelta(startTs, endTs) Gets all the records that changed (updated or inserted) between start and end time. The Delta can span any number of partitions.
  • 21. Apache Hudi @ Uber Foundation for the vast Data Lake >1 Trillion Records/day 10s PB Entire Data Lake 1000s Pipelines/Tables
  • 22. Apache Hudi Data Lake Meeting the requirements
  • 23. Data Lake built on Apache Hudi Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables upsert() /insert() Incr Pull()
  • 24. #1: upsert() database changelogs // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import com.uber.hoodie.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), "userID") .option(PARTITIONPATH_FIELD_OPT_KEY(),"country") .option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Step 1: Extract new changes to users table in MySQL, as avro data files on DFS (or) Use data integration tool of choice to feed db changelogs to Kafka/event queue Step 2: Use your fav datasource to read extracted data and directly “upsert” the users table on DFS/Hive (or) Use the Hudi DeltaStreamer tool
  • 25. #2: Filter out duplicate events // Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-bundle-*.jar` --props s3://path/to/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource --source-ordering-field time --target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions --op BULK_INSERT --filter-dupes // kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions- value/versions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081
  • 26. #3: Timeline consistency Atomic multi-row commits - Mask partial failures using timeline - Rollback/savepoint support Timeline - Special .hoodie folder - Actions are instantaneous MVCC based isolation - Between queries/ingestion - Between ingestion/compaction Future - Unlimited timeline lookback
  • 27. #4: Keyed update/insert() operations Ingested record tagging - Merge updates - Log inserts - HoodieRecordPayload interface to support complex merges Pluggable indexing - Built-in : Bloom/Range based, HBase - Scales with long term data growth - Handles data skews Future - Support via SQL
  • 28. #5: Incremental ETL/Data Pipelines // Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Bring Streaming APIs on Data Lake Incrementally pull - Avoid recomputes! - Order of magnitudes faster Transform + upsert - Avoid rewriting all data Future - Incr pull on logs - Watermark APIs
  • 29. #6: File Sizing & Fast Ingestion Enforce file size on write - Pay up cost to keep queries healthy - Set hoodie.parquet.max.file.size & hoodie.parquet.small.file.limit - See docs for full list Near real-time log ingest - Asynchronous compact & write columnar data Future - Support for split/collapse - Auto tune compression ratio etc
  • 30. #7: Optimized Timeline/FileSystem APIs Embedded Timeline Server - 0-listings from Spark executors - Incremental file-system views on Spark driver Consistency Guards - Masks eventual consistency on S3 - No data file renames, in-place writing - Storage aware “append” usage - Graceful MVCC design to handle various failures Future - Standalone timeline server
  • 31. #8: Data Dispersal out of Lake Incremental pull as sync mechanism - Only copy updated ML features - Only copy affected data ranges Decoupled from ETL writing - Shock absorber between Lake & Mart - Enables throttling, retrying, rewinding Future - Support Lake => Mart in DeltaStreamer tool
  • 32. #9: Efficient/Fast Deletes Soft deletes - upsert(k, null) - Propagates seamlessly via incr-pull Hard deletes - Using EmptyHoodieRecordPayload Indexing - 7-10x faster than using regular joins Future - Standardized tooling
  • 33. #10: Safe Reprocessing Identify late data - Timeline tracks all write activity - E.g: obtain bounds on lateness Adjust incremental pull windows - Still much efficient than bulk recomputation Future - Support parrival(data, window) APIs in TimelineServer - Apache Beam support for composing safe, incremental pipelines
  • 35. Current Status Where we are at ● Committed to open, vendor neutral data lake standard ● 2+ yrs of OSS community support ● First Apache release imminent ● EMIS Health, Yields.io + more in production ● Bunch of companies trying out ● Production tested on cloud ● hudi.apache.org/community.html
  • 36. 2019 Roadmap Key initiatives Bootstrapping tables into Hudi - With indexing benefits - Convenient tooling Standalone Timeline Server - Eliminate fs listings for query planning/ingestion - Track column level statistics for query Smart storage layouts - Increase file sizes for older data - Re-clustering data for queries
  • 38. ?
  • 39. Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.