This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Apache Hive is a data warehouse software built on top of Hadoop that allows users to query data stored in various databases and file systems using an SQL-like interface. It provides a way to summarize, query, and analyze large datasets stored in Hadoop distributed file system (HDFS). Hive gives SQL capabilities to analyze data without needing MapReduce programming. Users can build a data warehouse by creating Hive tables, loading data files into HDFS, and then querying and analyzing the data using HiveQL, which Hive then converts into MapReduce jobs.
The document provides an overview of Hive architecture and workflow. It discusses how Hive converts HiveQL queries to MapReduce jobs through its compiler. The compiler includes components like the parser, semantic analyzer, logical and physical plan generators, and logical and physical optimizers. It analyzes sample HiveQL queries and shows the transformations done at each compiler stage to generate logical and physical execution plans consisting of operators and tasks.
Processing Large Data with Apache Spark -- HasGeek
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
The document discusses various techniques for optimizing data organization and performance in Hive, including:
- Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance.
- Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution.
- Denormalizing data and optimizing JOIN queries through techniques like broadcast joins.
- Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance.
- Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.
The document discusses Live Long and Process (LLAP), a new capability in Apache Hive that enables sub-second query performance. LLAP achieves this through caching the hottest data in RAM on each Hadoop node and running queries against this cache via lightweight long-running daemon processes. It allows for 100% SQL compatibility while integrating with existing security and tools. LLAP provides benefits like failure tolerance, concurrency, ACID transactions, and elastic scaling. Performance tests on TPC-DS queries demonstrated sub-second latency for queries even at large data scales and high concurrency levels.
This document summarizes a presentation on using indexes in Hive to accelerate query performance. It describes how indexes provide an alternative view of data to enable faster lookups compared to full data scans. Example queries demonstrating group by and aggregation are rewritten to use an index on the shipdate column. Performance tests on TPC-H data show the indexed queries outperforming the non-indexed versions by an order of magnitude. Future work is needed to expand rewrite rules and integrate indexing fully into Hive's optimizer.
This document discusses benchmarking Hive at Yahoo scale. Some key points:
- Hive is the fastest growing product on Yahoo's Hadoop clusters which process 750k jobs per day across 32500 nodes.
- Benchmarking was done using TPC-H queries on 100GB, 1TB, and 10TB datasets stored in ORC format.
- Significant performance improvements were seen over earlier Hive versions, with 18x speedup over Hive 0.10 on text files for the 100GB dataset.
- Average query time was reduced from 530 seconds to 28 seconds for the 100GB dataset, and from 729 seconds to 172 seconds for the 1TB dataset.
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
The document discusses new features in Apache Hive 0.14 that improve SQL query performance. It introduces a cost-based optimizer that can optimize join orders, enabling faster query times. An example TPC-DS query is shown to demonstrate how the optimizer selects an efficient join order based on statistics about table and column sizes. Faster SQL queries are now possible in Hive through this query optimization capability.
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв: Solutions Architect, Big Data/High-performance Computation Expert в Altoros; г.Минск
Доклад: «Practical Steps to Improve Apache Hive Performance»
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
This document discusses Pig Hive and Cascading, tools for processing large datasets using Hadoop. It provides background on each tool, including that Pig was developed by Yahoo Research in 2006, Hive was developed by Facebook in 2007, and Cascading was authored by Chris Wensel in 2008. It then covers typical use cases for each tool like web analytics processing, mining search logs for synonyms, and building a product recommender. Finally, it discusses how each tool works, mapping queries to MapReduce jobs, and compares features of the tools like philosophy, productivity and data models.
Hadoop & Cloud Storage: Object Store Integration in Production
The document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows applications to work with different storage systems transparently. Recent enhancements to the S3A connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries running on S3A compared to earlier versions. Upcoming work on consistency, output committers, and abstraction layers is outlined to further improve object store integration.
Apache ORC has undergone significant improvements since its introduction in 2013 to provide faster, better, and smaller data analytics. Some key improvements include the addition of vectorized readers, columnar storage, predicate pushdown using bloom filters and statistics, improved compression techniques, and optimizations that reduce data size and query execution time. Over the years, ORC has become the native data format for Apache Hive and been adopted by many large companies for analytics workloads.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Apache Hive is a data warehouse software built on top of Hadoop that allows users to query data stored in various databases and file systems using an SQL-like interface. It provides a way to summarize, query, and analyze large datasets stored in Hadoop distributed file system (HDFS). Hive gives SQL capabilities to analyze data without needing MapReduce programming. Users can build a data warehouse by creating Hive tables, loading data files into HDFS, and then querying and analyzing the data using HiveQL, which Hive then converts into MapReduce jobs.
The document provides an overview of Hive architecture and workflow. It discusses how Hive converts HiveQL queries to MapReduce jobs through its compiler. The compiler includes components like the parser, semantic analyzer, logical and physical plan generators, and logical and physical optimizers. It analyzes sample HiveQL queries and shows the transformations done at each compiler stage to generate logical and physical execution plans consisting of operators and tasks.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
The document discusses various techniques for optimizing data organization and performance in Hive, including:
- Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance.
- Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution.
- Denormalizing data and optimizing JOIN queries through techniques like broadcast joins.
- Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance.
- Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.
The document discusses Live Long and Process (LLAP), a new capability in Apache Hive that enables sub-second query performance. LLAP achieves this through caching the hottest data in RAM on each Hadoop node and running queries against this cache via lightweight long-running daemon processes. It allows for 100% SQL compatibility while integrating with existing security and tools. LLAP provides benefits like failure tolerance, concurrency, ACID transactions, and elastic scaling. Performance tests on TPC-DS queries demonstrated sub-second latency for queries even at large data scales and high concurrency levels.
This document summarizes a presentation on using indexes in Hive to accelerate query performance. It describes how indexes provide an alternative view of data to enable faster lookups compared to full data scans. Example queries demonstrating group by and aggregation are rewritten to use an index on the shipdate column. Performance tests on TPC-H data show the indexed queries outperforming the non-indexed versions by an order of magnitude. Future work is needed to expand rewrite rules and integrate indexing fully into Hive's optimizer.
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
This document discusses benchmarking Hive at Yahoo scale. Some key points:
- Hive is the fastest growing product on Yahoo's Hadoop clusters which process 750k jobs per day across 32500 nodes.
- Benchmarking was done using TPC-H queries on 100GB, 1TB, and 10TB datasets stored in ORC format.
- Significant performance improvements were seen over earlier Hive versions, with 18x speedup over Hive 0.10 on text files for the 100GB dataset.
- Average query time was reduced from 530 seconds to 28 seconds for the 100GB dataset, and from 729 seconds to 172 seconds for the 1TB dataset.
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
The document discusses new features in Apache Hive 0.14 that improve SQL query performance. It introduces a cost-based optimizer that can optimize join orders, enabling faster query times. An example TPC-DS query is shown to demonstrate how the optimizer selects an efficient join order based on statistics about table and column sizes. Faster SQL queries are now possible in Hive through this query optimization capability.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva
Сергей Ковалёв: Solutions Architect, Big Data/High-performance Computation Expert в Altoros; г.Минск
Доклад: «Practical Steps to Improve Apache Hive Performance»
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
This document discusses Pig Hive and Cascading, tools for processing large datasets using Hadoop. It provides background on each tool, including that Pig was developed by Yahoo Research in 2006, Hive was developed by Facebook in 2007, and Cascading was authored by Chris Wensel in 2008. It then covers typical use cases for each tool like web analytics processing, mining search logs for synonyms, and building a product recommender. Finally, it discusses how each tool works, mapping queries to MapReduce jobs, and compares features of the tools like philosophy, productivity and data models.
The document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows applications to work with different storage systems transparently. Recent enhancements to the S3A connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries running on S3A compared to earlier versions. Upcoming work on consistency, output committers, and abstraction layers is outlined to further improve object store integration.
Apache ORC has undergone significant improvements since its introduction in 2013 to provide faster, better, and smaller data analytics. Some key improvements include the addition of vectorized readers, columnar storage, predicate pushdown using bloom filters and statistics, improved compression techniques, and optimizations that reduce data size and query execution time. Over the years, ORC has become the native data format for Apache Hive and been adopted by many large companies for analytics workloads.
This document compares query performance times between Apache Hive versions 0.10 and 0.13 using a benchmark of 50 SQL queries on a 30TB dataset. The results show that Hive 0.13 was over 100 times faster for 6 queries and averaged 52 times faster for all queries compared to Hive 0.10. Significant performance improvements were achieved through optimizations made during the Stinger Initiative involving 145 developers from 44 companies over 13 months.
This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
1) The document discusses optimizing Spark applications through JVM and OS tuning. Tuning aspects covered include JVM heap sizing, garbage collection options, process affinity, and large memory pages.
2) Benchmark results show that after applying these optimizations, execution time was reduced by 30-50% for Kmeans clustering and TPC-H queries compared to the default configuration.
3) Dividing the application across multiple smaller JVMs instead of a single large JVM helped reduce garbage collection overhead and resource contention, improving performance by up to 16%.
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
This document describes Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language called HiveQL, which are compiled into map-reduce jobs executed on Hadoop. Hive organizes data into tables partitioned across directories and files in HDFS. It includes a system catalog called Hive Metastore for storing schemas and statistics to optimize queries.
This document provides an overview and comparison of the Avro and Parquet data formats. It begins with introductions to Avro and Parquet, describing their key features and uses. The document then covers Avro and Parquet schemas, file structures, and includes code examples. Finally, it discusses considerations for choosing between Avro and Parquet and shares experiences using the two formats.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
This document provides an overview of performance improvements in Hive through the use of Apache Tez and other optimizations. Some key points discussed include:
- Hive on Tez replaces MapReduce as the execution engine, providing lower latency for interactive queries and higher throughput for batch queries.
- Tez enables various join optimizations like broadcast joins, dynamically partitioned hash joins, and 1-1 edges between tasks to improve performance.
- Other optimizations in Hive include vectorized query processing, cost-based optimization, faster query startup times, predicate pushdown in ORC files, and statistics gathering from ORC footers.
- Tez allows for pipelining of tasks, dynamic parallelism, split elimination, and
Tez is a data processing framework that allows dataflow jobs to be expressed as directed acyclic graphs (DAGs). It is built on top of YARN for resource management and aims to provide better performance than MapReduce by enabling container reuse, late binding of tasks, and simplifying operations. Tez defines APIs for developers to express DAGs and processing logic to customize jobs.
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
This document provides an overview of Tez, an Apache project that provides a framework for executing data processing jobs on Hadoop clusters. Tez allows expressing data processing jobs as directed acyclic graphs (DAGs) of tasks and executes these tasks in a optimized manner. It addresses limitations of MapReduce by providing a more flexible execution engine that can optimize performance and resource utilization.
This document discusses interactive querying in Hadoop. It describes how Hive facilitates SQL querying over data stored in HDFS. Hive performance is improved through optimizations like using Tez as the execution engine instead of MapReduce, vectorized queries, and ORC file format. Tez is a dataflow framework that allows expressing queries as directed acyclic graphs (DAGs) of vertices and edges, avoiding the multi-step MapReduce approach and improving latency. The document provides examples of expressing Hive queries in Tez and demonstrates its capabilities.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
Hive on Tez provides significant performance improvements over Hive on MapReduce by leveraging Apache Tez for query execution. Key features of Hive on Tez include vectorized processing, dynamic partitioned hash joins, and broadcast joins which avoid unnecessary data writes to HDFS. Test results show Hive on Tez queries running up to 100x faster on datasets ranging from terabytes to petabytes in size. Hive on Tez also handles concurrency well, with the ability to run 20 queries concurrently on a 30TB dataset and finish within 27.5 minutes.
Apache Hive is the most widely used SQL interface for Hadoop. As Hadoop usage continues its explosive growth, Hive`s performance and features do not meet the requirements and expectation of many users. This includes answering queries in human time (less than 30 seconds) and support for common analytics operations. The Hive community has risen to the challenge. Work is being done to drive down start up time of a Hive query, extend Hive to work on Tez (a Hadoop execution environment that is much faster than MapReduce), make Hive operators process records at 10x more than their current speed, add support for analytics and windowing functions such as RANK, NTILE, LEAD, LAG, etc., and add support to Hive for standard SQL datatypes. This talk will discuss the design and code changes that have been done as well as look at ongoing work and additional optimizations and features that could be added in the future.
This document discusses improvements to Hive performance and functionality in the Stinger initiative. Stinger includes changes to Hive and a new project called Tez, with two main goals: improve Hive performance by 100x and extend Hive SQL for analytics. Stinger is divided into three phases, with phase 1 focusing on optimizations, phase 2 adding YARN resource management and Hive on Tez, and phase 3 adding a buffer cache and cost-based optimizer. Hive 0.11 delivers performance gains through optimizations like improved map joins and collapsing jobs. It also introduces new technologies like Tez, ORC files, and vectorization. Standard queries now run much faster, with some seeing over 50x speedup. Future work will further reduce query startup
Evan Pollan talks about Bazaarvoice's Hadoop infrastructure for clickstream analytics, as well as an approach to large-scale cardinality analysis using Map/Reduce and HBase.
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
The document discusses Hadoop 2.2.0 and new features in YARN and MapReduce. Key points include: YARN introduces a new application framework and resource management system that replaces the jobtracker, allowing multiple data processing engines besides MapReduce; MapReduce is now a library that runs on YARN; Tez is introduced as a new data processing framework to improve performance beyond MapReduce.
This document provides an overview of Apache Spark, including:
- A refresher on MapReduce and its processing model
- An introduction to Spark, describing how it differs from MapReduce in addressing some of MapReduce's limitations
- Examples of how Spark can be used, including for iterative algorithms and interactive queries
- Resources for free online training in Hadoop, MapReduce, Hive and using HBase with MapReduce and Hive
This document discusses the Stinger initiative to improve the performance of Apache Hive. Stinger aims to speed up Hive queries by 100x, scale queries from terabytes to petabytes of data, and expand SQL support. Key developments include optimizing Hive to run on Apache Tez, the vectorized query execution engine, cost-based optimization using Optiq, and performance improvements from the ORC file format. The goals of Stinger Phase 3 are to deliver interactive query performance for Hive by integrating these technologies.
This document discusses stream processing and Hazelcast Jet. It defines stream processing as processing big volumes of continuous data with low latency. Some key challenges of stream processing discussed include handling infinite input streams, late arriving events, fault tolerance, and complexity. Hazelcast Jet is presented as a stream processing engine that favors simplicity and speed over other solutions like Spark Streaming. Example applications of Jet are provided.
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Apache Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining, and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments. To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. We believe the performance changes we are making today, along with the work being done in Tez will transform Hive into a single tool that Hadoop users can use to do report generation, ad hoc queries, and large batch jobs spanning 10s or 100s of terabytes.
Presenter(s):
Alan Gates, Co-founder, Hortonworks and Apache Pig PMC and Apache HCatalog PPMC Member, Author of "Programming Pig" from O'Reilly Media
Owen O’ Malley, Co-founder, Hortonworks, First committer added to Apache Hadoop and Founding Chair of the Apache Hadoop PMC
Big Data Analytics Projects - Real World with PentahoMark Kromer
This document discusses big data analytics projects and technologies. It provides an overview of Hadoop, MapReduce, YARN, Spark, SQL Server, and Pentaho tools for big data analytics. Specific scenarios discussed include digital marketing analytics using Hadoop, sentiment analysis using MongoDB and SQL Server, and data refinery using Hadoop, MPP databases, and Pentaho. The document also addresses myths and challenges around big data and provides code examples of MapReduce jobs.
Overview of stinger interactive query for hiveDavid Kaiser
This document provides an overview of the Stinger initiative to improve the performance of Hive interactive queries. The Stinger project worked to optimize Hive so that queries return results in seconds instead of minutes or hours by implementing features like Hive on Tez, vectorized processing, predicate pushdown, the ORC file format, and a cost-based optimizer. These optimizations improved Hive performance by over 100 times, allowing interactive use of Hive for the first time on large datasets.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
base optimizations:
Star join, MMR->MR, Multiple map joins grouped to single mapper.
Which analytic functions?
Windowing functions, over clause
Advanced optimizations
Predicate push down only eliminates the orc stripes?
Performance boosts via YARN
Improvements in shuffle
Tools? BI tools, Tableu, Microstrategy
Hive-0.13 is 100x faster.
Startup time improvements:
- Pre-launch the App master, keep containers around, what are the elements of query startup.
- Faster metastore lookup.
Using statistics other than Optiq:
- Metadata queries
- Estimating number of reducers
- Map join coversion
Optique: Join reordering
What is Optiq
50 optimization rules, examples
- Join reordering rules, filter push down, column pruning.
Should we mention we generate AST?
Ad hoc queries involving multiple views:
Currently supported to create views, the query on a view is executed by replacing the view with the subquery.
What is tez vertex boundary?
What is shuffle+map?
Why is d1 not joined with ss before first shuffle?
Why is Run2 slower for Non-CBO ?
What is bucketing off?
Why higher throughput?
How many contributors now?
No unncessary writes to HDFS.
Number of processes reduced.
The edges between M and R can be generalized.
On MR:
each mapper sorts partitions of both tables
In Tez
a mapper sorts only one table, the operators don’t have to switch between data sources.
Inventory is the bigger table in this case.
Similar to map-join w/o the need to build a hash table on the client
Will work with any level of sub-query nesting
Uses stats to determine if applicable
How it works:
Broadcast result set is computed in parallel on the cluster
Join processor are spun up in parallel
Broadcast set is streamed to join processor
Join processors build hash table
Other relation is joined with hashtable
Tez handles:
Best parallelism
Best data transfer of the hashed relation
Best scheduling to avoid latencies
Why broadcast join is better than the map join?
-- Multiple hashes can be generated in parallel
-- hashtable in memory can be more compact than the serialized one in local task
-- subqueries were always on streaming side and were joined with shuffle join
Parallelism:
Splits of a dimension table processed in parallel across mappers
Data transfer
- No hdfs write in between
Schedule
- read from rack local replica of the dimensional table
Comparing the bucketed map join in MR vs Tez
Inventory table is already bucketed.
In MR,
The hash map for each bucket is built in a single mapper in sequence, loaded in hdfs, then joined with store sales where the hash table is read as a side file.
In Tez,
The inventory scan is run in parallel in multiple mappers that process buckets.
------
Kicks in when large table is bucketed
Bucketed table
Dynamic as part of query processing
Uses custom edge to match the partitioning on the smaller table
Allows hash-join in cases where broadcast would be too large
Tez gives us the option of building custom edges and vertex managers
Fine grained control over how the data is replicated and partitioned
Scheduling and actual data transfer is handled by Tez
Common operation in decision support queries
Caused additional no-op stages in MR plans
Last stage spins up multi-input mapper to write result
Intermediate unions have to be materialized before additional processing
Tez has union that handles these cases transparently w/o any intermediate steps
Allows the same input to be split and written to different tables or partitions
Avoids duplicate scans/processing
Useful for ETL
Similar to “Splits” in PIG
In MR a “split” in the operator pipeline has to be written to HDFS and processed by multiple additional MR jobs
Tez allows to send the mulitple outputs directly to downstream processors
checkcast
Tpch query 1 and query 6.
Before:
1Tb of tpc-hdata compreses to 200Gb of ORC data.
30Tb of tpc-ds data compresses to approx ~6Tb of ORC data.