Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 7,160 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

View 1 employee

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website: https://hudi.apache.org/
External link for Apache Hudi
Industry: Data Infrastructure and Analytics
Company size: 201-500 employees
Headquarters: San Francisco, CA
Type: Nonprofit
Founded: 2016
Specialties: ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Primary

San Francisco, CA, US

Get directions

Employees at Apache Hudi

Dragan V.

Search Engine Optimization Expert

See all employees

Updates

Apache Hudi reposted this

Soumil S.

Lead Data Engineer | AWS & Apache Hudi Expert | Spark & AWS Glue Enthusiast | YouTuber
1d Edited
Report this post
Unlocking Advanced Analytics with AWS Glue and Apache Hudi Discover how you can elevate your data analytics with AWS Glue and Apache Hudi. Ever wondered how to efficiently track and analyze Athena query metrics for enhanced insights? Look no further! Our latest blog post dives deep into setting up Apache Hudi tables to store comprehensive Athena query metrics. 📊 Here’s a sneak peek at the powerful insights you can derive: 🔍 Total Data Scanned by WorkGroup in Athena Queries: Understand resource utilization across different workgroups. 📈 Count of Queries Executed by WorkGroup in Athena: Gauge workload distribution and query frequency. 📊 Athena Query Performance Analysis by WorkGroup: Evaluate average execution times and optimize query performance. 🚀 Performance Analysis of Athena Queries: Dive into detailed metrics like data scanned, execution times, and more. 📋 Database and Table Usage Metrics in Athena Queries: Track usage patterns to optimize database resources. This is just the beginning! Uncover more actionable insights and streamline your analytics workflows. Ready to elevate your data analytics game? Check out our detailed blog for a deep dive into AWS Glue, Apache Hudi, and Athena query metrics. 👉 Read the full blog here Start making data-driven decisions today! 🌟 #AWSGlue #ApacheHudi #DataAnalytics #AthenaQueries #BigDataAnalytics #AWS #DataInsights

Optimizing Analytics: Storing Athena Query Metrics in Hudi for Advanced Analysis using AWS Glue

Soumil S. on LinkedIn

2 Comments

Like Comment Share
Apache Hudi reposted this

Soumil S.

Lead Data Engineer | AWS & Apache Hudi Expert | Spark & AWS Glue Enthusiast | YouTuber
2d
Report this post
📊 Impressive Growth in Data Lakes Powered by Apache Hudi In just one year, our data landscape has seen remarkable growth. From a modest 30TB in July 2023, our data lakes have expanded to a substantial 130TB by July 2024. This represents a staggering accumulation of 100TB of data in just twelve months, underscoring our commitment to handling large-scale data with efficiency and scalability. 🔧 Infrastructure Overview Apache Hudi: The backbone of our data management strategy, ensuring efficient and reliable data ingestion and processing. AWS Glue: Serving as our Hive metastore, crucial for cataloging and organizing data within Hudi tables. AWS Glue Jobs: Powering over 4,000 jobs daily, with an average of 11,000 jobs per month, to manage our Hudi operations seamlessly. Storage: Leveraging S3, the cornerstone of our data storage solution, for its scalability, durability, and cost-effectiveness. Athena: Enabling ad-hoc querying capabilities, providing quick insights into our vast dataset. AWS DMS: Facilitating data ingestion from SQL Server and PostgreSQL, ensuring a smooth flow of data into our Hudi-powered environment. DynamoDB Streams: Integrating real-time data from DynamoDB, enriching our datasets with up-to-date information. 🚀 Future Directions Currently, we are exploring DataHub for enhanced metadata management, aiming to further streamline our data operations and ensure comprehensive governance as our data volumes continue to grow. 🔗 Learn More Curious about how we optimized our job targets and migrated to a serverless architecture with templated Glue jobs? Check out our detailed journey and insights on our blog. https://lnkd.in/eJtP3fH6
Like Comment Share
Apache Hudi reposted this

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author
2d
Report this post
Concurrency Control in Lakehouse. Open table formats such as Apache Hudi, Apache Iceberg & Delta Lake support concurrent access to data by multiple transactions. This is one of the most important problem tackled by a lakehouse architecture as opposed to data lakes. To ensure transactional guarantees during multiple reads & writes, there needs to be some mechanism to deal with. Concurrency control defines how different writers/readers coordinate access to a table. Iceberg uses Optimistic concurrency control to deal with concurrent writes. So, transactions operate without initial locks, updating data freely. At commit time, they check for conflicts; if found, conflicting transaction is rolled back. Apache Hudi differentiates itself from other formats by distinctly separating its processes into three categories: - writer processes (handling user upserts & deletes) - table services (managing data and metadata for optimization & bookkeeping) - readers (executing queries) It ensures snapshot isolation among these categories, ensuring each operates on a consistent table snapshot. Hudi uses: ✅ optimistic concurrency control (OCC) among writers ✅ lock-free, non-blocking approach using Multiversion Concurrency Control (MVCC) between writers and table services, as well as among various table services With MVCC, transactions see a consistent past state of the database, allowing read access without blocking, regardless of subsequent data changes. On top of that, with v1.0, Hudi has introduced "Non Blocking Concurrency Control (NBCC)" which allows multiple writers to write simultaneously and conflicts can be resolved later in query or via compaction. NBCC offers the same set of guarantees as OCC but without explicit locks for serializing the writes. Reading link in comments. #dataengineering #softwareengineering
5 Comments

Like Comment Share
Apache Hudi

7,160 followers
2d
Report this post
New HFile Reader in Apache Hudi 0.15.0. The metadata table in Hudi hosts various index types containing table metadata. These indexes are stored as "partitions" in the metadata table. Internally, the metadata table is a Hudi Merge-on-Read (MoR) table. There are different indices currently available: - files index: Stored as files partition. Contains file information such as file name, size, etc. - column_stats index: Stored as column_stats partition. Contains the statistics info of columns, such as min and max values, total values, null counts, size, etc., - bloom_filter index: Stored as bloom_filter partition in the metadata table. Stores bloom filters of all data files centrally to avoid scanning the footers directly from individual files. - record_index: Stored as record_index partition in the metadata table. Contains the mapping of the record key to location. Now for efficient lookups & needle-in-a-haystack type queries, compute engines must avoid scanning the entire index, as large datasets can have index sizes in TBs. This is why the 'HFile format' is chosen as the preferred file format for the metadata table. The HFile format is great for point & range lookups and allows us to quickly access the required metadata without needing to scan large portions of the table. So, What's happening? ✅ Apache Hudi 0.15.0 introduces a new "HFile Reader" implemented in Java for accessing metadata table & faster lookups. ✅ This reader is independent of HBase or Hadoop dependencies. ✅ Hudi connector (for engines like Trino, Presto) will use this HFile reader to support reading different indexes from the metadata table. ✅ technically, the readers & writers are language-independent (Rust, C++) as long as it aligns with the defined HFile format spec. 📄 HFile Format Spec: https://lnkd.in/dciGntpz 📙 Release notes: https://lnkd.in/dgUTP6_a 📱Chat about it in our Slack: https://lnkd.in/gZSZzdX4 #dataengineering #softwareengineering
Like Comment Share
Apache Hudi reposted this

Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author
3d
Report this post
5 Strategies for optimal query performance in a Lakehouse. One aspect that is particularly tricky to master when dealing with large amount of data is "performance". Querying larger data volumes demands certain tuning & optimization. Because your queries with any performant engine might be faster today, but over time it may not. Because over time: - Your query patterns may change - New query patterns may be added - Queries may be slow because of unorganized/small files It is therefore important to be aware of some of the techniques that can help prune & organize data effectively. Because the less data you read, the faster your queries could be. Here are 5 methods to think about for performance in a lakehouse with table formats like Apache Hudi, Apache Iceberg & Delta Lake. ✅ Partitioning: Organize and store related data together when writing to storage. ✅ File Sizing: Combine smaller files into a larger, more optimal size to read less number of files. ✅ Linear Sorting: Arrange data in order based on a unique range of values to improve query efficiency. ✅ Metadata Index: Use statistics from file formats, Bloom filters, etc to enhance data retrieval with structures like indexes (collated to a metadata table). For Iceberg & Delta there are no indexes but file-level stats are aggregated in the manifest file/transaction log. ✅ Z-order/Hilbert curves: Position related data points closely within the storage to enhance read operations involving multiple dimensions. #dataengineering #softwareengineering
1 Comment

Like Comment Share
Apache Hudi reposted this

Presto Foundation

1,899 followers
4d
Report this post
Awesome to see what Apache Hudi with #Presto can do 👏
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author
1w

Query Optimization with Clustering. Sometime back in a closed group workshop, I presented how the 'clustering' table service in Apache Hudi makes a huge impact on the overall query performance. To highlight the difference, I ran the same query using #Presto once before clustering & after in a 1 TB TPC-DS dataset. Query: 𝘚𝘌𝘓𝘌𝘊𝘛 𝘪_𝘪𝘵𝘦𝘮_𝘪𝘥 , 𝘈𝘝𝘎(𝘴𝘴_𝘲𝘶𝘢𝘯𝘵𝘪𝘵𝘺) 𝘢𝘨𝘨1 , 𝘈𝘝𝘎(𝘴𝘴_𝘭𝘪𝘴𝘵_𝘱𝘳𝘪𝘤𝘦) 𝘢𝘨𝘨2 , 𝘈𝘝𝘎(𝘴𝘴_𝘤𝘰𝘶𝘱𝘰𝘯_𝘢𝘮𝘵) 𝘢𝘨𝘨3 , 𝘈𝘝𝘎(𝘴𝘴_𝘴𝘢𝘭𝘦𝘴_𝘱𝘳𝘪𝘤𝘦) 𝘢𝘨𝘨4 𝘍𝘙𝘖𝘔 𝘰𝘯𝘦_𝘵𝘣.𝘥𝘪𝘱_1𝘵𝘣_𝘤𝘭𝘶𝘴𝘵𝘦𝘳𝘦𝘥_𝘸𝘰𝘳𝘬𝘴𝘩𝘰𝘱 𝘞𝘏𝘌𝘙𝘌 𝘤𝘢_𝘭𝘰𝘤𝘢𝘵𝘪𝘰𝘯_𝘵𝘺𝘱𝘦 = '𝘤𝘰𝘯𝘥𝘰' 𝘎𝘙𝘖𝘜𝘗 𝘉𝘠 𝘪_𝘪𝘵𝘦𝘮_𝘪𝘥 As you can see from the Presto UI, the unclustered table scanned 2.62 billion rows, whereas the clustered one scanned just 847 million records. That's a significant difference 🚀 Now imagine in a production environment, where you will have multiple analysts & engineers running similar queries on certain predicates over & over. For such cases, Clustering is extremely beneficial & is therefore important to run. One of the cool things about Hudi is that there are different deployment modes to run clustering: ✅ The simplest one is INLINE, meaning clustering happens sequentially after each ingestion batch if the latency or ingestion latency is not a concern or for batch jobs. ✅ ASYNC in another way & more flexible, where the clustering happens asynchronously in the background alongside ingestion, so ingestion latency is not impacted. Detailed reading in comments. #dataengineering #softwareengineering
Like Comment Share
Apache Hudi reposted this

Unity Catalog

3,124 followers
4d
Report this post
REMINDER 🚨 The next #UnityCatalog Community Meetup is tomorrow, July 11th at 8:30AM PST / 11:30AM EST! Agenda: 🌟 Opening & Welcome 🌟 Apache Hudi/UC demo with Kyle Weller from Onehouse 🌟 Localhost setup with Matthew Powers, CFA from Databricks 🌟 Roadmap 🌟 Q&A, discussion Register here ➡ https://lnkd.in/eAhFjG-r #opensource #linuxfoundation #oss #lfaidata

LFX Meetings

zoom-lfx.platform.linuxfoundation.org

1 Comment

Like Comment Share
Apache Hudi reposted this

Apache XTable (Incubating)

4,428 followers
4d
Report this post
Amazon Web Services (AWS) team shows how to get started with Apache XTable on AWS and how to use it in a batch pipeline orchestrated with Amazon Managed Workflows for Apache Airflow. This is an amazing hands on blog for anyone looking at orchestrating metadata translation between lakehouse table formats (Apache Hudi, Apache Iceberg & Delta Lake) with XTable. Blog: https://lnkd.in/daGfR9mx Credits: Matthias Rudolph, Stephen said #dataengineering #softwareengineering

Run Apache XTable on Amazon MWAA to translate open table formats | Amazon Web Services

aws.amazon.com

Like Comment Share
Apache Hudi

7,160 followers
4d
Report this post
Apache Hudi 0.15.0 introduces two new abstractions. These abstractions enhance integration with query engines like Trino, which uses its own native File System APIs. 1. HoodieStorage Abstraction - Hadoop-independent file system & storage APIs for readers/writers - extendible with Hadoop File system and TrinoFileSystem (for Trino-Hudi connector) 2. HoodieIOFactory Abstraction - provides APIs to create readers/writers for I/O without the need to depend on Hadoop class ✅ With this new way, the `hudi-common` module (core implementation of Hudi spec) & core reader logic is Hadoop-independent. This simplifies integration with engines like Trino, making it easier and more maintainable by allowing custom storage and IO factory implementations to be plugged in. Read details in the release note - https://lnkd.in/dnybdngQ #dataengineering #softwareengineering
4 Comments

Like Comment Share
Apache Hudi

7,160 followers
4d
Report this post
The Onehouse-Hudi June edition newsletter is NOW OUT! 🎉 Check out all the amazing stuff that has happened in the Hudi & Data community in the past month. 📰 Newsletter: https://lnkd.in/dPSZ_9wY #dataengineering #softwareengineering
Like Comment Share

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 7,160 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Locations

Employees at Apache Hudi

Dragan V.

Search Engine Optimization Expert

Updates

Optimizing Analytics: Storing Athena Query Metrics in Hudi for Advanced Analysis using AWS Glue

Soumil S. on LinkedIn

LFX Meetings

zoom-lfx.platform.linuxfoundation.org

Run Apache XTable on Amazon MWAA to translate open table formats | Amazon Web Services

aws.amazon.com

Join now to see what you are missing

Similar pages

Apache Iceberg

Apache XTable (Incubating)

Delta Lake

Onehouse

Apache Iceberg Workshops

Apache Doris

DuckDB

Apache Airflow

Polars

Tabular (now part of Databricks)