Hive integration: HBase and Rcfile__HadoopSummit2010

Hive Integration: HBase and RCFile John Sichi and Yongqiang He Facebook

HBase Integration (John Sichi) RCFile Integration (Yongqiang He) Session Agenda

HBase: Facebook Warehouse Use Case Reduce latency on dimension data availability HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries

HBase: Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “ hbase.columns.mapping” = “ small:name,small:email,large:notes”) TBLPROPERTIES ( “ hbase.table.name” = “user_list” ); INSERT, SELECT, JOIN, GROUP BY, UNION etc

Testing at scale 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet) HBase: Integration Status

Retest against HBase trunk with larger (30TB) data Try out new features for accelerating incremental load Bulk load into table with existing data Multiputs Deferred logging Support for “virtual partitions” based on timestamps Support for deletion Push down filters Index join? Optimize scans? HBase: Integration Roadmap

Why Columnar Storages Better Compression Light weight compression RLE Bit-map Etc CPU, Memory, Storage Columnar Operator Cache conscious (MonetDB) RCFile

Why RCFile Huge Data Reduce data storage space required Ad-hoc workloads Storage space vs. speed (data performance) Can we get both with no application changes? Reduce storage spaces Accelerate performance for arbitrary applications RCFile

Pros Work with Column Pruning Only touch needed columns at runtime Lazy decompression Select col1, col2 from tbl_col_10 where col_1 > 30 Will only touch col1 and col2 Col2 is decompressed only when a block contains a col1 value greater than 30 RCFile

Cons Row Construction Is the main overhead Each column’s data is stored separately, and may be sorted in different order In memory operation for rcfile This could be really painful; a lot of room to improve here RCFile

Facebook Deployment Default file format in Facebook cluster 20% space savings on average We are transforming old data to the new format RCFile

Future work Support built in indexing Like bloom filter etc more cache conscious columnar operators Pushing predicate to file reader RCFile

Questions? [email_address] [email_address]

Hive integration: HBase and Rcfile__HadoopSummit2010

Related slideshows

More Related Content

Hive integration: HBase and Rcfile__HadoopSummit2010

Editor's Notes