SlideShare a Scribd company logo
Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
1
• Context from various companies
• Early results
• Format deep-dive
2
Outline
http://parquet.io
3
Twitter Context
• Twitter’s data
• 200M+ monthly active users generating and consuming 400M+ tweets a day.
• 100TB+ a day of compressed data
• Scale is huge: Instrumentation, User graph, Derived data, ...
• Analytics infrastructure:
• Several 1K+ node Hadoop clusters
• Log collection pipeline
• Processing tools
http://parquet.io
The Parquet Planers
Gustave Caillebotte
• Logs available on HDFS
• Thrift to store logs
• example: one schema has 87 columns, up to 7 levels of nesting.
4
Twitter’s use case
struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}
struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}
http://parquet.io
5
Goal
To have a state of the art columnar storage available across the
Hadoop platform
• Hadoop is very reliable for big long running queries but also IO heavy.
• Incrementally take advantage of column based storage in existing framework.
• Not tied to any framework in particular
http://parquet.io
• Limits the IO to only the data that is needed.
• Saves space: columnar layout compresses better
• Enables better scans: load only the columns that need to be accessed
• Enables vectorized execution engines.
6
Columnar Storage
http://parquet.io
Collaboration between Twitter and Cloudera:
• Common file format definition:
• Language independent
• Formally specified.
• Implementation in Java for Map/Reduce:
• https://github.com/Parquet/parquet-mr
• C++ and code generation in Cloudera Impala:
• https://github.com/cloudera/impala
7http://parquet.io
Results in Impala TPC-H lineitem table @ 1TB scale factor
8http://parquet.io
GB
Impala query times on TPC-H lineitem table
9http://parquet.io
Seconds GB
Criteo: The Context
• ~20B Events per day
• ~60 Columns per Log
• Heavy analytic workload
• BI Analysts using Hive and RCFile
• Frequent Schema Modifications
==
• Perfect use case for Parquet + Hive !
10http://parquet.io
Parquet + Hive: Basic Reqs
• MapRed Compatibility due to Hive
• Correctly Handle Different Schemas in Parquet Files
• Read Only The Columns Used by Query
• Interoperability with Other Execution Engines (eg Pig, Impala, etc.)
• Optimize Amount of Data Read by each Mapper
11http://parquet.io
Parquet + Hive: Early User Experience
12
Relative Performance of Hive+Parquet vs Orc and RCFile:
http://parquet.io
LZO compressed
Twitter: Initial results
Space saving: 28% using the same compression algorithm
Scan + assembly time compared to original:
One column: 10%
All columns: 114% 13
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files
http://parquet.io
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Thrift Parquet
0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 30
Scan time
columns
Thrift Parquet
Additional gains with dictionary encoding
14http://parquet.io
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw data but only 3% of compressed data
Space saving: another 52% using the same compression algorithm (on
top of the original columnar storage gains)
Scan + assembly time compared to plain Parquet:
All 13 columns: 48% (of the already faster columnar scan)
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Parquet compressed (LZO)
Parquet Dictionary uncompressed
Parquet Dictionary compressed (LZO)
Columns
0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
1 13
Scan time
Parquet compressed
Parquet dictionary compressed
to thrift
Row group
Column a
Page 0
Row group
15
Format
Page 1
Page 2
Column b Column c
Page 0
Page 1
Page 2
Page 0
Page 1
Page 2
• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading. 
• roughly: 50MB < row group < 1 GB
• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.
• Page: Unit of access in a column chunk.
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 1MB
http://parquet.io
16
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Well defined format.
Hadoop and Cloudera
Impala support.
http://parquet.io
17
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Columns Max rep.
level
Max def. level
DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Column value R D
DocId 20 0 0
Links.Backward 10 0 2
Links.Backward 30 1 2
Links.Forward 80 0 2
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}
Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Document
DocId Links
Backward Forward
Document
DocId Links
Backward Forward
20
10 30 80
http://parquet.io
Differences of Parquet and ORC Nesting support
Parquet:
• Repetition/Definition levels capture the structure.
=> one column per Leaf in the schema.
• Array<int> is one column.
• Nullity/repetition of an inner node is stored in each of its children
• => One column independently of nesting with some redundancy.
ORC:
• An extra column for each Map or List to record their size.
=> one column per Node in the schema.
• Array<int> is two columns: array size and content.
• => An extra column per nesting level.
18
Document
DocId Links
Backward Forward
http://parquet.io
19
Iteration on fully assembled records
• To integrate with existing row based engines (Hive, Pig, M/R).
• Aware of dictionary encoding: enable optimizations.
• Assembles projection for any subset of the columns: only those are loaded from disc.
http://parquet.io
Document
DocId 20
Document
Links
Backward 10 30
Document
Links
Backward Forward10 30 80
Document
Links
Forward 80
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
Iteration on columns
• To implement column based execution engine
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
• Encoded or decoded values: computing aggregations on integers is faster than on
strings.
20http://parquet.io
D<1 => Null
Row:
0
1
2
3
R=1 => same row
0
0
1
R D V
0
1
1
1
0
A
B
C
0 1 D
APIs
• Schema definition and record materialization:
• Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do.
• Event-based SAX-style record materialization layer. No double conversion.
• Integration with existing type systems and processing frameworks:
• Impala
• Pig
• Thrift and Scrooge for M/R, Cascading and Scalding
• Cascading tuples
• Avro
• Hive
21http://parquet.io
Encodings
• Bit packing:
• Small integers encoded in the minimum bits required
• Useful for repetition level, definition levels and dictionary keys
• Run Length Encoding:
• Used in combination with bit packing,
• Cheap compression
• Works well for definition level of sparse columns.
• Dictionary encoding:
• Useful for columns with few ( < 50,000 ) distinct values
• Extensible:
• Defining new encodings is supported by the format
22http://parquet.io
01|11|10|00 00|10|10|001 3 2 0 0 2 2 0
8 11 1 1 1 1 1 1 1
Contributors
Main contributors:
• Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
• Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
• Jonathan Coveney, Alex Levenson, Aniket Mokashi (Twitter): Encodings
• Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration
• Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
• Tom White (Cloudera): Avro integration
• Avi Bryant (Stripe): Cascading tuples integration
23http://parquet.io
Future
• Indices for random access (lookup by ID).
• More encodings.
• Extensibility
• Statistics pages (max, min, ...)
24http://parquet.io
25
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Come talk to us:
Twitter booth #26
Cloudera booth #45
http://parquet.io

More Related Content

Parquet Hadoop Summit 2013

  • 1. Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala http://parquet.io 1
  • 2. • Context from various companies • Early results • Format deep-dive 2 Outline http://parquet.io
  • 3. 3 Twitter Context • Twitter’s data • 200M+ monthly active users generating and consuming 400M+ tweets a day. • 100TB+ a day of compressed data • Scale is huge: Instrumentation, User graph, Derived data, ... • Analytics infrastructure: • Several 1K+ node Hadoop clusters • Log collection pipeline • Processing tools http://parquet.io The Parquet Planers Gustave Caillebotte
  • 4. • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. 4 Twitter’s use case struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io
  • 5. 5 Goal To have a state of the art columnar storage available across the Hadoop platform • Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • Not tied to any framework in particular http://parquet.io
  • 6. • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. 6 Columnar Storage http://parquet.io
  • 7. Collaboration between Twitter and Cloudera: • Common file format definition: • Language independent • Formally specified. • Implementation in Java for Map/Reduce: • https://github.com/Parquet/parquet-mr • C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala 7http://parquet.io
  • 8. Results in Impala TPC-H lineitem table @ 1TB scale factor 8http://parquet.io GB
  • 9. Impala query times on TPC-H lineitem table 9http://parquet.io Seconds GB
  • 10. Criteo: The Context • ~20B Events per day • ~60 Columns per Log • Heavy analytic workload • BI Analysts using Hive and RCFile • Frequent Schema Modifications == • Perfect use case for Parquet + Hive ! 10http://parquet.io
  • 11. Parquet + Hive: Basic Reqs • MapRed Compatibility due to Hive • Correctly Handle Different Schemas in Parquet Files • Read Only The Columns Used by Query • Interoperability with Other Execution Engines (eg Pig, Impala, etc.) • Optimize Amount of Data Read by each Mapper 11http://parquet.io
  • 12. Parquet + Hive: Early User Experience 12 Relative Performance of Hive+Parquet vs Orc and RCFile: http://parquet.io LZO compressed
  • 13. Twitter: Initial results Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% 13 Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files http://parquet.io 0% 25.00% 50.00% 75.00% 100.00% Space Space Thrift Parquet 0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 30 Scan time columns Thrift Parquet
  • 14. Additional gains with dictionary encoding 14http://parquet.io 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) 0% 25.00% 50.00% 75.00% 100.00% Space Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) Columns 0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 1 13 Scan time Parquet compressed Parquet dictionary compressed to thrift
  • 15. Row group Column a Page 0 Row group 15 Format Page 1 Page 2 Column b Column c Page 0 Page 1 Page 2 Page 0 Page 1 Page 2 • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading.  • roughly: 50MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io
  • 16. 16 Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io
  • 17. 17 Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Columns Max rep. level Max def. level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Column value R D DocId 20 0 0 Links.Backward 10 0 2 Links.Backward 30 1 2 Links.Forward 80 0 2 Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Document DocId Links Backward Forward Document DocId Links Backward Forward 20 10 30 80 http://parquet.io
  • 18. Differences of Parquet and ORC Nesting support Parquet: • Repetition/Definition levels capture the structure. => one column per Leaf in the schema. • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. 18 Document DocId Links Backward Forward http://parquet.io
  • 19. 19 Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). • Aware of dictionary encoding: enable optimizations. • Assembles projection for any subset of the columns: only those are loaded from disc. http://parquet.io Document DocId 20 Document Links Backward 10 30 Document Links Backward Forward10 30 80 Document Links Forward 80 a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3
  • 20. Iteration on columns • To implement column based execution engine • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. • Encoded or decoded values: computing aggregations on integers is faster than on strings. 20http://parquet.io D<1 => Null Row: 0 1 2 3 R=1 => same row 0 0 1 R D V 0 1 1 1 0 A B C 0 1 D
  • 21. APIs • Schema definition and record materialization: • Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • Event-based SAX-style record materialization layer. No double conversion. • Integration with existing type systems and processing frameworks: • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive 21http://parquet.io
  • 22. Encodings • Bit packing: • Small integers encoded in the minimum bits required • Useful for repetition level, definition levels and dictionary keys • Run Length Encoding: • Used in combination with bit packing, • Cheap compression • Works well for definition level of sparse columns. • Dictionary encoding: • Useful for columns with few ( < 50,000 ) distinct values • Extensible: • Defining new encodings is supported by the format 22http://parquet.io 01|11|10|00 00|10|10|001 3 2 0 0 2 2 0 8 11 1 1 1 1 1 1 1
  • 23. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala • Jonathan Coveney, Alex Levenson, Aniket Mokashi (Twitter): Encodings • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration • Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • Tom White (Cloudera): Avro integration • Avi Bryant (Stripe): Cascading tuples integration 23http://parquet.io
  • 24. Future • Indices for random access (lookup by ID). • More encodings. • Extensibility • Statistics pages (max, min, ...) 24http://parquet.io
  • 25. 25 How to contribute Questions? Ideas? Contribute at: github.com/Parquet Come talk to us: Twitter booth #26 Cloudera booth #45 http://parquet.io