Parquet Hadoop Summit 2013
- 1. Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
1
- 2. • Context from various companies
• Early results
• Format deep-dive
2
Outline
http://parquet.io
- 3. 3
Twitter Context
• Twitter’s data
• 200M+ monthly active users generating and consuming 400M+ tweets a day.
• 100TB+ a day of compressed data
• Scale is huge: Instrumentation, User graph, Derived data, ...
• Analytics infrastructure:
• Several 1K+ node Hadoop clusters
• Log collection pipeline
• Processing tools
http://parquet.io
The Parquet Planers
Gustave Caillebotte
- 4. • Logs available on HDFS
• Thrift to store logs
• example: one schema has 87 columns, up to 7 levels of nesting.
4
Twitter’s use case
struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}
struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}
http://parquet.io
- 5. 5
Goal
To have a state of the art columnar storage available across the
Hadoop platform
• Hadoop is very reliable for big long running queries but also IO heavy.
• Incrementally take advantage of column based storage in existing framework.
• Not tied to any framework in particular
http://parquet.io
- 6. • Limits the IO to only the data that is needed.
• Saves space: columnar layout compresses better
• Enables better scans: load only the columns that need to be accessed
• Enables vectorized execution engines.
6
Columnar Storage
http://parquet.io
- 7. Collaboration between Twitter and Cloudera:
• Common file format definition:
• Language independent
• Formally specified.
• Implementation in Java for Map/Reduce:
• https://github.com/Parquet/parquet-mr
• C++ and code generation in Cloudera Impala:
• https://github.com/cloudera/impala
7http://parquet.io
- 10. Criteo: The Context
• ~20B Events per day
• ~60 Columns per Log
• Heavy analytic workload
• BI Analysts using Hive and RCFile
• Frequent Schema Modifications
==
• Perfect use case for Parquet + Hive !
10http://parquet.io
- 11. Parquet + Hive: Basic Reqs
• MapRed Compatibility due to Hive
• Correctly Handle Different Schemas in Parquet Files
• Read Only The Columns Used by Query
• Interoperability with Other Execution Engines (eg Pig, Impala, etc.)
• Optimize Amount of Data Read by each Mapper
11http://parquet.io
- 12. Parquet + Hive: Early User Experience
12
Relative Performance of Hive+Parquet vs Orc and RCFile:
http://parquet.io
LZO compressed
- 13. Twitter: Initial results
Space saving: 28% using the same compression algorithm
Scan + assembly time compared to original:
One column: 10%
All columns: 114% 13
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files
http://parquet.io
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Thrift Parquet
0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 30
Scan time
columns
Thrift Parquet
- 14. Additional gains with dictionary encoding
14http://parquet.io
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw data but only 3% of compressed data
Space saving: another 52% using the same compression algorithm (on
top of the original columnar storage gains)
Scan + assembly time compared to plain Parquet:
All 13 columns: 48% (of the already faster columnar scan)
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Parquet compressed (LZO)
Parquet Dictionary uncompressed
Parquet Dictionary compressed (LZO)
Columns
0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
1 13
Scan time
Parquet compressed
Parquet dictionary compressed
to thrift
- 15. Row group
Column a
Page 0
Row group
15
Format
Page 1
Page 2
Column b Column c
Page 0
Page 1
Page 2
Page 0
Page 1
Page 2
• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading.
• roughly: 50MB < row group < 1 GB
• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.
• Page: Unit of access in a column chunk.
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 1MB
http://parquet.io
- 16. 16
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Well defined format.
Hadoop and Cloudera
Impala support.
http://parquet.io
- 17. 17
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Columns Max rep.
level
Max def. level
DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Column value R D
DocId 20 0 0
Links.Backward 10 0 2
Links.Backward 30 1 2
Links.Forward 80 0 2
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}
Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Document
DocId Links
Backward Forward
Document
DocId Links
Backward Forward
20
10 30 80
http://parquet.io
- 18. Differences of Parquet and ORC Nesting support
Parquet:
• Repetition/Definition levels capture the structure.
=> one column per Leaf in the schema.
• Array<int> is one column.
• Nullity/repetition of an inner node is stored in each of its children
• => One column independently of nesting with some redundancy.
ORC:
• An extra column for each Map or List to record their size.
=> one column per Node in the schema.
• Array<int> is two columns: array size and content.
• => An extra column per nesting level.
18
Document
DocId Links
Backward Forward
http://parquet.io
- 19. 19
Iteration on fully assembled records
• To integrate with existing row based engines (Hive, Pig, M/R).
• Aware of dictionary encoding: enable optimizations.
• Assembles projection for any subset of the columns: only those are loaded from disc.
http://parquet.io
Document
DocId 20
Document
Links
Backward 10 30
Document
Links
Backward Forward10 30 80
Document
Links
Forward 80
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
- 20. Iteration on columns
• To implement column based execution engine
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
• Encoded or decoded values: computing aggregations on integers is faster than on
strings.
20http://parquet.io
D<1 => Null
Row:
0
1
2
3
R=1 => same row
0
0
1
R D V
0
1
1
1
0
A
B
C
0 1 D
- 21. APIs
• Schema definition and record materialization:
• Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do.
• Event-based SAX-style record materialization layer. No double conversion.
• Integration with existing type systems and processing frameworks:
• Impala
• Pig
• Thrift and Scrooge for M/R, Cascading and Scalding
• Cascading tuples
• Avro
• Hive
21http://parquet.io
- 22. Encodings
• Bit packing:
• Small integers encoded in the minimum bits required
• Useful for repetition level, definition levels and dictionary keys
• Run Length Encoding:
• Used in combination with bit packing,
• Cheap compression
• Works well for definition level of sparse columns.
• Dictionary encoding:
• Useful for columns with few ( < 50,000 ) distinct values
• Extensible:
• Defining new encodings is supported by the format
22http://parquet.io
01|11|10|00 00|10|10|001 3 2 0 0 2 2 0
8 11 1 1 1 1 1 1 1
- 23. Contributors
Main contributors:
• Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
• Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
• Jonathan Coveney, Alex Levenson, Aniket Mokashi (Twitter): Encodings
• Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration
• Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
• Tom White (Cloudera): Avro integration
• Avi Bryant (Stripe): Cascading tuples integration
23http://parquet.io
- 24. Future
• Indices for random access (lookup by ID).
• More encodings.
• Extensibility
• Statistics pages (max, min, ...)
24http://parquet.io