Spark and S3 with Ryan Blue

Spark and S3
Ryan Blue
Spark Summit 2017

Contents.
● Big Data at Netflix.
● HDFS, S3, and rename.
● Output committers.
● S3 multipart committer.
● Results & future work.

Big Data at Netflix.
500B to 1T
daily events
60+ PB
data warehouse
5 PB
read daily
300 TB
written daily

Production
3400 d2.4xl
355 TB memory
Ad-hoc
1200 d2.4xl
125 TB memory
Other clusters

● Need to update YARN? Deploy a new cluster.
● Reconfigure NodeManagers? Deploy a new cluster
● Add temporary capacity? Deploy a new cluster.
● Lost the NameNode? Deploy a new cluster. Quickly.
Netflix clusters are expendable.

● GENIE is a job submission service that selects clusters
● METACAT is a cluster-independent metastore
● S3 is used for all data storage
Expendable clusters require
architectural changes.

● A distributed object store (masquerading as FileSystem).
● Rename results in a copy and delete.
● File output committers rename every output file.
S3 != HDFS.

● DirectOutputCommitter writes directly to the final location.
Why commit outputs?

We tend to think of tasks like this:
input output
input output
input output
input output

But Spark is a distributed system.

So reality is closer to this:
Photo credit: Hamish Darby via Flickr – https://www.flickr.com/photos/ybrad/6245422027
Under CC BY 2.0 License – http://creativecommons.org/licenses/by/2.0

● Spark might lose communication with an executor.
● YARN might preempt executors.
● Speculative execution may start duplicate tasks.
● Tasks may run slowly, but still try to commit.
Anything can happen.

In practice, execution is this:
Photo credit: Alethe via Wikimedia Commons – https://en.wikipedia.org/wiki/Egg-and-spoon_race#/media/File:Egg_%26_spoon_finish_line.jpg
Under CC BY SA 3.0 License – http://creativecommons.org/licenses/by-sa/3.0

● Task attempts write different files in parallel.
● One attempt per task commits.
● Once all tasks commit, the job is commits.
Committers clean up the mess.

● One (and only one) copy of each task’s output.
● All output from a job is available, or none is.
Committer guarantees*:

● One (and only one) copy of each task’s output.
● All output from a job is available, or none is.
Committer guarantees*:
*Not really guarantees.

● DirectOutputCommitter writes directly to the final location.
○ Concurrent attempts clobber one another.
○ Job failures leave partial outputs behind.
○ Removed in Spark 2.0.0 – SPARK-10063
Why commit outputs?

● Attempts: write to a unique file for each task/attempt.
● Task commit: rename attempt file to task file location.
○ mv /tmp/task-4-attempt-0 /tmp/job-1/task-4
● Job commit: rename job output to final location.
○ mv /tmp/job-1 /final/path/output
● This is (roughly) what FileOutputCommitter does.
● Move/rename is not a metadata operation in S3!
A simplified file committer.

● Incremental file upload API.
● Upload file blocks as available.
● Notify S3 when finished adding blocks.
S3 multipart uploads.

● Attempts: write to a unique file on local disk.
● Task commit:
○ Upload data to S3 using the multipart API.
○ Serialize the final request and commit that to HDFS.
● Job commit:
○ Read the task outputs to get final requests
○ Use the pending requests to notify S3 the files are finished
Multipart upload committer.

● Will abort uploads to roll back.
● Will delete files to clean up partial commits.
● Failures during job commit can leave extra data.
(but it’s no worse than the file committer.)
Failure cases.

● Metadata-only job commit (unlike file committer).
● Provides reliable writes to S3 (unlike direct committer).
● Distributed content uploads, light-weight job commit.
● Shortened job commit times by hours.
Results.

● Released single-directory and partitioned committers.
○ https://github.com/rdblue/s3committer
● Will be included in S3A – HADOOP-13786
○ Huge thanks to Steve Loughran!
S3 multipart committer

● Short term, finish integrating into Hadoop.
● Long term, Hive table layout needs to be replaced.
Future work.

Thank you!
Questions?
https://jobs.netflix.com/
rblue@netflix.com

Spark and S3 with Ryan Blue

Related slideshows

More Related Content

Spark and S3 with Ryan Blue