Spark and S3 with Ryan Blue
- 2. Contents.
● Big Data at Netflix.
● HDFS, S3, and rename.
● Output committers.
● S3 multipart committer.
● Results & future work.
- 4. Big Data at Netflix.
500B to 1T
daily events
60+ PB
data warehouse
5 PB
read daily
300 TB
written daily
- 6. ● Need to update YARN? Deploy a new cluster.
● Reconfigure NodeManagers? Deploy a new cluster
● Add temporary capacity? Deploy a new cluster.
● Lost the NameNode? Deploy a new cluster. Quickly.
Netflix clusters are expendable.
- 7. ● GENIE is a job submission service that selects clusters
● METACAT is a cluster-independent metastore
● S3 is used for all data storage
Expendable clusters require
architectural changes.
- 8. ● A distributed object store (masquerading as FileSystem).
● Rename results in a copy and delete.
● File output committers rename every output file.
S3 != HDFS.
- 11. We tend to think of tasks like this:
input output
input output
input output
input output
- 13. So reality is closer to this:
Photo credit: Hamish Darby via Flickr – https://www.flickr.com/photos/ybrad/6245422027
Under CC BY 2.0 License – http://creativecommons.org/licenses/by/2.0
- 14. ● Spark might lose communication with an executor.
● YARN might preempt executors.
● Speculative execution may start duplicate tasks.
● Tasks may run slowly, but still try to commit.
Anything can happen.
- 15. In practice, execution is this:
Photo credit: Alethe via Wikimedia Commons – https://en.wikipedia.org/wiki/Egg-and-spoon_race#/media/File:Egg_%26_spoon_finish_line.jpg
Under CC BY SA 3.0 License – http://creativecommons.org/licenses/by-sa/3.0
- 16. ● Task attempts write different files in parallel.
● One attempt per task commits.
● Once all tasks commit, the job is commits.
Committers clean up the mess.
- 17. ● One (and only one) copy of each task’s output.
● All output from a job is available, or none is.
Committer guarantees*:
- 18. ● One (and only one) copy of each task’s output.
● All output from a job is available, or none is.
Committer guarantees*:
*Not really guarantees.
- 19. ● DirectOutputCommitter writes directly to the final location.
○ Concurrent attempts clobber one another.
○ Job failures leave partial outputs behind.
○ Removed in Spark 2.0.0 – SPARK-10063
Why commit outputs?
- 21. ● Attempts: write to a unique file for each task/attempt.
● Task commit: rename attempt file to task file location.
○ mv /tmp/task-4-attempt-0 /tmp/job-1/task-4
● Job commit: rename job output to final location.
○ mv /tmp/job-1 /final/path/output
● This is (roughly) what FileOutputCommitter does.
● Move/rename is not a metadata operation in S3!
A simplified file committer.
- 22. ● Incremental file upload API.
● Upload file blocks as available.
● Notify S3 when finished adding blocks.
S3 multipart uploads.
- 23. ● Incremental file upload API.
● Upload file blocks as available.
● Notify S3 when finished adding blocks.
S3 multipart uploads.
- 24. ● Attempts: write to a unique file on local disk.
● Task commit:
○ Upload data to S3 using the multipart API.
○ Serialize the final request and commit that to HDFS.
● Job commit:
○ Read the task outputs to get final requests
○ Use the pending requests to notify S3 the files are finished
Multipart upload committer.
- 25. ● Will abort uploads to roll back.
● Will delete files to clean up partial commits.
● Failures during job commit can leave extra data.
(but it’s no worse than the file committer.)
Failure cases.
- 26. ● Metadata-only job commit (unlike file committer).
● Provides reliable writes to S3 (unlike direct committer).
● Distributed content uploads, light-weight job commit.
● Shortened job commit times by hours.
Results.
- 27. ● Released single-directory and partitioned committers.
○ https://github.com/rdblue/s3committer
● Will be included in S3A – HADOOP-13786
○ Huge thanks to Steve Loughran!
S3 multipart committer
- 28. ● Short term, finish integrating into Hadoop.
● Long term, Hive table layout needs to be replaced.
Future work.