Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook's data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram's architecture, discuss Facebook's need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark's efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.
Report
Share
Report
Share
1 of 23
Download to read offline
More Related Content
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
5. What is Tangram?
The scheduling platform for
• reliably running various batch workloads
• with efficient heterogenous resource management
• at scale
5#UnifiedAnalytics #SparkAISummit
6. Tangram Scheduling Targets
• Single jobs: adhoc/periodic
• Batch jobs: adhoc/periodic, malleable
• Gang jobs: adhoc/periodic, rigid
• Long-running jobs: steady and regular; e.g. online training
6#UnifiedAnalytics #SparkAISummit
7. Why Tangram?
• Various workload characteristics
– ML
– Apache Spark
– Apache Giraph
– Single jobs
• Customized scheduling policies
• Scalability
– Fleet size: hundreds of thousands worker nodes
– Job scheduling throughput: hundreds of millions jobs per day
7#UnifiedAnalytics #SparkAISummit
8. Overview
• What is Tangram?
8#UnifiedAnalytics #SparkAISummit
Admin
Job Manager
DB
ML
Resource
Manager
Master
Agent AgentAgent
Single
Job
Gang Job
ML Elastic
Scheduler
1
2
3
4
5
6
SQL query
Giraph
Spark
10. Agent
• Report schedulable resources and runtime usage
• Health check reports
• Detect labels
• Launch/Kill Containers
• Container recovery
• Resource isolation with cgroup v2
10#UnifiedAnalytics #SparkAISummit
11. Failure Recovery
• Agent failure
– Scan the recovery directory and recover the running containers
• RM failure
– Both agent and client hold off communication to the RM until the new
master shows up
– Client sync session info to the new master to help it build the states
– Agents add them to the new master
11#UnifiedAnalytics #SparkAISummit
12. Scheduling Policies
• Hierarchical queue structure
• Jobs to be queued on leaves
• Queue configs:
– min/max resources
– Policy:
• FIFO
• Dominant Resource Fairness (DRF)
• User fairness
• Global
• …
12#UnifiedAnalytics #SparkAISummit
/
ads feed
pipelines interactive
Job
DRF
DRF DRF
User FairnessFIFO
20%80%
50% 50%
user1 user2
50% 50%
FIFO FIFO
Job
Job Job
13. Scheduling Policies
• Jobs ordered by priority, submission time within queue
• Gang job as first class in scheduling and resource allocation
• Lookahead scheduling for better throughput and utilization
• Job starvation prevention
13#UnifiedAnalytics #SparkAISummit
Gang 200 Gang 20 Single Gang 4 Single
15. Resource Allocation
15#UnifiedAnalytics #SparkAISummit
Prefetched
Host Cache
• Bypass the
steps of
host
filtering
and
scoring
• Speedup
allocation
process
Host Filtering
• Hard &
Soft
constraints
• Resource
constraint
• Label
constraint
• Job affinity
Host Scoring
and Ordering
• Packing
efficiency
• Host
healthiness
• Data
locality
Commit
Allocation
• Book
keeping
resources
• Update
cluster &
queue
parameters
16. Constraint-based Scheduling
• Machine type
• Datacenter
• Region
• CPU architecture
• Host prefix
• …
16#UnifiedAnalytics #SparkAISummit
Merged host pool - type 1 & 2
Job
Job
Job
Host 1
Host 2
Host 3
Host 4
Host 5
Labeled with
{”type”:”2”}
Labeled with
{”type”:”1”}
Job Job
Job constraint:
type=2
Job constraint:
type=1
Queue
17. Preemption
• Guarantee resource availability SLO within and across queues
• Identify the starving jobs and overallocated jobs
• Minimize preemption cost: two-phase protocol
– Only candidates appearing in both phases will be preempted
– Resource Manager notifies client with preemption intent s.t. necessary action can
be taken, e.g. checkpointing
17#UnifiedAnalytics #SparkAISummit
18. Cross Datacenter Scheduling
• The growing demand of computation and storage for Hive tables
spans across data centers
• Stranded capacity with imbalanced load
• Poor data locality and waste of network bandwidth
• Slow reaction to recover from crisis and disaster
18#UnifiedAnalytics #SparkAISummit
19. Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
19#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
20. Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
20#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1
21. Cross Datacenter Scheduling
• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
21#UnifiedAnalytics #SparkAISummit
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1
Table DataTable Data
22. Future Work
• Mix workloads managed by one resource manager
• Run batch workloads with off-peak resources from online services
• Automatic resource tuning for high utilization
• We’re hiring! Contact: rjian@fb.com
22#UnifiedAnalytics #SparkAISummit
23. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT