Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Rui Jian, Hao Lin, Facebook Inc.
rjian@fb.com, hlin@fb.com
Tangram: Distributed Scheduling
Framework for Apache Spark at
Facebook
#UnifiedAnalytics #SparkAISummit

About Us
• Rui Jian
– Software Engineer at Facebook (Data Warehouse & Graph Indexing)
– Master of Computer Science (Shanghai Jiao Tong university)
• Hao Lin
– Research scientist at Facebook (Data Warehouse Batch Scheduling)
– PhD in Parallel Computing (Purdue ECE)
3#UnifiedAnalytics #SparkAISummit

Agenda
• Overview
• Tangram Architecture
• Scheduling Policies & Resource Allocation
• Future work

What is Tangram?
The scheduling platform for
• reliably running various batch workloads
• with efficient heterogenous resource management
• at scale

Tangram Scheduling Targets
• Single jobs: adhoc/periodic
• Batch jobs: adhoc/periodic, malleable
• Gang jobs: adhoc/periodic, rigid
• Long-running jobs: steady and regular; e.g. online training

Why Tangram?
• Various workload characteristics
– ML
– Apache Spark
– Apache Giraph
– Single jobs
• Customized scheduling policies
• Scalability
– Fleet size: hundreds of thousands worker nodes
– Job scheduling throughput: hundreds of millions jobs per day

Overview
• What is Tangram?
Admin
Job Manager
DB
ML
Resource
Manager
Master
Agent AgentAgent
Single
Job
Gang Job
ML Elastic
Scheduler
1
2
3
4
5
6
SQL query
Giraph
Spark

Client Library
• Job management
• Request/Release resources
• Resource grant
• Preemption notification
• Launch containers
• Container status change event
Tangram
client
Resource
Manager
Agent
Application
1
2
3
4
5
6

Agent
• Report schedulable resources and runtime usage
• Health check reports
• Detect labels
• Launch/Kill Containers
• Container recovery
• Resource isolation with cgroup v2

Failure Recovery
• Agent failure
– Scan the recovery directory and recover the running containers
• RM failure
– Both agent and client hold off communication to the RM until the new
master shows up
– Client sync session info to the new master to help it build the states
– Agents add them to the new master

Scheduling Policies
• Hierarchical queue structure
• Jobs to be queued on leaves
• Queue configs:
– min/max resources
– Policy:
• FIFO
• Dominant Resource Fairness (DRF)
• User fairness
• Global
• …
/
ads feed
pipelines interactive
Job
DRF
DRF DRF
User FairnessFIFO
20%80%
50% 50%
user1 user2
50% 50%
FIFO FIFO
Job
Job Job

Scheduling Policies
• Jobs ordered by priority, submission time within queue
• Gang job as first class in scheduling and resource allocation
• Lookahead scheduling for better throughput and utilization
• Job starvation prevention
Gang 200 Gang 20 Single Gang 4 Single

Resource Allocation
• Fine-grained resource specification:
– {cpuMilliCores: 3000, memoryBytes: 200GB}
• Constraints:
– “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10”
• Job Affinity:
– inSameDatacenter

Resource Allocation
Prefetched
Host Cache
• Bypass the
steps of
host
filtering
and
scoring
• Speedup
allocation
process
Host Filtering
• Hard &
Soft
constraints
• Resource
constraint
• Label
constraint
• Job affinity
Host Scoring
and Ordering
• Packing
efficiency
• Host
healthiness
• Data
locality
Commit
Allocation
• Book
keeping
resources
• Update
cluster &
queue
parameters

Constraint-based Scheduling
• Machine type
• Datacenter
• Region
• CPU architecture
• Host prefix
• …
Merged host pool - type 1 & 2
Job
Job
Job
Host 1
Host 2
Host 3
Host 4
Host 5
Labeled with
{”type”:”2”}
Labeled with
{”type”:”1”}
Job Job
Job constraint:
type=2
Job constraint:
type=1
Queue

Preemption
• Guarantee resource availability SLO within and across queues
• Identify the starving jobs and overallocated jobs
• Minimize preemption cost: two-phase protocol
– Only candidates appearing in both phases will be preempted
– Resource Manager notifies client with preemption intent s.t. necessary action can
be taken, e.g. checkpointing

Cross Datacenter Scheduling
• The growing demand of computation and storage for Hive tables
spans across data centers
• Stranded capacity with imbalanced load
• Poor data locality and waste of network bandwidth
• Slow reaction to recover from crisis and disaster

• Dispatcher Proxy
– Monitors resource consumption
across data centers
– Decides the Resource Manager
for scheduling jobs
– Provides location hints to the
Resource Manager for
enforcement
• Planner
– Decides where the data will be
replaced based on utilization and
available resources
Datacenter 1 Datacenter 2 Datacenter 3
Resource Manager
1
Resource Manager
2
Dispatcher
Job

across data centers
for scheduling jobs
enforcement
• Planner
available resources
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1

across data centers
for scheduling jobs
enforcement
• Planner
available resources
Resource Manager
1
Resource Manager
2
Dispatcher
Job
Job constraint:
datacenter=1
Table DataTable Data

Future Work
• Mix workloads managed by one resource manager
• Run batch workloads with off-peak resources from online services
• Automatic resource tuning for high utilization
• We’re hiring! Contact: rjian@fb.com

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

More Related Content

Tangram: Distributed Scheduling Framework for Apache Spark at Facebook