Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

Dynamic Community Detection for Large-scale e-Commerce data
with Spark Streaming and GraphX
Ming Huang
Meng Zhang, Bin Wei
GuangYuan Huang, Jinkui Shi

Community Detection
Scenarios
•  VIP Customer
•  Reputation Escalator
•  Fraud Seller
•  ………
Algorithms
•  LPA
•  GN
•  Fast Unfolding
•  …….

How to make it Dynamic?
Static Communities Streaming Data
Make sophisticated, real-time decisions

Definition & Solution
Dynamic Community Detection
1.  Decide New Node’s community
2.  Update Graph Physical Topology
3.  Effect communities and modularity
Spark Streaming + GraphX à Streaming Graph
REAL-TIME

Streaming Graph
Edges
DStream
Graph
DStream
merge merge merge
Stock Graph
… … …

Quick Overview of
Fast Unfolding
Modularity：
!
Q=
1
2m
Aij
*
ki
kj
2m
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i,j
∑ δ ci
,cj( )
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑

Incremental Algorithms
JV（Streaming with RDD ) UMG（Streaming with Graph）
"   Union & Modularity Greedy"   Join & Vote

JV
A B C
C1 C2 C2
D D D
A B C
D D D
C1 C2 C2
D
C2
join
Vote
incEdgeRDD stockCommunityRDD
D
C2

UMG 1 - Union
A
B
C1
C2
C3
C
(C1 or C2) ?
newGraph = stockGraph.union(incGraph)"
A
B
C
D

UMG 2 - findBestCommunity
A
B
C
D
gain1=G(node(d), community(1))
gain2=G(node(d) , community(2))
C3
incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]]
(collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either)))
idCommunity = incVertexWithNeighbors.map {"
case (vid, neighbors) => (vid, findBestCommunity(neighbors))"
}.cache()"
!
Ci
=Cmax
j
G(nodei
,Cj
)
!
ΔQ=
in∑ +ki,in
2m
+
tot+ki∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
+
in∑
2m
+
tot∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
+
ki
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
C2
C1

UMG 3 - updateCommunities
A
D
B
C
newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)"
"
newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)"
C1
C2
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
(Q1, Q2)

edgeStreamRDD.foreachRDD { "
incEdgeRdd => { "
val incGraph = buildIncGraph(incEdgeRdd) "
(communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)"
outputToHBase(communityInfoRDD)"
outputToHBase(modularity)"
edgeRdd "
}"
} "
Flow Example Code
ssc.start()"
ssc.awaitTermination()"
val conf = new SparkConf().setMaster(……).setAppName(……)"
val ssc = new StreamingContext(conf, Seconds(60))"
"
"
val totalGraph = initGraph(totalEdgesRdd) "
Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)"
"
val onlineDataFlow = getDataFlow(ssc.sparkContext)"
val edgeStreamRDD = ssc.queueStream(onlineDataFlow, true) "
"

Autonomous Systems Graphs
Stanford Large Network Dataset Collection(as-733)
https://snap.stanford.edu/data/

Online Trading Graph
Buyer Seller
C-C

Modularity Trend – OT
Streaming Graph à Better Result

Key Points
"   Operator
"   Merge Small graph into Large graph
"   Model
"   Local changes
"   Index or summary
"   Algorithm
"   Delicate formula
"   Commutative law & Associative law
"   Parallelly & Incrementally

Graph Union Operator
GRAPH(H)GRAPH(G)
∪ =

E
F
G
H
B
C
D E
F
A
B
C
D
E
F
A
H
G
GRAPH(G U H)
Graph Union Operator
https://issues.apache.org/jira/browse/SPARK-7894"
"
[GraphX] Complex Operators between Graphs: Union
https://github.com/apache/spark/pull/6685"
"
newGraph = stockGraph.union(incGraph)"

Complex GraphX Operators
"   Union of Graphs ( G ∪ H )
"   Intersection of Graphs ( G ∩ H)
"   Graph Join
"   Difference of Graphs（G – H）
"   Graph Complement
"   Line Graph ( L(G) )
Issues:"
Complex Operators between Graphs
https://issues.apache.org/jira/browse/SPARK-7893"

Monitoring and Correction
Ω
Data Loading Modularity Threshold CheckingStreaming-FU
FastUnfolding
[Hourly Monitoring]
[Streaming]
[Daily Running]
FastUnfolding
communityID
communityInfo

community1
(in1,tot1,degree1,modularity1)

……
……

mTime mValue
timestamp1 totalModularity1
…… ……
modularityTablecommRDDTable

Streaming Resource Allocation
•  Driver-Memory： 20G
•  Executors： 100
•  Core： 2
•  Executor-Memory： 20G
Not Enough for Peak Period！

Streaming Buffer
Kafka
Stream
Hdfs
Stream
Join
StreamingFUModel
Streaming-
FU
Streaming-
Buffer
TT
Receiver
Split
HDFS
Modularity Correction Buffer
Resource Peak Buffer
Kafka
Buffer
Writer

Conclusion
"   Streaming Graph
"   Complex Operators will help
"   Daily Rebuild & Threshold Check
"   Costs more memory and time
"   Open Question
checkpoint with Streaming or Graph?

Acknowledgements
1.  Limits of community detection
" http://www.slideshare.net/vtraag/comm-detect
2.  Community Detection
" http://www.traag.net/projects/community-detection/
3.  Social Network Analysis
" http://lorenzopaoliani.info/topics/
4.  Community detection in complex networks using Extremal Optimization
" http://arxiv.org/pdf/cond-mat/0501368.pdf

Agenda
"   Dynamic Community Detection
"   Streaming Graph
"   Models and Algorithms
"   Complex GraphX Operators
"   Streaming Optimization
"   Conclusion

Static vs. Dynamic
Static Model Dynamic Model

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

Similar to Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao) (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)