Dynamic Community Detection for Large-scale e-Commerce data
with Spark Streaming and GraphX
Ming Huang
Meng Zhang, Bin Wei
GuangYuan Huang, Jinkui Shi
Community Detection
•  VIP Customer
•  Reputation Escalator
•  Fraud Seller
•  ………
•  LPA
•  GN
•  Fast Unfolding
•  …….
How to make it Dynamic?
Static Communities Streaming Data
Make sophisticated, real-time decisions
Definition & Solution
Dynamic Community Detection
1.  Decide New Node’s community
2.  Update Graph Physical Topology
3.  Effect communities and modularity
Spark Streaming + GraphX à Streaming Graph

Streaming Graph
merge merge merge
Stock Graph
… … …
Models and Algorithms
Quick Overview of
Fast Unfolding
∑ δ ci
,cj( )
Q = Qi
∑ =
Incremental Algorithms
JV(Streaming with RDD ) UMG(Streaming with Graph)
"   Union & Modularity Greedy"   Join & Vote

C1 C2 C2
C1 C2 C2
incEdgeRDD stockCommunityRDD
UMG 1 - Union
(C1 or C2) ?
   newGraph = stockGraph.union(incGraph)"
UMG 2 - findBestCommunity
gain1=G(node(d), community(1))
gain2=G(node(d) , community(2))
incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]]
(collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either)))
idCommunity = {"
case (vid, neighbors) => (vid, findBestCommunity(neighbors))"
in∑ +ki,in
UMG 3 - updateCommunities
newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)"
newModularity =>community.modularity).reduce(_+_)"
Q = Qi
∑ =
(Q1, Q2)

edgeStreamRDD.foreachRDD { "
  incEdgeRdd => { "
   val incGraph  = buildIncGraph(incEdgeRdd) "
   (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)"
edgeRdd "
} "
Flow Example Code
val conf = new SparkConf().setMaster(……).setAppName(……)"
val ssc = new StreamingContext(conf, Seconds(60))"
val totalGraph = initGraph(totalEdgesRdd) "
Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)"
val onlineDataFlow = getDataFlow(ssc.sparkContext)"
val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) "
Experiment Results
Autonomous Systems Graphs
Stanford Large Network Dataset Collection(as-733)
Modularity Trend – AS

Online Trading Graph
Buyer Seller
Modularity Trend – OT
Streaming Graph à Better Result
Key Points
"   Operator
"   Merge Small graph into Large graph
"   Model
"   Local changes
"   Index or summary
"   Algorithm
"   Delicate formula
"   Commutative law & Associative law
"   Parallelly & Incrementally
Complex GraphX

Graph Union Operator
∪ =	

Graph Union Operator"
[GraphX] Complex Operators between Graphs: Union"
   newGraph = stockGraph.union(incGraph)"
Complex GraphX Operators
"   Union of Graphs ( G ∪ H )
"   Intersection of Graphs ( G ∩ H)
"   Graph Join
"   Difference of Graphs(G – H)
"   Graph Complement
"   Line Graph ( L(G) )
Complex Operators between Graphs"
Streaming Optimization
Monitoring and Correction
Data Loading Modularity Threshold CheckingStreaming-FU
[Hourly Monitoring]
[Daily Running]



mTime mValue
timestamp1 totalModularity1
…… ……

Streaming Resource Allocation
•  Driver-Memory: 20G
•  Executors: 100
•  Core: 2
•  Executor-Memory: 20G
Not Enough for Peak Period!
Streaming Buffer
Modularity Correction Buffer
Resource Peak Buffer
"   Streaming Graph
"   Complex Operators will help
"   Daily Rebuild & Threshold Check
"   Costs more memory and time
"   Open Question
checkpoint with Streaming or Graph?
1.  Limits of community detection
2.  Community Detection
3.  Social Network Analysis
4.  Community detection in complex networks using Extremal Optimization

"   Q & A
"   Dynamic Community Detection
"   Streaming Graph
"   Models and Algorithms
"   Complex GraphX Operators
"   Streaming Optimization
"   Conclusion
Static vs. Dynamic
Static Model Dynamic Model

Unit ii ppt
Unit ii pptUnit ii ppt
Unit ii ppt
Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)
Tuple in python
Tuple in pythonTuple in python
Tuple in python
Database development coding standards
Database development coding standardsDatabase development coding standards
Database development coding standards
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
Learning c - An extensive guide to learn the C Language
Learning c - An extensive guide to learn the C LanguageLearning c - An extensive guide to learn the C Language
Learning c - An extensive guide to learn the C Language
Character set in c
Character set in cCharacter set in c
Character set in c
3 data-types-in-c
3 data-types-in-c3 data-types-in-c
3 data-types-in-c
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
Primitive data types in java
Primitive data types in javaPrimitive data types in java
Primitive data types in java
Лекция 4: Стек. Очередь
Лекция 4: Стек. ОчередьЛекция 4: Стек. Очередь
Лекция 4: Стек. Очередь
SQL Tunning
SQL TunningSQL Tunning
SQL Tunning
List , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in pythonList , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in python
Python Dictionary
Python DictionaryPython Dictionary
Python Dictionary
[APJ] Common Table Expressions (CTEs) in SQL
[APJ] Common Table Expressions (CTEs) in SQL[APJ] Common Table Expressions (CTEs) in SQL
[APJ] Common Table Expressions (CTEs) in SQL
우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018
우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018
우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018
Data Structures in Python
Data Structures in PythonData Structures in Python
Data Structures in Python
Sql commands
Sql commandsSql commands
Sql commands
Nested structure (Computer programming and utilization)
Nested structure (Computer programming and utilization)Nested structure (Computer programming and utilization)
Nested structure (Computer programming and utilization)
Datatypes in c
Datatypes in cDatatypes in c
Datatypes in c

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

  • 1. Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX Ming Huang Meng Zhang, Bin Wei GuangYuan Huang, Jinkui Shi
  • 2. Community Detection Scenarios •  VIP Customer •  Reputation Escalator •  Fraud Seller •  ……… Algorithms •  LPA •  GN •  Fast Unfolding •  …….
  • 3. How to make it Dynamic? Static Communities Streaming Data Make sophisticated, real-time decisions
  • 4. Definition & Solution Dynamic Community Detection 1.  Decide New Node’s community 2.  Update Graph Physical Topology 3.  Effect communities and modularity Spark Streaming + GraphX à Streaming Graph REAL-TIME
  • 7. Quick Overview of Fast Unfolding Modularity: ! Q= 1 2m Aij * ki kj 2m ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i,j ∑ δ ci ,cj( ) ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑
  • 8. Incremental Algorithms JV(Streaming with RDD ) UMG(Streaming with Graph) "   Union & Modularity Greedy"   Join & Vote
  • 9. JV A B C C1 C2 C2 D D D A B C D D D C1 C2 C2 D C2 join Vote incEdgeRDD stockCommunityRDD D C2
  • 10. UMG 1 - Union A B C1 C2 C3 C (C1 or C2) ?    newGraph = stockGraph.union(incGraph)" A B C D
  • 11. UMG 2 - findBestCommunity A B C D gain1=G(node(d), community(1)) gain2=G(node(d) , community(2)) C3 incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]] (collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either))) idCommunity = {" case (vid, neighbors) => (vid, findBestCommunity(neighbors))" }.cache()" ! Ci =Cmax j G(nodei ,Cj ) ! ΔQ= in∑ +ki,in 2m + tot+ki∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ + in∑ 2m + tot∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 + ki 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ C2 C1
  • 12. UMG 3 - updateCommunities A D B C newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)" " newModularity =>community.modularity).reduce(_+_)" C1 C2 ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑ (Q1, Q2)
  • 13. edgeStreamRDD.foreachRDD { "   incEdgeRdd => { "    val incGraph  = buildIncGraph(incEdgeRdd) "    (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)" outputToHBase(communityInfoRDD)" outputToHBase(modularity)" edgeRdd "   }" } " Flow Example Code ssc.start()" ssc.awaitTermination()" val conf = new SparkConf().setMaster(……).setAppName(……)" val ssc = new StreamingContext(conf, Seconds(60))" " " val totalGraph = initGraph(totalEdgesRdd) " Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)" " val onlineDataFlow = getDataFlow(ssc.sparkContext)" val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) " "
  • 15. Autonomous Systems Graphs Stanford Large Network Dataset Collection(as-733)
  • 18. Modularity Trend – OT Streaming Graph à Better Result
  • 19. Key Points "   Operator "   Merge Small graph into Large graph "   Model "   Local changes "   Index or summary "   Algorithm "   Delicate formula "   Commutative law & Associative law "   Parallelly & Incrementally
  • 21. Graph Union Operator GRAPH(H)GRAPH(G) ∪ =  E F G H B C D E F A B C D E F A H G GRAPH(G U H) Graph Union Operator" " [GraphX] Complex Operators between Graphs: Union" "    newGraph = stockGraph.union(incGraph)"
  • 22. Complex GraphX Operators "   Union of Graphs ( G ∪ H ) "   Intersection of Graphs ( G ∩ H) "   Graph Join "   Difference of Graphs(G – H) "   Graph Complement "   Line Graph ( L(G) ) Issues:" Complex Operators between Graphs"
  • 24. Monitoring and Correction Ω Data Loading Modularity Threshold CheckingStreaming-FU FastUnfolding [Hourly Monitoring] [Streaming] [Daily Running] FastUnfolding communityID  communityInfo  community1  (in1,tot1,degree1,modularity1)  ……  ……  mTime mValue timestamp1 totalModularity1 …… …… modularityTablecommRDDTable
  • 25. Streaming Resource Allocation •  Driver-Memory: 20G •  Executors: 100 •  Core: 2 •  Executor-Memory: 20G Not Enough for Peak Period!
  • 27. Conclusion "   Streaming Graph "   Complex Operators will help "   Daily Rebuild & Threshold Check "   Costs more memory and time "   Open Question checkpoint with Streaming or Graph?
  • 28. Acknowledgements 1.  Limits of community detection " 2.  Community Detection " 3.  Social Network Analysis " 4.  Community detection in complex networks using Extremal Optimization "
  • 29. "   Q & A
  • 30. Agenda "   Dynamic Community Detection "   Streaming Graph "   Models and Algorithms "   Complex GraphX Operators "   Streaming Optimization "   Conclusion
  • 31. Static vs. Dynamic Static Model Dynamic Model