Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
- 1. How graphs became just another big data primitive
Ted Willke
Cloud Platforms Group / Big Data Solutions
- 3. So, how did graphs become just another useful big data primitive?
They DIDN’T.
- 4. Reduce the tool drag for graph analytics
-- Vision (early 2012)
Set off in the right direction
4
- 6. 6
Wide on Analytics E2E on Graph Deep on Graph Wide on Analytics
User Interest
- 8. Popular Big Data (Structure) Primitives
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
8
- 9. Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
Basic dictionary.
Very fast.
Very easy.
No/minimal structure.
Java, PIQL, Lua, XML, XQuery,…
Popular Big Data (Structure) Primitives
9
- 10. Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
Key(s), metadata, hierarchy, document structure
XML, BSON, JSON…
Java, C, C++, REST, Clojure, Scala…
Popular Big Data (Structure) Primitives
10
- 11. Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
Key:col_val, Key:col_val…
Great for “do this to everything in this column”
Not so much for multiple columns, specific keys
Hadoop, Zookeeper, Java, Python,…
Popular Big Data (Structure) Primitives
11
- 12. Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
Old-school RDBMS
Collection of tables + relations that join them
*SQL*
Popular Big Data (Structure) Primitives
12
- 13. Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
Nodes, edges, properties of nodes and edges
Java, Clojure, Lisp, Ruby, C, C++, Scala, REST,…
Popular Big Data (Structure) Primitives
13
- 14. Key-Value Document Graph
Sync (I/O) Async (Bus) Off-line (Queue)
API (Remote) LIB (Local)
Model
Access
Implementation
Column SQL
14
How we use the primitives
- 16. Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Data workflow example
16
- 17. Data Representation
Personal Learning Knowledge Graph
has_associated
has_result
contains
implemented_by
Task Level evaluated_by
-name: "10th Grade"
-value: 10
Learning Task
-name: "Matrix Multiplication"
-task_id: 101
-description: "Demonstrate how
to multiply two matrices"
-type: "homework"
Subject
-name: "Linear Algebra"
-subject_id: 100 Task Outcome
-score: 0.8
-num_correct: 8
-num_attempts: 2
Learning Plan
-plan_id: 1
-num_tasks: 5
-expected_time: 5h
Learning Goal
-goal_id: 9
-description: "Achieve above
average proficiency in all Linear
Algebra course tasks"
Proficiency
name: "Above Average"
summarized_by
has_associated
has_prerequisite
Graph? Columnar? Tabular??
17
- 18. 18
Run a graph-based classifier (e.g. LBP)
Build graph w/ features from frame
Pull results back to frame to get model perf stats
Engineer features (avg, ratios)
Input from another model (segment/cluster)
- 20. Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Pig/MR
PySpark
ETL Tools?
Pig/MR
PySpark
Java, Scala
Giraph
GraphX
(Java, Scala…)
Mahout
MLlib
??
*SQL*
BI tools
PySpark…
Tooling mash-up!
20
- 21. Tools are not used in isolation either. How can we cope with this?
21
- 23. Unification with Apache Spark
Image Source: Databricks
•In-memory structures (RDDs) support both table and graph abstractions
•Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming real-time
Spark
SQL
MLLib
machine learning
DStream’s: Streams of RDD’s
SchemaRDD’s
RDD-Based Matrices
GraphX
graph processing/
machine learning
RDD-Based Graphs
23
- 24. Image Source: GraphX project
•Graph processing engine on Spark
•Supports Pregel-style vertex programming
•View same data as either graphs or collections
GraphX API for Spark
24
- 25. Python bindings for Spark (GraphX)
25
Client
Server
Python
JVM
Py4J
Files
JVM
Akka
Python
Worker
Pipes
Serialized Python Functions
Results
“Transformations”
“Actions”
“Operations”
- 27. Python bindings for Spark GraphX
Coming soon to Apache!
Vertex
•Transformations: filter, mapValues, diff
•Actions: aggregateUsingIndex
•Join Operations: innerJoin, leftJoin Edge
•Transformations: filter, mapValues, reverse
•Join Operations: innerJoin Graph
•Property Operators: mapVertices, mapEdges, mapTriplets
•Structural Operators: subgraph, reverse, mask, groupEdges,
•Join Operations: joinVertices, outerJoinVertices,
•Neighborhood Aggregation: mapReduceTriplets
•Analytics: ALS, SVDPlusPlus, TriangleCount, PageRank, ConnectedComponents, ShortestPaths
27
- 28. Direction #1: Spark
28
•Feature engineering
•Model training
•Limited language binding (Python, R getting better)
•Lacks transactions and model serving
- 29. Lacks transactions and model serving... or does it?
Image Source: Crankshaw, D., et al., “The Missing Piece in Complex Analytics: Low Latency,
Scalable Model Management and Serving with Velox,” Cornell University Library Archive, retrieved November 2014
Extending BDAS with Velox:
A UC Berkeley AMPlab project (sponsored in part by Intel)
29
- 31. Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Unification within the In-Memory Database (IMDB)
•Index data structure for graph traversal
•Prototyped in SAP HANA distributed columnar IMDB
•Lays foundation for complex graph query and algorithms
31
- 35. •Store graph as a set of nodes and a set of edges
•Relational algebra captures all basic graph operations
•Iterative algorithms captured as driver program that calls stored procedures
Graph Analytics in Relational Databases?
Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog
35
- 36. Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog
Graph Analytics in Relational Databases?
Relational and graphical analysis – better together!
36
- 38. Real Time Database
BQL – BigDAWG Query Language & Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist” provides portability
Historical / Analytics Databases
Spill
Stream
Future Vision – BigDAWG
38
- 39. Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language & Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB
S-Store
“Narrow waist” provides portability
MyriaX
Historical / Analytics DBMSs
Spill
Stream
39
- 40. Direction #2: Relational DB
40
•Feature engineering
•Transactions and model serving
•Performant model training?
•Just another Spark behind *QL?
- 42. 42
Takeaway from both:
Do all of the parallel distributed
processing in one place and work with it
through one UI!
- 43. 43
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP
APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” REST API
Intel Analytics Toolkit
Unified UI’s across the workflow
Easier feature & model creation
End-to-end graph pipeline
Fully scalable throughout
Multiple data primitives
Optimized for IA
Python
Libraries
3rd Party GUIs/SDKs
Viz
Tools
Future Libraries
BI Connectors
Query Interfaces
...
Pressing forward with the Intel Analytics Toolkit
- 47. 47
If we are successful...
graph will become just another big data primitive!
- 49. 49
How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.