Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

How graphs became just another big data primitive
Ted Willke
Cloud Platforms Group / Big Data Solutions

So, how did graphs become just another useful big data primitive?
They DIDN’T.

Reduce the tool drag for graph analytics
-- Vision (early 2012)
Set off in the right direction
4

A complete graph analytics solution
5
-- July 2013

6
Wide on Analytics  E2E on Graph  Deep on Graph  Wide on Analytics
User Interest

Learning #1: Don’t ignore what’s popular!
7

Popular Big Data (Structure) Primitives
Which one is best? It depends… and it’s probably not just one.
Key-Value Document Column Tabular Graph
8

Basic dictionary.
Very fast.
Very easy.
No/minimal structure.
Java, PIQL, Lua, XML, XQuery,…
9

Key(s), metadata, hierarchy, document structure
XML, BSON, JSON…
Java, C, C++, REST, Clojure, Scala…
10

Key:col_val, Key:col_val…
Great for “do this to everything in this column”
Not so much for multiple columns, specific keys
Hadoop, Zookeeper, Java, Python,…
11

Old-school RDBMS
Collection of tables + relations that join them
*SQL*
12

Nodes, edges, properties of nodes and edges
Java, Clojure, Lisp, Ruby, C, C++, Scala, REST,…
13

Key-Value Document Graph
Sync (I/O) Async (Bus) Off-line (Queue)
API (Remote) LIB (Local)
Model
Access
Implementation
Column SQL
14
How we use the primitives

How are these primitives put to use?
15

Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Data workflow example
16

Data Representation
Personal Learning Knowledge Graph
has_associated
has_result
contains
implemented_by
Task Level evaluated_by
-name: "10th Grade"
-value: 10
Learning Task
-name: "Matrix Multiplication"
-task_id: 101
-description: "Demonstrate how
to multiply two matrices"
-type: "homework"
Subject
-name: "Linear Algebra"
-subject_id: 100 Task Outcome
-score: 0.8
-num_correct: 8
-num_attempts: 2
Learning Plan
-plan_id: 1
-num_tasks: 5
-expected_time: 5h
Learning Goal
-goal_id: 9
-description: "Achieve above
average proficiency in all Linear
Algebra course tasks"
Proficiency
name: "Above Average"
summarized_by
has_associated
has_prerequisite
Graph? Columnar? Tabular??
17

18
Run a graph-based classifier (e.g. LBP)
Build graph w/ features from frame
Pull results back to frame to get model perf stats
Engineer features (avg, ratios)
Input from another model (segment/cluster)

Learning #2: The primitives are not used in isolation.
19

Ingest & Clean
Engineer Features
Structure Model
Train Model
Query & Analyze
Learn
Visualize
Pig/MR
PySpark
ETL Tools?
Pig/MR
PySpark
Java, Scala
Giraph
GraphX
(Java, Scala…)
Mahout
MLlib
??
*SQL*
BI tools
PySpark…
Tooling mash-up!
20

Tools are not used in isolation either. How can we cope with this?
21

Direction #1: Unify primitives and processing on a workflow-oriented engine
22

Unification with Apache Spark
Image Source: Databricks
•In-memory structures (RDDs) support both table and graph abstractions
•Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming real-time
Spark
SQL
MLLib
machine learning
DStream’s: Streams of RDD’s
SchemaRDD’s
RDD-Based Matrices
GraphX
graph processing/
machine learning
RDD-Based Graphs
23

Image Source: GraphX project
•Graph processing engine on Spark
•Supports Pregel-style vertex programming
•View same data as either graphs or collections
GraphX API for Spark
24

Python bindings for Spark (GraphX)
25
Client
Server
Python
JVM
Py4J
Files
JVM
Akka
Python
Worker
Pipes
Serialized Python Functions
Results
“Transformations”
“Actions”
“Operations”

Python bindings for Spark GraphX
26

Python bindings for Spark GraphX
Coming soon to Apache!
Vertex
•Transformations: filter, mapValues, diff
•Actions: aggregateUsingIndex
•Join Operations: innerJoin, leftJoin Edge
•Transformations: filter, mapValues, reverse
•Join Operations: innerJoin Graph
•Property Operators: mapVertices, mapEdges, mapTriplets
•Structural Operators: subgraph, reverse, mask, groupEdges,
•Join Operations: joinVertices, outerJoinVertices,
•Neighborhood Aggregation: mapReduceTriplets
•Analytics: ALS, SVDPlusPlus, TriangleCount, PageRank, ConnectedComponents, ShortestPaths
27

Direction #1: Spark
28
•Feature engineering
•Model training
•Limited language binding (Python, R getting better)
•Lacks transactions and model serving

Lacks transactions and model serving... or does it?
Image Source: Crankshaw, D., et al., “The Missing Piece in Complex Analytics: Low Latency,
Scalable Model Management and Serving with Velox,” Cornell University Library Archive, retrieved November 2014
Extending BDAS with Velox:
A UC Berkeley AMPlab project (sponsored in part by Intel)
29

Direction #2: Unify primitives and processing in relational database
30

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Unification within the In-Memory Database (IMDB)
•Index data structure for graph traversal
•Prototyped in SAP HANA distributed columnar IMDB
•Lays foundation for complex graph query and algorithms
31

Graph Traversal
32

Graph Indexing
33

Graph Traversal Results
34

•Store graph as a set of nodes and a set of edges
•Relational algebra captures all basic graph operations
•Iterative algorithms captured as driver program that calls stored procedures
Graph Analytics in Relational Databases?
Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog
35

Source: ISTC for Big Data, Alekh Jindal, “Graph Analytics: The New Use Case for Relational Databases,” blog
Graph Analytics in Relational Databases?
Relational and graphical analysis – better together!
36

Source: ISTC for Big Data, Alekh Jindal
Expressing Graph in SQL
37

Real Time Database
BQL – BigDAWG Query Language & Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist” provides portability
Historical / Analytics Databases
Spill
Stream
Future Vision – BigDAWG
38

Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language & Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB
S-Store
“Narrow waist” provides portability
MyriaX
Historical / Analytics DBMSs
Spill
Stream
39

Direction #2: Relational DB
40
•Feature engineering
•Transactions and model serving
•Performant model training?
•Just another Spark behind *QL?

Which direction do you favor?
41
Will the lines blur?

42
Takeaway from both:
Do all of the parallel distributed
processing in one place and work with it
through one UI!

43
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP
APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” REST API
Intel Analytics Toolkit
Unified UI’s across the workflow
Easier feature & model creation
End-to-end graph pipeline
Fully scalable throughout
Multiple data primitives
Optimized for IA
Python
Libraries
3rd Party GUIs/SDKs
Viz
Tools
Future Libraries
BI Connectors
Query Interfaces
...
Pressing forward with the Intel Analytics Toolkit

Analyzing the Semantic Web
Reputations
Neutral
Good
Bad
Suspect
44

Unified programming environment: DEMO
45

47
If we are successful...
graph will become just another big data primitive!

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

49
How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF

Related slideshows

More Related Content

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF