On-boarding with JanusGraph Performance

On-boarding with JanusGraph Performance
June 17, 2017
Chin Huang, chhuang@us.ibm.com;github:chinhuang007
Yi-Hong Wang, yh.wang@us.ibm.com;github:yhwang
Ted Chang, htchang@ibm.com;github:tedhtchang
JanusGraph:@JanusGraph

Agenda
Overview – Onboarding with graph performance
JanusGraph performance evaluation scenarios
• Bulk loader performance
• Data import performance
• Query performance
Lessons learned
Q&A

Onboarding with graph performance
Exciting era with many new technologies!!
Onboarding users/developers to graph databases
• Typical focus areas: features and benefits, ease of use, suitability,
extensibility, APIs…
• Performance is one of the most important differentiators for any
application
• Is performance just for system testing?!
• Performance and scalability are key considerations for design,
development, and operations
Journey to JanusGraph with a performance mind!
• Check out graph structures and traversals
• Evaluate reads and writes in high volume
• Can JanusGraph scale out for future data/user growth?
• Look for bottlenecks and provide improvements

JanusGraph performance test environment
Server spec
• Physical servers: x3650 M5, 2 sockets x 14 cores, 384 GB (12 x 32G) memory
• CPU: Intel Xeon Processor E5-2690 v4 14C 2.6GHz 35MB Cache 2400MHz
• Network interface: Emulex VFA5.2 ML2 Dual Port 10GbE SFP+ Adapter
• Disk: 720 GB SSD, RAID 5
• Operating system: Ubuntu 16.04.2 LTS
Existing tools
• jMeter - load testing tool
• nmon, nmon analyser - system performance monitor and analyze tool
• VisualVM - all-in-one Java troubleshooting/profiling tool
• GCeasy - garbage collection log analysis tool
Home grown tools
• Graph schema loader, data generator, batch importer, batch requester

JanusGraph performance tool - Graph schema loader
Enable the graph model creation via the
gremlin console or embedded in java
• Use JSON to describe your graph model
• Support:
• Property
• Vertex
• Edge
• Index
Benefit: Create schema on-the-fly
without single line of code!
https://github.com/yhwang/janusgraph-utils

Bulk load performance – Use case and data
Data Migration
OneTimeBulkLoader
Batch Update
IncrementalBulkLoader
011110100101100101Gryo:
{“id”: 1, “label”:…}GraphSON:
1:person:marko:29Script:
Supported Formats
• OneTimeBulkLoader
• 128GB GraphSON file
• 31 million vertices
• 38 million edges
• 3277 propertyKeys
• 5 vertex labels
• 3 edge labels
• 78.9 properties per edge
• 18.7 properties per vertex

Bulk load performance – Topology
HDFS
• Spark - 1.6.1
• Standalone Cluster
• 2 worker nodes
• 8 executors per node
• 8 cores per executor
• 2GB per executor
• Hadoop - 2.7.2
• Use HDFS to store the GraphSON file
• Cassandra - 2.1.17
• 2-node cluster
• Tinkerpop3 – 3.2.3
• GraphComputer
• JanusGraph – 0.1.1
• JanusGraphBulkLoaderVertexProgram
• Astyanax persistence provider
Worker Node
Cassandra
BulkLoader
+ + HDFS client
Executor
X 8
Worker Node
Executor X 8
Cassandra
Cluster Master

Bulk load performance – results
• Vertex:
• 31,594,277
• 19 mins
• 495 records/sec per core
• Edge:
• 38,322,731
• 24.8 mins
• 461.8 records/sec per core
0
10
20
30
40
50
60
70
80
90
100
CPU(%)
Node1 Node2
0
10
20
30
40
50
60
70
80
90
100
CPU(%)
Node1 Node2

Data import performance – use case and data
Synthetic Data
Small Medium Large XLarge 10x Properties 50x Properties 100x Properties
Vertices(Million) 0.3 3 30 30 3 3 3
Edges (Million) 0.3 3 30 300 3 3 3
PropertyKeys 7 7 7 7 70 350 700
Vertex labels 3 3 3 3 3 3 3
Edge labels 2 2 2 2 2 2 2
Public Data
Wikimedia votes Higgs Twitter Panama Papers
Vertices(Million) 0.007 0.456 1.04
Edges (Million) 0.1 16 1.53
PropertyKeys 0 2 22
Vertex labels 1 1 5
Edge labels 1 4 261

Data import performance – topology and configuration
All-In-One-Node
Cassandra
Batch Importer
+
CSV Data Generator
+
JanusGraph configuration:
storage.backend=astyanax
ids.block-size = 500000
storage.buffer-size = 2560
storage.batch-loading = true
schema.default = none
BatchImporter configuration:
commit size = 100
worker target size = 10000
Schema Loader
+

Data import performance tooling- Graph data generator
A Java application
• Vertices and edges labels
• Number of vertices and edges
• Number of properties and data types
• Native and mixed index
• Relations patterns
• Super-nodes
• Generate graph-db schema in JSON
• Generate datamap JSON for BatchImporter
https://github.ibm.com/htchang/JanusGraphBench

Data import performance tooling - Graph data batch importer
Java application to Import CSV data into JanusGraph
Features:
• Multiple Threads
• Worker record size
• Commit size
• Import schema
• Import CSV to JanusGraph with configurable data mapping
https://github.com/sdmonov/JanusGraphBatchImporter

Data import performance – results
0.2 1
10 10
0.2 2
25
241
55
70
73 73
0
10
20
30
40
50
60
70
80
0
50
100
150
200
250
300
CPU%
ImportTime(min)
Size of DB
Batch Import Time V.S. # of Records
Vertex Import(min) Edges Import(min) CPU%
1,648
319
35 10
788
450
133
52
70
80
90 90
0
10
20
30
40
50
60
70
80
90
100
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
medium (8mil) 10x Properties(80mil) 50x Properties(400mil) 100x Properties(800mil)
CPU%
records/sec
Size of DB
Insert rate V.S. # of Properties Per Record
Vertex/sec/core Edges/sec/core CPU%

Data query performance – use case and data
Flight search
• All flights from airport A to airport B on a given date and time
• # of stops: non-stop, one-stop, two-stop…
Data spec
• 600+ airports, 350K+ flight schedules
Performance analysis
• How many requests per second can JanusGraph handle?
• Can JanusGraph scale with future volume growth?

Data query performance - Topology and configuration
JanusG
raph
Server
ElasticSearchCassandra
Storage Backend Node Index Backend Node
JanusGraph Node 1 JanusGraph Node 2
JanusG
raph
Server
JanusG
raph
Server
JanusG
raph
Server
JanusG
raph
Server
JanusG
raph
Server
Load Driver Node
jMeter (thread groups)
REST Calls, http post
• JanusGraph server with REST
• 1 or 10 instances per server
• Astyanax persistence provider
• threadPoolBoss: 2
• threadPoolWorker: 20
• Java heap: -Xms512m -Xmx8G
• Concurrent threads (users): 1, 5, 10, 20,
40, 100, 200
• Think time: 0 ms
• Run duration: 5 minutes
• Multiple test configurations
• 10 instances on 1 node
• 20 instances on 2 nodes
• 30 instances on 3 nodes
. . . . . . . .. .

Data query performance – Non-stop flights (one level deep traversals)
Response Time
Concurrent threads Concurrent threads
TPS
milliseconds
transactions
Performs well regardless number of instances and nodes

Data query performance – One-stop flights (two levels deep traversals)
People would like to see more than just non stop flights…
Response Time
TPS
milliseconds
transactions

Data query performance – Two-stop flights (three levels deep traversals)
The query gets complicated because we need to operate and filter on
multiple vertices and edges.
Response Time
TPS
milliseconds
transactions

Lessons Learned
Model your graph database for performance
• Data is yours. Design the data model for your use cases!
• What kind of queries you want to support? How many levels deep into a traversal?
• Consider denormalization…
• Design and use indexes, graph indexes and vertex-centric indexes in JanusGraph, for better performance, but
not over-use indexes
• It is recommended to create the complete data model before inserting content
Use batch commits with caution
• Batch commits allow multiple transactions to be committed together. The batch size affects performance and the
optimal size depends on the characteristics of data.
• Need to handle conflicts for inserts and updates in a multi-threads/multi-clients implementation
• Make sure the commit is completed and closed

Lessons Learned
Fine-tune for your workloads and systems
• JanusGraph supports storage and index backends therefore tune your backends!
• JanusGraph server configurations, such as threadPoolBoss and threadPoolWorker
• JVM configurations, such as Xms (initial and minimum Java heap size) and Xmx (maximum Java heap size)
You don’t want to see the annoying java.lang.OutOfMemoryError exceptions  But at the same time an
oversized Xmx has negative impact on performance due to long and slower GCs.
• Use multiple threads and/or instances to your system’s capacity
• Next step… consider cloud and auto-scaling
• Be thorough and be patient because it will take a few iterations
• Just like a fine-tuned instrument, you will enjoy the beautiful music for a long time!

Compose for JanusGraph
What is it?
• Compose is an open-source database hosting provider
• Supports backups, monitoring, performance tuning, and a full-suite of deployment management tools backed
by a 24x7 support and operations team
• Offers JanusGraph technology with Scylla database
• https://www.compose.com/janusgraph

Thank you for keeping performance in mind !!
Chin Huang, chhuang@us.ibm.com;github:chinhuang007
Yi-Hong Wang, yh.wang@ibm.com;github:yhwang
Ted Chang, htchang@ibm.com;github:tedhtchang
What’s next?
The journey continues…
• Find ways to improve JanusGraph performance
• Join us if you are interested in graph performance
• Work with us if you have graph datasets
• Talk to us if you have any comments or suggestions

On-boarding with JanusGraph Performance

Related slideshows

More Related Content

On-boarding with JanusGraph Performance