Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston 2013

•

1 like•1,307 views

This document discusses graph databases and lessons learned from building an application using Neo4j. It defines what a graph database is, describes how Neo4j works and common usage patterns. It then outlines several key lessons learned, such as using unique relationship types, caching statistics, representing history as event nodes, modeling objects as nodes not relationships, and connecting data through relationships rather than properties.

Recommended for you

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon’s cloud environment. And a thorough performance evaluation in Amazon’s EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

•by Spark Summit

spark summit eastapache spark

PA1_template

This document summarizes a student's submission of Peer Assessment 1 for a course on reproducible research. The analysis involves loading step count data, then calculating and visualizing: [1] the total number of steps taken each day, [2] the average steps by 5-minute interval across all days, and [3] whether there are differences in activity levels between weekdays and weekends. Code is shown for each step of loading data, performing calculations, and creating visualizations using R Markdown, knitr, dplyr and ggplot2.

•by Grant Oliveira

Server side geo_tools_in_drupal_pnw_2012

Mack Hardy @mackaffinity from Affinity Bridge @affinitybridge discusses server side mapping tools for drupal, using PostGIS as a spatial backend, generating tiles and managing large sets of geodata and displaying it in Drupal CMS

•by Mack Hardy

drupalgeotilestache

Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston 2013

Recommended for you

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

This document provides an overview of Spark concepts and techniques for machine learning including naive Bayes classification, word2vec, k-means clustering, and semi-supervised learning. It discusses using RDD transformations like map, reduceByKey, and treeAggregate for counting word frequencies. It also covers configuring PySpark memory and using the EM algorithm to incorporate unlabeled data into naive Bayes classification.

•by Martin Goodson

spark

Introduction to spark

This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.

•by Duyhai Doan

sparkbig dataanalytics

3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB

Title: Neo4j: The World's Leading Graph DB Speaker: George Eleftheriadis (https://gr.linkedin.com/in/george-eleftheriadis-4526ba51/) Date: Monday, April 18, 2016 Event: https://meetup.com/Athens-Big-Data/events/229812890/

•by Athens Big Data

neo4jgraphdatabases

START user=node(1), other=node(2)
MATCH (user) -[r1:has|wants]-> (thing) <-[r2:has|wants]- (other)
WHERE TYPE(r1) <> TYPE(r2)
RETURN TYPE(r1), TYPE(r2), thing

SO…
JUST WHAT IS A
GRAPH DATABASEGRAPH DATABASE?

Recommended for you

NoSQL Graph Databases - Why, When and Where

NoSQL Graph Databases - Why, When And Where should we use it. Graph DB - The new era of understanding data

•by Eugene Hanikblum

nosqlneo4jbig data

AI與大數據數據處理 Spark實戰(20171216)

This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.

•by Paul Chao

sparkbig dataai

Introduction to R for data science

Why R is popular for data science? Its important packages for each steps in data mining. Examples with R codes are included.

•by Long Nguyen

rdata scienceimportant packages and techniques

# adjacency list:
nodes = List<Node>
neighbors = Map<Node, List<Node>>
neighbors[node1].add(node2)
# adjacency matrix:
nodes = List<Node>
connections = Map<Node, Map<Node, bool>>
connections[node1][node2] = true

Recommended for you

Transformations and actions a visual guide training

The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.

•by Spark Summit

spark summit 2015apache spark

R for Pythonistas (PyData NYC 2017)

R is a powerful language for data analysis and visualization. Some key advantages of R include its data-centric approach, large collection of packages, and powerful data visualization capabilities like ggplot2. The document discusses various R concepts like its functional programming style, object-oriented programming using S3 classes, and non-standard evaluation. It also provides examples of how to access R functions and libraries from Python using rpy2.

•by Christopher Roach

pythonrstatspydata

R basics

R is a language and environment for statistical computing and graphics. It is based on S, an earlier language developed at Bell Labs. R features include being cross-platform, open source, having a package-based repository, strong graphics capabilities, and active user and developer communities. Useful URLs and books for learning R are provided. Instructions for installing R and RStudio on different platforms are given. R can be used for a wide range of statistical analyses and data visualization.

•by Sagun Baijal

open sourcedata analyticsr

“ By definition, a graph database is any storage
system that provides index-free adjacency. ”
“ This means that every element contains a
direct pointer to its adjacent element and no
index lookups are necessary. ”

QUERYING
1. Start somewhere
2. Traverse elsewhere

QUERYING IN NEO4J
1. Start somewhere
Root node
ID directly (file offset)
Lucene index
2. Traverse elsewhere
Traversal APIs
Cypher patterns
Built-in graph algos (Djikstra, A*, etc.)

Recommended for you

Graph x pregel

The document discusses Spark GraphX and Pregel algorithms for graph processing. It introduces GraphX and how it represents graphs as RDDs. It then covers the feedback vertex set algorithm for finding cycles in a graph and how it can be parallelized in GraphX. Finally, it discusses Pregel and how it allows for large-scale distributed graph computations through message passing between vertices.

•by Sigmoid

sigmoidsparkbigdata

Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

The document discusses the MapReduce framework in Hadoop for processing large amounts of structured and unstructured data in parallel across clusters. It describes how MapReduce works by splitting input, mapping tasks, shuffling, and reducing results. It also explains the HDFS architecture with NameNode, DataNodes, and block replication. Finally, it outlines the overall Hadoop architecture including JobClient, JobTracker, TaskTracker, and their roles in managing jobs.

•by Debarchan Sarkar

bigdatamapreducehadoop

Overiew of Cassandra and Doradus

Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.

•by randyguck

doradus cassandra database nosql bigdata

NEO4J USAGE
Embedded mode (Java API)
Server mode (REST API)
Cypher query language (both)

NEO4J EDITIONS
Community edition
Single instance
Offline backup
Advanced edition
Meh
Enterprise edition
Multi-instance cluster!
Online backup!

Recommended for you

iot.pptx

The most popular batch processing framework is Apache Hadoop's MapReduce. MapReduce is a Java based system for processing large datasets in parallel. It reads data from the HDFS and divides the dataset into smaller pieces.

•by SabthamiS1

usinghadoop mapreduce forbatch data analysis

Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.

•by Scott Leberknight

javahadoophive

Rattle Graphical Interface for R Language

Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too. Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface. Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.

•by Majid Abdollahi

data miningr languagerattle

NEO4J SCALING
Master-slave replication
Cache-based sharding
Feature-based polyglot'ing
64B limit on nodes, rels, props
But can be easily upped; just flipping some bits
100 props/node (high) ⇒ 640M nodes

OKAY...
LET'S TALK ABOUT
WHAT WE LEARNEDWHAT WE LEARNED

WHAT WE LEARNED
Unique, expressive relationship types

Recommended for you

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial: 1) What is RDD - Resilient Distributed Datasets 2) Creating RDD in Scala 3) RDD Operations - Transformations & Actions 4) RDD Transformations - map() & filter() 5) RDD Actions - take() & saveAsTextFile() 6) Lazy Evaluation & Instant Evaluation 7) Lineage Graph 8) flatMap and Union 9) Scala Transformations - Union 10) Scala Actions - saveAsTextFile(), collect(), take() and count() 11) More Actions - reduce() 12) Can We Use reduce() for Computing Average? 13) Solving Problems with Spark 14) Compute Average and Standard Deviation with Spark 15) Pick Random Samples From a Dataset using Spark

•by CloudxLab

cloudxlabsparkapache spark

1403 app dev series - session 5 - analytics

This document provides an agenda for a session on reporting and analytics options in MongoDB, including Map Reduce, the Aggregation Framework, and examples using geospatial and text search features. It discusses building reports in an application, tuning aggregation pipelines with explain plans, and computing aggregations on the fly or pre-computing and storing them. The next session will cover operational topics like scaling out, high availability, production preparation, and sizing.

•by MongoDB

Big Data Processing using Apache Spark and Clojure

Talk given at ClojureD conference, Berlin Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API. In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming. Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience. About Paulus Esterhazy and Christian Betz Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization. Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster. Paulus Esterhazy Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development. He currently works as Senior Web Developer at Red Pineapple Media in Berlin.

•by Dr. Christian Betz

distributed computingsparkclojuredconf

WHAT WE LEARNED
Unique, expressive relationship types
Cache stats where possible

WHAT WE LEARNED
Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes

Recommended for you

Visual Api Training

This document summarizes Jeff Thompson's contributions of Spark API visual diagrams to the Spark community under an open source license. It also describes how Databricks further developed the diagrams and commissioned Adam Breindel for this work. The document briefly mentions Databricks' background and products.

•by Spark Summit

apache sparkspark summit 2015

MapReduce and NoSQL

The document discusses using MapReduce and NoSQL databases like MongoDB and Accumulo to solve challenges of analyzing large datasets by allowing distributed processing and incremental updates compared to traditional analytical systems. It provides examples of using MapReduce on MongoDB and Accumulo to perform analytics and maintain running aggregates or results. The document also discusses tradeoffs between different approaches and best practices for optimizing performance when using MapReduce and NoSQL databases together.

•by Aaron Cordova

accumulomongodbmapreduce

Scala meetup - Intro to spark

This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points: - Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster. - RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation. - Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler. - Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing

•by Javier Arrieta

big datascalakafka

WHAT WE LEARNED
Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes
First-class objects ⇒ nodes, not rels

WHAT WE LEARNED
Unique, expressive relationship types
Cache stats where possible
First-class objects ⇒ nodes, not rels
Capture history through event nodes
Connected data ⇒ nodes, not props

Recommended for you

BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf

Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.

•by Neo4j

neo4jneo4j webinarsgraph database

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Gursev Pirge, PhD Senior Data Scientist - JohnSnowLabs

•by Neo4j

neo4jgraph databasepharma

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph

Tomaz Bratanic Graph ML and GenAI Expert - Neo4j

•by Neo4j

neo4jgraph databasepharma

NEO4J ROADMAP
Overhaul of indexing API
Relationship type grouping
Socket and/or binary protocol
Automatic sharding?

Recommended for you

Leveraging the Graph for Clinical Trials and Standards

Katja Glaß OpenStudyBuilder Community Manager - Katja Glaß Consulting Marius Conjeaud Principal Consultant - Neo4j

•by Neo4j

neo4graph databasepharma

Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians

Dmitrii Kamaev, PhD Senior Product Owner - QIAGEN

•by Neo4j

neo4graph databasepharma

Atelier - Architecture d’applications de Graphes - GraphSummit Paris

Atelier - Architecture d’applications de Graphes Participez à cet atelier pratique animé par des experts de Neo4j qui vous guideront pour découvrir l’intelligence contextuelle. En utilisant un jeu de données réel, nous construirons étape par étape une solution de graphes ; de la construction du modèle de données de graphes à l’exécution de requêtes et à la visualisation des données. L’approche sera applicable à de multiples cas d’usages et industries.

•by Neo4j

graph databaseneo4j

THANKS!
TWITTER: @ASEEMK
GITHUB: @ASEEMK
EMAIL: ASEEM.KISHORE@GMAIL.COM
Questions?

What's hot

Weather of the Century: Visualization

MongoDB

The document discusses visualizing weather data stored in MongoDB. It describes extracting location and temperature data from MongoDB documents into NumPy arrays, using that data to perform grid interpolation and contour mapping with SciPy and Matplotlib. It then compares the performance of this process using PyMongo versus a new library called Monary, finding Monary is over 7 times faster for querying large datasets. In the end it thanks various Python libraries that helped enable this visualization and analysis of weather data from MongoDB.

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...

NoSQLmatters

1. The document discusses challenges in building real-time, data-driven applications including dealing with big data, privacy concerns, performing some real-time analysis, and enabling real-time retrieval of large datasets. 2. It describes using Hadoop to store, enrich, and preprocess raw logs totaling around 40TB of data, while addressing privacy needs. 3. The author details techniques used to enable fast real-time retrieval of data points within a given date range and radius from a center location, such as indexing data and using temporary tables.

Graph databases

Pathum Wijethunge

The document compares relational and graph databases, noting that relational databases require more joins when querying normalized data which increases query time. Graph databases provide an alternative by storing data in nodes and edges, allowing for faster index-free adjacency queries. Neo4j is highlighted as an example of a graph database that is fully ACID compliant and uses the Cypher query language.

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

Spark Summit

PA1_template

Grant Oliveira

Server side geo_tools_in_drupal_pnw_2012

Mack Hardy

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Martin Goodson