This document provides an overview of graph databases and Neo4j. It defines what a graph is mathematically and in the context of databases. It describes the key components of Neo4j including nodes, relationships, properties, labels, paths, traversals, and indexes. It also discusses the Cypher query language, performance advantages of Neo4j over SQL databases, and basic requirements and licensing options.
Introduction to Spark Datasets - Functional and relational together at last
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
- Spark ML pipelines involve estimators that are trained on datasets to produce immutable transformers.
- A transformer must define transformSchema() to validate the input schema, transform() to do the work, and copy() for cloning.
- Configurable transformers take parameters like inputCol and outputCol to allow configuration for meta algorithms.
- Estimators are similar but fit() returns a model instead of directly transforming.
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
Big Data Processing using Apache Spark and Clojure
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
A super fast introduction to Spark and glance at BEAM
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
MongoDB is a document-oriented NoSQL database that uses a document-data model. It provides horizontal scaling with auto-sharding and replication. MongoDB can store documents in collections without a defined schema and support dynamic queries and indexing. RealNetworks uses MongoDB with Scala and other technologies for an Android app to send notifications to devices with installed RealNetworks applications at scale.
Scaling with apache spark (a lesson in unintended consequences) strange loo...
This document discusses scaling Apache Spark applications and some of the unintended consequences that can arise. It covers Spark's core abstractions of RDDs and DataFrames for distributed data and computation. It explains how Spark's lazy evaluation model and use of deterministic partitioning can impact reusing data and operations like groupByKey. It also discusses challenges that can arise from Spark's support for arbitrary functions and working with non-JVM languages like Python.
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Slides from: https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/246892684/
Welcome to the first Sydney Spark Meetup in 2018!
We are very glad to have an visiting Apache Spark committer Holden Karau to give a talk on streaming machine learning. Title: Streaming ML w/Spark (and why it's a bit painful today & #workingonit)
Apache Spark is one of the most popular distributed systems, and it has built in libraries for both machine learning and streaming. This talk will cover Spark's two streaming libraries, look at the future, and how to make streaming ML work today (for both serving and prediction). If you aren't familiar with Spark, that's ok! We'll spend the first ~5 minutes covering just enough to get through the rest of the talk, and for those of you already familiar you can spend those ~5 minutes downloading the sample code :)
About Holden:
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
• What to bring
• Important to know
A couple of us will be at the doors of 60 Margaret St to let people in until 6.10pm.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
This document provides a summary of a presentation on scaling Apache Spark. It discusses techniques for reusing RDDs through caching, persistence levels and checkpointing. It also covers best practices for working with key-value data to avoid problems from groupByKey, and using Spark SQL and accumulators. Finally, it previews bringing code generation to Spark ML to improve performance.
This document discusses extending Spark ML pipelines with custom estimators and transformers. It begins with an overview of Spark ML and the pipeline API. Then it demonstrates how to build a simple hardcoded word count transformer and configurable transformer. It discusses important aspects like transforming the input schema, parameters, and model fitting. The document provides guidance on configuration, persistence, serving models, and resources for learning more about custom Spark ML components.
Java Performance Tips (So Code Camp San Diego 2014)
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
The document discusses programming techniques for the semantic web including LITEQ, a language for integrating RDF types and queries into programming languages. LITEQ allows programmers to navigate schemas, define types aligned with programming languages, and retrieve typed instances. The document also presents SchemEX, an index for efficiently searching RDF data sources in the linked open data cloud based on their schemas.
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Keanu Reeves dropped out of high school at age 17 to become an actor. He had early roles in commercials and television shows in the 1980s before gaining fame for his role in the Matrix trilogy of films. Some key facts about Reeves include that he was chosen as one of the 50 most beautiful people by People magazine and ranked 23rd in Empire magazine's top 100 movie stars list. He also plays bass guitar and had roles that inspired his hobbies of horseback riding and surfing.
This document proposes a method for recommending new bands to users based on bands they already like. It involves taking a band a user likes, finding other bands that are connected or related to that band through various sources like being on the same record label, having collaborated, or been reviewed together. These connected bands would then be recommended to the user, ranked by the strength of their aggregate connections. However, the method is critiqued for only considering directly connected bands, ignoring bands' quality, popularity, and how users' tastes may evolve over time.
This document discusses approaches to music recommendation using the Million Song Dataset. It examines using k-means clustering on listening histories to group users and songs, and user-based collaborative filtering to find similar users and make recommendations. K-means achieved a mean average precision of 0.01008 using multiple centroids and modified metadata. Collaborative filtering on 1,000 users achieved 0.00822 precision, improving to 0.1127 on 110,000 users. Future work could include ensemble techniques, additional metadata, and distributed k-means.
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013
Gamesys is a major online gaming company that handles billions in wagers annually. They built an internal social network graph database using Neo4j to model relationships between players and incentivize referrals. This helped reduce customer acquisition costs. Neo4j provided stable performance with Spring Data for building the application and Cypher for querying and analytics. The graph structure was well-suited to model complex game economies and detect fraud.
This document summarizes a presentation about the graph database Neo4j. The presentation included an agenda that covered graphs and their power, how graphs change data views, and real-time recommendations with graphs. It introduced the presenters and discussed how data relationships unlock value. It described how Neo4j allows modeling data as a graph to unlock this value through relationship-based queries, evolution of applications, and high performance at scale. Examples showed how Neo4j outperforms relational and NoSQL databases when relationships are important. The presentation concluded with examples of how Neo4j customers have benefited.
The document discusses how audio fingerprinting and identification works, as used by apps like Shazam. It covers topics like Fourier transforms, spectrograms, acoustic fingerprints, time-invariant hashing, storing and matching fingerprints to identify songs. The presenter then demonstrates generating fingerprints from songs in a database and using audio input to identify matches in real-time.
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
William Lyon presented on Neo4j 3.0 which introduces a new storage engine allowing unlimited graph size, new language drivers for easier application development, and improved operability for deploying Neo4j in the cloud, containers, and on premises. Key features include the new Bolt binary protocol, Java stored procedures, and an upgraded Cypher query engine with a new cost-based optimizer.
This introduction to graph databases is specifically designed for Enterprise Architects who need to map business requirements to architectural components like graph databases. It explains how and why graphs matter for Enterprise Architecture and reviews the architectural differences between relational and graph models.
This document provides an overview of Neo4j, a graph database management system. It discusses how Neo4j stores data as nodes and relationships, allowing for fast querying of connected data. Traditional relational databases struggle with complex relationships, while NoSQL databases don't support relationships at all. Neo4j addresses these issues through its native graph storage and processing capabilities. The document highlights key Neo4j features like scalability, high performance, and its Cypher query language.
Music Information Retrieval: Overview and Current Trends 2008
The document provides an overview of music information retrieval (MIR), including its applications, history, and techniques. MIR aims to extract semantic information from music to help organize and search large digital music collections. Key points include that MIR techniques analyze low-level audio features and integrate top-down information to determine higher-level attributes like genre, emotion, and similarity. This facilitates applications like music recommendation, identification, and discovery.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
A quick overview of the history, motivation, and uses of graph modeling and graph databases in various industries. Covers a brief introduction to graph databases with an emphasis on the Tinkerpop stack and Gremlin query language. These concepts are then solidified through a hands-on lab modeling a blog engine using Titan and Gremlin.
See more at http://allthingsgraphed.com.
Working With a Real-World Dataset in Neo4j: Import and Modeling
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
This document provides an overview of graph databases and their use cases. It begins with definitions of graphs and graph databases. It then gives examples of how graph databases can be used for social networking, network management, and other domains where data is interconnected. It provides Cypher examples for creating and querying graph patterns in a social networking and IT network management scenario. Finally, it discusses the graph database ecosystem and how graphs can be deployed for both online transaction processing and batch processing use cases.
This document discusses graph databases and the graph database Neo4j. It provides an introduction to graph databases, explaining that they are well-suited for storing relationships and sparse data. It then discusses Neo4j and its Cypher query language. Examples using GraphGists are provided and use cases and resources for getting started with Neo4j are listed.
This document provides an overview of graph databases. It discusses how graph data is naturally represented as nodes connected by edges, unlike relational databases which require joins. Graph databases allow for fast traversal of connected data and enable querying connected subgraphs. Popular graph database models include property graphs and RDF triple stores. Neo4j is introduced as a widely used graph database management system that uses labels, properties, relationships, and Cypher query language.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
This document discusses property graphs and how they are represented and queried using Morpheus, a graph query engine for Apache Spark.
Morpheus allows querying property graphs using Cypher and represents property graphs using DataFrames, with node and relationship data stored in tables. It integrates with various data sources and supports federated queries across multiple property graphs. The document provides examples of loading property graph data from sources like JSON, SQL databases and Neo4j, creating graph projections, running analytical queries, and recommending businesses based on graph algorithms.
OQGraph 3 is a graph computation engine for MariaDB that allows for representing graphs and hierarchies using plain SQL. It stores graph data in tables but operates differently than typical storage engines by focusing on graph computations rather than data storage and retrieval. Key features include improved performance over previous versions using Judy arrays and ability to handle larger graphs by only holding the bitmap array in memory. It represents graph data using nodes and edges stored in tables and allows querying to find paths and perform other graph algorithms.
Druid is an analytics-focused, distributed, scale-out data store. Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Up until version 0.10, Druid could only be queried in a JSON-based language that many users found unfamiliar.
Enter Apache Calcite. It includes an industry-standard SQL parser, validator, and JDBC driver, as well as a cost-based relational optimizer. Calcite bills itself as “the foundation for your next high-performance database” and is used by Hive, Drill, and a variety of other projects. Druid uses Calcite to power Druid SQL, a standards-based query API that vaults Druid out of the NoSQL world and into the SQL world.
Gian Merlino offers an overview of Druid SQL and explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
This document discusses processing large graphs. It introduces graph processing with MapReduce and Apache Giraph. MapReduce algorithms for finding triangles and connected components in graphs are described. The limitations of MapReduce for graph processing are discussed. Alternative graph processing technologies including Neo4j, a graph database, are presented. A movie recommendation use case is demonstrated using Neo4j to find similar users and recommend unseen movies.
The presentation gives a brief information about Graph Databases and its usage in today's scenario. Moving on the presentation talks about the popular Graph DB Neo4j and its Cypher Query Language i.e., used to query the graph.
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
Highlighting the progress in Neo4j 3.3 and 3.4 especially
Neo4j Desktop, Graph Algorithms, NLP, Date-Time, Geospatial, and performance.
Also featuring the new visualization tool Neo4j Bloom.
Neo4j is an open source graph database that uses nodes, relationships, and properties to store and query data. It supports ACID transactions and is high performance. Cypher is Neo4j's query language that allows matching patterns of nodes and relationships. A graph database model uses nodes connected by relationships, unlike a relational database that uses tables and rows.
This document provides an overview of using graphs and hierarchies in SQL databases with OQGRAPH. It discusses how trees and graphs differ, examples of each, and some of the challenges of representing them in relational databases. It then introduces OQGRAPH as a storage engine that can perform graph computations directly in SQL. Key features of OQGRAPH like inserting edges, performing path queries, and joining to other tables are demonstrated. Later versions provide additional optimizations and the ability to use an existing table as the source of edges.
This document discusses Spark, an open-source cluster computing framework. It begins with an introduction to distributed computing problems related to processing large datasets. It then provides an overview of Spark, including its core abstraction of resilient distributed datasets (RDDs) and how Spark builds on the MapReduce model. The rest of the document demonstrates Spark concepts like transformations and actions on RDDs and the use of key-value pairs. It also discusses SparkSQL and shows examples of finding the most retweeted tweet using core Spark and SparkSQL.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
- Spark ML pipelines involve estimators that are trained on datasets to produce immutable transformers.
- A transformer must define transformSchema() to validate the input schema, transform() to do the work, and copy() for cloning.
- Configurable transformers take parameters like inputCol and outputCol to allow configuration for meta algorithms.
- Estimators are similar but fit() returns a model instead of directly transforming.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
MongoDB is a document-oriented NoSQL database that uses a document-data model. It provides horizontal scaling with auto-sharding and replication. MongoDB can store documents in collections without a defined schema and support dynamic queries and indexing. RealNetworks uses MongoDB with Scala and other technologies for an Android app to send notifications to devices with installed RealNetworks applications at scale.
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
This document discusses scaling Apache Spark applications and some of the unintended consequences that can arise. It covers Spark's core abstractions of RDDs and DataFrames for distributed data and computation. It explains how Spark's lazy evaluation model and use of deterministic partitioning can impact reusing data and operations like groupByKey. It also discusses challenges that can arise from Spark's support for arbitrary functions and working with non-JVM languages like Python.
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
Slides from: https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/246892684/
Welcome to the first Sydney Spark Meetup in 2018!
We are very glad to have an visiting Apache Spark committer Holden Karau to give a talk on streaming machine learning. Title: Streaming ML w/Spark (and why it's a bit painful today & #workingonit)
Apache Spark is one of the most popular distributed systems, and it has built in libraries for both machine learning and streaming. This talk will cover Spark's two streaming libraries, look at the future, and how to make streaming ML work today (for both serving and prediction). If you aren't familiar with Spark, that's ok! We'll spend the first ~5 minutes covering just enough to get through the rest of the talk, and for those of you already familiar you can spend those ~5 minutes downloading the sample code :)
About Holden:
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
• What to bring
• Important to know
A couple of us will be at the doors of 60 Margaret St to let people in until 6.10pm.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
This document provides a summary of a presentation on scaling Apache Spark. It discusses techniques for reusing RDDs through caching, persistence levels and checkpointing. It also covers best practices for working with key-value data to avoid problems from groupByKey, and using Spark SQL and accumulators. Finally, it previews bringing code generation to Spark ML to improve performance.
Introduction to and Extending Spark MLHolden Karau
This document discusses extending Spark ML pipelines with custom estimators and transformers. It begins with an overview of Spark ML and the pipeline API. Then it demonstrates how to build a simple hardcoded word count transformer and configurable transformer. It discusses important aspects like transforming the input schema, parameters, and model fitting. The document provides guidance on configuration, persistence, serving models, and resources for learning more about custom Spark ML components.
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
The document discusses programming techniques for the semantic web including LITEQ, a language for integrating RDF types and queries into programming languages. LITEQ allows programmers to navigate schemas, define types aligned with programming languages, and retrieve typed instances. The document also presents SchemEX, an index for efficiently searching RDF data sources in the linked open data cloud based on their schemas.
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Keanu Reeves dropped out of high school at age 17 to become an actor. He had early roles in commercials and television shows in the 1980s before gaining fame for his role in the Matrix trilogy of films. Some key facts about Reeves include that he was chosen as one of the 50 most beautiful people by People magazine and ranked 23rd in Empire magazine's top 100 movie stars list. He also plays bass guitar and had roles that inspired his hobbies of horseback riding and surfing.
This document proposes a method for recommending new bands to users based on bands they already like. It involves taking a band a user likes, finding other bands that are connected or related to that band through various sources like being on the same record label, having collaborated, or been reviewed together. These connected bands would then be recommended to the user, ranked by the strength of their aggregate connections. However, the method is critiqued for only considering directly connected bands, ignoring bands' quality, popularity, and how users' tastes may evolve over time.
This document discusses approaches to music recommendation using the Million Song Dataset. It examines using k-means clustering on listening histories to group users and songs, and user-based collaborative filtering to find similar users and make recommendations. K-means achieved a mean average precision of 0.01008 using multiple centroids and modified metadata. Collaborative filtering on 1,000 users achieved 0.00822 precision, improving to 0.1127 on 110,000 users. Future work could include ensemble techniques, additional metadata, and distributed k-means.
Graph Adoption at Gamesys - Toby O'Rourke @ GraphConnect SF 2013Neo4j
Gamesys is a major online gaming company that handles billions in wagers annually. They built an internal social network graph database using Neo4j to model relationships between players and incentivize referrals. This helped reduce customer acquisition costs. Neo4j provided stable performance with Spring Data for building the application and Cypher for querying and analytics. The graph structure was well-suited to model complex game economies and detect fraud.
This document summarizes a presentation about the graph database Neo4j. The presentation included an agenda that covered graphs and their power, how graphs change data views, and real-time recommendations with graphs. It introduced the presenters and discussed how data relationships unlock value. It described how Neo4j allows modeling data as a graph to unlock this value through relationship-based queries, evolution of applications, and high performance at scale. Examples showed how Neo4j outperforms relational and NoSQL databases when relationships are important. The presentation concluded with examples of how Neo4j customers have benefited.
The document discusses how audio fingerprinting and identification works, as used by apps like Shazam. It covers topics like Fourier transforms, spectrograms, acoustic fingerprints, time-invariant hashing, storing and matching fingerprints to identify songs. The presenter then demonstrates generating fingerprints from songs in a database and using audio input to identify matches in real-time.
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
William Lyon presented on Neo4j 3.0 which introduces a new storage engine allowing unlimited graph size, new language drivers for easier application development, and improved operability for deploying Neo4j in the cloud, containers, and on premises. Key features include the new Bolt binary protocol, Java stored procedures, and an upgraded Cypher query engine with a new cost-based optimizer.
This introduction to graph databases is specifically designed for Enterprise Architects who need to map business requirements to architectural components like graph databases. It explains how and why graphs matter for Enterprise Architecture and reviews the architectural differences between relational and graph models.
This document provides an overview of Neo4j, a graph database management system. It discusses how Neo4j stores data as nodes and relationships, allowing for fast querying of connected data. Traditional relational databases struggle with complex relationships, while NoSQL databases don't support relationships at all. Neo4j addresses these issues through its native graph storage and processing capabilities. The document highlights key Neo4j features like scalability, high performance, and its Cypher query language.
Music Information Retrieval: Overview and Current Trends 2008Rui Pedro Paiva
The document provides an overview of music information retrieval (MIR), including its applications, history, and techniques. MIR aims to extract semantic information from music to help organize and search large digital music collections. Key points include that MIR techniques analyze low-level audio features and integrate top-down information to determine higher-level attributes like genre, emotion, and similarity. This facilitates applications like music recommendation, identification, and discovery.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
Intro to Graph Databases Using Tinkerpop, TitanDB, and GremlinCaleb Jones
A quick overview of the history, motivation, and uses of graph modeling and graph databases in various industries. Covers a brief introduction to graph databases with an emphasis on the Tinkerpop stack and Gremlin query language. These concepts are then solidified through a hands-on lab modeling a blog engine using Titan and Gremlin.
See more at http://allthingsgraphed.com.
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
This document provides an overview of graph databases and their use cases. It begins with definitions of graphs and graph databases. It then gives examples of how graph databases can be used for social networking, network management, and other domains where data is interconnected. It provides Cypher examples for creating and querying graph patterns in a social networking and IT network management scenario. Finally, it discusses the graph database ecosystem and how graphs can be deployed for both online transaction processing and batch processing use cases.
This document discusses graph databases and the graph database Neo4j. It provides an introduction to graph databases, explaining that they are well-suited for storing relationships and sparse data. It then discusses Neo4j and its Cypher query language. Examples using GraphGists are provided and use cases and resources for getting started with Neo4j are listed.
This document provides an overview of graph databases. It discusses how graph data is naturally represented as nodes connected by edges, unlike relational databases which require joins. Graph databases allow for fast traversal of connected data and enable querying connected subgraphs. Popular graph database models include property graphs and RDF triple stores. Neo4j is introduced as a widely used graph database management system that uses labels, properties, relationships, and Cypher query language.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
This document discusses property graphs and how they are represented and queried using Morpheus, a graph query engine for Apache Spark.
Morpheus allows querying property graphs using Cypher and represents property graphs using DataFrames, with node and relationship data stored in tables. It integrates with various data sources and supports federated queries across multiple property graphs. The document provides examples of loading property graph data from sources like JSON, SQL databases and Neo4j, creating graph projections, running analytical queries, and recommending businesses based on graph algorithms.
OQGraph 3 is a graph computation engine for MariaDB that allows for representing graphs and hierarchies using plain SQL. It stores graph data in tables but operates differently than typical storage engines by focusing on graph computations rather than data storage and retrieval. Key features include improved performance over previous versions using Judy arrays and ability to handle larger graphs by only holding the bitmap array in memory. It represents graph data using nodes and edges stored in tables and allows querying to find paths and perform other graph algorithms.
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
Druid is an analytics-focused, distributed, scale-out data store. Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Up until version 0.10, Druid could only be queried in a JSON-based language that many users found unfamiliar.
Enter Apache Calcite. It includes an industry-standard SQL parser, validator, and JDBC driver, as well as a cost-based relational optimizer. Calcite bills itself as “the foundation for your next high-performance database” and is used by Hive, Drill, and a variety of other projects. Druid uses Calcite to power Druid SQL, a standards-based query API that vaults Druid out of the NoSQL world and into the SQL world.
Gian Merlino offers an overview of Druid SQL and explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
This document discusses processing large graphs. It introduces graph processing with MapReduce and Apache Giraph. MapReduce algorithms for finding triangles and connected components in graphs are described. The limitations of MapReduce for graph processing are discussed. Alternative graph processing technologies including Neo4j, a graph database, are presented. A movie recommendation use case is demonstrated using Neo4j to find similar users and recommend unseen movies.
Getting started with Graph Databases & Neo4jSuroor Wijdan
The presentation gives a brief information about Graph Databases and its usage in today's scenario. Moving on the presentation talks about the popular Graph DB Neo4j and its Cypher Query Language i.e., used to query the graph.
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...jexp
Highlighting the progress in Neo4j 3.3 and 3.4 especially
Neo4j Desktop, Graph Algorithms, NLP, Date-Time, Geospatial, and performance.
Also featuring the new visualization tool Neo4j Bloom.
Neo4j is an open source graph database that uses nodes, relationships, and properties to store and query data. It supports ACID transactions and is high performance. Cypher is Neo4j's query language that allows matching patterns of nodes and relationships. A graph database model uses nodes connected by relationships, unlike a relational database that uses tables and rows.
This document provides an overview of using graphs and hierarchies in SQL databases with OQGRAPH. It discusses how trees and graphs differ, examples of each, and some of the challenges of representing them in relational databases. It then introduces OQGRAPH as a storage engine that can perform graph computations directly in SQL. Key features of OQGRAPH like inserting edges, performing path queries, and joining to other tables are demonstrated. Later versions provide additional optimizations and the ability to use an existing table as the source of edges.
This document discusses Spark, an open-source cluster computing framework. It begins with an introduction to distributed computing problems related to processing large datasets. It then provides an overview of Spark, including its core abstraction of resilient distributed datasets (RDDs) and how Spark builds on the MapReduce model. The rest of the document demonstrates Spark concepts like transformations and actions on RDDs and the use of key-value pairs. It also discusses SparkSQL and shows examples of finding the most retweeted tweet using core Spark and SparkSQL.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
This document provides an introduction and overview of Neo4j, a graph database. It discusses trends in big data, NoSQL databases, and different types of NoSQL databases like key-value stores, column family databases, and document databases. It then defines what a graph and graph database are, and introduces Neo4j as a native graph database that uses a property graph model. It outlines some of Neo4j's features and provides examples of how it can be used to represent social network, spatial, and interconnected data.
This document discusses using graphs and graph databases for machine learning. It provides an overview of graph analytics algorithms that can be used to solve problems with graph data, including recommendations, fraud detection, and network analysis. It also discusses using graph embeddings and graph neural networks for tasks like node classification and link prediction. Finally, it discusses how graphs can be used for machine learning infrastructure and metadata tasks like data provenance, audit trails, and privacy.
This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Support en anglais diffusé lors de l'événement 100% IA organisé dans les locaux parisiens d'Iguane Solutions, le mardi 2 juillet 2024 :
- Présentation de notre plateforme IA plug and play : ses fonctionnalités avancées, telles que son interface utilisateur intuitive, son copilot puissant et des outils de monitoring performants.
- REX client : Cyril Janssens, CTO d’ easybourse, partage son expérience d’utilisation de notre plateforme IA plug & play.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
5. What is a Graph in math
3
● represent a connected set of objects
● graph:
○ vertex (node/points)
○ edge (arc/line/relationship/arrow) - undirected
○ attribute (property) - on node/relationship
● types:
○ pair: G = (V, E)
○ digraph: D = (V, A)
○ mixed: G = (V, E, A)
V = {1, 2, 3, 4, 5, 6}
E = {{1, 2}, {1, 5}, {2, 3}, {2, 5}, {3, 4}, {4, 5}, {4, 6}}
6. What is a Graph database
4
● stores data in a graph and retrieving vast networks of data
● shines when storing richly-connected data
● consists of nodes, connected by relationships
○ A Graph —records data in→ Nodes —which have→ Properties
○ Nodes —are organized by→ Rels —which also have→ Properties
○ Nodes —are grouped by→ Labels —into→ Sets
○ A Traversal —navigates→ a Graph
it —identifies→ Paths —which order→ Nodes
○ An Index —maps from→ Properties —to either→ Nodes or Rels
○ A Graph Database —manages a→ Graph and
—also manages related→ Indexes
8. Graph Traversal
6
A Traversal
—navigates→ a Graph
it
—identifies→ Paths
—which order→ Nodes
what music
do my friends like
that I don’t yet own
if this power supply goes down,
what web services
are affected?
9. Graph Index
7
An Index
—maps from→ Properties
—to either→ Nodes or Rels
find the Account
for username master-of-graphs
13. A Graph Database elaborates a Key-Value Store
11
K* = key
V* = value
14. A Graph Database relates Column-Family
12
● BigTable databases are an evolution of key-value,
using "families" to allow grouping of rows
● stored in a graph, the families could become
hierarchical, and the relationships among data
becomes explicit
15. A Graph Database navigates a Document Store
13
D=Document,
S=Subdocument,
V=Value,
D2/S2 = reference
18. ● intuitive, using a graph model for data representation
● reliable, fully transactional, upholds ACID
● durable and fast, using a custom disk-based, native storage engine
● massively scalable, up to several billion nodes/relationships/properties
● highly-available, when distributed across multiple machines
● expressive, with a powerful, human readable declarative graph query
language
● fast, with a powerful traversal framework for high-speed graph queries
● embeddable, with a few small jars
● simple, accesible by a convenient REST API interface or an object-
oriented JAVA API
● indexes are based on Apache Lucene, supports Secondary Indexes
● has been in commercial development for 10 years and in production for
over 7 years; since 2003;
● Cross-platform; Simple set-up; Well documented; Open source;
● GPL for Community, AGPL for Enterprise
16
Neo4j features
19. ● CPU - Intel Core i3/i7
● Memory - 2GB .. 16/32GB
● Disk - 10GB SATA .. SSD w/ SATA
● Filesystem - ext4 .. ext4/ZFS
● Software - Oracle JAVA 7
17
Neo4j requirements
20. ● Neo4j Community
○ Open-Source High Performance
○ fully ACID transactional graph database
● Neo4j Enterprise
○ High-Performance Cache (up to 10x faster)
○ Horizontal scalability with Neo4j Clustering (predictable scalability)
○ High-availability and online backups
○ Cache based sharding (shard your graph in memory)
○ Advanced Monitoring (operational metrics)
○ Certified for Windows and Linux
○ Email/Phone Support (10x5, 24x7 hours)
○ Subscriptions
■ Personal (up to 3 devs, $100k annual revenue) = FREE
■ Startups (<$10M funding, <$5M annual revenue) = $12k
■ Business (medium, to Global 2000) = Contact Sales
18
Neo4j license
21. 19
● for the simple friends of friends query, Neo4j is 60% faster than MySQL
● for friends of friends of friends, Neo is 180 times faster
● and for the depth four query, Neo4j is 1,135 times faster
● and MySQL just chokes on the depth 5 query
Neo4j vs. Mysql
22. Neo4j: Nodes
● fundamental units that form a graph
● can have key/value-style properties
● index nodes and relationships
by {key, value} pairs
● represent entities
20
23. Neo4j: Relationships #1/2
● connect entities and structure domain
● allow for finding related data
● are always directed (outgoing or incoming)
● are equally well traversed in either direction
● can have relationships to itself
● have a relationship type (label)
21
25. Neo4j: Properties
● nodes and relationships can have properties
● are key-value pairs
○ key is a string
○ values can be either a primitive or an array of
one primitive type
■ boolean, String, int, int[], etc
■ Java Language Specification
● entity attributes, rels qualities,
and metadata
23
26. Neo4j: Labels
● used to group nodes into sets
● any number of labels, including none
● can be added and removed during runtime
● can be used to mark temporary states for nodes
● names case-sensitive
● CamelCase (convention)
24
27. Neo4j: Paths
● is one or more nodes with connecting relationships
● shortest path:
● a path of length one:
● a path of length one:
25
28. Neo4j: Traversal
● Traversal Framework from box
● means visiting nodes, following relationships by rules
● in most cases only a subgraph is visited
● callback based traversal API
○ you can specify the traversal rules
● traversing breadth- or depth-first
● open Java API
26
29. Neo4j: graph algorithms
● A* (> uses the A* algorithm to find the cheapest path between two
nodes)
● Dijkstra (dijkstra > Dijkstra algorithm to find the cheapest path
between two nodes)
● PathWithLength (> all paths of a certain length (depth)
between two nodes)
● Shortest paths (shortestPath Default > find all the
shortest paths between two nodes)
● All simple paths (allSimplePaths > find all simple paths
between two nodes; without loops;)
● All paths (allPaths > find all available paths between two
nodes)
27
31. ● introduced in Neo4j 2.0
● eventually available (populating in the background, is
not immediately available for querying)
○ come online after fully populated
○ failed status (drop and recreate the index)
● can be created on labels group
● indexed Nodes & Rels
● node_auto_indexing=false,
node_keys_indexable
Neo4j: Index
29
32. Neo4j: Constraints
● can help you keep your data clean
● specify the rules for what your data should
look like
● unique constraints is the only available
constraint type
30
33. ● single server instance
○ nodes = 2^35 (~34 billion)
○ relationships = 2^35 (~34 billion)
○ labels = 2^31 (~2 billion)
○ properties = 2^36 to 2^38 depending on
property types (maximum ~274 billion, always
at least ~68 billion)
○ relationship types = 2^15 (~ 32’000)
31
Neo4j: Data Size
34. ● powerful graph query language
● relatively simple
● declarative grammar (say what you want, not how)
● humane query language
● self-explanatory (based on English prose and neat iconography)
● written in Scala
● pattern-matching (borrows expression approaches from SPARQL)
● aggregation, ordering, limits
● create, update, delete
● structure and most of keywords inspired by SQL
● changing rather rapidly (CYPHER 1.9 START ...)
Cypher Query Language
32
“Makes the simple things easy, and the complex things possible”
37. Cypher: START / RETURN
“It all starts with the START”
Michael Hunger, Cypher webinar, Sep 2012
● designates the start points
● START is optional (in Neo4j >= 2.0)
Examples:
● START <lookup> RETURN <expression>
● START n=node(0) RETURN n
● START n=node(*) RETURN n.name
35
38. Cypher: MATCH
● primary way of getting data from the database
● START <lookup> MATCH <pattern> RETURN <expr>
● OPTIONAL MATCH <lookup> RETURN <expr>
Examples:
● MATCH (n) RETURN count(n)
● MATCH (actor:Actor) RETURN actor.name;
● START me=node(0) MATCH (me)--(f) RETURN f.name
● MATCH (n)-[r]->(m) RETURN n AS FROM, r AS `->`, m AS TO
36
40. Cypher: WHERE
● filters the results
● MATCH <pattern> WHERE <condition> RETURN <expr>
Examples:
● WHERE n.name =~ “(?i)John.*”
● WHERE NOT ..
● WHERE type(rel) =~ “Perso.*”
38
41. Cypher: RETURN
● creates the result table
● any query can return data
● can be nodes, relationships, or properties on these
● RETURN DISTINCT <expression> AS x
● RETURN aggregate(expr) as alias
● RETURN nodes, rels, properties
● RETURN expressions of funcs and operators
● RETURN aggregation funcs on the above
39
42. Cypher: etc
● CASE / WHEN / ELSE
● ORDER BY node.key, node2.key, .. ASC|DESC
● LIMIT / SKIP
● WITH (WITH count(*) as c)
● UNION / UNION ALL (combining results from multiple queries)
● USING INDEX/SCAN
● MERGE / SET / DELETE / REMOVE / FORECH
● Expressions
● Operators
● Comments
● Functions: ALL, ANY, LENGTH, {Math}, {String}, ...
40
43. ● any updating query will run in a transaction
● ACID
● “it is very important to finish each transaction”
● write lock on node/rel:
○ adding, changing or removing prop on a node/rel
● write lock on node:
○ creating or deleting a node
● write lock on node and both its nodes:
○ creating or deleting a relationship
Cypher: Transactions
41
45. ● SELECT *
FROM Person
WHERE name=“Valentin” and age > 30
● START person=node:Person(node=”Valentin”)
WHERE person.age > 30
RETURN person
Cypher: back to SQL #1/5
43
46. Cypher: back to SQL #2/5
● SELECT “Email”.*
FROM Person
JOIN “Email” ON “Person”.id = “Email”.person_id
WHERE “Person”.name = “Benedikt”
● START person=node:Person(name=”Benedikt”)
MATCH person-[:email]->email
RETURN email
44
47. Cypher: back to SQL #3/5
● show me all people that are both actors and
directors
● SELECT name FROM Person
WHERE
person_id IN (SELECT person_id FROM Actor) AND
person_id IN (SELECT person_id FROM Director)
● START person=node:Person(“name:*”)
WHERE (person)-[:ACTS_IN]->()
AND (person)-[:DIRECTED]->()
RETURN person.name
45
48. Cypher: back to SQL #4/5
● show me all Tom Hanks’s co-actors
● SELECT DISTICT co_actor.name FROM Person tom
JOIN Movie a1 ON tom.person_in = a1.person_id
JOIN Actor a2 ON a1.movie_id = a2.movie_id
JOIN Person co_actor ON co_actor.person_id = a2.person_id
WHERE tom.name = “Tom Hanks”
● START tom=node:Person(name=”Tom Hanks”)
MATCH tom-[:ACTS_IN]->movie,
co_actor-[:ACTS_IN]->movie
RETURN DISTINCT co_actor.name
46
49. Cypher: back to SQL #5/5
● show me all Lucy’s favorite directors
● SELECT dir.name, count(*) FROM Person lucy
JOIN Actor on Person.person_id = Actor.person_id
JOIN Director ON Actor.movie_id = Director.movie_id
JOIN Person dir ON Director.person_id = dir.person_id
WHERE lucy.name = “Lucy Liu”
GROUP BY dir.name
ORDER BY count(*) DESC
● START lucy=node:Person(name=”Lucy Liu”)
MATCH lucy-[:ACTS_IN]->movie,
director-[:DIRECTED]->movie
RETURN director.name, count(*)
ORDER BY director.name, count(*) DESC
47
50. START
lucy = node:Person(name=”Lucy Lui”),
kevin = node:Person(name=”Kevin Bacon”)
MATCH
p = shortestPath( lucy-[:ACTS_IN*]-kevin )
RETURN
EXTRACT (n in NODES(p):
COALESCE(n.name?, n.title?))
48
Cypher: back to SQL #6/5
52. Neo4j: Security
● does not deal with data encryption
explicitly
● can be used all means built into the Java
● can be used encrypted datastore
● webadmin https
50
53. ● manipulate data stored in RDF format
● focused on match triple sets
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
SPARQL
51
54. ● graph traversal language
● scripting language
● Pipe & Filter (similar to jQuery)
● across different graph databases
● based on Groovy (limited to Java)
● not as stable in Neo4j
● XPath like
● ./outE[label=”family”]/inV/@name
● g.v(1).out('likes').in('likes').out('likes').groupCount(m)
● g.V.as('x').out.groupCount(m).loop('x'){c++ < 1000}
● g.v(1).in(‘LOVE_OF’).out(‘SOME_IN’).has(‘title’,’abc’).back(2)
Gremlin
52
55. Neo4j and PHP
● everyman/neo4jphp < packagist.org
○ PHP wrapper for the Neo4j using REST interface
○ Follows the PSR-0 autoloading standard
○ Basic wrappers for all components
○ Last update - a month ago
○ supports Gremlin
● Neo4j-PHP OGM < a lot of based on
○ Object Graph Mapper, inspired by Doctrine
○ based on DoctrineCommon
○ borrows significantly DoctrineORM design
○ uses annotations on classes
○ MIT Licence
● Neo4J PHP REST API client
○ Using Neo4j REST API
○ Node create/find/delete
○ Relationship create/list/filter
53
56. High Availability with Neo4j
● in HA - a single master and zero or more slaves
● slave synchronizing with the master to preserve
consistency
● master write to slave before transaction completes
54
57. Demo
Neo4j.org Example Datasets:
● DrWho (nodes=1'060; rels=2'286)
● Cineasts Movies & Actors (nodes=64'069; rels=121'778)
● Hubway Data Challenge (nodes=554'674; rels=2'011'904)
GraphGist:
● JIRA and neo4j
● PHP and neo4j
● Kant in neo4j
XSS
55
65. ● GrapheneDB - based on neo4j
● AllegroGraph - Closed Source, Commercial, RDF-QuadStore
● Sones - Closed Source, .NET focused
○ graph database built around the W3C spec for the Resource
Description Framework
○ supports SPARQL, RDFS++, and Prolog
● Virtuoso - Closed Source, RDF focused
● GraphDB - graph database built in .NET by the German company sones
● InfiniteGraph - goal is to create a graph database with "virtually
unlimited scalability."
● FlockDB
Analogues
63
67. ● best used for graph-style,
rich or complex,
structured dense data,
deep graphs with unlimited depth and cyclical,
with weighted connections,
interconnected data
● quickly add new functionality without impacting
existing deployments
● schema-less forcing to re-think entire approach to data
● not the silver bullet for all problems
Conclusion