Since the irruption in the market of the NoSQL concept, graph databases have been traditionally designed to be used with Java or C. With some honorable exceptions, there isn't an easy way to manage graph databases from Python. In this talk, I will introduce you some of the tools that you can use today in order to work with those new challenging databases, from our favorite languge, Python.
Data Day Texas 2017: Scaling Data Science at Stitch Fix
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
H2O Deep Water - Making Deep Learning Accessible to Everyone
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
The document compares Neo4j, Titan, and Cassandra graph databases. It provides details on each database such as Neo4j using the Cypher query language, Cassandra being highly distributed and able to scale linearly, and Titan running on Cassandra or HBase but not supporting Cypher queries. It also gives a 15 point comparison of Cassandra vs Neo4j and examples of querying the same data in Gremlin, Cypher, and SQL. The conclusion recommends a graph database like Neo4j for recommendation queries and only using Titan for very large graphs or high loads.
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Presented at JAX London
In this session we'll look at some of the design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.
How Graph Databases efficiently store, manage and query connected data at s...
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
A super fast introduction to Spark and glance at BEAM
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
This document provides an overview of GraphDB and Neo4j. It discusses why graphs are useful for modeling connected data and common use cases. It also summarizes Neo4j's transactional graph database capabilities, performance advantages, and deployment options. Key topics covered include causal clustering, query planning, and driver and tooling support for developers.
Treasure Data is a cloud-based big data analytics company based in Silicon Valley with about 20 employees. The document discusses Treasure Data's services and architecture, which includes collecting data from various sources using Fluentd, storing the data in a columnar format on AWS S3, and performing analytics using Hadoop and SQL queries. Treasure Data aims to simplify big data adoption through its fully-managed platform and quick setup process. Example customers discussed were able to see results within 2 weeks of signing up.
This document summarizes an introduction to data analysis in Python using Wakari. It discusses why Python is a good language for data analysis, highlighting key Python packages like NumPy, Pandas, Matplotlib and IPython. It also introduces Wakari, a browser-based Python environment for collaborative data analysis and reproducible research. Wakari allows sharing of code, notebooks and data through a web link. The document recommends several talks at the PyData conference on efficient computing, machine learning and interactive plotting.
This document discusses persistent graphs in Python with Neo4j. It begins by explaining the limitations of relational databases and how graph databases like Neo4j focus on modeling complex relationships through nodes and edges. It then provides an overview of Neo4j, describing it as an open source graph database that is stable, actively developed, and can handle billions of nodes and relationships to model complex data.
PyCon India 2012: Rapid development of website search in python
The document discusses developing website search capabilities in Python. It provides an overview of typical search engine components like indexing, analyzing, and searching. It then compares two Python search libraries - Pylucene and Whoosh. Benchmark tests on indexing, committing, and searching a 1GB dataset showed Whoosh to outperform Pylucene in speed. The document recommends designing search as an independent, pluggable component and considers Whoosh and Pylucene as good options for rapid development and integration into Python web projects.
This document summarizes a talk about making Django and NoSQL databases like MongoDB play nicely together. Currently, Django's ORM is optimized for SQL databases and makes assumptions that don't always apply to NoSQL databases. The talk proposes some changes to address this, including having the Query object do less database-specific work and pushing more of that down to the individual database compilers. This would make the Query more agnostic and allow the compilers to generate queries optimized for their specific databases. An example backend for MongoDB would be built to demonstrate this approach.
This document discusses building knowledge graphs using DIG (Distributed Information Graphs) to integrate heterogeneous data sources. It describes the steps involved, including data acquisition, feature extraction, mapping to an ontology, entity resolution, graph construction, and deployment. As a use case, DIG has been used to build a knowledge graph from over 100 million web pages related to human trafficking to help law enforcement identify victims and prosecute traffickers.
This document compares relational and non-relational databases. It discusses how in 2003 the main databases were relational, but by 2010 non-relational databases grew popular in the "NoSQL movement". However, the document argues that there are no truly new database designs and that relational and non-relational databases can be combined. It advises to choose a database based on the specific problem and features needed rather than general classifications. The document provides examples of which types of databases fit certain data and access needs.
This document describes Bubbles, a Python framework for data processing and quality probing. Bubbles focuses on representing data objects and defining operations that can be performed on those objects. Key aspects include:
- Data objects define the structure and representations of data without enforcing a specific storage format.
- Operations can be performed on data objects and are dispatched dynamically based on the objects' representations.
- A context stores available operations and handles dispatching.
- Stores provide interfaces to load and save objects from formats like SQL, CSV, etc.
- Pipelines allow sequencing operations to transform and process objects from source to target stores.
- The framework includes common operations for filtering, joining, aggreg
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
Kovelinas is a painter born in 1978 in Lithuania. He is known for his Divas Powerpoint series of paintings. The document provides brief biographical information about the Lithuanian painter Kovelinas and references one of his art series but does not provide any additional context or details.
This document summarizes different market structures: pure competition, pure monopoly, monopolistic competition, oligopoly, and collusive oligopoly. It describes key characteristics of each like perfect competition having many firms and monopoly having a single firm. It also discusses profit maximization under these structures and provides examples.
Mobisfera és una Agència de Màrqueting Mòbil amb seu a Barcelona. Imaginem, dissenyem i desenvolupem solucions per a dispositius mòbils. A més, apostem per l’Internet of Thinks (IOT) i la Wearable Technology.
The document outlines the stages and funding of the Dementias Platform UK project. It establishes cohort(s) in stage 1 with £6M funding and establishes an imaging platform in stage 2 with another £6M. An additional £36M in capital funding will go toward imaging, stem cells, and informatics. It lists the director as John Gallacher from Oxford and describes the 14 work packages and informatics network leads. The informatics network lead is Simon Lovestone and lists various informatics sub-networks and their leads. It provides a conceptual model for the imaging informatics component with a central XNAT hub and nodes at various research centers.
Fastrack is a sub-brand of Titan that was established in 1998 and focuses on watches and accessories. The document discusses Fastrack's digital marketing campaign objectives of engaging 1000 students at top private universities in Bangladesh by 2016 through social media campaigns on platforms like Facebook and YouTube. The campaigns aim to raise brand awareness and engage customers among the target 20-25 year old male and female demographic interested in style, fashion and experiencing new things.
The document discusses three paths to designing digital experiences: structural, community, and customer. It advocates writing an experience brief to define goals and mapping the customer journey. The presentation provides recommendations for libraries to focus on the customer experience by asking questions, emphasizing conversation, and staging experiences on their website. The overall message is that experience design improves the ordinary interactions people have with an organization online.
El documento presenta una lista de autores españoles e italianos, incluyendo breves biografías y extractos cortos de sus obras más famosas. Entre los autores se encuentran Juan Ramón Jiménez, Rafael Alberti, Miguel de Cervantes, Gustavo Adolfo Bécquer, William Shakespeare. El documento proporciona información básica sobre la vida y obras de estos importantes escritores.
The document provides an introduction to hidden Markov models (HMM) and their applications. It begins with an overview of HMM and its advantages for modeling sequential data. It then describes the basic concepts of Markov models, including their graphical representation, definitions, and algorithms for calculating sequence and state probabilities. The document introduces HMM and the hidden aspect, which is the state transition information that cannot be directly observed. It provides the formal definition of HMM and describes the three main problems in HMM: model evaluation, decoding, and training. It focuses on explaining the forward algorithm for efficient model evaluation in linear time complexity. The document uses examples throughout to illustrate key concepts such as Markov models, HMM, and the forward algorithm.
This Nielsen report summarizes the results of a global survey of over 28,000 online consumers in 56 countries regarding their multi-screen media usage. The survey found that watching video on computers has become as popular as watching TV among online users. Reported online and mobile video viewing is rising, with over half of global online consumers watching videos on mobile phones monthly. Smartphone ownership is up significantly since 2010 and tablets are also gaining popularity globally. The report concludes that portable devices will continue affecting media consumption as their adoption increases.
The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
This document discusses scalability issues with publishing, exchanging, and consuming large RDF datasets on the semantic web. It proposes an integrated solution called Binary RDF that includes (1) a binary serialization format for efficient publication and exchange of RDF data, and (2) basic data structures for direct consumption without decompression. Preliminary results show Binary RDF in the form of HDT can provide a compact representation of RDF and support direct pattern matching queries during consumption. Further work is needed to fully understand RDF structure and apply it to innovative dictionary and triple indexes.
Roberto García presented on exploring linked data. He discussed how semantic data is fine for computers but difficult for people to interact with. He proposed automatically generating user interfaces from ontologies and datasets, including overview menus, faceted browsing, and interaction patterns to allow users to build queries without knowledge of SPARQL or dataset structure. He demonstrated examples of his approach applied to DBPedia and LinkedMDB data.
This document summarizes Rodrigo Dias Arruda Senra's 2012 doctoral thesis defense at the University of Campinas. The thesis studied how to organize digital information for sharing across heterogeneous systems and proposed three main contributions: 1) SciFrame, a conceptual framework for scientific digital data processing; 2) database descriptors to enable loose coupling between applications and database management systems; and 3) organographs, a method for explicitly organizing information based on tasks.
This document discusses using R Shiny and related tools to create cloud-based spatial data analytics applications. It describes a case study of an app called VectorPoint created to analyze spatial disease data from Peru. The app allows users to collect field data via smartphones, calculate disease probabilities on a map, and track inspections. R Shiny allows rapid prototyping by combining R code and interactive web interfaces. While powerful for prototyping, R Shiny has limitations like requiring an online connection and not being optimized for speed.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Scalable Hadoop with succinct Python: the best of both worlds
The document discusses using Python with Hadoop frameworks. It outlines some of the benefits of Hadoop like scalability and schema flexibility, and benefits of Python like succinct code and many data science libraries. It then reviews several projects that aim to bridge Python and Hadoop, including mrjob for MapReduce jobs, Pydoop for faster MapReduce, Pig for higher-level data flows, Snakebite for a Python HDFS client, and PySpark for working with Spark. However, it notes that Python support is often an afterthought or fringe project compared to the native Java support, and lacks commercial backing or cohesive APIs.
This document summarizes an introductory webinar on building an enterprise knowledge graph from RDF data using TigerGraph. It introduces RDF and knowledge graphs, demonstrates loading DBpedia data into a TigerGraph graph database using a universal schema, and provides examples of queries to extract information from the graph such as related people, publishers by location, and related topics for a given predicate. The webinar encourages attendees to learn more about graph databases and TigerGraph through additional resources and future webinar episodes.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
This document discusses Grails integration with Neo4j graph databases. It begins with an introduction to graph databases and Neo4j. It then covers the Grails Neo4j plugin which allows using Neo4j as the persistence layer for Grails domain classes. Finally, it addresses some challenges in mapping the Grails domain model to the Neo4j nodespace and potential solutions.
This document proposes a mapping between the Web Ontology Language (OWL) and the OpenAPI Specification (OAS) to generate REST APIs from OWL ontologies. It describes a mapping method, discusses related work, and details the mapping's coverage of OWL constructs. While some constructs like complex boolean restrictions are not supported, the mapping specification and implementation aim to make ontology knowledge graphs accessible via RESTful APIs in accordance with FAIR principles. Future work includes enhancing path/schema naming and adding metadata annotations.
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
This document provides an overview of Neo4j, a graph database. It begins with definitions of relational and NoSQL databases, categorizing NoSQL into key-value, document, column-oriented, and graph databases. Graph databases are explained to contain nodes, relationships, and properties. Neo4j is introduced as an example graph database, with Cypher listed as its query language. Examples of using Cypher to create nodes and relationships are provided. Finally, potential uses of Neo4j are listed, including social networks, network analysis, recommendations, and more.
Presto as a Service - Tips for operation and monitoringTaro L. Saito
- Presto as a Service in Treasure Data involves deploying Presto using blue-green deployments with no downtime and automatic error recovery of failed queries.
- Monitoring Presto involves using its JSON API to view queries and query plans as well as collecting Presto metrics with Fluentd and detecting anomalies.
- Benchmarking compares query performance between Presto versions by running predefined query sets and aggregating the results.
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
The document compares Neo4j, Titan, and Cassandra graph databases. It provides details on each database such as Neo4j using the Cypher query language, Cassandra being highly distributed and able to scale linearly, and Titan running on Cassandra or HBase but not supporting Cypher queries. It also gives a 15 point comparison of Cassandra vs Neo4j and examples of querying the same data in Gremlin, Cypher, and SQL. The conclusion recommends a graph database like Neo4j for recommendation queries and only using Titan for very large graphs or high loads.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...jaxLondonConference
Presented at JAX London
In this session we'll look at some of the design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
This document provides an overview of GraphDB and Neo4j. It discusses why graphs are useful for modeling connected data and common use cases. It also summarizes Neo4j's transactional graph database capabilities, performance advantages, and deployment options. Key topics covered include causal clustering, query planning, and driver and tooling support for developers.
Treasure Data is a cloud-based big data analytics company based in Silicon Valley with about 20 employees. The document discusses Treasure Data's services and architecture, which includes collecting data from various sources using Fluentd, storing the data in a columnar format on AWS S3, and performing analytics using Hadoop and SQL queries. Treasure Data aims to simplify big data adoption through its fully-managed platform and quick setup process. Example customers discussed were able to see results within 2 weeks of signing up.
This document summarizes an introduction to data analysis in Python using Wakari. It discusses why Python is a good language for data analysis, highlighting key Python packages like NumPy, Pandas, Matplotlib and IPython. It also introduces Wakari, a browser-based Python environment for collaborative data analysis and reproducible research. Wakari allows sharing of code, notebooks and data through a web link. The document recommends several talks at the PyData conference on efficient computing, machine learning and interactive plotting.
This document discusses persistent graphs in Python with Neo4j. It begins by explaining the limitations of relational databases and how graph databases like Neo4j focus on modeling complex relationships through nodes and edges. It then provides an overview of Neo4j, describing it as an open source graph database that is stable, actively developed, and can handle billions of nodes and relationships to model complex data.
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
The document discusses developing website search capabilities in Python. It provides an overview of typical search engine components like indexing, analyzing, and searching. It then compares two Python search libraries - Pylucene and Whoosh. Benchmark tests on indexing, committing, and searching a 1GB dataset showed Whoosh to outperform Pylucene in speed. The document recommends designing search as an independent, pluggable component and considers Whoosh and Pylucene as good options for rapid development and integration into Python web projects.
This document summarizes a talk about making Django and NoSQL databases like MongoDB play nicely together. Currently, Django's ORM is optimized for SQL databases and makes assumptions that don't always apply to NoSQL databases. The talk proposes some changes to address this, including having the Query object do less database-specific work and pushing more of that down to the individual database compilers. This would make the Query more agnostic and allow the compilers to generate queries optimized for their specific databases. An example backend for MongoDB would be built to demonstrate this approach.
This document discusses building knowledge graphs using DIG (Distributed Information Graphs) to integrate heterogeneous data sources. It describes the steps involved, including data acquisition, feature extraction, mapping to an ontology, entity resolution, graph construction, and deployment. As a use case, DIG has been used to build a knowledge graph from over 100 million web pages related to human trafficking to help law enforcement identify victims and prosecute traffickers.
This document compares relational and non-relational databases. It discusses how in 2003 the main databases were relational, but by 2010 non-relational databases grew popular in the "NoSQL movement". However, the document argues that there are no truly new database designs and that relational and non-relational databases can be combined. It advises to choose a database based on the specific problem and features needed rather than general classifications. The document provides examples of which types of databases fit certain data and access needs.
This document describes Bubbles, a Python framework for data processing and quality probing. Bubbles focuses on representing data objects and defining operations that can be performed on those objects. Key aspects include:
- Data objects define the structure and representations of data without enforcing a specific storage format.
- Operations can be performed on data objects and are dispatched dynamically based on the objects' representations.
- A context stores available operations and handles dispatching.
- Stores provide interfaces to load and save objects from formats like SQL, CSV, etc.
- Pipelines allow sequencing operations to transform and process objects from source to target stores.
- The framework includes common operations for filtering, joining, aggreg
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
Kovelinas is a painter born in 1978 in Lithuania. He is known for his Divas Powerpoint series of paintings. The document provides brief biographical information about the Lithuanian painter Kovelinas and references one of his art series but does not provide any additional context or details.
This document summarizes different market structures: pure competition, pure monopoly, monopolistic competition, oligopoly, and collusive oligopoly. It describes key characteristics of each like perfect competition having many firms and monopoly having a single firm. It also discusses profit maximization under these structures and provides examples.
Mobisfera és una Agència de Màrqueting Mòbil amb seu a Barcelona. Imaginem, dissenyem i desenvolupem solucions per a dispositius mòbils. A més, apostem per l’Internet of Thinks (IOT) i la Wearable Technology.
The document outlines the stages and funding of the Dementias Platform UK project. It establishes cohort(s) in stage 1 with £6M funding and establishes an imaging platform in stage 2 with another £6M. An additional £36M in capital funding will go toward imaging, stem cells, and informatics. It lists the director as John Gallacher from Oxford and describes the 14 work packages and informatics network leads. The informatics network lead is Simon Lovestone and lists various informatics sub-networks and their leads. It provides a conceptual model for the imaging informatics component with a central XNAT hub and nodes at various research centers.
Fastrack Digital Marketing Campaign by JubaerSlide Gen
Fastrack is a sub-brand of Titan that was established in 1998 and focuses on watches and accessories. The document discusses Fastrack's digital marketing campaign objectives of engaging 1000 students at top private universities in Bangladesh by 2016 through social media campaigns on platforms like Facebook and YouTube. The campaigns aim to raise brand awareness and engage customers among the target 20-25 year old male and female demographic interested in style, fashion and experiencing new things.
The document discusses three paths to designing digital experiences: structural, community, and customer. It advocates writing an experience brief to define goals and mapping the customer journey. The presentation provides recommendations for libraries to focus on the customer experience by asking questions, emphasizing conversation, and staging experiences on their website. The overall message is that experience design improves the ordinary interactions people have with an organization online.
Presentación sobre autores por Mati y VasilesextoBLucena
El documento presenta una lista de autores españoles e italianos, incluyendo breves biografías y extractos cortos de sus obras más famosas. Entre los autores se encuentran Juan Ramón Jiménez, Rafael Alberti, Miguel de Cervantes, Gustavo Adolfo Bécquer, William Shakespeare. El documento proporciona información básica sobre la vida y obras de estos importantes escritores.
The document provides an introduction to hidden Markov models (HMM) and their applications. It begins with an overview of HMM and its advantages for modeling sequential data. It then describes the basic concepts of Markov models, including their graphical representation, definitions, and algorithms for calculating sequence and state probabilities. The document introduces HMM and the hidden aspect, which is the state transition information that cannot be directly observed. It provides the formal definition of HMM and describes the three main problems in HMM: model evaluation, decoding, and training. It focuses on explaining the forward algorithm for efficient model evaluation in linear time complexity. The document uses examples throughout to illustrate key concepts such as Markov models, HMM, and the forward algorithm.
Multi-screen media report - May 2012 (Nielsen)Maple Aikon
This Nielsen report summarizes the results of a global survey of over 28,000 online consumers in 56 countries regarding their multi-screen media usage. The survey found that watching video on computers has become as popular as watching TV among online users. Reported online and mobile video viewing is rising, with over half of global online consumers watching videos on mobile phones monthly. Smartphone ownership is up significantly since 2010 and tablets are also gaining popularity globally. The report concludes that portable devices will continue affecting media consumption as their adoption increases.
The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
This document discusses scalability issues with publishing, exchanging, and consuming large RDF datasets on the semantic web. It proposes an integrated solution called Binary RDF that includes (1) a binary serialization format for efficient publication and exchange of RDF data, and (2) basic data structures for direct consumption without decompression. Preliminary results show Binary RDF in the form of HDT can provide a compact representation of RDF and support direct pattern matching queries during consumption. Further work is needed to fully understand RDF structure and apply it to innovative dictionary and triple indexes.
Roberto García presented on exploring linked data. He discussed how semantic data is fine for computers but difficult for people to interact with. He proposed automatically generating user interfaces from ontologies and datasets, including overview menus, faceted browsing, and interaction patterns to allow users to build queries without knowledge of SPARQL or dataset structure. He demonstrated examples of his approach applied to DBPedia and LinkedMDB data.
This document summarizes Rodrigo Dias Arruda Senra's 2012 doctoral thesis defense at the University of Campinas. The thesis studied how to organize digital information for sharing across heterogeneous systems and proposed three main contributions: 1) SciFrame, a conceptual framework for scientific digital data processing; 2) database descriptors to enable loose coupling between applications and database management systems; and 3) organographs, a method for explicitly organizing information based on tasks.
This document discusses using R Shiny and related tools to create cloud-based spatial data analytics applications. It describes a case study of an app called VectorPoint created to analyze spatial disease data from Peru. The app allows users to collect field data via smartphones, calculate disease probabilities on a map, and track inspections. R Shiny allows rapid prototyping by combining R code and interactive web interfaces. While powerful for prototyping, R Shiny has limitations like requiring an online connection and not being optimized for speed.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
The document discusses using Python with Hadoop frameworks. It outlines some of the benefits of Hadoop like scalability and schema flexibility, and benefits of Python like succinct code and many data science libraries. It then reviews several projects that aim to bridge Python and Hadoop, including mrjob for MapReduce jobs, Pydoop for faster MapReduce, Pig for higher-level data flows, Snakebite for a Python HDFS client, and PySpark for working with Spark. However, it notes that Python support is often an afterthought or fringe project compared to the native Java support, and lacks commercial backing or cohesive APIs.
This document summarizes an introductory webinar on building an enterprise knowledge graph from RDF data using TigerGraph. It introduces RDF and knowledge graphs, demonstrates loading DBpedia data into a TigerGraph graph database using a universal schema, and provides examples of queries to extract information from the graph such as related people, publishers by location, and related topics for a given predicate. The webinar encourages attendees to learn more about graph databases and TigerGraph through additional resources and future webinar episodes.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
This document discusses Grails integration with Neo4j graph databases. It begins with an introduction to graph databases and Neo4j. It then covers the Grails Neo4j plugin which allows using Neo4j as the persistence layer for Grails domain classes. Finally, it addresses some challenges in mapping the Grails domain model to the Neo4j nodespace and potential solutions.
This document proposes a mapping between the Web Ontology Language (OWL) and the OpenAPI Specification (OAS) to generate REST APIs from OWL ontologies. It describes a mapping method, discusses related work, and details the mapping's coverage of OWL constructs. While some constructs like complex boolean restrictions are not supported, the mapping specification and implementation aim to make ontology knowledge graphs accessible via RESTful APIs in accordance with FAIR principles. Future work includes enhancing path/schema naming and adding metadata annotations.
The document discusses how search engines are incorporating knowledge graphs and rich snippets to provide more detailed information to users. It describes Google's Knowledge Graph and how search engines like Bing are implementing similar features. The document then outlines how the Schema.org standard and modules like Schema.org and Rich Snippets for Drupal can help structure Drupal content to be understood by search engines and displayed as rich snippets in search results. Integrating these can provide benefits like a consistent search experience across public and private Drupal content.
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
Wes McKinney gave a presentation on the past, present, and future of Python for data analysis. He discussed the origins and development of pandas over the past 12 years from the first open source release in 2009 to the current state. Key points included pandas receiving its first formal funding in 2019, its large community of contributors, and factors driving Python's growth for data science like its package ecosystem and education. McKinney also addressed early concerns about Python and looked to the future, highlighting projects like Apache Arrow that aim to improve performance and interoperability.
The document discusses graph databases and their advantages over traditional relational databases. It covers the NoSQL movement, graph databases, use cases for graph databases like social networks and semantic web applications. It provides an overview of graph database technologies like Neo4j and DEX and examples of querying and modeling data in a graph database using Neo4j.rb.
The document discusses the Spark ecosystem. It provides an overview of Spark, a cluster computing framework developed at UC Berkeley, including its core components like Resilient Distributed Datasets (RDDs) and projects like Shark. Spark aims to improve on Hadoop and MapReduce by allowing more interactive queries and streaming data analysis through its use of RDDs to cache data in memory across clusters.
This document describes a course on big data analytics. The course aims to provide an overview of big data storage, retrieval, and processing technologies. It will cover tools for storing and analyzing large datasets as well as challenges in big data system design and analytics. Students will learn to build distributed systems with Apache Hadoop, write MapReduce applications, and develop applications using Hive, Pig, and Spark. Course units will introduce big data concepts, Hadoop, MapReduce programming, Hive, Pig, and Spark. Upon completing the course, students will be able to develop applications on Hadoop and with related big data tools.
The document discusses Big Data, MapReduce, Hadoop, and Pydoop. It provides an overview of MapReduce and how it works, describing the map and reduce functions. It also describes Hadoop, the popular open-source implementation of MapReduce, including its architecture and core components like HDFS and how tasks are executed in a distributed manner. Finally, it briefly introduces Pydoop as a way to use Python with Hadoop.
Il seminario presenta il tema emergente del Web of Data, nell'ambito del Semantic Web. Vengono esaminate le criticità incontrate nell'accedere all'enorme quantità di informazione presente attualmente nel Web e i vantaggi di un approccio basato sulla creazione interattiva di interrogazioni.
Similar to Graph Databases in Python (PyCon Canada 2012) (20)
This document discusses bringing the educational app Dr. Glearning to Firefox OS. It describes Dr. Glearning's history and functionality, challenges in porting from Android/iOS to Firefox OS, and solutions considered. The authors decided a hosted web app approach using standard web technologies like jQuery Mobile worked best to overcome restrictions of the packaged app model and enable third party API access. They demonstrated a working hosted version of Dr. Glearning for Firefox OS.
Este documento resume un estudio sobre la neutralización de la consonante /l/ por la /r/ en el dialecto andaluz de España. El estudio analizó las producciones de 4 hablantes de Huelva y Sevilla usando 180 palabras en lecturas rápidas y lentas. Encontraron que la neutralización ocurre con más frecuencia en palabras trisilábicas que empiezan con "al-", especialmente en posición coda. La neutralización ocurrió en el 29% de las producciones, con un promedio de 20 casos por hablante. El estudio concluye que este
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"Javier de la Rosa
Presentation of the proceeding article "Hybrid Page Layout Analysis via Tab-Stop Detection" by Ray Smith to the Page Segmentation Competition hold on ICDAR 2009.
Mejora de un problema combinatorio sobre vectores ordenadosJavier de la Rosa
Este documento presenta un problema combinatorio sobre vectores ordenados y propone una solución eficiente en tiempo y espacio. Se analizan dos formas de modelar el problema como un árbol y como un grafo, pero la solución propuesta genera el conjunto de soluciones como la traspuesta de una matriz, aprovechando patrones en los vectores de entrada. El algoritmo propuesto genera todas las soluciones en tiempo lineal respecto al número total de elementos.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
1. GRAPH DATABASES
IN PYTHON
Javier de la Rosa
@versae
The CulturePlex Lab
Western University, London, ON
PyCon Canada 2012
2. WHO I AM
●
Javier de la Rosa
●
versae
●
versae
●
Computer Scientist and
Humanist
●
CulturePlex Lab
●
CulturePlex
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 2
3. FIRST OF ALL
“You do not really understand something
unless you can explain it to your
grandmother”
– (Frequently attributed to) Richard Feynman
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 3
4. DATABASES (in the last 30 years)
●
Data in tables, rows and columns
●
Pretty basic mechanism to make connections:
– Primary keys, Foreign keys, and... that's all
●
Relational, ahem, really?
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 4
5. DATABASES (in the last 30 years)
●
Rigid data schemas
– Have you ever tried to make a schema migration?
●
Relational Algebra and SQL
– Terrible for highly interconnected data
– JOIN's can take a life to end (a bit overdramatized)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 5
6. NoSQL, Not Only SQL
●
Document ●
Anaylitc
– MongoDB, CouchDB, etc. – Hadoop
●
Key-value stores ●
Graph
– Redis, Riak, Voldemort, – Neo4j, OrientDB,
Dynamo, etc. HyperGraphDB, Titan, etc.
●
Big Tables ●
Other
– Cassandra, Hbase, etc – Objectivity/DB, ZODB, etc.
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 6
7. DATABASES LANDSCAPE
Source: 451Research, https://451research.com/report-long?icid=2289
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 7
8. WHO IS USING GRAPHS?
●
Mozilla with Pancake and Pacer
– https://wiki.mozilla.org/Pancake &
http://pangloss.github.com/pacer/
●
Twitter with FlockDB
– https://github.com/twitter/flockdb
●
Facebook with Open Graph
– https://developers.facebook.com/docs/opengraph/
●
Google with Knowledge Graph
– http://www.google.ca/insidesearch/.../knowledge.html
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 8
9. WHY GRAPHS?
●
Data is getting more and more connected
– From text documents, to wikis, to ontologies, to
folksonomies, etc
●
And more semi-structured
– Think about the decentralization of content generation
●
And more complex
– Social networks, semantic trending, etc
Source: Neo Technology, http://www.slideshare.net/emileifrem/neo4j-the-benefits-of-graph-databases-oscon-2009
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 9
10. A FEW OF THE CURRENT USES
●
Social Networking and Recommendations
●
Network and Cloud Management
●
Master Data Management
●
Geospatial
●
Bioinformatics
●
Content Management and Security and Access
Control
Source: Mashable, http://mashable.com/2012/09/26/graph-databases/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 10
11. AND WHY ELSE?
●
Because graphs are cool!
Leonard Euler
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 11
12. WHAT IS A GRAPH?
●
G = (V, E)
Where
– G is a graph
– V is a set of vertices
– E is a set of edges
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 12
13. WHAT IS A GRAPH?
●
G = (V, E)
– Graph, aka network, diagram, etc.
– Vertex, aka point, dot, node, element, etc.
– Edge, aka relationship, arc, line, link, etc.
●
Basically, “a graph states that something is related
to something else”
– Svetlana Sicular,
Research Director at Gartner
Source: Gartner, http://blogs.gartner.com/svetlana-sicular/think-graph/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 13
14. TYPES OF GRAPH
Undirected Digraph
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 14
15. TYPES OF GRAPH
Multigraph Hypergraph
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 15
16. SOME GRAPHS EVEN HAVE A NAME
●
Complete graphs
K3 K5 K8
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 16
17. SOME GRAPHS EVEN HAVE A NAME
●
Stars
The star graphs S3, S4, S5 and S6
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 17
18. SOME GRAPHS EVEN HAVE A NAME
●
Snarks
Blanuša (second) Szekeres Double star
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 18
19. THINGS CAN COMPLICATE...
Local McLaughlin graph
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 19
20. WAIT A SEC,
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 20
21. DON'T WORRY
●
Just one more type: the Property Graph
1
2 1
2 3 3
4
4
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 21
22. THE PROPERTY GRAPH
●
Directed, attributed and multi-relational
Name: Javi
1
2 1
Knows Knows
Since: 2009 Since:1990
2 3 3 Name: David
Likes
Name: John
4
Likes
4
Title: The Art of Computer Programming
Price: $135
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 22
23. THE PROPERTY GRAPH
●
A set of nodes, and each node has:
– An unique identifier.
– A set of outgoing edges.
– A set of incoming edges.
– A collection of properties defined by a map from key to value.
●
A set of relationships, and each relationship has:
– An unique identifier.
– An outgoing tail vertex.
– An incoming head vertex.
– And a collection of properties defined by a map from key to value.
Source: TinkerPop, https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 23
24. IN SHORT
●
A Property Graph is composed by:
– A set of nodes
– A set of relationships
– Properties and id's on both
●
Sometimes, nodes and relationship can be typed
– In Blueprints and Neo4j, a label denotes the type of
relationship between its two nodes.
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 24
25. GRAPH DATABASES
●
A graph database uses graph structures with nodes,
edges, and properties to represent and store data
– ...but there is not an easy way to visualize this
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 25
26. HOW IT LOOKS IN PYTHON?
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 26
27. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 27
28. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 28
29. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
>>> arnold = g.nodes.create(name="Arnold")
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 29
30. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
>>> arnold = g.nodes.create(name="Arnold")
Name: Silvester Name: Arnold
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 30
31. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
>>> arnold = g.nodes.create(name="Arnold")
>>> punch = arnold.punches(silvester)
Name: Silvester Name: Arnold
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 31
32. HOW IT LOOKS IN PYTHON?
# Let's create a graph
>>> silvester = g.nodes.create(name="Silvester")
>>> arnold = g.nodes.create(name="Arnold")
>>> punch = arnold.punches(silvester)
punches
Name: Silvester Name: Arnold
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 32
33. HOW IT LOOKS IN PYTHON?
punches
Name: Arnold
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 33
34. HOW IT LOOKS IN PYTHON?
>>> chuck = g.nodes.create(name="Chuck")
punches
Name: Arnold
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 34
35. HOW IT LOOKS IN PYTHON?
>>> chuck = g.nodes.create(name="Chuck")
punches
Name: Arnold
Name: Silvester Name: Chuck
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 35
36. HOW IT LOOKS IN PYTHON?
>>> chuck.dropkicks(silvester)
>>> chuck.dropkicks(arnold)
punches
Name: Arnold
Name: Silvester Name: Chuck
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 36
37. HOW IT LOOKS IN PYTHON?
>>> chuck.dropkicks(silvester)
>>> chuck.dropkicks(arnold)
punches dropkicks
Name: Arnold
dropkicks
Name: Silvester Name: Chuck
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 37
39. GRAPH DATABASES LANDSCAPE
And more:
– AffinityDB
– YarcData uRiKA
– Apache Giraph
– Cassovary
– StigDB
– NuvolaBase
– Pegasus
– Microsoft Trinity
– Sherlock
– And so on
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 41
41. GREMLIN, BLUEPRINTS, WAT?
Let me introduce you the TinkerPop Stack
Source:TinkerPop, http://www.tinkerpop.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 43
42. BLUEPRINTS AND REXSTER
●
Blueprints is a property graph model interface
●
Rexster is a server that exposes any Blueprints
graph through REST
Source:TinkerPop, http://www.tinkerpop.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 44
43. AND WHAT ABOUT PYTHON?
●
Options to connect to a Blueprints Graph Database
OrientDB Neo4j
bulbflow
Blueprints API Rexster python-blueprints
pyblueprints
DEX Titan
REST
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 45
44. BULBFLOW
●
Create
>>> alice = g.vertices.create(name="Alice")
>>> bob = g.vertices.create(name="Bob")
>>> g.edges.create(alice, "knows", bob)
●
Get
>>> alice = g.vertices.get(1)
>>> bob = g.vertices.get(2)
●
Update
>>> alice.age = 21
>>> alice.save()
●
Delete
>>> alice.delete()
Source: Bulbflow, http://bulbflow.com/docs/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 46
45. PYBLUEPRINTS
●
Create
>>> alice = g.addVertex()
>>> alice.setProperty("name", "Alice")
>>> bob = g.addVertex()
>>> bob.setProperty("name", "Bob")
>>> g.addEdge(alice, bob, "knows")
●
Get
>>> alice = g.getVertex(1)
>>> bob = g.getVertex(2)
●
Update
>>> alice.setProperty("age", 21)
●
Delete
>>> g.removeVertex(alice.getId())
Source: PyBlueprints, https://github.com/escalant3/pyblueprints
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 47
46. BUT NEO4J HAS ITS OWN CLIENTS!
●
REST Clients for Neo4j
neo4j-rest-client
OrientDB Neo4j
py2neo
Blueprints API Rexster bulbflow
python-blueprints
DEX Titan
pyblueprints
REST
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 48
47. HOW CAN I LOOKUP?
●
An index is a data structure that supports the fast
lookup of elements by some key/value pair
Source: TinkerPop, https://github.com/tinkerpop/blueprints/wiki/Graph-Indices
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 49
48. INDICES
●
In Python bindings, are similar to dict
�� bulbflow
# bulbflow creates auto indices to make easier basic lookups
>>> nodes = g.vertices.index.lookup(name="Alice")
>>> for node in nodes:
...: print vertex
– PyBlueprints
>>> index = g.getIndex("names", "vertex")
>>> index.put("name", alice.getProperty("name"), alice)
>>> nodes = index.get("name", "Alice")
>>> for node in nodes:
...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 50
49. INDICES
●
Some Graph Databases provide full-text queries
– bulbflow
>>> nodes = g.vertices.index.query(name="ali*")
>>> for node in nodes:
...: print node
– PyBlueprints
>>> index = g.getIndex("names", "vertex")
>>> nodes = index.query("name", "ali*")
>>> for node in nodes:
...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 51
50. ...MORE COMPLEX SEARCHS?
“Without traversals [FlockDB] is only a persisted
graph. But not a graph database.”
– Alex Popescu
Source: myNoSQL, http://nosql.mypopescu.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 52
51. LET'S TRAVERSE THE GRAPH!
●
“A graph traversal is the problem of visiting all the
nodes in a graph in a particular manner”
– A* search
– Alpha-beta prunning
– Breadth-First Search (BFS)
– Depth-First Search (DFS)
– Dijkstra's algorithm
– Floyd-Warshall's algortimth
– Etc.
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_traversal
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 53
52. NEO4J TRAVERSAL API
●
Python-embedded (native Neo4j Python binding)
>>> traverser = gdb.traversal()
.relationships('knows').traverse(alice)
# The graph is traversed as you loop through the result
>>> for node in traverser.nodes:
...: print node
●
neo4j-rest-client
>>> traverser = alice.traverse(types=[client.All.knows])
# The graph is traversed as you loop through the result
>>> for node in traverser:
...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 54
53. BLUEPRINTS GREMLIN
●
Gremlin is a domain specific language for traversing
property graphs
– Defines how to do a query based on the graph structure
>>> gremlin = g.extensions.GremlinPlugin.execute_script
>>> params = {'alice_id': alice.id}
>>> script = "g.V(alice_id).out('knows')"
>>> node = gremlin(script=script, params=params)
>>> node == bob
Source: TinkerPop Gremlin, https://github.com/tinkerpop/gremlin/wiki
Source: Marko Rodríguez, The Graph Traversal Programmin Pattern, http://www.slideshare.net/slidarko/graph-windycitydb2010
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 55
54. NEO4J CYPHER QUERY LANGUAGE
●
Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 56
55. NEO4J CYPHER QUERY LANGUAGE
●
Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
1 2
label
(1) -[:label]- (2)
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 57
56. NEO4J CYPHER QUERY LANGUAGE
●
Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
1 2
label
START n=(1), m=(2) MATCH
n-[r:label]-m
RETURN r
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 58
57. PY2NEO CYPHER HELPERS
●
Get or create elements
>>> g.get_or_create_relationships(
...: (bob, "WORKS WITH", carol, {"since": 2004}),
...: (alice, "DISLIKES!", carol, {"reason": "youth"}),
...: (bob, "WORKS WITH", dave, {"since": 2009}), )
●
Get counts
>>> nodes_count = g.get_node_count()
>>> rels_count = g.get_relationship_count()
●
Delete
>>> g.delete()
Source: py2neo, http://py2neo.org/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 59
59. LET'S PLAY!
●
Deploy Neo4j in Heroku or Amazon
●
Use one of the available clients
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 61
60. NEO4J HEROKU ADD-ON
●
Create a Heroku app and add the Neo4j add-on
$ heroku apps:create pyconca
$ heroku addons:add neo4j --app pyconca
$ xdg-open `heroku config:get NEO4J_URL --app pyconca`
$ export NEO4J_URL=`heroku config:get NEO4J_URL --app pyconca`
●
Create a virtualenv with neo4j-rest-client
$ mkvirtualenv --no-site-packages pyconca
$ workon pyconca
$ pip install ipython neo4jrestclient
$ ipython
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 62
61. NEO4J HEROKU ADD-ON
●
Run IPython and that's it!
>>> import os
>>> NEO4J_URL = os.environ["NEO4J_URL"]
>>> from neo4jrestclient import client
>>> gdb = client.GraphDatabase(NEO4J_URL + "/db/data")
>>> gdb.url
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 63
62. NEO4J HEROKU ADD-ON
●
Run IPython and that's it!
>>> import os
>>> NEO4J_URL = os.environ["NEO4J_URL"]
>>> from neo4jrestclient import client
>>> gdb = client.GraphDatabase(NEO4J_URL + "/db/data")
>>> gdb.url
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 64
63. THANKS!
Questions?
Javier de la Rosa
@versae
The CulturePlex Lab
Western University, London, ON
PyCon Canada 2012
64. APPENDIX: DATA MODELS
●
neo4django
– https://github.com/scholrly/neo4django
●
neomodel
– https://github.com/robinedwards/neomodel
●
bulbflow models
– http://bulbflow.com/quickstart/#models
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 66
65. APPENDIX: VISUALIZE YOUR GRAPH
●
Export somehow to .gexf for Gephi
– http://gephi.org/
●
Use D3.js
– http://d3js.org/
●
Use sigma.js
– http://sigmajs.org/
●
Take a look on Max De Marzi work
– http://maxdemarzi.com/category/visualization/
●
Use Sylva (for newbies)
– http://www.sylvadb.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 67