With the torrent of data available to us on the Internet, it's been increasingly difficult to separate the signal from the noise. We set out on a journey with a simple directive: Figure out a way to discover emerging technology trends. Through a series of experiments, trials, and pivots, we found our answer in the power of graph databases. We essentially built our "Emerging Tech Radar" on emerging technologies with graph databases being central to our discovery platform. Using a mix of NoSQL databases and open source libraries we built a scalable information digestion platform which touches upon multiple topics such as NLP, named entity extraction, data cleansing, cypher queries, multiple visualizations, and polymorphic persistence.
Academics in an ivory tower conjures images of people toiling away nicely insulated from many of the concerns of reality. While this has it's advantages, anyone who's tried to use a project written for a research paper under a deadline can attest that it doesn't always result in useful code. While completing my PhD, I found an Apache project that fit well with the work I was doing s I rolled up my sleeves to write some code to make it more useful for solving my own problems. I've since had the opportunity to join the project's PMC and now as a faculty member, I continue to find value in encouraging my own students to contribute to Apache projects. I'll discuss how academics and Apache projects can find mutual benefit in close collaboration.
Originally presented at DataDay Texas in Austin, this presentation shows how a graph database such as Neo4j can be used for common natural language processing tasks, such as building a word adjacency graph, mining word associations, summarization and keyword extraction and content recommendation.
It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. AstraZeneca is a global, innovation-driven biopharmaceutical business that focuses on the discovery, development, and commercialization of prescription medicines for some of the world’s most serious diseases. Our scientists have been able to improve our success rate over the past 5 years by moving to a data-driven approach (the “5R”) to help develop better drugs faster, choose the right treatment for a patient and run safer clinical trials. However, our scientists are still unable to make these decisions with all of the available scientific information at their fingertips. Data is sparse across our company as well as external public databases, every new technology requires a different data processing pipeline and new data comes at an increasing pace. It is often repeated that a new scientific paper appears every 30 seconds, which makes it impossible for any individual expert to keep up-to-date with the pace of scientific discovery. To help our scientists integrate all of this information and make targeted decisions, we have used Spark on Azure Databricks to build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AZ scientist to generate novel target hypotheses, for any disease, leveraging all of our data. In this talk, I will describe the applications of our knowledge graph and focus on the Spark pipelines we built to quickly assemble and create projections of the graph from 100s of sources. I will also describe the NLP pipelines we have built – leveraging spacy, bioBERT or snorkel – to reliably extract meaningful relations between entities and add them to our knowledge graph.
This document outlines a project exploring the use of Python and R for business applications. It provides brief descriptions of Python and R, noting their uses in scientific computing, big data, automation, web scraping, visualization, and more. Potential business applications are mentioned but not described. The document discusses success factors such as continuing to learn the syntax of Python and R, defining requirements, and investigating applications. It proposes taking what was learned about the languages and identifying a realistic business problem and solution to develop using Python, R, or both. Next steps include meeting with a professor, exploring solutions, and coordinating with teammates to develop materials showcasing Python and R.
Slides from Trey's opening presentation for the South Big Data Hub's Text Data Analysis Panel on December 8th, 2016. Trey provided a quick introduction to Apache Solr, described how companies are using Solr to power relevant search in industry, and provided a glimpse on where the industry is heading with regard to implementing more intelligent and relevant semantic search.
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
This presentation if for beginners in R and is geared toward use in psychometrics (academic, credentialing, and psychological exam development).
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
This document proposes extensions to user behavior modeling for web application prefetching. It discusses using n-gram and n-gram+ techniques to predict the next actions users will take based on sequential patterns in their historical requests and responses. Relations between actions are defined to identify dependencies between tokens in requests. An algorithm is proposed to assign actions to endpoints, tokenize requests/responses, identify action relations through n-gram statistics, and predict/prefetch future actions by filling token values. This predictive modeling could help prefetch dependent resources to reduce latency.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
This document provides a summary of Rangarajan Chari's background and experience. It includes 3 sentences of experience as a data scientist and machine learning specialist with skills in neural networks, natural language processing, and big data technologies. Chari has worked on projects involving text classification, face recognition, and troubleshooting techniques for vehicles. The summary also lists education including a PhD program in artificial intelligence and masters degrees in computer science, math, and physics.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016) Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
ExperTwin is a Knowledge Advantage Machine (KAM) that is able to collect data from your areas of interest and present it in-time, in-context and in place to the worker workspace. This research paper describes how workers can be benefited from having a personal net of crawlers (as Google does) collecting and organizing updated data relevant to their areas of interest and delivering these to their workspace.
The document provides a general introduction to artificial intelligence (AI), machine learning (ML), deep learning (DL), and data science (DS). It defines each term and describes their relationships. Key points include: - AI is the ability of computers to mimic human cognition and intelligence. - ML is an approach to achieve AI by having computers learn from data without being explicitly programmed. - DL uses neural networks for ML, especially with unstructured data like images and text. - DS involves extracting insights from data through scientific methods. It is a multidisciplinary field that uses techniques from ML, DL, and statistics.
The presentation gives a brief information about Graph Databases and its usage in today's scenario. Moving on the presentation talks about the popular Graph DB Neo4j and its Cypher Query Language i.e., used to query the graph.
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
The slides are from my talk on General Introduction to Artifical Intelligence, Machine Learning, Deep Learning & Data Science
Applying graph analytics on data stored in relational databases can provide tremendous value in many application domains. We discuss the importance of leveraging these analyses, and the challenges in enabling them. We present a tool, called GraphGen, that allows users to visually explore, and rapidly analyze (using NetworkX) different graph structures present in their databases.
This document discusses GraphGen, a tool for conducting graph analytics over relational databases. It begins by introducing graph analytics and its applications. It then discusses the current state of graph analytics, which is fragmented with no single solution. Most organizations store data relationally and have "hidden" graphs that can be extracted. GraphGen provides a declarative language to define nodes and edges to extract these graphs without ETL. It supports various interfaces like Java, Python, and a web application to enable graph analytics over relational data in an intuitive way.
This document discusses providing a modern interface for data science on Postgres and Greenplum databases. It introduces Ibis, a Python library that provides a DataFrame abstraction for SQL systems. Ibis allows defining complex data pipelines and transformations using deferred expressions, providing type checking before execution. The document argues that Ibis could be enhanced to support user-defined functions, saving results to tables, and data science modeling abstractions to provide a full-featured interface for data scientists on SQL databases.