Sebastopol, California, United States
Contact Info
10K followers
500+ connections
About
Activity
-
To quote Hamel H., "Its the least sexiest but most important topic" - cleaning, curating and looking at your data. Daniel van Strien and I spent…
To quote Hamel H., "Its the least sexiest but most important topic" - cleaning, curating and looking at your data. Daniel van Strien and I spent…
Liked by Paco Nathan
-
The growing utilization of finite resources is causing environmental and societal challenges to escalate worldwide. Join us 7/18 for an exciting…
The growing utilization of finite resources is causing environmental and societal challenges to escalate worldwide. Join us 7/18 for an exciting…
Liked by Paco Nathan
-
I love this presentation by my colleague Kathe Todd-Brown. Such a great description of how collaboration and communities go hand in hand…
I love this presentation by my colleague Kathe Todd-Brown. Such a great description of how collaboration and communities go hand in hand…
Liked by Paco Nathan
Experience & Education
Publications
-
Entity Resolved Knowledge Graphs: A Tutorial
Neo4j
Using the Python API for Senzing to run entity resolution on three datasets about businesses in the Las Vegas metro area: SafeGraph, WHISGARD wage compliance from US Dept of Labor, PPP loans from US Chamber of Commerce. We build a knowledge graph in Neo4j from the results, then use Jupyter, Pandas, Seaborn, PyVis to compare the before/after of resolving duplicate records.
-
Latent Space
Derwen
A f*ck-around-and-find-out whodunit tale of neo-noir gore and messy flip-the-script cli-fi about artificial intelligence, animism, national security liberals, insurrection, climate guilt, weaponized media, advanced mathematics, conspiracism, global cyberwar, overlapping polycrisis, and the strangest of bedfellows.
-
NLP Entity Linking for Medical Transcripts
Manning
In this liveProject, you’re a data scientist at a healthcare provider that deals with large volumes of incoming text. Your task is to analyze a large dataset containing medical transcriptions. Leveraging technologies including pandas, the IBM Project Debater API, and Seaborn, you’ll explore a Kaggle dataset, segment text data into known categories, and extract key points.
You’ll finish by building an interactive data visualization dashboard for analysis in the open-source framework…In this liveProject, you’re a data scientist at a healthcare provider that deals with large volumes of incoming text. Your task is to analyze a large dataset containing medical transcriptions. Leveraging technologies including pandas, the IBM Project Debater API, and Seaborn, you’ll explore a Kaggle dataset, segment text data into known categories, and extract key points.
You’ll finish by building an interactive data visualization dashboard for analysis in the open-source framework Streamlit. When you’re done, you’ll have leveled up your NLP toolbox with skills that are highly sought not only in healthcare but in law, customer support, market intelligence, media, and many other fields. -
2022 AI in Healthcare Survey Report
Gradient Flow
Applications of AI in Healthcare pose a number of challenges and considerations which differ substantially from other business verticals. We conducted an industry survey specifically about AI in healthcare, to understand more about current trends and issues. A total of 321 respondents from 41 countries participated in the survey. A quarter of all respondents (27%) held Technical Leadership roles. This survey was conducted in collaboration with John Snow Labs.
Other authorsSee publication -
Recommender Systems Best Practices
NVIDIA
Building, deploying, and optimizing recommender systems that effectively engages users and impacts business value, including revenue, is hard. Data scientists, machine learning engineers, and leads within global e-commerce, media, and on-demand domains have successfully designed, built, and deployed recommendation systems that impact business value. Download this paper to get insights, best practices, and advice from expert interviews and uncover how recommender systems teams handle…
Building, deploying, and optimizing recommender systems that effectively engages users and impacts business value, including revenue, is hard. Data scientists, machine learning engineers, and leads within global e-commerce, media, and on-demand domains have successfully designed, built, and deployed recommendation systems that impact business value. Download this paper to get insights, best practices, and advice from expert interviews and uncover how recommender systems teams handle preprocessing, feature engineering, training models, evaluating models, selecting which appropriate technologies to integrate, interoperability with open source, and more. Learn insights from leaders and technical experts at global companies such as The New York Times, Tencent, Meituan, NVIDIA, and more.
-
2021 NLP Survey Report
Gradient Flow
Our 2021 NLP Industry Survey report is informed by several important contrasts: organizations with years of history deploying NLP applications in production compared to those which are exploring NLP, responses from Technical Leaders versus general practitioners, and company size. We draw insights and indicate trends based on those contrasts. This survey was conducted in collaboration with John Snow Labs.
Other authorsSee publication -
Graph Thinking
Knowledge Graph Conferene
Graph Thinking, as a cognitive framework for approaching complex analytics problems which can be solved with graph technologies – with analogies from learning theory, about how people organize knowledge in graph-like cognitive structures as they progress from novice to expert in a given field.
-
Model Monitoring Enables Robust Machine Learning Applications
Gradient Flow
Key features of ML monitoring solutions, why companies need a holistic MLOps platform that includes model monitoring, and challenges companies face in making that happen.
Other authorsSee publication -
Hardware > Software > Process: Data Science in a Post-Moore's Law World
Manning
Learn why hardware innovations demand rethinking how data teams build analytics and ML applications.
Other authorsSee publication -
2021 AI in Healthcare Survey Report
Gradient Flow
Applications of AI in Healthcare pose a number of challenges and considerations which differ substantially from other business verticals. We conducted an industry survey specifically about AI in healthcare, to understand more about current trends and issues. A total of 373 respondents from 49 countries participated in the survey. A quarter of all respondents (27%) held Technical Leadership roles. This survey was conducted in collaboration with John Snow Labs.
Other authorsSee publication -
Operationalizing AI
O'Reilly Media
Across industry sectors, both management and leaders see a yawning gap between the promised and delivered impact of data science projects and wonder why the discrepancy exists. It's simple, really. Companies rely on highly skilled and expensive data scientists to help them build predictive capabilities into their products and workflows, but they often think the data science team alone can lead the change.
This report examines issues from several conversations the authors held with data…Across industry sectors, both management and leaders see a yawning gap between the promised and delivered impact of data science projects and wonder why the discrepancy exists. It's simple, really. Companies rely on highly skilled and expensive data scientists to help them build predictive capabilities into their products and workflows, but they often think the data science team alone can lead the change.
This report examines issues from several conversations the authors held with data science teams across industries, as well as those issues they've witnessed in their own experience as builders and leaders. Among their findings, the authors agreed that to shorten the production process, lower overhead, and reduce risk, organizations need a comprehensive understanding of how to build AI in a repeatable fashion.Other authorsSee publication -
2020 NLP Survey Report
Gradient Flow
The Natural Language Processing (NLP) Industry Survey was an online survey which ran for 41 days (July 5 to August 14, 2020). A total of 571 respondents from more than 50 countries completed the survey. A quarter of all respondents hold technical leadership roles. Respondents were recruited via social media, online advertising, the Gradient Flow Newsletter, and through industry partners and contacts. This survey was sponsored by John Snow Labs.
Other authorsSee publication -
Intro to RLlib: Example Environments
Anyscale
RLlib is an open-source library in Python, based on Ray, which is used for reinforcement learning (RL). This article provides a hands-on introduction to RLlib and reinforcement learning by working step-by-step through sample code. The material in this article, which comes from Anyscale Academy, provides a complement to the RLlib documentation.
-
Visualizing Geospatial Data in Python
Towards Data Science
Open source tools and techniques for visualizing data on custom maps in Python.
Other authorsSee publication -
Rich Search and Discovery for Research Datasets
SAGE Publishing
This ground-breaking book explores how automating the search for and discovery of datasets can help tackle irreproducibility in social science.
Other authorsSee publication -
Agile AI
O'Reilly Media
As more companies work to adopt AI for business processes, project costs and failure rates are on the rise. Why? No standard practice exists for implementing AI in business applications, and many organizations don’t have the skills, processes, and tools to mitigate risk.
Other authorsSee publication -
Fifty Years of Data Management and Beyond
O'Reilly Media
Every decade since the 1960s, researchers at companies like IBM, Amazon, and many others have introduced major new frameworks and techniques to handle rising data management problems. This concise ebook explains how these new systems helped data science evolve quickly—from hierarchical and relational databases to big data and cloud computing to streaming and graph data.
-
A landscape diagram for Python data
IBM Data Science Community
What are the open source libraries in Python which are popularly used in data science work, and how do they fit together?
Other authorsSee publication -
AI Adoption in the Enterprise
O'Reilly Media
While O’Reilly has identified several trends among enterprise companies for adopting artificial intelligence, we decided to drill down further to learn just how businesses worldwide are planning and prioritizing this work. In a recent survey, we asked respondents about revenue-bearing AI projects their organizations have in production. How might their AI adoption patterns change over the course of the next year?
Other authorsSee publication -
Evolving Data Infrastructure
O'Reilly Media
How are companies using or exploring AI, big data, and the cloud for advanced analytics and automation? In an O’Reilly survey conducted in October 2018, more than 3,200 companies throughout the world—located primarily in North America, Europe, and Asia—revealed their choices of tools, technologies, and practices for pursuing sophisticated cloud-based data solutions.
Other authorsSee publication -
The State of Machine Learning Adoption in the Enterprise
O'Reilly Media
While the use of machine learning (ML) in production started near the turn of the century, it’s taken roughly 20 years for the practice to become mainstream throughout industry. With this report, you’ll learn how more than 11,000 data specialists responded to a recent O’Reilly survey about their organization’s approach—or intended approach—to machine learning.
Other authorsSee publication -
Building Data Science Teams
O'Reilly Media
Imagine cooking a stew with a single ingredient or growing a country garden with a single type of flower. One-dimensional efforts like these yield bland and boring results. Now imagine staffing a data science team with only PhDs in machine learning. In spite of the impressive pedigree, the result would be similar: bland, boring, and, possibly worse, ineffective.
But if not just data people, then who? -
Introduction to Apache Spark
O'Reilly Media
With its ability to perform fast, in-memory cluster computing, Apache Spark is emerging as a favorite technology for analytics on large datasets. This video workshop from Paco Nathan (host of the Just Enough Math workshop) provides developers with an introduction to Spark and its core APIs. By working with hands-on technical exercises, you’ll get up to speed on how to use Spark for data exploration, analysis, and building big data applications in Python, Java, or Scala.
-
Just Enough Math
O'Reilly Media
The webcast introduces advanced math for business people — "just enough" to take advantage of open source frameworks — including graph theory, abstract algebra, optimization, bayesian statistics, and more advanced areas of linear algebra. These are needed for supply chain optimization, pricing models, and anti-fraud, especially given the increased data rates coming from the Internet of Things.
-
Intro to Apache Spark workshop
Databricks
Authored a full-day, hands-on workshop introducing Apache Spark, led team + partners to deliver instruction worldwide.
-
Whitepaper: Agricultural Systems + Data Outlook
The Data Guild
How can data be leveraged to make food production and distribution systems more responsive, resilient, and efficient? An ecosystem of agricultural data has been quietly evolving, and is rapidly becoming a vital component of global food security. The data rates and variety are vast: remote sensing via small satellites, sensor networks in the fields, tractors-as-drones, and more. Many issues implied by this category of data, however, are quite subtle and in some cases counterintuitive. Given…
How can data be leveraged to make food production and distribution systems more responsive, resilient, and efficient? An ecosystem of agricultural data has been quietly evolving, and is rapidly becoming a vital component of global food security. The data rates and variety are vast: remote sensing via small satellites, sensor networks in the fields, tractors-as-drones, and more. Many issues implied by this category of data, however, are quite subtle and in some cases counterintuitive. Given that this field is relatively new and not particularly organized yet, key learnings may be adapted from other sectors where large-scale data and analytics have already played a transformational role: finance, intelligence, e-commerce, telecom, energy, etc.
Other authorsSee publication -
Enterprise Data Workflows with Cascading
O'Reilly Media
Despite its growing use in the enterprise, building applications for Hadoop is notoriously difficult. But there is a solution. This hands-on book introduces you to Cascading, the framework that enables you to build powerful data processing applications on Hadoop without having to spend months learning the intricacies of MapReduce.
Whether you’re a developer, data scientist, or system/IT administrator, you’ll quickly learn Cascading’s streamlined approach to data processing, data…Despite its growing use in the enterprise, building applications for Hadoop is notoriously difficult. But there is a solution. This hands-on book introduces you to Cascading, the framework that enables you to build powerful data processing applications on Hadoop without having to spend months learning the intricacies of MapReduce.
Whether you’re a developer, data scientist, or system/IT administrator, you’ll quickly learn Cascading’s streamlined approach to data processing, data filtering, and workflow optimization, using sample apps based on Java, Scala, and Clojure. Companies such as Etsy, Razorfish, TeleNav, and Twitter already use Cascading for mission-critical applications. This book shows you how this framework can help your organization extract meaningful information from large amounts of distributed data. -
What "Countermeasures" Really Means
O'Reilly Media
Building a case for use of risk metrics in determining reasonable countermeasures to network security attacks. Introduction to "OpenSIMS" open source project.
-
The Corporate Body: Liber 118 U.S. 394
Signum Press
Review of "corporate metabolism" metaphor.
-
Corporate Metabolism
Tripzine
An extensive analysis of the structure and function of the "corporate organism".
-
Jackson Wins, Feds Lose
Wired
Coverage of federal court case in Steve Jackson Games vs. US Secret Service.
Projects
-
SofLiM4KG
The Software Lifecycle Management for KG workshop (SofLiM4KG) aims to collect experiences in successful and abandoned knowledge graph projects from this perspective to (a) carve out the specifics in knowledge graph engineering that pose challenges beyond software engineering practices, (b) to establish best practices and anti-patterns for the community, and (c) build the foundations for the systematic investigation of the connection to software engineering, as well as qualitative and…
The Software Lifecycle Management for KG workshop (SofLiM4KG) aims to collect experiences in successful and abandoned knowledge graph projects from this perspective to (a) carve out the specifics in knowledge graph engineering that pose challenges beyond software engineering practices, (b) to establish best practices and anti-patterns for the community, and (c) build the foundations for the systematic investigation of the connection to software engineering, as well as qualitative and quantitative studies in project management of knowledge graphs.
This project originated at Dagstuhl 24061, in Feb 2024Other creators -
TextGraphs
Using LLMs to boost the performance of NLP tasks in KG construction, introducing use of a "lemma graph" (linguistic provenance) for graph levels of detail, and exploring topological transforms to enhance graph ML capabilities. This research surveys and evaluates the open source model capabilities for named entity recognition, entity linking, relation extraction, and graph of relations.
-
MkRefs
MkDocs plugin to generate "semantic reference" materials as Markdown pages, from a knowledge graph.
-
kglab
Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Other creatorsSee project -
disparity_filter
- Present
Implements a disparity filter in Python, based on graphs in NetworkX, to extract the multiscale backbone of a complex weighted network (Serrano, et al., 2009)
-
PyTextRank
- Present
Python implementation of TextRank for text document NLP parsing and extractive summarization, based atop spaCy, datasketch, NetworkX. Graph algorithms for advanced NLP and preparing text data to use in deep learning, etc.
Other creatorsSee project -
Ray tutorial
-
An introductory tutorial about leveraging Ray core features for distributed patterns.
-
richcontext.scholapi
-
Rich Context API integrations for federating metadata discovery and exchange across multiple scholarly infrastructure providers.
-
Apache Spark Developer Certification
Authored exam, assisted on Databricks+O'Reilly Media partnership and publicity, led team executing on proctoring, evaluations, analysis, exam iteration, etc.
-
Exelixi
-
Exelixi is a distributed framework based on Apache Mesos, mostly implemented in Python using gevent for high-performance concurrency. It is intended to run cluster computing jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it runs genetic algorithms at scale.
Other creatorsSee project -
Cascading Pattern
-
Pattern sub-project for http://Cascading.org/ which uses flows as containers for machine learning models, importing PMML model descriptions from R, SAS, Weka, RapidMiner, KNIME, SQL Server, etc.
Other creatorsSee project -
Cascading for the Impatient
-
An introduction to programming with the Cascading API for MapReduce workflow orchestration. We start with the simplest possible Cascading app, a file copy, and progress up to a full implementation of TF-IDF in Cascading. Also showing best practices and test-driven development features for working with data at scale.
Other creatorsSee project -
Cascading + City of Palo Alto open data
An example of a "Big Data" application, based on Cascading, which leverages City of Palo Alto open data... find a shady spot on a hot day, to walk and take a phone call.
Other creatorsSee project
Honors & Awards
-
Top 30 People in Big Data and Analytics
Innovation Enterprise
http://www.kdnuggets.com/2015/02/top-30-people-big-data-analytics.html
-
NISOD Excellence Award
Austin Community College
As an adjunct professor at ACC, having developed a network security program for the Continuing Education department. https://www.nisod.org/forms/past_ea_recipients/
More activity by Paco
-
Happy second birthday to Open Source Science Initiative (OSSci)! We’ve been busy laying the groundwork, looking forward the next phase.
Happy second birthday to Open Source Science Initiative (OSSci)! We’ve been busy laying the groundwork, looking forward the next phase.
Liked by Paco Nathan
-
Ok, this is kind of mind-blowing. I've been telling people to keep an eye on WASM and the reason why may not be as obvious to some of you, but a few…
Ok, this is kind of mind-blowing. I've been telling people to keep an eye on WASM and the reason why may not be as obvious to some of you, but a few…
Liked by Paco Nathan
-
🖥️ Nvidia reigns supreme in AI chips, but the game is changing. From environmental impacts to on-device AI, the industry faces new challenges. As…
🖥️ Nvidia reigns supreme in AI chips, but the game is changing. From environmental impacts to on-device AI, the industry faces new challenges. As…
Liked by Paco Nathan
-
We just pushed an updated version of our website: https://kuzudb.com/. The updates include some quotes from members of the community or people who…
We just pushed an updated version of our website: https://kuzudb.com/. The updates include some quotes from members of the community or people who…
Liked by Paco Nathan
-
Excellent article. Moreover, Google Maps has been getting especially aggressive with rerouting while in transit. One could almost call the results…
Excellent article. Moreover, Google Maps has been getting especially aggressive with rerouting while in transit. One could almost call the results…
Shared by Paco Nathan
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More