SlideShare a Scribd company logo
A Data Science Pipeline for Real
Companies
Comcast’s Approach to Multi-datacenter, Cloud and On-
premise Machine Learning
“Comcast brings together
the best in media and
technology. We drive
innovation to create the
world's best entertainment
and online experiences.”
High Speed Internet
Video
Home Automation Digital Voice
Xfinity MobileContent
$84b (2017)
29m Customers
“Comcast brings together
the best in media and
technology. We drive
innovation to create the
world's best entertainment
and online experiences.”
High Speed Internet
Video
Home Automation Digital Voice
Xfinity MobileContent
$84b (2017)
29m Customers
A machine learning and data science pipeline for real companies

Recommended for you

Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at Froscon

Devops, Devops , Devops, from Dev and Ops to Devops presented at Froscon in Sankt-Augusting (Germany) on August 21st 2011

devopsfroscon
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture

Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.

spark + ai summit

 *
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage

This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.

metadatalineagedata
• Predictive network analysis
• Customer premise self-healing
• Comcast network self-healing
• Trouble-ticket prioritization
• Customer self-help (voice and text flows)
• Customer Retention
Use Cases: It’s All About The Customer
Our starting point!
Internal Data Centers
Cloud Based Infrastructure
E1
E2
E3
Predictions
Next Big Thing
Where’s the data?
How do I access it?
What tools do I have?
?
Where can I find information about data?
A machine learning and data science pipeline for real companies
Our Challenges

Recommended for you

Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4

Join us for a live webinar on December 13th to learn why you can’t have effective DevOps without Value Stream Management. While DevOps provides capabilities that improve a business value stream through the implementation of culture, toolchains, orchestration and automation, DevOps alone without Value Stream Management is not sufficient to realize business benefits. Don’t spend the time and money on DevOps alone and NOT get to reap the rewards for the business! Attend this webinar to hear Marc Hornbeek of Trace3, and Jeff Keyes of Plutora discuss how you can leverage all of the data from your DevOps tools chains to provide real-time analytics, and codify policies that must be orchestrated to realize benefits of a business value stream.

Simple cloud migration with OpenText Migrate
Simple cloud migration with OpenText MigrateSimple cloud migration with OpenText Migrate
Simple cloud migration with OpenText Migrate

Migrate any server workload to any target destination with the OpenText Migrate cloud migration platform. Learn about common migration challenges and how to choose the right cloud migration tool.

cloud migration companiescloud migration consultingcloud migration platform
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark

This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.

flink forwardapache flinkconference
Our Challenges
Security
Our Challenges
Security
Diversity of Skills
Our Challenges
Security
Diversity of Skills
Discoverability
FAST Provide frameworks, capabilities that allow for rapid deployment.
SIMPLE & TRANSPARENT Develop capabilities to promote self-service and ease of access to data.
CONSISTENT & SECURE Provide a universal security framework to govern all data under the Big Data Domain.
FULLY AUTOMATED Provide a robust operational model allowing for playback, data quality, and self-healing.
Guiding Principles
Gather, organize, make sense of Comcast data, and make it universally accessible to empower, enable, and
transform Comcast into an insight-driven organization.
Product Vision

Recommended for you

Value Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can helpValue Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can help

Modern enterprises increasingly rely on software to keep the lights on and lay the foundations for long-term sustainable growth. Among many things, IT leaders are tasked with accelerating the time to value of their software delivery value streams. But when asked, “Do you know what is slowing your software delivery teams down?”, why do IT leaders typically not know the answer? Methodologies such as Agile and DevOps have been adopted to accelerate the time between build to deploy, yet the benefits are often only felt at a localized level (more sprints completed, higher number of deployments etc.) without a tangible link to business outcomes. Enter Value Stream Architecture. During this webinar, Senior Value Stream Architect, Dan Feminella, presents: - The business case for Value Stream Architecture - Why your organization needs it in order to scale Agile and DevOps - How to architect for end-to-end flow of business value from customer request to delivery and back through the customer feedback loop

value stream integrationvalue stream managementvalue stream architecture
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...

Understand the concept of DevOps by employing DevOps Strategy Roadmap Lifecycle PowerPoint Presentation Slides Complete Deck. Describe how DevOps is different from traditional IT with these content-ready PPT themes. The slides also help to discuss DevOps use cases in the business, roadmap, and its lifecycle. Explain the roles, responsibilities, and skills of DevOps engineers by utilizing this visually appealing slide deck. Demonstrate DevOp roadmap for implementation in the organization with the help of a thoroughly researched PPT slideshow. Describe the characteristics of cloud computing, its benefits, and risks with the aid of this PPT layout. Utilize this easy-to-use DevOps transformation strategy PowerPoint slide deck to showcase the difference between cloud and traditional data centers. This ready-to-use PowerPoint layout also discusses the roadmap to integrate cloud computing in business. Highlight the usages of cloud computing and deployment models with the help of visual attention-grabbing DevOps implementation roadmap PowerPoint slides. https://bit.ly/3eFxYYr

technologydevelopmentdevops strategy roadmap lifecycle ppt powerpoint p
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...

Recently, Dr. Qingsong Zhang spoke at a Meetup about how Walmart is using DevOps. Within this slide deck, you'll learn about our DataOps, DevOps and OneOps, an application lifecycle management (ALM), and open source DevOps platform for cloud which was developed by Walmart Labs. Feel free to follow us on Twitter: @one_ops! Contribute to One_Ops: www.oneops.com

open sourcedevopsbig data
• Avoid religious wars where possible
• Whatever framework makes sense for the business problem at hand
• Focus on federated access to curated data
• Focus on Common APIs for Ingest, Egress and Machine Learning
• Focus on metadata and discoverability for:
• Enterprise Data
• Enterprise Features
• Trained Models
• Enterprise Portal
• Containerized scoring endpoints that accommodate multiple frameworks and
models
Approach
Shameless Lyft: Uber’s Michelangelo becomes Comcast’s Da Vinci
• Focus on Art AND Science (and a smattering of creativity)
• Common APIs usable from multiple frameworks using Python
or Scala
• Metadata is Key
• About data
• About features
• About trained models
Focus on a Common Approach to Features and Models
AT LAS
Ingest
API
Egress API (Federation Layer)
Feature Store
Model Store
On Premise Cloud
Portal
<Your Favorite Framework Here>
APACHE
Tools such as Presto and Alluxio
Scala
Python
Open ML API
Training
Deployment
Container Container Container
Client
Open ML API
• Reads and writes features and feature metadata
• Reads and writes model metadata
• Integrates with a common portal searchable by any user
16

Recommended for you

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

Speaker: Yupeng Fu, Staff Engineer, Uber High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned. Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.

uberstreaming platformbig data
Practical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA ModelPractical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA Model

Introduction to Enterprise Architecture in a simpler, modernized, & realistic model (CSVLOD). Target Audience: 1- Tech Leaders New to Enterprise Architecture. 2- Enterprise Architects. 3- CIO, CTO, CDO, EPMO, ITPMO.

enterprise architecturecsvlodea
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL

Event streaming applications unlock new benefits by combining various data feeds. However, getting actionable insights in a timely fashion has remained a challenge, as the data has been siloed in disparate systems. ksqlDB solves this by providing an interactive SQL interface that can seamlessly combine and transform data from various sources. In this webinar, we will show how streaming queries of high throughput NoSQL systems can derive insights from various push/pull queries via ksqlDB's User-Defined Functions, Aggregate Functions and Table Functions.

confluentapache kafkascylladb
DEMO
17
The Data Science Pipeline
DX Alpha
Goals
• Develop a system to manage features and models running in Spark
– Based on Uber’s Michelangelo
• Make it easier to build and deploy data transformations and ML models
• Enhance sharing of code across data science teams
• Support a variety of data science toolkits
Data Science Pipeline Components
• Feature Store
– Standardized approach to define data transformations
– A feature is a single attribute or column in a data frame
– A feature table is a set of features combined with meta data
– The transformation definition is separated from the ”context” in which it is applied

Recommended for you

Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for

Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.

apache flinkbig datastream analytics
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm

This document provides an overview of developing Java streaming applications with Apache Storm. It discusses what Storm is, its conceptual model including tuples, streams, spouts, bolts and topologies. It demonstrates developing a word count topology with code examples. It also covers Storm's runtime architecture, additional features like reliability, and integrating Storm with technologies like Kafka and HBase.

apachestormjava
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem

Telegraf is the open source server agent which is used to collect metrics from your stacks, sensors and systems. It is InfluxDB’s native data collector that supports over 250+ inputs and outputs. Learn how to send data from a variety of systems, apps, databases and services in the appropriate format to InfluxDB. Discover tips and tricks on how to write your own plugins. Join this webinar as Jessica Ingrassellino and Samantha Wang dive into: Types of Telegraf plugins (i.e. input, output, aggregator and processor) Specific plugins including Execd input plugins and the Starlark processor plugin How to create your own Telegraf plugin

influxdbinfluxdatatime series database
Data Science Pipeline Components
• Model store
– Standardized approach for defining models
– A model is defined by train, predict, and evaluate functions, a hyperparameter set, and associated meta data
– The definition of the train, predict, and evaluate functions are separated from their application
– A model may be associated with one or more trained instances, prediction data frames, and evaluation metrics
Data Science Pipeline Components
• Job Scheduler/Runner
– Handle streaming, scheduled, and one-time jobs
– Support interdependencies between jobs
Data Science Pipeline Components
• File system
– Store executable objects such as jar files and notebooks
– Store data frames
– Store trained models and other runtime artifacts
Development Approach
• Build on top of Databricks and Spark
• Start with a “thin slice” proof of concept
– Demonstrate basic end to end run from data exploration to model evaluation
• Iterate to improve usability and tooling

Recommended for you

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained

Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput. This talk provides a comprehensive overview of Kafka architecture and internal functions, including: -Topics, partitions and segments -The commit log and streams -Brokers and broker replication -Producer basics -Consumers, consumer groups and offsets This session is part 2 of 4 in our Fundamentals for Apache Kafka series.

apache kafkaconfluentconfluent platform
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink

This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.

apache flinkapache sparkplatform
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...

Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging. Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.

apache atlasapache avroapache kafka
What do we need to know about Feature Tables?
– Descriptive Information
• What data transformation does it perform?
• Who’s owns this feature table?
• Description of Input/output
– Build/deployment information
• Where’s the code? What’s the current version?
• What artifacts have been deployed to the production environment?
– Run information
• What jobs are running or have been run?
• What’s the status of these jobs?
• What data sets or streams are being produced and how do I access them?
• Are there performance metrics or summary statistics available?
What do we need to know about Models?
• Descriptive Information
– What does it do? Classification? Regression?
– Who’s owns this model?
– Is it supervised or unsupervised? What type of labels are required?
– What features does it use?
• Build/deployment information
– Where’s the code? What’s the current version?
– What artifacts have been deployed to the production environment?
• Run information
– What training / prediction / evaluation jobs are running or have run?
– What’s the status of these jobs?
– What data sets or streams are being produced and how do I access them?
– How well is the model performing? What criteria are being used to assess this?
What actions do we need to perform?
• Data exploration and development
• Packaging, versioning, and deployment of ML code
• Job scheduling and monitoring
• Storage/discovery/retrieval of job results
– Data frames
– Metrics
– Trained models
• Discovery of and interaction with Features and Models
What Technologies Already do this?
• Data exploration and development
– Databricks notebooks, local IDE
• Packaging, versioning, and deployment of code and metadata
– Github / Jenkins / Mortar
– Document store for metadata (MongoDB, Cassandra, etc.)
• Job scheduling and monitoring
– Airflow, Databricks Jobs API
• Storage of job results
– DBFS / S3, need to define standard file structure
• Discovery of and interaction with Features and Models
– Finding – Elastic Search or existing Thin Slice API
– Reading/processing data frame artifacts – Spark, Databricks notebooks
– Retrieving/viewing performance metrics - ???
– Monitoring model performance over time - ???
– Algebraic composition features and models - ???

Recommended for you

10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About

This document introduces several big data technologies that are less well known than traditional solutions like Hadoop and Spark. It discusses Apache Flink for stream processing, Apache Samza for processing real-time data from Kafka, Google Cloud Dataflow which provides a managed service for batch and stream data processing, and StreamSets Data Collector for collecting and processing data in real-time. It also covers machine learning technologies like TensorFlow for building dataflow graphs, and cognitive computing services from Microsoft. The document aims to think beyond traditional stacks and learn from companies building pipelines at scale.

big datagoogle data flowstreamsets
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering

As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends. * Introduction to Data Engineering * Role of Big Data in Data Engineering * Key Skills related to Data Engineering * Role of Big Data in Data Engineering * Overview of Data Engineering Certifications * Free Content and ITVersity Paid Resources Don't worry if you miss the video - you can click on the below link to go through the video after the schedule. https://youtu.be/dj565kgP1Ss * Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/ Relevant Playlists: * Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi * Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl * Join our Meetup group - https://www.meetup.com/itversityin/ * Enroll for our labs - https://labs.itversity.com/plans * Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1 * Access Content via our GitHub - https://github.com/dgadiraju/itversity-books * Lab and Content Support using Slack

data engineeringbig data
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx

In the new era of digitalization, there is an ever-growing need for design and production processes capable of increasing systems quality, reducing risks and the chance of errors, while, at the same me, reducing overall production costs. Nowadays, more and more systems design scenarios comprise a high number of domains. However, the underlying tool landscape is still dominated by closed ecosystems, resulting in the design data remaining in separate silos. To effectively deal with novel, massively diverse yet interconnected engineering scenarios, while also considering industrial sustainability and the well-being of the future digital society, we have to propose new ways to look at the digital thread, supporting every phase of a digital engineering lifecycle, while turning the siloed multi-domain engineering data into a holistic, accessible and globally analyzable digital thread.

digital threadsystems engineeringincquerylabs
Open questions
• How do we abstract file system details and other constants?
• How do we standardize ETL from other systems within Comcast?
• How do we support human-labeling of data sets?
• What other tools (H20, R, etc.) do we need to support?
• Are there other ways we need to interact with features and models?
• How do we integrate AutoML?
• Other technologies that may be useful? Databricks Delta? Amazon Sagemaker?
Architecture V2: PIpelines
• Pipeline Segments (same as Spark ML)
– Transformers
– Estimators
• Pipeline: linear sequence of Pipeline Segments (same as Spark ML)
– Transformation pipelines contain only Transformers
– Estimation pipelines contain one or more Transformers and end with an Estimator
•
T
E
D T T
T T TD
D
T
Transformation Pipeline
Estimation Pipeline
Architecture V2: PIpelines
• A pipeline is just a function
• It does not produce anything until supplied a specific DataFrame as input
Architecture V2: Workflows
• A Workflow is a directed (acyclic?) graph of Pipelines
– DataSources load data (from disk, streams, etc.) into a DataFrame
– Connectors merge the output of multiple DataSources into a single DataFrame
– Pipelines process the DataFrames
– DataSinks receive the output of the last pipeline
• Workflow rules
– Workflows must end in a single Pipeline node
– An Estimator Pipeline may only appear as the last node in a Workflow

Recommended for you

Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...

The document discusses challenges in maintaining consistency, completeness, and correctness (3C) across disconnected engineering data silos. It proposes using links and transformations to connect models between systems engineering and electrical design tools. Validation rules can then check that connections and components are properly mapped between the silos. The IncQuery Validator was used to import a model from E3.GENESYS into its knowledge graph and generate a validation report checking for 3C issues. Tracking link management and validation results over time provides visibility into the progress of the "digital thread" across the engineering lifecycle.

sysmlmbsesystem engineering
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow

Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.

Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower

This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.

big dataspark
Architecture V2: Workflows
PT PE
D
D
D
C
D
C T
PT
D
D
C
D
C PTTransformation Workflow
Training Workflow
Architecture V2: Data Sources
• Potential sources of data
– HTTP request
– Persistent store (avro, parquet, EDW, …)
– Kafka topic
– Others?
• Connectors could handle complex logic such as combining HTTP data with other sources before feeding into a
Workflow
Architecture V2
• Components
– Pipeline Segment Store
• Code catalog of available transforms, estimators, and pipelines
• Searchable by description, tags, and maybe by schema?
– “Find me a feature of type x that is tagged y”
– Workflow Store
• Stores DAGs (maybe Neo4j or other graph DB?)
• Integrates with Databricks to run DAGs as jobs
• Periodic graph analysis to optimize Workflows
– Data Source Store?
• Separate system or subset of Workflow Store?

More Related Content

What's hot

ArchiMate application and data architecture layer - Simplify the models
ArchiMate application and data architecture layer - Simplify the modelsArchiMate application and data architecture layer - Simplify the models
ArchiMate application and data architecture layer - Simplify the models
COMPETENSIS
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Document Presentment by OpenText
Document Presentment by OpenTextDocument Presentment by OpenText
Document Presentment by OpenText
Jonathan Beardsley
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at Froscon
Kris Buytaert
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
Databricks
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4
DevOps.com
 
Simple cloud migration with OpenText Migrate
Simple cloud migration with OpenText MigrateSimple cloud migration with OpenText Migrate
Simple cloud migration with OpenText Migrate
OpenText
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Value Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can helpValue Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can help
Tasktop
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
SlideTeam
 
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
WalmartLabs
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
confluent
 
Practical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA ModelPractical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA Model
Ashraf Fouad
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
InfluxData
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 

What's hot (20)

ArchiMate application and data architecture layer - Simplify the models
ArchiMate application and data architecture layer - Simplify the modelsArchiMate application and data architecture layer - Simplify the models
ArchiMate application and data architecture layer - Simplify the models
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Document Presentment by OpenText
Document Presentment by OpenTextDocument Presentment by OpenText
Document Presentment by OpenText
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at Froscon
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4Value stream management is essential for dev ops v4
Value stream management is essential for dev ops v4
 
Simple cloud migration with OpenText Migrate
Simple cloud migration with OpenText MigrateSimple cloud migration with OpenText Migrate
Simple cloud migration with OpenText Migrate
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Value Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can helpValue Stream Architecture: What it is and how it can help
Value Stream Architecture: What it is and how it can help
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
 
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
How We Do DevOps at Walmart: OneOps OSS Application Lifecycle Management Plat...
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Practical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA ModelPractical Enterprise Architecture - Introducing CSVLOD EA Model
Practical Enterprise Architecture - Introducing CSVLOD EA Model
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 

Similar to A machine learning and data science pipeline for real companies

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery Labs
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
Ákos Horváth
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Elizabeth Steiner
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
IncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IWIncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IW
IncQuery Labs
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
Codecamp Romania
 

Similar to A machine learning and data science pipeline for real companies (20)

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
IncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IWIncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IW
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 

Recently uploaded (20)

Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 

A machine learning and data science pipeline for real companies

  • 1. A Data Science Pipeline for Real Companies Comcast’s Approach to Multi-datacenter, Cloud and On- premise Machine Learning
  • 2. “Comcast brings together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences.” High Speed Internet Video Home Automation Digital Voice Xfinity MobileContent $84b (2017) 29m Customers
  • 3. “Comcast brings together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences.” High Speed Internet Video Home Automation Digital Voice Xfinity MobileContent $84b (2017) 29m Customers
  • 5. • Predictive network analysis • Customer premise self-healing • Comcast network self-healing • Trouble-ticket prioritization • Customer self-help (voice and text flows) • Customer Retention Use Cases: It’s All About The Customer
  • 6. Our starting point! Internal Data Centers Cloud Based Infrastructure E1 E2 E3 Predictions Next Big Thing Where’s the data? How do I access it? What tools do I have? ? Where can I find information about data?
  • 11. Our Challenges Security Diversity of Skills Discoverability
  • 12. FAST Provide frameworks, capabilities that allow for rapid deployment. SIMPLE & TRANSPARENT Develop capabilities to promote self-service and ease of access to data. CONSISTENT & SECURE Provide a universal security framework to govern all data under the Big Data Domain. FULLY AUTOMATED Provide a robust operational model allowing for playback, data quality, and self-healing. Guiding Principles Gather, organize, make sense of Comcast data, and make it universally accessible to empower, enable, and transform Comcast into an insight-driven organization. Product Vision
  • 13. • Avoid religious wars where possible • Whatever framework makes sense for the business problem at hand • Focus on federated access to curated data • Focus on Common APIs for Ingest, Egress and Machine Learning • Focus on metadata and discoverability for: • Enterprise Data • Enterprise Features • Trained Models • Enterprise Portal • Containerized scoring endpoints that accommodate multiple frameworks and models Approach
  • 14. Shameless Lyft: Uber’s Michelangelo becomes Comcast’s Da Vinci • Focus on Art AND Science (and a smattering of creativity) • Common APIs usable from multiple frameworks using Python or Scala • Metadata is Key • About data • About features • About trained models Focus on a Common Approach to Features and Models
  • 15. AT LAS Ingest API Egress API (Federation Layer) Feature Store Model Store On Premise Cloud Portal <Your Favorite Framework Here> APACHE Tools such as Presto and Alluxio Scala Python Open ML API Training Deployment Container Container Container Client
  • 16. Open ML API • Reads and writes features and feature metadata • Reads and writes model metadata • Integrates with a common portal searchable by any user 16
  • 18. The Data Science Pipeline DX Alpha
  • 19. Goals • Develop a system to manage features and models running in Spark – Based on Uber’s Michelangelo • Make it easier to build and deploy data transformations and ML models • Enhance sharing of code across data science teams • Support a variety of data science toolkits
  • 20. Data Science Pipeline Components • Feature Store – Standardized approach to define data transformations – A feature is a single attribute or column in a data frame – A feature table is a set of features combined with meta data – The transformation definition is separated from the ”context” in which it is applied
  • 21. Data Science Pipeline Components • Model store – Standardized approach for defining models – A model is defined by train, predict, and evaluate functions, a hyperparameter set, and associated meta data – The definition of the train, predict, and evaluate functions are separated from their application – A model may be associated with one or more trained instances, prediction data frames, and evaluation metrics
  • 22. Data Science Pipeline Components • Job Scheduler/Runner – Handle streaming, scheduled, and one-time jobs – Support interdependencies between jobs
  • 23. Data Science Pipeline Components • File system – Store executable objects such as jar files and notebooks – Store data frames – Store trained models and other runtime artifacts
  • 24. Development Approach • Build on top of Databricks and Spark • Start with a “thin slice” proof of concept – Demonstrate basic end to end run from data exploration to model evaluation • Iterate to improve usability and tooling
  • 25. What do we need to know about Feature Tables? – Descriptive Information • What data transformation does it perform? • Who’s owns this feature table? • Description of Input/output – Build/deployment information • Where’s the code? What’s the current version? • What artifacts have been deployed to the production environment? – Run information • What jobs are running or have been run? • What’s the status of these jobs? • What data sets or streams are being produced and how do I access them? • Are there performance metrics or summary statistics available?
  • 26. What do we need to know about Models? • Descriptive Information – What does it do? Classification? Regression? – Who’s owns this model? – Is it supervised or unsupervised? What type of labels are required? – What features does it use? • Build/deployment information – Where’s the code? What’s the current version? – What artifacts have been deployed to the production environment? • Run information – What training / prediction / evaluation jobs are running or have run? – What’s the status of these jobs? – What data sets or streams are being produced and how do I access them? – How well is the model performing? What criteria are being used to assess this?
  • 27. What actions do we need to perform? • Data exploration and development • Packaging, versioning, and deployment of ML code • Job scheduling and monitoring • Storage/discovery/retrieval of job results – Data frames – Metrics – Trained models • Discovery of and interaction with Features and Models
  • 28. What Technologies Already do this? • Data exploration and development – Databricks notebooks, local IDE • Packaging, versioning, and deployment of code and metadata – Github / Jenkins / Mortar – Document store for metadata (MongoDB, Cassandra, etc.) • Job scheduling and monitoring – Airflow, Databricks Jobs API • Storage of job results – DBFS / S3, need to define standard file structure • Discovery of and interaction with Features and Models – Finding – Elastic Search or existing Thin Slice API – Reading/processing data frame artifacts – Spark, Databricks notebooks – Retrieving/viewing performance metrics - ??? – Monitoring model performance over time - ??? – Algebraic composition features and models - ???
  • 29. Open questions • How do we abstract file system details and other constants? • How do we standardize ETL from other systems within Comcast? • How do we support human-labeling of data sets? • What other tools (H20, R, etc.) do we need to support? • Are there other ways we need to interact with features and models? • How do we integrate AutoML? • Other technologies that may be useful? Databricks Delta? Amazon Sagemaker?
  • 30. Architecture V2: PIpelines • Pipeline Segments (same as Spark ML) – Transformers – Estimators • Pipeline: linear sequence of Pipeline Segments (same as Spark ML) – Transformation pipelines contain only Transformers – Estimation pipelines contain one or more Transformers and end with an Estimator • T E D T T T T TD D T Transformation Pipeline Estimation Pipeline
  • 31. Architecture V2: PIpelines • A pipeline is just a function • It does not produce anything until supplied a specific DataFrame as input
  • 32. Architecture V2: Workflows • A Workflow is a directed (acyclic?) graph of Pipelines – DataSources load data (from disk, streams, etc.) into a DataFrame – Connectors merge the output of multiple DataSources into a single DataFrame – Pipelines process the DataFrames – DataSinks receive the output of the last pipeline • Workflow rules – Workflows must end in a single Pipeline node – An Estimator Pipeline may only appear as the last node in a Workflow
  • 33. Architecture V2: Workflows PT PE D D D C D C T PT D D C D C PTTransformation Workflow Training Workflow
  • 34. Architecture V2: Data Sources • Potential sources of data – HTTP request – Persistent store (avro, parquet, EDW, …) – Kafka topic – Others? • Connectors could handle complex logic such as combining HTTP data with other sources before feeding into a Workflow
  • 35. Architecture V2 • Components – Pipeline Segment Store • Code catalog of available transforms, estimators, and pipelines • Searchable by description, tags, and maybe by schema? – “Find me a feature of type x that is tagged y” – Workflow Store • Stores DAGs (maybe Neo4j or other graph DB?) • Integrates with Databricks to run DAGs as jobs • Periodic graph analysis to optimize Workflows – Data Source Store? • Separate system or subset of Workflow Store?

Editor's Notes

  1. We are a technology, entertainment and media company focused on delivering the best customer experience.