Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.
Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month
To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019
1) Databricks provides a machine learning platform for MLOps that includes tools for data ingestion, model training, runtime environments, and monitoring.
2) It offers a collaborative data science workspace for data engineers, data scientists, and ML engineers to work together on projects using notebooks.
3) The platform provides end-to-end governance for machine learning including experiment tracking, reproducibility, and model governance.
The document discusses hyperparameters and hyperparameter tuning in deep learning models. It defines hyperparameters as parameters that govern how the model parameters (weights and biases) are determined during training, in contrast to model parameters which are learned from the training data. Important hyperparameters include the learning rate, number of layers and units, and activation functions. The goal of training is for the model to perform optimally on unseen test data. Model selection, such as through cross-validation, is used to select the optimal hyperparameters. Training, validation, and test sets are also discussed, with the validation set used for model selection and the test set providing an unbiased evaluation of the fully trained model.
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
A tenet of the Agile Manifesto is ‘Working software over comprehensive documentation’, and many have interpreted that to mean that data models are not necessary in the agile development environment. Others have seen the value of data models for achieving the other core tenets of ‘Customer Collaboration’ and ‘Responding to Change’.
This webinar will discuss how data models are being effectively used in today’s Agile development environment and the benefits that are being achieved from this approach.
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationDavid Solomon
The initial version of a maturity roadmap to help guide businesses when adopting AI technology into their workflow. IBM Watson Studio is referenced as an example of technology that can help in accelerating the adoption process.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
This document discusses machine learning methods and analysis. It provides an overview of machine learning, including that it allows computer programs to teach themselves from new data. The main machine learning techniques are described as supervised learning, unsupervised learning, and reinforcement learning. Popular applications of these techniques are also listed. The document then outlines the typical steps involved in applying machine learning, including data curation, processing, resampling, variable selection, building a predictive model, and generating predictions. It stresses that while data is important, the right analysis is also needed to apply machine learning effectively. The document concludes by discussing issues like data drift and how to implement validation and quality checks to safeguard automated predictions from such problems.
Machine learning helps predict behavior and recognize patterns that humans cannot by learning from data without relying on programmed rules. It is an algorithmic approach that differs from statistical modeling which formalizes relationships through mathematical equations. Machine learning is a part of the broader field of artificial intelligence which aims to develop systems that can act and respond intelligently like humans. The machine learning workflow involves collecting and preprocessing data, selecting algorithms, training models, and evaluating performance. Common machine learning algorithms include supervised learning, unsupervised learning, reinforcement learning, and deep learning. Popular tools for machine learning include Python, R, TensorFlow, and Spark.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
- Learn to understand what knowledge graphs are for
- Understand the structure of knowledge graphs (and how it relates to taxonomies and ontologies)
- Understand how knowledge graphs can be created using manual, semi-automatic, and fully automatic methods.
- Understand knowledge graphs as a basis for data integration in companies
- Understand knowledge graphs as tools for data governance and data quality management
- Implement and further develop knowledge graphs in companies
- Query and visualize knowledge graphs (including SPARQL and SHACL crash course)
- Use knowledge graphs and machine learning to enable information retrieval, text mining and document classification with the highest precision
- Develop digital assistants and question and answer systems based on semantic knowledge graphs
- Understand how knowledge graphs can be combined with text mining and machine learning techniques
- Apply knowledge graphs in practice: Case studies and demo applications
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session.
The recent focus on Big Data in the data management community brings with it a paradigm shift—from the more traditional top-down, “design then build” approach to data warehousing and business intelligence, to the more bottom up, “discover and analyze” approach to analytics with Big Data. Where does data modeling fit in this new world of Big Data? Does it go away, or can it evolve to meet the emerging needs of these exciting new technologies? Join this webinar to discuss:
Big Data –A Technical & Cultural Paradigm Shift
Big Data in the Larger Information Management Landscape
Modeling & Technology Considerations
Organizational Considerations
The Role of the Data Architect in the World of Big Data
This document provides an introduction to knowledge graphs. It discusses:
- The foundation and origins of knowledge graphs in semantic networks from the 1950s-60s.
- Key applications of knowledge graphs at companies like Google, Amazon, Alibaba, and Microsoft.
- Standards for knowledge graphs including RDF, OWL, and SPARQL.
- Research topics related to knowledge graph construction, reasoning, and querying.
- Approaches to constructing knowledge graphs including mapping data from Wikipedia and using machine learning techniques.
- Reasoning with knowledge graphs using description logics, and approximate reasoning techniques.
- Knowledge graph embeddings for tasks like link prediction.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Edureka!
** Data Analytics Masters' Program: https://www.edureka.co/masters-program/data-analyst-certification **
** Data Scientist Masters' Program: https://www.edureka.co/masters-program/data-scientist-certification **
This Edureka PPT on "Data Analyst vs Data Engineer vs Data Scientist" will help you understand the various similarities and differences between them. Also, you will get a complete roadmap along with the skills required to get into a data-related career. Below topics are covered in this tutorial:
Who is data analyst, data engineer and data scientist?
Roadmap
Required skill-sets
Roles and Responsibilities
Salary Perspective
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Machine learning with Big Data power point presentationDavid Raj Kanthi
This is an article made form the articles of IEEE published in the year 2017
The following presentation has the slides for the Title called the
Machine Learning with Big data. that following presentation which has the challenges and approaches of machine learning with big data.
The integration of the Big Data with Machine Learning has so many challenges that Big data has and what is the approach made by the machine learning mechanism for those challenges.
The document provides an overview of Vertex AI, Google Cloud's managed machine learning platform. It discusses topics such as managing datasets, building and training machine learning models using both automated and custom approaches, implementing explainable AI, and deploying models. The document also includes references to the Vertex AI documentation and contact information for further information.
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
This document summarizes a webinar on building machine learning platforms. It discusses how operating ML models is complex, requiring tasks like monitoring performance, handling data drift, and ensuring governance and security. It then outlines common components of ML platforms, including data management, model management, and code/deployment management. The webinar will demonstrate how different organizations handle these components and include demos from four companies. It will also cover Databricks' approach to providing an ML platform that integrates various tools and simplifies the full ML lifecycle from data preparation to deployment.
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
This talk will introduce participants to the theory and practice of machine learning in production. The talk will begin with an intro on machine learning models and data science systems and then discuss data pipelines, containerization, real-time vs. batch processing, change management and versioning.
As part of this talk, an audience will learn more about:
• How data scientists can have the complete self-service capability to rapidly build, train, and deploy machine learning models.
• How organizations can accelerate machine learning from research to production while preserving the flexibility and agility of data scientists and modern business use cases demand.
A small demo will showcase how to rapidly build, train, and deploy machine learning models in R, python, and Spark, and continue with a discussion of API services, RESTful wrappers/Docker, PMML/PFA, Onyx, SQLServer embedded models, and
lambda functions.
Speakers
Sagar Kewalramani, Solutions Architect
Cloudera
Justin Norman, Director, Research and Data Science Services
Cloudera Fast Forward Labs
Credit card fraud detection using python machine learningSandeep Garg
This document provides an overview of machine learning tools, technologies, and the data preparation process. It discusses collecting and selecting relevant data, data visualization, labeling data for supervised learning, and transforming raw data into a tidy format. The document also covers various data preprocessing techniques, including data cleaning, formatting, handling missing values and outliers, smoothing, aggregation, generalization, and data reduction methods. The goal of these preprocessing steps is to prepare raw data into a structured format suitable for machine learning modeling.
The document discusses the growing field of data analytics and provides guidance on how to become a data analyst. It notes that the amount of data in the world is growing exponentially and data analytics is an in-demand job that is expected to grow 25% by 2030. It then outlines the skills and qualifications needed to become a data analyst, including technical skills like programming, data visualization, statistics, as well as soft skills like communication. It recommends getting hands-on experience with projects, developing a portfolio, and then applying for data analyst jobs.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
This document provides an overview of big data analytics and discusses related concepts and tools. It describes challenges of big data such as increased data volume, velocity and variety. It introduces the Hadoop platform and tools like HDFS, Hive and Spark for storing and analyzing large datasets. Different types of analytics including descriptive, predictive and sentiment analysis are covered. The document also outlines the analytics lifecycle and provides an example use case of sentiment analysis on Twitter data.
August webinar - Data Analysis vs Business Analysis vs BI vs Big DataMichael Olafusi
Michael Olafusi is an Excel expert and experienced trainer who quit his job in the telecom industry to focus on Excel. He has worked in various roles involving data analysis and business intelligence. He is now the training director of UrBizEge and plans to revolutionize business data analysis in Nigeria. He is also the only Excel MVP in Africa and first from Nigeria.
This document provides an introduction and overview of a summer school course on business analytics and data science. It begins by introducing the instructor and their qualifications. It then outlines the course schedule and topics to be covered, including introductions to data science, analytics, modeling, Google Analytics, and more. Expectations and support resources are also mentioned. Key concepts from various topics are then defined at a high level, such as the data-information-knowledge hierarchy, data mining, CRISP-DM, machine learning techniques like decision trees and association analysis, and types of models like regression and clustering.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
This document discusses putting data science solutions into business practices. It emphasizes the importance of starting with a clear business problem rather than just focusing on the data. It also recommends adopting the right technology, mindset, and methodology. For methodology, it advocates an iterative approach using techniques like exploratory analysis, feature engineering, and machine learning algorithms like gradient boosting. It also discusses automating machine learning tasks and gaining efficiency through collaborative data science platforms.
This is the Machine Learning Engineering in Production Course notes. This is the Week 3 of Machine Learning Data Life Cycle in Production (Course 2) course. This is the course 2 of MLOps specialization on coursera
The meetup agenda outlines presentations on using BellaDati's agile analytics and reporting tool. The morning sessions cover creating quick reports, collaboration features, and industry-specific app templates. The afternoon advanced analytics session with Peter Fedorocko highlights BellaDati's integrated data warehouse, ETL scripts, and capabilities for analyzing Twitter, MongoDB, and other data sources.
How to classify documents automatically using NLPSkyl.ai
This document provides an overview of how to classify documents automatically using natural language processing (NLP). It begins with introducing NLP text classification and the types of classification that can be performed at the document, paragraph, sentence, and sub-sentence levels. It then discusses several business applications of content classification including legal document discovery, enabling customer support, and online content classification. The document demonstrates a live classification of news articles into categories. It also discusses challenges of implementing AI/ML projects and best practices for data collection, quality, security, labeling, infrastructure, skills, speed and continuous improvement. It promotes the capabilities of Skyl.ai as an ML automation platform to help overcome these challenges.
This document provides an overview of big data analytics. It discusses challenges of big data like increased storage needs and handling varied data formats. The document introduces Hadoop and Spark as approaches for processing large, unstructured data at scale. Descriptive and predictive analytics are defined, and a sample use case of sentiment analysis on Twitter data is presented, demonstrating data collection, modeling, and scoring workflows. Finally, the author's skills in areas like Java, Python, SQL, Hadoop, and predictive analytics tools are outlined.
A Beginner's Guide to Business Analytics for business analytics assignment he...Assignment World
Learn the basics of Business Analytics for business analtics assignment, its benefits, tools, and steps to get started. Unlock data-driven insights to boost your business success.
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
Building scalable machine learning as a service, or MLaaS, is critical to enterprise success. Key to translate machine learning project success into program success is to solve the evolving convoluted data engineering challenge, using local and global data. Enabling sharing of data features across a multitude of models within and across various line of business is pivotal to program success.
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...JOHNLEAK1
This document provides information about different types of data models:
1. Conceptual data models define entities, attributes, and relationships at a high level without technical details.
2. Logical data models build on conceptual models by adding more detail like data types but remain independent of specific databases.
3. Physical data models describe how the database will be implemented for a specific database system, including keys, constraints and other features.
Data pipelines are the heart and soul of data science. Are you a beginner looking to understand data pipelines? A glimpse into what they are and how they work.
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...NadinaLisbon1
Joined our community-led event to dive into the world of Artificial Intelligence (AI)! Whether you were just starting your AI journey or already familiar with its concepts, one thing was certain: AI was reshaping the future of work. This enablement session was your chance to level up your skills and stay ahead in that rapidly evolving landscape.
As AI news continues to dominate headlines, it's natural to have questions and concerns about its impact on our lives. Will AI take over human jobs? Will it render us obsolete? Rest assured, the outlook is far brighter than you may think. Rather than replacing humans, AI is designed to enhance our capabilities and work alongside us. It won't be replacing marketers, service representatives, or salespeople—it will be empowering them to achieve even greater results. Companies across industries recognize this potential and are embracing AI to unlock new levels of performance.
During this enablement session, you'll have the opportunity to explore how AI advancements can positively influence your professional journey and daily life. We'll debunk common misconceptions, address fears, and showcase real-world examples of how successful AI implementation leads to workforce augmentation rather than replacement. Be prepared to gain valuable insights and practical knowledge that will help you navigate the AI landscape with confidence.
Exploring Data Modeling Techniques in Modern Data Warehousespriyanka rajput
This article delves deep into data modeling techniques in modern data warehouses, shedding light on their significance and various approaches. If you are aspiring to be a data analyst or data scientist, understanding data modeling is essential, making a Data Analytics Course in Bangalore, Lucknow, Bangalore, Pune, Delhi, Mumbai, Gandhinagar, and other cities across India an attractive proposition.
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
Two #ModernDataStack talks and one DevOps talk: https://youtu.be/4R--iLnjCmU
1. "From Data-driven Business to Business-driven Data: Hands-on #DataModelling exercise" by Jacob Frackson of Montreal Analytics
2. "Trends in the #DataEngineering Consulting Landscape" by Nadji Bessa of Infostrux Solutions
3. "Building Secure #Serverless Delivery Pipelines on #GCP" by Ugo Udokporo of Google Cloud Canada
We ran out of time for the 4th presenter, so the event will CONTINUE in March... stay tuned! Compliments of #ServerlessTO.
Similar to Real World End to End machine Learning Pipeline (20)
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
### Data Description and Analysis Summary for Presentation
#### 1. **Importing Libraries**
Libraries used:
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`: Data visualization
- `scikit-learn`: Machine learning utilities
- `statsmodels`, `pmdarima`: Statistical modeling
- `keras`: Deep learning models
#### 2. **Loading and Exploring the Dataset**
**Dataset Overview:**
- **Source:** CSV file (`mumbai-monthly-rains.csv`)
- **Columns:**
- `Year`: The year of the recorded data.
- `Jan` to `Dec`: Monthly rainfall data.
- `Total`: Total annual rainfall.
**Initial Data Checks:**
- Displayed first few rows.
- Summary statistics (mean, standard deviation, min, max).
- Checked for missing values.
- Verified data types.
**Visualizations:**
- **Annual Rainfall Time Series:** Trends in annual rainfall over the years.
- **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall.
- **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall.
- **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall.
- **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall.
#### 3. **Data Transformation**
**Steps:**
- Ensured 'Year' column is of integer type.
- Created a datetime index.
- Converted monthly data to a time series format.
- Created lag features to capture past values.
- Generated rolling statistics (mean, standard deviation) for different window sizes.
- Added seasonal indicators (dummy variables for months).
- Dropped rows with NaN values.
**Result:**
- Transformed dataset with additional features ready for time series analysis.
#### 4. **Data Splitting**
**Procedure:**
- Split the data into features (`X`) and target (`y`).
- Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order.
**Result:**
- Training set: `(X_train, y_train)`
- Testing set: `(X_test, y_test)`
#### 5. **Automated Hyperparameter Tuning**
**Tool Used:** `pmdarima`
- Automatically selected the best parameters for the SARIMA model.
- Evaluated using metrics such as AIC and BIC.
**Output:**
- Best SARIMA model parameters and statistical summary.
#### 6. **SARIMA Model**
**Steps:**
- Fit the SARIMA model using the training data.
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Indicates accuracy on training data.
- **Test MAE:** Indicates accuracy on unseen data.
- **Train RMSE:** Measures average error magnitude on training data.
- **Test RMSE:** Measures average error magnitude on testing data.
#### 7. **LSTM Model**
**Preparation:**
- Reshaped data for LSTM input.
- Converted data to `float32`.
**Model Building and Training:**
- Built an LSTM model with one LSTM layer and one Dense layer.
- Trained the model on the training data.
**Evaluation:**
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Accuracy on training data.
- **T
1. End to End Machine Learning for Aspiring
Data Scientist
-S r i v a t s a n S r i n i v a s a n
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
1
2. Before you proceed.. Stop.. Read .. Proceed at your own terms
This presentation is not to complain on online courses and academics but to highlight the difference in
expectation between these courses and what enterprise need
Doing data science has it’s own set of challenges and multiple failure points. Some of the information I will be
sharing on Linkedin will cover in detail on those failure points and how to overcome the same
If you are Aspiring to be in Data Science this presentation and series of post that I will be sharing over next few
months will take you through end to end machine learning cycle in typical organization
-> Use this information to fill in the skills that can get you closer to industry needs.
-> Use this content to define strategy for yourself to land a job in enterprise world.
You can search for post using hashtag #end2endDS in LinkedIn content or follow me on LinkedIn to get updates
as I post in LinkedIn
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
Content on this topic will be posted between 29th July and 27th September, 2019. The frequency will purely
depend on bandwidth I have. On average you can expect 1 or max 2 posts in a week
I will also summarize key take away in article as well as update this presentation over time
Every data scientist need not be expert in entire ML pipeline but it is good for them to know the process
- Happy Learning
6. If you see below “Data Science Hierarchy of Needs” as Hill climbing,
Academia puts you on top of the Hill and real world is when one
understand the path to climb is the most difficult one
Image Source: Hackernoon
7. Education (Courses/Academics) vs Enterprise
Education Enterprise
Focus on Model Accuracy
and usage of algorithms
Focus on
deployment/Integration.
Balance between accuracy
and explain-ability
Focus on increasing
complexity of Models for
better accuracy
Keep it simple as much as
possible and as long as
possible
Data Mostly comes in Single
or few Files
Data comes from multiple
enterprise system. Need to
be integrated, cross
referenced and summarized
Data size is Typically Small
to Medium
Data size ranges from
Medium to Very Large
Data typically is 80% clean Data is 80% noisy
Limited Tools More Tools + Dev Ops +
Cloud + Other Craps
Do it at decent Pace Agile (Not now, don’t make
me talk)
8. For most online courses
Data Science = ML Code + Some Data Analysis
In Reality
Data Science = ML Code + Data Analysis + Data Collection + Data Engineering + Software Engineering + Dev Ops + BI Engineer
+ Product Manager
Note: If you are coming from premier institute that addresses all of the reality. Please feel free to exit the presentation
9. 5 Biggest Challenge for Enterprise deploying ML solution
• Data Collection
• Deploying and Reproducing the model in production
• Model Monitoring
• Keeping model relevant by adopting to changing business scenarios
• Communicate and interpret model output to various stakeholders
11. Data
Collection
Data
Analysis/Cle
aning
Data
Organization
and
Transformation
Feature
Engineering
Model
Training
Model
Evaluation
and
Validation
Model
Deployment
Model Re-calibration (Some steps might be optional on case basis)
Business
Understanding
Data
Understanding
Model
Monitoring
Model Drift
Analysis
Components of End to End Machine Learning Pipeline in Real World
Problem
Definition
Model
Explanation
(Local and
Global)
Health Dashboard, Reports & Alerts
Model Training (Iterative/Some steps might be optional on case basis)
Model Management and Governance
Data Management
Model and Application Logging
Pipeline Orchestrator
Infrastructure/Dev Ops/Automation
Data Drift
Analysis
Data
Validation/An
omalies
detection
Model
Integration
and SLA
understanding
12. ML Components and Skills/Role mapping
Components Primary Responsibility Secondary Responsibility
Problem Definition Business Owner, AI Champion Product Owner
Business Understanding Product Owner, Business
Owner, AI Champion
ML Engineer
Data Understanding Data Engineer, ML Engineer,
Product Owner
Business Owner/Analyst
Model Integration and SLA
understanding
ML Engineer, Data Engineer,
Software Engineer
Business Owner, Product
Owner
Data Collection Data Engineer, Data Analyst
Data Analysis/ Cleaning Data Engineer, Data Analyst
Data
Organization/Transformation
Data Engineer, ML Engineer Data Analyst
Data Validation/Anomaly
Detection
Data Analyst, Data Engineer
Feature Engineering ML Engineer Data Engineer
Model Training ML Engineer
Model Evaluation/validation ML Engineer Business Owner, Model
Governance team
Model Monitoring Operations Engineer, ML
Engineer
BI Engineer
Model Deployment Software Engineer, Data
Engineer, ML Engineer
Data Drift/Model Drift Operations Engineer, ML
Engineer
BI Engineer, ML Engineer
Dashboard/Reports BI Engineer Business Owner, Product
Owner
Note: Depending on size of ML project, One person might play multiple role or there might be multiple person required for single role.
Some role might also be part time or some components can be built as capability that can be leveraged across projects
13. Most of the Role Definition in previous slide can be found online, let me talk about AI
Champion as not much is mentioned on it….
AI Champion (Head of Analytics or Sometimes CAO himself) is responsible for driving intelligent insights backed
by data science capability within enterprise. He also owns the resulting ROI or Impact numbers on delivering
intelligent solution. He leads the data science team by developing policies, strategies and propagates culture of
experimentation and research. He and his team are also responsible for working with business stakeholders in
planning, identifying, prioritizing and Implementing AI use cases
You can find more details here: https://www.linkedin.com/pulse/identifying-prioritizing-artificial-intelligence-use-cases-srivatsan
This role might be more relevant in mid to large size organization where organization has multiple use cases to deliver and AI
Champion helps enterprise focus on prioritizing use case that can be fit for AI as well as generate substantial business value
14. Few Components of End to End ML Explained
(Will cover more details on each on my LinkedIn post)
15. Data Collection
• Data is typically collected and centralized from variety of sources either into Data Lake or Data Warehouse or any
enterprise data ecosystem
• Data is sourced from High volume transactional systems like ERP, Sales etc. or from High velocity IOT devices, POS systems
etc
• Data takes variety of shapes - Structured, Semi Structured and Unstructured sources of data
• Data takes variety of forms - Batch, Streaming, API, Alternate Data etc.
• While ingesting data is one part of the puzzle, data also needs to be cataloged, secured and governed
Further Reading: https://www.linkedin.com/pulse/think-data-first-before-being-ai-srivatsan-srinivasan
“Define a efficient Data Strategy that is simple to implement and help accelerate on AI strategy”
16. Data Analysis and Validation
Inspect and clean data to discover useful information that can further help in modeling AI driven intelligent solution.
Purpose of Data Analysis and Validation is to understand
• What is characteristic of my data and how does my data look like?
• Are there any outliers or errors in the data?
• How does independent variable respond to target variable?
• Base statistics out of analysis phase is used against production inference data to identify if the data has evolved (drifted)
from the underlying assumptions than what the model was trained on?
Further Reading: https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/
“Understanding your data is key step to insight”
17. Data Organization and Transformation
Data collected from source systems into Data ecosystem are typically at granular level not directly consumable by ML model.
Sources are as well spread across multiple domain. Take marketing as example data might be spread across customer,
product, transaction systems, loyalty etc. Data Organization and Transformation is to make data consumable for ML models
and as well make data accessible for self service
Raw data typically in TB is cleansed, aggregated in a form that can be fed into model directly. This is where most heavy lifting
work happens in close collaboration with Business, Data Engineers, ML Engineers and Data Analyst
Integrate
Explore
Aggregate
Model
Deploy
Monitor
Raw Data (TB-PB)
Model Input Data (MB-GB)
60%
40%
Data Engineering and Data
Analyst
ML Engineer, Data
Engineer and Software
Engineer
Insight (KB)
18. Model Deployment
Few key things to remember while deploying models to production or integrating models with business process
Further Reading: https://www.linkedin.com/pulse/ml-model-deployment-considerations-srivatsan-srinivasan/
https://www.linkedin.com/pulse/integrating-machine-learning-models-within-matured-srinivasan/
• Training deployment skew - Models developed on historical sources might have to be deployed in streaming
flow or in edge of network/devices
• Not everything can be flask’ed or exposed as service. Deployment scenario varies based on technology in
business process, inference SLA etc
• Keep model pipeline as simple as possible. Avoid spaghetti pipeline code
• Provision for experimentation of new models when implementing deployment framework -
Champion/Challenger or A/B testing based model deployment and analysis
• Training deployment skew – Features that are hard to compute in inference time or features that were forward
computed during training time (This may sound not so sensible but trust me have seen enterprises doing such
mistake)
19. Model Monitoring
Machine Learning today is essential for running some of our critical business process. ML is deployed in decision making
substituting or replacing humans and needs to be monitored continuously as it is making decisions
Ongoing monitoring of ML models is essential to evaluate whether the assumptions that model was developed on is not
drifted and is performing as intended.
Model can drift due to changes in business assumption, Changes or issues with data, market conditions that might need
adjustment among others Ongoing monitoring highlight scenarios when model might need re-calibration. For some business
process it can be yearly for some it can be as frequently as daily.
Plan for monitoring the models continuously -> Alert on drift in data, concept or model. Business today evolves rapidly and
assumptions on which models are trained on becomes quickly invalidated. You want to know before your models starts
making wrong predictions
20. Other Key components to succeed in Enterprise Machine Learning
Structured and modularized code base
Experiment tracking for reproducibility
Version Control of ML code, data and Experiment results
Dev Ops for both Infrastructure and Model deployment
Orchestrator for Data and Model pipeline
Logging deployment runtime critical info and making it searchable
22. Food for thought #1 - Various point of Failure in ML Lifecycle
Machine Learning cycle is not complete post deployment. Model needs to be monitored continuously and be prepared for
failure at any part of pre and post modeling exercise
• Failure during experimentation. This is ideal case as well if you figure out the problem earlier.
• Failure during development by not thinking about real world inference scenario. Using features that are hard or
impossible to compute during inference
• Failure post deployment where few models did not generate business value they were supposed to
• Failure post deployment to keep up with even changing data landscape. These model need to have frequent re-calibration
or need to have some form of continuous learning
• Failure in using right performance metrics. Think from your business to succeed not for model to succeed
Further Reading –
Reasons why ML project fail: https://www.linkedin.com/pulse/top-reasons-why-artificial-intelligence-projects-fail-srinivasan/
23. Food for thought #2 - Infrastructure
Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence-initiatives-srivatsan-srinivasan/
Enterprises hiring artificial intelligence and machine learning expert without right infrastructure and tools is like
“Hiring astronauts to drive a bullock cart”
Building data science capability within enterprise must be thought ground up right from selection of silicon chip. Data
Engineering and ML process are typically compute and memory intensive and on large dataset the infrastructure has to be
thought ground up.
Data scientist typically performs 100’s of iteration to come up with right algorithm, hyper parameters, metrics. Not having right
infrastructure can derail enterprise getting onto machine learning
Plan for Infrastructure with right kind of hardware (GPU, CPU, HPC etc), technologies (Hadoop, Kubernetes etc.) and tools
(Spark ML, Tensorflow, scikit etc.) that can distribute ML/DL pipelines for faster hypothesis and value generation
Cloud is very good alternative to accelerate ML journey where you can spin up compute on demand and tear down when
not needed
24. Food for thought #3 - Cloud for AI/ML
Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-cloud-platform-srivatsan-srinivasan/
https://www.linkedin.com/pulse/data-analytics-google-cloud-platform-srivatsan-srinivasan/
Cloud is key component of AI/ML journey especially for enterprise that needs Agility to meet the huge compute demand needed
to run ML jobs
Key benefits cloud provide are
Scale - Instant access to hundreds of compute instances
Speed - Easy availability of specialized device like (GPU/TPU) that can help accelerate AI development Cloud AI API's - Quick
jump start into complex activities rather build from scratch. For cases like speech to text or language translation, enterprise as
well might lack data to build models with high accuracy as available in cloud
Cloud AutoML - Train high quality models specific to business needs with citizen data scientist or even by business users
Cloud Bursting - With advances in Hybrid Cloud, start small in local data center and use cloud to scale AI compute
25. Food for thought #4 - Stay simple as long as possible
Fitting simple models and if accuracy is low, Do you immediately jump to complex models?
Try below 2 steps before moving to trendy and complex algorithms
Follow your model output -> Listen to what your algorithm metrics says. Drill down into misclassification scenarios and see if
you are able to find any interesting pattern
Be Curious and Creative with your data -> Try to see if you find any pattern or relationship in data that has ability to influence
your model outcome. Lot can be solved by proper EDA and feature engineering
If you are still not meeting the performance targets go for complex models in increments. The steps you performed above is
still relevant and can be input to your complex models to enhance decision boundary
In some critical business process 84% of simple model performance might be better than 86% of complex models
26. Food for thought #5 - Data Science and Agile
There is lot of misconception on use of Agile for Data Science. Data Science outcome depends on continuous experimentation
where as Agile focuses on early and continuous delivery throughout the development lifecycle
First thing to remember Agile is set of guiding principles and not set in stone methodology. Agile can be tailored to one’s
unique Data Science need
Here is one way of doing data science in Agile way especially the machine learning part
• Don't set strict deliverables at the end of every sprint
• Use daily/weekly meeting to get road blockers alone not daily status
• As soon as you have working model (Say every sprint or 2) with decent accuracy put it in private beta mode. Private beta
mode or dark mode is where model generate output but it is not actioned on. This will help us monitor the data with real
world information and test its reliability
• Keep updating private beta as you build models with better performance accuracy
• Launch the private beta model to small percentage of live traffic. Collect feedback based on response from end users
• Keep increasing the volume of transaction to model in frequent interval until all traffic is diverted and feedback/outcome is
met
In real world there are scenarios where ML model might not get you same value that was seen during training/evaluation
phase. In this case agile delivery allows machine learning projects to be value and outcome focused and to achieve project
objectives in a timely manner.
27. Fact
Traditional ML algorithms can scale on large datasets. There are distributed
frameworks that can train your model on large dataset and are very effective
in learning from large dataset as well. Choose technology based on your
business and data needs
If your tabular data is big in size, switch to deep learning. Traditional
ML will not work
Machine Learning will eventually replace existing rules in legacy
system
Think ML as initially technology for complementing your legacy rules.
One can reduce the complexity of rules by introducing ML solution. It
can eventually replace but it is always better to have some deterministic
rules complementing your probabilistic ML models
Machine Learning is the new “Magic Wand” for making your business
process smart and intelligent
Do not take a non ML problem and try to fit ML into it. Use ML when
you believe it will add value to the business process. You can make
your business process smart by advance analytics or statistical
techniques as well
Data science is more than what AutoML can currently do. It will be
assistant to Data Scientist taking care of boring part of Data Scientist and
have them focus more on delivering business value
AutoML will replace and automate data science work
Myth
Food for thought #5 - Myth v/s Fact
Further Reading on AutoML – https://www.linkedin.com/pulse/fear-data-scientist-called-autophobia-srivatsan-srinivasan/
28. To Summarize
Plan for investing in right
Infrastructure (GPU, CPU,
Cloud) to accelerate model
development process
Only 20% or less of actual
pipeline is ML code
29. Thank You and Stay Tuned on LinkedIn for more info
on End to End Data Science Pipeline
Follow or search with hashtag #end2endDS in
LinkedIn to get updates