Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024
- 5. By 2028, the adoption of AI will culminate in
over 50% of cloud compute
resources devoted to AI workload, up from
less than 10% in 2023.
Gartner® states, 2023
- 6. OK … so what’s about this AI thingy?
🤔
- 8. AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning
- 9. AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning
Machine Learning Algorithms learn from data patterns
Deep Learning Complex patterns learned using neural networks
Language Models Large models that understand and generate language
- 10. AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning
Machine Learning Algorithms learn from data patterns
Deep Learning Complex patterns learned using neural networks
Language Models Large models that understand and generate language
Chatbot Applications Interactive systems using language models
Chat GPT A conversational agent powered by language models
- 11. AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning
Machine Learning Algorithms learn from data patterns
Deep Learning Complex patterns learned using neural networks
Language Models Large models that understand and generate language
Chatbot Applications Interactive systems using language models
Chat GPT A conversational agent powered by language models
API / UI
- 12. AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning
Machine Learning Algorithms learn from data patterns
Deep Learning Complex patterns learned using neural networks
Language Models Large models that understand and generate language
Chatbot Applications Interactive systems using language models
Chat GPT A conversational agent powered by language models
API / UI
How to
manage?
��
- 13. … a lot of Data and Math for an
Infrastructure guy 🧐
… how does such data get compute?
🖥 💽
- 14. Credits to Andrej Karpathy 👉 Awesome Intro to LLMs
[1hr Talk] Intro to Large Language Models
- 16. Training them is more involved.
Think of it like compressing the internet.
Credits to Andrej Karpathy 👏
- 17. How does it work?
Credits to Andrej Karpathy 👏
Little is known in full detail….
● Billions of parameters are dispersed through the network
● We know how to iteratively adjust them to make it better at prediction
● We can measure that this works, but we don’t really know how the billions of
parameters collaborate to do it.
They build and maintain some kind of knowledge database,
but it is a bit strange and imperfect:
Recent viral example: “reversal curse”
Q: “Who is Tom Cruise’s mother”?
A: Mary Lee Pfeiffer ✅
Q: “Who is Mary Lee Pfeiffer’s son?”
A: I don’t know ❌
⇒ Think of LLMs as mostly inscrutable artifacts,
develop correspondingly sophisticated evaluations
- 18. Summary: how to train your ChatGPT
Credits to Andrej Karpathy 👏
Stage 1: Pretraining
1. Download ∼10TB of text.
2. Get a cluster of ∼6,000 GPUs.
3. Compress the text into a neutral network, pay ∼$2M, wait ∼12 days
4. Obtain a base model.
Stage 2: Finetuning
1. Write labeling instructions
2. Hire people (our use scale.ai!), collect 100K high quality ideal
Q&A responses, and/or comparisons.
3. Finetune base model on this data, wait ∼1 day.
4. Obtain assistant model.
5. Run a lot of evaluations.
6. Deploy.
7. Monitor, collect misbehaviors, go to step 1.
Credits to Andrej Karpathy 👏
Every
∼year
Every
∼week
- 19. Summary: how to train your ChatGPT
Credits to Andrej Karpathy 👏
Stage 1: Pretraining
1. Download ∼10TB of text.
2. Get a cluster of ∼6,000 GPUs.
3. Compress the text into a neutral network, pay ∼$2M, wait ∼12 days
4. Obtain a base model.
Stage 2: Finetuning
1. Write labeling instructions
2. Hire people (our use scale.ai!), collect 100K high quality ideal
Q&A responses, and/or comparisons.
3. Finetune base model on this data, wait ∼1 day.
4. Obtain assistant model.
5. Run a lot of evaluations.
6. Deploy.
7. Monitor, collect misbehaviors, go to step 1.
Credits to Andrej Karpathy 👏
Every
∼year
Every
∼week
- 23. Based on Adel Zaalouk (@ZaNetworker) drawings from the CNCF Cloud Native AI white paper 🙏
Artificial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math & Statistics
Exploratory Data
Analysis (EDA)
Visualization
Data
Science
- 28. Data Center I
Infrastructure Layer
Standardization with Kubernetes
App Services
IT Space I IT Space II IT Space III
Backend Services
DB Services Analytics Observability
Data Center II
Data Center III
Edge
Cloud Providers
Caches AI / ML DDoS
Protect
Managed
Services
Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing
- 29. Kube for AI ⇔ Kube for Applications
● Kube is a de facto standard as “cloud operating systems”
● API abstraction layer for multiple types of network, storage, and compute resources
● Standard interfaces for support of DevOps best practices like GitOps
● Variation of Cloud Providers and Services are consumable via standard APIs
- 37. Cloud Native!
Data
Preparation
Feature
Store
Model Development
ML Train / Tune
Model
Storage
Model
Serving
Repeat the
Process
Platform
Orchestration / Scheduling
Platform Engineering
Platform Engineer
AI Platform 🤔
ML
Engineer
Application
Development
Application
QA
Application
Rollout
CI / CD
Cloud Native AI Apps
Consume
Get
managed by
- 38. Platform
Orchestration / Scheduling
Infrastructure
Cloud or On-prem
Hardware
Accelerators Hardware Architect
SRE / Operations
Platform Engineer
Data/ML/AI
Engineer
Data-Scientist
/ Developer
Workloads
Models, applications, ….
ML Lifecycle
(AI/ML/LLM Ops)
CPU GPU NPU TPU DPU
Data Prep Model Training Model Serving Perf / Scale Observe
CI CD
Classification Object Detection
Clustering Forecasting ….
RAGs LLMs
Vector DBs LVMs ….
Predictive Generative
Cloud Native AI - ©CNCF White Paper
Artificial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math &
Statistics
Exploratory
Data Analysis
(EDA)
Visualization
Data
Science
- 41. Options to Adapt the Frameworks
Feature
Focused Tools
(local/on-prem)
Examples
MLflow, Backyard AI, Ollama,
Hugging Face TGI
Scope
Specific functionalities
within ML lifecycle
Open Source Yes
Scalability & Portability Moderate
Setup & Management Simpler
Portability Mostly Machine based
Vendor Lock-in No
- 42. Options to Adapt the Frameworks
Feature
Focused Tools
(local/on-prem)
Managed Platforms
Examples
MLflow, Backyard AI, Ollama,
Hugging Face TGI
AWS SageMaker, scale.ai
Scope
Specific functionalities
within ML lifecycle
Managed MLOps service
Open Source Yes No
Scalability & Portability Moderate Depends on cloud provider
Setup & Management Simpler Simpler
Portability Mostly Machine based Mostly Cloud
Vendor Lock-in No Yes (to specific cloud provider)
- 43. Options to Adapt the Frameworks
Feature
Focused Tools
(local/on-prem)
Managed Platforms “Kube Native”
Examples
MLflow, Backyard AI, Ollama,
Hugging Face TGI
AWS SageMaker, scale.ai
Kubeflow / KServer
(Hugging Face TGI / LocalAI)
Scope
Specific functionalities
within ML lifecycle
Managed MLOps service End-to-end MLOps platform
Open Source Yes No Yes
Scalability & Portability Moderate Depends on cloud provider High
Setup & Management Simpler Simpler Complex
Portability Mostly Machine based Mostly Cloud Everywhere
Vendor Lock-in No Yes (to specific cloud provider) No
- 46. KubeFlow ⁉
The Beauty 👸:
● Incubating CNCF Project
● Serving AI Platform in Multi-Tenancy
● Popularity 13.7k ⭐ ~ long-term Maintenance Chance
● Alternatives like MLflow / KServe are integrated
The Beast 👾
● Mostly vendor specific installer instructions 🥴
○ No maintained automated installer for generic Kubernetes
○ Helm chart issue #3173
● Dependency “hell”
○ A lot of different 3rd party dependencies constraints
○ Hard to adapt again to existing company defaults
● Only support EOL Kubernetes <= 1.26❗
○ Usability is then questionable in production
- 47. … but new KubeFlow release is around the corner
https://github.com/kubeflow/manifests/
releases/tag/v1.9.0-rc.1
- 49. [Cloud] Data Center I
GPU / TPU Powered Services
based on Argo CD
AI Model Serving
[AI] Application Service
Application
⚙ Separate Model Training / Model Usage Example
Infrastructure Layer
Data Center II
Data Center III
Edge
Cloud Providers
Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing
Data Delivery
Model
Export
Local AI
Consume
Scale for
Training
Vanilla Setup
- 51. Kubeflow | Katib Architecture for Hyperparameter
Tuning (aka optimization run)