Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024

Containers & AI
👸Beauty and the 👾Beast!?!

Tobias Schneck
@ tobi@kubermatic.com
@toschneck
Principal Architect
@toschneck

As a Container & Kubernetes guy,
why should we care about AI?
🤨

… will it be the next big thing?!
😨

By 2028, the adoption of AI will culminate in
over 50% of cloud compute
resources devoted to AI workload, up from
less than 10% in 2023.
Gartner® states, 2023

OK … so what’s about this AI thingy?
🤔󰞵

AI Technology Layers
Data Gathering Collect and prepare data for training AI models
Data Processing Clean and structure data for effective learning

Machine Learning Algorithms learn from data patterns
Deep Learning Complex patterns learned using neural networks
Language Models Large models that understand and generate language

Chatbot Applications Interactive systems using language models
Chat GPT A conversational agent powered by language models

API / UI

API / UI
How to
manage?
��

… a lot of Data and Math for an
Infrastructure guy 🧐
… how does such data get compute?
🖥 💽

Credits to Andrej Karpathy 👉 Awesome Intro to LLMs
[1hr Talk] Intro to Large Language Models

Large Language Model (LLM)
Credits to Andrej Karpathy 👏

Training them is more involved.
Think of it like compressing the internet.

How does it work?
Little is known in full detail….
● Billions of parameters are dispersed through the network
● We know how to iteratively adjust them to make it better at prediction
● We can measure that this works, but we don’t really know how the billions of
parameters collaborate to do it.
They build and maintain some kind of knowledge database,
but it is a bit strange and imperfect:
Recent viral example: “reversal curse”
Q: “Who is Tom Cruise’s mother”?
A: Mary Lee Pfeiffer ✅
Q: “Who is Mary Lee Pfeiffer’s son?”
A: I don’t know ❌
⇒ Think of LLMs as mostly inscrutable artifacts,
develop correspondingly sophisticated evaluations

Summary: how to train your ChatGPT
Stage 1: Pretraining
1. Download ∼10TB of text.
2. Get a cluster of ∼6,000 GPUs.
3. Compress the text into a neutral network, pay ∼$2M, wait ∼12 days
4. Obtain a base model.
Stage 2: Finetuning
1. Write labeling instructions
2. Hire people (our use scale.ai!), collect 100K high quality ideal
Q&A responses, and/or comparisons.
3. Finetune base model on this data, wait ∼1 day.
4. Obtain assistant model.
5. Run a lot of evaluations.
6. Deploy.
7. Monitor, collect misbehaviors, go to step 1.
Every
∼year
Every
∼week

How does our normal Job look like?
󰻶

Platform
Orchestration / Scheduling
Infrastructure
Cloud or On-prem
Hardware
Accelerators
Platform Engineering
CPU Network Storage
Hardware Architect
SRE / Operations
Platform Engineer
Based on Adel Zaalouk (@ZaNetworker) drawings from the CNCF Cloud Native AI white paper 🙏

What will change in our Infra?
🏗

Artiﬁcial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math & Statistics
Exploratory Data
Analysis (EDA)
Visualization
Data
Science

Artiﬁcial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math & Statistics
Exploratory Data
Analysis (EDA)
Visualization
Data
Science
Cloud Native?

Artiﬁcial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math & Statistics
Exploratory Data
Analysis (EDA)
Visualization
Data
Science
Cloud Native?
Platform
Platform Engineer
AI Platform 🤔

Flexibility & Standardization
Standard
Container
High-Cube
Container
Hardtop
Container
Open Top
Container
Flat Platform (Plat) Ventilated
Container
Cooling Container Bulk Container
Tank
Container
Container Types

Data Center I
Infrastructure Layer
Standardization with Kubernetes
App Services
IT Space I IT Space II IT Space III
Backend Services
DB Services Analytics Observability
Data Center II
Data Center III
Edge
Cloud Providers
Caches AI / ML DDoS
Protect
Managed
Services
Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing

Kube for AI ⇔ Kube for Applications
● Kube is a de facto standard as “cloud operating systems”
● API abstraction layer for multiple types of network, storage, and compute resources
● Standard interfaces for support of DevOps best practices like GitOps
● Variation of Cloud Providers and Services are consumable via standard APIs

… and OpenAI uses Kube already 2017 🤯

So …. 😅
How to build an AI Platform?
🏗 🤔

Cloud Native!
Data Preparation Feature Store
Model Development
ML Train / Tune
Model Storage Model Serving
ML Engineer

Cloud Native!
Data Preparation Feature Store
Model Development
ML Train / Tune
Model Storage Model Serving
Repeat the
Process
ML Engineer

Cloud Native!
Data
Preparation
Feature
Store
Model
Development ML
Train / Tune
Model
Storage
Model
Serving
Repeat the
Process
Platform
Platform Engineer
AI Platform 🤔
ML Engineer
Support during Orchestration and easy to use for Data!

Cloud Native!
Data
Preparation
Feature
Store
Model
Development ML
Train / Tune
Model
Storage
Model
Serving
Repeat the
Process
Platform
Platform Engineer
AI Platform 🤔
ML Engineer

Cloud Native!
Data
Preparation
Feature
Store
Model Development
ML Train / Tune
Model
Storage
Model
Serving
Repeat the
Process
Platform
Platform Engineer
AI Platform 🤔
ML
Engineer
Application
Development
Application
QA
Application
Rollout
CI / CD

Cloud Native!
Data
Preparation
Feature
Store
Model Development
ML Train / Tune
Model
Storage
Model
Serving
Repeat the
Process
Platform
Platform Engineer
AI Platform 🤔
ML
Engineer
Application
Development
Application
QA
Application
Rollout
CI / CD
Cloud Native AI Apps
Consume
Get
managed by

Platform
Infrastructure
Cloud or On-prem
Hardware
Accelerators Hardware Architect
SRE / Operations
Platform Engineer
Data/ML/AI
Engineer
Data-Scientist
/ Developer
Workloads
Models, applications, ….
ML Lifecycle
(AI/ML/LLM Ops)
CPU GPU NPU TPU DPU
Data Prep Model Training Model Serving Perf / Scale Observe
CI CD
Classiﬁcation Object Detection
Clustering Forecasting ….
RAGs LLMs
Vector DBs LVMs ….
Predictive Generative
Cloud Native AI - ©CNCF White Paper
Artiﬁcial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)
Math &
Statistics
Exploratory
Data Analysis
(EDA)
Visualization
Data
Science

How does the ecosystem look
like!?
🛒 🛍

AI Frameworks
Popular Frameworks by GitHub Stars
TensorFlow: 181.3K
PyTorch: 99.2K
Keras: 54.3K
Scikit-learn: 49.1K
Caffe: 33.7K

Options to Adapt the Frameworks 󰞵
Feature
Focused Tools
(local/on-prem)
Examples
MLﬂow, Backyard AI, Ollama,
Hugging Face TGI
Scope
Speciﬁc functionalities
within ML lifecycle
Open Source Yes
Scalability & Portability Moderate
Setup & Management Simpler
Portability Mostly Machine based
Vendor Lock-in No

Feature
Focused Tools
(local/on-prem)
Managed Platforms
Examples
Hugging Face TGI
AWS SageMaker, scale.ai
Scope
within ML lifecycle
Managed MLOps service
Open Source Yes No
Scalability & Portability Moderate Depends on cloud provider
Setup & Management Simpler Simpler
Portability Mostly Machine based Mostly Cloud
Vendor Lock-in No Yes (to speciﬁc cloud provider)

Feature
Focused Tools
(local/on-prem)
Managed Platforms “Kube Native”
Examples
Hugging Face TGI
AWS SageMaker, scale.ai
Kubeﬂow / KServer
(Hugging Face TGI / LocalAI)
Scope
within ML lifecycle
Managed MLOps service End-to-end MLOps platform
Open Source Yes No Yes
Scalability & Portability Moderate Depends on cloud provider High
Setup & Management Simpler Simpler Complex
Portability Mostly Machine based Mostly Cloud Everywhere
Vendor Lock-in No Yes (to speciﬁc cloud provider) No

AI Frameworks
Popular Frameworks by GitHub Stars
TensorFlow: 181.3K
PyTorch: 99.2K
Keras: 54.3K
Scikit-learn: 49.1K
Caffe: 33.7K
Could use

KubeFlow ⁉
👉 Currently the most feature
complete choice for Kube
🥴 But Setup is complex!

KubeFlow ⁉
The Beauty 👸:
● Incubating CNCF Project
● Serving AI Platform in Multi-Tenancy
● Popularity 13.7k ⭐ ~ long-term Maintenance Chance
● Alternatives like MLﬂow / KServe are integrated
The Beast 👾
● Mostly vendor speciﬁc installer instructions 🥴
○ No maintained automated installer for generic Kubernetes
○ Helm chart issue #3173
● Dependency “hell”
○ A lot of different 3rd party dependencies constraints
○ Hard to adapt again to existing company defaults
● Only support EOL Kubernetes <= 1.26❗
○ Usability is then questionable in production

… but new KubeFlow release is around the corner
https://github.com/kubeﬂow/manifests/
releases/tag/v1.9.0-rc.1

Sounds good, but what about
on-prem / oﬄine cases?
🤔

[Cloud] Data Center I
GPU / TPU Powered Services
based on Argo CD
AI Model Serving
[AI] Application Service
Application
⚙ Separate Model Training / Model Usage Example
Infrastructure Layer
Data Center II
Data Center III
Edge
Cloud Providers
Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing
Data Delivery
Model
Export
Local AI
Consume
Scale for
Training
Vanilla Setup

Starting a POC
󰳘
github.com/toschneck/kubernetes-and-ai

Kubeﬂow | Katib Architecture for Hyperparameter
Tuning (aka optimization run)

Serve trained Models “local” with LocalAI

Ask localAI about KCD Istanbul

THANKS FOR JOINING!
kubermatic
@toschneck
tobi@kubermatic.com

Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024

Related slideshows

More Related Content

Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024