SlideShare a Scribd company logo
Architectures of AI systems
Engineering for Big Data & AI
HCMC, Sep 6th 2019 herve@quod.aiHerve Roussel
What is
Data Engineering ?
Is this data engineering?
UploadData.java
upload_data.py
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Data engineering?
Transformed dataEvent data
Program
Backend vs Data?
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Event data
Transform
Transformed data
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Notify? Web,
mobile?
Who can
see this?
Racist? Vulgar?
Is this a face? Who’s
this? Friend? Celebrity?
Courtney likes. Is that
good or bad?
Paddy commented. Is
that good or bad?
Chris posted. Is that
good or bad?
Anybody tagged?
What rank
in feed?
Copyright violation?
Is Big Data just for big companies?
300K QPS [R]
6K QPS [W]
As of JULY 8, 2013
1B+ QPM [P]
250M+ QPM [R]
400M LOC [P]
1.8 TB per year [P]
Data Engineering
Augmented dataEvent data
Program
Event data
Transform
Augmented data
Big Data Engineering + AI
Pipeline (Transform)
Source (Event data)
Sink (Augmented data)
What is a
source ?
Synchronous_
( 10-100 ms )_
Where is data coming from?
Main data
Event source
Why split?
Asynchronous_
( 3-5 s )_
What’s in an event data?
Post
{
id: 12345,
content: “hello world”,
created_at: …
updated_at: …
author_id: 67890,
…
}
PostCreatedEvent
{
story_id: 12345,
type: “story_posted”
…
}
Job 1
Job 2
Scheduler
What’s batch processing?
Which DB for event source?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)
30 GB OK Good Very good
10K WPS OK Good Very good
1K RPS OK Good Very good
Range read OK Good Very good
Cost $$ $$$ $
MySQL MongoDB
30 GB OK Good
10K WPS OK Good
1K RPS OK Good
Sequential read OK Good
Cost $$ $$$
How to store events?
Who wants to become architect?
Job 1
Job 2
Scheduler
What’s the problem with batch?
LATENCY
How to process real-time?
Stream processing
How can 2 processes talk?
QUEUE
Why not use database?
Importance MySQL Kafka Redis
10K WPS 1.0 5 10 10
1K RPS 1.0 5 10 10
Sequential
read
1.0 10
(with B-TREE)
10 10
(using Lists)
Order
guarantee
0.2 10 0 10
Durability 0.1 10 5 (but perf. hit) 0
Deployability 0.5 10 5 7.5
Score 5.6 / 10 6.6 / 10 7.15 / 10
Why not database?
What is a
transform ?
Transforms
Source
Sink
Functional vs OOP
Librarian
.startShift()
Catalog.open() Library.close()
Books.create()
Operations on things
Add more things
find(book)
assign(book)
Things with operations
Add more operations
remove(book)
load_cover(book)
Functional vs OOP
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded
)
Things with operations
Add more operations
alert_subscribers(vid_uploaded)
generate_thumbnails(vid_uploaded
)
What’s supporting data?
Transform
Supporting data
event
{
id: 12345,
type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66
]
}
Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?
Requests in thread Long running
API vs Pipeline: performance?
100ms
⇓
10ms
100ms * 300,000/60/60 = 9H
⇓
10ms * 300,000/60/60 = 55 min
Where is the data coming from?
Is this a face? Who’s
this? Friend? Celebrity?
Data pipelines & AI
TransformAI model
How can 2 processes talk?
Transform
AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read
Data scientist
Sales
What are the read use cases?
Give me summary
report of last
month’s activity
Give me posts that
contain the words
Donald Trump,
Trump or President
Give me all posts by
female, age 18-35
Aggregation Full text search Bulk data, filtered
ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?
TransformAI model Transform v2
Idempotency & backfill
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?
TransformAI model v2
AI systems ≠ traditional systems?
93.2%
ProbabilisticDeterministic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )
What have we
learned ?
Source: Uber Engineering
[DE] Collect data
[DE] Process data
[DS] Build DL model
[BE/FE] Use DL model in app
[DA] Validate DL model
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
What have we learned?
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join
herve@quod.aiHerve Roussel

More Related Content

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI