Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI

Architectures of AI systems
Engineering for Big Data & AI
HCMC, Sep 6th 2019 herve@quod.aiHerve Roussel

Is this data engineering?
UploadData.java
upload_data.py

cat console.log
| grep “ERROR”
> errors.log

Data engineering?
Transformed dataEvent data
Program

cat console.log
| grep “ERROR”
> errors.log
Event data
Transform
Transformed data

What is
Big Data Engineering ?

How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC

Notify? Web,
mobile?
Who can
see this?
Racist? Vulgar?
Is this a face? Who’s
this? Friend? Celebrity?
Courtney likes. Is that
good or bad?
Paddy commented. Is
that good or bad?
Chris posted. Is that
good or bad?
Anybody tagged?
What rank
in feed?

Is Big Data just for big companies?
300K QPS [R]
6K QPS [W]
As of JULY 8, 2013
1B+ QPM [P]
250M+ QPM [R]
400M LOC [P]
1.8 TB per year [P]

Data Engineering
Augmented dataEvent data
Program

Event data
Transform
Augmented data
Big Data Engineering + AI

Pipeline (Transform)
Source (Event data)
Sink (Augmented data)

Synchronous_
( 10-100 ms )_
Where is data coming from?
Main data
Event source
Why split?
Asynchronous_
( 3-5 s )_

What’s in an event data?
Post
{
id: 12345,
content: “hello world”,
created_at: …
updated_at: …
author_id: 67890,
…
}
PostCreatedEvent
{
story_id: 12345,
type: “story_posted”
…
}

Job 1
Job 2
Scheduler
What’s batch processing?

● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?

MySQL MongoDB JSON on S3 (or
GCS)
30 GB OK Good Very good
10K WPS OK Good Very good
1K RPS OK Good Very good
Range read OK Good Very good
Cost $$ $$$ $
MySQL MongoDB
30 GB OK Good
10K WPS OK Good
1K RPS OK Good
Sequential read OK Good
Cost $$ $$$
How to store events?

Who wants to become architect?

Job 1
Job 2
Scheduler
What’s the problem with batch?
LATENCY

How to process real-time?
Stream processing

Importance MySQL Kafka Redis
10K WPS 1.0 5 10 10
1K RPS 1.0 5 10 10
Sequential
read
1.0 10
(with B-TREE)
10 10
(using Lists)
Order
guarantee
0.2 10 0 10
Durability 0.1 10 5 (but perf. hit) 0
Deployability 0.5 10 5 7.5
Score 5.6 / 10 6.6 / 10 7.15 / 10
Why not database?

Functional vs OOP
Librarian
.startShift()
Catalog.open() Library.close()
Books.create()
Operations on things
Add more things
find(book)
assign(book)
Things with operations
Add more operations
remove(book)
load_cover(book)

Functional vs OOP
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded
)
Things with operations
Add more operations
alert_subscribers(vid_uploaded)
generate_thumbnails(vid_uploaded
)

What’s supporting data?
Transform
Supporting data
event
{
id: 12345,
type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66
]
}
Friends or city DB

Who uses ext. supporting data?

API vs Pipeline: availability?
Requests in thread Long running

API vs Pipeline: performance?
100ms
⇓
10ms
100ms * 300,000/60/60 = 9H
⇓
10ms * 300,000/60/60 = 55 min

Where is the data coming from?
Is this a face? Who’s
this? Friend? Celebrity?

Data pipelines & AI
TransformAI model

How can 2 processes talk?
Transform
AI model

What to do with the sink?
Write Read
Data scientist
Sales

What are the read use cases?
Give me summary
report of last
month’s activity
Give me posts that
contain the words
Donald Trump,
Trump or President
Give me all posts by
female, age 18-35
Aggregation Full text search Bulk data, ﬁltered

What’s distributed data systems?

Why re-run the pipeline?
TransformAI model Transform v2

Idempotency & backﬁll
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }

What if the AI model improves?
TransformAI model v2

AI systems ≠ traditional systems?
93.2%
ProbabilisticDeterministic

Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )

Source: Uber Engineering
[DE] Collect data
[DE] Process data
[DS] Build DL model
[BE/FE] Use DL model in app
[DA] Validate DL model

Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modiﬁability
• Maintainability
• Testability
• Usability
• Buildability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability

• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modiﬁability
• Maintainability
• Testability
• Usability
• Buildability

Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
What have we learned?

Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join
herve@quod.aiHerve Roussel

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI

More Related Content

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI