Stream Processing in SmartNews #jawsdays
- 5. What is SmartNews?
• News Discovery App for Mobile
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
- 14. Index System
• Crawler
• collect news articles & social signals
• Analyzer
• extract title, content, thumbnail...
• classify topics (sports, politics, technology...)
• Indexer
• upload article metadata into CloudSearch
- 15. Feedback System
• API Tracker
• receive user's activity log from mobile app
• Spark Streaming
• generate various metrics for news ranking
• stored metrics into DynamoDB
- 21. Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
- 24. Kinesis Libraries
• Kinesis Producer Library (KPL)
• put records into an stream
• asynchronous architecture (buffer records)
• Kinesis Consumer Library (KCL)
• consume and process data from an stream
• handle complex tasks associated with distributed
computing
- 25. KPL/KCL Monitoring
• KPL/KCL publishes custom CloudWatch metrics
• Key Metrics for KPL
• User Record Received, User Record Pending
• All Errors
• Key Metrics for KCL
• RecordsProcessed
• MillisBehindLatest
• RecordProcessor.processRecords.Time
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
- 27. Feedback System
Generate Metrics by User Clusters for
Ranking Articles
Amazon
CloudSearch
API
Search
API
Gateway
Kinesis
Stream
Amazon S3 Hive / Spark
DynamoDB
User
Clusters
User
Feedback
API
Tracker
Amazon S3
Offline ETL / Machine Learning
Push
Notification
Article
Metadata
Metrics
by Cluster
- 28. Why Metrics by Cluster?
Consider Each User's Interests
Ensure Diversity for Avoiding Filter Bubble
https://en.wikipedia.org/wiki/Filter_bubble
Amazon
CloudSearch
API
DynamoDB
Article raw score
San Fransisco Giants … 3.5
New York Yankees … 6.2
FIFA World Cup … 20.4
U.S.Open Championships … 8.4
weight
1
0.6
0.2
0.2
score
3.5
3
4.08
1.68
+ =
User
GET /news/sports
Metrics by
User Cluster
Article
Inventry
userId: 1000
gender: Male
age: 36
location: San Fransisco, US
interests: Baseball
- 29. Input Data by Fluentd
• Forwarder (running on each instances)
• archive events to S3
• forward events to aggregators
• Aggregator (HA Configuration※)
• put events into Kinesis Stream
• alert and report (not mentioned here)
※ http://docs.fluentd.org/articles/high-availability
- 30. Example Configurations
<source>
@type tail
tag smartnews.user_activity
...
</source>
<match smartnews.user_activity>
@type copy
<store>
@type s3
...
</store>
<store>
@type forward
...
</store>
</match>
Forwarder
<source>
@type forward
...
</source>
<match smartnews.user_activity>
@type copy
<store>
@type kinesis
...
</store>
<store>
...
</store>
</match>
Aggregator
http://docs.fluentd.org/articles/kinesis-stream
- 31. Offline ETL Flow
Transform Text Files into Columnar Files
Various Machine Learning Tasks
API
RDS
{
“timestamp”: 1453161447,
“userId”: 1234,
“platform”: “ios”,
“edition”: “ja_JP”,
“action”: “viewArticle”,
“data”: {
“articleId: 1234,
“duration”: 30.2
}
}
userId, age, gender, location,
1234, 28, M, Tokyo, …
1235, 32, F, Nagano, …
1240, 18, F, Keyoto, …
Amazon S3
Hive on EMR
Amazon S3
Airflow
Manage
Workflow
Activities
Users
Spark on EMR
- 32. Airflow: Workflow Engine
Execute Task A -> Task B -> Task C, D
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
- 33. Spark Streaming
Kinesis Stream
Shard 1
Shard 2
Shard3
Dstream 1
Dstream 2
Dstream 3
R
D
D
RDD
R
D
D
R
D
D
Female
Male
+
Minutely RDD
Teen
Female
Male
Teen
Female
Male
Teen
Minutely Metrics by User Cluster
DynamoDB
.
.
.
Pre Computed RDD
Split Streams into Minutely RDD
Join Minutely RDD on PreComputed RDD
- 35. Integrate with CloudWatch
class CloudWatchRelay(conf: SparkConf) extends StreamingListener {
override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) {
putMetricToCloudWatch(s"BatchStarted", 1.0)
}
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) {
putMetricToCloudWatch(s"BatchCompleted", 1.0)
putMetricToCloudWatch(s"BatchRecordsProcessed",
batchCompleted.batchInfo.numRecords toDouble)
batchCompleted.batchInfo.processingDelay.foreach { delay =>
putMetricToCloudWatch(s"ProcessingDelay", delay)
}
batchCompleted.batchInfo.schedulingDelay.foreach { delay =>
putMetricToCloudWatch(s"SchedulingDelay", delay)
}
batchCompleted.batchInfo.totalDelay.foreach { delay =>
putMetricToCloudWatch(s"TotalDelay", delay)
}
}
}
Set Alert to SchedulingDelay
- 37. Summary
• Fast & stable stream processing is crucial for SmartNews
• lifetime of news is very short
• process events as fast as possible
• Kinesis Stream plays an important role
• one-click provision & scale
• empowers engineers to do trial & error
- 40. See Also
• SmartNews の Webmining を支えるプラットフォーム
• Stream 処理と Offline 処理の統合
• Building a Sustainable Data Platform on AWS
• AWS meetup「Apache Spark on EMR」
- 42. PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
- 44. Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day, hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
- 45. Dashboard in Chartio
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)