Failsafe Mechanism for
Yahoo Homepage
Using Apache Storm & Apache Traffic Server
Pushkar Sachdeva (
Kit Chan (
Failsafe Mechanism for Yahoo Homepage
“A fail-safe or fail-secure device is one that, in the event of a
specific type of failure, responds in a way that will cause no
harm, or at least a minimum of harm, to other devices or to
Overall Architecture
Property ATS
Serving Stack
Crawler on Storm
Auto activate Failsafe
Switch traffic to AWS
Offstage Data Flow
Online Request Flow
Normal Operation
Online Request Flow
Failsafe Mode
AWS Failsafe Stack Architecture
Elastic Load
S3 Bucket
Security Group
Availability Zone #1
Availability Zone #2
Region (US W Oregon)
Region (US E North Virginia)
Region (Ireland)
Region (Singapore)
S3 Replication across regions
Cloud watch
Crawled data
from Yahoo
EC2 Instance - ATS
● Instance (amazon linux)
○ t2.large - burstable
○ 2 vCPUs/8GB RAM/1 gbps network
● Apache Traffic Server
○ For caching
■ Negative caching enabled
■ Ramdisk used
○ Health Check/S3 Authentication plugin
○ Lua plugin
■ Query Parameters Sorting
■ Simple Device Detection
■ Error handling
● Cloudwatch Log Agent/Monitoring Scripts
● Autoscaling based on # of incoming requests
● Deployment Mechanism using Terraform / Packer
4Gb ramdisk
Amazon Linux
Monitoring Scripts
Lua script example - sorting query parameters
function do_remap()
local query = ts.client_request.get_uri_args()
if (query ~= nil and query ~= '') then
local result = {}
local i = 1
for value in query:gmatch '([^&]*)' do
if (value ~= '') then
result [i] = value
i = i + 1
local sorted_query = table.concat(result, '&')
Cloudwatch Log Agent Conf
# /etc/awslogs/awslogs.conf
# Custom ATS log enabled and in /usr/local/var/log/trafficserver/mon
datetime_format = %Y-%m-%d %H:%M:%S
file = /usr/local/var/log/trafficserver/mon.*
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = monlog
Perl Script calling Cloudwatch Monitoring Lib
+ if ($report_chr) {
+ my $result = `/usr/local/bin/traffic_line -r proxy.node.cache_hit_ratio_avg_10s`;
+ add_metric('CacheHitRatio', 'Percent', 100 * $result);
+ }
+ if ($report_tef) {
+ my $connect_failed = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.connect_failed`;
+ my $aborts = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.aborts`;
+ my $possible_aborts = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.possible_aborts`;
+ my $pre_accept_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.
+ my $early_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.early_hangups`;
+ my $empty_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.empty_hangups`;
+ my $other = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.other`;
+ add_metric('TransErrorFraction', 'Percent', 100 * ($connect_failed + $aborts + $possible_aborts +
$pre_accept_hangups + $early_hangups + $empty_hangups + $other));
+ }
Cloudwatch Dashboard
AWS Autoscaling - Terraform Configuration File
resource "aws_autoscaling_group" "fsfb_base_load" {
availability_zones = ["${split(",", var.zones)}"]
name = "${var.env}_fsfb_base_load-${}"
load_balancers = ["${}"]
max_size = 8
min_size = 2
health_check_grace_period = 180
health_check_type = "ELB"
desired_capacity = 2
launch_configuration = "${}"
force_delete = true
wait_for_elb_capacity = 2
lifecycle {
create_before_destroy = true
AWS Autoscaling - Terraform Configuration File
resource "aws_autoscaling_policy" "fsfb_scale_out_med" {
name = "${var.env}_fsfb_scale_out_med"
scaling_adjustment = 8
adjustment_type = "ExactCapacity"
cooldown = 300
autoscaling_group_name = "${}"
AWS Autoscaling - Terraform Configuration File
resource "aws_cloudwatch_metric_alarm" "fsfb_upper_medium_rps" {
alarm_name = "${var.env}_fsfb_upper_medium_rps"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
period = "60"
metric_name = "RequestCount"
namespace = "AWS/ELB"
statistic = "Sum"
threshold = "75000"
dimensions {
LoadBalancerName = "${}"
alarm_description = "This metric monitors medium elb traffic"
alarm_actions = ["${aws_autoscaling_policy.fsfb_scale_out_med.arn}", "${var.sns_email_topic}"]
Escalate Plugin in Apache Traffic Server (ATS)
● ATS is a proxy server that sits between the user and the origin server
● ‘Escalate’ is an ATS plugin that fetches content from failsafe servers when the
origin server fails to provide a ‘good’ response.
ATS Origin ServerUser
Escalate Plugin in ATS (Continued)
● ‘Escalate’ is a remap plugin -
● Loads global configuration with ‘label’ definitions
● Sample ‘label’ definition -
"some_label" : {
"enable" : 1,
"response" : {
"500" : {
"mode" : "url",
"url" : "$h/$d/$p$x"
Escalate Plugin in ATS (Continued)
● Uses 'TSHttpTxnRedirectUrlSet’ to fetch content from failsafe servers
if (EscalateLabel::ACTION_URL == entry->second.mode) {
std::string content;
MyExpander expander(txn, entry->second.url);
if (!expander(entry->second.url, config->get_device_type_header(), config->get_default_device_type())) {
TSError("[" PLUGIN_TAG "] invalid expansion");
TSDebug(PLUGIN_TAG, "invalid expansion");
goto finish;
url_str = TSstrdup(content.c_str());
length = content.size();
if (url_str) {
TSHttpTxnRedirectUrlSet(txn, url_str, length); // Transfers ownership
Apache Storm Crawler
● Based on scalable Apache Storm platform
● Topology
● Spouts
● Bolts
Apache Storm Crawler (Continued)
Simplified Topology
Custom Event
Queue UpdaterCustom Event
Queue Feeder
Apache Storm Crawler (Continued)
● Crawls content for desktop, smartphone and tablet
● Supports domain level configuration for request headers, query params and
output storage.
● Failsafe url path mapping example -
Mapping: http://{failsafe_host}/{original_domain}/{device}/{path};
S3 file path:
High Level Architecture
Proxy Router Proxy Cache Origin Server
Failsafe Crawler
AWS storage
8 7
Offline Crawler Request Flow
User Request Flow
Optional Request Flow to fetch
failsafe content
● No manual intervention needed to serve failsafe content
● Granular control
● More relevant content is shown to user
● Failsafe content is cached in proxy layer
● Lagging Crawler
● Handling additional Crawler traffic
● Bucket specific experience
● Malformed Page
Future on Resiliency - multi-cloud for failsafe
● Additional Cloud Vendor
○ E.g. Google Cloud Platform
○ S3 vs Google Cloud Storage
○ EC2/ELB vs Google Compute Engine
○ Cloudwatch vs StackDriver
● Changes in Apache Storm Crawler
○ Can use Apache jclouds to create objects in storage in S3 or Google Cloud Storage
● Changes in deployment using terraform / configuration using chef
○ GCP & AWS are supported
● Route 53 can be used to do failover to GCP
Future on Resiliency
● Speculative Retry
void SpeculativeRetryPlugin::handleInputComplete()
orig_url_ = transaction_.getClientRequest().getUrl().getUrlString();
//fetch original request
sendFetchRequest(orig_url_, false);
//start a timer which would give a callback after ‘time_’ msecs
Async::execute<AsyncTimer>(this, new AsyncTimer(AsyncTimer::TYPE_ONE_OFF, time_), getMutex());
void SpeculativeRetryPlugin::handleAsyncComplete(AsyncTimer &async_timer)
//active_fetch keeps track if we have received the response of original request yet or not
//if not initiate a retry request
if(!active_fetch_) {
sendFetchRequest(orig_url_, true);
Thank you. Questions?

Failsafe Mechanism for Yahoo Homepage

  • 1. Failsafe Mechanism for Yahoo Homepage Using Apache Storm & Apache Traffic Server Pushkar Sachdeva ( Kit Chan ( 05/2016
  • 3. Failsafe “A fail-safe or fail-secure device is one that, in the event of a specific type of failure, responds in a way that will cause no harm, or at least a minimum of harm, to other devices or to personnel”
  • 4. Overall Architecture Yahoo! Presentation, Confidential Browser ELB EC2 ATS S3 Property ATS Property Serving Stack Crawler on Storm AWSYahoo Auto activate Failsafe Switch traffic to AWS Offstage Data Flow Online Request Flow Normal Operation Online Request Flow Failsafe Mode
  • 5. AWS Failsafe Stack Architecture Elastic Load Balancer S3 Bucket Security Group ATS EC2 Instances ATS Server VPC Availability Zone #1 ATS EC2 Instances ATS Server Availability Zone #2 Region (US W Oregon) Region (US E North Virginia) Region (Ireland) Region (Singapore) S3 Replication across regions Cloud watch Crawled data from Yahoo https http
  • 6. EC2 Instance - ATS ● Instance (amazon linux) ○ t2.large - burstable ○ 2 vCPUs/8GB RAM/1 gbps network ● Apache Traffic Server ○ For caching ■ Negative caching enabled ■ Ramdisk used ○ Health Check/S3 Authentication plugin ○ Lua plugin ■ Query Parameters Sorting ■ Simple Device Detection ■ Error handling ● Cloudwatch Log Agent/Monitoring Scripts ● Autoscaling based on # of incoming requests ● Deployment Mechanism using Terraform / Packer ATS 4Gb ramdisk cache Amazon Linux Cloudwatch Agent Cloudwatch Monitoring Scripts
  • 7. Lua script example - sorting query parameters function do_remap() local query = ts.client_request.get_uri_args() if (query ~= nil and query ~= '') then local result = {} local i = 1 for value in query:gmatch '([^&]*)' do if (value ~= '') then result [i] = value i = i + 1 end end table.sort(result) local sorted_query = table.concat(result, '&') ts.client_request.set_uri_args(sorted_query) end end
  • 8. Cloudwatch Log Agent Conf # /etc/awslogs/awslogs.conf # Custom ATS log enabled and in /usr/local/var/log/trafficserver/mon [monlog] datetime_format = %Y-%m-%d %H:%M:%S file = /usr/local/var/log/trafficserver/mon.* buffer_duration = 5000 log_stream_name = {instance_id} initial_position = start_of_file log_group_name = monlog
  • 9. Perl Script calling Cloudwatch Monitoring Lib + if ($report_chr) { + my $result = `/usr/local/bin/traffic_line -r proxy.node.cache_hit_ratio_avg_10s`; + add_metric('CacheHitRatio', 'Percent', 100 * $result); + } + if ($report_tef) { + my $connect_failed = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.connect_failed`; + my $aborts = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.aborts`; + my $possible_aborts = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.possible_aborts`; + my $pre_accept_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors. pre_accept_hangups`; + my $early_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.early_hangups`; + my $empty_hangups = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.empty_hangups`; + my $other = `/usr/local/bin/traffic_line -r proxy.node.http.transaction_frac_avg_10s.errors.other`; + + add_metric('TransErrorFraction', 'Percent', 100 * ($connect_failed + $aborts + $possible_aborts + $pre_accept_hangups + $early_hangups + $empty_hangups + $other)); + }
  • 11. AWS Autoscaling - Terraform Configuration File resource "aws_autoscaling_group" "fsfb_base_load" { availability_zones = ["${split(",", var.zones)}"] name = "${var.env}_fsfb_base_load-${}" load_balancers = ["${}"] max_size = 8 min_size = 2 health_check_grace_period = 180 health_check_type = "ELB" desired_capacity = 2 launch_configuration = "${}" force_delete = true wait_for_elb_capacity = 2 lifecycle { create_before_destroy = true } }
  • 12. AWS Autoscaling - Terraform Configuration File (Cont’d) resource "aws_autoscaling_policy" "fsfb_scale_out_med" { name = "${var.env}_fsfb_scale_out_med" scaling_adjustment = 8 adjustment_type = "ExactCapacity" cooldown = 300 autoscaling_group_name = "${}" }
  • 13. AWS Autoscaling - Terraform Configuration File (Cont’d) resource "aws_cloudwatch_metric_alarm" "fsfb_upper_medium_rps" { alarm_name = "${var.env}_fsfb_upper_medium_rps" comparison_operator = "GreaterThanOrEqualToThreshold" evaluation_periods = "1" period = "60" metric_name = "RequestCount" namespace = "AWS/ELB" statistic = "Sum" threshold = "75000" dimensions { LoadBalancerName = "${}" } alarm_description = "This metric monitors medium elb traffic" alarm_actions = ["${aws_autoscaling_policy.fsfb_scale_out_med.arn}", "${var.sns_email_topic}"] }
  • 14. Escalate Plugin in Apache Traffic Server (ATS) ● ATS is a proxy server that sits between the user and the origin server ● ‘Escalate’ is an ATS plugin that fetches content from failsafe servers when the origin server fails to provide a ‘good’ response. ATS Origin ServerUser
  • 15. Escalate Plugin in ATS (Continued) ● ‘Escalate’ is a remap plugin - map @pparam=some_label ● Loads global configuration with ‘label’ definitions ● Sample ‘label’ definition - "some_label" : { "enable" : 1, "response" : { "500" : { "mode" : "url", "url" : "$h/$d/$p$x" } } }
  • 16. Escalate Plugin in ATS (Continued) ● Runs in ‘READ_RESPONSE_HDR_HOOK’ ● Uses 'TSHttpTxnRedirectUrlSet’ to fetch content from failsafe servers if (EscalateLabel::ACTION_URL == entry->second.mode) { std::string content; MyExpander expander(txn, entry->second.url); if (!expander(entry->second.url, config->get_device_type_header(), config->get_default_device_type())) { TSError("[" PLUGIN_TAG "] invalid expansion"); TSDebug(PLUGIN_TAG, "invalid expansion"); goto finish; } expander.swap(content); url_str = TSstrdup(content.c_str()); length = content.size(); if (url_str) { TSHttpTxnRedirectUrlSet(txn, url_str, length); // Transfers ownership } }
  • 17. Apache Storm Crawler ● Based on scalable Apache Storm platform ● Topology ● Spouts ● Bolts Spout Bolt Spout Bolt Bolt
  • 18. Apache Storm Crawler (Continued) Simplified Topology Cron Feeder Changelog Feeder IndexUrl Config Fetcher Url Fetcher Memory Storage Writer Response Processor Response Uploader Custom Event Queue UpdaterCustom Event Queue Feeder
  • 19. Apache Storm Crawler (Continued) ● Crawls content for desktop, smartphone and tablet ● Supports domain level configuration for request headers, query params and output storage. ● Failsafe url path mapping example - Mapping: http://{failsafe_host}/{original_domain}/{device}/{path}; {sorted_query_params_as_matrix_params} URL: S3 file path: plan-201628138.html;a=2;q=1
  • 20. High Level Architecture Proxy Router Proxy Cache Origin Server Failsafe Crawler AWS storage 1 10 5 4 3 2 9 8 7 6 User 7 6 4 3 5 2 1 PUT Offline Crawler Request Flow User Request Flow Optional Request Flow to fetch failsafe content
  • 21. Benefits ● No manual intervention needed to serve failsafe content ● Granular control ● More relevant content is shown to user ● Failsafe content is cached in proxy layer
  • 22. Pitfalls/Limitations ● Lagging Crawler ● Handling additional Crawler traffic ● Bucket specific experience ● Malformed Page
  • 23. Future on Resiliency - multi-cloud for failsafe ● Additional Cloud Vendor ○ E.g. Google Cloud Platform ○ S3 vs Google Cloud Storage ○ EC2/ELB vs Google Compute Engine ○ Cloudwatch vs StackDriver ● Changes in Apache Storm Crawler ○ Can use Apache jclouds to create objects in storage in S3 or Google Cloud Storage ● Changes in deployment using terraform / configuration using chef ○ GCP & AWS are supported ● Route 53 can be used to do failover to GCP
  • 24. Future on Resiliency ● Speculative Retry void SpeculativeRetryPlugin::handleInputComplete() { orig_url_ = transaction_.getClientRequest().getUrl().getUrlString(); //fetch original request sendFetchRequest(orig_url_, false); //start a timer which would give a callback after ‘time_’ msecs Async::execute<AsyncTimer>(this, new AsyncTimer(AsyncTimer::TYPE_ONE_OFF, time_), getMutex()); } void SpeculativeRetryPlugin::handleAsyncComplete(AsyncTimer &async_timer) { async_timer.cancel(); //active_fetch keeps track if we have received the response of original request yet or not //if not initiate a retry request if(!active_fetch_) { sendFetchRequest(orig_url_, true); } }