Autoscaled Distributed Automation using AWS at Selenium London MeetUp
- 2. WHAT DO I GET?
• SeleniumGridScaler = Selenium Grid + AWS + Autoscaling
• DA will phenomenally shorten the UI automation run time to
few minutes
• Faster feedback cycle
• Fewer Jenkins jobs to run automation, instead of few
hundreds
• Cost effective and reliable
• Enables Continuous Integration / Continuous
Deployment
2
- 3. AGENDA
• Setting up
• Making the Grid stable
• Grid topologies
• Cost saving
• Reporting / Dashboard
3
- 6. PROBLEM DESCRIPTION
• Hundreds of Jenkins jobs to run all the tests
(monolithic apps)
• Not having a system to run hundreds of UI
automation tests reliably, fast and scalable in a cost
effective way is a blocker for CI / CD
• No intelligent automation report to narrow down
failures quickly!
6
- 7. SOLUTION
• To be able to run all UI automation
scenarios within the time taken by the
slowest test case
• Cost effective, scalable and reliable
• Teams focussing on automation
• Note: This is not about cross browser test coverage rather using grid for
parallel test execution
7
- 12. SETTING UP
• c3.4xlarge (16 cpu / 30 GB RAM / High BW) for
thousands of test
• c3.large (2 cpu / 3.75 GB RAM / Enhanced Net) for
fewer hundreds of tests
• Hub should have enough network bandwidth but low
CPU / Memory is fine
• AMI with bootstrap SeleniumGridScaler jar, which will
act as the hub that can autoscale
• https://github.com/mhardin/SeleniumGridScaler
12
SELENIUM GRID HUB SETUP
- 13. SETTING UP
• Open Source
• Acts as an intelligent hub
• Auto scales grid nodes depending on the number of tests
• Optimized termination of nodes when not in use
• Adhoc launch of new nodes is also possible
• Talks to AWS using EC2
• Nodes are bootstrapped to attach themselves to the hub
• Supports AWS Windows as well
13
SELENIUMGRIDSCALER - HUB
- 14. • c3.xlarge
• Capable of running maximum 24 Firefox
• Number of Chrome that can be run is lesser ~15
• Node created out of AMI has bootstrap code to
help attach to the hub
14
SETTING UP
SELENIUM GRID NODE SETUP
- 15. SETTING UP
• To have your own node AMI
• Either you have to get the node AMI or create
an AWS instance, bootstrap it,create an AMI out
of it and refer it in the Hub config.
• Hub creates the node based on a config:
AMI ID, subnet, security group, node type,etc.
15
SELENIUMGRIDSCALER - NODE
- 16. SELENIUM NODE BOOTSTRAP
CODE
[root@ip-10-2-12-167 ~]# more /home/grid/grid/grid_start_node.sh
#!/bin/sh
PATH=/sbin:/usr/sbin:/bin:/usr/bin
cd /home/grid/grid
export EC2_INSTANCE_ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id || die "wget instance-id has failed:
$?"`"
# Pull down the user data, which will be a zip file containing necessary information
export NODE_TEMPLATE="/home/grid/grid/nodeConfigTemplate.json"
curl http://169.254.169.254/latest/user-data -o /home/grid/grid/data.zip
# Now, unzip the data downloaded from the userdata
unzip -o /home/grid/grid/data.zip -d /home/grid/ubuntu/grid
# Replace the instance ID in the node config file
sed "s/<INSTANCE_ID>/$EC2_INSTANCE_ID/g" $NODE_TEMPLATE > /home/grid/grid/nodeConfig.json
# Finally, run the java process in a window so browsers can run
xvfb-run --auto-servernum --server-args='-screen 0, 1600x1200x24' java -jar /home/grid/grid/selenium-server-node.jar -role node -
nodeConfig /home/grid/grid/nodeConfig.json -Dwebdriver.chrome.driver="/home/grid/grid/chromedriver" -log
/home/grid/grid/grid.log &
16
- 17. MAKING THE GRID STABLE
• Timeouts in json config
• “timeout”:240000 (ms)
• “browserTimeout”:390000 (ms)
• browserTimeout has to be bigger than ‘timeout’
and ‘webDriver’ timeout
• browserTimeout is specified in secs in command
line
TIMEOUTS
17
- 18. • If browser instance hangs (for any reason what so ever), it will take
3hrs (http client socket timeout) for the particular slot to become free.
• This timeouts the Jenkins job
• Solution:
• Fix the particular test scenario causing this issue
• Add a cronjob to kill any browser instances that is running for more
than 10mins.
• Make this as part of your Chef knife plugin
• Ref: selenium repo, PR: 227 / fixed in 285
MAKING THE GRID STABLE
TIMEOUTS
18
- 19. • Grid setup should be in the same AWS subnet
• Using multiple subnets will result in lots of
FORWARDING_TO_NODE_FAILED errors
MAKING THE GRID STABLE
AWS - SUBNET
19
- 20. • Subnet you are using should have enough free IP
addresses
• It will be a blocker for autoscaling the grid nodes
MAKING THE GRID STABLE
AWS - IP ADDRESS
20
- 21. • The webDriver object creation consumes bandwidth
in the range of 6Gbits/5min in the Hub for 250+ tests
in parallel
MAKING THE GRID STABLE
AWS - HUB BANDWIDTH
c3.4xlarge
bandwidth is “High”
c3.large can also be
used for smaller
apps
21
- 22. • Fine tune your
• -Xms
• -Xmx
• -DPOOL_MAX
MAKING THE GRID STABLE
AWS - HUB / NODE MEMORY
22
- 23. • HUB becomes unstable after running thousands of
tests
• Automate restarting of Hub after every 2000+ tests
or at the end of your test job
MAKING THE GRID STABLE
AWS - RESTARTING HUB
23
- 24. • Jenkins executor which would be running hundreds of
tests in parallel, needs to have enough CPU power.
MAKING THE GRID STABLE
AWS - JENKINS EXECUTOR CPU
c3.8xlarge when running 250+ tests in parallel
24
- 25. • Don’t rely too much on Selenium Grid’s queuing
policy
• If your average test execution time is greater than
webDriver timeout, tests will timeout at webDriver
creation itself
MAKING THE GRID STABLE
HUB QUEUING POLICY
25
- 26. • Update browsers in the node and create a new node
AMI
• Necessary browser settings:
MAKING THE GRID STABLE
UPDATE BROWSERS
26
profile =Selenium::WebDriver::Firefox::Profile.new
profile['app.update.auto'] = false
profile['app.update.enabled'] = false
profile['app.update.service.enabled'] = false
profile['dom.max_script_run_time'] = 60
profile['dom.max_chrome_script_run_time'] = 60
profile['focusmanager.testmode']=true
profile['accept_untrusted_certs']=true
profile['assume_untrusted_certificate_issuer'] = false
- 28. GRID TOPOLOGIES
• Decide what you want before selecting the topology to be cost efficient!
• I want to release code to production ..
1. Every CL (change list)
2. Once a day
3. Once a week
4. When ever I want (on demand!)
• Based on the above answers, Do I want to run all UI automation for
5. Every CL ?
6. Every 2 hours
7. Four times a day
28
- 29. GRID TOPOLOGY - 1
HUB
• parallel execution for small projects
• 1 executor - 1 hub - 14 nodes
• eg: c3.8xlarge can execute 250*+ tests in parallel
• Test run would finish in ~5mins
c3.8xlarge
c3.large
c3.xlarge
29
….
- 30. GRID TOPOLOGY - 2
HUB
• Suitable for medium size projects (500+ tests)
• Adding one more executor (2 executors 1 hub
and 28 node),this could double your parallel
execution cases, still taking only ~5mins
c3.4xlarge
c3.8xlarge
c3.xlarge
30
….
….
- 31. GRID TOPOLOGY - 3
HUB
• Takes 2x times as previous topology, but half the
cost! (1 executor - 1 hub - 14 nodes)
• Suitable for medium size projects
• Test run would finish in ~10mins
c3.8xlarge
c3.xlargejob runs sequentially
31
….
c3.4xlarge
- 32. GRID TOPOLOGY
HUB
• One more job? Probably NOT as HUB network traffic would
make it unstable especially during webDriver creation
• c3.8xlarge network bandwidth limit is 10Gbit
c3.4xlarge
c3.8xlarge
c3.xlarge
32
….
….
- 33. GRID TOPOLOGY - 4
HUB
HUB
• Use two hubs to double
the tests (1000+)
• But speed is same as
topology 2 (~5mins)
• Double the cost
c3.8xlarge
c3.xlarge
33
c3.4xlarge
c3.4xlarge
- 35. OPTIMAL USE OF GRID NODES
• Running 250+ tests on a grid setup with 250 slots will
take around 5mins
• Nodes are idling for the remaining 55mins of time
which is already billed by AWS
• Even during the 5mins of run, only very minority of the
tests takes around 4mins and majority of the test
complete in less than 1 min
35
COST SAVING
- 37. • On a c3.8xlarge 250 tests can be run at one go
before all 32 CPU reach 100%
• Start 250 cases
• Then between every ~50 seconds or so, start 100
tests in batch, repeat this until all tests are executed
• Fine tune the delay according to your observation
37
BATCH PROCESSING
COST SAVING
- 38. GRID TOPOLOGY - BATCH PROCESSING
HUB
• Cost saving topology 1 executor - 1 hub - 16 nodes
• Can run any number of tests
• Can run 5000 UI tests within ~1hr 10mins
job runs sequentially
c3.8xlarge c3.xlarge
38
COST SAVING
c3.4xlarge
- 39. COMPARING AWS COST VS DATA CENTRE
• 1 Medium box (~$8000 / per month)
• 1 Large box (~$10000 / per month)
• 1 VM (~$2000 / per month)
• Total AWS cost for 2 Batch Processing Topologies
• ~$2400 / month (fully autoscaled and runs 9500+
UI test)
• Frequency: 9-11 times a day
39
COST SAVING
- 40. AUTOSCALING OF GRID NODES
• SeleniumGridScaler autoscales the grid nodes
• It creates AWS nodes on demand based on a
configuration file and the number of tests to run
• Optimized termination of nodes
40
COST SAVING
- 42. • c3.xlarge = $0.21 per Hour (can run 24 Firefox instances)
• t2.micro = $0.013 per Hour
• 16 t2.micro for the price of 1 c3.xlarge = 16 Firefox
Conclusion:
• I would prefer to use c3.xlarge as it is more value add
• I would not have to use 15 extra IP addresses
But always this depends on your observation of your own
setup!
42
LARGE VS SMALLER NODE TYPES
COST SAVING
- 43. • Shutdown the hub when not in use
• Benefit: You are paying tiny amount to AWS when
a node is stopped than when its running
• Automate this stop and start tasks
43
STOPPING THE HUB
COST SAVING
- 50. REPORTING / DASHBOARD
50
INTELLIGENCE REPORTING
Automate the decision if a failure is a bug or automation
issue
•Use OCR to read failed screenshot images to get
error messages not captured by automation
• Use Java Script errors in browser console
• Use logs (Splunk) to get exceptions specific to the
test
• Use good automation failure logging best practices
- 51. FEW WORDS
• Few differences in Expedia specific SeleniumGrid
Scaler
• https://github.com/ambirag/SeleniumGridScaler,
branch: SeleniumGridScalerExp
• Dockerised!
51