SlideShare a Scribd company logo
Building services on AWS in China region
DevOps Engineer at Sproutling (Mattel’s subsidiary)
AWS Solutions Architect
Systems Engineering background
roman@naumenko.ca
@naumenko_roman
Sproutling
IoT company
The product is an innovative baby monitor that learns patterns and notifies parents
when baby is likely to wake up, ideal room conditions or anything unusual
We’re hosted in AWS
Golang shop
We had a plan
In the end of December we got AWS account in China + one in us-west-2
We wanted infra as code and microservices, immutable EC2
No single point of failure (“destroy availability zone” test)
Auto-managed SSL on the frontends
Docker images as artifacts
It was obvious that complicated things won’t fit into the budget
We got battle-tested tools: terraform, docker
Great tools: packer, serverspec
Fast and versatile: nomad, traefik
Services in US regions
AWS services in CN region
Services in China region...
Are you ready for production in 2 months?
Sources
Short discription of differences for each service in CN region:
http://docs.amazonaws.cn/en_us/aws/latest/userguide/services.html
Namespaces (naming is hard!):
http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html
Faq:
https://www.amazonaws.cn/en/about-aws/china/faqs/
Updates:
https://www.amazonaws.cn/en/new/
Endpoints for cn-north-1:
http://docs.amazonaws.cn/en_us/general/latest/gr/rande.html#cnnorth_region
ELB
Only “old school” type ELB available (no Application Load Balancers)
Possible to have “static” IPs for an ELB (ask aws support how)
No IPv6, no Route53 alias records
SSL certificates pretty straightforward:
create IAM cert resource aws_iam_server_sertificate
reference ARN as aws_iam_server_sertificate.name.arn
Set Route53 records as CNAMEs (via terraform_remote_state)
What it means: a good frontend http router/proxy required (haproxy/nginx type)
AMIs
Most of “vendors” AMIs (say OpenVPN) are absent in CN region
We build our AMIs with Packer+serverspec for tests, same set of configs!
Very slow packages update (almost impossible to disable in Ubuntu cloud-init
scripts). Use local mirrors: aliyun.com,mirrors.163.com or make your own in S3
Third party packages are very, very slow to download (upload ‘em all to
regional publically accessible S3)
NTP is not enabled by default (Ubuntu, seriously?)
What it means: setup tooling and resources to build AMIs consistently in different
regions
AMI (continue)
One way of dealing with specifics
Amazon and Ubuntu AMIs are updated frequently, you can rely on them
What it means: standardize on one or two Linux distros. Build base AMI.
VPN for VPC
We wanted VPCs connected. Amazingly, no “out of the box” solution exists
So we took AWS guide: “Connecting Multiple VPCs with EC2 Instances over
SSL” and terraformed the hell out of it
https://aws.amazon.com/articles/0639686206802544
Very simple OpenVPN setup that relies on pre-shared secret key and Route53
names
150 lines terraform + 60 lines user-data script: saves 1000’s $$ on “enterprise”
solutions :)
Can work as HA (multiple “remote” addresses) - omg, the HA! Many more
1000’s $$
VPN for VPC (continues)
Only few services need to talk to another VPC (service discovery, etc)
So it requires careful setup of fw rules (maybe even special subnets for multi-
VPC routing)
Basic setup will have a very high and unpredictable latency
The final setup will be more complicated: we’re getting a leased line from cn
region to ap-southeast-1
What it means: set something simple to connect multiple VPC in the regions, it’s
always possible to make it more advance/robust/HA/restricted later
VPN for users
For users we configured Pritunl: https://pritunl.com
OpenVPN would work too, however printunl provided 100% self-service for
users, configuration was minimal
We actually used only a limited set of features (basically SSO)
It can connect VPCs. But didn’t have time to experiment with cn.
Each region has own VPN URL for obvious reasons (latency)
What it means: VPN for users should be a 100% self-service, otherwise count for
a helpdesk duties :)
VPC/networking
NAT gateway is absent in cn region, had to setup “old-school” NAT instances
For VPN with NAT GW, use nat_gateway_id:
Service discovery
A very first service to provision, everything else relies on consul
Very easy to operate cluster (~30 min to upgrade manually 3 node cluster)
Making use of consul agents requires more work
We utilized nagios (omg!) basic checks
Service discovery (system healthchecks)
apt-get -y install nagios-plugins-basic
Service discovery (service’s healthchecks)
Healthchecks are part of service definition
CI/CD between regions
Had to setup own docker registry (managed services in China didn’t work for us)
Speed of uploading to cn is okey (problems starts when docker image updates
>100-300MB) - will be waiting unpredictable time, from 5 to 30 minutes
Default Travis docker is incompatible with the registry, had to update it
script:
- echo "Upgrade docker to 17.03.0-ce or higher, default doesn't work with our priv registry"
- sudo apt-get install docker-engine
“Persistent” services
Service like Kafka, Zookeeper, Cassanda DB, Elasticsearch and some other
Traded speed and simplicity of provisioning (basically terraform + user-data
scripts) for complexity of configuration management
Future work:
More specific AMIs (since the vendor specific are absent in cn anyways)
Basic configuration management via consul-template (want to reload or
reconfigure service upon changes in KV)
2.5 months later: what worked well
Terraform + cloudformation for rolling updates
Nomad for scheduling services + consul + traefik for http routing
EC2 services: ELB, ASG. No disk encryption :(
S3 (although features are much more limited)
IAM, with limitations - no MFA for example
ECR registry
RDS (postgres)
Some challenges
Challenges (continues)
Latency, GFW (The Great Firewall of China)
Lack of mirrored packages and distros on other side
Steep learning curve for the tools and services
Docker registry
“Cheburashka” servers and manually created resources
Not cloud friendly apps, aka “zookeepers”
Patterns
Try basic configuration to provide required service (AWS certificate manager,
IAM certificates and then ACME’s letsencrypt)
Use tools resourcefully, for example:
Zookeeper-dependent services can use zookeeper.service.consul
All services are monitored by service running in scheduler + Route53 checks
Docker registry is installed with tar image downloaded from S3 (otherwise too slow)
Infra code “on reflexes”
I just got a call from PMO
They want all
microservices configured
and deployed manually?!
Microservices
Service discovery in consul - everything is registered there. Even Zookeeper!
Jobs that define services run by nomad
Jobs configs (json-ish type) are created in travis from templates
Travis preserves history
Deployed on every merge to master
Microservices (continues)
Example of nomad job. Docker images tagged by commit sha. Can run this manually any time.
Patterns
Always verify assumptions about China with AWS reps/architects
China’s SNS service doesn’t have SMS nor PUSH notifications
S3 doesn’t have regional replication
Latency and packet loss would be worst than you hope for
Get somebody “on the ground” in China, you’ll need them (from having local
phone number to registering at different hosted services)
Cheburashka server is waiting for maintenance
Not in the account managed by terraform
Questions?
roman@naumenko.ca
@naumenko_roman

More Related Content

Building services on AWS in China region

  • 2. DevOps Engineer at Sproutling (Mattel’s subsidiary) AWS Solutions Architect Systems Engineering background roman@naumenko.ca @naumenko_roman
  • 3. Sproutling IoT company The product is an innovative baby monitor that learns patterns and notifies parents when baby is likely to wake up, ideal room conditions or anything unusual We’re hosted in AWS Golang shop
  • 4. We had a plan In the end of December we got AWS account in China + one in us-west-2 We wanted infra as code and microservices, immutable EC2 No single point of failure (“destroy availability zone” test) Auto-managed SSL on the frontends Docker images as artifacts It was obvious that complicated things won’t fit into the budget
  • 5. We got battle-tested tools: terraform, docker
  • 6. Great tools: packer, serverspec
  • 7. Fast and versatile: nomad, traefik
  • 8. Services in US regions
  • 9. AWS services in CN region
  • 10. Services in China region...
  • 11. Are you ready for production in 2 months?
  • 12. Sources Short discription of differences for each service in CN region: http://docs.amazonaws.cn/en_us/aws/latest/userguide/services.html Namespaces (naming is hard!): http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html Faq: https://www.amazonaws.cn/en/about-aws/china/faqs/ Updates: https://www.amazonaws.cn/en/new/ Endpoints for cn-north-1: http://docs.amazonaws.cn/en_us/general/latest/gr/rande.html#cnnorth_region
  • 13. ELB Only “old school” type ELB available (no Application Load Balancers) Possible to have “static” IPs for an ELB (ask aws support how) No IPv6, no Route53 alias records SSL certificates pretty straightforward: create IAM cert resource aws_iam_server_sertificate reference ARN as aws_iam_server_sertificate.name.arn Set Route53 records as CNAMEs (via terraform_remote_state) What it means: a good frontend http router/proxy required (haproxy/nginx type)
  • 14. AMIs Most of “vendors” AMIs (say OpenVPN) are absent in CN region We build our AMIs with Packer+serverspec for tests, same set of configs! Very slow packages update (almost impossible to disable in Ubuntu cloud-init scripts). Use local mirrors: aliyun.com,mirrors.163.com or make your own in S3 Third party packages are very, very slow to download (upload ‘em all to regional publically accessible S3) NTP is not enabled by default (Ubuntu, seriously?) What it means: setup tooling and resources to build AMIs consistently in different regions
  • 15. AMI (continue) One way of dealing with specifics Amazon and Ubuntu AMIs are updated frequently, you can rely on them What it means: standardize on one or two Linux distros. Build base AMI.
  • 16. VPN for VPC We wanted VPCs connected. Amazingly, no “out of the box” solution exists So we took AWS guide: “Connecting Multiple VPCs with EC2 Instances over SSL” and terraformed the hell out of it https://aws.amazon.com/articles/0639686206802544 Very simple OpenVPN setup that relies on pre-shared secret key and Route53 names 150 lines terraform + 60 lines user-data script: saves 1000’s $$ on “enterprise” solutions :) Can work as HA (multiple “remote” addresses) - omg, the HA! Many more 1000’s $$
  • 17. VPN for VPC (continues) Only few services need to talk to another VPC (service discovery, etc) So it requires careful setup of fw rules (maybe even special subnets for multi- VPC routing) Basic setup will have a very high and unpredictable latency The final setup will be more complicated: we’re getting a leased line from cn region to ap-southeast-1 What it means: set something simple to connect multiple VPC in the regions, it’s always possible to make it more advance/robust/HA/restricted later
  • 18. VPN for users For users we configured Pritunl: https://pritunl.com OpenVPN would work too, however printunl provided 100% self-service for users, configuration was minimal We actually used only a limited set of features (basically SSO) It can connect VPCs. But didn’t have time to experiment with cn. Each region has own VPN URL for obvious reasons (latency) What it means: VPN for users should be a 100% self-service, otherwise count for a helpdesk duties :)
  • 19. VPC/networking NAT gateway is absent in cn region, had to setup “old-school” NAT instances For VPN with NAT GW, use nat_gateway_id:
  • 20. Service discovery A very first service to provision, everything else relies on consul Very easy to operate cluster (~30 min to upgrade manually 3 node cluster) Making use of consul agents requires more work We utilized nagios (omg!) basic checks
  • 21. Service discovery (system healthchecks) apt-get -y install nagios-plugins-basic
  • 22. Service discovery (service’s healthchecks) Healthchecks are part of service definition
  • 23. CI/CD between regions Had to setup own docker registry (managed services in China didn’t work for us) Speed of uploading to cn is okey (problems starts when docker image updates >100-300MB) - will be waiting unpredictable time, from 5 to 30 minutes Default Travis docker is incompatible with the registry, had to update it script: - echo "Upgrade docker to 17.03.0-ce or higher, default doesn't work with our priv registry" - sudo apt-get install docker-engine
  • 24. “Persistent” services Service like Kafka, Zookeeper, Cassanda DB, Elasticsearch and some other Traded speed and simplicity of provisioning (basically terraform + user-data scripts) for complexity of configuration management Future work: More specific AMIs (since the vendor specific are absent in cn anyways) Basic configuration management via consul-template (want to reload or reconfigure service upon changes in KV)
  • 25. 2.5 months later: what worked well Terraform + cloudformation for rolling updates Nomad for scheduling services + consul + traefik for http routing EC2 services: ELB, ASG. No disk encryption :( S3 (although features are much more limited) IAM, with limitations - no MFA for example ECR registry RDS (postgres)
  • 27. Challenges (continues) Latency, GFW (The Great Firewall of China) Lack of mirrored packages and distros on other side Steep learning curve for the tools and services Docker registry “Cheburashka” servers and manually created resources Not cloud friendly apps, aka “zookeepers”
  • 28. Patterns Try basic configuration to provide required service (AWS certificate manager, IAM certificates and then ACME’s letsencrypt) Use tools resourcefully, for example: Zookeeper-dependent services can use zookeeper.service.consul All services are monitored by service running in scheduler + Route53 checks Docker registry is installed with tar image downloaded from S3 (otherwise too slow) Infra code “on reflexes”
  • 29. I just got a call from PMO They want all microservices configured and deployed manually?!
  • 30. Microservices Service discovery in consul - everything is registered there. Even Zookeeper! Jobs that define services run by nomad Jobs configs (json-ish type) are created in travis from templates Travis preserves history Deployed on every merge to master
  • 31. Microservices (continues) Example of nomad job. Docker images tagged by commit sha. Can run this manually any time.
  • 32. Patterns Always verify assumptions about China with AWS reps/architects China’s SNS service doesn’t have SMS nor PUSH notifications S3 doesn’t have regional replication Latency and packet loss would be worst than you hope for Get somebody “on the ground” in China, you’ll need them (from having local phone number to registering at different hosted services)
  • 33. Cheburashka server is waiting for maintenance Not in the account managed by terraform

Editor's Notes

  1. Hey, I'm a reasonable guy. But I've just experienced some very unreasonable things. And I would like to share them with you. Because it was pretty exciting to build something on the other side of the world. Where nor Google or Facebook can get anything built. And then get it inter-connected with this side of the world and deliver some value for business Few words about who I am
  2. What is Sproutling and what company develop? I’ve joined 5 months ago.
  3. When I joined the company, I got an AWS account with leftovers from a “datacenter” scheduler Boy that scheduler was rooted in every AWS service it could find! Took some effort to clean it up. And meanwhile we were getting a main account in China (mention ICP Recordal and process of opening account)
  4. It was pretty obvious things have to be simple, automated and controlled end-to-end for china region We got some awesome tools and started to cook stuff Terraform & docker to setup up basic things in AWS and get builds going
  5. Packer, serverspec to get server images prepared and tested
  6. Consul for services discovery Nomad for scheduling containers Traefik for load-balancing
  7. So it was easy to get everything going, although we had to pivot many times We had to get MVP with the most simple solution that would work in the region It seemed like there shouldn’t be a lot of problems to deploy services on AWS
  8. 75 or so services with many, many features available in AWS So we felt good about China - AWS got our back, right?
  9. Well, it turns out...
  10. Not only there are many more services in US regions, but they are all interconnected. CN region is a “special” one. IAM won’t work. Special aws console. Everything is separated. Treat it as another cloud provider. Meanwhile, PMO from mothership descended in our office and gave us a deadline So what Jack says about getting ready for production in 2 months?
  11. Useful links to bookmark before start building
  12. Lets start with the Internet facing services: load-balancers They worked pretty well
  13. App load balancers are big deal for ECS, but it’s not available either Static IPs are useful for IPC Recordal updates SSL certs worked great, but not so convenient as AWS Certificate Manager (with auto-renewal and all the nice features) Making records in Route53 is tricky since it’s another (global) region. So we simply importing state from cn and making necessary updates. Downside is that 2 step process when networking changes. It’s possible to setup providers and permissions in terraform but in practice it didn’t work in 0.8x versions. Now AMIs
  14. Absence of AMI is generally a problem Impossible to copy Building servers is slow if no local repos used How to deal with it in terraform?
  15. Bunch of checks to make packer runs fast, also region specific settings Now lets talk about VPN for VPC (connectivity for servers)
  16. OpenVPN instances are completely ephemeral Setup in cn region requires 2 step apply to update Route53 records Openvpn will retry indefinitely until it connects to the other side VPC continues: future work
  17. There are other solutions to setup VPN Connecting private DCs will be even more complicated There is not VPN What about VPN for users?
  18. Standard VPC/networking setup for AWS Just create a terraform module, provision and forget Won’t change much
  19. Nothing special here except of tweaking terraform modules. These setups are battle tested and easy to setup. Next: service discovery.
  20. Turns out it works pretty great with consul (of course we discovered we needed checks after hell broke loose) Defining a check takes literally few lines of code. When did you use nagios last time?
  21. Systems health checks are essentially configuration free Every service that runs in nomad has own healthecks registered in consul Next: healthchecks for services
  22. Also notice how easy to publish service on frontend http router Remember Apache configs? Nginx settings? All that just ones key pair (frontend.rule) Next: CI/CD
  23. This part still in works because configuring repositories is not trivial. Lots of boilerplate code and silly templates edits with sed. Next: persistent services and challenges
  24. This was by far the most challenging part. Talk about contemporary cloud-readiness for apps Given this, what patterns to apply? Next: what worked well
  25. But there were some challenges as well Next: pic about challenges
  26. Challenges were 3 types: Access - distros and AMIs, slow and unreliable network Steep learning curve Provisioning and support of hosted services (zookeepers) Next: challenges continues
  27. When you usually work in AWS, you don’t have to think about choices Even though names are obscure sometimes, the choices of building blocks is great
  28. So everything was getting assembled, tested and kinda ready and we got a call... Next: pic call from PMO
  29. Well, it wasn’t exactly a call - rather bunch of meetings, but I indeed heard “configured and deployed manually” Next: How we set up microservices
  30. Next: example of nomad job
  31. Next: patterns and best practicies
  32. Next: cheburashka servers
  33. Don’t create little cheburashka servers - even if it’s very tempting leverage good tools and aws best practices Next: end and questions