If you going to build services in China's AWS, learn from our experience.
Slides from meetup:
https://www.meetup.com/SF-DevOps-for-Startups/events/238642366/
Report
Share
Report
Share
1 of 34
More Related Content
Building services on AWS in China region
2. DevOps Engineer at Sproutling (Mattel’s subsidiary)
AWS Solutions Architect
Systems Engineering background
roman@naumenko.ca
@naumenko_roman
3. Sproutling
IoT company
The product is an innovative baby monitor that learns patterns and notifies parents
when baby is likely to wake up, ideal room conditions or anything unusual
We’re hosted in AWS
Golang shop
4. We had a plan
In the end of December we got AWS account in China + one in us-west-2
We wanted infra as code and microservices, immutable EC2
No single point of failure (“destroy availability zone” test)
Auto-managed SSL on the frontends
Docker images as artifacts
It was obvious that complicated things won’t fit into the budget
12. Sources
Short discription of differences for each service in CN region:
http://docs.amazonaws.cn/en_us/aws/latest/userguide/services.html
Namespaces (naming is hard!):
http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html
Faq:
https://www.amazonaws.cn/en/about-aws/china/faqs/
Updates:
https://www.amazonaws.cn/en/new/
Endpoints for cn-north-1:
http://docs.amazonaws.cn/en_us/general/latest/gr/rande.html#cnnorth_region
13. ELB
Only “old school” type ELB available (no Application Load Balancers)
Possible to have “static” IPs for an ELB (ask aws support how)
No IPv6, no Route53 alias records
SSL certificates pretty straightforward:
create IAM cert resource aws_iam_server_sertificate
reference ARN as aws_iam_server_sertificate.name.arn
Set Route53 records as CNAMEs (via terraform_remote_state)
What it means: a good frontend http router/proxy required (haproxy/nginx type)
14. AMIs
Most of “vendors” AMIs (say OpenVPN) are absent in CN region
We build our AMIs with Packer+serverspec for tests, same set of configs!
Very slow packages update (almost impossible to disable in Ubuntu cloud-init
scripts). Use local mirrors: aliyun.com,mirrors.163.com or make your own in S3
Third party packages are very, very slow to download (upload ‘em all to
regional publically accessible S3)
NTP is not enabled by default (Ubuntu, seriously?)
What it means: setup tooling and resources to build AMIs consistently in different
regions
15. AMI (continue)
One way of dealing with specifics
Amazon and Ubuntu AMIs are updated frequently, you can rely on them
What it means: standardize on one or two Linux distros. Build base AMI.
16. VPN for VPC
We wanted VPCs connected. Amazingly, no “out of the box” solution exists
So we took AWS guide: “Connecting Multiple VPCs with EC2 Instances over
SSL” and terraformed the hell out of it
https://aws.amazon.com/articles/0639686206802544
Very simple OpenVPN setup that relies on pre-shared secret key and Route53
names
150 lines terraform + 60 lines user-data script: saves 1000’s $$ on “enterprise”
solutions :)
Can work as HA (multiple “remote” addresses) - omg, the HA! Many more
1000’s $$
17. VPN for VPC (continues)
Only few services need to talk to another VPC (service discovery, etc)
So it requires careful setup of fw rules (maybe even special subnets for multi-
VPC routing)
Basic setup will have a very high and unpredictable latency
The final setup will be more complicated: we’re getting a leased line from cn
region to ap-southeast-1
What it means: set something simple to connect multiple VPC in the regions, it’s
always possible to make it more advance/robust/HA/restricted later
18. VPN for users
For users we configured Pritunl: https://pritunl.com
OpenVPN would work too, however printunl provided 100% self-service for
users, configuration was minimal
We actually used only a limited set of features (basically SSO)
It can connect VPCs. But didn’t have time to experiment with cn.
Each region has own VPN URL for obvious reasons (latency)
What it means: VPN for users should be a 100% self-service, otherwise count for
a helpdesk duties :)
19. VPC/networking
NAT gateway is absent in cn region, had to setup “old-school” NAT instances
For VPN with NAT GW, use nat_gateway_id:
20. Service discovery
A very first service to provision, everything else relies on consul
Very easy to operate cluster (~30 min to upgrade manually 3 node cluster)
Making use of consul agents requires more work
We utilized nagios (omg!) basic checks
23. CI/CD between regions
Had to setup own docker registry (managed services in China didn’t work for us)
Speed of uploading to cn is okey (problems starts when docker image updates
>100-300MB) - will be waiting unpredictable time, from 5 to 30 minutes
Default Travis docker is incompatible with the registry, had to update it
script:
- echo "Upgrade docker to 17.03.0-ce or higher, default doesn't work with our priv registry"
- sudo apt-get install docker-engine
24. “Persistent” services
Service like Kafka, Zookeeper, Cassanda DB, Elasticsearch and some other
Traded speed and simplicity of provisioning (basically terraform + user-data
scripts) for complexity of configuration management
Future work:
More specific AMIs (since the vendor specific are absent in cn anyways)
Basic configuration management via consul-template (want to reload or
reconfigure service upon changes in KV)
25. 2.5 months later: what worked well
Terraform + cloudformation for rolling updates
Nomad for scheduling services + consul + traefik for http routing
EC2 services: ELB, ASG. No disk encryption :(
S3 (although features are much more limited)
IAM, with limitations - no MFA for example
ECR registry
RDS (postgres)
27. Challenges (continues)
Latency, GFW (The Great Firewall of China)
Lack of mirrored packages and distros on other side
Steep learning curve for the tools and services
Docker registry
“Cheburashka” servers and manually created resources
Not cloud friendly apps, aka “zookeepers”
28. Patterns
Try basic configuration to provide required service (AWS certificate manager,
IAM certificates and then ACME’s letsencrypt)
Use tools resourcefully, for example:
Zookeeper-dependent services can use zookeeper.service.consul
All services are monitored by service running in scheduler + Route53 checks
Docker registry is installed with tar image downloaded from S3 (otherwise too slow)
Infra code “on reflexes”
29. I just got a call from PMO
They want all
microservices configured
and deployed manually?!
30. Microservices
Service discovery in consul - everything is registered there. Even Zookeeper!
Jobs that define services run by nomad
Jobs configs (json-ish type) are created in travis from templates
Travis preserves history
Deployed on every merge to master
32. Patterns
Always verify assumptions about China with AWS reps/architects
China’s SNS service doesn’t have SMS nor PUSH notifications
S3 doesn’t have regional replication
Latency and packet loss would be worst than you hope for
Get somebody “on the ground” in China, you’ll need them (from having local
phone number to registering at different hosted services)
33. Cheburashka server is waiting for maintenance
Not in the account managed by terraform
Hey, I'm a reasonable guy.
But I've just experienced some very unreasonable things.
And I would like to share them with you. Because it was pretty exciting to build something on the other side of the world. Where nor Google or Facebook can get anything built.
And then get it inter-connected with this side of the world and deliver some value for business
Few words about who I am
What is Sproutling and what company develop?
I’ve joined 5 months ago.
When I joined the company, I got an AWS account with leftovers from a “datacenter” scheduler
Boy that scheduler was rooted in every AWS service it could find! Took some effort to clean it up.
And meanwhile we were getting a main account in China (mention ICP Recordal and process of opening account)
It was pretty obvious things have to be simple, automated and controlled end-to-end for china region
We got some awesome tools and started to cook stuff
Terraform & docker to setup up basic things in AWS and get builds going
Packer, serverspec to get server images prepared and tested
Consul for services discovery
Nomad for scheduling containers
Traefik for load-balancing
So it was easy to get everything going, although we had to pivot many times
We had to get MVP with the most simple solution that would work in the region
It seemed like there shouldn’t be a lot of problems to deploy services on AWS
75 or so services with many, many features available in AWS
So we felt good about China - AWS got our back, right?
Well, it turns out...
Not only there are many more services in US regions, but they are all interconnected.
CN region is a “special” one. IAM won’t work. Special aws console. Everything is separated.Treat it as another cloud provider. Meanwhile, PMO from mothership descended in our office and gave us a deadline
So what Jack says about getting ready for production in 2 months?
Useful links to bookmark before start building
Lets start with the Internet facing services: load-balancers
They worked pretty well
App load balancers are big deal for ECS, but it’s not available either
Static IPs are useful for IPC Recordal updates
SSL certs worked great, but not so convenient as AWS Certificate Manager (with auto-renewal and all the nice features)
Making records in Route53 is tricky since it’s another (global) region.
So we simply importing state from cn and making necessary updates. Downside is that 2 step process when networking changes. It’s possible to setup providers and permissions in terraform but in practice it didn’t work in 0.8x versions.
Now AMIs
Absence of AMI is generally a problem
Impossible to copy
Building servers is slow if no local repos used
How to deal with it in terraform?
Bunch of checks to make packer runs fast, also region specific settings
Now lets talk about VPN for VPC (connectivity for servers)
OpenVPN instances are completely ephemeral
Setup in cn region requires 2 step apply to update Route53 records
Openvpn will retry indefinitely until it connects to the other side
VPC continues: future work
There are other solutions to setup VPN
Connecting private DCs will be even more complicated
There is not VPN
What about VPN for users?
Standard VPC/networking setup for AWS
Just create a terraform module, provision and forget
Won’t change much
Nothing special here except of tweaking terraform modules. These setups are battle tested and easy to setup.
Next: service discovery.
Turns out it works pretty great with consul (of course we discovered we needed checks after hell broke loose)
Defining a check takes literally few lines of code.
When did you use nagios last time?
Systems health checks are essentially configuration free
Every service that runs in nomad has own healthecks registered in consul
Next: healthchecks for services
Also notice how easy to publish service on frontend http router
Remember Apache configs? Nginx settings?
All that just ones key pair (frontend.rule)
Next: CI/CD
This part still in works because configuring repositories is not trivial.
Lots of boilerplate code and silly templates edits with sed.
Next: persistent services and challenges
This was by far the most challenging part.
Talk about contemporary cloud-readiness for apps
Given this, what patterns to apply?
Next: what worked well
But there were some challenges as well
Next: pic about challenges
Challenges were 3 types:
Access - distros and AMIs, slow and unreliable network
Steep learning curve
Provisioning and support of hosted services (zookeepers)
Next: challenges continues
When you usually work in AWS, you don’t have to think about choices
Even though names are obscure sometimes, the choices of building blocks is great
So everything was getting assembled, tested and kinda ready and we got a call...
Next: pic call from PMO
Well, it wasn’t exactly a call - rather bunch of meetings, but I indeed heard “configured and deployed manually”
Next: How we set up microservices
Next: example of nomad job
Next: patterns and best practicies
Next: cheburashka servers
Don’t create little cheburashka servers - even if it’s very tempting
leverage good tools and aws best practices
Next: end and questions