Building services on AWS in China region

DevOps Engineer at Sproutling (Mattel’s subsidiary)
AWS Solutions Architect
Systems Engineering background
roman@naumenko.ca
@naumenko_roman

Sproutling
IoT company
The product is an innovative baby monitor that learns patterns and notifies parents
when baby is likely to wake up, ideal room conditions or anything unusual
We’re hosted in AWS
Golang shop

We had a plan
In the end of December we got AWS account in China + one in us-west-2
We wanted infra as code and microservices, immutable EC2
No single point of failure (“destroy availability zone” test)
Auto-managed SSL on the frontends
Docker images as artifacts
It was obvious that complicated things won’t fit into the budget

We got battle-tested tools: terraform, docker

Great tools: packer, serverspec

Fast and versatile: nomad, traefik

Are you ready for production in 2 months?

Sources
Short discription of differences for each service in CN region:
http://docs.amazonaws.cn/en_us/aws/latest/userguide/services.html
Namespaces (naming is hard!):
http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html
Faq:
https://www.amazonaws.cn/en/about-aws/china/faqs/
Updates:
https://www.amazonaws.cn/en/new/
Endpoints for cn-north-1:
http://docs.amazonaws.cn/en_us/general/latest/gr/rande.html#cnnorth_region

ELB
Only “old school” type ELB available (no Application Load Balancers)
Possible to have “static” IPs for an ELB (ask aws support how)
No IPv6, no Route53 alias records
SSL certificates pretty straightforward:
create IAM cert resource aws_iam_server_sertificate
reference ARN as aws_iam_server_sertificate.name.arn
Set Route53 records as CNAMEs (via terraform_remote_state)
What it means: a good frontend http router/proxy required (haproxy/nginx type)

AMIs
Most of “vendors” AMIs (say OpenVPN) are absent in CN region
We build our AMIs with Packer+serverspec for tests, same set of configs!
Very slow packages update (almost impossible to disable in Ubuntu cloud-init
scripts). Use local mirrors: aliyun.com,mirrors.163.com or make your own in S3
Third party packages are very, very slow to download (upload ‘em all to
regional publically accessible S3)
NTP is not enabled by default (Ubuntu, seriously?)
What it means: setup tooling and resources to build AMIs consistently in different
regions

AMI (continue)
One way of dealing with specifics
Amazon and Ubuntu AMIs are updated frequently, you can rely on them
What it means: standardize on one or two Linux distros. Build base AMI.

VPN for VPC
We wanted VPCs connected. Amazingly, no “out of the box” solution exists
So we took AWS guide: “Connecting Multiple VPCs with EC2 Instances over
SSL” and terraformed the hell out of it
https://aws.amazon.com/articles/0639686206802544
Very simple OpenVPN setup that relies on pre-shared secret key and Route53
names
150 lines terraform + 60 lines user-data script: saves 1000’s $$ on “enterprise”
solutions :)
Can work as HA (multiple “remote” addresses) - omg, the HA! Many more
1000’s $$

VPN for VPC (continues)
Only few services need to talk to another VPC (service discovery, etc)
So it requires careful setup of fw rules (maybe even special subnets for multi-
VPC routing)
Basic setup will have a very high and unpredictable latency
The final setup will be more complicated: we’re getting a leased line from cn
region to ap-southeast-1
What it means: set something simple to connect multiple VPC in the regions, it’s
always possible to make it more advance/robust/HA/restricted later

VPN for users
For users we configured Pritunl: https://pritunl.com
OpenVPN would work too, however printunl provided 100% self-service for
users, configuration was minimal
We actually used only a limited set of features (basically SSO)
It can connect VPCs. But didn’t have time to experiment with cn.
Each region has own VPN URL for obvious reasons (latency)
What it means: VPN for users should be a 100% self-service, otherwise count for
a helpdesk duties :)

VPC/networking
NAT gateway is absent in cn region, had to setup “old-school” NAT instances
For VPN with NAT GW, use nat_gateway_id:

Service discovery
A very first service to provision, everything else relies on consul
Very easy to operate cluster (~30 min to upgrade manually 3 node cluster)
Making use of consul agents requires more work
We utilized nagios (omg!) basic checks

Service discovery (system healthchecks)
apt-get -y install nagios-plugins-basic

Service discovery (service’s healthchecks)
Healthchecks are part of service definition

CI/CD between regions
Had to setup own docker registry (managed services in China didn’t work for us)
Speed of uploading to cn is okey (problems starts when docker image updates
>100-300MB) - will be waiting unpredictable time, from 5 to 30 minutes
Default Travis docker is incompatible with the registry, had to update it
script:
- echo "Upgrade docker to 17.03.0-ce or higher, default doesn't work with our priv registry"
- sudo apt-get install docker-engine

“Persistent” services
Service like Kafka, Zookeeper, Cassanda DB, Elasticsearch and some other
Traded speed and simplicity of provisioning (basically terraform + user-data
scripts) for complexity of configuration management
Future work:
More specific AMIs (since the vendor specific are absent in cn anyways)
Basic configuration management via consul-template (want to reload or
reconfigure service upon changes in KV)

2.5 months later: what worked well
Terraform + cloudformation for rolling updates
Nomad for scheduling services + consul + traefik for http routing
EC2 services: ELB, ASG. No disk encryption :(
S3 (although features are much more limited)
IAM, with limitations - no MFA for example
ECR registry
RDS (postgres)

Challenges (continues)
Latency, GFW (The Great Firewall of China)
Lack of mirrored packages and distros on other side
Steep learning curve for the tools and services
Docker registry
“Cheburashka” servers and manually created resources
Not cloud friendly apps, aka “zookeepers”

Patterns
Try basic configuration to provide required service (AWS certificate manager,
IAM certificates and then ACME’s letsencrypt)
Use tools resourcefully, for example:
Zookeeper-dependent services can use zookeeper.service.consul
All services are monitored by service running in scheduler + Route53 checks
Docker registry is installed with tar image downloaded from S3 (otherwise too slow)
Infra code “on reflexes”

I just got a call from PMO
They want all
microservices configured
and deployed manually?!

Microservices
Service discovery in consul - everything is registered there. Even Zookeeper!
Jobs that define services run by nomad
Jobs configs (json-ish type) are created in travis from templates
Travis preserves history
Deployed on every merge to master

Microservices (continues)
Example of nomad job. Docker images tagged by commit sha. Can run this manually any time.

Patterns
Always verify assumptions about China with AWS reps/architects
China’s SNS service doesn’t have SMS nor PUSH notifications
S3 doesn’t have regional replication
Latency and packet loss would be worst than you hope for
Get somebody “on the ground” in China, you’ll need them (from having local
phone number to registering at different hosted services)

Cheburashka server is waiting for maintenance
Not in the account managed by terraform

Questions?
roman@naumenko.ca
@naumenko_roman

Building services on AWS in China region

More Related Content

Building services on AWS in China region

Editor's Notes