Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv

THE FOLLOWING CONTAINS CONFIDENTIAL INFORMATION.
DO NOT DISTRIBUTE WITHOUT PERMISSION.
Kubernetes Navigation Stories
DevOpsStage 2019

Director of Infrastructure Engineering at thredUP
Senior Engineering Manager at Hotwire
Roman Chepurnyi
Staff Software Engineer at thredUP
Senior Software Engineer at Toptal
Oleksii Asiutin

4
ThredUP Technology
100% in k8s since mid-2018
● 70 Software Engineers
● 5 Infrastructure Engineers
● 50 applications
● 100 EC2 nodes
Stack
● NodeJS, react
● Ruby, .NET, Java
● RabbitMQ, SQS
● Redis
● MySQL Aurora

5
CNCF Case study
https://www.cncf.io/thredup-case-study/

7
Life after Kubernetes migration
● Fixing shortcuts and gaps
○ IAM
○ Secrets management
● Developers experience
○ Staging environment
○ Local development
● Infrastructure optimization
○ Auto-scaling
○ Spot Instances
○ Security
○ Networking

8
Authentication
Hey Infra team, I need
an access to k8s cluster
Oh my

9
Auth mechanisms
Singed certificates for
everyone
Openid Connect
aws-iam-authenticator
?

10
AWS-IAM-Authenticator
Client
aws-iam-authenticator
binary
API
Server
API
Server
API
Server
AWS-
IAM-
AUTH
AWS-
IAM-
AUTH
AWS-
IAM-
AUTH
Webhook
DaemonSet on master nodes

11
AWS-IAM-Authenticator – kubeconfig
Cluster prod – Role read-only
Cluster stage – Role developer
Cluster dev – Role admin
kubeconfig

12
AWS-IAM-Authenticator – kubeconfig generation
dev
dev lead
infra team
kubeconfig generation service
IAM identity: john-smith
Kubeconfig for dev
IAM identity: lara-jones
Kubeconfig for dev-lead
prod
stage
dev
+ group
kubeconfig
IAM user group

14
Hashicorp Vault
https://www.vaultproject.io/
Init Container
App Container
shared in-memory
volume
app secrets
https://github.com/cruise-automation/daytona
k8s Pod

15
SOPS – Secrets OPerationS
https://github.com/mozilla/sops
Supported formats:
YAML
JSON
.env
# secrets.production.yaml
app_secrets:
db_username: cart_service
db_password: supersecret

16
SOPS – Encryption
$ sops -e --kms <AWS-KMS-ARN> secrets.production.yaml
app_secrets:
db_username:
ENC[AES256_GCM,data:KuhPWLhijVc/9wa6,iv:V7YS/QglsuYwpmBcTZjOwFz8p10yt+qOcRgg+/OL4Uo=,tag
:jchhWABpUVYK4kpRKlrYPQ==,type:str]
db_password:
ENC[AES256_GCM,data:TWjWb4up6nx+gSk=,iv:VoI9vnYrIdYxjTmSsqFzbXZ9z8LsZp4ud8LgVocxGAs=,tag
:PVNKEAq3OvWGiUSmM3aHpw==,type:str]
sops:
kms:
- arn: AWS-KMS-ARN
created_at: '2019-09-26T09:00:30Z'
enc:
AQICAHhGGWsaRwq5wtMieLutm2hnsC2WqAifhQ6HgfjDUdbvpQE5pwGLIOabNseXxCnNWo0YAAAAfjB8BgkqhkiG
9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMhPJ/IHKNPgmqzN8vAgEQgDvTzDYH71MH
x5nGWHjzNjpNDjnTw3pgS8IPf26qVhcdrO7Uv1g7yjKsJIVdcD00/hSNCgg6+KgulNgHmw==
gcp_kms: []

17
SOPS – helm template and secrets storage

18
SOPS – Deployment
$ sops -d -i ./helm/env/secrets.production.yaml
$ helm upgrade --install --wait --timeout 600
-f ./helm/env/secrets.production.yaml
-f ./helm/env/production.yaml
app_name ./helm/app_name

19
Staging Environments
prepare persistent layer
weekly DB snapshot
store helm charts config override
- values
- tags
- secrets
- dependencies
web-client
content
checkout
backend
setup CD deploy

20
Staging Environment
$git checkout -b devopsstage
$git push -u origin devopsstage
wait 4-5 min
use https://devopsstage.threduptest.com/

21
Local development
When your service has a lot of dependencies (MySQL, Redis, RabbitMQ and 5 other services)

22
Local Development
macbook: Thredup $ git clone git@github.com:thredup/node-proxy.git
Cloning into 'node-proxy'...
...
macbook: Thredup $ cd node-proxy/
macbook: node-proxy (master) $ npm install
added 6 packages from 8 contributors and audited 6 packages in 0.595s
found 0 vulnerabilities
macbook: node-proxy (master) $ npm test
> proxy@1.0.0 test ~/Thredup/node-proxy
...
macbook: node-proxy (master) $ npm start
> proxy@1.0.0 start
> node server.js

23
Local Development with Docker
macbook: Thredup $ docker run -it -v ${PWD}:/app -p 3000:3000
node:12-alpine sh
/ $ apk add --no-cache mysql-dev
/ $ npm install
/ $ npm test
/ $ npm start
> proxy@1.0.0 start
> node server.js

24
Local Development with Docker Compose
version: "3.7"
services:
web:
image: node:12-alpine
volumes:
- ./:/app
ports:
- "3000"
environment:
REDIS_HOST: "127.0.0.1"
mysql:
image: ...
...
redis:
image: ...

25
macbook: Thredup $ docker-compose up -d
…
macbook: Thredup $ docker-compose exec web sh
/ $ npm install
/ $ npm test
/ $ npm start
> proxy@1.0.0 start
> node server.js

26
And then you need another service as a dependency ;-)
...and another one
…
docker-compose.yaml ~ 330 lines
MySQL DB ~25Gb

27
And you need to keep it
UP TO DATE

28
Dynamic Staging Env
Local development - Telepresence
https://www.telepresence.io/
Service A
Service A
Service B Service C Service D

29
Local development with Telepresence
macbook: Thredup $ telepresence --swap-deployment
deployment-name
--expose 3000
--method container
--docker-run --rm -it -v ${PWD}:/app
000000000001.dkr.ecr.us-east-1.amazonaws.com/cart:latest
...
...
/ $ npm install
/ $ npm test
/ $ npm start
> proxy@1.0.0 start
> node server.js

30
Horizontal Pod Autoscaling (HPA)
● Do not over-provision
● Be ready for traffic spikes
metrics:
- type: External
external:
metricName: trace.rack.request.hits
metricSelector:
matchLabels:
env : production
service : some-service
targetAverageValue: 10

31
HPA lessons learned
offender pods:
request 1 core
use 3+ cores on start
response time spikes
autoscaling pattern

32
HPA lessons learned
add warmup script
update deployment strategy

33
Cluster autoscaler
● overflow capacity in production
● utilize spot instances

34
Spot instances and AZRebalance
● spot termination works https://github.com/mumoshu/kube-spot-termination-notice-handler
● except when instance is terminated by Availability Zone
Terminating EC2 instance: i-0e685dc2a84b65f63
Cause:CauseAt 2019-07-18T06:09:59Z instances were launched to balance instances in
zones us-east-1a us-east-1e with other zones resulting in more than desired number of
instances in the group. At 2019-07-18T06:11:30Z an instance was taken out of service
in response to a difference between desired and actual capacity, shrinking the
capacity from 4 to 3. At 2019-07-18T06:11:30Z instance i-0e685dc2a84b65f63 was
selected for termination.

35
Spot instances and AZRebalance
metadata:
creationTimestamp: 2017-10-12T16:28:23Z
generation: 2
name: m4xlarge
spec:
image: 405610825889/harden-k8s-x.14-debian-stretch-amd64-hvm-ebs-2019-08-16
machineType: m4.2xlarge
maxPrice: "0.20"
maxSize: 30
minSize: 5
role: Node
rootVolumeSize: 100
subnets:
- us-east-1a
- us-east-1c
- us-east-1e
suspendProcesses:
- AZRebalanceapiVersion: kops/v1alpha2
kind: InstanceGroup

Confidential 36
Container vulnerability scan
https://github.com/arminc/clair-scanner
https://snyk.io/blog/top-ten-most-popular-docker-images-each-contain-at-least-30-vulnerabilities/

Confidential 37
Container runtime security
https://falco.org
https://snyk.io/blog/top-ten-most-popular-docker-images-each-contain-at-least-30-vulnerabilities/

39
Service Mesh
● Visibility
● Simple configuration
● Security (policies, mTLS)

40
What’s next
● Finish Istio rollout
● More security
● Knative builds
● Have fun!

THANK YOU
https://www.thredup.com/devopsstage-2019

Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv

More Related Content

Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv

Editor's Notes