2019 hashiconf seattle_consul_ioc
- 1. Pierre Souchay
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay
Inversion Of
Control with
Consul
Leading Discovery Team @Criteo (SDKs + Consul)
Dealing with 240k+ services, 38k Consul nodes in 9 DCs
1st external contributor to Consul
Author of consul-templaterb
- 2. Today’s dishes
• Starters
• History of Consul at Criteo
• Entrées
• Inversion of Control explained
• Cheese
• Real World Examples
• Sweets
• How it changes infrastructure
- 4. 4 •
More servers every year
DC: 12 (9 prod)
Servers: 38 000+
Services: 3400
Instances: 240k+
HTTP req/s: 3M+
BigData: 180+ Pb
Kafka msg/s: 8M+
Criteo Infrastructure growth
- 10. 2015 - Mesos
• Containers
• Frequent changes
• Many services/machine
• Different Provisioning
- 13. 1 2 3 4
Provisioning
time
is an issue
globally (F5)
Database
polling shows
its limits
Services both
in containers
and machines?
More latency
Introduced
By new
Load-Balancers
Sounds almost good enough but…
- 15. Consul
to discover everything
• No SPOF`
• Multi DC support
• Service oriented
• Real time updates
• Toolbox (KV, locks)
• DNS integration!
• Working on IP
- 17. Step 2 : re-
implement
our libraries
HTTP Client Side LB Database Access
Kafka Memcached/Couchbase
- 19. Step 3 was harder
• Watch changes many services
→ cpu/net: idx/service: #3899 (and many more)
• Leader get saturated
→ discovery_max_stale: #3920
• DNS issues on big services
→ DNS fixes: #3940, #3948, #4071,
• 800mb/s to watch changes
→ consul-templaterb : now 12kb/s
• Weights in services / meta in services
→ #3881 / #4047 / #4468
- 21. What did we learn about our users?
love their services configuration into their systems
- 22. What did we learn about our users?
love their services
want predictability
configuration into their systems
give them tools to investigate
- 23. What did we learn about our users?
love their services
want predictability
love business semantics
configuration into their systems
give them tools to investigate
focus on semantics, ignore tools
- 24. What did we learn about our users?
love their services
want predictability
love business semantics
want it fast and magic
configuration into their systems
give them tools to investigate
focus on semantics, ignore tools
magic is better than As A Service
- 25. Can we go further ?
Can we change the way we create infrastructure and tools?
- 27. 27 •
Inversion of Control
Decoupling systems stuff using a framework
Provides semantics of your needs
Someone will provide what you need (and
much more)
Broader than Dependency Injection
- 29. 29 •
Consul exposes lots of stuff
list all services filter services using tags Notifications in real time Provide configuration
settings (KV/Service Data)
- 30. 30 •
Let’s use
those features
Expose searchable semantics
using tags
Provide configuration hints with
business semantics as meta
Tools observe, react & provision
Consul is like an infra DB
- 34. Why is meta so cool?
Direct configuration
• alert_* => automatic alerts
• vip_*=> VAAS Configuration
• swagger_* => you saw it
…and information
• version
• start
• team
• OWNERS...
.. Automatically cleaned up
- 35. Consumers of those meta can be…
OPT-IN
• VAAS (network load balancing)
• Swagger repository
OPT-OUT
• Chaos Monkey
• Security Scanner
• Automatic Alerting
• App Watcher
• Version Scanner
- 36. More meta,
More power,
More services
node meta + service meta: 2 layers
metrics can re-use it as well (ex:
Prometheus/Consul integration)
Templates are re-useable
Same meta can be re-used for new tools
It gets easier and easier
- 37. Isn't K/V the right place instead of meta?
Most of the time… no
Cardinality is hard to get right: a service is NOT a monolith
Cleanup is just too hard
It gives a bit more work on consumer side, because of cardinality
- 38. 38 •
Semantics, not tool oriented
alert_channel = “myteam”
alert_ratio = “0.5”
alert_on_call_group = “myteam-alerts”
alert_depends_on = “database1,serviceX”
alert_criticity = “business,high”
- 40. Automatic Metering /
Alerts
• templates of consul-templaterb generating prometheus alerts
• Provide 100% of coverage of Criteo for free
• Also provide metrics such as availability for all of Criteo
• App availability according to version/OS/rack, using meta!
• Re-use those meta in all metrics
- 41. 41 •
VAAS
• Provides all networking for Criteo
• Serving more than 4M HTTP req/s
• Share semantics for several load-balancers
• HaProxy
• F5
• Provisioning of much more than reverse proxies
• DNS (including Geo-DNS)
• TLS
• Real time creation of Services (less than 1 minute)
- 42. 42 •
• Detect old applications
• Detect invalid ownership
• Old security groups
• Deprecated users…
Services Scanner
- 43. 43 •
Consul-UI / Consul-
Timeline
• Live logs for all services
• History of services
• http://github.com/criteo/consul-
templaterb/
• Provides real time updates about the
status of all services
• Provides an history of changes for all
services
- 44. 44 •
And much
more…
Swagger browser (catalog of all JSON APIs
in Criteo)
Chaos Monkey
Resource Tracking Systems
Latency Monitoring between machines
Security Scanner looks up for new services
to scan
- 46. Removes
configuration
from hidden
places
If you are providing a cross service new
system, you probably don’t need a git
repository for the configuration
So everything is transparent and open to
everybody
Information is where it needs to be, on
the service itself
Ease onboarding of newcomers
- 47. Cleanup is not
a hard
problem
anymore
Systems live and die,
consumers react
Ops synchronization is not
needed anymore
- 48. Help
innovating
Real decoupling
You can start your new project on your laptop
Templating systems create your configs easily
No migration costs anymore, we don’t configure
tools
Semantics are better than YAML config files