Hot to build continuously processing for 24/7 real-time data streaming platform?
- 2. • Big Data DevOps Engineer in Getindata
• Smart City Consultant in Almine
• Editor in Antyweb
• Focused on infrastructure, cloud, Internet of
Things and Big Data
Who am I?
- 3. Bullet points
• Infrastructure as Code (IaC)
• Documentation, documentation, documentation
• Monitor wisely
• Log analytics
• SRE principals with DevOps mindset
• CICD
• Automate tasks
• Ask why? not how?
• Stories, issues, curiosities
- 4. Infrastructure as Code
• Code everything because it means readable
infrastructure for everyone
• Do not rely on local copies of script, someone
has invented code repositories and it was a
great day for IT stuff
• Cloud-ready scripts – all public cloud vendors
provide their tools for deploying
infrastructure
- 5. Infrastructure as Code in
practice
• Amazon Web Services
• CloudFormation
• Google Cloud Platform
• Deployment Manager
• Microsoft Azure
• Azure Templates
• Azure DevOps
Or use one tool for all environment, like
Terraform
- 6. Exercise: Infrastructure as
Code
Google Cloud Platform & Deployment Manager
Source:
https://github.com/GoogleCloudPlatform/deploymen
tmanager-samples/tree/master/examples/v2
git clone https://github.com/GoogleCloudPlatform/deploymentmanager-
samples.git
cd deploymentmanager-samples/examples/v2
NAME="gid_gke_example_20200108"
ZONE="europe-west3b"
gcloud deployment-manager deployments create ${NAME} --template
cluster.py --properties zone:${ZONE}
- 7. Documentation
• Add README if you won’t be
disliked by your team.
• You will forget about
important things faster than
you think.
• Add description and comments
to your tasks and merge
requests. Make work easier
not harder.
- 8. Monitoring
• Scraping metrics is done everywhere.
• Getting information about CPU utilisation, disk space usage,
amount of free RAM, etc.
• Multiple available tools so which one should you choose?
• Do not forget about learning about new tools.
• Understand which metrics are valuable and provide information
about status of data pipelines or data ingestion
- 9. Prometheus’ stories
• Use service discovery, it’s great
• Discover where Flink JM and TM expose their metrics.
• How to provide HA?
• Think of using long-term storage like Thanos or M3
or Cortex
• Do you need archived data?
• Monitor Prometheus even if it’s a monitoring
tool.
- 11. How to monitor and visualize
metrics?
All available services for monitoring visualisation
are quite similar.
• Think about it like a part of the complex solution
that has to be compatible with metrics exporters
and log exporters.
• Think of security, ease-of-use for operations team
and adding any own modules that may be required.
• Simpler visualisation = more readable (often, not
always).
• Understand value of metrics.
- 12. Don’t forget about alerts
Alerts signify that a human needs to take action
immediately in response to something that is
either happening or about to happen, in order to
improve the situation.
Do not overuse alerts, some issues should be
fixed by automation scripts.
- 14. Exercise: One to rull them all
• Demo from Alerta’s team: https://try.alerta.io/
git clone https://github.com/alerta/docker-alerta
docker-compose up
- 15. Log analytics
• Discover what is inside your log files.
• Useful for operations team and for developers to understand what
happened with their applications.
• Read log files in the dedicated tool, not use less or tail when you have
several machines to check.
• You can take wise actions based on the log content when you see
what is going on with your services.
- 18. Make it simple with Loki
Like Prometheus, but for logs!
• Write simple queries with LogQL that is similar
to PromQL
Examples:
• {instance=~"kafka-[23]",name="kafka"} !=
kafka.server:type=ReplicaManager
• topk(10,sum(rate({region="us-east1"}[5m])) by (name))
• Ingest log files with Promtail or Fluentd or
Fluentbit
• Relabeling log files if needed
• Designed for clusters in Kubernetes
- 19. Our experience with Loki
• Two environments: development and production stages.
• Migration from ELK stack that didn’t provide enough good
performance and had issues with scraping log files.
• Loki in Grafana: metrics and logs can be verified in one tool.
• Stable solution that provides all metrics and enables counting
interesting values from jobs’ logs
- 21. Exercise: Glance at Loki
Install Helm
curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/master/scripts/get-
helm-3
chmod 700 get_helm.sh
./get_helm.sh
- 22. Exercise: Glance at Loki
helm repo add loki https://grafana.github.io/loki/charts
helm repo update
helm upgrade --install loki --namespace=loki-stack loki/loki-stack
helm install stable/grafana -n loki-grafana
kubectl get secret --namespace <YOUR-NAMESPACE> loki-grafana -o
jsonpath="{.data.admin-password}" | base64 --decode ; echo
kubectl port-forward --namespace <YOUR-NAMESPACE> service/loki-
grafana 3000:80
Go to: http://localhost:3000
Add Loki as data source.
- 23. SRE & DevOps
If a human operator needs to touch your system during normal
operations, you have a bug. The definition of normal changes as your
systems grow.
Carla Geisser, Google SRE
- 24. DevOps vs. SRE
DevOps SRE
Reduce organization silos Share ownership with developers by
using the same tools and
techniques across the stack
Accept failure as normal Have a formula for balancing
accidents and failures against new
releases
Implement gradual change Encourage moving quickly by
reducing costs of failure
Leverage tooling & automation Encourages "automating this year's
job away" and minimizing manual
systems work to focus on efforts
that bring long-term value to the
system
Measure everything Believes that operations is a
software problem, and defines
- 25. CICD pipelines
Besides black art, there is only automation and mechanization.
Federico García Lorca (1898–1936), Spanish poet and playwright
Source: AWS
- 26. Improve, commit, test, deploy
• Define which applications or jobs can be
deployed automatically to the production
environment.
• Test everything.
• Remember about CI tools that are really useful.
• Teach others how they should use automation
tools.
• Discuss, improve, make
- 27. Automate and drink coffee
• Automate boring stuff.
• Make Everything as Code.
• Run tested Ansible playbooks
and forget about manual
changes.
• More well thought automation,
less problems.
- 28. Useful tools
• Ansible
• Jenkins
• Rundeck
• Automate all ops
tasks
• Run Ansible from
one place where
you can set up
variables easily.
• It supports LDAP.
- 29. Test your Ansible
• Use Molecule.
• Molecule provides support for testing with multiple instances,
operating systems and distributions, virtualization providers, test
frameworks and testing scenarios.
• Use Tox.
• Tox is a generic virtualenv management, and test command line
tool. Tox can be used in conjunction with Factors and Molecule, to
perform scenario tests.
• Use them in your CI tool.
- 30. Molecule tests
Have installed working Docker, installed molecule
and tox (by pip).
Scenarios – they are a test suite for your newly created role.
What is inside scenario directory?
• Dockerfile.j2
• INSTALL.rst – it contains instructions on what
additional software or setup steps.
• molecule.yml – here you add dependencies, etc.
• playbook.yml – it’ll be invoked by Molecule.
• Tests – here you can add specific tests.