SlideShare a Scribd company logo
How to build
continuously
processing for 24/7
real-time data
streaming platform?
Author: Albert Lewandowski
• Big Data DevOps Engineer in Getindata
• Smart City Consultant in Almine
• Editor in Antyweb
• Focused on infrastructure, cloud, Internet of
Things and Big Data
Who am I?
Bullet points
• Infrastructure as Code (IaC)
• Documentation, documentation, documentation
• Monitor wisely
• Log analytics
• SRE principals with DevOps mindset
• CICD
• Automate tasks
• Ask why? not how?
• Stories, issues, curiosities
Infrastructure as Code
• Code everything because it means readable
infrastructure for everyone
• Do not rely on local copies of script, someone
has invented code repositories and it was a
great day for IT stuff
• Cloud-ready scripts – all public cloud vendors
provide their tools for deploying
infrastructure
Infrastructure as Code in
practice
• Amazon Web Services
• CloudFormation
• Google Cloud Platform
• Deployment Manager
• Microsoft Azure
• Azure Templates
• Azure DevOps
Or use one tool for all environment, like
Terraform
Exercise: Infrastructure as
Code
Google Cloud Platform & Deployment Manager
Source:
https://github.com/GoogleCloudPlatform/deploymen
tmanager-samples/tree/master/examples/v2
git clone https://github.com/GoogleCloudPlatform/deploymentmanager-
samples.git
cd deploymentmanager-samples/examples/v2
NAME="gid_gke_example_20200108"
ZONE="europe-west3b"
gcloud deployment-manager deployments create ${NAME} --template
cluster.py --properties zone:${ZONE}
Documentation
• Add README if you won’t be
disliked by your team.
• You will forget about
important things faster than
you think.
• Add description and comments
to your tasks and merge
requests. Make work easier
not harder.
Monitoring
• Scraping metrics is done everywhere.
• Getting information about CPU utilisation, disk space usage,
amount of free RAM, etc.
• Multiple available tools so which one should you choose?
• Do not forget about learning about new tools.
• Understand which metrics are valuable and provide information
about status of data pipelines or data ingestion
Prometheus’ stories
• Use service discovery, it’s great
• Discover where Flink JM and TM expose their metrics.
• How to provide HA?
• Think of using long-term storage like Thanos or M3
or Cortex
• Do you need archived data?
• Monitor Prometheus even if it’s a monitoring
tool.
How to monitor and visualize
metrics?
How to monitor and visualize
metrics?
All available services for monitoring visualisation
are quite similar.
• Think about it like a part of the complex solution
that has to be compatible with metrics exporters
and log exporters.
• Think of security, ease-of-use for operations team
and adding any own modules that may be required.
• Simpler visualisation = more readable (often, not
always).
• Understand value of metrics.
Don’t forget about alerts
Alerts signify that a human needs to take action
immediately in response to something that is
either happening or about to happen, in order to
improve the situation.
Do not overuse alerts, some issues should be
fixed by automation scripts.
Solutions designed for alerts
• AlertManager
• Builtin alerts in Grafana
• Alerta
Exercise: One to rull them all
• Demo from Alerta’s team: https://try.alerta.io/
git clone https://github.com/alerta/docker-alerta
docker-compose up
Log analytics
• Discover what is inside your log files.
• Useful for operations team and for developers to understand what
happened with their applications.
• Read log files in the dedicated tool, not use less or tail when you have
several machines to check.
• You can take wise actions based on the log content when you see
what is going on with your services.
Log analytics tools
• Elastic stack:
• ELK
• EFK
• Loki + Promtail + Grafana
Make it simple with Loki
Make it simple with Loki
Like Prometheus, but for logs!
• Write simple queries with LogQL that is similar
to PromQL
Examples:
• {instance=~"kafka-[23]",name="kafka"} !=
kafka.server:type=ReplicaManager
• topk(10,sum(rate({region="us-east1"}[5m])) by (name))
• Ingest log files with Promtail or Fluentd or
Fluentbit
• Relabeling log files if needed
• Designed for clusters in Kubernetes
Our experience with Loki
• Two environments: development and production stages.
• Migration from ELK stack that didn’t provide enough good
performance and had issues with scraping log files.
• Loki in Grafana: metrics and logs can be verified in one tool.
• Stable solution that provides all metrics and enables counting
interesting values from jobs’ logs
Exercise: Glance at Loki
Exercise: Glance at Loki
Install Helm
curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/master/scripts/get-
helm-3
chmod 700 get_helm.sh
./get_helm.sh
Exercise: Glance at Loki
helm repo add loki https://grafana.github.io/loki/charts
helm repo update
helm upgrade --install loki --namespace=loki-stack loki/loki-stack
helm install stable/grafana -n loki-grafana
kubectl get secret --namespace <YOUR-NAMESPACE> loki-grafana -o
jsonpath="{.data.admin-password}" | base64 --decode ; echo
kubectl port-forward --namespace <YOUR-NAMESPACE> service/loki-
grafana 3000:80
Go to: http://localhost:3000
Add Loki as data source.
SRE & DevOps
If a human operator needs to touch your system during normal
operations, you have a bug. The definition of normal changes as your
systems grow.
Carla Geisser, Google SRE
DevOps vs. SRE
DevOps SRE
Reduce organization silos Share ownership with developers by
using the same tools and
techniques across the stack
Accept failure as normal Have a formula for balancing
accidents and failures against new
releases
Implement gradual change Encourage moving quickly by
reducing costs of failure
Leverage tooling & automation Encourages "automating this year's
job away" and minimizing manual
systems work to focus on efforts
that bring long-term value to the
system
Measure everything Believes that operations is a
software problem, and defines
CICD pipelines
Besides black art, there is only automation and mechanization.
Federico García Lorca (1898–1936), Spanish poet and playwright
Source: AWS
Improve, commit, test, deploy
• Define which applications or jobs can be
deployed automatically to the production
environment.
• Test everything.
• Remember about CI tools that are really useful.
• Teach others how they should use automation
tools.
• Discuss, improve, make
Automate and drink coffee
• Automate boring stuff.
• Make Everything as Code.
• Run tested Ansible playbooks
and forget about manual
changes.
• More well thought automation,
less problems.
Useful tools
• Ansible
• Jenkins
• Rundeck
• Automate all ops
tasks
• Run Ansible from
one place where
you can set up
variables easily.
• It supports LDAP.
Test your Ansible
• Use Molecule.
• Molecule provides support for testing with multiple instances,
operating systems and distributions, virtualization providers, test
frameworks and testing scenarios.
• Use Tox.
• Tox is a generic virtualenv management, and test command line
tool. Tox can be used in conjunction with Factors and Molecule, to
perform scenario tests.
• Use them in your CI tool.
Molecule tests
Have installed working Docker, installed molecule
and tox (by pip).
Scenarios – they are a test suite for your newly created role.
What is inside scenario directory?
• Dockerfile.j2
• INSTALL.rst – it contains instructions on what
additional software or setup steps.
• molecule.yml – here you add dependencies, etc.
• playbook.yml – it’ll be invoked by Molecule.
• Tests – here you can add specific tests.
Q&A
Feel free to ask me about anything 
Thank you for your attention!

More Related Content

Hot to build continuously processing for 24/7 real-time data streaming platform?

  • 1. How to build continuously processing for 24/7 real-time data streaming platform? Author: Albert Lewandowski
  • 2. • Big Data DevOps Engineer in Getindata • Smart City Consultant in Almine • Editor in Antyweb • Focused on infrastructure, cloud, Internet of Things and Big Data Who am I?
  • 3. Bullet points • Infrastructure as Code (IaC) • Documentation, documentation, documentation • Monitor wisely • Log analytics • SRE principals with DevOps mindset • CICD • Automate tasks • Ask why? not how? • Stories, issues, curiosities
  • 4. Infrastructure as Code • Code everything because it means readable infrastructure for everyone • Do not rely on local copies of script, someone has invented code repositories and it was a great day for IT stuff • Cloud-ready scripts – all public cloud vendors provide their tools for deploying infrastructure
  • 5. Infrastructure as Code in practice • Amazon Web Services • CloudFormation • Google Cloud Platform • Deployment Manager • Microsoft Azure • Azure Templates • Azure DevOps Or use one tool for all environment, like Terraform
  • 6. Exercise: Infrastructure as Code Google Cloud Platform & Deployment Manager Source: https://github.com/GoogleCloudPlatform/deploymen tmanager-samples/tree/master/examples/v2 git clone https://github.com/GoogleCloudPlatform/deploymentmanager- samples.git cd deploymentmanager-samples/examples/v2 NAME="gid_gke_example_20200108" ZONE="europe-west3b" gcloud deployment-manager deployments create ${NAME} --template cluster.py --properties zone:${ZONE}
  • 7. Documentation • Add README if you won’t be disliked by your team. • You will forget about important things faster than you think. • Add description and comments to your tasks and merge requests. Make work easier not harder.
  • 8. Monitoring • Scraping metrics is done everywhere. • Getting information about CPU utilisation, disk space usage, amount of free RAM, etc. • Multiple available tools so which one should you choose? • Do not forget about learning about new tools. • Understand which metrics are valuable and provide information about status of data pipelines or data ingestion
  • 9. Prometheus’ stories • Use service discovery, it’s great • Discover where Flink JM and TM expose their metrics. • How to provide HA? • Think of using long-term storage like Thanos or M3 or Cortex • Do you need archived data? • Monitor Prometheus even if it’s a monitoring tool.
  • 10. How to monitor and visualize metrics?
  • 11. How to monitor and visualize metrics? All available services for monitoring visualisation are quite similar. • Think about it like a part of the complex solution that has to be compatible with metrics exporters and log exporters. • Think of security, ease-of-use for operations team and adding any own modules that may be required. • Simpler visualisation = more readable (often, not always). • Understand value of metrics.
  • 12. Don’t forget about alerts Alerts signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation. Do not overuse alerts, some issues should be fixed by automation scripts.
  • 13. Solutions designed for alerts • AlertManager • Builtin alerts in Grafana • Alerta
  • 14. Exercise: One to rull them all • Demo from Alerta’s team: https://try.alerta.io/ git clone https://github.com/alerta/docker-alerta docker-compose up
  • 15. Log analytics • Discover what is inside your log files. • Useful for operations team and for developers to understand what happened with their applications. • Read log files in the dedicated tool, not use less or tail when you have several machines to check. • You can take wise actions based on the log content when you see what is going on with your services.
  • 16. Log analytics tools • Elastic stack: • ELK • EFK • Loki + Promtail + Grafana
  • 17. Make it simple with Loki
  • 18. Make it simple with Loki Like Prometheus, but for logs! • Write simple queries with LogQL that is similar to PromQL Examples: • {instance=~"kafka-[23]",name="kafka"} != kafka.server:type=ReplicaManager • topk(10,sum(rate({region="us-east1"}[5m])) by (name)) • Ingest log files with Promtail or Fluentd or Fluentbit • Relabeling log files if needed • Designed for clusters in Kubernetes
  • 19. Our experience with Loki • Two environments: development and production stages. • Migration from ELK stack that didn’t provide enough good performance and had issues with scraping log files. • Loki in Grafana: metrics and logs can be verified in one tool. • Stable solution that provides all metrics and enables counting interesting values from jobs’ logs
  • 21. Exercise: Glance at Loki Install Helm curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get- helm-3 chmod 700 get_helm.sh ./get_helm.sh
  • 22. Exercise: Glance at Loki helm repo add loki https://grafana.github.io/loki/charts helm repo update helm upgrade --install loki --namespace=loki-stack loki/loki-stack helm install stable/grafana -n loki-grafana kubectl get secret --namespace <YOUR-NAMESPACE> loki-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo kubectl port-forward --namespace <YOUR-NAMESPACE> service/loki- grafana 3000:80 Go to: http://localhost:3000 Add Loki as data source.
  • 23. SRE & DevOps If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. Carla Geisser, Google SRE
  • 24. DevOps vs. SRE DevOps SRE Reduce organization silos Share ownership with developers by using the same tools and techniques across the stack Accept failure as normal Have a formula for balancing accidents and failures against new releases Implement gradual change Encourage moving quickly by reducing costs of failure Leverage tooling & automation Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system Measure everything Believes that operations is a software problem, and defines
  • 25. CICD pipelines Besides black art, there is only automation and mechanization. Federico García Lorca (1898–1936), Spanish poet and playwright Source: AWS
  • 26. Improve, commit, test, deploy • Define which applications or jobs can be deployed automatically to the production environment. • Test everything. • Remember about CI tools that are really useful. • Teach others how they should use automation tools. • Discuss, improve, make
  • 27. Automate and drink coffee • Automate boring stuff. • Make Everything as Code. • Run tested Ansible playbooks and forget about manual changes. • More well thought automation, less problems.
  • 28. Useful tools • Ansible • Jenkins • Rundeck • Automate all ops tasks • Run Ansible from one place where you can set up variables easily. • It supports LDAP.
  • 29. Test your Ansible • Use Molecule. • Molecule provides support for testing with multiple instances, operating systems and distributions, virtualization providers, test frameworks and testing scenarios. • Use Tox. • Tox is a generic virtualenv management, and test command line tool. Tox can be used in conjunction with Factors and Molecule, to perform scenario tests. • Use them in your CI tool.
  • 30. Molecule tests Have installed working Docker, installed molecule and tox (by pip). Scenarios – they are a test suite for your newly created role. What is inside scenario directory? • Dockerfile.j2 • INSTALL.rst – it contains instructions on what additional software or setup steps. • molecule.yml – here you add dependencies, etc. • playbook.yml – it’ll be invoked by Molecule. • Tests – here you can add specific tests.
  • 31. Q&A Feel free to ask me about anything 
  • 32. Thank you for your attention!