This document discusses zero downtime architectures. It defines zero downtime as services being available to end users at all times. It identifies sources of planned and unplanned downtime. It proposes concepts like independent application groups, redundant infrastructure within and between datacenters, and replicating data between datacenters to reduce downtime. It provides examples of implementing high availability for networks, applications, and databases. It also discusses development guidelines and monitoring to support zero downtime operations.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Microsoft technologies form the backbone of many Enterprise IT Infrastructures. Whether you are running Microsoft Exchange, Sharepoint, SQL Server or Active Directory; chances are you rely upon you these services for your mission critical needs. Solutions Architects and IT professionals will get an overview of the common Microsoft workloads running on AWS including approaches for server migrations, design and deployment of infrastructure services and maintenance and monitoring of those services once they are in production.
The document discusses using Oracle Database with Amazon Web Services. It outlines Amazon EC2, which allows users to provision virtual machines in Amazon's data centers, and Amazon S3 for storing and retrieving data. It then provides steps for deploying Oracle Database Express Edition on EC2, backing up databases to S3 using Oracle Recovery Manager, and storing database files and backups in S3 for cost effective storage.
Building Cloud-Native App Series - Part 7 of 11
Microservices Architecture Series
Containers Docker Kind Kubernetes Istio
- Pods
- ReplicaSet
- Deployment (Canary, Blue-Green)
- Ingress
- Service
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Amazon Web Services (AWS) are gaining popularity, and for good reasons. The Amazon Relational Database Service (AWS RDS) is getting a lot of attention, also for very good reasons. It is quite a compelling idea to have on-demand data services that do not require hiring DBA staff. The expectation is set that everything works like magic and will satisfy all of your enterprise database availability needs.
If you want to build high-volume, business-critical applications, possibly with geographically-distributed audiences, you really want to think twice about using RDS. Continuent customers have a large number deployments in AWS running MySQL on AWS EC2 instances and they choose to rely upon Tungsten Clustering to provide high availability (HA) and disaster recovery (DR). We also support multi-site/multi-master operations and offer true zero-downtime MySQL operations.
AGENDA
- How does RDS handle failover? (Hint: Not very quickly)
- How does RDS handle read scaling? (Hint: Not very well)
- Can you do zero-downtime maintenance with RDS? (Hint: No)
- Is RDS cheaper? (Hint: No, not really)
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
With the rapid advancements in the cloud computing techniques and growing maturity of Infrastructure as Code (IaC) in the DevOps space, attaining a zero downtime while deploying the latest updates to the applications and databases, has become a new norm in the IT industry. Blue-Green deployment model is a DevOps technique which helps to achieve a zero downtime by seamlessly switching to new version(green) when it is fully ready while the original version (blue) is still running. Making duplicate copies of entire cloud infrastructure at a rapid pace is only possible when it is written as code, rather than manually configured. Terraform is a declarative programming language which helps in writing infrastructure as code.In this session -1.We will provision a new AWS VPC, public and private subnets and security groups using Terraform code.2. We will then create a Document DB cluster from scratch and create a new DB using snapshot, by running the Terraform.3. We will then demonstrate the creation of an AWS Fargate based ECS cluster using Terraform script and then run a java based micro service on it, that uses AWS Document DB as the NoSQL database. 4. We will then simulate the Blue Green deployment model by creating an AWS Lambda function which deploys the latest updates and rolls back to the older version with zero downtime.
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
According to OpenStack users survey, Cloud Foundry is the 2nd most popular workload on OpenStack. You want to deploy Cloud Foundry on OpenStack or already have. What's next?
Cloud Foundry continues to evolve with revolutionary changes, e.g move from bosh-micro to bosh-init, using the new eCPI, move to Diego etc.
Same with OpenStack, e.g changes from Keystone v2 to v3, from Liberty to Mitaka, network plugins changes etc. Both IaaS and PaaS layers are changing frequently. How do you do in-place updates/upgrades/operational tasks without impacting user experience at both the layers?
In this talk will discuss our lessons learnt operating hybrid Cloud Foundry deployments on top of OpenStack over the last two years and how we used underlying technologies to seamlessly operate them
Multi-master, multi-region MySQL deployment in Amazon AWS
MySQL data rules the cloud, but recent experience shows us that there's no substitute for maintaining copies of data, across availability zones and regions, when it comes to Amazon Web Services (AWS) data resilience.
In this webinar, we discuss the multi-master capabilities of Continuent Tungsten to help you build and manage systems that spread data across multiple sites. We cover important topics such as setting up large scale topologies, handling failures, and how to handle data privacy issues like removing personally identifiable information or handling privacy law restrictions on data movement. We will conclude with a live demonstration of a distributed MySQL solution with Continuent Tungsten clusters working across multiple AWS availability zones and regions.
This document summarizes Denis Gundarev's presentation on how to build a Citrix infrastructure in the Amazon Web Services (AWS) cloud. The presentation covered:
- An overview of AWS services like EC2, S3, VPC, RDS, and how to monitor with CloudWatch
- Common Citrix deployment architectures on AWS like using NetScaler and AutoScaling
- Limitations of running Citrix on AWS like lack of capacity management and client OS support
- Guidelines for deploying Citrix on AWS like starting simple, proper sizing, and careful VPC planning
Migrating Oracle workloads to Azure requires understanding the workload and hardware requirements. It is important to analyze the workload using the Automatic Workload Repository (AWR) report to accurately size infrastructure needs. The right virtual machine series and storage options must be selected to meet the identified input/output and capacity needs. Rather than moving existing hardware, the focus should be migrating the Oracle workload to take advantage of cloud capabilities while ensuring performance and high availability.
Container Orchestration with Docker Swarm and Kubernetes
This presentation covers the basics of what container orchestration is providing pros and cons of Docker Swarm, Kubernetes and Amazon ECS and outlining the terms and tools you will need to successfully use them.
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.
The document discusses continuous delivery and zero downtime deployment. It describes automating the full deployment process from development to production multiple times a day without any downtime. This is achieved through continuous integration, maintaining deployment scripts and packages, managing database changes, and techniques like feature toggles, blue/green deployments, and state management. The goal is to enable fast and reliable releases of new features and fixes to users.
The document discusses moving towards zero downtime. It summarizes Vision's solutions which aim to [1] eliminate planned and unplanned downtime, [2] quickly recover data to any point in time with zero data loss, and [3] improve service levels and comply with regulations. Vision addresses these issues through data replication, virtualized failover, application protection, and cluster integration capabilities.
The document discusses potential issues with using MTBF/MTTF as the primary reliability metric for the defense and aerospace industries. It argues that MTBF/MTTF provides an incomplete view of reliability across the entire product lifecycle and can result in overly optimistic assessments. The document proposes using an alternative metric called Bx/Lx, which specifies the life point where no more than a certain percentage (like 10%) of failures have occurred. This provides a more comprehensive view of reliability focused on early failures. Overall, the document advocates updating reliability metrics and practices to better reflect physical failure mechanisms.
Explains what troubleshooting is, what skills are involved, and clears up some common misconceptions. Originally designed with IT Helpdesks in mind, but it could apply to any kind of troubleshooting.
=========================
Wrote this a VERY long time ago! I always meant to revisit/revamp it, but never quite got round to it. But people seem to get value from it, so I'll leave it up :)
The New Simple: Predictive Analytics for the Mainstream
The document summarizes a presentation on predictive analytics given by Mike Watschke from SAP. The presentation covered:
- SAP's predictive analytics solution which automates data preparation, modeling, and deployment tasks.
- Questions from the analyst about whether SAP sees in-memory technology as critical, whether it has its own data preparation technology or partners, and whether the capability works with cloud, Hadoop, and streaming data.
- Upcoming topics for future briefing room presentations, including business intelligence/analytics in March, big data in April, and cloud in May.
The document discusses techniques for achieving zero downtime deployments. It begins with an introduction and overview before covering specific methods such as blue-green deployments, canary releases, and rolling deployments. It also provides details on tools that can be used and considerations for deploying to web servers and databases. The document advocates combining different techniques into a hybrid 1/10/100 approach for deploying code changes to environments in a phased manner to minimize risk.
Obstacles encountered by teams are logged on obstacle boards at three levels - team, management, and executive. At the team level, the Scrum Master tries to resolve obstacles and logs them on a physical board. Unresolved obstacles are escalated to the management level board where managers work to find solutions. Obstacles that cannot be resolved by management are escalated to the executive level board where executives are responsible for resolving or dismissing them.
The document discusses strategies for deploying and releasing applications, including creating a release strategy, release plans, and managing the test and release process. It recommends stakeholders meet to define responsibilities, environments, deployment tools, and other factors. The release strategy should describe the deployment pipeline and processes for testing, approvals, and moving builds between environments. The release plan details automated steps for initial deployment, rollbacks, upgrades, and other lifecycle events. Tools can help model and manage moving builds through approval gates to different test stages and production.
This document discusses implementing reliability strategies and engineering. It begins by explaining the importance of reliability in fields like aviation, defense, and energy where failure could lead to dangerous situations. It then discusses mechanical reliability and common failure modes. Reliability engineering is introduced as the study of reliability and life-cycle management. Several high-profile system failures are listed to emphasize the need for reliability in design. The document outlines various areas of reliability engineering and provides definitions of key terms. It gives examples of reliability calculations and discusses maintainability, availability, and quality. Analytical reliability techniques are also summarized, along with key points and steps to implement a reliability strategy.
Abusing the word "Reliability" was an annoying thing for me, it's not linked to submission date of a document nor the training programs, yes these procedure can help in undirect way to improve the reliability, but when you consider your reliability program sole on it, then you are not doing reliability anymore.
So i decided to express my anger in peaceful way and i hope it can be a postive too.
for that i'll start to write a post and i'll call it "Real Reliability" to bust the myth around reliability, and i'll start with my first enemy "MTBF".
This for all the fed up guys from the wrong usage of "Reliability"
10 Things an Operations Supervisor can do Today to Improve Reliability
Continuing the series that started with maintenance technicians and supervisors, if you are new to the position of Operations Supervisor, what are some of the things you can begin working on immediately to improve reliability within the area you work?
The document discusses the importance of including equipment operators in Reliability Centered Maintenance (RCM) analysis. Operators play a key role by identifying important failure modes that others may overlook related to equipment operation. They can provide valuable details in failure effect statements about process impacts. Operators also help determine downtime from failures and identify mitigation tasks, such as process monitoring, that are effective at improving reliability. The document argues that excluding operators results in an incomplete RCM analysis.
You wonder sometimes, is Reliability the same as Availability. Here's a sample, showing 2 ways to calculate Availability. (They are not the same, but at times we think so.)
The document discusses software availability and resiliency. It defines availability as the percentage of time a system is up and running. High availability systems aim for 99.999% uptime or less than 5 minutes of downtime per year. The document advocates for a reactive, message-driven approach to building resilient systems that can withstand failures through isolation, asynchronous communication, failure management techniques like circuit breakers and supervisors, and redundancy. The goal is to design systems that can continue processing transactions even when failures occur.
The Seven Deadly Sins in Measuring Asset Reliability
Most companies don’t measure mean time between failures (MTBF), even though it’s the most basic measurement that quantifies reliability. MTBF is the average time an asset functions before it fails. So, why don’t they measure MTBF? Let’s define reliability first before we go any further.
Reliability: The ability of an item to perform a required function under stated conditions for a stated period of time
So why don’t we measure Mean Time Between Failure. This articles discusses this issue.
Draft comparison of electronic reliability prediction methodologies
A draft version of the paper that was eventually published as “J.A.Jones & J.A.Hayes, ”A comparison of electronic-reliability prediction models”, IEEE Transactions on reliability, June 1999, Volume 48, Number 2, pp 127-134”
Provide with the kind permission of the author, J.A.Jones
MTBF is often misused and can be misleading. It is calculated as the average time between failures of a system, but does not represent the actual duration of failure-free periods. A better metric is reliability (R(t)), which shows the probability that a system will operate at a given time. Additionally, the document notes that MTBF is intended for repairable systems, while MTTF is a more accurate term for non-repairable systems, as it is calculated the same way as MTBF under certain assumptions about repair times and part lifetime distributions.
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Solar trackers are the foundation of a utility-scale solar plant and their reliability affects energy production, uptime, and O&M costs; significantly impacting the economics of a project. In the near future it will become increasingly important for solar asset owners and investors to take tracker reliability into consideration. For tracker vendors, providing proven reliability and overall bankability of their systems will be a critical differentiator moving forward.
Efficient Reliability Demonstration Tests - by Guangbin Yang
This document discusses efficient reliability demonstration tests that can reduce sample sizes and test times compared to conventional methods. It presents principles for test time reduction using degradation measurements during testing. Methods are provided for calculating optimal test plans that minimize costs while meeting reliability requirements and risk constraints. Decision rules are given for terminating tests early based on degradation measurements and risk estimates. An example application demonstrates how the approach can significantly reduce testing costs.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
This document discusses various patterns for scaling web applications. It begins by describing single machine, two-tier, and multi-tier web application architectures. It then covers message bus and service-oriented architectures. The document discusses scaling up approaches like identifying bottlenecks, utilizing bottlenecks, adjusting non-bottlenecks, elevating bottlenecks, and reviewing processes. Specific scaling techniques covered include caching, threading and queues, and using content delivery networks.
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service ...VMware Tanzu
SpringOne Platform 2016
Speaker: Lenley Hensarling; SVP Strategy, EnterpriseDB
Enterprises want to enable continuous delivery and deployment of their digital products while also having the necessary security, robustness, monitoring, and management of the infrastructure. EnterpriseDB is integrating its Cloud Management provisioning capability with the Cloud Foundry Service Broker to allow data services and DBA groups to create templates for robust highly available PostgreSQL deployments while not impeding the speed and agility of the developer groups they serve. We’ll discuss how database provisioning through EDB’s Cloud Management can provide responsible DevOps models for the enterprise.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Microsoft technologies form the backbone of many Enterprise IT Infrastructures. Whether you are running Microsoft Exchange, Sharepoint, SQL Server or Active Directory; chances are you rely upon you these services for your mission critical needs. Solutions Architects and IT professionals will get an overview of the common Microsoft workloads running on AWS including approaches for server migrations, design and deployment of infrastructure services and maintenance and monitoring of those services once they are in production.
Using Oracle Database with Amazon Web Servicesguest484c12
The document discusses using Oracle Database with Amazon Web Services. It outlines Amazon EC2, which allows users to provision virtual machines in Amazon's data centers, and Amazon S3 for storing and retrieving data. It then provides steps for deploying Oracle Database Express Edition on EC2, backing up databases to S3 using Oracle Recovery Manager, and storing database files and backups in S3 for cost effective storage.
Building Cloud-Native App Series - Part 7 of 11
Microservices Architecture Series
Containers Docker Kind Kubernetes Istio
- Pods
- ReplicaSet
- Deployment (Canary, Blue-Green)
- Ingress
- Service
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Continuent
Amazon Web Services (AWS) are gaining popularity, and for good reasons. The Amazon Relational Database Service (AWS RDS) is getting a lot of attention, also for very good reasons. It is quite a compelling idea to have on-demand data services that do not require hiring DBA staff. The expectation is set that everything works like magic and will satisfy all of your enterprise database availability needs.
If you want to build high-volume, business-critical applications, possibly with geographically-distributed audiences, you really want to think twice about using RDS. Continuent customers have a large number deployments in AWS running MySQL on AWS EC2 instances and they choose to rely upon Tungsten Clustering to provide high availability (HA) and disaster recovery (DR). We also support multi-site/multi-master operations and offer true zero-downtime MySQL operations.
AGENDA
- How does RDS handle failover? (Hint: Not very quickly)
- How does RDS handle read scaling? (Hint: Not very well)
- Can you do zero-downtime maintenance with RDS? (Hint: No)
- Is RDS cheaper? (Hint: No, not really)
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA
With the rapid advancements in the cloud computing techniques and growing maturity of Infrastructure as Code (IaC) in the DevOps space, attaining a zero downtime while deploying the latest updates to the applications and databases, has become a new norm in the IT industry. Blue-Green deployment model is a DevOps technique which helps to achieve a zero downtime by seamlessly switching to new version(green) when it is fully ready while the original version (blue) is still running. Making duplicate copies of entire cloud infrastructure at a rapid pace is only possible when it is written as code, rather than manually configured. Terraform is a declarative programming language which helps in writing infrastructure as code.In this session -1.We will provision a new AWS VPC, public and private subnets and security groups using Terraform code.2. We will then create a Document DB cluster from scratch and create a new DB using snapshot, by running the Terraform.3. We will then demonstrate the creation of an AWS Fargate based ECS cluster using Terraform script and then run a java based micro service on it, that uses AWS Document DB as the NoSQL database. 4. We will then simulate the Blue Green deployment model by creating an AWS Lambda function which deploys the latest updates and rolls back to the older version with zero downtime.
As a Service: Cloud Foundry on OpenStack - Lessons LearntAnimesh Singh
According to OpenStack users survey, Cloud Foundry is the 2nd most popular workload on OpenStack. You want to deploy Cloud Foundry on OpenStack or already have. What's next?
Cloud Foundry continues to evolve with revolutionary changes, e.g move from bosh-micro to bosh-init, using the new eCPI, move to Diego etc.
Same with OpenStack, e.g changes from Keystone v2 to v3, from Liberty to Mitaka, network plugins changes etc. Both IaaS and PaaS layers are changing frequently. How do you do in-place updates/upgrades/operational tasks without impacting user experience at both the layers?
In this talk will discuss our lessons learnt operating hybrid Cloud Foundry deployments on top of OpenStack over the last two years and how we used underlying technologies to seamlessly operate them
Multi-master, multi-region MySQL deployment in Amazon AWSContinuent
MySQL data rules the cloud, but recent experience shows us that there's no substitute for maintaining copies of data, across availability zones and regions, when it comes to Amazon Web Services (AWS) data resilience.
In this webinar, we discuss the multi-master capabilities of Continuent Tungsten to help you build and manage systems that spread data across multiple sites. We cover important topics such as setting up large scale topologies, handling failures, and how to handle data privacy issues like removing personally identifiable information or handling privacy law restrictions on data movement. We will conclude with a live demonstration of a distributed MySQL solution with Continuent Tungsten clusters working across multiple AWS availability zones and regions.
How to build a Citrix infrastructure on AWSDenis Gundarev
This document summarizes Denis Gundarev's presentation on how to build a Citrix infrastructure in the Amazon Web Services (AWS) cloud. The presentation covered:
- An overview of AWS services like EC2, S3, VPC, RDS, and how to monitor with CloudWatch
- Common Citrix deployment architectures on AWS like using NetScaler and AutoScaling
- Limitations of running Citrix on AWS like lack of capacity management and client OS support
- Guidelines for deploying Citrix on AWS like starting simple, proper sizing, and careful VPC planning
Migrating Oracle workloads to Azure requires understanding the workload and hardware requirements. It is important to analyze the workload using the Automatic Workload Repository (AWR) report to accurately size infrastructure needs. The right virtual machine series and storage options must be selected to meet the identified input/output and capacity needs. Rather than moving existing hardware, the focus should be migrating the Oracle workload to take advantage of cloud capabilities while ensuring performance and high availability.
Container Orchestration with Docker Swarm and KubernetesWill Hall
This presentation covers the basics of what container orchestration is providing pros and cons of Docker Swarm, Kubernetes and Amazon ECS and outlining the terms and tools you will need to successfully use them.
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...Amazon Web Services
Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.
The document discusses continuous delivery and zero downtime deployment. It describes automating the full deployment process from development to production multiple times a day without any downtime. This is achieved through continuous integration, maintaining deployment scripts and packages, managing database changes, and techniques like feature toggles, blue/green deployments, and state management. The goal is to enable fast and reliable releases of new features and fixes to users.
The document discusses moving towards zero downtime. It summarizes Vision's solutions which aim to [1] eliminate planned and unplanned downtime, [2] quickly recover data to any point in time with zero data loss, and [3] improve service levels and comply with regulations. Vision addresses these issues through data replication, virtualized failover, application protection, and cluster integration capabilities.
The document discusses potential issues with using MTBF/MTTF as the primary reliability metric for the defense and aerospace industries. It argues that MTBF/MTTF provides an incomplete view of reliability across the entire product lifecycle and can result in overly optimistic assessments. The document proposes using an alternative metric called Bx/Lx, which specifies the life point where no more than a certain percentage (like 10%) of failures have occurred. This provides a more comprehensive view of reliability focused on early failures. Overall, the document advocates updating reliability metrics and practices to better reflect physical failure mechanisms.
Explains what troubleshooting is, what skills are involved, and clears up some common misconceptions. Originally designed with IT Helpdesks in mind, but it could apply to any kind of troubleshooting.
=========================
Wrote this a VERY long time ago! I always meant to revisit/revamp it, but never quite got round to it. But people seem to get value from it, so I'll leave it up :)
The New Simple: Predictive Analytics for the Mainstream Inside Analysis
The document summarizes a presentation on predictive analytics given by Mike Watschke from SAP. The presentation covered:
- SAP's predictive analytics solution which automates data preparation, modeling, and deployment tasks.
- Questions from the analyst about whether SAP sees in-memory technology as critical, whether it has its own data preparation technology or partners, and whether the capability works with cloud, Hadoop, and streaming data.
- Upcoming topics for future briefing room presentations, including business intelligence/analytics in March, big data in April, and cloud in May.
The document discusses techniques for achieving zero downtime deployments. It begins with an introduction and overview before covering specific methods such as blue-green deployments, canary releases, and rolling deployments. It also provides details on tools that can be used and considerations for deploying to web servers and databases. The document advocates combining different techniques into a hybrid 1/10/100 approach for deploying code changes to environments in a phased manner to minimize risk.
Obstacles encountered by teams are logged on obstacle boards at three levels - team, management, and executive. At the team level, the Scrum Master tries to resolve obstacles and logs them on a physical board. Unresolved obstacles are escalated to the management level board where managers work to find solutions. Obstacles that cannot be resolved by management are escalated to the executive level board where executives are responsible for resolving or dismissing them.
The document discusses strategies for deploying and releasing applications, including creating a release strategy, release plans, and managing the test and release process. It recommends stakeholders meet to define responsibilities, environments, deployment tools, and other factors. The release strategy should describe the deployment pipeline and processes for testing, approvals, and moving builds between environments. The release plan details automated steps for initial deployment, rollbacks, upgrades, and other lifecycle events. Tools can help model and manage moving builds through approval gates to different test stages and production.
Unit 9 implementing the reliability strategyCharlton Inao
This document discusses implementing reliability strategies and engineering. It begins by explaining the importance of reliability in fields like aviation, defense, and energy where failure could lead to dangerous situations. It then discusses mechanical reliability and common failure modes. Reliability engineering is introduced as the study of reliability and life-cycle management. Several high-profile system failures are listed to emphasize the need for reliability in design. The document outlines various areas of reliability engineering and provides definitions of key terms. It gives examples of reliability calculations and discusses maintainability, availability, and quality. Analytical reliability techniques are also summarized, along with key points and steps to implement a reliability strategy.
Abusing the word "Reliability" was an annoying thing for me, it's not linked to submission date of a document nor the training programs, yes these procedure can help in undirect way to improve the reliability, but when you consider your reliability program sole on it, then you are not doing reliability anymore.
So i decided to express my anger in peaceful way and i hope it can be a postive too.
for that i'll start to write a post and i'll call it "Real Reliability" to bust the myth around reliability, and i'll start with my first enemy "MTBF".
This for all the fed up guys from the wrong usage of "Reliability"
10 Things an Operations Supervisor can do Today to Improve ReliabilityRicky Smith CMRP, CMRT
Continuing the series that started with maintenance technicians and supervisors, if you are new to the position of Operations Supervisor, what are some of the things you can begin working on immediately to improve reliability within the area you work?
The document discusses the importance of including equipment operators in Reliability Centered Maintenance (RCM) analysis. Operators play a key role by identifying important failure modes that others may overlook related to equipment operation. They can provide valuable details in failure effect statements about process impacts. Operators also help determine downtime from failures and identify mitigation tasks, such as process monitoring, that are effective at improving reliability. The document argues that excluding operators results in an incomplete RCM analysis.
You wonder sometimes, is Reliability the same as Availability. Here's a sample, showing 2 ways to calculate Availability. (They are not the same, but at times we think so.)
The document discusses software availability and resiliency. It defines availability as the percentage of time a system is up and running. High availability systems aim for 99.999% uptime or less than 5 minutes of downtime per year. The document advocates for a reactive, message-driven approach to building resilient systems that can withstand failures through isolation, asynchronous communication, failure management techniques like circuit breakers and supervisors, and redundancy. The goal is to design systems that can continue processing transactions even when failures occur.
Most companies don’t measure mean time between failures (MTBF), even though it’s the most basic measurement that quantifies reliability. MTBF is the average time an asset functions before it fails. So, why don’t they measure MTBF? Let’s define reliability first before we go any further.
Reliability: The ability of an item to perform a required function under stated conditions for a stated period of time
So why don’t we measure Mean Time Between Failure. This articles discusses this issue.
Draft comparison of electronic reliability prediction methodologiesAccendo Reliability
A draft version of the paper that was eventually published as “J.A.Jones & J.A.Hayes, ”A comparison of electronic-reliability prediction models”, IEEE Transactions on reliability, June 1999, Volume 48, Number 2, pp 127-134”
Provide with the kind permission of the author, J.A.Jones
MTBF is often misused and can be misleading. It is calculated as the average time between failures of a system, but does not represent the actual duration of failure-free periods. A better metric is reliability (R(t)), which shows the probability that a system will operate at a given time. Additionally, the document notes that MTBF is intended for repairable systems, while MTTF is a more accurate term for non-repairable systems, as it is calculated the same way as MTBF under certain assumptions about repair times and part lifetime distributions.
Solar trackers are the foundation of a utility-scale solar plant and their reliability affects energy production, uptime, and O&M costs; significantly impacting the economics of a project. In the near future it will become increasingly important for solar asset owners and investors to take tracker reliability into consideration. For tracker vendors, providing proven reliability and overall bankability of their systems will be a critical differentiator moving forward.
This document discusses efficient reliability demonstration tests that can reduce sample sizes and test times compared to conventional methods. It presents principles for test time reduction using degradation measurements during testing. Methods are provided for calculating optimal test plans that minimize costs while meeting reliability requirements and risk constraints. Decision rules are given for terminating tests early based on degradation measurements and risk estimates. An example application demonstrates how the approach can significantly reduce testing costs.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
Design patterns for scaling web applicationsIvan Dimitrov
This document discusses various patterns for scaling web applications. It begins by describing single machine, two-tier, and multi-tier web application architectures. It then covers message bus and service-oriented architectures. The document discusses scaling up approaches like identifying bottlenecks, utilizing bottlenecks, adjusting non-bottlenecks, elevating bottlenecks, and reviewing processes. Specific scaling techniques covered include caching, threading and queues, and using content delivery networks.
This document discusses challenges related to virtual machine (VM) migration in cloud computing. It provides background on cloud computing and virtual machines. Key issues discussed include automated service provisioning, VM migration for server consolidation and energy management, and security challenges. The document also covers motivation for VM migration when workload increases trigger resource requirement changes. Methods for VM migration discussed include memory, network, and device migration techniques. Performance evaluation results of migration are presented. Migration across data centers introduces additional challenges like increased latency. Proposed solutions discussed encryption for security and redirection approaches to handle increased latency.
The document discusses implementing service-oriented architecture (SOA) using web services in C++. It provides reasons for using C++, such as performance advantages and tight control over memory and CPU. It then discusses how a native web services stack can help integrate legacy C++ systems and provide new capabilities without rewriting code. The stack should support web service standards, code generation from WSDL, portability, low memory usage, security, handling binary data, interoperability, and asynchronous communication. It presents the WSF/C++ stack as fulfilling these requirements through support for standards, security, low-level control, and interoperability testing.
Rohit Yadav - The future of the CloudStack Virtual RouterShapeBlue
Rohit Yadav presented ideas for improving the future of CloudStack virtual routers. Currently, VR upgrades cause downtime as the old VR is destroyed and a new one provisioned. Rohit proposed keeping the existing VRs and applying live patches to eliminate downtime. He also suggested containerizing non-core services and providing core services directly from the hypervisor kernel to reduce the VR footprint. Finally, Rohit discussed refactoring CloudStack's network types and implementing a network designer to support complex topologies.
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
Talk from FOSDEM 2022
It's easy to get misled into overconfidence based on the performance of powerful servers, given today's monster core counts and RAM sizes. However, the reality of high concurrency usage is often disappointing, with less throughput than one would expect. Because of its internals and its multi-process architecture, PostgreSQL is very particular about how it likes to deal with high concurrency and in some cases it can slow down to the point where it looks like it's not performing as it should. In this talk we'll take a look at potential pitfalls when you throw a lot of work at your database. Specifically, very high concurrency and resource contention can cause problems with lock waits in Postgres. Very high transaction rates can also cause problems of a different nature. Finally, we will be looking at ways to mitigate these by examining our queries and connection parameters, leveraging connection pooling and replication, or adapting the workload.
Topics:
1. Understand what we mean by high concurrency.
2. Understand ACID & MVCC in Postgres.
3. Understand how high concurrency affects Postgres performance.
4. Understand how locks/latches affect Postgres performance.
5. Understand how high transaction rates can affect Postgres.
6. Mitigation strategies for high concurrency scenarios.
- Dynomite is a framework that makes non-distributed databases distributed by adding a proxy layer, auto-sharding, replication across datacenters, and more.
- Netflix uses Dynomite to provide high availability and scalability for several internal microservices and tools like Conductor, an orchestration engine that uses Redis.
- Conductor allows defining workflows as code and executing them in a distributed and scalable way using Dynomite and Dyno Queues.
If you need to build highly performant, mission critical ,microservice-based system following DevOps best practices, you should definitely check Service Fabric!
Service Fabric is one of the most interesting services Azure offers today. It provide unique capabilities outperforming competitor products.
We are seeing global companies start to use Service Fabric for their mission critical solutions.
In this talk we explore the current state of Service Fabric and dive deeper to highlight best practices and design patterns.
We will cover the following topics:
• Service Fabric Core Concepts
• Cluster Planning and Management
• Stateless Services
• Stateful Services
• Actor Model
• Availability and reliability
• Scalability and perfromance
• Diganostics and Monitoring
• Containers
• Testing
• IoT
Live broadcast on https://www.youtube.com/watch?v=Zuxfhpab6xo
SDN & NFV Introduction - Open Source Data Center NetworkingThomas Graf
This document introduces software defined networking (SDN) and network functions virtualization (NFV) concepts. It discusses challenges with traditional networking and how SDN and NFV address these by decoupling the control and data planes, centralizing network intelligence, and abstracting the underlying network infrastructure. It then provides examples of open source SDN technologies like OpenDaylight, Open vSwitch, and OpenStack that can be used to build programmable software-defined networks and virtualized network functions.
This document introduces software defined networking (SDN) and network functions virtualization (NFV) concepts. It discusses challenges with traditional networking and how SDN and NFV address these by decoupling the control and data planes, making the network programmable through APIs, and virtualizing network functions. It then provides examples of open source SDN platforms like OpenDaylight, Open vSwitch, and OpenStack that enable building virtual networks and service chains.
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedis Labs
Dynomite is a framework that makes non-distributed databases distributed by adding a proxy layer, auto-sharding, replication across datacenters, and more. It is used at Netflix to power various services by sitting on top of Redis and providing high availability, scalability, and tunable consistency. Conductor is a workflow orchestration engine used by Netflix that stores workflow definitions and state in Dynomite to allow reusable and controllable workflow processes.
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld
VMworld 2013
Kaushik Bhattacharya, Pivotal
Michel Bond, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/parallelizing-machine-learning-applications-in-the-cloud-with-kubernetes-a-case-study-a-presentation-from-amd/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Rajy Meeyakhan Rawther, PMTS Software Architect in the Machine Learning Software Engineering group at AMD, presents the “Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A Case Study” tutorial at the September 2020 Embedded Vision Summit.
In this talk, Rawther presents techniques for obtaining the best inference performance when deploying machine learning applications in the cloud. With the increasing use of AI in applications ranging from image classification/object detection to natural language processing, it is vital to deploy AI applications in ways that are scalable and efficient. Much work has focused on how to distribute DNN training for parallel execution using machine learning frameworks (TensorFlow, MXNet, PyTorch and others). There has been less work on scaling and deploying trained models on multi-processor systems.
Rawther presents a case study analysis of scaling an image classification application in the cloud using multiple Kubernetes pods. She explores the factors and bottlenecks affecting performance and examine techniques for building a scalable application pipeline.
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
Brian Brazil is an engineer passionate about reliable systems who has worked at Google SRE and Boxever. He discusses Prometheus, an open source monitoring system he helped create. Prometheus offers inclusive monitoring of services, is manageable and reliable, integrates easily with other tools, and provides powerful querying and dashboards. It is efficient, scalable, and helps provide visibility into systems through its data model and labeling.
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
Audience: Advanced
About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand.
This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production.
Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more!
Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited
Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc.
Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex.
Speakers:
- Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex
- Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
In this InfluxDays NYC 2019 session, Richard Laskey from the Wayfair Storefront team will share their monitoring best practices using InfluxEnterprise. These efforts are critical and help improve the user experience by driving forward site-wide improvements, establishing best practices, and driving change through many different teams.
Enter the world of cloud computing and software development with PaaS. What it takes to create a production ready application with Heroku and how to run it?
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Support en anglais diffusé lors de l'événement 100% IA organisé dans les locaux parisiens d'Iguane Solutions, le mardi 2 juillet 2024 :
- Présentation de notre plateforme IA plug and play : ses fonctionnalités avancées, telles que son interface utilisateur intuitive, son copilot puissant et des outils de monitoring performants.
- REX client : Cyril Janssens, CTO d’ easybourse, partage son expérience d’utilisation de notre plateforme IA plug & play.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
3. Zero Downtime Architectures
● Base on a customer project with the classic JEE Application Stack
● Classic web applications with server side code
● HTTP based APIs
● Goals, Concepts and Implementation Techniques
● Constraints and limitations
● Developement guidelines
● How these concepts can be applied to the new cuttung edge technolgies
● Single page Java Script based Apps
● Mobile clients
● Rest APIs
● Node.js
● NoSQL stores
4. Zero Downtime Architecture?
● My database server has 99.999% uptime
● We have Tomcat cluster
● Redundant power supply
● Second Datacenter
● Load Balancer
● Distribute routes over OSPF
● Deploy my application online
● Second ISP
● Session Replication
● Monitoring
● Data Replication
● Auto restarts
5. Zero Downtime architecture: our definition
The services from the end user point of view
could be always available
6. Our Vision
Identify all sources of downtime and remove
all them
http://www.meteleco.com/wp-content/uploads/2011/09/p360.jpg
7. When could we have a downtime (unplanned)?
● Human errors
● Server node has crashed
● Power supply is broken, RAM Chip burned out, OS just crashed
● Server Software just crashed
● IO errors, software bug, tablespace full
● Network is unavailable
● Router crashed, Uplink down
● Datacenter is down
● Uplinks down ( notorious bagger :-) )
● Flood/Fire
● Aircondition broken
● Hit by a nuke (not so often :-) )
8. When could we need a downtime (planned)?
● Replace a hardware part
● Replace a router/switch
● Firmware upgrade
● Upgrade/exchange the storage
● Configuration of the connection pool
● Configuration of the cluster
● Upgrade the cluster software
● Recover from a logical data error
● Upgrade the database software
● Deploy a new version of our software
● Move the application to another data center
9. How can we avoid downtime
● Redunancy
● Hardware, network
● Uplinks
● Datacenters
● Software
● Monitoring
● Detect exhausted resources before the application notices it
● Detect a failed node and replace it
● Software design
● Idempotent service calls
● Backwards compatibility
● Live releases
● Scalability
● Scale on more load
● Protect from attacks (e.g. DDoS)
10. Requirements for a Zero Downtime Architecture:
handling of events of failure or maintenance
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
11. Our goals and constraints
● Reduce downtime to 0
● Keep the costs low
● No expensive propriatery hardware
● Minimize the potential application changes/rewrites
http://www.signwarehouse.com/blog/how-to-keep-fixed-costs-low/
12. Our Concepts 1/4
● Independent Applications or Application Groups
● One Application (Group) = IP Address
● Communication between Application exclusively over this IP Address!
http://www.binaryguys.de/media/catalog/product/cache/1/image/313x313/9df78eab33525d08d6e5fb8d27136e95/3/6/36.noplacelikelocalhost_1_4.jpg
14. Our Concepts 3/4
● Reduce the downtime within a datacenter to 0
● High available network
● Redundant firewalls and load balancers
● Web server farms
● Application server clusters with sesion replication
● Oracle RAC Cluster
● Downtime free application deployments
15. Our Concepts 4/4
● Replicate the data on both datacenters
● and make the applications switchable
17. Concepts: Internet traffic, BGP(Border Gateway Protocol) 1/2
●
Every datacenter has fully redundant uplinks
●
Own provider independent IP address range (assigned by RIPE)
●
Hard to get in the moment (but not impossible)
●
Propagate these addresses to the rest of the internet through both ISPs using BGP
●
Both DCs our addresses
●
The network path of one announcement could be preferred (for costs reasons)
●
Switch of internet traffic
●
Gracefully by changing the preferences of the announcements
– No single TCP session lost
●
In case of disaster the backup route is propagated automatically within seconds to minutes (depending on the internet
distance)
●
Protect us from connectivity problems between our ISPs and our customer ISPs
10.8.8.0/24
10.8.8.0/24
Announcement
Announcement
18. Concepts: Internet traffic, use DNS ? 2/2
● We don't use DNS for switching
● A datacenter switch based on DNS could take up to months to reach all customers and
their software (e.g. JVMs caching DNS entries, default behaviour)
● No need to restart browsers, applications and proxies on the customer site. The customer
doesn't see any change at all (except that route to us has changed)
● DNS is good for load balancing but not for High Availability!
19. Concepts: Internal traffic
● OSPF (Open Shortest Path First) protocol for dynamic routing
● Deals with redundant paths completely transparently
● Can also do load balancing
● The second level firewalls (in front of the load balancers) announce the address to
the rest of the routers
● To switch the processing of a service, it's firewall just has to announce the route (could be also a /32)
with a higher priority, after a second the traffic goes through the new route.
● Could be also used for a unattended switch of the whole datacenter
● Just announce the same IPs from both sites with different priorities
● If the one datacenter dies there are only announcements from the other one
10.8.8.23
10.8.8.23
20. Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce Downtime within a DC
● Replicate the data between the Dcs and make
the application switchable
21. Zero Downtime within a datacenter
● High Available network
● Redundant switches
– Again using Spanning Tree
Protocol
● Redundant firewalls, routers, load
balancers
– Active/Passive Clusters
– VRRP protocol implemeneted
by keepalived
– IP tables with contractd
● Web Server Apache farms
● Managed by load balancer
● Application Server Cluster
● Weblogic Cluster
● With Session replication,
● automcatic retries and restarts
● Oracle RAC database cluster
● Deployment without
downtime
22. Failover within one datacenter:Apache plugin (mod_wl)
Session ID Format: sessionid!primary_server_id!secondary_server_id
Quelle: http://egeneration.beasys.com/wls/docs100/cluster/wwimages/cluster-06-1-2.gif
23. Development guidelines (HTTPSession)
● If you need a session then you most probably want to replicate it
● Example (weblogic.xml)
● Generally all requests of one session go to the same application instance
● When it fails (answer with 50x, dies or not answer in a given period) the backup instance is involved
● The session attributes are only replicated on the backup node when HTTPSession.setAttribute
was called. HTTPSession.getAttribute("foo") .changeSomething() will not be replicated!
● Every attribute stored in the HTTPSession must be serializable!
● The ServletContext will not be replicated in any cases.
● If you implement caches they will have probably different contents on every node (except we
use a 3rd party cluster aware cache). Probably the best practice is not to rely that the data is
present and declare the cache transient
● Keep the session small in size and do regular reattaching.
24. Development guidelines (cluster handling)
● Return proper HTTP return codes to the client
● Common practice is to return a well formed error page with HTTP code 200
● It is a good practice if you are sure that the cluster is incapable of recovering from it (example: a
missing page will be missing on the other node too)
● But an exhausted resource (like heap, datasource) could be present on the other node
● It is hard to implement it, therefore Weblogic offers you help:
● You can bind the number of execution threads to a datasource capacity
● Shut down the node if an OutOfMemoryError occurs but use it with extreme care!
● Design for idempotence
● Do all your methods idempotent as far as possible.
● For those that cannot be idempotent (e.g. sendMoney(Money money, Account account)) prevent re-
execution:
– By using a ticketing service
– By declaring the it as not idempotent:
<LocationMatch /pathto/yourservlet >
SetHandler weblogichandler
Idempotent OFF
</Location>
25. Development guidelines (Datasources)
● Don't build your own connection pools, take them from the Application Server by JNDI
Lookup
● As we are using Oracle RAC , the datasource must be a multipool consisting of single datasources per RAC
node
– One can take one of the single datasources out of the mutlipool (online)
– Load balancing is guaranteed
– Reconfiguring the pool online
● Example Spring config:
● Example without Spring:
26. Basic monitoring
● Different possibilities for monitoring on Weblogic
● Standard admin console
– Threads (stuck, in use, etc), JVM (heap size, usage etc.), online thread dumps
– Connection pools statistics
– Transaction manager statistics
– Application statistics (per servlet), WorkManager statistics
● Diagnostic console
– Online monitoring only
– All attributes exposed by Weblogic Mbeans can be monitored
– Demo: diagnostics console
● Diagnostic images
– On demand, on shutdown, regularly
– Useful for problem analysis (especially for after crash analysis)
– For analysing of resource leaks: Demo: analyse a connection leak and a stuck thread
● SNMP and diagnostic modules
– All MBean attributes can be monitored by SNMP
– Gauge, string, counter monitors, log filters, attribute changes
– Collected metrics, watches and notifications
27. Zero downtime deployment
● 2 Clusters within the one datacenter
● Managed by Apache LB
● (simple script based on the session ID)
● Both are active during normal operations
● Before we deploy the new release we
switch off cluster 1
● Old sessions go to both cluster 1 and 2
● New sessions go to cluster 2 only
● When all sessions of cluster 1 expire we deploy
the new version
● Test it
● If everything ok, then we put it back
into the Apache load balancer
● Now we take cluster 2 off
● Untill all sessions expire
● The same procedure as above
● Then we deploy on the second datacenter
28. Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable
29. Our requirements again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
30. Replicate the data between the DCs
● Bidirectional data replication between DCs
● Oracle Streams/Golden Gate
http://docs.oracle.com/cd/E11882_01/server.112/e10705/man_gen_rep.htm#STREP013
32. Application groups
●
One or more applications without hard dependencies to or from other
applications
●
Why application groups
●
Switching many application at once leads to long downtimes and higher risk
●
Switching a single one is not possible if there are hard dependencies on database level to
other applications
●
Identify groups of applications that are critical dependent on each other but not to other
applications out of the group
●
Switch such groups always at once
●
As bigger the group as longer the downtime
– A single application in the category HA will be able to switch without any downtime, just delayed
requests
●
Critical (hard) dependencies is if it leads to issues (editing the same record on different
DCs will be definitely problematic, reading data for reporting is not)
– Must be identified on case by case base
35. Example of a switch procedure of an application group
36. Applications: Limitations
Limitation/Categories
No bulk transactions
No DB sequences
No file based sequences
No shared file system storage
Use a central batch system
All new releases has to be compatible with
the previous release.
Stick to the infrastructure
37. Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable
38. Our requirements once again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
40. Modern Architectures: Application Layer
● Web apps
● Completely independent on the backend
● Using only Rest APIs
● 90% of the state is locally managed (supported by frameworks like AngularJS and
BackboneJS)
● Must be compatible with different versions of the Rest API (at least 2 versions)
● If websockets are used, then more tricky, see backend.
● New mobile versions managed by Apps Stores
● Good to have a upgrade reminder (to limit the supported versions)
● Rest API must be versioned and backwards compatible
● Messages over message clouds is transparent. HA managed by vendors
● Stafeful Services
● e.g. Oauth v1/v2
– Normally by DB Persistence
41. Session Replication
● Less needed that with Server Side Applications
● Frameworks like AngularJS, BackboneJS , Ember etc. manage their own sessions,
routings etc.
● but still needed
● Weblogic: no change
● Tomcat evtl. with JDBC Store
● Jetty with Terracotta
● Node.js: secure (digitally signed) sessions stored in cookies
– Senchalabs Connect
– Mozilla/node-client-sessions
● https://hacks.mozilla.org/2012/12/using-secure-client-side-sessions-to-build-simple-and-
scalable-node-js-applications-a-node-js-holiday-season-part-3/
42. Backend: Bidirectional Data Replication
● Elastic Search
● Currently no cross cluster replication
● But is on their roadmap
● Couchdb
● Very flexible replication, regardless within one or more datacenters
● Bidirectional replication is possible
● Mongodb
● One direction replication possible and mature
● Bidirectional not possible in the moment
● Workaround would be: one mongodb per app and strict separation of the apps
● Hadoop HDFS
● Currently no cross cluster replication available
● e.g. Facebook wrote their own replication for HIVE
● Will possibly arrive soon with Apache Falcon http://falcon.incubator.apache.org/
44. Some pictures on this presentation were purchased from iStockphoto LP. The price paid applies for the use of the pictures within the
scope of a standard license, which includes among other things, online publications including websites up to a maximum image size of
800 x 600 pixels (video: 640 x 480 pixels).
Some icons from https://www.iconfinder.com/ are used under the Creative Commons public domain license from the following authors:
Artbees, Neurovit and Pixel Mixer (http://pixel-mixer.com)
All other trademarks mentioned herein are the property of their respective owners.
47. Key features
● 2 datacenters
●
Both active (both datacenters active but probably different applications running on them)
● Independent uplinks
● Redundant interconnect
● Applications are deployed and running on both
● Application cluster in every datacenter
● Session replication within every datacenter
● Cross replication between the 2 datacenters
● e.g. with Weblogic Cluster
● Bidirectional database replication
● e.g. 2 independent Oracle RAC in each datacenter
● Replication over streams/Golden Gate
● Monitoring of all critical resources
● Hardware nodes
● Connection pools
● JVM heaps
● Application switch
48. Concepts: other network components
● Firewalls
● First level firewalls
– Cisco routers
– Stateless firewalls
– Not very restrictive
● Second level firewalls (in front of the application load balancers)
– Should be stateful
– based on Linux/Iptables with conntrackd (for failover)
– Statefull, connection tracking
– Very restrictive
– Rate limiting of new connections (DoS or slashdot)
● All firewalls will be/are in active/hot standby mode.
● On a controlled failover (both are running and we switch them) no single TCP connection
should be affected (except small delays)
● In disaster case some seconds until the cluster software detects the crash of the node and
initiate the failover. No TCP connections should be lost but there is a very small risk
49. Example of a switch procedure of an application group
● Preparation steps
● Check the health of the replication processes.
● Stop all batch applications (by stopping the job scheduling system). If the time
pressure for the switch is high just kill all running jobs (they should be restartable
anyway, also currently).
● Switch off the keepalive feature on all httpd servers
● Switching steps
● Change the firewall rules on the second layer firewalls, so that any new
connection requests (Syn flag is active) is being dropped.
● Wait until the data is synchronized on both sides (e.g. by monitoring a
heartbeat table) and no more httpd processes are active.
● Switch the application traffic to the other DC (by changing the routing of their
IP addresses).
● Clean up (remove dropping of Syn packages on the “old” site etc.)
● This procedure is done per application group until all applications are running
50. Application clusters (Weblogic)
● Features of Weblogic that we use
● mod_wl
– Manages the stickiness and failover to backup nodes
– Automatic retry of failed requests
● On time-outs
● On response header 50x
● Multipools
– Gracefully remove a database node out of the pool
– Gracefully change parameters of connection pools
– Guaranteed balance of connections between database nodes
● Binding execution threads to connection pools
● Auto shutdown (+ restart) of nodes on OutOfMemoryException
● Session replication (also over both DCs)
● Thread monitoring (detect dead or long running threads etc.)
● Diagnostic images and alarms
52. Deployment of connection pools
● One datasource per Oracle RAC node
● Set the initial capacity to a value that will be sufficient for the usual load for the application
– Creation of new connections is expensive
● Set the max capacity to a value that will be sufficient in a high load scenario
– The overall number of connections should match to the limit of connection on the database site
● Set JDBC parameter in the connection pool and not globally (e.g. v8compatibility=true)
● Check connections on reserve
● You can set db session parameters in the init SQL property (e.g. alter session set
NLS_SORT='GERMAN')
● Enable 2 phase commit only if you need it (expensive)
● Prepared statement caching does not bring much performance (at least for Oracle databases) but cost
open cursors in the database (per connection!), so don't use it unless you have a very good reason to
do it.
● One Multipool containing all single datasources for one database
● Strategy: load balancing
Editor's Notes
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications
reduce downtime to 0
keep the costs low
use linux
use x64 hw
SW Licenses as low as possible
Minimize changes of applications