SlideShare a Scribd company logo
Zero Downtime Architectures
Alexander Penev
ByteSource Technology Consulting GmbH
Neubaugasse 43
1070, Vienna
Austria
whoami
Alexander Penev
Email: alexander.penev@bytesource.net
Twitter: @apenev
@ByteSourceNet
JEE, Databases, Linux, TCP/IP
Fan of (automatic) testing, TDD, ADD, BDD…..
Like to design high available and scalable systems :-)
Zero Downtime Architectures
● Base on a customer project with the classic JEE Application Stack
● Classic web applications with server side code
● HTTP based APIs
● Goals, Concepts and Implementation Techniques
● Constraints and limitations
● Developement guidelines
● How these concepts can be applied to the new cuttung edge technolgies
● Single page Java Script based Apps
● Mobile clients
● Rest APIs
● Node.js
● NoSQL stores
Zero Downtime Architecture?
● My database server has 99.999% uptime
● We have Tomcat cluster
● Redundant power supply
● Second Datacenter
● Load Balancer
● Distribute routes over OSPF
● Deploy my application online
● Second ISP
● Session Replication
● Monitoring
● Data Replication
● Auto restarts

Recommended for you

Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel

This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.

kafka
Migration of Microsoft Workloads
Migration of Microsoft WorkloadsMigration of Microsoft Workloads
Migration of Microsoft Workloads

Microsoft technologies form the backbone of many Enterprise IT Infrastructures. Whether you are running Microsoft Exchange, Sharepoint, SQL Server or Active Directory; chances are you rely upon you these services for your mission critical needs. Solutions Architects and IT professionals will get an overview of the common Microsoft workloads running on AWS including approaches for server migrations, design and deployment of infrastructure services and maintenance and monitoring of those services once they are in production.

2015entnysummitcloud2015 aws summit new york
Using Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Services

The document discusses using Oracle Database with Amazon Web Services. It outlines Amazon EC2, which allows users to provision virtual machines in Amazon's data centers, and Amazon S3 for storing and retrieving data. It then provides steps for deploying Oracle Database Express Edition on EC2, backing up databases to S3 using Oracle Recovery Manager, and storing database files and backups in S3 for cost effective storage.

Zero Downtime architecture: our definition
The services from the end user point of view
could be always available
Our Vision
Identify all sources of downtime and remove
all them
http://www.meteleco.com/wp-content/uploads/2011/09/p360.jpg
When could we have a downtime (unplanned)?
● Human errors
● Server node has crashed
● Power supply is broken, RAM Chip burned out, OS just crashed
● Server Software just crashed
● IO errors, software bug, tablespace full
● Network is unavailable
● Router crashed, Uplink down
● Datacenter is down
● Uplinks down ( notorious bagger :-) )
● Flood/Fire
● Aircondition broken
● Hit by a nuke (not so often :-) )
When could we need a downtime (planned)?
● Replace a hardware part
● Replace a router/switch
● Firmware upgrade
● Upgrade/exchange the storage
● Configuration of the connection pool
● Configuration of the cluster
● Upgrade the cluster software
● Recover from a logical data error
● Upgrade the database software
● Deploy a new version of our software
● Move the application to another data center

Recommended for you

Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source

Learn how Netflix Open Source components can help your company speed up development and grow to web-scale size

netflixossasgardsimian army
Scaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysScaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - Infosys

See how organisations have economically accelerated adoption of MongoDB across the enterprise using advanced techniques, tooling and DevOps.

mdbe18
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio

Building Cloud-Native App Series - Part 7 of 11 Microservices Architecture Series Containers Docker Kind Kubernetes Istio - Pods - ReplicaSet - Deployment (Canary, Blue-Green) - Ingress - Service

container orchestrationlinux containerscontainers
How can we avoid downtime
● Redunancy
● Hardware, network
● Uplinks
● Datacenters
● Software
● Monitoring
● Detect exhausted resources before the application notices it
● Detect a failed node and replace it
● Software design
● Idempotent service calls
● Backwards compatibility
● Live releases
● Scalability
● Scale on more load
● Protect from attacks (e.g. DDoS)
Requirements for a Zero Downtime Architecture:
handling of events of failure or maintenance
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Our goals and constraints
● Reduce downtime to 0
● Keep the costs low
● No expensive propriatery hardware
● Minimize the potential application changes/rewrites
http://www.signwarehouse.com/blog/how-to-keep-fixed-costs-low/
Our Concepts 1/4
● Independent Applications or Application Groups
● One Application (Group) = IP Address
● Communication between Application exclusively over this IP Address!
http://www.binaryguys.de/media/catalog/product/cache/1/image/313x313/9df78eab33525d08d6e5fb8d27136e95/3/6/36.noplacelikelocalhost_1_4.jpg

Recommended for you

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...

Amazon Web Services (AWS) are gaining popularity, and for good reasons. The Amazon Relational Database Service (AWS RDS) is getting a lot of attention, also for very good reasons. It is quite a compelling idea to have on-demand data services that do not require hiring DBA staff. The expectation is set that everything works like magic and will satisfy all of your enterprise database availability needs. If you want to build high-volume, business-critical applications, possibly with geographically-distributed audiences, you really want to think twice about using RDS. Continuent customers have a large number deployments in AWS running MySQL on AWS EC2 instances and they choose to rely upon Tungsten Clustering to provide high availability (HA) and disaster recovery (DR). We also support multi-site/multi-master operations and offer true zero-downtime MySQL operations. AGENDA - How does RDS handle failover? (Hint: Not very quickly) - How does RDS handle read scaling? (Hint: Not very well) - Can you do zero-downtime maintenance with RDS? (Hint: No) - Is RDS cheaper? (Hint: No, not really)

continuentcontinuent tungstendatabase
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...

With the rapid advancements in the cloud computing techniques and growing maturity of Infrastructure as Code (IaC) in the DevOps space, attaining a zero downtime while deploying the latest updates to the applications and databases, has become a new norm in the IT industry. Blue-Green deployment model is a DevOps technique which helps to achieve a zero downtime by seamlessly switching to new version(green) when it is fully ready while the original version (blue) is still running. Making duplicate copies of entire cloud infrastructure at a rapid pace is only possible when it is written as code, rather than manually configured. Terraform is a declarative programming language which helps in writing infrastructure as code.In this session -1.We will provision a new AWS VPC, public and private subnets and security groups using Terraform code.2. We will then create a Document DB cluster from scratch and create a new DB using snapshot, by running the Terraform.3. We will then demonstrate the creation of an AWS Fargate based ECS cluster using Terraform script and then run a java based micro service on it, that uses AWS Document DB as the NoSQL database. 4. We will then simulate the Blue Green deployment model by creating an AWS Lambda function which deploys the latest updates and rolls back to the older version with zero downtime.

data con ladcla
Our Concepts 2/4
Treat the internet and internal traffic
independently
Our Concepts 3/4
● Reduce the downtime within a datacenter to 0
● High available network
● Redundant firewalls and load balancers
● Web server farms
● Application server clusters with sesion replication
● Oracle RAC Cluster
● Downtime free application deployments
Our Concepts 4/4
● Replicate the data on both datacenters
● and make the applications switchable
Implementation: Network (Layer 2)

Recommended for you

As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons Learnt

According to OpenStack users survey, Cloud Foundry is the 2nd most popular workload on OpenStack. You want to deploy Cloud Foundry on OpenStack or already have. What's next? Cloud Foundry continues to evolve with revolutionary changes, e.g move from bosh-micro to bosh-init, using the new eCPI, move to Diego etc. Same with OpenStack, e.g changes from Keystone v2 to v3, from Liberty to Mitaka, network plugins changes etc. Both IaaS and PaaS layers are changing frequently. How do you do in-place updates/upgrades/operational tasks without impacting user experience at both the layers? In this talk will discuss our lessons learnt operating hybrid Cloud Foundry deployments on top of OpenStack over the last two years and how we used underlying technologies to seamlessly operate them

openstackbluemixcloudfoundry
Multi-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSMulti-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWS

MySQL data rules the cloud, but recent experience shows us that there's no substitute for maintaining copies of data, across availability zones and regions, when it comes to Amazon Web Services (AWS) data resilience. In this webinar, we discuss the multi-master capabilities of Continuent Tungsten to help you build and manage systems that spread data across multiple sites. We cover important topics such as setting up large scale topologies, handling failures, and how to handle data privacy issues like removing personally identifiable information or handling privacy law restrictions on data movement. We will conclude with a live demonstration of a distributed MySQL solution with Continuent Tungsten clusters working across multiple AWS availability zones and regions.

mysqlcontinuent tungstenamazon web services
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere

Docker based Hadoop provisioning - anywhere Janos Matyas Senior Director, Engineering Hortonworks

Concepts: Internet traffic, BGP(Border Gateway Protocol) 1/2
●
Every datacenter has fully redundant uplinks
●
Own provider independent IP address range (assigned by RIPE)
●
Hard to get in the moment (but not impossible)
●
Propagate these addresses to the rest of the internet through both ISPs using BGP
●
Both DCs our addresses
●
The network path of one announcement could be preferred (for costs reasons)
●
Switch of internet traffic
●
Gracefully by changing the preferences of the announcements
– No single TCP session lost
●
In case of disaster the backup route is propagated automatically within seconds to minutes (depending on the internet
distance)
●
Protect us from connectivity problems between our ISPs and our customer ISPs
10.8.8.0/24
10.8.8.0/24
Announcement
Announcement
Concepts: Internet traffic, use DNS ? 2/2
● We don't use DNS for switching
● A datacenter switch based on DNS could take up to months to reach all customers and
their software (e.g. JVMs caching DNS entries, default behaviour)
● No need to restart browsers, applications and proxies on the customer site. The customer
doesn't see any change at all (except that route to us has changed)
● DNS is good for load balancing but not for High Availability!
Concepts: Internal traffic
● OSPF (Open Shortest Path First) protocol for dynamic routing
● Deals with redundant paths completely transparently
● Can also do load balancing
● The second level firewalls (in front of the load balancers) announce the address to
the rest of the routers
● To switch the processing of a service, it's firewall just has to announce the route (could be also a /32)
with a higher priority, after a second the traffic goes through the new route.
● Could be also used for a unattended switch of the whole datacenter
● Just announce the same IPs from both sites with different priorities
● If the one datacenter dies there are only announcements from the other one
10.8.8.23
10.8.8.23
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce Downtime within a DC
● Replicate the data between the Dcs and make
the application switchable

Recommended for you

How to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSHow to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWS

This document summarizes Denis Gundarev's presentation on how to build a Citrix infrastructure in the Amazon Web Services (AWS) cloud. The presentation covered: - An overview of AWS services like EC2, S3, VPC, RDS, and how to monitor with CloudWatch - Common Citrix deployment architectures on AWS like using NetScaler and AutoScaling - Limitations of running Citrix on AWS like lack of capacity management and client OS support - Guidelines for deploying Citrix on AWS like starting simple, proper sizing, and careful VPC planning

xenappcloudcitrix
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres

Introduction to Apache Kafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases

big dataapachestream data
PASS Summit 2020
PASS Summit 2020PASS Summit 2020
PASS Summit 2020

Migrating Oracle workloads to Azure requires understanding the workload and hardware requirements. It is important to analyze the workload using the Automatic Workload Repository (AWR) report to accurately size infrastructure needs. The right virtual machine series and storage options must be selected to meet the identified input/output and capacity needs. Rather than moving existing hardware, the focus should be migrating the Oracle workload to take advantage of cloud capabilities while ensuring performance and high availability.

oracleoracle on azuredatabase migrations
Zero Downtime within a datacenter
● High Available network
● Redundant switches
– Again using Spanning Tree
Protocol
● Redundant firewalls, routers, load
balancers
– Active/Passive Clusters
– VRRP protocol implemeneted
by keepalived
– IP tables with contractd
● Web Server Apache farms
● Managed by load balancer
● Application Server Cluster
● Weblogic Cluster
● With Session replication,
● automcatic retries and restarts
● Oracle RAC database cluster
● Deployment without
downtime
Failover within one datacenter:Apache plugin (mod_wl)
Session ID Format: sessionid!primary_server_id!secondary_server_id
Quelle: http://egeneration.beasys.com/wls/docs100/cluster/wwimages/cluster-06-1-2.gif
Development guidelines (HTTPSession)
● If you need a session then you most probably want to replicate it
● Example (weblogic.xml)
● Generally all requests of one session go to the same application instance
● When it fails (answer with 50x, dies or not answer in a given period) the backup instance is involved
● The session attributes are only replicated on the backup node when HTTPSession.setAttribute
was called. HTTPSession.getAttribute("foo") .changeSomething() will not be replicated!
● Every attribute stored in the HTTPSession must be serializable!
● The ServletContext will not be replicated in any cases.
● If you implement caches they will have probably different contents on every node (except we
use a 3rd party cluster aware cache). Probably the best practice is not to rely that the data is
present and declare the cache transient
● Keep the session small in size and do regular reattaching.
Development guidelines (cluster handling)
● Return proper HTTP return codes to the client
● Common practice is to return a well formed error page with HTTP code 200
● It is a good practice if you are sure that the cluster is incapable of recovering from it (example: a
missing page will be missing on the other node too)
● But an exhausted resource (like heap, datasource) could be present on the other node
● It is hard to implement it, therefore Weblogic offers you help:
● You can bind the number of execution threads to a datasource capacity
● Shut down the node if an OutOfMemoryError occurs but use it with extreme care!
● Design for idempotence
● Do all your methods idempotent as far as possible.
● For those that cannot be idempotent (e.g. sendMoney(Money money, Account account)) prevent re-
execution:
– By using a ticketing service
– By declaring the it as not idempotent:
<LocationMatch /pathto/yourservlet > 
               SetHandler weblogic­handler
               Idempotent OFF
</Location>

Recommended for you

Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes

This presentation covers the basics of what container orchestration is providing pros and cons of Docker Swarm, Kubernetes and Amazon ECS and outlining the terms and tools you will need to successfully use them.

dockerdocker swarmkubernetes
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...

Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.

reinvent2016aws reinventcloud
Continuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeContinuous Delivery and Zero Downtime
Continuous Delivery and Zero Downtime

The document discusses continuous delivery and zero downtime deployment. It describes automating the full deployment process from development to production multiple times a day without any downtime. This is achieved through continuous integration, maintaining deployment scripts and packages, managing database changes, and techniques like feature toggles, blue/green deployments, and state management. The goal is to enable fast and reliable releases of new features and fixes to users.

continuous deliveryzero downtimeagile
Development guidelines (Datasources)
● Don't build your own connection pools, take them from the Application Server by JNDI
Lookup
● As we are using Oracle RAC , the datasource must be a multipool consisting of single datasources per RAC
node
– One can take one of the single datasources out of the mutlipool (online)
– Load balancing is guaranteed
– Reconfiguring the pool online
● Example Spring config:
● Example without Spring:
Basic monitoring
● Different possibilities for monitoring on Weblogic
● Standard admin console
– Threads (stuck, in use, etc), JVM (heap size, usage etc.), online thread dumps
– Connection pools statistics
– Transaction manager statistics
– Application statistics (per servlet), WorkManager statistics
● Diagnostic console
– Online monitoring only
– All attributes exposed by Weblogic Mbeans can be monitored
– Demo: diagnostics console
● Diagnostic images
– On demand, on shutdown, regularly
– Useful for problem analysis (especially for after crash analysis)
– For analysing of resource leaks: Demo: analyse a connection leak and a stuck thread
● SNMP and diagnostic modules
– All MBean attributes can be monitored by SNMP
– Gauge, string, counter monitors, log filters, attribute changes
– Collected metrics, watches and notifications
Zero downtime deployment
● 2 Clusters within the one datacenter
● Managed by Apache LB
● (simple script based on the session ID)
● Both are active during normal operations
● Before we deploy the new release we
switch off cluster 1
● Old sessions go to both cluster 1 and 2
● New sessions go to cluster 2 only
● When all sessions of cluster 1 expire we deploy
the new version
● Test it
● If everything ok, then we put it back
into the Apache load balancer
● Now we take cluster 2 off
● Untill all sessions expire
● The same procedure as above
● Then we deploy on the second datacenter
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable

Recommended for you

Moving Towards Zero Downtime
Moving Towards Zero DowntimeMoving Towards Zero Downtime
Moving Towards Zero Downtime

The document discusses moving towards zero downtime. It summarizes Vision's solutions which aim to [1] eliminate planned and unplanned downtime, [2] quickly recover data to any point in time with zero data loss, and [3] improve service levels and comply with regulations. Vision addresses these issues through data replication, virtualized failover, application protection, and cluster integration capabilities.

A Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFA Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTF

The document discusses potential issues with using MTBF/MTTF as the primary reliability metric for the defense and aerospace industries. It argues that MTBF/MTTF provides an incomplete view of reliability across the entire product lifecycle and can result in overly optimistic assessments. The document proposes using an alternative metric called Bx/Lx, which specifies the life point where no more than a certain percentage (like 10%) of failures have occurred. This provides a more comprehensive view of reliability focused on early failures. Overall, the document advocates updating reliability metrics and practices to better reflect physical failure mechanisms.

reliability metricasq rdasq reliability division
Troubleshooting
TroubleshootingTroubleshooting
Troubleshooting

Explains what troubleshooting is, what skills are involved, and clears up some common misconceptions. Originally designed with IT Helpdesks in mind, but it could apply to any kind of troubleshooting. ========================= Wrote this a VERY long time ago! I always meant to revisit/revamp it, but never quite got round to it. But people seem to get value from it, so I'll leave it up :)

helpdeskittechnology
Our requirements again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Replicate the data between the DCs
● Bidirectional data replication between DCs
● Oracle Streams/Golden Gate
http://docs.oracle.com/cd/E11882_01/server.112/e10705/man_gen_rep.htm#STREP013
Cross Cluster replication: 2 clusters in 2 datacenters
Application groups
●
One or more applications without hard dependencies to or from other
applications
●
Why application groups
●
Switching many application at once leads to long downtimes and higher risk
●
Switching a single one is not possible if there are hard dependencies on database level to
other applications
●
Identify groups of applications that are critical dependent on each other but not to other
applications out of the group
●
Switch such groups always at once
●
As bigger the group as longer the downtime
– A single application in the category HA will be able to switch without any downtime, just delayed
requests
●
Critical (hard) dependencies is if it leads to issues (editing the same record on different
DCs will be definitely problematic, reading data for reporting is not)
– Must be identified on case by case base

Recommended for you

The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream

The document summarizes a presentation on predictive analytics given by Mike Watschke from SAP. The presentation covered: - SAP's predictive analytics solution which automates data preparation, modeling, and deployment tasks. - Questions from the analyst about whether SAP sees in-memory technology as critical, whether it has its own data preparation technology or partners, and whether the capability works with cloud, Hadoop, and streaming data. - Upcoming topics for future briefing room presentations, including business intelligence/analytics in March, big data in April, and cloud in May.

sapthe bloor grouppredictive analytics
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment

The document discusses techniques for achieving zero downtime deployments. It begins with an introduction and overview before covering specific methods such as blue-green deployments, canary releases, and rolling deployments. It also provides details on tools that can be used and considerations for deploying to web servers and databases. The document advocates combining different techniques into a hybrid 1/10/100 approach for deploying code changes to environments in a phased manner to minimize risk.

tech@agodaoctopus deploy.net
Obstacle escalation process
Obstacle escalation processObstacle escalation process
Obstacle escalation process

Obstacles encountered by teams are logged on obstacle boards at three levels - team, management, and executive. At the team level, the Scrum Master tries to resolve obstacles and logs them on a physical board. Unresolved obstacles are escalated to the management level board where managers work to find solutions. Obstacles that cannot be resolved by management are escalated to the executive level board where executives are responsible for resolving or dismissing them.

Identify application groups
Switch application by application
Example of a switch procedure of an application group
Applications: Limitations
Limitation/Categories
No bulk transactions
No DB sequences
No file based sequences
No shared file system storage
Use a central batch system
All new releases has to be compatible with
the previous release.
Stick to the infrastructure

Recommended for you

Deploying and releasing applications
Deploying and releasing applicationsDeploying and releasing applications
Deploying and releasing applications

The document discusses strategies for deploying and releasing applications, including creating a release strategy, release plans, and managing the test and release process. It recommends stakeholders meet to define responsibilities, environments, deployment tools, and other factors. The release strategy should describe the deployment pipeline and processes for testing, approvals, and moving builds between environments. The release plan details automated steps for initial deployment, rollbacks, upgrades, and other lifecycle events. Tools can help model and manage moving builds through approval gates to different test stages and production.

marketing strategycontinuous deliverydeploy
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy

This document discusses implementing reliability strategies and engineering. It begins by explaining the importance of reliability in fields like aviation, defense, and energy where failure could lead to dangerous situations. It then discusses mechanical reliability and common failure modes. Reliability engineering is introduced as the study of reliability and life-cycle management. Several high-profile system failures are listed to emphasize the need for reliability in design. The document outlines various areas of reliability engineering and provides definitions of key terms. It gives examples of reliability calculations and discusses maintainability, availability, and quality. Analytical reliability techniques are also summarized, along with key points and steps to implement a reliability strategy.

probabilitymanagementreliability
How to measure reliability
How to measure reliabilityHow to measure reliability
How to measure reliability

Abusing the word "Reliability" was an annoying thing for me, it's not linked to submission date of a document nor the training programs, yes these procedure can help in undirect way to improve the reliability, but when you consider your reliability program sole on it, then you are not doing reliability anymore. So i decided to express my anger in peaceful way and i hope it can be a postive too. for that i'll start to write a post and i'll call it "Real Reliability" to bust the myth around reliability, and i'll start with my first enemy "MTBF". This for all the fed up guys from the wrong usage of "Reliability"

engineeringmtbfprocess
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable
Our requirements once again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Modern Architectures: how does the concepts fit?
Modern Architectures: Application Layer
● Web apps
● Completely independent on the backend
● Using only Rest APIs
● 90% of the state is locally managed (supported by frameworks like AngularJS and
BackboneJS)
● Must be compatible with different versions of the Rest API (at least 2 versions)
● If websockets are used, then more tricky, see backend.
● New mobile versions managed by Apps Stores
● Good to have a upgrade reminder (to limit the supported versions)
● Rest API must be versioned and backwards compatible
● Messages over message clouds is transparent. HA managed by vendors
● Stafeful Services
● e.g. Oauth v1/v2
– Normally by DB Persistence

Recommended for you

10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability

Continuing the series that started with maintenance technicians and supervisors, if you are new to the position of Operations Supervisor, what are some of the things you can begin working on immediately to improve reliability within the area you work?

opexasset reliabilitylss
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
How to measure reliability 2

Part 2 of how to measure reliability for part 1 https://www.linkedin.com/pulse/real-reliability-stop-abusing-ammar-alkhaldi-cbbss-

reliabilitymtbfchemical
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators

The document discusses the importance of including equipment operators in Reliability Centered Maintenance (RCM) analysis. Operators play a key role by identifying important failure modes that others may overlook related to equipment operation. They can provide valuable details in failure effect statements about process impacts. Operators also help determine downtime from failures and identify mitigation tasks, such as process monitoring, that are effective at improving reliability. The document argues that excluding operators results in an incomplete RCM analysis.

opexoptimizing reliabilityoperator care
Session Replication
● Less needed that with Server Side Applications
● Frameworks like AngularJS, BackboneJS , Ember etc. manage their own sessions,
routings etc.
● but still needed
● Weblogic: no change
● Tomcat evtl. with JDBC Store
● Jetty with Terracotta
● Node.js: secure (digitally signed) sessions stored in cookies
– Senchalabs Connect
– Mozilla/node-client-sessions
● https://hacks.mozilla.org/2012/12/using-secure-client-side-sessions-to-build-simple-and-
scalable-node-js-applications-a-node-js-holiday-season-part-3/
Backend: Bidirectional Data Replication
● Elastic Search
● Currently no cross cluster replication
● But is on their roadmap
● Couchdb
● Very flexible replication, regardless within one or more datacenters
● Bidirectional replication is possible
● Mongodb
● One direction replication possible and mature
● Bidirectional not possible in the moment
● Workaround would be: one mongodb per app and strict separation of the apps
● Hadoop HDFS
● Currently no cross cluster replication available
● e.g. Facebook wrote their own replication for HIVE
● Will possibly arrive soon with Apache Falcon http://falcon.incubator.apache.org/
Questions?
Thank you for your attention !
Some pictures on this presentation were purchased from iStockphoto LP. The price paid applies for the use of the pictures within the
scope of a standard license, which includes among other things, online publications including websites up to a maximum image size of
800 x 600 pixels (video: 640 x 480 pixels).
Some icons from https://www.iconfinder.com/ are used under the Creative Commons public domain license from the following authors:
Artbees, Neurovit and Pixel Mixer (http://pixel-mixer.com)
All other trademarks mentioned herein are the property of their respective owners.

Recommended for you

Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability

You wonder sometimes, is Reliability the same as Availability. Here's a sample, showing 2 ways to calculate Availability. (They are not the same, but at times we think so.)

availabilitymtbf
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency

The document discusses software availability and resiliency. It defines availability as the percentage of time a system is up and running. High availability systems aim for 99.999% uptime or less than 5 minutes of downtime per year. The document advocates for a reactive, message-driven approach to building resilient systems that can withstand failures through isolation, asynchronous communication, failure management techniques like circuit breakers and supervisors, and redundancy. The goal is to design systems that can continue processing transactions even when failures occur.

availabilityreactivefailure
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability

Most companies don’t measure mean time between failures (MTBF), even though it’s the most basic measurement that quantifies reliability. MTBF is the average time an asset functions before it fails. So, why don’t they measure MTBF? Let’s define reliability first before we go any further. Reliability: The ability of an item to perform a required function under stated conditions for a stated period of time So why don’t we measure Mean Time Between Failure. This articles discusses this issue.

mtbfmaintenance best practicesroot cause
Backup slides
Big picture example architecture
Key features
● 2 datacenters
●
Both active (both datacenters active but probably different applications running on them)
● Independent uplinks
● Redundant interconnect
● Applications are deployed and running on both
● Application cluster in every datacenter
● Session replication within every datacenter
● Cross replication between the 2 datacenters
● e.g. with Weblogic Cluster
● Bidirectional database replication
● e.g. 2 independent Oracle RAC in each datacenter
● Replication over streams/Golden Gate
● Monitoring of all critical resources
● Hardware nodes
● Connection pools
● JVM heaps
● Application switch
Concepts: other network components
● Firewalls
● First level firewalls
– Cisco routers
– Stateless firewalls
– Not very restrictive
● Second level firewalls (in front of the application load balancers)
– Should be stateful
– based on Linux/Iptables with conntrackd (for failover)
– Statefull, connection tracking
– Very restrictive
– Rate limiting of new connections (DoS or slashdot)
● All firewalls will be/are in active/hot standby mode.
● On a controlled failover (both are running and we switch them) no single TCP connection
should be affected (except small delays)
● In disaster case some seconds until the cluster software detects the crash of the node and
initiate the failover. No TCP connections should be lost but there is a very small risk

Recommended for you

Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies

A draft version of the paper that was eventually published as “J.A.Jones & J.A.Hayes, ”A comparison of electronic-reliability prediction models”, IEEE Transactions on reliability, June 1999, Volume 48, Number 2, pp 127-134” Provide with the kind permission of the author, J.A.Jones

studypredictionparts count
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF

MTBF is often misused and can be misleading. It is calculated as the average time between failures of a system, but does not represent the actual duration of failure-free periods. A better metric is reliability (R(t)), which shows the probability that a system will operate at a given time. Additionally, the document notes that MTBF is intended for repairable systems, while MTTF is a more accurate term for non-repairable systems, as it is calculated the same way as MTBF under certain assumptions about repair times and part lifetime distributions.

nomtbfreliabilitymean time between failures
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events

Solar trackers are the foundation of a utility-scale solar plant and their reliability affects energy production, uptime, and O&M costs; significantly impacting the economics of a project. In the near future it will become increasingly important for solar asset owners and investors to take tracker reliability into consideration. For tracker vendors, providing proven reliability and overall bankability of their systems will be a critical differentiator moving forward.

lcoerenewable energysolar energy
Example of a switch procedure of an application group
● Preparation steps
● Check the health of the replication processes.
● Stop all batch applications (by stopping the job scheduling system). If the time
pressure for the switch is high just kill all running jobs (they should be restartable
anyway, also currently).
● Switch off the keepalive feature on all httpd servers
● Switching steps
● Change the firewall rules on the second layer firewalls, so that any new
connection requests (Syn flag is active) is being dropped.
● Wait until the data is synchronized on both sides (e.g. by monitoring a
heartbeat table) and no more httpd processes are active.
● Switch the application traffic to the other DC (by changing the routing of their
IP addresses).
● Clean up (remove dropping of Syn packages on the “old” site etc.)
● This procedure is done per application group until all applications are running
Application clusters (Weblogic)
● Features of Weblogic that we use
● mod_wl
– Manages the stickiness and failover to backup nodes
– Automatic retry of failed requests
● On time-outs
● On response header 50x
● Multipools
– Gracefully remove a database node out of the pool
– Gracefully change parameters of connection pools
– Guaranteed balance of connections between database nodes
● Binding execution threads to connection pools
● Auto shutdown (+ restart) of nodes on OutOfMemoryException
● Session replication (also over both DCs)
● Thread monitoring (detect dead or long running threads etc.)
● Diagnostic images and alarms
Apache plugin failover
Quelle: e-docs.bea.com
Deployment of connection pools
● One datasource per Oracle RAC node
● Set the initial capacity to a value that will be sufficient for the usual load for the application
– Creation of new connections is expensive
● Set the max capacity to a value that will be sufficient in a high load scenario
– The overall number of connections should match to the limit of connection on the database site
● Set JDBC parameter in the connection pool and not globally (e.g. v8compatibility=true)
● Check connections on reserve
● You can set db session parameters in the init SQL property (e.g. alter session set
NLS_SORT='GERMAN')
● Enable 2 phase commit only if you need it (expensive)
● Prepared statement caching does not bring much performance (at least for Oracle databases) but cost
open cursors in the database (per connection!), so don't use it unless you have a very good reason to
do it.
● One Multipool containing all single datasources for one database
● Strategy: load balancing

Recommended for you

Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin Yang

This document discusses efficient reliability demonstration tests that can reduce sample sizes and test times compared to conventional methods. It presents principles for test time reduction using degradation measurements during testing. Methods are provided for calculating optimal test plans that minimize costs while meeting reliability requirements and risk constraints. Decision rules are given for terminating tests early based on degradation measurements and risk estimates. An example application demonstrates how the approach can significantly reduce testing costs.

reliability divisionqualitywebinar
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.

tubemogulfastlyjenkins
Design patterns for scaling web applications
Design patterns for scaling web applicationsDesign patterns for scaling web applications
Design patterns for scaling web applications

This document discusses various patterns for scaling web applications. It begins by describing single machine, two-tier, and multi-tier web application architectures. It then covers message bus and service-oriented architectures. The document discusses scaling up approaches like identifying bottlenecks, utilizing bottlenecks, adjusting non-bottlenecks, elevating bottlenecks, and reviewing processes. Specific scaling techniques covered include caching, threading and queues, and using content delivery networks.

More Related Content

What's hot

Kafka Security
Kafka SecurityKafka Security
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
VMware Tanzu
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Migration of Microsoft Workloads
Migration of Microsoft WorkloadsMigration of Microsoft Workloads
Migration of Microsoft Workloads
Amazon Web Services
 
Using Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Services
guest484c12
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
Ruslan Meshenberg
 
Scaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysScaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - Infosys
MongoDB
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
Araf Karsh Hamid
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Continuent
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
Animesh Singh
 
Multi-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSMulti-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWS
Continuent
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere
DataWorks Summit
 
How to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSHow to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWS
Denis Gundarev
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
PivotalOpenSourceHub
 
PASS Summit 2020
PASS Summit 2020PASS Summit 2020
PASS Summit 2020
Kellyn Pot'Vin-Gorman
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
Amazon Web Services
 

What's hot (20)

Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Migration of Microsoft Workloads
Migration of Microsoft WorkloadsMigration of Microsoft Workloads
Migration of Microsoft Workloads
 
Using Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Services
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Scaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysScaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - Infosys
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
 
Multi-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSMulti-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWS
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere
 
How to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSHow to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWS
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
PASS Summit 2020
PASS Summit 2020PASS Summit 2020
PASS Summit 2020
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
 

Viewers also liked

Continuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeContinuous Delivery and Zero Downtime
Continuous Delivery and Zero Downtime
Axel Fontaine
 
Moving Towards Zero Downtime
Moving Towards Zero DowntimeMoving Towards Zero Downtime
Moving Towards Zero Downtime
BCM Institute
 
A Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFA Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTF
ASQ Reliability Division
 
Troubleshooting
TroubleshootingTroubleshooting
Troubleshooting
Julia .
 
The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream
Inside Analysis
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment
Joel Dickson
 
Obstacle escalation process
Obstacle escalation processObstacle escalation process
Obstacle escalation process
Ravi Tadwalkar
 
Deploying and releasing applications
Deploying and releasing applicationsDeploying and releasing applications
Deploying and releasing applications
Ma Xuebin
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy
Charlton Inao
 
How to measure reliability
How to measure reliabilityHow to measure reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
Ricky Smith CMRP, CMRT
 
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators
Ricky Smith CMRP, CMRT
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability
Tom Jacyszyn
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency
Reza Samei
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability
Ricky Smith CMRP, CMRT
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies
Accendo Reliability
 
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF
Accendo Reliability
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Array Technologies, Inc.
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin Yang
ASQ Reliability Division
 

Viewers also liked (20)

Continuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeContinuous Delivery and Zero Downtime
Continuous Delivery and Zero Downtime
 
Moving Towards Zero Downtime
Moving Towards Zero DowntimeMoving Towards Zero Downtime
Moving Towards Zero Downtime
 
A Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFA Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTF
 
Troubleshooting
TroubleshootingTroubleshooting
Troubleshooting
 
The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment
 
Obstacle escalation process
Obstacle escalation processObstacle escalation process
Obstacle escalation process
 
Deploying and releasing applications
Deploying and releasing applicationsDeploying and releasing applications
Deploying and releasing applications
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy
 
How to measure reliability
How to measure reliabilityHow to measure reliability
How to measure reliability
 
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
 
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
How to measure reliability 2
 
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies
 
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin Yang
 

Similar to Zero Downtime JEE Architectures

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
Design patterns for scaling web applications
Design patterns for scaling web applicationsDesign patterns for scaling web applications
Design patterns for scaling web applications
Ivan Dimitrov
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
Sarmad Makhdoom
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
ShapeBlue
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In Soa
WSO2
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual Router
ShapeBlue
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
Jimmy Angelakos
 
Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
SDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center NetworkingSDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf
 
Network Virtualization & Software-defined Networking
Network Virtualization & Software-defined NetworkingNetwork Virtualization & Software-defined Networking
Network Virtualization & Software-defined Networking
Digicomp Academy AG
 
Node.js Presentation
Node.js PresentationNode.js Presentation
Node.js Presentation
Exist
 
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
Redis Labs
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
Edge AI and Vision Alliance
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
OpenStack
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
InfluxData
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
Tapio Rautonen
 

Similar to Zero Downtime JEE Architectures (20)

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Design patterns for scaling web applications
Design patterns for scaling web applicationsDesign patterns for scaling web applications
Design patterns for scaling web applications
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In Soa
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual Router
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
SDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center NetworkingSDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center Networking
 
Network Virtualization & Software-defined Networking
Network Virtualization & Software-defined NetworkingNetwork Virtualization & Software-defined Networking
Network Virtualization & Software-defined Networking
 
Node.js Presentation
Node.js PresentationNode.js Presentation
Node.js Presentation
 
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 

Recently uploaded

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 

Recently uploaded (20)

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 

Zero Downtime JEE Architectures

  • 1. Zero Downtime Architectures Alexander Penev ByteSource Technology Consulting GmbH Neubaugasse 43 1070, Vienna Austria
  • 2. whoami Alexander Penev Email: alexander.penev@bytesource.net Twitter: @apenev @ByteSourceNet JEE, Databases, Linux, TCP/IP Fan of (automatic) testing, TDD, ADD, BDD….. Like to design high available and scalable systems :-)
  • 3. Zero Downtime Architectures ● Base on a customer project with the classic JEE Application Stack ● Classic web applications with server side code ● HTTP based APIs ● Goals, Concepts and Implementation Techniques ● Constraints and limitations ● Developement guidelines ● How these concepts can be applied to the new cuttung edge technolgies ● Single page Java Script based Apps ● Mobile clients ● Rest APIs ● Node.js ● NoSQL stores
  • 4. Zero Downtime Architecture? ● My database server has 99.999% uptime ● We have Tomcat cluster ● Redundant power supply ● Second Datacenter ● Load Balancer ● Distribute routes over OSPF ● Deploy my application online ● Second ISP ● Session Replication ● Monitoring ● Data Replication ● Auto restarts
  • 5. Zero Downtime architecture: our definition The services from the end user point of view could be always available
  • 6. Our Vision Identify all sources of downtime and remove all them http://www.meteleco.com/wp-content/uploads/2011/09/p360.jpg
  • 7. When could we have a downtime (unplanned)? ● Human errors ● Server node has crashed ● Power supply is broken, RAM Chip burned out, OS just crashed ● Server Software just crashed ● IO errors, software bug, tablespace full ● Network is unavailable ● Router crashed, Uplink down ● Datacenter is down ● Uplinks down ( notorious bagger :-) ) ● Flood/Fire ● Aircondition broken ● Hit by a nuke (not so often :-) )
  • 8. When could we need a downtime (planned)? ● Replace a hardware part ● Replace a router/switch ● Firmware upgrade ● Upgrade/exchange the storage ● Configuration of the connection pool ● Configuration of the cluster ● Upgrade the cluster software ● Recover from a logical data error ● Upgrade the database software ● Deploy a new version of our software ● Move the application to another data center
  • 9. How can we avoid downtime ● Redunancy ● Hardware, network ● Uplinks ● Datacenters ● Software ● Monitoring ● Detect exhausted resources before the application notices it ● Detect a failed node and replace it ● Software design ● Idempotent service calls ● Backwards compatibility ● Live releases ● Scalability ● Scale on more load ● Protect from attacks (e.g. DDoS)
  • 10. Requirements for a Zero Downtime Architecture: handling of events of failure or maintenance Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 11. Our goals and constraints ● Reduce downtime to 0 ● Keep the costs low ● No expensive propriatery hardware ● Minimize the potential application changes/rewrites http://www.signwarehouse.com/blog/how-to-keep-fixed-costs-low/
  • 12. Our Concepts 1/4 ● Independent Applications or Application Groups ● One Application (Group) = IP Address ● Communication between Application exclusively over this IP Address! http://www.binaryguys.de/media/catalog/product/cache/1/image/313x313/9df78eab33525d08d6e5fb8d27136e95/3/6/36.noplacelikelocalhost_1_4.jpg
  • 13. Our Concepts 2/4 Treat the internet and internal traffic independently
  • 14. Our Concepts 3/4 ● Reduce the downtime within a datacenter to 0 ● High available network ● Redundant firewalls and load balancers ● Web server farms ● Application server clusters with sesion replication ● Oracle RAC Cluster ● Downtime free application deployments
  • 15. Our Concepts 4/4 ● Replicate the data on both datacenters ● and make the applications switchable
  • 17. Concepts: Internet traffic, BGP(Border Gateway Protocol) 1/2 ● Every datacenter has fully redundant uplinks ● Own provider independent IP address range (assigned by RIPE) ● Hard to get in the moment (but not impossible) ● Propagate these addresses to the rest of the internet through both ISPs using BGP ● Both DCs our addresses ● The network path of one announcement could be preferred (for costs reasons) ● Switch of internet traffic ● Gracefully by changing the preferences of the announcements – No single TCP session lost ● In case of disaster the backup route is propagated automatically within seconds to minutes (depending on the internet distance) ● Protect us from connectivity problems between our ISPs and our customer ISPs 10.8.8.0/24 10.8.8.0/24 Announcement Announcement
  • 18. Concepts: Internet traffic, use DNS ? 2/2 ● We don't use DNS for switching ● A datacenter switch based on DNS could take up to months to reach all customers and their software (e.g. JVMs caching DNS entries, default behaviour) ● No need to restart browsers, applications and proxies on the customer site. The customer doesn't see any change at all (except that route to us has changed) ● DNS is good for load balancing but not for High Availability!
  • 19. Concepts: Internal traffic ● OSPF (Open Shortest Path First) protocol for dynamic routing ● Deals with redundant paths completely transparently ● Can also do load balancing ● The second level firewalls (in front of the load balancers) announce the address to the rest of the routers ● To switch the processing of a service, it's firewall just has to announce the route (could be also a /32) with a higher priority, after a second the traffic goes through the new route. ● Could be also used for a unattended switch of the whole datacenter ● Just announce the same IPs from both sites with different priorities ● If the one datacenter dies there are only announcements from the other one 10.8.8.23 10.8.8.23
  • 20. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce Downtime within a DC ● Replicate the data between the Dcs and make the application switchable
  • 21. Zero Downtime within a datacenter ● High Available network ● Redundant switches – Again using Spanning Tree Protocol ● Redundant firewalls, routers, load balancers – Active/Passive Clusters – VRRP protocol implemeneted by keepalived – IP tables with contractd ● Web Server Apache farms ● Managed by load balancer ● Application Server Cluster ● Weblogic Cluster ● With Session replication, ● automcatic retries and restarts ● Oracle RAC database cluster ● Deployment without downtime
  • 22. Failover within one datacenter:Apache plugin (mod_wl) Session ID Format: sessionid!primary_server_id!secondary_server_id Quelle: http://egeneration.beasys.com/wls/docs100/cluster/wwimages/cluster-06-1-2.gif
  • 23. Development guidelines (HTTPSession) ● If you need a session then you most probably want to replicate it ● Example (weblogic.xml) ● Generally all requests of one session go to the same application instance ● When it fails (answer with 50x, dies or not answer in a given period) the backup instance is involved ● The session attributes are only replicated on the backup node when HTTPSession.setAttribute was called. HTTPSession.getAttribute("foo") .changeSomething() will not be replicated! ● Every attribute stored in the HTTPSession must be serializable! ● The ServletContext will not be replicated in any cases. ● If you implement caches they will have probably different contents on every node (except we use a 3rd party cluster aware cache). Probably the best practice is not to rely that the data is present and declare the cache transient ● Keep the session small in size and do regular reattaching.
  • 24. Development guidelines (cluster handling) ● Return proper HTTP return codes to the client ● Common practice is to return a well formed error page with HTTP code 200 ● It is a good practice if you are sure that the cluster is incapable of recovering from it (example: a missing page will be missing on the other node too) ● But an exhausted resource (like heap, datasource) could be present on the other node ● It is hard to implement it, therefore Weblogic offers you help: ● You can bind the number of execution threads to a datasource capacity ● Shut down the node if an OutOfMemoryError occurs but use it with extreme care! ● Design for idempotence ● Do all your methods idempotent as far as possible. ● For those that cannot be idempotent (e.g. sendMoney(Money money, Account account)) prevent re- execution: – By using a ticketing service – By declaring the it as not idempotent: <LocationMatch /pathto/yourservlet >                 SetHandler weblogic­handler                Idempotent OFF </Location>
  • 25. Development guidelines (Datasources) ● Don't build your own connection pools, take them from the Application Server by JNDI Lookup ● As we are using Oracle RAC , the datasource must be a multipool consisting of single datasources per RAC node – One can take one of the single datasources out of the mutlipool (online) – Load balancing is guaranteed – Reconfiguring the pool online ● Example Spring config: ● Example without Spring:
  • 26. Basic monitoring ● Different possibilities for monitoring on Weblogic ● Standard admin console – Threads (stuck, in use, etc), JVM (heap size, usage etc.), online thread dumps – Connection pools statistics – Transaction manager statistics – Application statistics (per servlet), WorkManager statistics ● Diagnostic console – Online monitoring only – All attributes exposed by Weblogic Mbeans can be monitored – Demo: diagnostics console ● Diagnostic images – On demand, on shutdown, regularly – Useful for problem analysis (especially for after crash analysis) – For analysing of resource leaks: Demo: analyse a connection leak and a stuck thread ● SNMP and diagnostic modules – All MBean attributes can be monitored by SNMP – Gauge, string, counter monitors, log filters, attribute changes – Collected metrics, watches and notifications
  • 27. Zero downtime deployment ● 2 Clusters within the one datacenter ● Managed by Apache LB ● (simple script based on the session ID) ● Both are active during normal operations ● Before we deploy the new release we switch off cluster 1 ● Old sessions go to both cluster 1 and 2 ● New sessions go to cluster 2 only ● When all sessions of cluster 1 expire we deploy the new version ● Test it ● If everything ok, then we put it back into the Apache load balancer ● Now we take cluster 2 off ● Untill all sessions expire ● The same procedure as above ● Then we deploy on the second datacenter
  • 28. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce/avoid Downtime within a DC ● Replicate the data between the DCs and make the application switchable
  • 29. Our requirements again Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 30. Replicate the data between the DCs ● Bidirectional data replication between DCs ● Oracle Streams/Golden Gate http://docs.oracle.com/cd/E11882_01/server.112/e10705/man_gen_rep.htm#STREP013
  • 31. Cross Cluster replication: 2 clusters in 2 datacenters
  • 32. Application groups ● One or more applications without hard dependencies to or from other applications ● Why application groups ● Switching many application at once leads to long downtimes and higher risk ● Switching a single one is not possible if there are hard dependencies on database level to other applications ● Identify groups of applications that are critical dependent on each other but not to other applications out of the group ● Switch such groups always at once ● As bigger the group as longer the downtime – A single application in the category HA will be able to switch without any downtime, just delayed requests ● Critical (hard) dependencies is if it leads to issues (editing the same record on different DCs will be definitely problematic, reading data for reporting is not) – Must be identified on case by case base
  • 34. Switch application by application
  • 35. Example of a switch procedure of an application group
  • 36. Applications: Limitations Limitation/Categories No bulk transactions No DB sequences No file based sequences No shared file system storage Use a central batch system All new releases has to be compatible with the previous release. Stick to the infrastructure
  • 37. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce/avoid Downtime within a DC ● Replicate the data between the DCs and make the application switchable
  • 38. Our requirements once again Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 39. Modern Architectures: how does the concepts fit?
  • 40. Modern Architectures: Application Layer ● Web apps ● Completely independent on the backend ● Using only Rest APIs ● 90% of the state is locally managed (supported by frameworks like AngularJS and BackboneJS) ● Must be compatible with different versions of the Rest API (at least 2 versions) ● If websockets are used, then more tricky, see backend. ● New mobile versions managed by Apps Stores ● Good to have a upgrade reminder (to limit the supported versions) ● Rest API must be versioned and backwards compatible ● Messages over message clouds is transparent. HA managed by vendors ● Stafeful Services ● e.g. Oauth v1/v2 – Normally by DB Persistence
  • 41. Session Replication ● Less needed that with Server Side Applications ● Frameworks like AngularJS, BackboneJS , Ember etc. manage their own sessions, routings etc. ● but still needed ● Weblogic: no change ● Tomcat evtl. with JDBC Store ● Jetty with Terracotta ● Node.js: secure (digitally signed) sessions stored in cookies – Senchalabs Connect – Mozilla/node-client-sessions ● https://hacks.mozilla.org/2012/12/using-secure-client-side-sessions-to-build-simple-and- scalable-node-js-applications-a-node-js-holiday-season-part-3/
  • 42. Backend: Bidirectional Data Replication ● Elastic Search ● Currently no cross cluster replication ● But is on their roadmap ● Couchdb ● Very flexible replication, regardless within one or more datacenters ● Bidirectional replication is possible ● Mongodb ● One direction replication possible and mature ● Bidirectional not possible in the moment ● Workaround would be: one mongodb per app and strict separation of the apps ● Hadoop HDFS ● Currently no cross cluster replication available ● e.g. Facebook wrote their own replication for HIVE ● Will possibly arrive soon with Apache Falcon http://falcon.incubator.apache.org/
  • 43. Questions? Thank you for your attention !
  • 44. Some pictures on this presentation were purchased from iStockphoto LP. The price paid applies for the use of the pictures within the scope of a standard license, which includes among other things, online publications including websites up to a maximum image size of 800 x 600 pixels (video: 640 x 480 pixels). Some icons from https://www.iconfinder.com/ are used under the Creative Commons public domain license from the following authors: Artbees, Neurovit and Pixel Mixer (http://pixel-mixer.com) All other trademarks mentioned herein are the property of their respective owners.
  • 46. Big picture example architecture
  • 47. Key features ● 2 datacenters ● Both active (both datacenters active but probably different applications running on them) ● Independent uplinks ● Redundant interconnect ● Applications are deployed and running on both ● Application cluster in every datacenter ● Session replication within every datacenter ● Cross replication between the 2 datacenters ● e.g. with Weblogic Cluster ● Bidirectional database replication ● e.g. 2 independent Oracle RAC in each datacenter ● Replication over streams/Golden Gate ● Monitoring of all critical resources ● Hardware nodes ● Connection pools ● JVM heaps ● Application switch
  • 48. Concepts: other network components ● Firewalls ● First level firewalls – Cisco routers – Stateless firewalls – Not very restrictive ● Second level firewalls (in front of the application load balancers) – Should be stateful – based on Linux/Iptables with conntrackd (for failover) – Statefull, connection tracking – Very restrictive – Rate limiting of new connections (DoS or slashdot) ● All firewalls will be/are in active/hot standby mode. ● On a controlled failover (both are running and we switch them) no single TCP connection should be affected (except small delays) ● In disaster case some seconds until the cluster software detects the crash of the node and initiate the failover. No TCP connections should be lost but there is a very small risk
  • 49. Example of a switch procedure of an application group ● Preparation steps ● Check the health of the replication processes. ● Stop all batch applications (by stopping the job scheduling system). If the time pressure for the switch is high just kill all running jobs (they should be restartable anyway, also currently). ● Switch off the keepalive feature on all httpd servers ● Switching steps ● Change the firewall rules on the second layer firewalls, so that any new connection requests (Syn flag is active) is being dropped. ● Wait until the data is synchronized on both sides (e.g. by monitoring a heartbeat table) and no more httpd processes are active. ● Switch the application traffic to the other DC (by changing the routing of their IP addresses). ● Clean up (remove dropping of Syn packages on the “old” site etc.) ● This procedure is done per application group until all applications are running
  • 50. Application clusters (Weblogic) ● Features of Weblogic that we use ● mod_wl – Manages the stickiness and failover to backup nodes – Automatic retry of failed requests ● On time-outs ● On response header 50x ● Multipools – Gracefully remove a database node out of the pool – Gracefully change parameters of connection pools – Guaranteed balance of connections between database nodes ● Binding execution threads to connection pools ● Auto shutdown (+ restart) of nodes on OutOfMemoryException ● Session replication (also over both DCs) ● Thread monitoring (detect dead or long running threads etc.) ● Diagnostic images and alarms
  • 52. Deployment of connection pools ● One datasource per Oracle RAC node ● Set the initial capacity to a value that will be sufficient for the usual load for the application – Creation of new connections is expensive ● Set the max capacity to a value that will be sufficient in a high load scenario – The overall number of connections should match to the limit of connection on the database site ● Set JDBC parameter in the connection pool and not globally (e.g. v8compatibility=true) ● Check connections on reserve ● You can set db session parameters in the init SQL property (e.g. alter session set NLS_SORT='GERMAN') ● Enable 2 phase commit only if you need it (expensive) ● Prepared statement caching does not bring much performance (at least for Oracle databases) but cost open cursors in the database (per connection!), so don't use it unless you have a very good reason to do it. ● One Multipool containing all single datasources for one database ● Strategy: load balancing

Editor's Notes

  1. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  2. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  3. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  4. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  5. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  6. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  7. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  8. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  9. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications