a
Amazon web services goes offline

What to Do When AWS Goes Offline?

Amazon’s services are not completely iron-clad and applications built on the platform are still vulnerable to the occasional hiccup. When it does happen, there is a degree of shared responsibility between AWS and the developers of the web applications. All services are designed with redundancies in mind and much of the physical infrastructure does come with backup databases and availability zones to help cover for any instances that go down or are brought down for maintenance. Though data loss is mitigated, it is still advisable to prepare for the next incident.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

Dolan Cleary
Dolan Cleary

I am a recent graduate from the University of Wisconsin - Stout and am now working with AllCode as a web technician. Currently working within the marketing department.

Related Articles

Top CI/CD Tools to Use in App Development

Top CI/CD Tools to Use in App Development

Modern software development requires continuous maintenance over the course of its operational lifespan in the form of continuous integration (CI) and continuous deployment (CD). It is tedious work, but helps developers worry less about critical breakdowns. Automating this cycle provides an easier means by which rollbacks can occur in the case of a bad update while providing additional benefits such as security and compliance functionality.

Top Software as a Service Companies in 2024

Top Software as a Service Companies in 2024

Spending for public cloud usage continues to climb with every year. In 2023, nearly $600 billion was spent world-wide with a third of that being taken up by SaaS. By comparison, Infrastructure as a Service only takes up $150 billion and Platform as a Service makes up $139 billion. On average, companies use roughly 315 individual SaaS applications for their operations and are gradually increasing on a yearly basis. SaaS offers a level of cost efficiency that makes it an appealing option for consuming software.

AWS Graviton and Arm-architecture Processors

AWS Graviton and Arm-architecture Processors

AWS launched its new batch of Arm-based processors in 2018 with AWS Graviton. It is a series of server processors designed for Amazon EC2 virtual machines. The EC2 AI instances support web servers, caching fleets, distributed data centers, and containerized microservices. Arm architecture is gradually being rolled out to handle enterprise-grade utilities at scale. Graviton instances are popular for handling intense workloads in the cloud.