Technology Certification Courses

AWS S3 Outage & Lessons in Tech Responsibility From Smokey the Bear

AWS S3 Outage & Lessons in Tech Responsibility

Earlier this week, AWS S3 had to fight its way back to life in the US-east-1 Region with multiple Availability Zones in the throes of recovery, and potentially hundreds of thousands of websites and applications experiencing issues retrieving objects from the popular object storage platform.

Who’s at fault when IT fails?

From data breaches resulting from third party vulnerabilities, the CloudPets data breach serving as the latest example in a string of very public attacks – to when the cloud goes down, as it did on February 29th during the AWS S3 outage – the question will come up.

Many AWS customers impacted by the outage were quick to explain that any disruption to their business was not their fault. True or not, any disruption in service is a reflection of your business and pointing fingers will not please unhappy customers.

To answer the question: organizations are responsible for when IT fails. Because, organizations are responsible for ensuring their infrastructures are resilient in the event of an outage. It's like what Smokey the Bear means when he tells us, “only you can prevent forest fires” – but in the case of AWS, Jeff Bazos is Smokey. And he’s pointing his finger at all of those AWS users telling their customers it was AWS’ fault.

The Cloud Will Go Down.

There is no special text in the terms and conditions; this is a fact. While cloud providers design their infrastructure to be as resilient as possible, they warn users to design with the intention of surviving partial service outages. In addition to the recent S3 outage, AWS has suffered DDoS attacks before and been forced to reboot EC2 hosts to patch security vulnerabilities.

Help your organization avoid the residual impact of when the cloud goes down by designing a resilient infrastructure.

How do we design a resilient infrastructure? I posed this question to former lead architect at Disney, Steve Haines, Principal Software Architect at Turbonomic. Here’s the transcript of that discussion…

Q&A with Disney Tech Veteran: A Developer’s Reaction to the AWS Outage

ERIC: What does it mean to think about designing across regions inside the public cloud?

STEVE: Designing an application to run across multiple AWS regions is not a trivial task. While you can deploy stateless services or micro-services to multiple regions and then configure Route53 (Amazon’s DNS Service) to point to Elastic Load Balancers (ELBs) in each region, that doesn't completely solve the problem.

First, it's crucial to consider the cost of redundancy. How many regions and how many availability zones (AZ) in each region do we want to deploy to? From historical outages, you’re probably safe with two regions, but you do not want to keep a full copy of your application deployed in another region just for disaster recovery: you want to use it and distribute workloads across those regions!

For some use cases this will be easy, but for others you will need to design your application so that it is close to the resources it needs to access. If you design your application with failure in mind and to run in multiple regions then you can manage the cost because both regions will be running your workloads.

ERIC: That seems to be a bit of the cost of doing business for design and resiliency, but what is the impact below the presentation layers? It feels like that is the sort of “low hanging fruit” as we know it, but there is much more to the application architecture than that, right?

STEVE: Exactly! That leads to the next challenge: resources, such as databases and files. While AWS provides users multi-A to Z database replication free of charge for databases running behind RDS, users are still paying for storage, IOPS, etc. However, this model changes if a user wants to replicate across regions. For example, Oracle provides a product called GoldenGate for performing cross-region replication, which is a great tool but can significantly impact your IT budget.

Alternatively, you can consider one of Amazon’s native offerings, Aurora, which supports cross- region replication out-of-the-box, but that needs to be a design decision you make when you’re building or refactoring your application. And, if you store files in S3, be sure that you enable cross- region replication, it will cost you more, but it will ensure that files stored in one region will be available in the event of a regional outage.

ERIC: Sounds like we have already got some challenges in front of us with just porting our designs to cloud platforms, but when you're already leaning into the cloud as a first-class destination for your apps we have to already think about big outages. We do disaster recovery testing on-premises because that's something we can control. How do we do that type of testing out in the public cloud?

STEVE: Good question. It’s important to remember that while designing an application to run in a cross-region capacity is one thing, having the confidence that it will work when you lose a region is another beast altogether!

This is where I’ll defer to Netflix’s practice of designing for failure and regularly testing failure scenarios. They have a “Simian Army” ( that simulates various failure scenarios in production and ensures that everything continues to work. One of the members of the Simian Army is the Chaos Gorilla that regularly kills a region and ensure that Netflix continues to function, which is one of the reasons they were able to sustain the previous full region outage. If you’re serious about running across regions then you need to regularly validate that it works!

But maybe we should think bigger than cross-region – what if we could design across clouds for the ultimate protection?

ERIC: Thanks for the background and advice, Steve. Good food for thought for all of us in the IT industry.  I’m sure there are a lot of people having this discussion in the coming weeks after the recent outage.

By Eric Wright and Steve Haines


Eric Wright

Before joining Turbonomic as the principal solutions engineer, Eric Wright served as a systems architect at Raymond James in Toronto. As a result of his work, Eric was named a VMware vExpert and Cisco Champion with a background in virtualization, OpenStack, business continuity, PowerShell scripting and systems automation. He’s worked in many industries, including financial services, health services and engineering firms. As the author behind, a technology and virtualization blog, Eric is also a regular contributor to community-driven technology groups such as the vBrownBag community and leading the VMUG organization in Toronto, Canada. He is a Pluralsight Author, the leading provider of online training for tech and creative professionals. Eric’s latest course is “Introduction to OpenStack” you can check it out at

Steven Haines

Steven Haines is a Principal Software Architect at Turbonomic and a frequent contributor to about:virtualization. He has spent the better part of the past 8 years working at Disney in various architecture roles, most recently focusing on cloud architectures, deployments, and migrations. Steven has written two Java programming books, a performance analysis book, more than 500 articles, and more than a dozen white papers on performance, scalability, and cloud-based architectures. Check out Steven’s website at


Established in 2009, CloudTweaks is recognized as one of the leading authorities in cloud connected technology information and consultancy services.

Are you a cloud services expert in a world of digital transformation? If so, contact us for information on how to become part of our growing cloud consultancy ecosystem.


Predictions For The Enterprise - Interconnected Cities

Predictions For The Enterprise – Interconnected Cities

Predictions For The Enterprise The IoT will be reality In 2016, we’ll work smarter, not harder. Human beings, appliances, homes, ...
Advances in Technology and Consumer Behaviour are Driving Transformation

Advances in Technology and Consumer Behaviour are Driving Transformation

Technology and Consumer Behaviour Advances in technology and consumer behaviour are driving a transformation in the way video content is ...
The Rise Of BI Data And How To Use It Effectively

The Rise Of BI Data And How To Use It Effectively

The Rise of BI Data Every few years, a new concept or technological development is introduced that drastically improves the ...
Ransomware Cyber-Attacks: Best Practices and Preventative Measures

Ransomware Cyber-Attacks: Best Practices and Preventative Measures

Ransomware Cyber-Attacks “WanaCrypt0r 2.0” or “WannaCry,” an unprecedented global ransomware cyber-attack recently hit over 200,000 banking institutions, hospitals, government agencies, ...
Matthew Cleaver

Dispelling the Myths of Cloud Solutions for the Small Business

Dispelling the Myths of Cloud Solutions As a business leader, migrating to the cloud can be overwhelming due to the ...
Avoiding Obsolescence In The Cloud

Avoiding Obsolescence In The Cloud

The Cloud I was amused to discover this week that Microsoft aren’t supporting Internet Explorer 8 or 9 – with ...


Dropbox heads for trading debut after upsized IPO pricing

Dropbox heads for trading debut after upsized IPO pricing

(Reuters) - Having topped expectations with the upsized price of its initial public offering, Dropbox Inc on Friday faces its next big challenge: a successful launch of trading when global stock markets are the defensive ...
IDC Report: Smart Cities Initiatives to Reach $28.3 Billion in 2018

IDC Report: Smart Cities Initiatives to Reach $28.3 Billion in 2018

First-ever IDC Smart Cities Spending Guide Expects Technologies Enabling Smart Cities Initiatives to Reach $28.3 Billion in 2018 SINGAPORE, March 23rd, 2018 – Asia/Pacific (excluding Japan) on the technologies that enable Smart Cities initiatives is expected ...
BMW delays electric car mass production until 2020 for cost reasons

BMW delays electric car mass production until 2020 for cost reasons

FRANKFURT (Reuters) - BMW has held back the mass rollout of electric cars until 2020 because current fourth generation electric car technology is not profitable enough for volume production, Chief Executive Harald Krueger said. “We ...
Rackspace Extends Managed Security to Google Cloud Platform

Rackspace Extends Managed Security to Google Cloud Platform

SAN ANTONIO, March 21, 2018 (GLOBE NEWSWIRE) -- Rackspace® announced today that Managed Security and Compliance Assistance for Google Cloud Platform (GCP) is now available for preview to new and existing customers that use Rackspace Managed Services for GCP ...
Google classroom

Helping G Suite customers stay secure with new proactive phishing protections and management controls

Security tools are only effective at stopping threats if they are deployed and managed at scale, but getting everyone in your organization to adopt these tools ultimately hinges on how easy they are to use ...
Gartner Says Worldwide IoT Security Spending Will Reach $1.5 Billion in 2018

Gartner Says Worldwide IoT Security Spending Will Reach $1.5 Billion in 2018

By 2021, Regulatory Compliance Will Become the Prime Influencer for IoT Security Uptake Internet of Things (IoT)-based attacks are already a reality. A recent CEB, now Gartner, survey found that nearly 20 percent of organizations ...