It was a busy time of the year – just a month after presenting at AWS CloudDay Warsaw, I was back on stage at AWS CloudDay Prague 2024 where I delivered session APP306 – Resilient architectures at scale: Real-life use cases from Amazon. The event was held at O2 Universum on the 23rd of October. This was my first time speaking in Prague and the atmosphere was very nice – it was great to connect with the Czech tech community.
I started by looking at the scale at which Amazon operates, sharing Amazon Prime Day 2023 statistics: 375 million items purchased, $12.7 billion in global sales, with Amazon SQS handling 86 million peak requests per second, Amazon Aurora processing 318 billion transactions and Amazon DynamoDB reaching 126 million peak requests per second. At this scale, resilience becomes critical. Using the Amazon.com product detail page as an example, I showed how microservices enable resilience and scale, and from there we explored cell-based architectures – a design pattern where a service is split into multiple independent deployment stacks called “cells” that share nothing, reducing the blast radius of failures.
We then covered two real-world use cases. First, how Prime Video improved availability by deploying cells in each AWS Region, using Amazon Route 53 for cellular traffic routing with round-robin and geo-proximity policies, combined with calculated health checks based on Amazon CloudWatch alarms – achieving 99.9996% availability. Second, how Amazon Music implemented fault isolation using AWS Fargate on Amazon ECS for cell routing, with a two-layer mapping strategy based on device type and event tier, organizing workloads into supercells.
It was great to see so many people interested in resilience patterns and cell-based architectures!
My slides covering the presentation can be found here.