Skip to main content

Command Palette

Search for a command to run...

Design Resilient Architecture Part-01

Updated
β€’6 min read
V

DevOps & Cloud Engineer with 3.5+ years of AWS experience managing multi-cluster Kubernetes platforms and 70+ microservices. Specialized in Terraform (IaC), CI/CD automation, and GitOps (ArgoCD), building secure, highly available, and cost-optimized cloud-native platforms with strong expertise in AWS architecture and security hardening.

🧠 Quick Story: What is happening here?

  • You have Hadoop (big, distributed ETL workload).

  • You have LOTS of EC2 instances (50 per AZ!).

  • You want high availability (no single hardware failure should kill too many instances).


πŸ”₯ Which Placement Group Should You Pick?

βœ… Partition Placement Group

Why?

  • It divides EC2 instances into partitions.

  • Each partition uses different hardware.

  • So, failures are isolated β†’ If one partition fails, others are safe!

  • Designed for distributed systems like Hadoop, Kafka, Cassandra.


❌ Wrong Options (and why)

OptionWhy Wrong?
Cluster Placement GroupFor HPC (High Performance Computing), needs low latency, not for big distributed jobs.
Spread Placement GroupOnly for a small number of instances (like 7 per AZ), not 50+ instances.
Both Spread and PartitionNo, Spread is not good here. Only Partition is correct.

🎯 Final Shortcut to Remember

Hadoop = Partition Placement Group.

Simple one-line memory rule:

"Distributed Big Data ➑️ Partition Placement Group"


Would you also like a one-line trick to remember when to use Cluster vs Spread vs Partition?
I can give you a funny shortcut for that too! πŸŽ―πŸ˜„
Want it? πŸš€

Awesome! πŸš€ Here's a funny and super easy shortcut for remembering Cluster vs Spread vs Partition:


🎯 1. Cluster Placement Group

"Cluster = Close"

  • All instances packed closely together.

  • For: Super fast, low-latency communication (like HPC apps, tightly-coupled systems).

🧠 Memory trick:

"Cluster means close together like a tight friend group."


🎯 2. Spread Placement Group

"Spread = Separate"

  • Instances are far apart β€” on different hardware.

  • For: Protecting a small number of critical instances (max 7 per AZ).

🧠 Memory trick:

"Spread means spread out like introverts at a party."


🎯 3. Partition Placement Group

"Partition = Pieces"

  • Big groups divided into pieces (partitions).

  • Each partition is isolated β†’ good for Hadoop, Kafka, Big Data systems.

🧠 Memory trick:

"Partition means split into pieces for massive distributed workloads."


πŸ”₯ Final One-Liner Summary:

Cluster = Close together πŸ”₯
Spread = Separate far away πŸƒβ€β™‚οΈπŸƒβ€β™€οΈ
Partition = Pieces for Big Data πŸ—οΈ


🎯 What the company needs:

  • Failover to AWS quickly if their on-premises data center fails.

  • Minimal downtime (least waiting time).

  • Same data on-prem and AWS (uniform data).


❌ Why your selected option was wrong:

You selected a solution using AWS CloudFormation triggered by Lambda.
πŸ‘‰ Problem:
CloudFormation takes time to create EC2 instances, ALB, etc. (could be minutes).
Not instant failover ❌ β€” which increases downtime.


βœ… Correct (Best) Solution:

  • Route 53 failover record
    (Automatically detects failure and switches traffic to AWS.)

  • Already running EC2 servers behind an Application Load Balancer in an Auto Scaling group
    (Not creating them after failure β€” they are already ready!)

  • AWS Storage Gateway
    (Keeps on-prem and AWS data in sync.)

πŸ‘‰ So no provisioning delay, failover is instant βœ….


🧠 Easy memory trick:

Failover = Pre-Running + Load Balancer + Storage Sync πŸš€
(Never launch servers during disaster. They must be already active.)


πŸ“’ Simple Final Thought:

If the question asks for LEAST downtime, avoid any solution where new resources are created on-the-fly like CloudFormation or Lambda triggers.
Pre-built infrastructure wins every time! πŸ†

🎯 What’s happening in the question:

  • Videos are saved on local EBS volumes attached to each EC2 instance.

  • When users log in, the Load Balancer sends them to different instances.

  • Problem: Each instance has different videos on its own EBS.
    (That's why users see a random subset of their videos each time.)


βœ… Best solution:

You need shared storage that all EC2 instances can access together.
Two best options:

1. Amazon S3 (Best for Object Storage like videos)

  • Upload all videos to S3.

  • Modify app to read/write videos directly from S3.

  • βœ… S3 is scalable, reliable, and all instances can access it anytime.

2. Amazon EFS (Shared File System)

  • Mount EFS to all EC2 instances.

  • Migrate existing videos from EBS to EFS.

  • App can access videos like normal files but from shared storage.


❌ Why other options are wrong:

OptionWhy Wrong
S3 Glacier Deep ArchiveIt's for cold storage (very slow retrieval, meant for backup only), not for active videos. ❌
Amazon RDSRelational database (for structured data like users, orders) β€” not good for big videos. ❌
DynamoDBNoSQL database (for key-value or document data) β€” not meant for storing big video files. ❌

🧠 Easy memory trick:

For storing videos, images, files βž” Use S3 or EFS, NOT databases.


πŸ“’ Simple Final Thought:

Block storage (EBS) is tied to a single server.
Shared storage (S3 or EFS) is visible to all servers.

When users need to access the same content regardless of server, always think S3 or EFS first! πŸš€

🏬 What’s happening in the question:

A retail company wants to test a blue-green deployment for its global app within 48 hours, just before a major sales event (Thanksgiving).

πŸ’‘ Most users access the app via mobile devices, which often cache DNS records β€” making DNS-based changes slow to propagate.


βœ… Best answer:

Use AWS Global Accelerator to distribute a portion of traffic to a particular deployment


βœ… Why it's correct:

  • AWS Global Accelerator lets you shift traffic instantly between deployments without waiting for DNS changes to propagate.

  • Perfect for mobile clients, which cache DNS and may not update quickly.

  • Provides global coverage and near-instant routing updates.

  • Ideal for testing a new "green" version while keeping the "blue" live.


❌ Why the other options are wrong:

OptionWhy It's Not Ideal
Amazon Route 53 (DNS weighted routing)❌ DNS caching delays updates β€” users might still hit the old version even after routing changes.
Elastic Load Balancer (ELB)⚠️ ALBs can do blue-green via weighted target groups, but only within a single Region β€” not great for global users.
AWS CodeDeploy❌ Used for application deployment, not for routing user traffic between blue/green environments.

πŸ” Blue-Green Deployment Options: Visual Comparison

Feature / Option🌐 AWS Global AcceleratorπŸ” Elastic Load Balancer (ELB)🌍 Amazon Route 53 (DNS)🧩 AWS CodeDeploy
DNS Caching Impact❌ Not affected❌ Not affectedβœ… Affected❌ Not relevant
Traffic Controlβœ… Endpoint weights + dialsβœ… Weighted target groupsβœ… Weighted routing❌ Deployment only
Multi-Region Supportβœ… Yes (Global)❌ Region-boundβœ… Yes❌ Not a routing tool
Switch Speed⚑ Instant⚑ FastπŸ•’ Slower due to caching🚫 Not applicable
Best Use Caseβœ… Global blue-green rollout🟑 Good for single-region❌ Not reliable for quick switch❌ Deploys app code, not routes