CC

SAP-C02 Study Guide & Reviewer

Comprehensive Study Materials & Key Concepts

AWS Certified Solutions Architect - Professional (SAP-C02)

Comprehensive Study Guide

Complete Learning Path for Certification Success

Overview

This study guide provides a structured learning path from fundamentals to exam readiness for the AWS Certified Solutions Architect - Professional (SAP-C02) certification. Designed for complete novices, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.

Target Audience: Complete beginners with little to no AWS experience who need to learn everything from scratch.

Study Approach: Self-sufficient textbook replacement - you should NOT need external resources to understand concepts. Everything is explained from first principles with extensive examples and visual diagrams.

About This Certification

Exam Details:

  • Exam Code: SAP-C02
  • Questions: 75 (65 scored + 10 unscored)
  • Duration: 180 minutes (3 hours)
  • Passing Score: 750/1000
  • Question Types: Multiple choice (1 correct) and multiple response (2+ correct)
  • Exam Cost: $300 USD

Target Candidate:

  • 2+ years of hands-on experience with AWS
  • Experience designing distributed applications and systems on AWS
  • Ability to evaluate cloud application requirements and make architectural recommendations
  • Capability to provide guidance on architectural design across multiple applications and projects
  • Understanding of AWS Well-Architected Framework

What This Exam Tests:

  • Designing solutions for organizational complexity (multi-account, networking, security)
  • Designing new solutions (deployment strategies, business continuity, performance)
  • Continuous improvement of existing solutions (operational excellence, security, reliability)
  • Accelerating workload migration and modernization (assessment, migration strategies, modernization)

Section Organization

Study Sections (read in order):

  • Overview (this section) - How to use the guide and study plan
  • Fundamentals - Section 0: Essential AWS background and prerequisites
  • 02_domain_1_organizational_complexity - Section 1: Domain 1 detailed content (26% of exam)
  • 03_domain_2_new_solutions - Section 2: Domain 2 detailed content (29% of exam)
  • 04_domain_3_continuous_improvement - Section 3: Domain 3 detailed content (25% of exam)
  • 05_domain_4_migration_modernization - Section 4: Domain 4 detailed content (20% of exam)
  • Integration - Integration & cross-domain scenarios
  • Study strategies - Study techniques & test-taking strategies
  • Final checklist - Final week preparation checklist
  • Appendices - Quick reference tables, glossary, resources
  • diagrams/ - Folder containing all Mermaid diagram files (.mmd)

Study Plan Overview

Total Time: 8-12 weeks (2-3 hours daily for complete novices)

Week-by-Week Breakdown:

  • Week 1-2: Fundamentals & Domain 1 Part 1 (sections 01-02, first half)

    • Focus: AWS basics, networking fundamentals, VPC concepts
    • Practice: Domain 1 Bundle 1 (aim for 60%+)
  • Week 3-4: Domain 1 Part 2 & Domain 2 Part 1 (file 02 second half, file 03 first half)

    • Focus: Security controls, resilience, multi-account, deployment strategies
    • Practice: Domain 1 Bundle 2, Domain 2 Bundle 1 (aim for 65%+)
  • Week 5-6: Domain 2 Part 2 & Domain 3 (file 03 second half, file 04)

    • Focus: Business continuity, performance, operational excellence, security improvements
    • Practice: Domain 2 Bundle 2, Domain 3 Bundle 1 (aim for 70%+)
  • Week 7-8: Domain 4 & Integration (sections 05-06)

    • Focus: Migration strategies, modernization, cross-domain scenarios
    • Practice: Domain 4 Bundle 1, Full Practice Test 1 (aim for 70%+)
  • Week 9-10: Practice & Review

    • Complete all remaining practice tests
    • Review flagged topics from chapters
    • Focus on weak areas identified in practice tests
    • Practice: Full Practice Tests 2 & 3 (aim for 75%+)
  • Week 11-12: Final Preparation

    • Review study strategies (section 07)
    • Complete final checklist (section 08)
    • Review cheat sheets daily
    • Light review only - no new learning
    • Practice: Service-focused bundles for weak areas

Learning Approach

Four-Phase Learning Method:

  1. Read & Understand (First Pass)

    • Read each section thoroughly
    • Study all diagrams carefully
    • Take notes on ⭐ Must Know items
    • Don't rush - understanding is more important than speed
    • Expect 3-4 hours per domain chapter
  2. Practice & Apply (Second Pass)

    • Complete practice exercises after each section
    • Take domain-focused practice tests
    • Review explanations for ALL questions (even correct ones)
    • Identify patterns in question types
    • Return to chapter sections for topics you missed
  3. Review & Reinforce (Third Pass)

    • Review chapter summaries and quick reference cards
    • Focus on ⭐ Must Know items
    • Practice with full-length tests
    • Time yourself to build exam stamina
    • Review diagrams to reinforce mental models
  4. Master & Refine (Final Pass)

    • Take all remaining practice tests
    • Review only flagged/weak areas
    • Memorize critical numbers and limits
    • Practice decision frameworks
    • Build confidence through repetition

Progress Tracking

Use checkboxes to track your completion:

Chapter Completion:

  • Chapter 0: Fundamentals (section 01)
  • Chapter 1: Domain 1 - Organizational Complexity (section 02)
  • Chapter 2: Domain 2 - New Solutions (section 03)
  • Chapter 3: Domain 3 - Continuous Improvement (section 04)
  • Chapter 4: Domain 4 - Migration & Modernization (section 05)
  • Integration & Cross-Domain Scenarios (section 06)
  • Study Strategies (section 07)
  • Final Checklist (section 08)

Practice Test Milestones:

  • Domain 1 Bundle 1: Score ___% (target: 60%+)
  • Domain 1 Bundle 2: Score ___% (target: 65%+)
  • Domain 2 Bundle 1: Score ___% (target: 65%+)
  • Domain 2 Bundle 2: Score ___% (target: 70%+)
  • Domain 3 Bundle 1: Score ___% (target: 70%+)
  • Domain 3 Bundle 2: Score ___% (target: 70%+)
  • Domain 4 Bundle 1: Score ___% (target: 70%+)
  • Full Practice Test 1: Score ___% (target: 70%+)
  • Full Practice Test 2: Score ___% (target: 75%+)
  • Full Practice Test 3: Score ___% (target: 75%+)

Self-Assessment Criteria:
For each chapter, you should be able to:

  • Explain key concepts in your own words
  • Draw basic architecture diagrams from memory
  • Make service selection decisions using decision frameworks
  • Identify common pitfalls and anti-patterns
  • Score 70%+ on related practice questions

Legend & Visual Markers

Throughout this guide, you'll see these markers:

  • ⭐ Must Know: Critical information for exam success - memorize this
  • šŸ’” Tip: Helpful insight, shortcut, or memory aid
  • āš ļø Warning: Common mistake or misconception to avoid
  • šŸ”— Connection: Links to related topics in other chapters
  • šŸ“ Practice: Hands-on exercise or scenario to work through
  • šŸŽÆ Exam Focus: Frequently tested concept or question pattern
  • šŸ“Š Diagram: Visual representation available (see diagrams folder)

How to Use Diagrams

Diagram Integration:

  • Every complex concept has at least one diagram
  • Diagrams are embedded in the text using Mermaid syntax
  • Each diagram is also saved as a separate .mmd file in diagrams/ folder
  • Detailed written explanations accompany every diagram

Diagram Types You'll Encounter:

  • Architecture Diagrams: Show how services connect and interact
  • Sequence Diagrams: Illustrate step-by-step processes and flows
  • Decision Trees: Help you choose between options
  • State Diagrams: Show lifecycle and transitions
  • Comparison Tables: Side-by-side feature comparisons

How to Study with Diagrams:

  1. Read the text explanation first to understand the concept
  2. Study the diagram to see the visual representation
  3. Read the detailed diagram explanation that follows
  4. Try to redraw the diagram from memory
  5. Use diagrams as quick reference during review

Study Tips for Success

For Complete Novices:

  • Don't skip the Fundamentals chapter (section 01) - it builds essential background
  • Take your time - understanding is more important than speed
  • Use analogies provided to relate technical concepts to everyday experiences
  • Practice drawing diagrams to reinforce understanding
  • Don't hesitate to re-read sections multiple times

Active Learning Techniques:

  • Teach Someone: Explain concepts out loud as if teaching a friend
  • Draw Diagrams: Recreate architecture diagrams from memory
  • Write Scenarios: Create your own question scenarios
  • Compare Options: Use comparison tables to understand trade-offs
  • Practice Decisions: Work through decision trees for different scenarios

Memory Aids:

  • Use mnemonics provided in the guide
  • Create your own acronyms for lists
  • Associate services with their use cases
  • Group related services together mentally
  • Review ⭐ Must Know items daily in final weeks

Time Management:

  • Study in focused 45-60 minute blocks
  • Take 10-15 minute breaks between blocks
  • Review previous day's material before starting new content
  • End each session by noting what to study next
  • Don't cram - consistent daily study is more effective

When You're Ready for the Exam

You're ready when:

  • You score 75%+ consistently on full practice tests
  • You can explain all ⭐ Must Know concepts without notes
  • You recognize question patterns instantly
  • You make service selection decisions quickly using frameworks
  • You understand WHY answers are correct, not just WHAT they are
  • You can draw key architecture diagrams from memory
  • You complete practice tests within time limits comfortably

Final Week Preparation:

  • Review file 08 (Final Checklist) daily
  • Take one full practice test every other day
  • Review only weak areas - no new learning
  • Get adequate sleep (8 hours)
  • Stay confident - trust your preparation

Additional Resources

Practice Test Bundles (included with this guide):

  • Difficulty-Based: 5 bundles (beginner, intermediate, advanced)
  • Full Practice: 3 complete 65-question exams
  • Domain-Focused: 8 bundles (2 per domain)
  • Service-Focused: 10 bundles (networking, security, compute, etc.)

Cheat Sheets (included with this guide):

  • Quick reference for final week review
  • Located in:
  • 5-6 pages of condensed key takeaways
  • Perfect for daily review in final 2 weeks

Official AWS Resources (optional supplements):

  • AWS Well-Architected Framework documentation
  • AWS service documentation (for deep dives)
  • AWS Architecture Center (real-world patterns)
  • AWS Whitepapers (best practices)

How to Navigate This Guide

Sequential Reading (Recommended for Novices):

  • Start with file 01 (Fundamentals)
  • Progress through sections 02-05 (Domain chapters) in order
  • Complete file 06 (Integration) after domain chapters
  • Review sections 07-08 (Strategies & Checklist) in final weeks
  • Use file 99 (Appendices) as quick reference throughout

Topic-Based Reading (For Experienced Users):

  • Use the table of contents in each section
  • Jump to specific sections as needed
  • Cross-reference using šŸ”— Connection markers
  • Review diagrams for quick visual understanding

Review Mode (For Final Preparation):

  • Read chapter summaries only
  • Review all ⭐ Must Know items
  • Study decision frameworks and comparison tables
  • Practice with diagrams
  • Use file 99 (Appendices) for quick facts

Important Notes

About Exam Content:

  • This guide covers 100% of exam topics from the official exam guide
  • Content is based on 655 practice questions covering all domains
  • All technical details verified with official AWS documentation
  • Scenarios reflect real-world professional architect responsibilities

About Question Practice:

  • Practice questions are essential - reading alone is not enough
  • Review explanations for ALL questions, even ones you get correct
  • Understand WHY wrong answers are wrong
  • Look for patterns in how questions are structured
  • Time yourself on full practice tests to build exam stamina

About Updates:

  • AWS services evolve - always verify critical details with official docs
  • This guide reflects exam content as of October 2025
  • Focus on concepts and patterns, not just specific features
  • Understand the "why" behind architectural decisions

Getting Started

Your First Steps:

  1. Read this overview completely (you're almost done!)
  2. Review the study plan and adjust timeline to your schedule
  3. Set up a study space free from distractions
  4. Open file 01 (Fundamentals) and begin Chapter 0
  5. Take notes and mark ⭐ Must Know items as you read
  6. Complete practice exercises after each major section
  7. Track your progress using the checklists above

Remember:

  • Quality over speed - understanding is more important than rushing
  • Consistent daily study beats cramming
  • Practice questions are essential for success
  • Visual diagrams enhance retention - use them actively
  • You can do this - thousands have passed with proper preparation

Ready to begin? Open file 01_fundamentals to start Chapter 0!

Good luck on your certification journey! šŸš€


Chapter 0: Essential AWS Background & Prerequisites

What You Need to Know First

This certification assumes you understand certain foundational concepts. This chapter builds that foundation from scratch, assuming you're a complete novice. If you're already familiar with AWS basics, you can skim this chapter, but don't skip it entirely - it establishes the mental models used throughout the guide.

Prerequisites Checklist

Before diving into professional-level architecture, you should understand:

  • Basic Cloud Computing Concepts - What cloud computing is and why organizations use it
  • Networking Fundamentals - IP addresses, subnets, routing, DNS basics
  • Security Principles - Authentication, authorization, encryption concepts
  • High Availability Concepts - Redundancy, failover, disaster recovery basics
  • Basic AWS Services - EC2, S3, VPC, IAM at a conceptual level

If you're missing any: Don't worry! This chapter will teach you everything you need. Take your time and work through each section carefully.


Section 1: Cloud Computing Fundamentals

What is Cloud Computing?

Simple Definition: Cloud computing means using computers, storage, and services over the internet instead of owning and maintaining your own physical servers.

Why it exists: Traditionally, companies had to buy servers, set up data centers, hire staff to maintain them, and predict future capacity needs years in advance. This was expensive, inflexible, and risky. Cloud computing solves these problems by letting you rent computing resources on-demand, paying only for what you use.

Real-world analogy: Think of cloud computing like electricity. You don't build your own power plant - you plug into the grid and pay for what you use. Similarly, you don't build your own data center - you use AWS's infrastructure and pay for what you consume.

How it works (Detailed step-by-step):

  1. AWS builds massive data centers around the world with thousands of servers, storage systems, and networking equipment. These facilities have redundant power, cooling, security, and internet connections.

  2. AWS virtualizes the hardware using software that divides physical servers into many virtual machines (VMs). This means one physical server can run dozens of isolated virtual servers for different customers.

  3. You request resources through AWS's web interface or APIs. For example, you might request "I need 2 virtual servers with 4GB RAM each, running Linux, in the US East region."

  4. AWS provisions your resources in seconds. The virtualization software carves out the requested resources from available physical hardware and makes them available to you.

  5. You use the resources to run your applications, store data, or provide services to your customers. You access everything over the internet.

  6. AWS meters your usage and bills you based on what you consume - compute hours, storage gigabytes, data transfer, etc.

  7. You can scale up or down instantly. Need more servers? Request them. Don't need them anymore? Shut them down and stop paying.

Why this matters for the exam: The SAP-C02 exam tests your ability to design solutions that leverage cloud advantages - elasticity, pay-per-use, global reach, and managed services. Understanding WHY cloud exists helps you make better architectural decisions.

Cloud Service Models

There are three main service models in cloud computing. Understanding these helps you choose the right AWS services for different scenarios.

Infrastructure as a Service (IaaS)

What it is: You rent virtual servers, storage, and networking. You're responsible for everything else - operating system, applications, data, security configurations.

AWS Examples: Amazon EC2 (virtual servers), Amazon EBS (block storage), Amazon VPC (networking)

When to use:

  • You need full control over the operating system and software stack
  • You're migrating existing applications that need specific OS configurations
  • You have specialized software that requires specific system-level access

Real-world analogy: Renting an unfurnished apartment. You get the space and utilities, but you bring your own furniture, decorations, and appliances.

Platform as a Service (PaaS)

What it is: AWS manages the infrastructure AND the platform (OS, runtime, middleware). You just deploy your application code and data.

AWS Examples: AWS Elastic Beanstalk (application platform), AWS Lambda (serverless functions), Amazon RDS (managed databases)

When to use:

  • You want to focus on application development, not infrastructure management
  • You need automatic scaling and high availability without manual configuration
  • You want AWS to handle patching, updates, and maintenance

Real-world analogy: Renting a furnished apartment. The furniture and appliances are provided and maintained. You just move in and live there.

Software as a Service (SaaS)

What it is: Complete applications delivered over the internet. You just use the software - AWS manages everything.

AWS Examples: Amazon WorkSpaces (virtual desktops), Amazon Chime (communications), AWS managed services

When to use:

  • You need standard business applications without customization
  • You want zero infrastructure or platform management
  • You need quick deployment with minimal setup

Real-world analogy: Staying in a hotel. Everything is provided and managed. You just show up and use the services.

⭐ Must Know: The exam frequently tests whether you understand when to use IaaS vs PaaS. Generally, PaaS is preferred for operational efficiency, but IaaS is needed when you require specific control or have legacy application requirements.

AWS Global Infrastructure

Understanding AWS's physical infrastructure is critical for designing resilient, performant, and compliant solutions.

Regions

What it is: A Region is a physical geographic area where AWS has multiple data centers. Each Region is completely independent and isolated from other Regions.

Why it exists:

  • Data sovereignty: Some countries require data to stay within their borders
  • Latency: Placing resources closer to users reduces response times
  • Disaster recovery: Regions are far apart, so a natural disaster in one won't affect others
  • Service availability: New AWS services often launch in specific Regions first

How it works:

  1. AWS selects geographic locations based on customer demand, connectivity, and regulatory requirements
  2. Each Region has a unique identifier (e.g., us-east-1, eu-west-1, ap-southeast-2)
  3. Regions are connected by AWS's private global network backbone
  4. You explicitly choose which Region(s) to deploy resources in
  5. Data doesn't leave a Region unless you explicitly configure replication or transfer

Examples of Regions:

  • us-east-1 (N. Virginia): Oldest and largest AWS Region, most services available
  • eu-west-1 (Ireland): Primary European Region for many customers
  • ap-southeast-1 (Singapore): Serves Asia-Pacific customers
  • sa-east-1 (SĆ£o Paulo): Serves South American customers

When to choose a Region:

  • āœ… Choose based on: User location (latency), compliance requirements, service availability, cost
  • āŒ Don't choose based on: Random selection, always using us-east-1 by default

⭐ Must Know: As of 2025, AWS has 30+ Regions globally. Not all services are available in all Regions. Always verify service availability in your target Region.

Availability Zones (AZs)

What it is: An Availability Zone is one or more discrete data centers within a Region, each with redundant power, networking, and connectivity.

Why it exists: To provide high availability and fault tolerance within a Region. If one AZ fails (power outage, network issue, natural disaster), applications in other AZs continue running.

Real-world analogy: Think of a Region as a city, and Availability Zones as different neighborhoods in that city. Each neighborhood has its own power grid and infrastructure. If one neighborhood loses power, the others keep running.

How it works (Detailed):

  1. Each Region has multiple AZs (minimum 3, typically 3-6)
  2. AZs are physically separated by meaningful distances (miles/kilometers apart)
  3. Each AZ has independent power sources, cooling, and physical security
  4. AZs are connected by high-speed, low-latency private fiber networks
  5. You deploy resources across multiple AZs for redundancy
  6. AWS automatically handles failover for many managed services

Architecture Example:

Region: us-east-1
ā”œā”€ā”€ AZ: us-east-1a (Data Center Group 1)
ā”œā”€ā”€ AZ: us-east-1b (Data Center Group 2)
ā”œā”€ā”€ AZ: us-east-1c (Data Center Group 3)
ā”œā”€ā”€ AZ: us-east-1d (Data Center Group 4)
ā”œā”€ā”€ AZ: us-east-1e (Data Center Group 5)
└── AZ: us-east-1f (Data Center Group 6)

Detailed Example 1: Multi-AZ Web Application

Imagine you're running an e-commerce website. Here's how Multi-AZ deployment works:

  1. Setup: You deploy web servers in us-east-1a, us-east-1b, and us-east-1c
  2. Normal Operation: A load balancer distributes traffic across all three AZs. Each AZ handles roughly 33% of requests.
  3. Failure Scenario: At 2 PM, us-east-1a experiences a power failure
  4. Automatic Response:
    • The load balancer detects failed health checks from us-east-1a servers
    • Within 30 seconds, it stops sending traffic to us-east-1a
    • Traffic is redistributed to us-east-1b and us-east-1c (now 50% each)
    • Users experience no downtime - they're automatically routed to healthy AZs
  5. Recovery: When us-east-1a comes back online, the load balancer automatically includes it again

Detailed Example 2: Multi-AZ Database

For a database using Amazon RDS Multi-AZ:

  1. Setup: Primary database in us-east-1a, synchronous standby in us-east-1b
  2. Normal Operation: All reads and writes go to the primary. Every transaction is synchronously replicated to the standby (happens in milliseconds).
  3. Failure Scenario: The primary database instance fails
  4. Automatic Failover:
    • RDS detects the failure within 60 seconds
    • Promotes the standby in us-east-1b to primary
    • Updates DNS to point to the new primary
    • Total downtime: 1-2 minutes
    • Zero data loss (synchronous replication)
  5. Recovery: RDS automatically creates a new standby in another AZ

⭐ Must Know:

  • Each AZ is identified by a letter suffix (a, b, c, etc.)
  • AZ identifiers are mapped randomly per AWS account (your us-east-1a might be different from another account's us-east-1a)
  • Always deploy across at least 2 AZs for high availability
  • Some services (like RDS Multi-AZ) automatically handle AZ failover

šŸ’” Tip: For production workloads, use at least 3 AZs. This allows you to maintain availability even if one AZ fails and another is undergoing maintenance.

Edge Locations and CloudFront

What it is: Edge Locations are AWS data centers specifically designed for content delivery. They're part of Amazon CloudFront, AWS's Content Delivery Network (CDN).

Why it exists: To deliver content (web pages, videos, files) to users with low latency by caching content geographically close to them.

How it works:

  1. You upload your content (website, images, videos) to an origin server (like S3 or EC2)
  2. You configure CloudFront to distribute this content
  3. CloudFront copies your content to Edge Locations around the world
  4. When a user requests content, CloudFront serves it from the nearest Edge Location
  5. If content isn't cached, CloudFront fetches it from the origin, caches it, and serves it

Real-world analogy: Think of Edge Locations like local convenience stores. Instead of driving to a distant warehouse (origin server) every time you need milk, you go to the nearby store that stocks popular items. The store occasionally restocks from the warehouse, but most purchases are served locally.

Detailed Example: Global Website Delivery

Scenario: You have a website hosted on EC2 in us-east-1, but users worldwide access it.

Without CloudFront:

  • User in Tokyo requests your website
  • Request travels across the internet to us-east-1 (Virginia)
  • Round-trip time: 200-300ms
  • Every image, CSS file, JavaScript file requires a separate round trip
  • Total page load: 3-5 seconds

With CloudFront:

  • User in Tokyo requests your website
  • Request goes to nearest Edge Location (Tokyo)
  • Edge Location has cached content: serves immediately (10-20ms)
  • Edge Location doesn't have content: fetches from us-east-1 once, caches it, serves it
  • Subsequent requests from Tokyo users: served from cache
  • Total page load: 0.5-1 second

⭐ Must Know:

  • AWS has 400+ Edge Locations globally (more than Regions)
  • Edge Locations are read-only caches (you can't deploy applications there)
  • CloudFront is the primary service using Edge Locations
  • Other services using Edge Locations: Route 53 (DNS), AWS WAF (web firewall), Lambda@Edge

šŸŽÆ Exam Focus: Questions often test whether you understand when to use CloudFront for performance optimization, especially for global user bases.


Section 2: Networking Fundamentals

Understanding networking is absolutely critical for the SAP-C02 exam. Domain 1 (26% of the exam) heavily focuses on network architecture. This section builds your networking foundation from scratch.

IP Addresses and CIDR Notation

What it is: An IP address is a unique identifier for a device on a network, like a phone number for computers. CIDR (Classless Inter-Domain Routing) is a way to specify ranges of IP addresses.

Why it exists: Networks need a way to identify and route traffic to specific devices. CIDR provides a flexible way to allocate IP addresses efficiently.

Real-world analogy: Think of IP addresses like street addresses. Just as "123 Main Street" uniquely identifies a house, "10.0.1.5" uniquely identifies a computer on a network.

How IP Addresses Work:

An IPv4 address consists of 4 numbers (0-255) separated by dots:

  • Example: 192.168.1.10
  • Each number is called an "octet" (8 bits)
  • Total: 32 bits (4 octets Ɨ 8 bits)

CIDR Notation Explained:

CIDR notation looks like: 10.0.0.0/16

  • The /16 is the "prefix length" - it tells you how many bits are fixed
  • Remaining bits can vary, defining the range of addresses

Detailed Example 1: Understanding /16

10.0.0.0/16 means:

  • First 16 bits are fixed: 10.0
  • Last 16 bits can vary: 0.0 to 255.255
  • Address range: 10.0.0.0 to 10.0.255.255
  • Total addresses: 2^16 = 65,536 addresses

Detailed Example 2: Understanding /24

192.168.1.0/24 means:

  • First 24 bits are fixed: 192.168.1
  • Last 8 bits can vary: 0 to 255
  • Address range: 192.168.1.0 to 192.168.1.255
  • Total addresses: 2^8 = 256 addresses

Detailed Example 3: Understanding /28

10.0.1.0/28 means:

  • First 28 bits are fixed
  • Last 4 bits can vary: 0 to 15
  • Address range: 10.0.1.0 to 10.0.1.15
  • Total addresses: 2^4 = 16 addresses

Common CIDR Blocks (Memorize These):

CIDR Addresses Typical Use
/32 1 Single host
/28 16 Very small subnet
/24 256 Small subnet (common)
/20 4,096 Medium subnet
/16 65,536 Large network
/8 16,777,216 Huge network

⭐ Must Know:

  • Smaller prefix (like /16) = MORE addresses
  • Larger prefix (like /28) = FEWER addresses
  • AWS VPCs can be /16 to /28
  • AWS subnets can be /16 to /28

šŸ’” Tip: Quick calculation - if CIDR is /X, you have 2^(32-X) addresses. For /24: 2^(32-24) = 2^8 = 256 addresses.

Private vs Public IP Addresses

What it is: IP addresses are divided into "private" (used internally) and "public" (used on the internet).

Why it exists: The internet has a limited number of IPv4 addresses (about 4 billion). Private addresses allow organizations to use the same address ranges internally without conflicts, while public addresses are globally unique.

Private IP Ranges (RFC 1918):

  • 10.0.0.0/8 (10.0.0.0 to 10.255.255.255) - 16 million addresses
  • 172.16.0.0/12 (172.16.0.0 to 172.31.255.255) - 1 million addresses
  • 192.168.0.0/16 (192.168.0.0 to 192.168.255.255) - 65,536 addresses

Public IP Addresses:

  • All other IPv4 addresses
  • Globally unique and routable on the internet
  • Must be assigned by your ISP or cloud provider

How They Work Together:

  1. Internal Communication: Devices use private IPs to talk to each other within your network
  2. Internet Access: A NAT (Network Address Translation) device translates private IPs to public IPs
  3. Incoming Traffic: Public IPs are used to reach your services from the internet

Detailed Example: Web Server Architecture

Scenario: You're running a web application on AWS.

Setup:

  • Web server has private IP: 10.0.1.50
  • Web server has public IP: 54.123.45.67 (assigned by AWS)
  • Database has private IP: 10.0.2.100 (no public IP)

User Access Flow:

  1. User types www.example.com in browser
  2. DNS resolves to public IP: 54.123.45.67
  3. User's browser connects to 54.123.45.67
  4. AWS routes traffic to web server's private IP: 10.0.1.50
  5. Web server processes request

Database Access Flow:

  1. Web server needs data from database
  2. Web server connects to database's private IP: 10.0.2.100
  3. Traffic stays within AWS's private network (fast, secure)
  4. Database responds to web server
  5. Database is NOT accessible from internet (no public IP)

⭐ Must Know:

  • AWS VPCs always use private IP ranges
  • Public IPs are optional and assigned separately
  • Databases and internal services should NEVER have public IPs (security best practice)
  • Use NAT Gateway for private instances to access internet

Subnets

What it is: A subnet is a subdivision of a network. It's a range of IP addresses within a larger network.

Why it exists: Subnets allow you to organize and secure your network by grouping related resources together and controlling traffic between groups.

Real-world analogy: Think of a subnet like a floor in an office building. The building (VPC) has multiple floors (subnets). Each floor has its own set of offices (IP addresses). You can control who can move between floors (routing and security).

How Subnets Work in AWS:

  1. You create a VPC with a CIDR block (e.g., 10.0.0.0/16)
  2. You divide the VPC into subnets (e.g., 10.0.1.0/24, 10.0.2.0/24)
  3. Each subnet exists in ONE Availability Zone
  4. You launch resources (EC2, RDS, etc.) into specific subnets
  5. You control traffic between subnets using route tables and security groups

Detailed Example: Three-Tier Application

Scenario: You're designing a web application with web servers, application servers, and databases.

VPC: 10.0.0.0/16 (65,536 addresses)

Subnets:

  • Public Subnet 1 (us-east-1a): 10.0.1.0/24 (256 addresses)

    • Web servers that need internet access
    • Has route to Internet Gateway
  • Public Subnet 2 (us-east-1b): 10.0.2.0/24 (256 addresses)

    • Web servers in different AZ for high availability
    • Has route to Internet Gateway
  • Private Subnet 1 (us-east-1a): 10.0.11.0/24 (256 addresses)

    • Application servers (no direct internet access)
    • Has route to NAT Gateway for outbound internet
  • Private Subnet 2 (us-east-1b): 10.0.12.0/24 (256 addresses)

    • Application servers in different AZ
    • Has route to NAT Gateway for outbound internet
  • Database Subnet 1 (us-east-1a): 10.0.21.0/24 (256 addresses)

    • Database servers (completely isolated)
    • No internet access at all
  • Database Subnet 2 (us-east-1b): 10.0.22.0/24 (256 addresses)

    • Database servers in different AZ
    • No internet access at all

Traffic Flow:

  1. Internet → Public Subnet (web servers)
  2. Public Subnet → Private Subnet (app servers)
  3. Private Subnet → Database Subnet (databases)
  4. Database Subnet → Private Subnet (responses)
  5. Private Subnet → Public Subnet (responses)
  6. Public Subnet → Internet (responses)

⭐ Must Know:

  • Public subnet = has route to Internet Gateway
  • Private subnet = no direct internet route (may have NAT Gateway)
  • Each subnet is in exactly ONE Availability Zone
  • Subnets cannot span multiple AZs
  • Always create subnets in multiple AZs for high availability

šŸ’” Tip: Use a consistent IP addressing scheme. For example:

  • 10.0.1.x - Public subnets in AZ-a
  • 10.0.2.x - Public subnets in AZ-b
  • 10.0.11.x - Private subnets in AZ-a
  • 10.0.12.x - Private subnets in AZ-b

Routing

What it is: Routing is the process of determining how network traffic gets from source to destination. Route tables contain rules (routes) that direct traffic.

Why it exists: Networks need to know where to send packets. Without routing, traffic wouldn't know how to reach its destination.

Real-world analogy: Routing is like GPS directions. When you want to go somewhere, GPS tells you which roads to take. Similarly, route tables tell network packets which path to take.

How Route Tables Work:

  1. Every subnet has a route table (either custom or main)
  2. Route table contains routes - rules that say "if destination is X, send to Y"
  3. Most specific route wins - if multiple routes match, the one with longest prefix is used
  4. Local route is automatic - traffic within VPC is always routed locally

Route Table Structure:

Destination Target Meaning
10.0.0.0/16 local Traffic to any IP in VPC stays in VPC
0.0.0.0/0 igw-xxx All other traffic goes to Internet Gateway

Detailed Example 1: Public Subnet Route Table

Destination       Target          Explanation
10.0.0.0/16      local           VPC internal traffic
0.0.0.0/0        igw-12345678    Internet traffic

How it works:

  • Packet to 10.0.1.50: Matches 10.0.0.0/16 → stays in VPC (local)
  • Packet to 8.8.8.8: Matches 0.0.0.0/0 → goes to Internet Gateway
  • Packet to 54.123.45.67: Matches 0.0.0.0/0 → goes to Internet Gateway

Detailed Example 2: Private Subnet Route Table

Destination       Target          Explanation
10.0.0.0/16      local           VPC internal traffic
0.0.0.0/0        nat-87654321    Internet traffic via NAT

How it works:

  • Packet to 10.0.2.100: Matches 10.0.0.0/16 → stays in VPC (local)
  • Packet to 8.8.8.8: Matches 0.0.0.0/0 → goes to NAT Gateway
  • NAT Gateway translates private IP to public IP and forwards to Internet Gateway
  • Response comes back through NAT Gateway, translated back to private IP

Detailed Example 3: Database Subnet Route Table

Destination       Target          Explanation
10.0.0.0/16      local           VPC internal traffic only

How it works:

  • Packet to 10.0.1.50: Matches 10.0.0.0/16 → stays in VPC (local)
  • Packet to 8.8.8.8: No matching route → dropped (no internet access)
  • This is intentional for security - databases shouldn't access internet

⭐ Must Know:

  • 0.0.0.0/0 means "all IP addresses" (default route)
  • Local route is automatically added and cannot be deleted
  • Most specific route wins (longest prefix match)
  • Public subnet = route to Internet Gateway (igw-xxx)
  • Private subnet = route to NAT Gateway (nat-xxx) for internet access
  • Isolated subnet = no route to internet at all

šŸŽÆ Exam Focus: Questions often test whether you understand the difference between public and private subnets based on their route tables, not just their names.

Internet Gateway and NAT Gateway

These are critical components for internet connectivity in AWS VPCs.

Internet Gateway (IGW)

What it is: An Internet Gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between your VPC and the internet.

Why it exists: VPCs are isolated by default. An Internet Gateway provides a target for internet-routable traffic and performs NAT for instances with public IP addresses.

How it works:

  1. You create an Internet Gateway
  2. You attach it to your VPC (one IGW per VPC)
  3. You add a route in your subnet's route table pointing 0.0.0.0/0 to the IGW
  4. You assign public IPs to instances in that subnet
  5. IGW performs 1:1 NAT between private and public IPs

Detailed Example: Web Server Internet Access

Setup:

  • VPC: 10.0.0.0/16
  • Public Subnet: 10.0.1.0/24
  • Web Server: Private IP 10.0.1.50, Public IP 54.123.45.67
  • Internet Gateway: igw-12345678
  • Route Table: 0.0.0.0/0 → igw-12345678

Outbound Traffic (Web Server → Internet):

  1. Web server sends packet: Source 10.0.1.50, Destination 8.8.8.8
  2. Route table directs packet to Internet Gateway
  3. IGW translates: Source 10.0.1.50 → 54.123.45.67 (public IP)
  4. Packet leaves AWS: Source 54.123.45.67, Destination 8.8.8.8
  5. Internet sees traffic from 54.123.45.67

Inbound Traffic (Internet → Web Server):

  1. Internet sends packet: Source 8.8.8.8, Destination 54.123.45.67
  2. Packet arrives at AWS
  3. IGW translates: Destination 54.123.45.67 → 10.0.1.50 (private IP)
  4. Packet delivered to web server: Source 8.8.8.8, Destination 10.0.1.50

⭐ Must Know:

  • One Internet Gateway per VPC
  • IGW is highly available (AWS managed)
  • IGW performs 1:1 NAT for instances with public IPs
  • No bandwidth limits or availability risks
  • Free (no charges for IGW itself)

NAT Gateway

What it is: A NAT (Network Address Translation) Gateway allows instances in private subnets to access the internet while preventing the internet from initiating connections to those instances.

Why it exists: Private instances (like application servers or batch processing servers) sometimes need to download updates, access external APIs, or send data to external services, but they shouldn't be directly accessible from the internet for security reasons.

Real-world analogy: Think of a NAT Gateway like a security guard at a gated community. Residents (private instances) can leave to go shopping (access internet), but random people from outside (internet) can't come in uninvited.

How it works:

  1. You create a NAT Gateway in a PUBLIC subnet
  2. You assign an Elastic IP (public IP) to the NAT Gateway
  3. You add a route in PRIVATE subnet's route table: 0.0.0.0/0 → NAT Gateway
  4. Private instances send internet-bound traffic to NAT Gateway
  5. NAT Gateway translates source IP to its own public IP and forwards to Internet Gateway
  6. Responses come back to NAT Gateway, which forwards to original private instance

Detailed Example: Application Server Downloading Updates

Setup:

  • VPC: 10.0.0.0/16
  • Public Subnet: 10.0.1.0/24 (has Internet Gateway route)
  • Private Subnet: 10.0.11.0/24 (has NAT Gateway route)
  • NAT Gateway: In public subnet, Elastic IP 54.200.100.50
  • App Server: In private subnet, Private IP 10.0.11.20 (no public IP)

Outbound Traffic Flow:

  1. App server needs to download updates from updates.example.com (93.184.216.34)
  2. App server sends packet: Source 10.0.11.20, Destination 93.184.216.34
  3. Private subnet route table directs to NAT Gateway
  4. NAT Gateway receives packet in public subnet
  5. NAT Gateway translates: Source 10.0.11.20 → 54.200.100.50 (NAT's public IP)
  6. NAT Gateway sends to Internet Gateway
  7. Internet Gateway forwards to internet
  8. Update server sees request from 54.200.100.50

Response Traffic Flow:

  1. Update server responds: Source 93.184.216.34, Destination 54.200.100.50
  2. Internet Gateway receives response
  3. Internet Gateway forwards to NAT Gateway
  4. NAT Gateway translates: Destination 54.200.100.50 → 10.0.11.20
  5. NAT Gateway forwards to app server in private subnet
  6. App server receives updates

Inbound Traffic (Blocked):

  1. Attacker tries to connect to app server from internet
  2. Attacker doesn't know private IP 10.0.11.20 (not exposed)
  3. Even if attacker knew it, NAT Gateway only allows OUTBOUND connections
  4. Inbound connection attempts are dropped
  5. App server remains secure

⭐ Must Know:

  • NAT Gateway must be in a PUBLIC subnet
  • NAT Gateway needs an Elastic IP (static public IP)
  • Private subnets route 0.0.0.0/0 to NAT Gateway
  • NAT Gateway is highly available within ONE AZ
  • For multi-AZ, create one NAT Gateway per AZ
  • NAT Gateway charges: $0.045/hour + $0.045/GB processed

šŸ’” Tip: NAT Gateway vs NAT Instance:

  • NAT Gateway: AWS managed, highly available, scales automatically, preferred
  • NAT Instance: EC2 instance you manage, can be cheaper for low traffic, legacy approach

āš ļø Warning: Common mistake - putting NAT Gateway in private subnet. It MUST be in public subnet with Internet Gateway route.

šŸŽÆ Exam Focus: Questions often test whether you understand:

  • NAT Gateway must be in public subnet
  • Need one NAT Gateway per AZ for high availability
  • NAT Gateway allows outbound only, not inbound
  • Cost optimization: Single NAT Gateway vs one per AZ

Section 3: Security Fundamentals

Security is woven throughout the SAP-C02 exam. Understanding these fundamentals is essential for every domain.

Authentication vs Authorization

What they are:

  • Authentication: Proving who you are (identity verification)
  • Authorization: Determining what you're allowed to do (permission checking)

Why they exist: Systems need to verify identity before granting access, then control what authenticated users can do.

Real-world analogy:

  • Authentication: Showing your ID at airport security (proving you're you)
  • Authorization: Your boarding pass determines which plane you can board (what you can access)

How they work together:

  1. User provides credentials (username/password, access keys, etc.)
  2. System authenticates: "Is this really who they claim to be?"
  3. If authenticated, system checks authorization: "What is this user allowed to do?"
  4. System grants or denies access based on permissions

Detailed Example: AWS Console Login

Authentication Phase:

  1. You navigate to AWS Console
  2. You enter email and password
  3. AWS verifies credentials against IAM database
  4. If MFA enabled, you provide second factor (code from app)
  5. AWS confirms: "Yes, this is really user John Smith"

Authorization Phase:

  1. AWS checks IAM policies attached to your user
  2. You try to launch an EC2 instance
  3. AWS checks: "Does John Smith have ec2:RunInstances permission?"
  4. If yes: Instance launches
  5. If no: "Access Denied" error

⭐ Must Know:

  • Authentication = WHO you are
  • Authorization = WHAT you can do
  • In AWS, IAM handles both
  • Always authenticate first, then authorize

Encryption Basics

What it is: Encryption transforms readable data (plaintext) into unreadable data (ciphertext) using a mathematical algorithm and a key.

Why it exists: To protect data confidentiality. Even if someone intercepts encrypted data, they can't read it without the decryption key.

Real-world analogy: Encryption is like a locked safe. The data is inside the safe (encrypted), and only someone with the key can open it and read the contents.

Encryption at Rest

What it is: Encrypting data while it's stored (on disk, in database, in S3, etc.).

Why it exists: To protect data if physical storage is stolen or accessed by unauthorized parties.

How it works:

  1. You write data to storage
  2. Encryption software intercepts the write
  3. Data is encrypted using a key
  4. Encrypted data is written to disk
  5. When reading, data is automatically decrypted using the key

Detailed Example: S3 Bucket Encryption

Setup:

  • S3 bucket with encryption enabled
  • Encryption key managed by AWS KMS
  • You upload a file: customer_data.csv

Upload Process:

  1. You upload customer_data.csv (plaintext)
  2. S3 receives the file
  3. S3 requests encryption key from KMS
  4. KMS provides data encryption key (DEK)
  5. S3 encrypts file using DEK
  6. S3 stores encrypted file on disk
  7. S3 stores encrypted DEK with the file

Download Process:

  1. You request customer_data.csv
  2. S3 retrieves encrypted file and encrypted DEK
  3. S3 sends encrypted DEK to KMS
  4. KMS decrypts DEK (you must have permission)
  5. S3 uses DEK to decrypt file
  6. S3 sends plaintext file to you

If Disk is Stolen:

  • Thief has encrypted file
  • Thief has encrypted DEK
  • Thief CANNOT decrypt DEK (needs KMS access)
  • Thief CANNOT read file (needs DEK)
  • Data remains protected

⭐ Must Know:

  • Encryption at rest protects stored data
  • AWS KMS manages encryption keys
  • Most AWS services support encryption at rest
  • Encryption/decryption is transparent to applications

Encryption in Transit

What it is: Encrypting data while it's moving across networks (between client and server, between services, etc.).

Why it exists: To protect data from interception during transmission. Without encryption, network traffic can be captured and read.

How it works:

  1. Client and server establish encrypted connection (TLS/SSL)
  2. They exchange encryption keys securely
  3. All data sent is encrypted before transmission
  4. Receiving side decrypts data
  5. Connection remains encrypted for entire session

Detailed Example: HTTPS Website Connection

Setup:

  • Web server with SSL/TLS certificate
  • User accessing website from browser

Connection Process:

  1. User types https://example.com in browser
  2. Browser connects to server on port 443 (HTTPS)
  3. Server sends SSL certificate (contains public key)
  4. Browser verifies certificate is valid and trusted
  5. Browser generates session key
  6. Browser encrypts session key with server's public key
  7. Server decrypts session key with its private key
  8. Both sides now have shared session key

Data Transfer:

  1. User submits form with credit card number
  2. Browser encrypts data with session key
  3. Encrypted data travels across internet
  4. Even if intercepted, attacker sees gibberish
  5. Server receives encrypted data
  6. Server decrypts with session key
  7. Server processes plaintext credit card number

Without HTTPS (HTTP):

  1. User submits form with credit card number
  2. Data travels in plaintext
  3. Anyone on network path can read it
  4. Credit card number exposed

⭐ Must Know:

  • Encryption in transit protects data during transmission
  • TLS/SSL is the standard protocol
  • HTTPS = HTTP + TLS/SSL
  • AWS services support encryption in transit
  • Always use HTTPS for sensitive data

šŸ’” Tip: Remember the difference:

  • At Rest: Data sitting on disk (like a parked car)
  • In Transit: Data moving across network (like a car driving)

Principle of Least Privilege

What it is: Granting users and services only the minimum permissions they need to perform their job, nothing more.

Why it exists: To minimize damage if credentials are compromised. If an account only has limited permissions, an attacker who steals those credentials can only do limited damage.

Real-world analogy: A hotel housekeeper gets a key that opens guest rooms but not the safe or manager's office. If the key is lost, the damage is limited to guest rooms, not the entire hotel.

How it works:

  1. Identify what actions a user/service needs to perform
  2. Grant ONLY those specific permissions
  3. Deny everything else by default
  4. Regularly review and remove unused permissions
  5. Use temporary credentials when possible

Detailed Example 1: Application Server Permissions

Bad Approach (Too Permissive):

{
  "Effect": "Allow",
  "Action": "*",
  "Resource": "*"
}
  • Grants ALL permissions on ALL resources
  • If compromised, attacker can delete everything
  • Violates least privilege

Good Approach (Least Privilege):

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-app-bucket/*"
}
  • Grants ONLY read/write to specific S3 bucket
  • Cannot delete bucket, cannot access other buckets
  • If compromised, damage is limited

Detailed Example 2: Developer Access

Scenario: Developer needs to test application in development environment.

Bad Approach:

  • Give developer admin access to production account
  • Developer can accidentally delete production resources
  • Security risk if developer's laptop is compromised

Good Approach:

  • Give developer admin access ONLY to dev account
  • Give developer read-only access to production (for troubleshooting)
  • Use separate AWS accounts for dev and production
  • Developer can't accidentally harm production

Detailed Example 3: Lambda Function Permissions

Scenario: Lambda function needs to read from S3 and write to DynamoDB.

Bad Approach:

{
  "Effect": "Allow",
  "Action": [
    "s3:*",
    "dynamodb:*"
  ],
  "Resource": "*"
}
  • Can delete S3 buckets
  • Can delete DynamoDB tables
  • Can access all S3 buckets and DynamoDB tables

Good Approach:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject"
  ],
  "Resource": "arn:aws:s3:::input-bucket/*"
},
{
  "Effect": "Allow",
  "Action": [
    "dynamodb:PutItem"
  ],
  "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/OutputTable"
}
  • Can ONLY read from specific S3 bucket
  • Can ONLY write to specific DynamoDB table
  • Cannot delete anything
  • Cannot access other resources

⭐ Must Know:

  • Start with no permissions, add only what's needed
  • Be specific: exact actions, exact resources
  • Avoid wildcards (*) when possible
  • Regularly audit and remove unused permissions
  • Use IAM Access Analyzer to identify overly permissive policies

šŸŽÆ Exam Focus: Questions often present scenarios where you must choose the most restrictive policy that still allows required functionality.

āš ļø Warning: Common mistake - granting broad permissions "just to make it work." Always take time to identify exact permissions needed.


Section 4: High Availability and Resilience Fundamentals

High availability and resilience are core themes throughout the SAP-C02 exam, especially in Domain 1 (Task 1.3) and Domain 2 (Task 2.2).

What is High Availability?

What it is: High availability means a system remains operational and accessible even when components fail. It's measured as a percentage of uptime.

Why it exists: Systems fail - hardware breaks, software crashes, networks disconnect, data centers lose power. High availability ensures business continuity despite these failures.

Real-world analogy: Think of high availability like having spare tires in your car. If one tire goes flat, you can replace it and keep driving. The journey continues despite the failure.

Availability Percentages:

Availability Downtime per Year Downtime per Month Downtime per Week
99% (Two nines) 3.65 days 7.2 hours 1.68 hours
99.9% (Three nines) 8.76 hours 43.2 minutes 10.1 minutes
99.95% (Three and a half nines) 4.38 hours 21.6 minutes 5.04 minutes
99.99% (Four nines) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (Five nines) 5.26 minutes 25.9 seconds 6.05 seconds

How to achieve high availability:

  1. Eliminate single points of failure: Every component should have a backup
  2. Detect failures quickly: Monitor health and respond automatically
  3. Fail over automatically: Switch to backup without manual intervention
  4. Distribute across failure domains: Use multiple AZs, Regions
  5. Design for failure: Assume everything will fail eventually

Detailed Example: Highly Available Web Application

Scenario: E-commerce website that must stay online 99.99% of the time (52 minutes downtime per year).

Architecture:

  • Load Balancer in multiple AZs
  • Web servers in 3 Availability Zones
  • Application servers in 3 Availability Zones
  • Database with Multi-AZ failover
  • Auto Scaling to replace failed instances

Normal Operation:

  • Load balancer distributes traffic across all web servers
  • Each web server can handle requests independently
  • If one server gets 100 requests/second, others can take over if it fails

Failure Scenario 1: Single Web Server Fails:

  1. Web server in AZ-A crashes at 2:00 PM
  2. Load balancer detects failure via health checks (30 seconds)
  3. Load balancer stops sending traffic to failed server
  4. Traffic redistributed to healthy servers in AZ-B and AZ-C
  5. Auto Scaling detects missing capacity
  6. Auto Scaling launches replacement server in AZ-A (2 minutes)
  7. New server passes health checks and receives traffic
  8. Total impact: 30 seconds of reduced capacity, zero downtime

Failure Scenario 2: Entire Availability Zone Fails:

  1. AZ-A loses power at 3:00 PM
  2. All servers in AZ-A become unreachable
  3. Load balancer detects failures (30 seconds)
  4. Load balancer redirects ALL traffic to AZ-B and AZ-C
  5. Servers in AZ-B and AZ-C handle increased load
  6. Auto Scaling launches additional servers in AZ-B and AZ-C
  7. Database automatically fails over from AZ-A to AZ-B (1-2 minutes)
  8. Total impact: 1-2 minutes of degraded performance, zero downtime

Why this achieves 99.99%:

  • No single point of failure
  • Automatic detection and failover
  • Multiple AZs provide redundancy
  • Auto Scaling maintains capacity
  • Downtime limited to failover time (1-2 minutes)

⭐ Must Know:

  • 99.9% = 8.76 hours downtime per year (acceptable for many apps)
  • 99.99% = 52 minutes downtime per year (required for critical apps)
  • 99.999% = 5 minutes downtime per year (very expensive to achieve)
  • Multi-AZ deployment is minimum for high availability
  • Multi-Region deployment for highest availability

Redundancy

What it is: Having backup components that can take over when primary components fail.

Why it exists: Single components fail. Redundancy ensures service continues when failures occur.

Types of Redundancy:

  1. Active-Active: All components handle traffic simultaneously

    • Example: Multiple web servers behind load balancer
    • Benefit: Full capacity always available, instant failover
    • Cost: Higher (paying for all components)
  2. Active-Passive: Primary handles traffic, backup stands by

    • Example: RDS Multi-AZ (primary active, standby passive)
    • Benefit: Lower cost than active-active
    • Cost: Failover takes 1-2 minutes
  3. N+1 Redundancy: N components needed, N+1 deployed

    • Example: Need 4 servers for capacity, deploy 5
    • Benefit: Can lose one component without impact
    • Cost: Moderate (one extra component)

Detailed Example: Load Balancer Redundancy

Setup:

  • Application Load Balancer (ALB) in 3 AZs
  • 2 web servers per AZ (6 total)
  • Each server can handle 1,000 requests/second
  • Normal load: 4,000 requests/second

Redundancy Analysis:

  • Capacity needed: 4,000 req/sec Ć· 1,000 req/sec = 4 servers
  • Capacity deployed: 6 servers
  • Redundancy: N+2 (need 4, have 6)
  • Can lose: 2 servers without impact

Failure Scenario:

  1. 2 servers fail simultaneously
  2. Remaining 4 servers handle 4,000 req/sec
  3. No performance degradation
  4. Auto Scaling launches 2 replacement servers
  5. Redundancy restored to N+2

⭐ Must Know:

  • Always deploy redundant components
  • Distribute redundancy across AZs
  • Active-Active preferred for web/app tiers
  • Active-Passive acceptable for databases
  • Test failover regularly

Fault Tolerance vs High Availability

Fault Tolerance: System continues operating WITHOUT ANY DEGRADATION when components fail.

  • Example: RAID array - if one disk fails, system continues at full speed
  • Expensive: Requires duplicate everything
  • Zero downtime, zero performance impact

High Availability: System continues operating with MINIMAL DEGRADATION when components fail.

  • Example: Multi-AZ deployment - if one AZ fails, system continues with reduced capacity
  • Cost-effective: Shared resources with automatic failover
  • Brief downtime (seconds to minutes) during failover

Comparison:

Aspect Fault Tolerance High Availability
Downtime Zero Seconds to minutes
Performance Impact None Possible degradation
Cost Very high Moderate
Complexity High Moderate
Use Case Mission-critical systems Most applications
AWS Example S3 (11 9's durability) RDS Multi-AZ

⭐ Must Know:

  • Most applications need high availability, not fault tolerance
  • Fault tolerance is expensive and complex
  • Exam questions usually ask for high availability solutions
  • S3 is fault tolerant (automatically handles failures)
  • EC2 is not fault tolerant (you must design for HA)

Disaster Recovery Concepts

What it is: Disaster recovery (DR) is the process of restoring systems and data after a catastrophic failure.

Why it exists: Despite high availability measures, disasters can still occur - entire data centers can fail, regions can become unavailable, data can be corrupted or deleted.

Key Metrics:

  1. RTO (Recovery Time Objective): How long can you be down?

    • Example: "We can tolerate 4 hours of downtime"
    • Determines DR strategy and cost
  2. RPO (Recovery Point Objective): How much data can you lose?

    • Example: "We can lose maximum 1 hour of data"
    • Determines backup frequency

Detailed Example: Understanding RTO and RPO

Scenario: Online banking application

Business Requirements:

  • RTO: 1 hour (bank can be offline maximum 1 hour)
  • RPO: 5 minutes (can lose maximum 5 minutes of transactions)

What this means:

  • If disaster occurs at 2:00 PM, system must be back online by 3:00 PM
  • Data must be restored to at least 1:55 PM state
  • Any transactions between 1:55 PM and 2:00 PM may be lost

Architecture to meet requirements:

  • For RTO (1 hour):

    • Warm standby in another Region
    • Pre-deployed infrastructure ready to scale up
    • Automated failover procedures
    • Regular DR drills to ensure 1-hour recovery
  • For RPO (5 minutes):

    • Database replication every 5 minutes
    • Transaction logs backed up continuously
    • Point-in-time recovery capability
    • Can restore to any point within last 5 minutes

Cost Implications:

  • Tighter RTO = More expensive (need standby resources)
  • Tighter RPO = More expensive (more frequent backups/replication)
  • RTO 1 hour vs 24 hours: 10x cost difference
  • RPO 5 minutes vs 24 hours: 5x cost difference

⭐ Must Know:

  • RTO = Time to recover (how long down)
  • RPO = Data loss tolerance (how much data lost)
  • Tighter RTO/RPO = Higher cost
  • Business requirements drive RTO/RPO
  • DR strategy must meet both RTO and RPO

šŸ’” Tip: Remember the difference:

  • RTO: "How long until we're back?" (TIME)
  • RPO: "How much data can we lose?" (DATA)

šŸŽÆ Exam Focus: Questions often give you RTO/RPO requirements and ask you to choose appropriate DR strategy.


Section 5: Mental Model - How AWS Works

This section ties everything together into a cohesive mental model. Understanding this will help you make better architectural decisions throughout the exam.

The AWS Mental Model

šŸ“Š AWS Global Infrastructure Overview:

graph TB
    subgraph "AWS Global Infrastructure"
        subgraph "Region: us-east-1 (N. Virginia)"
            subgraph "AZ: us-east-1a"
                DC1A[Data Center 1]
                DC1B[Data Center 2]
            end
            subgraph "AZ: us-east-1b"
                DC2A[Data Center 3]
                DC2B[Data Center 4]
            end
            subgraph "AZ: us-east-1c"
                DC3A[Data Center 5]
                DC3B[Data Center 6]
            end
        end
        
        subgraph "Region: eu-west-1 (Ireland)"
            subgraph "AZ: eu-west-1a"
                DC4[Data Centers]
            end
            subgraph "AZ: eu-west-1b"
                DC5[Data Centers]
            end
            subgraph "AZ: eu-west-1c"
                DC6[Data Centers]
            end
        end
        
        subgraph "Edge Locations (400+)"
            EDGE1[Tokyo Edge]
            EDGE2[London Edge]
            EDGE3[Sydney Edge]
            EDGE4[SĆ£o Paulo Edge]
        end
    end
    
    DC1A -.High-Speed Network.-> DC2A
    DC2A -.High-Speed Network.-> DC3A
    DC1A -.High-Speed Network.-> DC3A
    
    DC4 -.High-Speed Network.-> DC5
    DC5 -.High-Speed Network.-> DC6
    
    DC1A -.AWS Backbone.-> DC4
    DC2A -.AWS Backbone.-> DC5
    
    EDGE1 -.Content Delivery.-> DC1A
    EDGE2 -.Content Delivery.-> DC4
    EDGE3 -.Content Delivery.-> DC1A
    EDGE4 -.Content Delivery.-> DC4
    
    style DC1A fill:#c8e6c9
    style DC2A fill:#c8e6c9
    style DC3A fill:#c8e6c9
    style DC4 fill:#fff3e0
    style DC5 fill:#fff3e0
    style DC6 fill:#fff3e0
    style EDGE1 fill:#e1f5fe
    style EDGE2 fill:#e1f5fe
    style EDGE3 fill:#e1f5fe
    style EDGE4 fill:#e1f5fe

See: diagrams/01_fundamentals_global_infrastructure.mmd

Diagram Explanation (Detailed):

This diagram shows the hierarchical structure of AWS's global infrastructure. At the highest level, AWS operates in multiple geographic Regions (shown here are us-east-1 in Virginia and eu-west-1 in Ireland, but there are 30+ Regions globally). Each Region is completely independent and isolated from other Regions, which provides fault isolation - a disaster in one Region doesn't affect others.

Within each Region, you see multiple Availability Zones (AZs). The us-east-1 Region shown has 6 AZs (us-east-1a through us-east-1f), while eu-west-1 has 3 AZs. Each AZ consists of one or more discrete data centers with redundant power, networking, and connectivity. The green boxes (DC1A, DC1B, etc.) represent individual data center facilities. AZs within a Region are connected by high-speed, low-latency private fiber networks (shown as dotted lines), allowing you to replicate data and fail over between AZs quickly.

The orange boxes represent data centers in the eu-west-1 Region. Notice the "AWS Backbone" connections between Regions - this is AWS's private global network that connects all Regions together. This network is separate from the public internet and provides faster, more reliable connectivity for cross-region replication and data transfer.

The blue boxes at the bottom represent Edge Locations, which are part of AWS's Content Delivery Network (CloudFront). There are 400+ Edge Locations globally, far more than Regions. Edge Locations cache content close to end users for low-latency delivery. The dotted lines show how Edge Locations connect back to origin Regions to fetch content that isn't cached.

Key Takeaways from this Diagram:

  1. Regions are isolated: Failure in one Region doesn't affect others
  2. AZs provide redundancy: Deploy across multiple AZs for high availability
  3. Edge Locations improve performance: Cache content close to users
  4. AWS Backbone is private: Cross-region traffic doesn't use public internet
  5. Hierarchical structure: Region → AZ → Data Center → Your Resources

Complete VPC Architecture

šŸ“Š VPC Architecture with Multi-AZ Deployment:

graph TB
    INTERNET[Internet]
    
    subgraph "VPC: 10.0.0.0/16"
        IGW[Internet Gateway]
        
        subgraph "Availability Zone A"
            subgraph "Public Subnet: 10.0.1.0/24"
                WEB1[Web Server<br/>10.0.1.10<br/>Public: 54.1.2.3]
                NAT1[NAT Gateway<br/>10.0.1.50<br/>EIP: 54.1.2.100]
            end
            
            subgraph "Private Subnet: 10.0.11.0/24"
                APP1[App Server<br/>10.0.11.20<br/>No Public IP]
            end
            
            subgraph "Database Subnet: 10.0.21.0/24"
                DB1[Database<br/>10.0.21.30<br/>No Public IP]
            end
        end
        
        subgraph "Availability Zone B"
            subgraph "Public Subnet: 10.0.2.0/24"
                WEB2[Web Server<br/>10.0.2.10<br/>Public: 54.1.2.4]
                NAT2[NAT Gateway<br/>10.0.2.50<br/>EIP: 54.1.2.101]
            end
            
            subgraph "Private Subnet: 10.0.12.0/24"
                APP2[App Server<br/>10.0.12.20<br/>No Public IP]
            end
            
            subgraph "Database Subnet: 10.0.22.0/24"
                DB2[Database<br/>10.0.22.30<br/>No Public IP]
            end
        end
    end
    
    INTERNET <-->|Public Traffic| IGW
    IGW <--> WEB1
    IGW <--> WEB2
    IGW <--> NAT1
    IGW <--> NAT2
    
    WEB1 <--> APP1
    WEB2 <--> APP2
    APP1 <--> DB1
    APP2 <--> DB2
    
    APP1 -.Outbound Only.-> NAT1
    APP2 -.Outbound Only.-> NAT2
    
    DB1 <-.Replication.-> DB2
    
    style INTERNET fill:#ffebee
    style IGW fill:#e1f5fe
    style WEB1 fill:#c8e6c9
    style WEB2 fill:#c8e6c9
    style NAT1 fill:#fff3e0
    style NAT2 fill:#fff3e0
    style APP1 fill:#f3e5f5
    style APP2 fill:#f3e5f5
    style DB1 fill:#e8f5e9
    style DB2 fill:#e8f5e9

See: diagrams/01_fundamentals_vpc_architecture.mmd

Diagram Explanation (Detailed):

This diagram shows a complete, production-ready VPC architecture following AWS best practices. Let's walk through each component and understand why it's designed this way.

VPC Structure (10.0.0.0/16):
The entire VPC uses the 10.0.0.0/16 CIDR block, giving us 65,536 IP addresses to work with. This is a private IP range that's not routable on the public internet. The VPC spans multiple Availability Zones (AZ-A and AZ-B) for high availability.

Internet Gateway (Blue):
The Internet Gateway (IGW) is the entry/exit point for internet traffic. It's attached to the VPC and provides a target for internet-routable traffic. The IGW is highly available by design - AWS manages its redundancy. All public subnets route their internet-bound traffic (0.0.0.0/0) to this IGW.

Public Subnets (Green - Web Servers):
Public subnets (10.0.1.0/24 in AZ-A and 10.0.2.0/24 in AZ-B) contain resources that need to be directly accessible from the internet. The web servers (WEB1 and WEB2) each have TWO IP addresses: a private IP (10.0.1.10 and 10.0.2.10) for internal VPC communication, and a public IP (54.1.2.3 and 54.1.2.4) for internet access. The Internet Gateway performs 1:1 NAT between these addresses. These subnets are "public" because their route table has a route sending 0.0.0.0/0 traffic to the IGW.

NAT Gateways (Orange):
Each public subnet also contains a NAT Gateway (NAT1 and NAT2). These are critical for allowing private subnet resources to access the internet for updates, API calls, etc., while preventing inbound connections from the internet. Each NAT Gateway has an Elastic IP (EIP) - a static public IP address. Notice we have ONE NAT Gateway per AZ - this is important for high availability. If we only had one NAT Gateway and its AZ failed, private subnets in other AZs couldn't access the internet.

Private Subnets (Purple - Application Servers):
Private subnets (10.0.11.0/24 in AZ-A and 10.0.12.0/24 in AZ-B) contain application servers that don't need direct internet access. APP1 and APP2 have ONLY private IPs (10.0.11.20 and 10.0.12.20) - no public IPs. Their route table sends internet-bound traffic (0.0.0.0/0) to their AZ's NAT Gateway, not directly to the IGW. This means they can initiate outbound connections (like downloading updates) but cannot receive inbound connections from the internet.

Database Subnets (Light Green - Databases):
Database subnets (10.0.21.0/24 in AZ-A and 10.0.22.0/24 in AZ-B) are the most isolated. DB1 and DB2 have ONLY private IPs and their route tables have NO route to the internet at all - not even through NAT Gateway. They can only communicate within the VPC. This is the most secure configuration for databases. The dotted line between DB1 and DB2 represents database replication for high availability.

Traffic Flows:

  1. Internet → Web Server: User request comes from internet, hits IGW, IGW translates public IP to private IP, traffic reaches web server
  2. Web Server → App Server: Web server sends request to app server's private IP, stays within VPC (fast, secure)
  3. App Server → Database: App server queries database using private IP, stays within VPC
  4. App Server → Internet: App server needs to call external API, sends to NAT Gateway, NAT translates to its EIP, forwards to IGW, reaches internet
  5. Database Replication: DB1 and DB2 replicate data using private IPs within VPC

Why This Design:

  • Security: Databases have no internet access, app servers have outbound only, only web servers are publicly accessible
  • High Availability: Resources in both AZs, if one AZ fails, other continues
  • Performance: Internal traffic uses private IPs (low latency, no internet routing)
  • Cost Optimization: NAT Gateway charges apply, but necessary for security
  • Scalability: Can add more subnets and AZs as needed

Key Takeaways:

  1. Three-tier architecture: Web (public), App (private), Database (isolated)
  2. Multi-AZ deployment: Every tier has resources in multiple AZs
  3. Defense in depth: Multiple layers of security (public/private subnets, security groups, NACLs)
  4. NAT Gateway per AZ: Ensures high availability for outbound internet access
  5. No public IPs for internal resources: Databases and app servers are not internet-accessible

This architecture is the foundation for most AWS applications and appears frequently in exam scenarios.


Section 6: AWS Service Categories Overview

Understanding how AWS services are categorized helps you choose the right service for each scenario. This section provides a high-level overview - detailed coverage comes in domain chapters.

Compute Services

Purpose: Run application code and workloads.

Key Services:

  1. Amazon EC2 (Elastic Compute Cloud)

    • Virtual servers in the cloud
    • Full control over OS and configuration
    • Use when: You need specific OS, custom software, or full control
  2. AWS Lambda

    • Serverless compute - run code without managing servers
    • Pay only when code runs
    • Use when: Event-driven workloads, microservices, short-running tasks
  3. Amazon ECS/EKS (Container Services)

    • Run Docker containers
    • ECS = AWS-native, EKS = Kubernetes
    • Use when: Containerized applications, microservices architecture
  4. AWS Elastic Beanstalk

    • Platform-as-a-Service (PaaS)
    • Deploy code, AWS manages infrastructure
    • Use when: Want to focus on code, not infrastructure

⭐ Must Know: EC2 = IaaS (you manage), Lambda = Serverless (AWS manages), Beanstalk = PaaS (middle ground)

Storage Services

Purpose: Store and retrieve data.

Key Services:

  1. Amazon S3 (Simple Storage Service)

    • Object storage for files, images, backups
    • 11 9's durability (99.999999999%)
    • Use when: Storing files, static website hosting, data lakes
  2. Amazon EBS (Elastic Block Store)

    • Block storage for EC2 instances (like hard drives)
    • Persistent storage that survives instance termination
    • Use when: Database storage, application data for EC2
  3. Amazon EFS (Elastic File System)

    • Shared file system accessible from multiple EC2 instances
    • NFS protocol
    • Use when: Shared storage needed across multiple servers
  4. Amazon FSx

    • Managed file systems (Windows File Server, Lustre)
    • Use when: Windows applications, high-performance computing

⭐ Must Know: S3 = Objects (files), EBS = Blocks (disks), EFS = Shared files

Database Services

Purpose: Store and query structured data.

Key Services:

  1. Amazon RDS (Relational Database Service)

    • Managed relational databases (MySQL, PostgreSQL, Oracle, SQL Server)
    • AWS handles backups, patching, failover
    • Use when: Traditional relational database needs
  2. Amazon Aurora

    • AWS-built relational database (MySQL/PostgreSQL compatible)
    • 5x faster than MySQL, 3x faster than PostgreSQL
    • Use when: Need high performance relational database
  3. Amazon DynamoDB

    • NoSQL database, key-value and document
    • Millisecond latency at any scale
    • Use when: Need high-scale, low-latency NoSQL
  4. Amazon ElastiCache

    • In-memory cache (Redis, Memcached)
    • Microsecond latency
    • Use when: Need to cache frequently accessed data

⭐ Must Know: RDS/Aurora = Relational (SQL), DynamoDB = NoSQL, ElastiCache = Cache

Networking Services

Purpose: Connect resources and control traffic.

Key Services:

  1. Amazon VPC (Virtual Private Cloud)

    • Isolated network for your resources
    • Full control over IP addressing, subnets, routing
    • Use when: Always (every resource goes in a VPC)
  2. Elastic Load Balancing (ELB)

    • Distribute traffic across multiple targets
    • Types: ALB (HTTP/HTTPS), NLB (TCP/UDP), GLB (Gateway)
    • Use when: Need to distribute traffic, achieve high availability
  3. Amazon Route 53

    • DNS service
    • Route users to applications
    • Use when: Need domain name management, DNS routing
  4. AWS Direct Connect

    • Dedicated network connection from on-premises to AWS
    • More reliable than internet VPN
    • Use when: Need consistent network performance, high bandwidth

⭐ Must Know: VPC = Network foundation, ELB = Traffic distribution, Route 53 = DNS

Security Services

Purpose: Protect resources and data.

Key Services:

  1. AWS IAM (Identity and Access Management)

    • Control who can access what
    • Users, roles, policies
    • Use when: Always (controls all AWS access)
  2. AWS KMS (Key Management Service)

    • Manage encryption keys
    • Encrypt data at rest
    • Use when: Need to encrypt sensitive data
  3. AWS Security Hub

    • Centralized security findings
    • Aggregates alerts from multiple services
    • Use when: Need unified security view
  4. Amazon GuardDuty

    • Threat detection service
    • Monitors for malicious activity
    • Use when: Need automated threat detection

⭐ Must Know: IAM = Access control, KMS = Encryption, Security Hub = Central monitoring

Management Services

Purpose: Monitor, manage, and automate AWS resources.

Key Services:

  1. AWS CloudFormation

    • Infrastructure as Code (IaC)
    • Define infrastructure in templates
    • Use when: Need repeatable, automated deployments
  2. Amazon CloudWatch

    • Monitoring and observability
    • Metrics, logs, alarms
    • Use when: Need to monitor resources and applications
  3. AWS CloudTrail

    • API activity logging
    • Who did what, when
    • Use when: Need audit trail, compliance, security analysis
  4. AWS Systems Manager

    • Operational management
    • Patch management, automation, parameter store
    • Use when: Need to manage EC2 instances at scale

⭐ Must Know: CloudFormation = IaC, CloudWatch = Monitoring, CloudTrail = Audit logs


Chapter Summary

What We Covered

This chapter built your foundational understanding of AWS and cloud computing:

āœ… Cloud Computing Fundamentals

  • What cloud computing is and why it exists
  • Service models: IaaS, PaaS, SaaS
  • Benefits: Elasticity, pay-per-use, global reach

āœ… AWS Global Infrastructure

  • Regions: Geographic areas with multiple AZs
  • Availability Zones: Isolated data centers within Regions
  • Edge Locations: Content delivery network nodes
  • How to use them for high availability and low latency

āœ… Networking Fundamentals

  • IP addresses and CIDR notation
  • Private vs public IP addresses
  • Subnets and network segmentation
  • Routing and route tables
  • Internet Gateway and NAT Gateway

āœ… Security Fundamentals

  • Authentication vs authorization
  • Encryption at rest and in transit
  • Principle of least privilege
  • Defense in depth

āœ… High Availability and Resilience

  • What high availability means (uptime percentages)
  • Redundancy strategies (active-active, active-passive)
  • Fault tolerance vs high availability
  • Disaster recovery concepts (RTO, RPO)

āœ… AWS Service Categories

  • Compute: EC2, Lambda, ECS, Beanstalk
  • Storage: S3, EBS, EFS
  • Database: RDS, Aurora, DynamoDB
  • Networking: VPC, ELB, Route 53
  • Security: IAM, KMS, Security Hub
  • Management: CloudFormation, CloudWatch, CloudTrail

Critical Takeaways

  1. AWS Regions and AZs: Always deploy across multiple AZs for high availability. Each AZ is isolated but connected by high-speed networks.

  2. VPC Architecture: Public subnets have Internet Gateway routes, private subnets use NAT Gateway, database subnets have no internet access.

  3. High Availability: Achieved through redundancy across AZs, automatic failover, and eliminating single points of failure.

  4. Security Layers: Use multiple layers - network (VPC, security groups), identity (IAM), encryption (KMS), monitoring (CloudTrail, GuardDuty).

  5. Service Selection: Choose IaaS (EC2) for control, PaaS (Beanstalk) for simplicity, Serverless (Lambda) for event-driven workloads.

Self-Assessment Checklist

Before moving to Domain chapters, ensure you can:

  • Explain the difference between Regions, AZs, and Edge Locations
  • Calculate IP address ranges from CIDR notation (e.g., /24 = 256 addresses)
  • Describe the difference between public and private subnets
  • Explain how Internet Gateway and NAT Gateway work
  • Understand authentication vs authorization
  • Calculate availability percentages and downtime
  • Explain RTO vs RPO
  • Describe the difference between fault tolerance and high availability
  • List key AWS services in each category (compute, storage, database, networking)
  • Draw a basic multi-AZ VPC architecture from memory

Practice Questions

Before proceeding to Domain 1, test your understanding:

Try these questions from your practice test bundles:

  • Beginner Bundle 1: Questions 1-10 (should cover fundamentals)
  • Expected score: 80%+ to proceed confidently

If you scored below 80%:

  • Review sections where you struggled
  • Re-read the diagram explanations
  • Try drawing the VPC architecture diagram from memory
  • Review the ⭐ Must Know items

Quick Reference Card

Copy this to your notes for quick review:

AWS Infrastructure:

  • Region: Geographic area with multiple AZs
  • AZ: One or more data centers with redundant power/networking
  • Edge Location: CDN cache point (400+ globally)

IP Addressing:

  • /32 = 1 address
  • /24 = 256 addresses
  • /16 = 65,536 addresses
  • Private ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16

Subnet Types:

  • Public: Route to Internet Gateway
  • Private: Route to NAT Gateway
  • Isolated: No internet route

Availability:

  • 99.9% = 8.76 hours downtime/year
  • 99.99% = 52 minutes downtime/year
  • 99.999% = 5 minutes downtime/year

DR Metrics:

  • RTO: How long to recover (time)
  • RPO: How much data loss (data)

Service Categories:

  • Compute: EC2, Lambda, ECS
  • Storage: S3, EBS, EFS
  • Database: RDS, Aurora, DynamoDB
  • Network: VPC, ELB, Route 53
  • Security: IAM, KMS, GuardDuty

Next Steps: You're now ready to dive into Domain 1 (Design Solutions for Organizational Complexity). Open file 02_domain_1_organizational_complexity to continue.

šŸ’” Tip: Keep this fundamentals chapter bookmarked. You'll reference these concepts throughout your study.


Chapter 1: Design Solutions for Organizational Complexity

Domain Weight: 26% of exam (highest weight)

Chapter Overview

This domain focuses on designing AWS solutions for large, complex organizations with multiple accounts, teams, and requirements. You'll learn how to architect network connectivity across complex environments, implement security controls at scale, design resilient architectures, manage multi-account structures, and optimize costs across the organization.

What you'll learn:

  • Architect network connectivity strategies for multi-VPC and hybrid environments
  • Prescribe security controls for enterprise-scale deployments
  • Design reliable and resilient architectures with appropriate RTO/RPO
  • Design and manage multi-account AWS environments
  • Determine cost optimization and visibility strategies

Time to complete: 12-15 hours (this is the largest domain)

Prerequisites: Chapter 0 (Fundamentals) - especially networking and security sections

Exam Weight: 26% (approximately 17 questions on the actual exam)


Task 1.1: Architect Network Connectivity Strategies

This task covers designing network architectures for complex organizations with multiple VPCs, hybrid cloud requirements, and global presence.

Introduction

The problem: Organizations grow complex over time. They have multiple VPCs for different applications, teams, or environments. They have on-premises data centers that need to connect to AWS. They have users in multiple geographic locations. They need to segment networks for security and compliance. Traditional point-to-point connections don't scale.

The solution: AWS provides multiple services for network connectivity - VPC Peering, Transit Gateway, PrivateLink, Direct Connect, VPN, and Route 53 Resolver. Each solves specific connectivity challenges. The key is understanding when to use each service and how to combine them for optimal architecture.

Why it's tested: Network architecture is fundamental to every AWS solution. Poor network design leads to security vulnerabilities, performance issues, high costs, and operational complexity. As a Solutions Architect Professional, you must design scalable, secure, and cost-effective network architectures.


Core Concepts

VPC Peering

What it is: VPC Peering creates a direct network connection between two VPCs, allowing resources in each VPC to communicate using private IP addresses as if they were in the same network.

Why it exists: Organizations often have multiple VPCs for different purposes (production, development, different applications, different teams). VPC Peering allows these VPCs to communicate securely without going through the internet.

Real-world analogy: Think of VPC Peering like building a private bridge between two islands. Instead of taking a boat through public waters (internet), you can drive directly across the bridge (private connection).

How it works (Detailed step-by-step):

  1. You create a peering connection request from VPC-A to VPC-B. This can be in the same account or different accounts, same region or different regions.

  2. The owner of VPC-B accepts the peering request. Until accepted, no traffic can flow.

  3. AWS establishes a network connection between the VPCs. This connection uses AWS's private network backbone, not the public internet.

  4. You update route tables in both VPCs. In VPC-A, you add a route: "To reach VPC-B's CIDR (10.1.0.0/16), send traffic to the peering connection." In VPC-B, you add the reverse route.

  5. You update security groups to allow traffic from the peer VPC's CIDR block. Security groups are stateful, so you only need to allow inbound rules.

  6. Traffic flows directly between VPCs using private IP addresses. No NAT, no internet gateway, no public IPs needed.

  7. AWS handles the routing automatically once route tables are configured. Traffic takes the most direct path through AWS's network.

Detailed Example 1: Development and Production VPC Peering

Scenario: You have a production VPC and a development VPC. Developers need to access a shared database in production for testing, but you want to keep the environments separate.

Setup:

  • Production VPC: 10.0.0.0/16 (us-east-1)
  • Development VPC: 10.1.0.0/16 (us-east-1)
  • Production Database: 10.0.50.100
  • Development App Server: 10.1.10.50

Implementation Steps:

  1. Create Peering Connection:

    • From Development VPC, create peering request to Production VPC
    • Production VPC owner accepts the request
    • Peering connection status: Active
  2. Update Route Tables:

    • Development VPC route table: Add route 10.0.0.0/16 → pcx-12345678 (peering connection)
    • Production VPC route table: Add route 10.1.0.0/16 → pcx-12345678
  3. Update Security Groups:

    • Production database security group: Allow inbound port 3306 from 10.1.0.0/16
    • Development app server security group: Allow outbound port 3306 to 10.0.0.0/16
  4. Test Connectivity:

    • From development app server (10.1.10.50), connect to database at 10.0.50.100
    • Traffic flows: App Server → Dev VPC route table → Peering connection → Prod VPC route table → Database
    • Connection succeeds using private IPs

Benefits:

  • No internet exposure (database not accessible from internet)
  • Low latency (direct connection through AWS backbone)
  • No data transfer charges within same region
  • Simple to set up and manage

Detailed Example 2: Cross-Region VPC Peering

Scenario: You have a web application in us-east-1 and a data analytics platform in eu-west-1. The analytics platform needs to access application data.

Setup:

  • Application VPC: 10.0.0.0/16 (us-east-1)
  • Analytics VPC: 10.2.0.0/16 (eu-west-1)
  • Application Database: 10.0.50.100 (us-east-1)
  • Analytics Server: 10.2.10.50 (eu-west-1)

Implementation:

  1. Create Inter-Region Peering:

    • From Analytics VPC (eu-west-1), create peering request to Application VPC (us-east-1)
    • Application VPC owner accepts
    • Peering connection spans regions
  2. Update Route Tables:

    • Analytics VPC: Add route 10.0.0.0/16 → pcx-87654321
    • Application VPC: Add route 10.2.0.0/16 → pcx-87654321
  3. Configure Security:

    • Application database security group: Allow port 3306 from 10.2.0.0/16
    • Analytics server security group: Allow outbound to 10.0.0.0/16
  4. Data Transfer:

    • Analytics server queries database across regions
    • Traffic uses AWS global backbone (not internet)
    • Latency: ~80-100ms (us-east-1 to eu-west-1)
    • Data transfer charges apply: $0.02/GB

Considerations:

  • Cross-region peering incurs data transfer charges
  • Higher latency than same-region peering
  • Useful for disaster recovery, global applications
  • Encryption in transit automatically enabled

Detailed Example 3: Multiple VPC Peering (Hub-and-Spoke)

Scenario: You have 4 VPCs - 1 shared services VPC and 3 application VPCs. All applications need to access shared services (Active Directory, monitoring, logging).

Setup:

  • Shared Services VPC: 10.0.0.0/16 (hub)
  • App1 VPC: 10.1.0.0/16 (spoke)
  • App2 VPC: 10.2.0.0/16 (spoke)
  • App3 VPC: 10.3.0.0/16 (spoke)

Implementation:

  1. Create Peering Connections:

    • Shared Services ↔ App1: pcx-111
    • Shared Services ↔ App2: pcx-222
    • Shared Services ↔ App3: pcx-333
    • Total: 3 peering connections
  2. Update Route Tables:

    • Shared Services VPC: Routes to 10.1.0.0/16, 10.2.0.0/16, 10.3.0.0/16
    • App1 VPC: Route to 10.0.0.0/16 only
    • App2 VPC: Route to 10.0.0.0/16 only
    • App3 VPC: Route to 10.0.0.0/16 only
  3. Traffic Patterns:

    • App1 can reach Shared Services āœ…
    • App2 can reach Shared Services āœ…
    • App3 can reach Shared Services āœ…
    • App1 CANNOT reach App2 āŒ (no direct peering)
    • App2 CANNOT reach App3 āŒ (no direct peering)

Important Limitation:
VPC Peering is NOT transitive. Even though App1 peers with Shared Services, and App2 peers with Shared Services, App1 and App2 cannot communicate through Shared Services. If you need full mesh connectivity, you'd need to peer every VPC with every other VPC (N*(N-1)/2 connections for N VPCs).

⭐ Must Know (Critical Facts):

  • VPC Peering is NOT transitive: If VPC-A peers with VPC-B, and VPC-B peers with VPC-C, VPC-A cannot reach VPC-C through VPC-B. You must create a direct peering connection between VPC-A and VPC-C.

  • CIDR blocks cannot overlap: You cannot peer VPCs with overlapping IP ranges. If VPC-A is 10.0.0.0/16 and VPC-B is 10.0.0.0/16, peering will fail. Plan your IP addressing carefully.

  • One peering connection per VPC pair: You can only have one active peering connection between any two VPCs. You cannot create multiple peering connections for redundancy.

  • Maximum 125 peering connections per VPC: This is a soft limit (can be increased), but it shows VPC Peering doesn't scale to hundreds of VPCs.

  • Cross-region peering supported: You can peer VPCs in different regions, but data transfer charges apply ($0.02/GB).

When to use (Comprehensive):

  • āœ… Use when: You have a small number of VPCs (2-10) that need to communicate

    • Example: Production and development VPCs, or application and shared services VPCs
    • Reason: Simple to set up, no additional cost (same region), low latency
  • āœ… Use when: You need the lowest possible latency between VPCs

    • Example: Real-time data processing between VPCs
    • Reason: Direct connection through AWS backbone, no intermediate hops
  • āœ… Use when: You want to avoid data transfer charges (same region)

    • Example: Frequent data synchronization between VPCs in same region
    • Reason: No charges for data transfer within same region via peering
  • āŒ Don't use when: You have many VPCs (10+) that need full mesh connectivity

    • Problem: N*(N-1)/2 peering connections needed (10 VPCs = 45 connections, 20 VPCs = 190 connections)
    • Better solution: Use Transit Gateway instead
  • āŒ Don't use when: You need transitive routing

    • Problem: VPC Peering doesn't support transitive routing
    • Better solution: Use Transit Gateway for hub-and-spoke with transitive routing
  • āŒ Don't use when: VPCs have overlapping CIDR blocks

    • Problem: Peering requires non-overlapping IP ranges
    • Better solution: Re-IP one VPC, or use NAT/proxy solutions

Limitations & Constraints:

  • No transitive routing: Cannot route through a peered VPC to reach another VPC
  • No overlapping CIDRs: VPCs must have unique, non-overlapping IP ranges
  • No edge-to-edge routing: Cannot route to internet gateway, VPN, or Direct Connect in peer VPC
  • Maximum 125 peering connections per VPC: Doesn't scale to large numbers of VPCs
  • DNS resolution: Must enable DNS resolution for peering to resolve private DNS names across VPCs
  • Security group references: Can reference security groups in peered VPC (same region only)

šŸ’” Tips for Understanding:

  • Think of VPC Peering as a "direct cable" between two VPCs - simple, fast, but not scalable to many VPCs
  • Remember: "Peering is NOT transitive" - this is the most commonly tested limitation
  • For 2-5 VPCs: Peering is perfect. For 10+ VPCs: Consider Transit Gateway
  • Always check for CIDR overlap before attempting to peer VPCs

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Assuming transitive routing works

    • Why it's wrong: VPC Peering explicitly does not support transitive routing
    • Correct understanding: Each VPC pair needs a direct peering connection for communication
  • Mistake 2: Forgetting to update route tables after creating peering

    • Why it's wrong: Peering connection alone doesn't enable traffic - route tables must be updated
    • Correct understanding: Three steps required: (1) Create peering, (2) Update route tables, (3) Update security groups
  • Mistake 3: Trying to peer VPCs with overlapping CIDRs

    • Why it's wrong: AWS cannot route traffic when IP ranges overlap
    • Correct understanding: Plan IP addressing carefully from the start to avoid overlaps

šŸ”— Connections to Other Topics:

  • Relates to Transit Gateway (Task 1.1) because: Transit Gateway solves VPC Peering's scalability limitations
  • Builds on VPC Fundamentals (Chapter 0) by: Extending VPC connectivity beyond a single VPC
  • Often used with PrivateLink (Task 1.1) to: Provide service-level connectivity instead of network-level

Troubleshooting Common Issues:

  • Issue 1: Peering connection created but traffic not flowing

    • Solution: Check route tables in both VPCs - routes must point to peering connection
    • Solution: Check security groups - must allow traffic from peer VPC CIDR
    • Solution: Check NACLs - must allow traffic (often forgotten)
  • Issue 2: Cannot create peering connection

    • Solution: Check for CIDR overlap - VPCs must have non-overlapping IP ranges
    • Solution: Check peering connection limit - maximum 125 per VPC
    • Solution: Verify IAM permissions - need ec2:CreateVpcPeeringConnection permission

AWS Transit Gateway

What it is: AWS Transit Gateway is a network transit hub that connects VPCs, on-premises networks, and remote offices through a single gateway. It acts as a cloud router, simplifying network architecture and enabling transitive routing.

Why it exists: As organizations grow, they accumulate many VPCs (10, 20, 50+). Using VPC Peering for full mesh connectivity becomes unmanageable - 50 VPCs would require 1,225 peering connections. Transit Gateway solves this by providing a central hub where all networks connect once, and the hub handles routing between them.

Real-world analogy: Think of Transit Gateway like a major airport hub. Instead of having direct flights between every pair of cities (like VPC Peering), all flights go through the hub airport. You fly from City A to the hub, then from the hub to City B. The hub handles all the routing complexity.

How it works (Detailed step-by-step):

  1. You create a Transit Gateway in a region. It's a highly available, scalable service managed by AWS.

  2. You attach VPCs to the Transit Gateway. Each VPC gets a "Transit Gateway attachment" which connects it to the hub.

  3. You attach on-premises networks via VPN or Direct Connect. These also become attachments to the Transit Gateway.

  4. You configure route tables in the Transit Gateway. These control which attachments can communicate with which other attachments.

  5. You update VPC route tables to send traffic destined for other networks to the Transit Gateway attachment.

  6. Transit Gateway routes traffic between attachments based on its route tables. It supports transitive routing - VPC-A can reach VPC-B through the Transit Gateway, and VPC-B can reach on-premises through the same Transit Gateway.

  7. Traffic flows through the hub. All inter-VPC and hybrid traffic goes through Transit Gateway, which provides centralized routing, monitoring, and control.

šŸ“Š Transit Gateway Architecture:

graph TB
    subgraph "Transit Gateway: tgw-12345"
        TGW[Transit Gateway<br/>Central Hub]
    end
    
    subgraph "VPC Attachments"
        VPC1[VPC 1<br/>10.0.0.0/16<br/>Production]
        VPC2[VPC 2<br/>10.1.0.0/16<br/>Development]
        VPC3[VPC 3<br/>10.2.0.0/16<br/>Shared Services]
        VPC4[VPC 4<br/>10.3.0.0/16<br/>Analytics]
    end
    
    subgraph "On-Premises"
        DC[Data Center<br/>192.168.0.0/16]
        VPN[VPN Connection]
        DX[Direct Connect]
    end
    
    subgraph "Other Regions"
        TGW2[Transit Gateway<br/>eu-west-1]
    end
    
    VPC1 <--> TGW
    VPC2 <--> TGW
    VPC3 <--> TGW
    VPC4 <--> TGW
    
    DC --> VPN
    DC --> DX
    VPN --> TGW
    DX --> TGW
    
    TGW <-.Peering.-> TGW2
    
    style TGW fill:#e1f5fe
    style VPC1 fill:#c8e6c9
    style VPC2 fill:#fff3e0
    style VPC3 fill:#f3e5f5
    style VPC4 fill:#ffe0b2
    style DC fill:#ffebee
    style VPN fill:#e8eaf6
    style DX fill:#e8eaf6
    style TGW2 fill:#e1f5fe

See: diagrams/02_domain_1_transit_gateway.mmd

Diagram Explanation (Detailed):

This diagram illustrates a complete Transit Gateway deployment serving as the central networking hub for an organization. Let's examine each component and understand how they interact.

Transit Gateway (Blue, Center): The Transit Gateway (tgw-12345) sits at the center as the network hub. It's a regional service that's highly available across multiple Availability Zones automatically. Think of it as a virtual router that AWS manages for you. It has its own route tables (separate from VPC route tables) that control traffic flow between attachments.

VPC Attachments (Colored Boxes, Top): Four VPCs are attached to the Transit Gateway, each representing a different environment or application. VPC 1 (green) is Production with CIDR 10.0.0.0/16, VPC 2 (orange) is Development with 10.1.0.0/16, VPC 3 (purple) is Shared Services with 10.2.0.0/16, and VPC 4 (light orange) is Analytics with 10.3.0.0/16. Each VPC connects to the Transit Gateway through a "Transit Gateway attachment" - this is a logical connection that appears as an elastic network interface in the VPC. The bidirectional arrows show that traffic can flow in both directions.

Key Benefit - Transitive Routing: Unlike VPC Peering, Transit Gateway supports transitive routing. This means VPC 1 can communicate with VPC 2 through the Transit Gateway, VPC 2 can communicate with VPC 3, and VPC 1 can also communicate with VPC 3 - all through the same hub. You don't need direct connections between every VPC pair. With 4 VPCs, you only need 4 attachments (one per VPC) instead of 6 peering connections for full mesh.

On-Premises Connectivity (Red, Bottom Left): The diagram shows a corporate data center (192.168.0.0/16) connecting to AWS through two methods: VPN Connection and Direct Connect. Both terminate at the Transit Gateway. The VPN provides encrypted connectivity over the internet (good for backup or low-bandwidth needs), while Direct Connect provides a dedicated, high-bandwidth connection (good for primary connectivity). Having both provides redundancy - if Direct Connect fails, traffic automatically fails over to VPN.

Hybrid Cloud Routing: Here's where Transit Gateway really shines. The on-premises data center can reach ALL four VPCs through a single connection to the Transit Gateway. Without Transit Gateway, you'd need separate VPN or Direct Connect connections to each VPC, or complex routing through a "transit VPC." Transit Gateway simplifies this dramatically.

Inter-Region Connectivity (Blue, Bottom Right): The dotted line shows Transit Gateway Peering to another Transit Gateway in eu-west-1. This allows VPCs in us-east-1 to communicate with VPCs in eu-west-1 through the Transit Gateway hub. This is useful for global applications, disaster recovery, or multi-region architectures. Transit Gateway Peering uses AWS's global backbone network, not the public internet.

Traffic Flow Example: Let's trace a request from VPC 1 (Production) to the on-premises data center:

  1. Application in VPC 1 sends packet to 192.168.1.100 (on-premises server)
  2. VPC 1 route table has route: 192.168.0.0/16 → Transit Gateway attachment
  3. Packet arrives at Transit Gateway
  4. Transit Gateway route table has route: 192.168.0.0/16 → VPN/Direct Connect attachment
  5. Packet is forwarded to on-premises via Direct Connect (primary) or VPN (backup)
  6. On-premises server receives packet
  7. Response follows reverse path back to VPC 1

Centralized Management: All routing decisions happen at the Transit Gateway. You can implement network segmentation by using multiple Transit Gateway route tables. For example, you might have a "Production" route table that allows Production VPC to reach Shared Services and on-premises, but NOT Development. Development gets its own route table that allows it to reach Shared Services but NOT Production or on-premises. This provides security isolation while maintaining connectivity where needed.

Scalability: This architecture scales easily. Need to add VPC 5, 6, 7? Just attach them to the Transit Gateway and update route tables. Need to add a second data center? Attach it to the Transit Gateway. Need to connect to a partner network? Attach it. The hub-and-spoke model scales to thousands of attachments.

Detailed Example 1: Enterprise Multi-VPC Architecture

Scenario: Large enterprise with 25 VPCs across different business units, plus 3 on-premises data centers. They need full mesh connectivity between all VPCs and hybrid connectivity to all data centers.

Without Transit Gateway (VPC Peering approach):

  • VPC-to-VPC connections needed: 25 * 24 / 2 = 300 peering connections
  • Each VPC needs VPN/Direct Connect to each data center: 25 * 3 = 75 connections
  • Total connections to manage: 375
  • Route table entries per VPC: 24 (for other VPCs) + 3 (for data centers) = 27
  • Operational complexity: Extremely high
  • Cost: High (many VPN connections)

With Transit Gateway:

  • VPC attachments: 25 (one per VPC)
  • On-premises attachments: 3 (one per data center)
  • Total attachments: 28
  • Route table entries per VPC: 1 (everything goes to Transit Gateway)
  • Transit Gateway handles all routing
  • Operational complexity: Low
  • Cost: Lower (fewer connections, centralized management)

Implementation:

  1. Create Transit Gateway in us-east-1
  2. Attach all 25 VPCs to Transit Gateway
  3. Attach 3 data centers via Direct Connect or VPN
  4. Configure Transit Gateway route table:
    • Routes for all 25 VPC CIDRs
    • Routes for all 3 data center CIDRs
    • Enable route propagation for automatic route updates
  5. Update VPC route tables:
    • Single route: 0.0.0.0/0 → Transit Gateway (or specific routes for other VPCs and on-premises)
  6. Test connectivity: Any VPC can reach any other VPC and any data center

Benefits Realized:

  • 375 connections reduced to 28 attachments (93% reduction)
  • Centralized routing and monitoring
  • Easy to add new VPCs or data centers
  • Consistent security policies
  • Simplified troubleshooting

Detailed Example 2: Network Segmentation with Multiple Route Tables

Scenario: Organization needs to isolate Production, Development, and Shared Services environments while allowing specific connectivity patterns.

Requirements:

  • Production VPCs can reach Shared Services and on-premises
  • Development VPCs can reach Shared Services only (NOT Production or on-premises)
  • Shared Services can reach everything
  • On-premises can reach Production and Shared Services (NOT Development)

Setup:

  • Transit Gateway with 3 route tables: Production-RT, Development-RT, SharedServices-RT
  • 5 Production VPCs
  • 3 Development VPCs
  • 1 Shared Services VPC
  • 1 On-premises attachment

Route Table Configuration:

Production-RT (associated with Production VPC attachments):

  • Routes to: Other Production VPCs, Shared Services VPC, On-premises
  • Does NOT route to: Development VPCs

Development-RT (associated with Development VPC attachments):

  • Routes to: Other Development VPCs, Shared Services VPC
  • Does NOT route to: Production VPCs, On-premises

SharedServices-RT (associated with Shared Services VPC attachment):

  • Routes to: All Production VPCs, All Development VPCs, On-premises
  • Can reach everything (provides shared services to all)

On-Premises-RT (associated with on-premises attachment):

  • Routes to: All Production VPCs, Shared Services VPC
  • Does NOT route to: Development VPCs

Traffic Flow Examples:

  1. Production VPC → Shared Services: āœ… Allowed

    • Production-RT has route to Shared Services
    • Traffic flows through Transit Gateway
  2. Production VPC → Development VPC: āŒ Blocked

    • Production-RT has no route to Development VPCs
    • Traffic is dropped at Transit Gateway
  3. Development VPC → On-premises: āŒ Blocked

    • Development-RT has no route to on-premises
    • Prevents developers from accessing production data
  4. Shared Services → Production VPC: āœ… Allowed

    • SharedServices-RT has routes to all Production VPCs
    • Monitoring and logging services can reach production

Security Benefits:

  • Network-level isolation between environments
  • Prevents accidental or malicious access from Development to Production
  • Centralized enforcement of connectivity policies
  • Audit trail of all routing decisions

Detailed Example 3: Multi-Region Architecture with Transit Gateway Peering

Scenario: Global application with primary region in us-east-1 and disaster recovery region in eu-west-1. Need connectivity between regions for data replication and failover.

Setup:

  • Transit Gateway in us-east-1 (TGW-US)
  • Transit Gateway in eu-west-1 (TGW-EU)
  • 10 VPCs in us-east-1
  • 10 VPCs in eu-west-1 (DR replicas)
  • Transit Gateway Peering between TGW-US and TGW-EU

Implementation:

  1. Create Transit Gateways in both regions

  2. Attach VPCs to their regional Transit Gateway

  3. Create Transit Gateway Peering:

    • From TGW-US, create peering request to TGW-EU
    • Accept peering in TGW-EU
    • Peering connection established
  4. Configure Routing:

    • TGW-US route table: Add routes for eu-west-1 VPC CIDRs → TGW Peering
    • TGW-EU route table: Add routes for us-east-1 VPC CIDRs → TGW Peering
  5. Cross-Region Traffic:

    • VPC in us-east-1 can reach VPC in eu-west-1
    • Traffic flows: VPC → TGW-US → TGW Peering → TGW-EU → VPC
    • Uses AWS global backbone (not internet)
    • Latency: ~80-100ms (us-east-1 to eu-west-1)

Use Cases:

  • Database replication between regions
  • Disaster recovery failover
  • Global application deployment
  • Data analytics across regions

Cost Considerations:

  • Transit Gateway Peering: $0.05/GB for data transfer
  • Lower than internet-based transfer
  • Higher than same-region Transit Gateway ($0.02/GB)

⭐ Must Know (Critical Facts):

  • Transit Gateway supports transitive routing: Unlike VPC Peering, you can route through Transit Gateway to reach other networks. This is the key advantage.

  • Maximum 5,000 attachments per Transit Gateway: Scales to thousands of VPCs and connections. Soft limit, can be increased.

  • Supports multiple route tables: Use different route tables for network segmentation and security isolation.

  • Regional service: Each Transit Gateway is regional, but you can peer Transit Gateways across regions.

  • Bandwidth: Up to 50 Gbps per VPC attachment, 50 Gbps per VPN attachment. Scales automatically.

  • Pricing: $0.05/hour per attachment + $0.02/GB data processed (same region). Cross-region peering: $0.05/GB.

When to use (Comprehensive):

  • āœ… Use when: You have many VPCs (10+) that need to communicate

    • Example: Enterprise with 50 VPCs across multiple business units
    • Reason: Scales better than VPC Peering, centralized management
  • āœ… Use when: You need transitive routing

    • Example: VPCs need to reach on-premises through a central hub
    • Reason: Transit Gateway supports transitive routing, VPC Peering doesn't
  • āœ… Use when: You need network segmentation with centralized control

    • Example: Isolate Production from Development while allowing Shared Services access
    • Reason: Multiple route tables provide flexible segmentation
  • āœ… Use when: You have hybrid cloud requirements

    • Example: Multiple VPCs need to reach on-premises data centers
    • Reason: Single connection point for all VPCs to on-premises
  • āœ… Use when: You need to connect multiple AWS accounts

    • Example: Organization with 20 AWS accounts, each with multiple VPCs
    • Reason: Transit Gateway can be shared across accounts using AWS Resource Access Manager
  • āŒ Don't use when: You only have 2-3 VPCs

    • Problem: Transit Gateway costs more than VPC Peering for small deployments
    • Better solution: Use VPC Peering for simplicity and lower cost
  • āŒ Don't use when: You need the absolute lowest latency

    • Problem: Transit Gateway adds a small hop (microseconds), VPC Peering is direct
    • Better solution: Use VPC Peering for latency-sensitive applications

Limitations & Constraints:

  • Regional service: Each Transit Gateway operates in one region (can peer across regions)
  • Maximum 5,000 attachments: Soft limit, can be increased
  • Maximum 50 Gbps per attachment: Bandwidth limit per VPC or VPN connection
  • Route table limit: 10,000 routes per route table
  • Peering limitations: Can peer with Transit Gateways in other regions, but not other accounts' Transit Gateways in same region
  • No edge-to-edge routing: Cannot route to internet gateway or NAT gateway in attached VPC

šŸ’” Tips for Understanding:

  • Think of Transit Gateway as a "cloud router" - it routes traffic between all attached networks
  • Remember: Transit Gateway DOES support transitive routing (unlike VPC Peering)
  • For large deployments (10+ VPCs), Transit Gateway is almost always the right choice
  • Use multiple route tables for security segmentation
  • Transit Gateway Peering connects regions, not accounts in same region

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Forgetting to update VPC route tables after attaching to Transit Gateway

    • Why it's wrong: Attachment alone doesn't enable traffic - VPC route tables must point to Transit Gateway
    • Correct understanding: Three steps: (1) Create Transit Gateway, (2) Attach VPCs, (3) Update VPC route tables
  • Mistake 2: Assuming all attachments can communicate by default

    • Why it's wrong: Transit Gateway route tables control which attachments can reach which others
    • Correct understanding: Must configure Transit Gateway route tables to allow desired traffic flows
  • Mistake 3: Using Transit Gateway for only 2-3 VPCs

    • Why it's wrong: Transit Gateway costs more than VPC Peering for small deployments
    • Correct understanding: Transit Gateway is cost-effective at scale (10+ VPCs), use VPC Peering for small deployments

šŸ”— Connections to Other Topics:

  • Relates to VPC Peering (Task 1.1) because: Transit Gateway solves VPC Peering's scalability and transitive routing limitations
  • Builds on VPC Fundamentals (Chapter 0) by: Providing centralized routing for multiple VPCs
  • Often used with Direct Connect (Task 1.1) to: Provide hybrid connectivity to on-premises
  • Integrates with AWS Resource Access Manager (Task 1.4) to: Share Transit Gateway across AWS accounts

Troubleshooting Common Issues:

  • Issue 1: Transit Gateway attached but traffic not flowing

    • Solution: Check VPC route tables - must have routes pointing to Transit Gateway attachment
    • Solution: Check Transit Gateway route tables - must have routes for destination networks
    • Solution: Check security groups and NACLs - must allow traffic
  • Issue 2: Some VPCs can communicate, others cannot

    • Solution: Check Transit Gateway route table associations - ensure VPCs are associated with correct route table
    • Solution: Check route propagation settings - may need to enable for automatic route updates
    • Solution: Verify CIDR blocks don't overlap - overlapping CIDRs cause routing issues
  • Issue 3: High data transfer costs

    • Solution: Review traffic patterns - ensure traffic that should stay local isn't going through Transit Gateway
    • Solution: Consider VPC Peering for high-volume, low-latency connections between specific VPCs
    • Solution: Use VPC endpoints for AWS services to avoid Transit Gateway data processing charges

AWS PrivateLink

What it is: AWS PrivateLink provides private connectivity between VPCs, AWS services, and on-premises networks without exposing traffic to the public internet. It enables you to access services as if they were in your own VPC.

Why it exists: Sometimes you don't need full network-level connectivity (like VPC Peering or Transit Gateway provides). You just need to access a specific service - maybe an API, a database endpoint, or an AWS service. PrivateLink provides service-level connectivity without opening up entire networks to each other.

Real-world analogy: Think of PrivateLink like a private phone line between two offices. Instead of connecting the entire office networks together (VPC Peering), you just have a dedicated line for specific communication. The rest of the networks remain isolated.

How it works (Detailed step-by-step):

  1. Service provider creates an endpoint service (also called VPC Endpoint Service). This exposes their application or service through a Network Load Balancer.

  2. Service consumer creates a VPC endpoint (also called Interface Endpoint) in their VPC. This creates an elastic network interface (ENI) with a private IP address.

  3. AWS establishes a private connection between the consumer's VPC endpoint and the provider's endpoint service. This connection uses AWS's private network, not the internet.

  4. Consumer accesses the service using the private IP address of the VPC endpoint. Traffic never leaves AWS's network.

  5. Provider's service receives requests through the Network Load Balancer, processes them, and sends responses back through the same private connection.

  6. No route table changes needed in most cases. The VPC endpoint appears as a local resource in the consumer's VPC.

Detailed Example 1: SaaS Application Access via PrivateLink

Scenario: Your company uses a third-party SaaS application for customer analytics. The SaaS provider offers PrivateLink connectivity. You want to access their API from your VPCs without going over the internet.

Setup:

  • Your VPC: 10.0.0.0/16 (us-east-1)
  • SaaS Provider's Endpoint Service: com.amazonaws.vpce.us-east-1.vpce-svc-123456
  • Your Application Servers: 10.0.10.0/24 subnet

Implementation:

  1. SaaS Provider Setup (already done by provider):

    • Provider deploys their API behind a Network Load Balancer
    • Provider creates VPC Endpoint Service
    • Provider shares service name with you
  2. Your Setup:

    • Create Interface VPC Endpoint in your VPC
    • Specify provider's service name
    • Select subnets where endpoint should be created (10.0.10.0/24)
    • Select security group (allow HTTPS from your app servers)
  3. AWS Creates ENI:

    • Elastic Network Interface created in your subnet
    • Private IP assigned: 10.0.10.100
    • DNS name created: vpce-abc123.execute-api.us-east-1.vpce.amazonaws.com
  4. Your Application Accesses Service:

    • App server makes HTTPS request to vpce-abc123.execute-api.us-east-1.vpce.amazonaws.com
    • DNS resolves to 10.0.10.100 (private IP in your VPC)
    • Traffic goes to VPC endpoint ENI
    • AWS routes traffic privately to provider's service
    • Provider's API processes request and responds
    • Response comes back through same private path

Benefits:

  • No internet exposure (traffic stays on AWS network)
  • Lower latency (direct private connection)
  • Better security (no public IPs needed)
  • Simplified network architecture (no NAT Gateway needed for this traffic)
  • Provider can't see your VPC structure (only sees requests)

Detailed Example 2: Accessing AWS Services via PrivateLink

Scenario: You have EC2 instances in private subnets that need to access Amazon S3 and DynamoDB. You want to avoid sending traffic through NAT Gateway (costs money and adds latency).

Setup:

  • VPC: 10.0.0.0/16
  • Private Subnet: 10.0.10.0/24 (no internet route)
  • EC2 Instances: Need to access S3 and DynamoDB

Implementation:

  1. Create Gateway VPC Endpoint for S3:

    • Type: Gateway Endpoint (free, no ENI)
    • Service: com.amazonaws.us-east-1.s3
    • Route table: Automatically adds route for S3 prefix list → VPC endpoint
  2. Create Gateway VPC Endpoint for DynamoDB:

    • Type: Gateway Endpoint (free, no ENI)
    • Service: com.amazonaws.us-east-1.dynamodb
    • Route table: Automatically adds route for DynamoDB prefix list → VPC endpoint
  3. EC2 Instances Access Services:

    • Instance makes request to S3: aws s3 ls s3://my-bucket
    • Route table directs S3 traffic to Gateway Endpoint
    • Traffic goes directly to S3 via AWS private network
    • No NAT Gateway, no internet gateway, no public IPs
    • Same for DynamoDB requests

Cost Savings:

  • Without VPC Endpoints: Traffic goes through NAT Gateway
    • NAT Gateway: $0.045/hour + $0.045/GB processed
    • For 1TB/month: $45 (hourly) + $45 (data) = $90/month
  • With VPC Endpoints: Gateway Endpoints are free
    • Cost: $0/month
    • Savings: $90/month per NAT Gateway

Performance Benefits:

  • Lower latency (direct connection, no NAT hop)
  • Higher throughput (no NAT Gateway bandwidth limits)
  • More reliable (no NAT Gateway as potential failure point)

Detailed Example 3: Multi-Account Service Sharing

Scenario: You have a central shared services account with a REST API that multiple application accounts need to access. You want to provide private access without VPC Peering.

Setup:

  • Shared Services Account: API behind Network Load Balancer
  • Application Account 1: VPC 10.1.0.0/16
  • Application Account 2: VPC 10.2.0.0/16
  • Application Account 3: VPC 10.3.0.0/16

Implementation:

  1. Shared Services Account (Service Provider):

    • Deploy API application
    • Create Network Load Balancer in front of API
    • Create VPC Endpoint Service
    • Configure acceptance: Require acceptance for connections (security)
    • Whitelist: Add Application Account IDs to allowed principals
  2. Application Accounts (Service Consumers):

    • Each account creates Interface VPC Endpoint
    • Specify Shared Services' endpoint service name
    • Request connection
  3. Shared Services Accepts Connections:

    • Review connection requests
    • Accept requests from known accounts
    • Reject unknown or unauthorized requests
  4. Applications Access API:

    • Each application uses VPC endpoint DNS name
    • Traffic flows privately through PrivateLink
    • No VPC Peering needed
    • No Transit Gateway needed
    • Each account's network remains isolated

Security Benefits:

  • Network isolation maintained (no full VPC connectivity)
  • Service provider controls who can connect
  • Service consumer controls which subnets have access
  • All traffic encrypted in transit
  • No internet exposure

⭐ Must Know (Critical Facts):

  • Two types of VPC Endpoints: Gateway Endpoints (S3, DynamoDB, free) and Interface Endpoints (all other services, charged)

  • Gateway Endpoints are free: No hourly charge, no data processing charge. Always use for S3 and DynamoDB.

  • Interface Endpoints cost money: $0.01/hour per AZ + $0.01/GB processed. Consider cost vs NAT Gateway.

  • PrivateLink uses ENIs: Interface Endpoints create elastic network interfaces in your subnets with private IPs.

  • DNS resolution: VPC endpoints have DNS names that resolve to private IPs in your VPC.

  • Security groups apply: Interface Endpoints have security groups that control access.

When to use (Comprehensive):

  • āœ… Use when: You need to access AWS services from private subnets

    • Example: EC2 instances accessing S3, DynamoDB, or other AWS services
    • Reason: Avoid NAT Gateway costs, improve security and performance
  • āœ… Use when: You need service-level connectivity, not network-level

    • Example: Accessing a specific API or service in another VPC
    • Reason: More secure than VPC Peering (doesn't expose entire network)
  • āœ… Use when: You're providing a service to multiple customers/accounts

    • Example: SaaS provider offering private connectivity to customers
    • Reason: Scalable, secure, doesn't require VPC Peering with each customer
  • āœ… Use when: You need to access third-party SaaS applications privately

    • Example: Accessing Salesforce, Datadog, or other SaaS via PrivateLink
    • Reason: Better security and performance than internet-based access
  • āŒ Don't use when: You need full network connectivity between VPCs

    • Problem: PrivateLink is service-level, not network-level
    • Better solution: Use VPC Peering or Transit Gateway
  • āŒ Don't use when: Cost is primary concern and NAT Gateway is cheaper

    • Problem: Interface Endpoints cost $0.01/hour per AZ + data processing
    • Better solution: Calculate costs - for low traffic, NAT Gateway might be cheaper

Limitations & Constraints:

  • Interface Endpoints: Maximum 255 per VPC (soft limit)
  • Gateway Endpoints: Only for S3 and DynamoDB
  • Regional service: VPC endpoints are regional, can't access services in other regions
  • DNS resolution: Must enable DNS resolution in VPC for endpoint DNS names to work
  • Security groups: Interface Endpoints require security group configuration
  • Endpoint policies: Can restrict which resources/actions are accessible through endpoint

šŸ’” Tips for Understanding:

  • Gateway Endpoints (S3, DynamoDB) = Free, use route tables
  • Interface Endpoints (everything else) = Paid, use ENIs
  • PrivateLink = Service-level connectivity (not network-level)
  • Always use VPC endpoints for S3/DynamoDB from private subnets (free and faster)

āš ļø Common Mistakes & Misconceptions:

  • Mistake 1: Thinking PrivateLink provides full network connectivity

    • Why it's wrong: PrivateLink is service-level, not network-level
    • Correct understanding: Use PrivateLink for specific services, VPC Peering/Transit Gateway for full network connectivity
  • Mistake 2: Not using Gateway Endpoints for S3 and DynamoDB

    • Why it's wrong: Wastes money on NAT Gateway for traffic that could be free
    • Correct understanding: Always create Gateway Endpoints for S3 and DynamoDB in VPCs with private subnets
  • Mistake 3: Forgetting to configure security groups for Interface Endpoints

    • Why it's wrong: Traffic will be blocked even though endpoint is created
    • Correct understanding: Interface Endpoints need security groups that allow traffic from your resources

šŸ”— Connections to Other Topics:

  • Relates to VPC Peering (Task 1.1) because: PrivateLink provides service-level alternative to network-level peering
  • Builds on VPC Fundamentals (Chapter 0) by: Extending VPC connectivity to services without internet exposure
  • Often used with Multi-Account Strategy (Task 1.4) to: Share services across accounts securely

AWS Direct Connect

What it is: AWS Direct Connect is a dedicated network connection from your on-premises data center to AWS. It provides a private, high-bandwidth, low-latency connection that doesn't use the public internet.

Why it exists: Internet connections are unpredictable - latency varies, bandwidth is shared, and security is a concern. For enterprises with significant AWS usage or strict requirements, Direct Connect provides consistent network performance and enhanced security.

Real-world analogy: Think of Direct Connect like having a private highway between your office and AWS. Instead of driving on public roads with traffic (internet), you have a dedicated lane that's always fast and reliable.

How it works (Detailed):

  1. You order a Direct Connect connection through AWS Console. You choose a Direct Connect location (AWS facility) and connection speed (1 Gbps or 10 Gbps).

  2. AWS provisions a port at the Direct Connect location. This is a physical network port in an AWS-managed facility.

  3. You establish physical connectivity from your data center to the Direct Connect location. This is typically done through a telecommunications provider (cross-connect).

  4. You create a Virtual Interface (VIF) on the Direct Connect connection. VIFs are logical connections that carry traffic:

    • Private VIF: Connects to VPCs via Virtual Private Gateway or Transit Gateway
    • Public VIF: Connects to AWS public services (S3, DynamoDB, etc.)
    • Transit VIF: Connects to Transit Gateway for multi-VPC access
  5. You configure BGP (Border Gateway Protocol) to exchange routes between your network and AWS.

  6. Traffic flows over the dedicated connection. Your on-premises resources can access AWS resources with consistent performance.

Detailed Example: Enterprise Hybrid Cloud with Direct Connect

Scenario: Large enterprise with on-premises data center needs to migrate 500TB of data to AWS and maintain ongoing hybrid connectivity for applications.

Requirements:

  • High bandwidth (10 Gbps)
  • Low latency (<10ms)
  • Consistent performance
  • Access to multiple VPCs
  • Redundancy for high availability

Implementation:

  1. Order Direct Connect:

    • Location: Equinix DC2 (near your data center)
    • Speed: 10 Gbps
    • AWS provisions port
  2. Establish Physical Connection:

    • Contract with telecom provider for cross-connect
    • Provider runs fiber from your data center to Equinix DC2
    • Provider connects to AWS port
    • Physical layer established
  3. Create Transit VIF:

    • Connect Direct Connect to Transit Gateway
    • Configure BGP: Advertise on-premises routes (192.168.0.0/16) to AWS
    • AWS advertises VPC routes (10.0.0.0/8) to on-premises
    • BGP session established
  4. Configure Transit Gateway:

    • Attach Direct Connect via Transit VIF
    • Attach 20 VPCs to Transit Gateway
    • Configure route tables for on-premises access
  5. Test Connectivity:

    • From on-premises server, ping VPC resources
    • Latency: 5-8ms (excellent)
    • Bandwidth: 10 Gbps available
    • Jitter: <1ms (very stable)
  6. Begin Data Migration:

    • Transfer 500TB over Direct Connect
    • Speed: ~10 Gbps sustained
    • Time: ~5 days (vs months over internet)
    • No internet bandwidth costs

Benefits Realized:

  • Consistent performance (no internet variability)
  • Lower latency (direct connection)
  • Cost savings (no data transfer out charges for Direct Connect)
  • Enhanced security (private connection)
  • Reliable for production workloads

Detailed Example: Direct Connect with Failover to VPN

Scenario: Company needs high availability for hybrid connectivity. Primary connection via Direct Connect, backup via VPN.

Setup:

  • Primary: Direct Connect (10 Gbps)
  • Backup: Site-to-Site VPN (1.25 Gbps max)
  • On-premises: 192.168.0.0/16
  • AWS VPCs: 10.0.0.0/8 (via Transit Gateway)

Implementation:

  1. Configure Direct Connect (Primary):

    • Create Direct Connect connection
    • Create Transit VIF to Transit Gateway
    • BGP: Advertise routes with AS path prepending for preference
  2. Configure VPN (Backup):

    • Create Site-to-Site VPN to Transit Gateway
    • BGP: Advertise same routes with longer AS path (lower preference)
    • VPN tunnels established over internet
  3. BGP Configuration:

    • Direct Connect routes: AS path length 1 (preferred)
    • VPN routes: AS path length 3 (backup)
    • AWS prefers shorter AS path (Direct Connect)
  4. Normal Operation:

    • All traffic flows over Direct Connect
    • VPN tunnels stay up but carry no traffic
    • Monitoring shows Direct Connect active
  5. Failover Scenario:

    • Direct Connect fails (fiber cut, equipment failure)
    • BGP detects failure (30-90 seconds)
    • AWS removes Direct Connect routes
    • VPN routes become active
    • Traffic automatically fails over to VPN
    • Downtime: 30-90 seconds (BGP convergence time)
  6. Recovery:

    • Direct Connect restored
    • BGP re-establishes
    • Direct Connect routes preferred again
    • Traffic fails back to Direct Connect

High Availability Achieved:

  • Primary path: Direct Connect (high bandwidth, low latency)
  • Backup path: VPN (lower bandwidth, higher latency, but available)
  • Automatic failover (no manual intervention)
  • Minimal downtime during failures

⭐ Must Know:

  • Direct Connect is NOT encrypted by default: You must use VPN over Direct Connect or application-level encryption for security
  • Speeds: 1 Gbps, 10 Gbps, or sub-1Gbps via hosted connections
  • Pricing: Port hours + data transfer out (data transfer in is free)
  • Setup time: 1-4 weeks (physical provisioning takes time)
  • BGP required: Must configure BGP for routing

When to use:

  • āœ… High bandwidth requirements (>1 Gbps sustained)
  • āœ… Consistent performance needed (latency, jitter)
  • āœ… Large data transfers (hundreds of TB)
  • āœ… Hybrid applications with frequent AWS communication
  • āŒ Don't use for temporary or low-bandwidth needs (use VPN instead)

AWS Site-to-Site VPN

What it is: AWS Site-to-Site VPN creates an encrypted connection between your on-premises network and AWS over the internet. It's a quick, cost-effective way to establish hybrid connectivity.

Why it exists: Not every organization needs Direct Connect's bandwidth and cost. VPN provides secure hybrid connectivity using existing internet connections, with fast setup and lower cost.

Real-world analogy: VPN is like using a secure tunnel through public roads. You're still using the internet (public roads), but your traffic is encrypted and protected (tunnel).

How it works:

  1. You create a Customer Gateway in AWS, representing your on-premises VPN device
  2. You create a Virtual Private Gateway (VGW) and attach it to your VPC, or use Transit Gateway
  3. You create a Site-to-Site VPN connection between Customer Gateway and VGW/Transit Gateway
  4. AWS provisions two VPN tunnels for redundancy (active/active or active/passive)
  5. You configure your on-premises VPN device with tunnel parameters
  6. IPsec tunnels establish over the internet
  7. Traffic flows encrypted through the tunnels

Detailed Example: Quick Hybrid Connectivity with VPN

Scenario: Startup needs to connect on-premises office to AWS quickly for development and testing.

Requirements:

  • Fast setup (days, not weeks)
  • Low cost
  • Secure connectivity
  • Moderate bandwidth (100-200 Mbps)

Implementation:

  1. On-Premises Setup:

    • VPN device: Cisco ASA, pfSense, or AWS-compatible router
    • Public IP: 203.0.113.10
    • Internal network: 192.168.0.0/16
  2. AWS Setup:

    • Create Customer Gateway: IP 203.0.113.10
    • Create Virtual Private Gateway, attach to VPC
    • Create Site-to-Site VPN connection
    • Download configuration file for your VPN device
  3. Configure VPN Device:

    • Import AWS configuration
    • Configure IPsec parameters (encryption, authentication)
    • Configure BGP or static routes
    • Bring up tunnels
  4. Verify Connectivity:

    • Tunnel 1: UP (primary)
    • Tunnel 2: UP (backup)
    • Ping test: On-premises to VPC successful
    • Latency: 20-50ms (depends on internet)

Cost:

  • VPN connection: $0.05/hour = $36/month
  • Data transfer: $0.09/GB out
  • Total for 100GB/month: $36 + $9 = $45/month
  • Much cheaper than Direct Connect ($300+/month)

Limitations:

  • Bandwidth: Max 1.25 Gbps per tunnel (internet dependent)
  • Latency: Variable (depends on internet path)
  • Reliability: Depends on internet connection quality

⭐ Must Know:

  • Two tunnels for redundancy: AWS always provisions two tunnels
  • Encrypted by default: IPsec encryption included
  • Maximum 1.25 Gbps per tunnel: Bandwidth limited
  • Pricing: $0.05/hour per connection + data transfer
  • Setup time: Minutes to hours (vs weeks for Direct Connect)

When to use:

  • āœ… Quick setup needed (days, not weeks)
  • āœ… Lower bandwidth requirements (<1 Gbps)
  • āœ… Cost-sensitive deployments
  • āœ… Backup for Direct Connect
  • āŒ Don't use for high-bandwidth, latency-sensitive applications (use Direct Connect)

šŸŽÆ Exam Focus: Questions often ask you to choose between Direct Connect and VPN based on requirements (bandwidth, latency, cost, setup time).

Task 1.1 Summary

Key Takeaways:

  1. VPC Peering: Direct connection between two VPCs, NOT transitive, use for 2-10 VPCs
  2. Transit Gateway: Central hub for many VPCs, supports transitive routing, use for 10+ VPCs
  3. PrivateLink: Service-level connectivity, use for accessing specific services privately
  4. Direct Connect: Dedicated connection, high bandwidth, consistent performance, use for enterprise hybrid cloud
  5. Site-to-Site VPN: Encrypted over internet, quick setup, lower cost, use for moderate bandwidth needs

Decision Framework:

Requirement Solution
2-5 VPCs need to communicate VPC Peering
10+ VPCs need to communicate Transit Gateway
Access specific service in another VPC PrivateLink
High bandwidth to on-premises (>1 Gbps) Direct Connect
Quick hybrid connectivity (<1 Gbps) Site-to-Site VPN
Access AWS services from private subnets VPC Endpoints (PrivateLink)

Task 1.2: Prescribe Security Controls

This task covers implementing security controls at scale for enterprise AWS environments, including IAM, encryption, monitoring, and compliance.

Introduction

The problem: Enterprise security is complex. You have hundreds of AWS accounts, thousands of users, sensitive data across multiple services, compliance requirements (HIPAA, PCI-DSS, GDPR), and sophisticated threats. Traditional perimeter security doesn't work in the cloud. You need defense in depth, least privilege access, encryption everywhere, and continuous monitoring.

The solution: AWS provides comprehensive security services - IAM for access control, KMS for encryption, CloudTrail for auditing, Security Hub for centralized monitoring, GuardDuty for threat detection. The key is implementing these services correctly and at scale.

Why it's tested: Security is the #1 priority in AWS. Poor security leads to data breaches, compliance violations, and business disruption. As a Solutions Architect Professional, you must design secure architectures that protect data, control access, and meet compliance requirements.


Core Concepts

IAM (Identity and Access Management)

What it is: IAM controls who can access AWS resources (authentication) and what they can do (authorization). It's the foundation of AWS security.

Why it exists: Without access control, anyone could access your AWS resources. IAM ensures only authorized users and services can perform specific actions on specific resources.

Key Components:

  1. Users: Individual people with AWS Console or API access
  2. Groups: Collections of users with shared permissions
  3. Roles: Temporary credentials for services or federated users
  4. Policies: JSON documents defining permissions

How IAM Works (Detailed):

  1. Principal (user, role, or service) makes a request to AWS
  2. AWS authenticates the principal (verifies identity)
  3. AWS evaluates policies attached to the principal
  4. AWS checks resource policies (if any) on the target resource
  5. AWS applies permission boundaries (if configured)
  6. AWS makes decision: Allow or Deny
  7. If allowed, action is performed; if denied, error returned

Detailed Example: Least Privilege IAM Policy

Scenario: Developer needs to deploy Lambda functions and read CloudWatch logs, but shouldn't access production databases or delete resources.

Bad Policy (Too Permissive):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "*",
    "Resource": "*"
  }]
}

Problem: Grants all permissions on all resources. Developer could delete production databases, modify IAM policies, or access sensitive data.

Good Policy (Least Privilege):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "LambdaDeployment",
      "Effect": "Allow",
      "Action": [
        "lambda:CreateFunction",
        "lambda:UpdateFunctionCode",
        "lambda:UpdateFunctionConfiguration",
        "lambda:GetFunction",
        "lambda:ListFunctions"
      ],
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:dev-*"
    },
    {
      "Sid": "CloudWatchLogsRead",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams",
        "logs:GetLogEvents",
        "logs:FilterLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/dev-*"
    },
    {
      "Sid": "IAMPassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/lambda-dev-execution-role"
    }
  ]
}

Benefits:

  • Only Lambda functions starting with "dev-" can be modified
  • Only CloudWatch logs for dev Lambda functions can be read
  • Can only pass specific execution role to Lambda
  • Cannot delete resources, access production, or modify IAM

⭐ Must Know:

  • Explicit Deny always wins: If any policy denies an action, it's denied regardless of allows
  • Default is Deny: If no policy explicitly allows an action, it's denied
  • Least Privilege: Grant minimum permissions needed, nothing more
  • Use Roles, not Users: For applications and services, always use IAM roles
  • MFA for sensitive operations: Require multi-factor authentication for critical actions

AWS KMS (Key Management Service)

What it is: KMS manages encryption keys used to encrypt data at rest and in transit. It provides centralized key management with audit trails.

Why it exists: Encryption is essential for data protection, but managing encryption keys is complex. KMS handles key generation, rotation, access control, and auditing.

How KMS Works:

  1. You create a KMS key (formerly called Customer Master Key)
  2. You define key policy controlling who can use the key
  3. Service requests encryption (e.g., S3, RDS, EBS)
  4. KMS generates data encryption key (DEK) using your KMS key
  5. Service encrypts data with DEK
  6. Service stores encrypted DEK with the data
  7. For decryption, service sends encrypted DEK to KMS
  8. KMS decrypts DEK (if caller has permission) and returns it
  9. Service decrypts data with DEK

Detailed Example: S3 Encryption with KMS

Scenario: Store sensitive customer data in S3 with encryption, ensuring only authorized applications can decrypt.

Implementation:

  1. Create KMS Key:

    • Key type: Symmetric
    • Key policy: Allow S3 service and specific IAM roles
    • Enable automatic key rotation (yearly)
  2. Configure S3 Bucket:

    • Enable default encryption: SSE-KMS
    • Specify KMS key ARN
    • All new objects automatically encrypted
  3. Upload Object:

    • Application uploads file to S3
    • S3 requests data key from KMS
    • KMS generates 256-bit AES key
    • S3 encrypts file with data key
    • S3 encrypts data key with KMS key
    • S3 stores encrypted file + encrypted data key
  4. Download Object:

    • Application requests file from S3
    • S3 retrieves encrypted file + encrypted data key
    • S3 sends encrypted data key to KMS
    • KMS checks if caller has kms:Decrypt permission
    • If authorized, KMS decrypts data key and returns it
    • S3 decrypts file with data key
    • S3 returns plaintext file to application

Security Benefits:

  • Data encrypted at rest (protects against disk theft)
  • Centralized key management (one place to control access)
  • Audit trail (CloudTrail logs all KMS API calls)
  • Key rotation (automatic yearly rotation)
  • Access control (key policy + IAM policies)

⭐ Must Know:

  • KMS keys never leave KMS: Keys are stored in hardware security modules (HSMs)
  • Envelope encryption: Data encrypted with data key, data key encrypted with KMS key
  • Key policies: Control who can use keys (separate from IAM policies)
  • Automatic rotation: Enable for yearly key rotation (old keys still work for decryption)
  • Regional service: KMS keys are regional, must create keys in each region

AWS CloudTrail

What it is: CloudTrail records all API calls made in your AWS account, providing a complete audit trail of who did what, when, and from where.

Why it exists: For security, compliance, and troubleshooting, you need to know what actions were taken in your AWS account. CloudTrail provides this visibility.

How CloudTrail Works:

  1. User or service makes API call (e.g., create EC2 instance)
  2. CloudTrail captures event with details: who, what, when, where, result
  3. CloudTrail writes event to S3 (encrypted, immutable)
  4. CloudTrail sends to CloudWatch Logs (optional, for real-time monitoring)
  5. CloudTrail sends to EventBridge (optional, for automated responses)

Detailed Example: Security Incident Investigation

Scenario: Production database was deleted. Need to find who did it and when.

Investigation Steps:

  1. Query CloudTrail:

    • Event: DeleteDBInstance
    • Time: 2024-10-08 14:32:15 UTC
    • User: arn:aws:iam::123456789012:user/john.smith
    • Source IP: 203.0.113.45
    • Result: Success
  2. Analyze Context:

    • Check if IP is expected (company VPN range)
    • Check if user should have delete permissions
    • Check if MFA was used (should be required for delete)
    • Check for other suspicious activity from same user
  3. Take Action:

    • Disable user's access immediately
    • Restore database from backup
    • Review IAM policies (why did user have delete permission?)
    • Implement MFA requirement for destructive operations
    • Set up CloudWatch alarm for future delete operations

Prevention:

  • Require MFA for sensitive operations
  • Implement least privilege (users shouldn't have delete permissions)
  • Set up real-time alerts for critical operations
  • Regular access reviews

⭐ Must Know:

  • CloudTrail is regional: Must enable in each region (or use organization trail)
  • 90-day retention: Events stored 90 days in CloudTrail console (longer in S3)
  • Immutable logs: Once written to S3, logs cannot be modified (use S3 Object Lock)
  • Management events vs Data events: Management events (API calls) free, data events (S3 object access) charged
  • Multi-account: Use AWS Organizations to enable CloudTrail across all accounts

AWS Security Hub

What it is: Security Hub provides a centralized view of security findings from multiple AWS services and third-party tools. It aggregates, organizes, and prioritizes security alerts.

Why it exists: Large AWS environments generate thousands of security findings from GuardDuty, Inspector, Macie, Config, and third-party tools. Security Hub consolidates these into a single dashboard with prioritization.

How Security Hub Works:

  1. Enable Security Hub in your account
  2. Enable security standards (AWS Foundational Security Best Practices, CIS AWS Foundations Benchmark, PCI-DSS)
  3. Integrate services: GuardDuty, Inspector, Macie, Config, Firewall Manager
  4. Security Hub collects findings from all sources
  5. Security Hub normalizes findings into standard format (AWS Security Finding Format)
  6. Security Hub assigns severity (Critical, High, Medium, Low, Informational)
  7. Security Hub generates insights (patterns across findings)
  8. You review and remediate findings

Detailed Example: Centralized Security Monitoring

Scenario: Enterprise with 50 AWS accounts needs centralized security monitoring and compliance reporting.

Implementation:

  1. Enable Security Hub in master account

  2. Invite member accounts (50 accounts)

  3. Enable standards:

    • AWS Foundational Security Best Practices
    • CIS AWS Foundations Benchmark v1.4.0
    • PCI-DSS v3.2.1
  4. Integrate Services:

    • GuardDuty: Threat detection
    • Inspector: Vulnerability scanning
    • Macie: Data discovery and protection
    • Config: Configuration compliance
    • IAM Access Analyzer: Unintended access
  5. Review Findings:

    • Critical: 15 findings (immediate action)
    • High: 127 findings (prioritize)
    • Medium: 543 findings (schedule remediation)
    • Low: 1,234 findings (review periodically)
  6. Automated Remediation:

    • EventBridge rule: Security Hub finding → Lambda function
    • Lambda automatically remediates common issues:
      • S3 bucket public access → Block public access
      • Security group 0.0.0.0/0 → Restrict to company IP range
      • Unencrypted EBS volume → Enable encryption

Benefits:

  • Single pane of glass for security across 50 accounts
  • Automated compliance reporting
  • Prioritized findings (focus on critical issues first)
  • Automated remediation for common issues
  • Continuous compliance monitoring

⭐ Must Know:

  • Aggregates findings: From GuardDuty, Inspector, Macie, Config, and 50+ partner products
  • Security standards: Built-in compliance frameworks (CIS, PCI-DSS, etc.)
  • Automated remediation: Integrate with EventBridge and Lambda for auto-remediation
  • Multi-account: Master-member model for centralized monitoring
  • Pricing: $0.0010 per security check per month + $0.00003 per finding ingested

šŸŽÆ Exam Focus: Questions often test understanding of which security service to use for specific scenarios (IAM for access control, KMS for encryption, CloudTrail for auditing, Security Hub for centralized monitoring).

Task 1.2 Summary

Key Security Services:

  • IAM: Access control (who can do what)
  • KMS: Encryption key management
  • CloudTrail: API audit logging
  • Security Hub: Centralized security monitoring
  • GuardDuty: Threat detection

Security Best Practices:

  1. Implement least privilege access
  2. Enable MFA for sensitive operations
  3. Encrypt data at rest and in transit
  4. Enable CloudTrail in all regions
  5. Use Security Hub for centralized monitoring
  6. Automate security responses with EventBridge + Lambda

Task 1.3: Design Reliable and Resilient Architectures

Key Concepts

RTO and RPO

RTO (Recovery Time Objective): Maximum acceptable downtime

  • Example: "System must be back online within 4 hours"
  • Drives DR strategy selection and cost

RPO (Recovery Point Objective): Maximum acceptable data loss

  • Example: "Can lose maximum 15 minutes of data"
  • Drives backup frequency and replication strategy

Relationship to DR Strategies:

Strategy RTO RPO Cost Use Case
Backup & Restore Hours to days Hours Lowest Non-critical systems
Pilot Light 10s of minutes Minutes Low Cost-sensitive with moderate RTO
Warm Standby Minutes Seconds Medium Business-critical applications
Multi-Site Active-Active Seconds None Highest Mission-critical, zero downtime

Disaster Recovery Strategies

1. Backup and Restore (Lowest Cost, Highest RTO/RPO):

  • What: Regular backups to S3, restore when needed
  • RTO: Hours to days (time to provision infrastructure + restore data)
  • RPO: Hours (backup frequency)
  • Cost: Very low (only pay for S3 storage)
  • Use when: Non-critical systems, cost is primary concern
  • Example: Development environments, internal tools

2. Pilot Light (Low Cost, Moderate RTO/RPO):

  • What: Minimal infrastructure always running (database replication), scale up during disaster
  • RTO: 10s of minutes (time to scale up infrastructure)
  • RPO: Minutes (continuous replication)
  • Cost: Low (minimal infrastructure + replication)
  • Use when: Cost-sensitive but need faster recovery than backup/restore
  • Example: E-commerce site with moderate traffic

3. Warm Standby (Medium Cost, Low RTO/RPO):

  • What: Scaled-down version of full environment always running, scale up during disaster
  • RTO: Minutes (time to scale up to full capacity)
  • RPO: Seconds (real-time replication)
  • Cost: Medium (running infrastructure at reduced capacity)
  • Use when: Business-critical applications, can tolerate brief downtime
  • Example: Financial services applications

4. Multi-Site Active-Active (Highest Cost, Lowest RTO/RPO):

  • What: Full production environment in multiple regions, both serving traffic
  • RTO: Seconds (automatic failover)
  • RPO: None (synchronous replication)
  • Cost: Highest (full infrastructure in multiple regions)
  • Use when: Mission-critical, zero downtime tolerance
  • Example: Global banking systems, healthcare platforms

⭐ Must Know:

  • Tighter RTO/RPO = Higher cost
  • Business requirements drive DR strategy selection
  • Test DR procedures regularly (quarterly minimum)
  • Document runbooks for failover procedures

High Availability Patterns

Multi-AZ Deployment:

  • Deploy resources across multiple Availability Zones
  • Load balancer distributes traffic
  • Automatic failover if AZ fails
  • Use for: All production workloads

Multi-Region Deployment:

  • Deploy resources across multiple AWS Regions
  • Route 53 routes traffic based on health/latency/geolocation
  • Protects against region-wide failures
  • Use for: Global applications, highest availability requirements

Auto Scaling:

  • Automatically adjust capacity based on demand
  • Replace failed instances automatically
  • Maintain desired capacity
  • Use for: Variable workloads, automatic recovery

Database High Availability:

  • RDS Multi-AZ: Synchronous replication, automatic failover (1-2 minutes)
  • Aurora: Multi-AZ by default, 6 copies across 3 AZs
  • DynamoDB: Multi-AZ by default, global tables for multi-region

⭐ Must Know:

  • Always deploy across multiple AZs for production
  • Use Auto Scaling for automatic recovery
  • RDS Multi-AZ provides automatic failover
  • Aurora is more resilient than standard RDS

Task 1.4: Design Multi-Account AWS Environment

Key Concepts

AWS Organizations

What it is: Service for centrally managing multiple AWS accounts. Provides consolidated billing, policy-based management, and account organization.

Why it exists: Enterprises have many AWS accounts (50, 100, 500+) for different teams, applications, and environments. Organizations provides centralized management.

Key Features:

  • Organizational Units (OUs): Group accounts hierarchically
  • Service Control Policies (SCPs): Set permission guardrails across accounts
  • Consolidated Billing: Single bill for all accounts, volume discounts
  • Account Creation: Programmatically create new accounts

Typical OU Structure:

Root
ā”œā”€ā”€ Security OU
│   ā”œā”€ā”€ Security Tooling Account
│   └── Log Archive Account
ā”œā”€ā”€ Infrastructure OU
│   ā”œā”€ā”€ Network Account
│   └── Shared Services Account
ā”œā”€ā”€ Production OU
│   ā”œā”€ā”€ Prod App 1 Account
│   └── Prod App 2 Account
ā”œā”€ā”€ Development OU
│   ā”œā”€ā”€ Dev App 1 Account
│   └── Dev App 2 Account
└── Sandbox OU
    ā”œā”€ā”€ Sandbox 1 Account
    └── Sandbox 2 Account

Service Control Policies (SCPs):

  • Permission boundaries applied to OUs or accounts
  • Define maximum permissions (what's allowed)
  • Cannot grant permissions (only restrict)
  • Applied hierarchically (parent OU policies affect child accounts)

Example SCP - Prevent Region Usage:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": "*",
    "Resource": "*",
    "Condition": {
      "StringNotEquals": {
        "aws:RequestedRegion": ["us-east-1", "us-west-2"]
      }
    }
  }]
}

Result: Accounts in this OU can only use us-east-1 and us-west-2 regions.

⭐ Must Know:

  • SCPs don't grant permissions, only restrict
  • SCPs apply to all users and roles in account (including root)
  • Use OUs to group accounts by function, environment, or team
  • Consolidated billing provides volume discounts

AWS Control Tower

What it is: Service that automates setup of multi-account AWS environment based on best practices. Provides guardrails, account factory, and dashboard.

Why it exists: Setting up Organizations, OUs, SCPs, logging, and security manually is complex and error-prone. Control Tower automates this with best practices built-in.

Key Features:

  • Landing Zone: Pre-configured multi-account environment
  • Guardrails: Preventive (SCPs) and detective (Config rules) controls
  • Account Factory: Automated account provisioning
  • Dashboard: Centralized compliance view

Guardrails:

  • Mandatory: Always enforced (e.g., disallow public S3 buckets)
  • Strongly Recommended: Best practices (e.g., enable CloudTrail)
  • Elective: Optional controls (e.g., disallow specific instance types)

⭐ Must Know:

  • Control Tower uses Organizations, Config, CloudTrail, and other services
  • Guardrails enforce security and compliance policies
  • Account Factory automates account creation with baseline configuration
  • Use for new multi-account setups (not for existing complex environments)

Cross-Account Access Patterns

1. IAM Roles (Preferred):

  • Create role in target account
  • Grant permissions in role policy
  • Trust policy allows source account to assume role
  • Users in source account assume role to access target account

2. Resource-Based Policies:

  • Attach policy to resource (S3 bucket, Lambda function)
  • Policy grants access to principals from other accounts
  • No role assumption needed

3. AWS Resource Access Manager (RAM):

  • Share resources across accounts (Transit Gateway, subnets, Route 53 Resolver rules)
  • Centralized resource management
  • No resource duplication

⭐ Must Know:

  • Use IAM roles for cross-account access (most common)
  • Resource-based policies for specific resources (S3, Lambda)
  • RAM for sharing network resources (Transit Gateway, subnets)

Task 1.5: Cost Optimization and Visibility

Key Concepts

Cost Monitoring Tools

AWS Cost Explorer:

  • Visualize and analyze costs
  • Filter by service, account, tag, region
  • Forecast future costs
  • Identify cost trends

AWS Budgets:

  • Set custom cost and usage budgets
  • Alerts when exceeding thresholds
  • Can trigger automated actions (stop instances, send SNS notification)

AWS Cost and Usage Report:

  • Most detailed cost data
  • Hourly granularity
  • Delivered to S3 for analysis
  • Use with Athena or QuickSight for custom reports

AWS Trusted Advisor:

  • Real-time recommendations
  • Cost optimization checks (idle resources, underutilized instances)
  • Security, performance, fault tolerance checks

⭐ Must Know:

  • Cost Explorer for visualization and analysis
  • Budgets for alerts and automated actions
  • Cost and Usage Report for detailed analysis
  • Trusted Advisor for recommendations

Purchasing Options

On-Demand Instances:

  • Pay by the hour/second
  • No commitment
  • Highest cost per hour
  • Use for: Unpredictable workloads, short-term needs

Reserved Instances (RIs):

  • 1 or 3-year commitment
  • Up to 75% discount vs On-Demand
  • Types: Standard (highest discount, no flexibility), Convertible (lower discount, can change instance type)
  • Use for: Steady-state workloads, predictable usage

Savings Plans:

  • 1 or 3-year commitment to spend amount ($/hour)
  • Up to 72% discount vs On-Demand
  • More flexible than RIs (applies to Lambda, Fargate, EC2)
  • Use for: Consistent compute usage across services

Spot Instances:

  • Bid on spare EC2 capacity
  • Up to 90% discount vs On-Demand
  • Can be interrupted with 2-minute warning
  • Use for: Fault-tolerant, flexible workloads (batch processing, big data)

⭐ Must Know:

  • RIs and Savings Plans require commitment (1 or 3 years)
  • Spot Instances can be interrupted (not for critical workloads)
  • Savings Plans more flexible than RIs
  • Combine purchasing options for optimal cost

Cost Optimization Strategies

1. Right-Sizing:

  • Match instance types to actual usage
  • Use AWS Compute Optimizer for recommendations
  • Downsize over-provisioned resources

2. Auto Scaling:

  • Scale capacity based on demand
  • Reduce costs during low-usage periods
  • Maintain performance during peaks

3. Storage Optimization:

  • Use S3 Intelligent-Tiering for automatic tiering
  • Lifecycle policies to move data to cheaper storage classes
  • Delete unused EBS volumes and snapshots

4. Reserved Capacity:

  • Purchase RIs or Savings Plans for steady workloads
  • Analyze usage patterns with Cost Explorer
  • Start with 1-year commitment, move to 3-year for higher discount

5. Tagging Strategy:

  • Tag all resources with cost allocation tags
  • Track costs by project, team, environment
  • Enable tag-based budgets and alerts

⭐ Must Know:

  • Right-sizing can save 20-40% on compute costs
  • Auto Scaling reduces costs during low usage
  • S3 Intelligent-Tiering automatically optimizes storage costs
  • Tagging enables cost allocation and chargeback

Chapter 1 Summary

What We Covered

This chapter covered Domain 1: Design Solutions for Organizational Complexity (26% of exam).

āœ… Task 1.1: Network Connectivity

  • VPC Peering: Direct connection, not transitive, 2-10 VPCs
  • Transit Gateway: Central hub, transitive routing, 10+ VPCs
  • PrivateLink: Service-level connectivity, private access
  • Direct Connect: Dedicated connection, high bandwidth, consistent performance
  • Site-to-Site VPN: Encrypted over internet, quick setup, moderate bandwidth

āœ… Task 1.2: Security Controls

  • IAM: Access control, least privilege, roles over users
  • KMS: Encryption key management, envelope encryption
  • CloudTrail: API audit logging, immutable logs
  • Security Hub: Centralized security monitoring, compliance frameworks
  • GuardDuty: Threat detection, machine learning-based

āœ… Task 1.3: Reliable and Resilient Architectures

  • RTO/RPO: Define recovery objectives
  • DR Strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site
  • High Availability: Multi-AZ, Multi-Region, Auto Scaling
  • Database HA: RDS Multi-AZ, Aurora, DynamoDB Global Tables

āœ… Task 1.4: Multi-Account Environment

  • AWS Organizations: Centralized management, consolidated billing
  • Service Control Policies: Permission guardrails
  • Control Tower: Automated multi-account setup
  • Cross-Account Access: IAM roles, resource policies, RAM

āœ… Task 1.5: Cost Optimization

  • Monitoring: Cost Explorer, Budgets, Cost and Usage Report
  • Purchasing Options: On-Demand, RIs, Savings Plans, Spot
  • Optimization: Right-sizing, Auto Scaling, storage tiering, tagging

Critical Takeaways

  1. Network Architecture: Use Transit Gateway for 10+ VPCs, VPC Peering for 2-10 VPCs, PrivateLink for service-level access

  2. Security Layers: Implement defense in depth - network (VPC, security groups), identity (IAM), encryption (KMS), monitoring (CloudTrail, Security Hub)

  3. High Availability: Always deploy across multiple AZs, use Auto Scaling, implement appropriate DR strategy based on RTO/RPO

  4. Multi-Account Strategy: Use Organizations for centralized management, SCPs for guardrails, Control Tower for automated setup

  5. Cost Optimization: Right-size resources, use Auto Scaling, purchase RIs/Savings Plans for steady workloads, implement tagging strategy

Self-Assessment Checklist

Before moving to Domain 2, ensure you can:

  • Explain when to use VPC Peering vs Transit Gateway vs PrivateLink
  • Describe how Direct Connect and VPN differ and when to use each
  • Design IAM policies following least privilege principle
  • Explain how KMS envelope encryption works
  • Choose appropriate DR strategy based on RTO/RPO requirements
  • Design multi-AZ and multi-region architectures
  • Explain AWS Organizations structure and SCPs
  • Describe cross-account access patterns
  • Choose appropriate EC2 purchasing options based on workload
  • Implement cost allocation using tags

Practice Questions

Test your Domain 1 knowledge:

  • Domain 1 Bundle 1: Questions 1-50 (target: 70%+)
  • Domain 1 Bundle 2: Questions 1-50 (target: 75%+)
  • Domain 1 Bundle 3: Questions 1-50 (target: 75%+)

If you scored below 70%:

  • Review sections where you struggled
  • Focus on ⭐ Must Know items
  • Practice drawing architecture diagrams
  • Review decision frameworks (when to use which service)

Quick Reference Card

Network Connectivity Decision Matrix:

Scenario Solution
2-5 VPCs VPC Peering
10+ VPCs Transit Gateway
Service access PrivateLink
High bandwidth hybrid Direct Connect
Quick hybrid setup Site-to-Site VPN

DR Strategy Selection:

RTO RPO Strategy
Days Hours Backup & Restore
Minutes Minutes Pilot Light
Minutes Seconds Warm Standby
Seconds None Multi-Site Active-Active

Security Services:

  • Access Control: IAM
  • Encryption: KMS
  • Audit Logging: CloudTrail
  • Centralized Monitoring: Security Hub
  • Threat Detection: GuardDuty

Cost Optimization:

  • Monitoring: Cost Explorer, Budgets
  • Steady Workloads: RIs, Savings Plans
  • Variable Workloads: Auto Scaling
  • Fault-Tolerant: Spot Instances
  • Storage: S3 Intelligent-Tiering

Next Steps: You've completed Domain 1 (26% of exam). Continue to Domain 2 (Design for New Solutions) in file 03_domain_2_new_solutions.

šŸ’” Tip: Domain 1 is the largest domain. Take a break, review your notes, and practice with Domain 1 bundles before moving forward.


Chapter 2: Design for New Solutions

Domain Weight: 29% of exam (highest weight)

Chapter Overview

This domain focuses on designing new AWS solutions from scratch. You'll learn deployment strategies, business continuity planning, security controls, reliability patterns, performance optimization, and cost optimization for new applications.

What you'll learn:

  • Design deployment strategies using IaC, CI/CD, and automation
  • Ensure business continuity with appropriate DR strategies
  • Determine security controls based on requirements
  • Design solutions meeting reliability requirements
  • Meet performance objectives through proper architecture
  • Optimize costs while meeting solution goals

Time to complete: 12-15 hours (largest domain by weight)

Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (Domain 1)

Exam Weight: 29% (approximately 19 questions on the actual exam)


Task 2.1: Design Deployment Strategy

Key Concepts

Infrastructure as Code (IaC)

AWS CloudFormation:

  • Define infrastructure in JSON or YAML templates
  • Version control infrastructure
  • Repeatable deployments
  • Rollback on failure

Key Features:

  • Stacks: Collection of resources managed as single unit
  • Change Sets: Preview changes before applying
  • Stack Sets: Deploy across multiple accounts/regions
  • Drift Detection: Identify manual changes

Example Template Structure:

AWSTemplateFormatVersion: '2010-09-09'
Description: Web application infrastructure

Parameters:
  InstanceType:
    Type: String
    Default: t3.micro
    AllowedValues: [t3.micro, t3.small, t3.medium]

Resources:
  WebServerSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !Ref InstanceType
      ImageId: ami-0c55b159cbfafe1f0
      SecurityGroupIds:
        - !Ref WebServerSecurityGroup

Outputs:
  WebServerPublicIP:
    Description: Public IP of web server
    Value: !GetAtt WebServer.PublicIp

⭐ Must Know:

  • CloudFormation is declarative (describe desired state)
  • Supports rollback on failure
  • Use parameters for flexibility
  • Use outputs to export values
  • Stack Sets for multi-account/region deployment

CI/CD Pipelines

AWS CodePipeline:

  • Orchestrates build, test, and deploy stages
  • Integrates with CodeCommit, CodeBuild, CodeDeploy
  • Supports third-party tools (GitHub, Jenkins)

Typical Pipeline Stages:

  1. Source: Code commit triggers pipeline (CodeCommit, GitHub)
  2. Build: Compile code, run unit tests (CodeBuild)
  3. Test: Run integration tests, security scans
  4. Deploy to Staging: Deploy to staging environment
  5. Manual Approval: Human review before production
  6. Deploy to Production: Deploy to production environment

Deployment Strategies:

1. All-at-Once:

  • Deploy to all instances simultaneously
  • Fastest deployment
  • Downtime during deployment
  • Use for: Development environments, non-critical applications

2. Rolling:

  • Deploy to instances in batches
  • Maintain partial capacity during deployment
  • No downtime
  • Use for: Production applications, can tolerate reduced capacity

3. Blue/Green:

  • Deploy to new environment (green)
  • Switch traffic from old (blue) to new (green)
  • Instant rollback (switch back to blue)
  • Use for: Zero-downtime deployments, easy rollback needed

4. Canary:

  • Deploy to small subset of instances first
  • Monitor metrics, gradually increase traffic
  • Rollback if issues detected
  • Use for: Risk-averse deployments, gradual rollout

⭐ Must Know:

  • Blue/Green provides instant rollback
  • Canary reduces risk with gradual rollout
  • Rolling maintains capacity during deployment
  • All-at-Once is fastest but has downtime

Task 2.2: Design for Business Continuity

Key Concepts

Multi-Region Architectures

Active-Passive:

  • Primary region serves all traffic
  • Secondary region on standby
  • Failover when primary fails
  • RTO: Minutes (DNS propagation + scaling)
  • RPO: Seconds to minutes (replication lag)

Active-Active:

  • Both regions serve traffic simultaneously
  • Route 53 distributes traffic (latency/geolocation routing)
  • No failover needed (automatic)
  • RTO: Seconds (automatic)
  • RPO: None (synchronous or near-synchronous replication)

Implementation Pattern:

Primary Region (us-east-1):
ā”œā”€ā”€ Application Load Balancer
ā”œā”€ā”€ EC2 Auto Scaling Group
ā”œā”€ā”€ RDS Primary (Multi-AZ)
└── S3 Bucket (Cross-Region Replication enabled)

Secondary Region (us-west-2):
ā”œā”€ā”€ Application Load Balancer
ā”œā”€ā”€ EC2 Auto Scaling Group (scaled down or off)
ā”œā”€ā”€ RDS Read Replica (can be promoted)
└── S3 Bucket (replication target)

Route 53:
ā”œā”€ā”€ Health Checks (monitor primary region)
└── Failover Routing (primary → secondary on failure)

⭐ Must Know:

  • Active-Passive: Lower cost, higher RTO
  • Active-Active: Higher cost, lower RTO
  • Use Route 53 health checks for automatic failover
  • Cross-Region Replication for S3, RDS Read Replicas for databases

Database Replication Strategies

RDS Multi-AZ:

  • Synchronous replication within region
  • Automatic failover (1-2 minutes)
  • Same endpoint (no application changes)
  • Use for: High availability within region

RDS Read Replicas:

  • Asynchronous replication
  • Can be in different region
  • Can be promoted to standalone database
  • Use for: Read scaling, disaster recovery

Aurora Global Database:

  • Primary region + up to 5 secondary regions
  • Replication lag < 1 second
  • Promote secondary region in < 1 minute
  • Use for: Global applications, lowest RPO/RTO

DynamoDB Global Tables:

  • Multi-region, multi-active replication
  • Automatic conflict resolution
  • Sub-second replication
  • Use for: Global applications, NoSQL workloads

⭐ Must Know:

  • RDS Multi-AZ for HA within region
  • RDS Read Replicas for cross-region DR
  • Aurora Global Database for global applications
  • DynamoDB Global Tables for multi-region NoSQL

Task 2.3: Security Controls

Key Concepts

Defense in Depth

Layer 1: Network Security:

  • VPC with private subnets
  • Security groups (stateful firewall)
  • Network ACLs (stateless firewall)
  • AWS WAF (web application firewall)

Layer 2: Identity and Access:

  • IAM with least privilege
  • MFA for sensitive operations
  • IAM roles for applications
  • Temporary credentials (STS)

Layer 3: Data Protection:

  • Encryption at rest (KMS)
  • Encryption in transit (TLS/SSL)
  • S3 bucket policies
  • Database encryption

Layer 4: Monitoring and Detection:

  • CloudTrail for audit logs
  • GuardDuty for threat detection
  • Security Hub for centralized monitoring
  • CloudWatch for metrics and alarms

⭐ Must Know:

  • Implement multiple security layers
  • Network security alone is insufficient
  • Encrypt sensitive data at rest and in transit
  • Monitor and detect threats continuously

Compliance and Governance

AWS Config:

  • Track resource configuration changes
  • Evaluate compliance with rules
  • Automated remediation
  • Configuration history

AWS Systems Manager:

  • Patch management
  • Configuration management
  • Parameter Store (secrets management)
  • Session Manager (secure access)

AWS Secrets Manager:

  • Rotate secrets automatically
  • Integrate with RDS, Redshift
  • Fine-grained access control
  • Audit secret access

⭐ Must Know:

  • Config for compliance monitoring
  • Systems Manager for operational management
  • Secrets Manager for automatic rotation
  • Use Parameter Store for non-sensitive configuration

Task 2.4: Reliability Requirements

Key Concepts

Loosely Coupled Architectures

Message Queues (SQS):

  • Decouple components
  • Buffer requests during spikes
  • Retry failed messages
  • Use for: Asynchronous processing, decoupling

Pub/Sub (SNS):

  • Fan-out messages to multiple subscribers
  • Push notifications
  • Email, SMS, HTTP endpoints
  • Use for: Event notifications, fan-out patterns

Event-Driven (EventBridge):

  • Route events between services
  • Filter and transform events
  • Schedule events
  • Use for: Event-driven architectures, integrations

Orchestration (Step Functions):

  • Coordinate multiple services
  • Visual workflow
  • Error handling and retry
  • Use for: Complex workflows, long-running processes

⭐ Must Know:

  • SQS for decoupling and buffering
  • SNS for fan-out and notifications
  • EventBridge for event routing
  • Step Functions for orchestration

Auto Scaling Patterns

Target Tracking:

  • Maintain metric at target value (e.g., 70% CPU)
  • Automatically adjusts capacity
  • Use for: Most common use case

Step Scaling:

  • Add/remove capacity based on thresholds
  • Multiple steps for different levels
  • Use for: Predictable scaling patterns

Scheduled Scaling:

  • Scale based on time/date
  • Predictable traffic patterns
  • Use for: Known traffic patterns (business hours)

Predictive Scaling:

  • Machine learning predicts future load
  • Proactively scales before demand
  • Use for: Recurring patterns, optimize costs

⭐ Must Know:

  • Target Tracking for most use cases
  • Scheduled Scaling for predictable patterns
  • Predictive Scaling for ML-based optimization
  • Combine multiple scaling policies

Task 2.5: Performance Objectives

Key Concepts

Caching Strategies

CloudFront (CDN):

  • Cache static content at edge locations
  • Reduce latency for global users
  • Offload origin servers
  • Use for: Static content, global distribution

ElastiCache:

  • In-memory cache (Redis, Memcached)
  • Microsecond latency
  • Session storage, database caching
  • Use for: Frequently accessed data, session management

DAX (DynamoDB Accelerator):

  • In-memory cache for DynamoDB
  • Microsecond latency
  • Fully managed
  • Use for: DynamoDB read-heavy workloads

⭐ Must Know:

  • CloudFront for static content caching
  • ElastiCache for application caching
  • DAX specifically for DynamoDB
  • Caching reduces latency and cost

Database Performance

Read Replicas:

  • Offload read traffic from primary
  • Scale read capacity horizontally
  • Can be in different regions
  • Use for: Read-heavy workloads

Connection Pooling (RDS Proxy):

  • Manage database connections
  • Reduce connection overhead
  • Improve failover time
  • Use for: Serverless, high connection count

Partitioning:

  • DynamoDB: Partition key design
  • RDS: Sharding across multiple databases
  • Use for: Scale beyond single database limits

⭐ Must Know:

  • Read Replicas for read scaling
  • RDS Proxy for connection management
  • Proper partition key design critical for DynamoDB
  • Consider Aurora for better performance than RDS

Task 2.6: Cost Optimization

Key Concepts

Compute Cost Optimization

Right-Sizing:

  • Match instance types to workload
  • Use Compute Optimizer recommendations
  • Monitor utilization metrics
  • Savings: 20-40% typical

Graviton Instances:

  • ARM-based processors
  • 40% better price-performance
  • Support for many workloads
  • Use for: Compatible workloads, cost optimization

Lambda vs EC2:

  • Lambda: Pay per request, no idle costs
  • EC2: Pay per hour, idle costs
  • Use Lambda for: Sporadic, event-driven workloads
  • Use EC2 for: Consistent, long-running workloads

⭐ Must Know:

  • Right-sizing saves 20-40% on compute
  • Graviton provides 40% better price-performance
  • Lambda eliminates idle costs for sporadic workloads
  • Use Compute Optimizer for recommendations

Storage Cost Optimization

S3 Storage Classes:

  • S3 Standard: Frequent access, highest cost
  • S3 Intelligent-Tiering: Automatic tiering, no retrieval fees
  • S3 Standard-IA: Infrequent access, lower storage cost
  • S3 One Zone-IA: Single AZ, lowest IA cost
  • S3 Glacier: Archive, very low cost, retrieval time
  • S3 Glacier Deep Archive: Lowest cost, 12-hour retrieval

Lifecycle Policies:

  • Automatically transition objects between storage classes
  • Delete old objects
  • Example: Standard → IA after 30 days → Glacier after 90 days → Delete after 365 days

⭐ Must Know:

  • S3 Intelligent-Tiering for unknown access patterns
  • Lifecycle policies for automatic cost optimization
  • Glacier for long-term archive
  • Delete unused EBS volumes and snapshots

Chapter 2 Summary

What We Covered

āœ… Task 2.1: Deployment Strategy

  • Infrastructure as Code (CloudFormation)
  • CI/CD pipelines (CodePipeline)
  • Deployment strategies (Blue/Green, Canary, Rolling)

āœ… Task 2.2: Business Continuity

  • Multi-region architectures (Active-Passive, Active-Active)
  • Database replication (RDS, Aurora, DynamoDB)
  • Disaster recovery testing

āœ… Task 2.3: Security Controls

  • Defense in depth (network, identity, data, monitoring)
  • Compliance (Config, Systems Manager, Secrets Manager)
  • Encryption strategies

āœ… Task 2.4: Reliability

  • Loosely coupled architectures (SQS, SNS, EventBridge)
  • Auto Scaling patterns
  • High availability patterns

āœ… Task 2.5: Performance

  • Caching strategies (CloudFront, ElastiCache, DAX)
  • Database performance (Read Replicas, RDS Proxy)
  • Content delivery optimization

āœ… Task 2.6: Cost Optimization

  • Compute optimization (right-sizing, Graviton, Lambda)
  • Storage optimization (S3 storage classes, lifecycle policies)
  • Monitoring and analysis

Critical Takeaways

  1. Deployment: Use IaC for repeatable deployments, Blue/Green for zero-downtime, Canary for risk reduction

  2. Business Continuity: Active-Active for lowest RTO, Aurora Global Database for global applications

  3. Security: Implement defense in depth, encrypt data at rest and in transit, monitor continuously

  4. Reliability: Decouple with SQS/SNS, use Auto Scaling, implement multi-AZ deployments

  5. Performance: Cache aggressively (CloudFront, ElastiCache), use Read Replicas for read scaling

  6. Cost: Right-size resources, use S3 Intelligent-Tiering, leverage Graviton instances

Self-Assessment Checklist

  • Explain CloudFormation template structure
  • Describe Blue/Green vs Canary deployment
  • Design multi-region Active-Active architecture
  • Implement defense in depth security
  • Choose appropriate caching strategy
  • Optimize costs using right-sizing and storage classes

Practice Questions

  • Domain 2 Bundle 1: Questions 1-50 (target: 70%+)
  • Domain 2 Bundle 2: Questions 1-50 (target: 75%+)
  • Domain 2 Bundle 3: Questions 1-50 (target: 75%+)

Next Steps: Continue to Domain 3 (Continuous Improvement) in file 04_domain_3_continuous_improvement.


Chapter 3: Continuous Improvement for Existing Solutions

Domain Weight: 25% of exam

Chapter Overview

This domain focuses on improving existing AWS solutions. You'll learn how to enhance operational excellence, security, performance, reliability, and cost efficiency of running systems.

What you'll learn:

  • Improve operational excellence through automation and monitoring
  • Enhance security posture of existing systems
  • Optimize performance of running applications
  • Increase reliability and resilience
  • Identify and implement cost optimizations

Time to complete: 10-12 hours

Prerequisites: Chapters 0-2 (Fundamentals, Domain 1, Domain 2)

Exam Weight: 25% (approximately 16 questions on the actual exam)


Task 3.1: Improve Operational Excellence

Key Concepts

Monitoring and Observability

CloudWatch Metrics:

  • Standard metrics (CPU, network, disk)
  • Custom metrics (application-specific)
  • Metric math (combine metrics)
  • Use for: Performance monitoring, capacity planning

CloudWatch Logs:

  • Centralized log aggregation
  • Log Insights for querying
  • Metric filters (create metrics from logs)
  • Use for: Application logging, troubleshooting

CloudWatch Alarms:

  • Threshold-based alerts
  • Composite alarms (multiple conditions)
  • Actions (SNS, Auto Scaling, EC2 actions)
  • Use for: Proactive alerting, automated responses

AWS X-Ray:

  • Distributed tracing
  • Service map visualization
  • Performance bottleneck identification
  • Use for: Microservices debugging, performance analysis

⭐ Must Know:

  • CloudWatch for metrics and logs
  • X-Ray for distributed tracing
  • Use metric filters to create metrics from logs
  • Composite alarms for complex conditions

Automation and Remediation

EventBridge Rules:

  • Event-driven automation
  • Schedule-based automation
  • Filter and route events
  • Use for: Automated responses, scheduled tasks

Systems Manager Automation:

  • Runbooks for common tasks
  • Automated patching
  • Configuration management
  • Use for: Operational tasks, compliance

Lambda for Automation:

  • Event-driven functions
  • Automated remediation
  • Custom automation logic
  • Use for: Custom automation, integrations

Example Automation Pattern:

CloudWatch Alarm (High CPU) 
  → EventBridge Rule 
  → Lambda Function 
  → Auto Scaling (add instances)
  → SNS Notification (alert team)

⭐ Must Know:

  • EventBridge for event-driven automation
  • Systems Manager for operational automation
  • Lambda for custom automation logic
  • Automate common operational tasks

CI/CD Improvements

Automated Testing:

  • Unit tests in build stage
  • Integration tests in test stage
  • Security scanning (SAST, DAST)
  • Use for: Quality assurance, security

Deployment Automation:

  • Blue/Green deployments
  • Canary deployments with automatic rollback
  • Feature flags for gradual rollout
  • Use for: Safe deployments, quick rollback

Infrastructure Testing:

  • CloudFormation drift detection
  • Config rules for compliance
  • Automated remediation
  • Use for: Infrastructure compliance, drift prevention

⭐ Must Know:

  • Automate testing in CI/CD pipeline
  • Use Blue/Green or Canary for safe deployments
  • Implement automatic rollback on failures
  • Test infrastructure compliance continuously

Task 3.2: Improve Security

Key Concepts

Security Posture Assessment

AWS Security Hub:

  • Centralized security findings
  • Security standards (CIS, PCI-DSS)
  • Automated compliance checks
  • Use for: Security posture assessment, compliance

IAM Access Analyzer:

  • Identify unintended access
  • External access to resources
  • Unused access (last accessed information)
  • Use for: Access review, least privilege

AWS Trusted Advisor:

  • Security checks (open ports, IAM usage)
  • Cost optimization
  • Performance recommendations
  • Use for: Best practice recommendations

⭐ Must Know:

  • Security Hub for centralized security monitoring
  • IAM Access Analyzer for access review
  • Trusted Advisor for security recommendations
  • Regular security assessments

Secrets Management

AWS Secrets Manager:

  • Automatic secret rotation
  • Integration with RDS, Redshift
  • Fine-grained access control
  • Use for: Database credentials, API keys

Systems Manager Parameter Store:

  • Hierarchical parameter storage
  • Encryption with KMS
  • Version history
  • Use for: Configuration parameters, non-rotating secrets

Best Practices:

  • Never hardcode secrets in code
  • Rotate secrets regularly (automatic with Secrets Manager)
  • Use IAM roles for applications
  • Audit secret access with CloudTrail

⭐ Must Know:

  • Secrets Manager for automatic rotation
  • Parameter Store for configuration
  • Never hardcode secrets
  • Rotate secrets regularly

Patch Management

Systems Manager Patch Manager:

  • Automated patching
  • Patch baselines (which patches to apply)
  • Maintenance windows (when to patch)
  • Use for: OS and application patching

Patching Strategy:

  1. Development: Patch immediately, test applications
  2. Staging: Patch after dev validation
  3. Production: Patch after staging validation, use maintenance windows

⭐ Must Know:

  • Automate patching with Systems Manager
  • Test patches in non-production first
  • Use maintenance windows for production
  • Monitor patch compliance

Task 3.3: Improve Performance

Key Concepts

Performance Monitoring

CloudWatch Metrics:

  • CPU utilization
  • Network throughput
  • Disk I/O
  • Application-specific metrics

Application Performance Monitoring (APM):

  • X-Ray for distributed tracing
  • CloudWatch Application Insights
  • Third-party APM tools (New Relic, Datadog)

Database Performance:

  • RDS Performance Insights
  • DynamoDB metrics (throttling, latency)
  • ElastiCache metrics (hit rate, evictions)

⭐ Must Know:

  • Monitor key performance metrics
  • Use X-Ray for microservices performance
  • RDS Performance Insights for database optimization
  • Identify bottlenecks before they impact users

Performance Optimization Techniques

Caching:

  • Add CloudFront for static content
  • Add ElastiCache for database queries
  • Add DAX for DynamoDB reads
  • Impact: 10-100x latency reduction

Database Optimization:

  • Add Read Replicas for read scaling
  • Optimize queries (indexes, query plans)
  • Upgrade instance type
  • Impact: 2-10x performance improvement

Compute Optimization:

  • Upgrade to newer instance types (Graviton)
  • Use placement groups for low latency
  • Optimize application code
  • Impact: 20-50% performance improvement

⭐ Must Know:

  • Caching provides biggest performance gains
  • Read Replicas for read-heavy workloads
  • Newer instance types often faster and cheaper
  • Optimize application code first (often cheapest)

Task 3.4: Improve Reliability

Key Concepts

Identifying Single Points of Failure

Common SPOFs:

  • Single AZ deployment
  • Single NAT Gateway
  • Single database instance
  • Single region deployment

Remediation:

  • Deploy across multiple AZs
  • Add NAT Gateway per AZ
  • Enable RDS Multi-AZ
  • Implement multi-region for critical systems

⭐ Must Know:

  • Identify and eliminate SPOFs
  • Multi-AZ for high availability
  • Multi-region for highest availability
  • Test failover procedures

Chaos Engineering

Principles:

  • Intentionally inject failures
  • Test system resilience
  • Identify weaknesses before production incidents
  • Build confidence in system reliability

AWS Fault Injection Simulator (FIS):

  • Managed chaos engineering service
  • Pre-built fault injection scenarios
  • Safe guardrails (stop conditions)
  • Use for: Resilience testing, DR validation

Common Experiments:

  • Terminate EC2 instances
  • Throttle API calls
  • Inject network latency
  • Fail over database

⭐ Must Know:

  • Test failure scenarios regularly
  • Use FIS for controlled chaos engineering
  • Validate DR procedures through testing
  • Build confidence in system resilience

Task 3.5: Identify Cost Optimizations

Key Concepts

Cost Analysis

Cost Explorer:

  • Visualize spending trends
  • Filter by service, account, tag
  • Forecast future costs
  • Use for: Cost analysis, trend identification

Cost Anomaly Detection:

  • Machine learning detects unusual spending
  • Automatic alerts
  • Root cause analysis
  • Use for: Unexpected cost spikes

Cost Allocation Tags:

  • Tag resources by project, team, environment
  • Track costs by business unit
  • Chargeback/showback
  • Use for: Cost attribution, accountability

⭐ Must Know:

  • Use Cost Explorer for analysis
  • Enable Cost Anomaly Detection
  • Implement tagging strategy
  • Regular cost reviews

Optimization Opportunities

Compute:

  • Right-size over-provisioned instances (20-40% savings)
  • Use Spot Instances for fault-tolerant workloads (up to 90% savings)
  • Purchase RIs/Savings Plans for steady workloads (up to 75% savings)
  • Migrate to Graviton instances (40% better price-performance)

Storage:

  • Delete unused EBS volumes and snapshots
  • Use S3 Intelligent-Tiering (automatic cost optimization)
  • Implement S3 lifecycle policies
  • Use EFS Intelligent-Tiering

Database:

  • Right-size database instances
  • Use Aurora Serverless for variable workloads
  • Delete unused RDS snapshots
  • Use DynamoDB On-Demand for unpredictable workloads

Network:

  • Use VPC Endpoints to avoid NAT Gateway costs
  • Optimize data transfer (keep traffic within region)
  • Use CloudFront to reduce origin load

⭐ Must Know:

  • Right-sizing provides 20-40% savings
  • Spot Instances for up to 90% savings
  • S3 Intelligent-Tiering for automatic optimization
  • VPC Endpoints eliminate NAT Gateway costs

Chapter 3 Summary

What We Covered

āœ… Task 3.1: Operational Excellence

  • Monitoring (CloudWatch, X-Ray)
  • Automation (EventBridge, Systems Manager, Lambda)
  • CI/CD improvements

āœ… Task 3.2: Security

  • Security posture assessment (Security Hub, IAM Access Analyzer)
  • Secrets management (Secrets Manager, Parameter Store)
  • Patch management (Systems Manager)

āœ… Task 3.3: Performance

  • Performance monitoring (CloudWatch, X-Ray, Performance Insights)
  • Optimization techniques (caching, database, compute)

āœ… Task 3.4: Reliability

  • Eliminate single points of failure
  • Chaos engineering (FIS)
  • Failover testing

āœ… Task 3.5: Cost Optimization

  • Cost analysis (Cost Explorer, Anomaly Detection)
  • Optimization opportunities (compute, storage, database, network)

Critical Takeaways

  1. Operational Excellence: Automate monitoring, alerting, and remediation

  2. Security: Continuous assessment, secrets management, automated patching

  3. Performance: Monitor continuously, cache aggressively, optimize databases

  4. Reliability: Eliminate SPOFs, test failures, implement multi-AZ/multi-region

  5. Cost: Right-size resources, use Spot/RIs, implement tagging, regular reviews

Self-Assessment Checklist

  • Set up CloudWatch monitoring and alarms
  • Implement automated remediation with EventBridge + Lambda
  • Use Secrets Manager for credential rotation
  • Identify and eliminate single points of failure
  • Implement caching for performance improvement
  • Right-size resources for cost optimization

Practice Questions

  • Domain 3 Bundle 1: Questions 1-50 (target: 70%+)
  • Domain 3 Bundle 2: Questions 1-50 (target: 75%+)

Next Steps: Continue to Domain 4 (Migration & Modernization) in file 05_domain_4_migration_modernization.


Chapter 4: Accelerate Workload Migration and Modernization

Domain Weight: 20% of exam

Chapter Overview

This domain focuses on migrating existing workloads to AWS and modernizing applications. You'll learn migration strategies, tools, and modernization patterns.

What you'll learn:

  • Select workloads for migration
  • Determine optimal migration approach
  • Design new architectures for migrated workloads
  • Identify modernization opportunities

Time to complete: 8-10 hours

Prerequisites: Chapters 0-3

Exam Weight: 20% (approximately 13 questions on the actual exam)


Task 4.1: Select Workloads for Migration

Key Concepts

The 7 Rs of Migration

1. Retire:

  • Decommission applications no longer needed
  • Savings: Eliminate costs entirely
  • Use for: Redundant, unused applications

2. Retain:

  • Keep on-premises (not ready for migration)
  • Reasons: Compliance, latency, cost
  • Use for: Applications requiring on-premises

3. Rehost (Lift and Shift):

  • Move to AWS without changes
  • Speed: Fastest migration
  • Use for: Quick migration, minimal risk

4. Relocate:

  • Move to AWS with minimal changes (VMware Cloud on AWS)
  • Speed: Fast, automated
  • Use for: VMware environments

5. Repurchase:

  • Replace with SaaS
  • Example: Exchange → Microsoft 365
  • Use for: Standard business applications

6. Replatform (Lift, Tinker, and Shift):

  • Minor optimizations during migration
  • Example: Self-managed database → RDS
  • Use for: Gain cloud benefits with minimal changes

7. Refactor/Re-architect:

  • Redesign application for cloud-native
  • Example: Monolith → microservices
  • Use for: Maximum cloud benefits, long-term

⭐ Must Know:

  • Rehost: Fastest, least benefit
  • Replatform: Balance of speed and benefit
  • Refactor: Slowest, most benefit
  • Choose based on business priorities

Migration Assessment

AWS Migration Hub:

  • Centralized migration tracking
  • Discover on-premises resources
  • Track migration progress
  • Use for: Migration planning and tracking

AWS Application Discovery Service:

  • Discover on-premises applications
  • Map dependencies
  • Collect utilization data
  • Use for: Migration planning

Migration Evaluator:

  • Analyze on-premises environment
  • Create business case for migration
  • TCO analysis
  • Use for: Business case development

⭐ Must Know:

  • Discovery before migration
  • Map dependencies
  • Create business case
  • Track migration progress

Task 4.2: Determine Migration Approach

Key Concepts

Data Migration Tools

AWS DataSync:

  • Automated data transfer
  • On-premises to AWS (S3, EFS, FSx)
  • Up to 10 Gbps per agent
  • Use for: Large-scale file migrations

AWS Transfer Family:

  • SFTP, FTPS, FTP to S3
  • Managed service
  • Existing workflows
  • Use for: Partner file transfers

AWS Snow Family:

  • Snowcone: 8 TB, edge computing
  • Snowball Edge: 80 TB, compute capable
  • Snowmobile: 100 PB, exabyte-scale
  • Use for: Offline data transfer, limited bandwidth

Database Migration Service (DMS):

  • Migrate databases to AWS
  • Homogeneous (Oracle → Oracle) or heterogeneous (Oracle → Aurora)
  • Continuous replication
  • Use for: Database migrations

⭐ Must Know:

  • DataSync for file migrations
  • Snow Family for offline transfer
  • DMS for database migrations
  • Choose based on data size and bandwidth

Application Migration Tools

AWS Application Migration Service (MGN):

  • Automated lift-and-shift
  • Continuous replication
  • Minimal downtime
  • Use for: Server migrations

AWS Server Migration Service (SMS):

  • Automated VM migration (legacy, use MGN instead)
  • Incremental replication
  • Use for: Legacy migrations (prefer MGN)

CloudEndure Migration:

  • Continuous replication
  • Automated conversion
  • Use for: Large-scale migrations (now part of MGN)

⭐ Must Know:

  • MGN for server migrations (preferred)
  • Continuous replication minimizes downtime
  • Automated conversion to AWS formats
  • Test migrations before cutover

Task 4.3: Determine New Architecture

Key Concepts

Compute Modernization

EC2 → Containers:

  • Package applications in containers
  • Use ECS or EKS
  • Benefits: Portability, efficiency, scalability

EC2 → Serverless:

  • Migrate to Lambda
  • Event-driven architecture
  • Benefits: No server management, pay per use

Monolith → Microservices:

  • Break into smaller services
  • Independent deployment
  • Benefits: Scalability, agility, resilience

⭐ Must Know:

  • Containers for portability
  • Serverless for event-driven workloads
  • Microservices for scalability
  • Choose based on application characteristics

Storage Modernization

File Servers → EFS/FSx:

  • Managed file systems
  • Elastic capacity
  • Benefits: No management, high availability

Block Storage → EBS/S3:

  • EBS for databases, applications
  • S3 for objects, backups
  • Benefits: Durability, scalability

Tape Backups → S3 Glacier:

  • Cloud-based archive
  • Lower cost than tape
  • Benefits: Durability, accessibility

⭐ Must Know:

  • EFS for shared file storage
  • FSx for Windows/Lustre workloads
  • S3 for object storage
  • Glacier for archive

Database Modernization

Self-Managed → RDS/Aurora:

  • Managed database service
  • Automated backups, patching
  • Benefits: Reduced management, high availability

Relational → NoSQL:

  • DynamoDB for key-value
  • DocumentDB for documents
  • Benefits: Scale, performance, flexibility

Commercial → Open Source:

  • Oracle → Aurora PostgreSQL
  • SQL Server → Aurora MySQL
  • Benefits: Cost savings, no licensing

⭐ Must Know:

  • RDS/Aurora for managed relational
  • DynamoDB for NoSQL
  • Aurora for high performance
  • Consider licensing costs

Task 4.4: Modernization Opportunities

Key Concepts

Serverless Adoption

Lambda Functions:

  • Event-driven compute
  • No server management
  • Pay per request
  • Use for: APIs, data processing, automation

API Gateway:

  • Managed API service
  • Throttling, caching, authentication
  • Use for: RESTful APIs, WebSocket APIs

Step Functions:

  • Orchestrate Lambda functions
  • Visual workflows
  • Use for: Complex workflows, long-running processes

⭐ Must Know:

  • Lambda for event-driven workloads
  • API Gateway for APIs
  • Step Functions for orchestration
  • Serverless reduces operational overhead

Container Adoption

Amazon ECS:

  • AWS-native container orchestration
  • Fargate for serverless containers
  • Use for: Docker containers, AWS-native

Amazon EKS:

  • Managed Kubernetes
  • Compatible with standard Kubernetes
  • Use for: Kubernetes workloads, portability

AWS Fargate:

  • Serverless compute for containers
  • No EC2 management
  • Use for: Simplified container operations

⭐ Must Know:

  • ECS for AWS-native containers
  • EKS for Kubernetes
  • Fargate for serverless containers
  • Choose based on orchestration preference

Decoupling and Integration

SQS for Decoupling:

  • Message queues
  • Decouple components
  • Benefits: Resilience, scalability

SNS for Fan-Out:

  • Pub/sub messaging
  • Multiple subscribers
  • Benefits: Event distribution

EventBridge for Events:

  • Event bus
  • Route events between services
  • Benefits: Event-driven architecture

⭐ Must Know:

  • SQS for asynchronous processing
  • SNS for notifications
  • EventBridge for event routing
  • Decoupling improves resilience

Chapter 4 Summary

What We Covered

āœ… Task 4.1: Select Workloads

  • 7 Rs of migration (Retire, Retain, Rehost, Relocate, Repurchase, Replatform, Refactor)
  • Migration assessment (Migration Hub, Application Discovery Service)

āœ… Task 4.2: Migration Approach

  • Data migration (DataSync, Snow Family, DMS)
  • Application migration (MGN, SMS)

āœ… Task 4.3: New Architecture

  • Compute modernization (Containers, Serverless, Microservices)
  • Storage modernization (EFS, FSx, S3)
  • Database modernization (RDS, Aurora, DynamoDB)

āœ… Task 4.4: Modernization

  • Serverless adoption (Lambda, API Gateway, Step Functions)
  • Container adoption (ECS, EKS, Fargate)
  • Decoupling (SQS, SNS, EventBridge)

Critical Takeaways

  1. Migration Strategy: Choose based on business priorities (speed vs optimization)

  2. Data Migration: DataSync for files, DMS for databases, Snow for offline

  3. Application Migration: MGN for lift-and-shift, minimize downtime

  4. Modernization: Containers for portability, Serverless for event-driven, Microservices for scale

  5. Decoupling: SQS/SNS/EventBridge for resilient architectures

Self-Assessment Checklist

  • Understand 7 Rs of migration
  • Choose appropriate migration tools
  • Design modernized architectures
  • Identify decoupling opportunities

Practice Questions

  • Domain 4 Bundle 1: Questions 1-50 (target: 70%+)

Next Steps: Continue to Integration chapter in file 06_integration.


Integration & Cross-Domain Scenarios

Overview

This chapter ties together concepts from all four domains, showing how they integrate in real-world scenarios.


Cross-Domain Scenario 1: Global E-Commerce Platform

Requirements:

  • Global user base (Domain 1: Network)
  • 99.99% availability (Domain 1: Reliability)
  • PCI-DSS compliance (Domain 1: Security)
  • Zero-downtime deployments (Domain 2: Deployment)
  • Auto-scaling for traffic spikes (Domain 2: Performance)
  • Cost optimization (Domain 1: Cost)

Architecture:

Global:
ā”œā”€ā”€ Route 53 (latency-based routing)
ā”œā”€ā”€ CloudFront (edge caching)
└── WAF (DDoS protection)

Multi-Region (us-east-1, eu-west-1, ap-southeast-1):
ā”œā”€ā”€ Application Load Balancer
ā”œā”€ā”€ ECS Fargate (auto-scaling)
ā”œā”€ā”€ Aurora Global Database
ā”œā”€ā”€ ElastiCache Redis (session storage)
└── S3 (cross-region replication)

Security:
ā”œā”€ā”€ IAM roles (least privilege)
ā”œā”€ā”€ KMS (encryption at rest)
ā”œā”€ā”€ TLS (encryption in transit)
ā”œā”€ā”€ Security Hub (compliance monitoring)
└── GuardDuty (threat detection)

Deployment:
ā”œā”€ā”€ CodePipeline (CI/CD)
ā”œā”€ā”€ Blue/Green deployment
└── Automated rollback

Monitoring:
ā”œā”€ā”€ CloudWatch (metrics, logs, alarms)
ā”œā”€ā”€ X-Ray (distributed tracing)
└── CloudTrail (audit logs)

Key Integration Points:

  • Route 53 + CloudFront (Domain 1 + Domain 2)
  • Aurora Global Database + Multi-Region (Domain 1 + Domain 2)
  • Security Hub + GuardDuty (Domain 1 + Domain 3)
  • CodePipeline + Blue/Green (Domain 2 + Domain 3)

Cross-Domain Scenario 2: Enterprise Hybrid Cloud

Requirements:

  • 50 AWS accounts (Domain 1: Multi-account)
  • Hybrid connectivity (Domain 1: Network)
  • Centralized security (Domain 1: Security)
  • Migrate 500 applications (Domain 4: Migration)
  • Continuous improvement (Domain 3: Operations)

Architecture:

Multi-Account:
ā”œā”€ā”€ AWS Organizations (50 accounts)
ā”œā”€ā”€ Control Tower (guardrails)
ā”œā”€ā”€ Service Control Policies
└── Consolidated billing

Network:
ā”œā”€ā”€ Transit Gateway (hub)
ā”œā”€ā”€ Direct Connect (10 Gbps)
ā”œā”€ā”€ VPN (backup)
└── Route 53 Resolver (hybrid DNS)

Security:
ā”œā”€ā”€ Security Hub (centralized)
ā”œā”€ā”€ GuardDuty (all accounts)
ā”œā”€ā”€ CloudTrail (organization trail)
└── Config (compliance)

Migration:
ā”œā”€ā”€ Migration Hub (tracking)
ā”œā”€ā”€ Application Discovery Service
ā”œā”€ā”€ MGN (server migration)
└── DMS (database migration)

Operations:
ā”œā”€ā”€ Systems Manager (patch management)
ā”œā”€ā”€ CloudWatch (centralized monitoring)
└── EventBridge (automated remediation)

Key Integration Points:

  • Organizations + Transit Gateway (Domain 1)
  • Security Hub + Multi-Account (Domain 1 + Domain 3)
  • Migration Hub + MGN (Domain 4)
  • Systems Manager + Multi-Account (Domain 3)

Common Integration Patterns

Pattern 1: Multi-Region Active-Active

Services Integrated:

  • Route 53 (global routing)
  • CloudFront (edge caching)
  • Aurora Global Database (data replication)
  • DynamoDB Global Tables (NoSQL replication)
  • S3 Cross-Region Replication (object storage)

Use Case: Global applications requiring low latency worldwide

Pattern 2: Event-Driven Architecture

Services Integrated:

  • EventBridge (event routing)
  • Lambda (event processing)
  • SQS (buffering)
  • SNS (notifications)
  • Step Functions (orchestration)

Use Case: Decoupled, scalable applications

Pattern 3: Data Lake Architecture

Services Integrated:

  • S3 (data storage)
  • Glue (ETL)
  • Athena (querying)
  • QuickSight (visualization)
  • Lake Formation (governance)

Use Case: Big data analytics, business intelligence

Pattern 4: Microservices Architecture

Services Integrated:

  • ECS/EKS (container orchestration)
  • API Gateway (API management)
  • Lambda (serverless functions)
  • DynamoDB (database)
  • ElastiCache (caching)

Use Case: Scalable, independently deployable services


Chapter Summary

Key Takeaways:

  1. Real-world solutions integrate multiple domains
  2. Network, security, and reliability are foundational
  3. Deployment and operations are continuous
  4. Migration and modernization are ongoing processes

Practice:

  • Full Practice Test 1 (65 questions, target: 75%+)
  • Full Practice Test 2 (65 questions, target: 75%+)
  • Full Practice Test 3 (65 questions, target: 75%+)

Next Steps: Review study strategies in file 07_study_strategies.


Study Strategies & Test-Taking Techniques

Effective Study Techniques

The 3-Pass Method

Pass 1: Understanding (Weeks 1-6)

  • Read each chapter thoroughly
  • Take notes on ⭐ Must Know items
  • Complete practice exercises
  • Focus on comprehension, not speed

Pass 2: Application (Weeks 7-8)

  • Review chapter summaries
  • Focus on decision frameworks
  • Practice full-length tests
  • Identify weak areas

Pass 3: Reinforcement (Weeks 9-10)

  • Review flagged items
  • Memorize key facts
  • Final practice tests
  • Build confidence

Active Learning Techniques

1. Teach Someone

  • Explain concepts out loud
  • Use analogies and examples
  • Identify gaps in understanding

2. Draw Diagrams

  • Recreate architectures from memory
  • Label components and connections
  • Explain data flows

3. Write Scenarios

  • Create your own question scenarios
  • Identify requirements and constraints
  • Choose appropriate solutions

4. Compare Options

  • Use comparison tables
  • Understand trade-offs
  • Know when to use each service

Test-Taking Strategies

Time Management

Exam Details:

  • Total time: 180 minutes (3 hours)
  • Total questions: 75 (65 scored + 10 unscored)
  • Time per question: ~2.4 minutes

Strategy:

  • First pass (120 min): Answer all questions you know
  • Second pass (40 min): Tackle flagged questions
  • Final pass (20 min): Review marked answers

Question Analysis Method

Step 1: Read the scenario (30 seconds)

  • Identify: Company type, current situation, problem
  • Note: Key requirements and constraints

Step 2: Identify constraints (15 seconds)

  • Cost requirements (minimize cost, cost-effective)
  • Performance needs (low latency, high throughput)
  • Compliance requirements (HIPAA, PCI-DSS, GDPR)
  • Operational overhead (minimize management)

Step 3: Eliminate wrong answers (30 seconds)

  • Remove options that violate constraints
  • Eliminate technically incorrect options
  • Remove options that don't meet requirements

Step 4: Choose best answer (30 seconds)

  • Select option that best meets ALL requirements
  • Consider trade-offs
  • Choose most AWS-recommended solution

Handling Difficult Questions

When stuck:

  1. Eliminate obviously wrong answers
  2. Look for constraint keywords
  3. Choose most commonly recommended solution
  4. Flag and move on if unsure

Never:

  • Spend more than 3 minutes on one question initially
  • Leave questions unanswered (no penalty for guessing)
  • Second-guess yourself excessively

Memory Aids

Service Selection Mnemonics

Compute: "ELEF"

  • EC2: Full control
  • Lambda: Serverless
  • ECS: Containers
  • Fargate: Serverless containers

Storage: "SEEG"

  • S3: Objects
  • EBS: Blocks
  • EFS: Shared files
  • Glacier: Archive

Database: "RADE"

  • RDS: Relational
  • Aurora: High-performance relational
  • DynamoDB: NoSQL
  • ElastiCache: Cache

Decision Frameworks

Network Connectivity:

  • 2-5 VPCs → VPC Peering
  • 10+ VPCs → Transit Gateway
  • Service access → PrivateLink
  • High bandwidth hybrid → Direct Connect
  • Quick hybrid → VPN

DR Strategy:

  • Days RTO → Backup & Restore
  • Minutes RTO → Pilot Light or Warm Standby
  • Seconds RTO → Multi-Site Active-Active

Migration Strategy:

  • Fast → Rehost
  • Balanced → Replatform
  • Optimized → Refactor

Common Exam Patterns

Pattern 1: "Most Cost-Effective"

Keywords: minimize cost, cost-effective, lowest cost

Approach:

  • Eliminate expensive options (Direct Connect, large instances)
  • Consider Spot Instances, RIs, Savings Plans
  • Look for managed services (reduce operational cost)
  • Choose S3 over EBS, Lambda over EC2 (when appropriate)

Pattern 2: "Minimize Operational Overhead"

Keywords: least operational overhead, minimize management, fully managed

Approach:

  • Choose managed services over self-managed
  • RDS over EC2 database
  • Lambda over EC2
  • Fargate over ECS on EC2

Pattern 3: "High Availability"

Keywords: highly available, fault-tolerant, 99.99% uptime

Approach:

  • Multi-AZ deployment
  • Auto Scaling
  • Load balancing
  • RDS Multi-AZ or Aurora

Pattern 4: "Lowest Latency"

Keywords: minimize latency, low latency, fastest response

Approach:

  • CloudFront for global users
  • ElastiCache for database queries
  • Read Replicas in multiple regions
  • Placement groups for HPC

Final Week Strategy

Day 7: Full Practice Test 1

  • Simulate exam conditions
  • Time yourself (180 minutes)
  • Target: 70%+
  • Review ALL explanations (even correct answers)

Day 6: Review Mistakes

  • Identify weak areas
  • Review relevant chapters
  • Focus on ⭐ Must Know items
  • Create summary notes

Day 5: Full Practice Test 2

  • Simulate exam conditions
  • Target: 75%+
  • Note improvement areas

Day 4: Targeted Review

  • Focus on weak domains
  • Review decision frameworks
  • Practice drawing architectures

Day 3: Domain-Focused Tests

  • Take domain-specific tests for weak areas
  • Review explanations thoroughly

Day 2: Full Practice Test 3

  • Final full-length test
  • Target: 75%+
  • Build confidence

Day 1: Light Review

  • Review cheat sheet (30 minutes)
  • Skim chapter summaries (1 hour)
  • Review flagged items (30 minutes)
  • Relax, get 8 hours sleep

Exam Day Tips

Morning Routine

  • Light review of cheat sheet (30 min max)
  • Eat a good breakfast
  • Arrive 30 minutes early
  • Bring ID and confirmation

Brain Dump Strategy

When exam starts, immediately write down:

  • 7 Rs of migration
  • DR strategies (Backup/Restore, Pilot Light, Warm Standby, Multi-Site)
  • Network decision matrix
  • Key service limits

During Exam

  • Read questions carefully (don't skim)
  • Identify constraints first
  • Eliminate wrong answers
  • Flag difficult questions
  • Manage time (2.4 min per question)
  • Review flagged questions

Stay Calm

  • Don't panic if questions seem hard
  • Trust your preparation
  • Use elimination strategy
  • Make educated guesses (no penalty)

Next Steps: Review final checklist in file 08_final_checklist.


Final Week Checklist

7 Days Before Exam

Knowledge Audit

Domain 1: Organizational Complexity (26%)

  • VPC Peering vs Transit Gateway vs PrivateLink
  • Direct Connect vs VPN
  • IAM policies and least privilege
  • KMS encryption and key management
  • RTO/RPO and DR strategies
  • Multi-account with Organizations
  • Cost optimization strategies

Domain 2: New Solutions (29%)

  • CloudFormation and IaC
  • CI/CD and deployment strategies
  • Multi-region architectures
  • Database replication strategies
  • Defense in depth security
  • Caching strategies
  • Cost optimization for new solutions

Domain 3: Continuous Improvement (25%)

  • CloudWatch monitoring and alarms
  • Automated remediation
  • Security posture assessment
  • Performance optimization
  • Eliminating single points of failure
  • Cost analysis and optimization

Domain 4: Migration & Modernization (20%)

  • 7 Rs of migration
  • Migration tools (DataSync, DMS, MGN)
  • Modernization patterns
  • Serverless and container adoption

If you checked fewer than 80%: Review those specific chapters

Practice Test Marathon

  • Day 7: Full Practice Test 1 (target: 70%+)
  • Day 6: Review mistakes, study weak areas
  • Day 5: Full Practice Test 2 (target: 75%+)
  • Day 4: Review mistakes, focus on patterns
  • Day 3: Domain-focused tests for weak domains
  • Day 2: Full Practice Test 3 (target: 75%+)
  • Day 1: Review cheat sheet, relax

Day Before Exam

Final Review (2-3 hours max)

Hour 1: Cheat Sheet Review

  • Review all domain cheat sheets
  • Focus on decision frameworks
  • Memorize key facts

Hour 2: Chapter Summaries

  • Skim all chapter summaries
  • Review ⭐ Must Know items
  • Review quick reference cards

Hour 3: Flagged Items

  • Review your flagged topics
  • Clarify any remaining confusion
  • Build confidence

Don't: Try to learn new topics

Mental Preparation

  • Get 8 hours sleep
  • Prepare exam day materials (ID, confirmation)
  • Review testing center policies
  • Set multiple alarms
  • Plan route to testing center

Exam Day

Morning Routine

  • Light review of cheat sheet (30 min)
  • Eat a good breakfast
  • Arrive 30 minutes early
  • Bring valid ID
  • Bring exam confirmation

Brain Dump Strategy

When exam starts, immediately write down on provided materials:

Network Decision Matrix:

  • 2-5 VPCs → VPC Peering
  • 10+ VPCs → Transit Gateway
  • Service access → PrivateLink
  • High bandwidth → Direct Connect
  • Quick setup → VPN

DR Strategies:

  • Backup & Restore: Days RTO, Hours RPO
  • Pilot Light: Minutes RTO, Minutes RPO
  • Warm Standby: Minutes RTO, Seconds RPO
  • Multi-Site: Seconds RTO, None RPO

7 Rs of Migration:

  • Retire, Retain, Rehost, Relocate, Repurchase, Replatform, Refactor

Key Service Limits:

  • VPC Peering: 125 per VPC
  • Transit Gateway: 5,000 attachments
  • VPN: 1.25 Gbps per tunnel

During Exam

Time Management:

  • 180 minutes for 75 questions
  • ~2.4 minutes per question
  • First pass: Answer known questions (120 min)
  • Second pass: Tackle flagged questions (40 min)
  • Final pass: Review marked answers (20 min)

Question Strategy:

  1. Read scenario carefully
  2. Identify constraints (cost, performance, compliance)
  3. Eliminate wrong answers
  4. Choose best answer
  5. Flag if unsure, move on

Stay Calm:

  • Don't panic if questions seem hard
  • Trust your preparation
  • Use elimination strategy
  • Make educated guesses (no penalty)
  • Don't second-guess excessively

You're Ready When...

  • You score 75%+ consistently on full practice tests
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You understand WHY answers are correct
  • You can draw key architectures from memory
  • You complete practice tests within time limits

Remember

Trust Your Preparation

  • You've studied thoroughly
  • You've practiced extensively
  • You know the material

Manage Your Time

  • Don't spend too long on one question
  • Flag and move on if stuck
  • Review flagged questions at end

Read Carefully

  • Identify constraints
  • Look for keywords
  • Eliminate wrong answers

Don't Overthink

  • First instinct often correct
  • Don't second-guess excessively
  • Trust your knowledge

After the Exam

Pass or Fail:

  • Results available immediately (preliminary)
  • Official results within 5 business days
  • Passing score: 750/1000

If You Pass:

  • Congratulations! šŸŽ‰
  • Certificate available in 5 business days
  • Valid for 3 years
  • Consider next certification (DevOps, Security, etc.)

If You Don't Pass:

  • Review score report (shows domain performance)
  • Identify weak areas
  • Study those domains thoroughly
  • Retake after 14 days
  • Many people pass on second attempt

Good luck on your exam! You've got this! šŸš€


Appendices

Appendix A: Quick Reference Tables

Network Connectivity Decision Matrix

Scenario Solution Key Benefit
2-5 VPCs need to communicate VPC Peering Simple, low cost
10+ VPCs need to communicate Transit Gateway Scalable, transitive routing
Access specific service in another VPC PrivateLink Service-level security
High bandwidth to on-premises (>1 Gbps) Direct Connect Consistent performance
Quick hybrid connectivity (<1 Gbps) Site-to-Site VPN Fast setup, encrypted
Access AWS services from private subnets VPC Endpoints No NAT Gateway cost

DR Strategy Selection

RTO RPO Strategy Cost Use Case
Days Hours Backup & Restore Lowest Non-critical systems
Minutes Minutes Pilot Light Low Cost-sensitive, moderate RTO
Minutes Seconds Warm Standby Medium Business-critical
Seconds None Multi-Site Active-Active Highest Mission-critical

Migration Strategy (7 Rs)

Strategy Speed Benefit Use Case
Retire Instant Eliminate cost Unused applications
Retain N/A Keep on-premises Not ready for cloud
Rehost Fast Quick migration Lift and shift
Relocate Fast VMware migration VMware environments
Repurchase Medium SaaS benefits Standard business apps
Replatform Medium Some cloud benefits Balance speed/benefit
Refactor Slow Maximum cloud benefits Long-term optimization

Compute Service Selection

Service Management Use Case Cost Model
EC2 You manage Full control needed Per hour
Lambda AWS manages Event-driven, sporadic Per request
ECS You manage cluster Docker containers Per hour (EC2)
Fargate AWS manages Serverless containers Per vCPU/memory
Elastic Beanstalk AWS manages PaaS, focus on code Per hour (underlying EC2)

Storage Service Selection

Service Type Use Case Durability
S3 Object Files, backups, data lakes 11 9's
EBS Block EC2 instance storage 99.8-99.9%
EFS File Shared file system 11 9's
FSx File Windows/Lustre workloads 11 9's
Glacier Archive Long-term archive 11 9's

Database Service Selection

Service Type Use Case Scaling
RDS Relational Traditional SQL databases Vertical
Aurora Relational High-performance SQL Horizontal reads
DynamoDB NoSQL Key-value, document Horizontal
ElastiCache Cache In-memory caching Horizontal
DocumentDB NoSQL MongoDB-compatible Horizontal
Neptune Graph Graph databases Horizontal

Appendix B: Key Service Limits

Networking Limits

Service Limit Notes
VPC Peering 125 per VPC Soft limit, can be increased
Transit Gateway 5,000 attachments Soft limit
VPN 1.25 Gbps per tunnel Hard limit
Direct Connect 10 Gbps per connection Can have multiple connections
VPC Endpoints 255 per VPC Soft limit

Compute Limits

Service Limit Notes
EC2 instances 20 On-Demand per region Soft limit, varies by instance type
Lambda concurrent executions 1,000 per region Soft limit
ECS tasks 1,000 per cluster Soft limit

Storage Limits

Service Limit Notes
S3 buckets 100 per account Soft limit
S3 object size 5 TB Hard limit
EBS volume size 64 TB Hard limit
EFS file system size Unlimited Elastic

Appendix C: Pricing Quick Reference

Compute Pricing (Approximate)

Service Pricing Model Example Cost
EC2 t3.medium Per hour $0.0416/hour
Lambda Per request + duration $0.20 per 1M requests + $0.0000166667/GB-second
Fargate Per vCPU/memory $0.04048/vCPU-hour + $0.004445/GB-hour

Storage Pricing (Approximate)

Service Pricing Model Example Cost
S3 Standard Per GB/month $0.023/GB
S3 Intelligent-Tiering Per GB/month $0.023/GB (frequent) + monitoring
S3 Glacier Per GB/month $0.004/GB
EBS gp3 Per GB/month $0.08/GB

Data Transfer Pricing

Transfer Type Cost
Data IN to AWS Free
Data OUT to internet $0.09/GB (first 10 TB)
Data between AZs $0.01/GB each direction
Data between regions $0.02/GB

Appendix D: Glossary

AZ (Availability Zone): One or more discrete data centers within a Region with redundant power, networking, and connectivity.

CIDR (Classless Inter-Domain Routing): Method for allocating IP addresses and routing (e.g., 10.0.0.0/16).

DR (Disaster Recovery): Process of restoring systems and data after a catastrophic failure.

IAM (Identity and Access Management): Service for controlling access to AWS resources.

IaC (Infrastructure as Code): Managing infrastructure through code rather than manual processes.

KMS (Key Management Service): Service for managing encryption keys.

NAT (Network Address Translation): Translating private IP addresses to public IP addresses.

RPO (Recovery Point Objective): Maximum acceptable data loss (time).

RTO (Recovery Time Objective): Maximum acceptable downtime (time).

SCP (Service Control Policy): Permission boundaries in AWS Organizations.

VPC (Virtual Private Cloud): Isolated network in AWS.


Appendix E: Additional Resources

Official AWS Resources

Practice Resources

  • Practice Test Bundles: Included with this study guide

    • Difficulty-based: 5 bundles
    • Full practice: 3 bundles
    • Domain-focused: 8 bundles
    • Service-focused: 10 bundles
  • Cheat Sheets: Included with this study guide

    • Located in:
    • Quick reference for final week review

Community Resources


Final Words

You're Ready When...

  • You score 75%+ on all practice tests
  • You can explain key concepts without notes
  • You recognize question patterns instantly
  • You make decisions quickly using frameworks
  • You understand WHY answers are correct
  • You can draw key architectures from memory

Remember

Trust Your Preparation

  • You've studied thoroughly
  • You've practiced extensively
  • You know the material

Stay Confident

  • Believe in yourself
  • Trust your knowledge
  • Stay calm during the exam

Good Luck!

You've put in the work. You're prepared. You've got this! šŸš€


End of Study Guide