Comprehensive Study Materials & Key Concepts
Complete Learning Path for Certification Success
This study guide provides a structured learning path from fundamentals to exam readiness for the AWS Certified DevOps Engineer - Professional (DOP-C02) certification. Designed for both novices and experienced professionals, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.
Target Audience: DevOps engineers, system administrators, and cloud professionals seeking professional-level AWS certification. Assumes basic AWS knowledge but builds comprehensive understanding from the ground up.
Exam Details:
Total Time: 8-12 weeks (2-3 hours daily)
Use checkboxes to track completion:
Throughout this guide, you'll see these visual markers:
This guide integrates with comprehensive practice test bundles:
Difficulty-Based Practice:
Full Practice Tests (3 bundles):
Domain-Focused Tests (12 bundles):
Service-Focused Tests (12 bundles):
While this guide is comprehensive, hands-on practice is essential:
Recommended Lab Exercises:
AWS Free Tier Usage:
Supplement this guide with official AWS documentation:
You're ready for the exam when you can:
Hour 1: New content reading and note-taking
Hour 2: Diagram study and hands-on practice
Hour 3: Practice questions and review
This comprehensive study guide contains everything you need to pass the DOP-C02 exam. The content is extensive and detailed, designed to build deep understanding rather than surface-level memorization.
Your journey starts with Chapter 0: Fundamentals. Take your time, be thorough, and trust the process. Thousands of professionals have successfully used structured approaches like this to achieve AWS certification.
Good luck on your certification journey!
Last Updated: October 2024
Guide Version: 1.0
Exam Version: DOP-C02
The AWS Certified DevOps Engineer - Professional (DOP-C02) exam assumes you have solid foundational knowledge in several key areas. This chapter ensures you have the essential background needed to understand the advanced DevOps concepts covered in the exam.
Prerequisites Checklist:
If you're missing any prerequisites: This chapter provides a comprehensive primer, but consider additional AWS fundamentals training if you're completely new to AWS.
What it is: DevOps is a cultural and technical movement that emphasizes collaboration between development and operations teams to deliver software faster, more reliably, and with higher quality.
Why it matters for this exam: The DOP-C02 exam tests your ability to implement DevOps practices using AWS services. Understanding the underlying principles helps you choose the right tools and approaches.
Real-world analogy: Think of DevOps like a modern assembly line where developers and operations work together seamlessly, rather than throwing work "over the wall" between departments.
Key DevOps Principles:
š” Tip: Every question on the DOP-C02 exam relates back to these core principles. When in doubt, choose the answer that best embodies DevOps practices.
What it is: The SDLC is the process of planning, creating, testing, and deploying software applications. In DevOps, this process is highly automated and iterative.
Why it exists: Without a structured approach to software development, teams create inconsistent, unreliable software with unpredictable delivery timelines.
Traditional vs. DevOps SDLC:
Traditional SDLC (Waterfall):
DevOps SDLC (Agile/Continuous):
ā Must Know: The exam heavily focuses on automating steps 3-6 (Build, Test, Release, Deploy) using AWS services.
The Challenge: Modern applications require dozens of AWS services working together. Understanding how they integrate is crucial for the exam.
The Solution: AWS provides a comprehensive set of services that cover every aspect of the DevOps lifecycle, from source code management to production monitoring.
š AWS DevOps Ecosystem Diagram:
graph TB
subgraph "Source & Planning"
CC[CodeCommit]
GH[GitHub]
BB[Bitbucket]
end
subgraph "Build & Test"
CB[CodeBuild]
CA[CodeArtifact]
ECR[ECR]
end
subgraph "Deploy & Release"
CP[CodePipeline]
CD[CodeDeploy]
CF[CloudFormation]
CDK[CDK]
end
subgraph "Infrastructure"
EC2[EC2]
ECS[ECS]
EKS[EKS]
LMB[Lambda]
EB[Elastic Beanstalk]
end
subgraph "Monitor & Operate"
CW[CloudWatch]
XR[X-Ray]
CT[CloudTrail]
EB2[EventBridge]
end
subgraph "Security & Compliance"
IAM[IAM]
SM[Secrets Manager]
SH[Security Hub]
GD[GuardDuty]
CFG[Config]
end
CC --> CP
GH --> CP
BB --> CP
CP --> CB
CB --> CA
CB --> ECR
CP --> CD
CD --> EC2
CD --> ECS
CD --> EKS
CD --> LMB
CF --> EC2
CF --> ECS
CDK --> CF
EC2 --> CW
ECS --> CW
EKS --> CW
LMB --> CW
CW --> EB2
EB2 --> LMB
IAM --> EC2
IAM --> ECS
IAM --> EKS
IAM --> LMB
SM --> CB
SM --> CD
style CP fill:#ff9999
style CB fill:#99ccff
style CD fill:#99ff99
style CW fill:#ffcc99
style IAM fill:#cc99ff
See: diagrams/01_fundamentals_aws_devops_ecosystem.mmd
Diagram Explanation:
This diagram shows the complete AWS DevOps ecosystem and how services integrate. The flow starts with source code repositories (CodeCommit, GitHub, Bitbucket) feeding into CodePipeline (red), which orchestrates the entire process. CodeBuild (blue) handles compilation and testing, pulling dependencies from CodeArtifact and pushing container images to ECR. CodeDeploy (green) manages deployments to various compute platforms (EC2, ECS, EKS, Lambda). CloudFormation and CDK handle infrastructure provisioning. CloudWatch (orange) provides monitoring across all services, with EventBridge enabling event-driven automation. IAM (purple) secures everything, while Secrets Manager handles sensitive data. This interconnected ecosystem enables fully automated DevOps workflows.
Amazon EC2 (Elastic Compute Cloud):
Amazon ECS (Elastic Container Service):
Amazon EKS (Elastic Kubernetes Service):
AWS Lambda:
ā Must Know: Each compute service has different deployment strategies and monitoring approaches. The exam tests your ability to choose the right service for specific scenarios.
Amazon S3 (Simple Storage Service):
Amazon EBS (Elastic Block Store):
Amazon EFS (Elastic File System):
Amazon RDS (Relational Database Service):
Amazon DynamoDB:
What it is: A logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define.
Why it exists: Applications need secure, isolated network environments with controlled access. VPC provides the foundation for all AWS networking.
Real-world analogy: Think of a VPC like a private office building where you control who can enter, which floors they can access, and how they move between rooms.
How it works (Detailed step-by-step):
š VPC Architecture Diagram:
graph TB
subgraph "VPC (10.0.0.0/16)"
subgraph "Availability Zone A"
PubA[Public Subnet<br/>10.0.1.0/24]
PrivA[Private Subnet<br/>10.0.3.0/24]
end
subgraph "Availability Zone B"
PubB[Public Subnet<br/>10.0.2.0/24]
PrivB[Private Subnet<br/>10.0.4.0/24]
end
IGW[Internet Gateway]
NATGW[NAT Gateway]
PubRT[Public Route Table]
PrivRT[Private Route Table]
end
Internet[Internet] --> IGW
IGW --> PubA
IGW --> PubB
PubA --> NATGW
NATGW --> PrivA
NATGW --> PrivB
PubRT --> PubA
PubRT --> PubB
PrivRT --> PrivA
PrivRT --> PrivB
style PubA fill:#e1f5fe
style PubB fill:#e1f5fe
style PrivA fill:#fff3e0
style PrivB fill:#fff3e0
style IGW fill:#c8e6c9
style NATGW fill:#f3e5f5
See: diagrams/01_fundamentals_vpc_architecture.mmd
Diagram Explanation:
This diagram shows a typical multi-AZ VPC setup. The VPC spans multiple Availability Zones for high availability. Public subnets (blue) have direct internet access through the Internet Gateway (green) and typically host load balancers, bastion hosts, or NAT gateways. Private subnets (orange) contain application servers and databases, accessing the internet through the NAT Gateway (purple) in the public subnet. Route tables control traffic flow - public route tables direct internet traffic to the IGW, while private route tables direct internet traffic to the NAT Gateway. This architecture provides security (private resources aren't directly accessible from internet) while enabling necessary outbound connectivity.
Key Networking Concepts:
Security Groups:
Network ACLs (Access Control Lists):
Route Tables:
ā Must Know: Security groups are stateful (return traffic automatically allowed), while NACLs are stateless (must explicitly allow return traffic). This distinction appears frequently on the exam.
Application Load Balancer (ALB):
Network Load Balancer (NLB):
Classic Load Balancer (CLB):
What IAM is: AWS Identity and Access Management is a web service that helps you securely control access to AWS resources.
Why it's critical for DevOps: Every automated process, every service, and every user needs appropriate permissions. IAM is the foundation of AWS security.
Real-world analogy: IAM is like a sophisticated key card system in a large office building, where different cards provide access to different floors, rooms, and resources based on job requirements.
Core IAM Components:
Users:
Groups:
Roles:
Policies:
š IAM Hierarchy Diagram:
graph TD
subgraph "AWS Account"
subgraph "IAM Users"
U1[Developer User]
U2[Admin User]
U3[Auditor User]
end
subgraph "IAM Groups"
G1[Developers Group]
G2[Administrators Group]
G3[Auditors Group]
end
subgraph "IAM Roles"
R1[EC2 Instance Role]
R2[Lambda Execution Role]
R3[Cross-Account Role]
end
subgraph "IAM Policies"
P1[S3 Read Policy]
P2[EC2 Full Access]
P3[CloudWatch Logs]
end
subgraph "AWS Services"
EC2[EC2 Instance]
LMB[Lambda Function]
CB[CodeBuild Project]
end
end
U1 --> G1
U2 --> G2
U3 --> G3
G1 --> P1
G1 --> P3
G2 --> P2
G3 --> P1
R1 --> P1
R1 --> P3
R2 --> P3
R3 --> P1
EC2 --> R1
LMB --> R2
CB --> R2
style U1 fill:#e1f5fe
style U2 fill:#e1f5fe
style U3 fill:#e1f5fe
style G1 fill:#fff3e0
style G2 fill:#fff3e0
style G3 fill:#fff3e0
style R1 fill:#f3e5f5
style R2 fill:#f3e5f5
style R3 fill:#f3e5f5
style P1 fill:#e8f5e9
style P2 fill:#e8f5e9
style P3 fill:#e8f5e9
See: diagrams/01_fundamentals_iam_hierarchy.mmd
Diagram Explanation:
This diagram illustrates the IAM hierarchy and relationships. Users (blue) represent individual identities that can be organized into Groups (orange) for easier management. Groups and individual users can have Policies (green) attached that define their permissions. Roles (purple) provide temporary credentials and are used by AWS services like EC2 instances, Lambda functions, and CodeBuild projects. The arrows show how permissions flow - users inherit permissions from their groups, and services assume roles to get the permissions they need. This structure enables the principle of least privilege while maintaining manageable access control.
Continuous Integration (CI):
Continuous Delivery (CD):
Continuous Deployment:
š CI/CD Pipeline Flow Diagram:
graph LR
subgraph "Developer Workflow"
DEV[Developer]
CODE[Write Code]
COMMIT[Commit & Push]
end
subgraph "Continuous Integration"
TRIGGER[Pipeline Trigger]
BUILD[Build Application]
UNITTEST[Unit Tests]
INTEGRATION[Integration Tests]
SECURITY[Security Scans]
ARTIFACT[Create Artifacts]
end
subgraph "Continuous Delivery"
DEPLOY_DEV[Deploy to Dev]
SMOKE[Smoke Tests]
DEPLOY_STAGE[Deploy to Staging]
E2E[End-to-End Tests]
APPROVAL[Manual Approval]
DEPLOY_PROD[Deploy to Production]
end
subgraph "Continuous Monitoring"
MONITOR[Monitor Application]
ALERT[Alerts & Notifications]
FEEDBACK[Feedback Loop]
end
DEV --> CODE
CODE --> COMMIT
COMMIT --> TRIGGER
TRIGGER --> BUILD
BUILD --> UNITTEST
UNITTEST --> INTEGRATION
INTEGRATION --> SECURITY
SECURITY --> ARTIFACT
ARTIFACT --> DEPLOY_DEV
DEPLOY_DEV --> SMOKE
SMOKE --> DEPLOY_STAGE
DEPLOY_STAGE --> E2E
E2E --> APPROVAL
APPROVAL --> DEPLOY_PROD
DEPLOY_PROD --> MONITOR
MONITOR --> ALERT
ALERT --> FEEDBACK
FEEDBACK --> DEV
style BUILD fill:#99ccff
style UNITTEST fill:#99ccff
style INTEGRATION fill:#99ccff
style DEPLOY_DEV fill:#99ff99
style DEPLOY_STAGE fill:#99ff99
style DEPLOY_PROD fill:#99ff99
style MONITOR fill:#ffcc99
See: diagrams/01_fundamentals_cicd_pipeline_flow.mmd
Diagram Explanation:
This diagram shows a complete CI/CD pipeline flow. It starts with developers writing and committing code, which triggers the Continuous Integration phase (blue) where the application is built, tested, and scanned for security issues. Artifacts are created and deployed through multiple environments in the Continuous Delivery phase (green), with automated and manual testing at each stage. The final deployment to production is followed by Continuous Monitoring (orange) that provides feedback to developers. This creates a complete feedback loop that enables rapid, reliable software delivery.
Build Automation:
Test Automation:
Deployment Automation:
Infrastructure as Code (IaC):
ā Must Know: The exam focuses heavily on implementing these concepts using AWS services. Understanding the principles helps you choose the right AWS tools for each scenario.
AWS CodeCommit:
GitHub Integration:
Bitbucket Integration:
AWS CodeBuild:
Build Specifications (buildspec.yml):
version: 0.2
phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
- docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker image...
- docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
artifacts:
files:
- '**/*'
AWS CodeArtifact:
Amazon ECR (Elastic Container Registry):
AWS CodeDeploy:
AWS CodePipeline:
What it is: AWS service that provides a common language for describing and provisioning all infrastructure resources in your cloud environment.
Why it exists: Manual infrastructure provisioning is error-prone, inconsistent, and doesn't scale. CloudFormation enables infrastructure to be version-controlled, tested, and automated.
Real-world analogy: CloudFormation is like architectural blueprints for buildings - you design once, then can build identical structures repeatedly with confidence.
How it works:
Key Concepts:
What it is: Software development framework for defining cloud infrastructure using familiar programming languages.
Why it exists: CloudFormation templates can become complex and hard to maintain. CDK allows developers to use programming constructs like loops, conditions, and functions.
Supported languages: TypeScript, JavaScript, Python, Java, C#, Go
Key advantages:
ā Must Know: CDK synthesizes to CloudFormation templates, so understanding both is important for the exam.
Metrics:
Logs:
Traces:
š Observability Stack Diagram:
graph TB
subgraph "Application Layer"
APP1[Web Application]
APP2[API Service]
APP3[Background Jobs]
end
subgraph "Observability Services"
CW[CloudWatch Metrics]
CWL[CloudWatch Logs]
XR[X-Ray Tracing]
end
subgraph "Analysis & Alerting"
CWD[CloudWatch Dashboards]
CWA[CloudWatch Alarms]
CWI[CloudWatch Insights]
SNS[SNS Notifications]
end
subgraph "Response & Automation"
EB[EventBridge]
LMB[Lambda Functions]
SSM[Systems Manager]
end
APP1 --> CW
APP1 --> CWL
APP1 --> XR
APP2 --> CW
APP2 --> CWL
APP2 --> XR
APP3 --> CW
APP3 --> CWL
CW --> CWD
CW --> CWA
CWL --> CWI
XR --> CWD
CWA --> SNS
CWA --> EB
EB --> LMB
LMB --> SSM
style CW fill:#ffcc99
style CWL fill:#ffcc99
style XR fill:#ffcc99
style CWA fill:#ff9999
style EB fill:#99ff99
See: diagrams/01_fundamentals_observability_stack.mmd
Diagram Explanation:
This diagram shows how observability works in AWS. Applications send metrics, logs, and traces to CloudWatch services (orange). CloudWatch Metrics feeds dashboards and alarms, while CloudWatch Logs enables detailed analysis through Insights. Alarms (red) can trigger notifications via SNS or events via EventBridge (green), which can invoke Lambda functions for automated responses or Systems Manager for remediation actions. This creates a complete observability and response system.
Proactive vs Reactive Monitoring:
SLIs, SLOs, and SLAs:
The Four Golden Signals:
Shift Left Security:
DevSecOps Principles:
AWS IAM (Identity and Access Management):
AWS Secrets Manager:
AWS Systems Manager Parameter Store:
AWS Config:
ā Must Know: Security is integrated throughout all DevOps processes. The exam tests your ability to implement security controls at every stage of the pipeline.
Now that we've covered the individual components, let's understand how they work together in a complete DevOps ecosystem.
š Complete DevOps Workflow Diagram:
graph TB
subgraph "Development"
DEV[Developer]
IDE[IDE/Editor]
GIT[Git Repository]
end
subgraph "CI/CD Pipeline"
TRIGGER[Pipeline Trigger]
BUILD[Build & Test]
SECURITY[Security Scan]
ARTIFACT[Artifact Storage]
DEPLOY[Deploy]
end
subgraph "Infrastructure"
IaC[Infrastructure as Code]
COMPUTE[Compute Resources]
NETWORK[Networking]
STORAGE[Storage]
end
subgraph "Operations"
MONITOR[Monitoring]
LOGS[Logging]
ALERTS[Alerting]
INCIDENT[Incident Response]
end
subgraph "Security & Compliance"
IAM[Identity & Access]
SECRETS[Secrets Management]
COMPLIANCE[Compliance Monitoring]
AUDIT[Audit Logging]
end
DEV --> IDE
IDE --> GIT
GIT --> TRIGGER
TRIGGER --> BUILD
BUILD --> SECURITY
SECURITY --> ARTIFACT
ARTIFACT --> DEPLOY
IaC --> COMPUTE
IaC --> NETWORK
IaC --> STORAGE
DEPLOY --> COMPUTE
COMPUTE --> MONITOR
COMPUTE --> LOGS
MONITOR --> ALERTS
ALERTS --> INCIDENT
INCIDENT --> DEV
IAM --> COMPUTE
IAM --> DEPLOY
SECRETS --> BUILD
SECRETS --> DEPLOY
COMPLIANCE --> IaC
AUDIT --> GIT
AUDIT --> DEPLOY
style BUILD fill:#99ccff
style DEPLOY fill:#99ff99
style MONITOR fill:#ffcc99
style IAM fill:#cc99ff
See: diagrams/01_fundamentals_complete_devops_workflow.mmd
Diagram Explanation:
This comprehensive diagram shows how all DevOps components work together. Developers use IDEs to write code that's stored in Git repositories. Code changes trigger CI/CD pipelines that build, test, scan for security issues, store artifacts, and deploy applications. Infrastructure as Code provisions the underlying compute, network, and storage resources. Operations teams monitor applications and infrastructure, with logs and alerts feeding into incident response that creates feedback loops back to developers. Security and compliance are integrated throughout - IAM controls access, secrets are managed securely, compliance is monitored continuously, and audit logs track all activities. This creates a complete, secure, automated DevOps ecosystem.
| Term | Definition | Example |
|---|---|---|
| Artifact | A deployable unit produced by the build process | JAR file, Docker image, ZIP package |
| Blue/Green Deployment | Deployment strategy using two identical environments | Switch traffic from blue to green environment |
| Canary Deployment | Gradual deployment to a subset of users | Deploy to 10% of servers, then 50%, then 100% |
| CI/CD | Continuous Integration/Continuous Delivery | Automated pipeline from code to production |
| Container | Lightweight, portable application package | Docker container with app and dependencies |
| GitOps | Using Git as single source of truth for infrastructure | Infrastructure changes via Git pull requests |
| IaC | Infrastructure as Code | CloudFormation template, CDK code |
| Immutable Infrastructure | Infrastructure that is replaced, not modified | New AMI for each deployment |
| Microservices | Architecture of small, independent services | Each service has its own database and API |
| Pipeline | Automated sequence of stages for software delivery | Source ā Build ā Test ā Deploy |
| Rollback | Reverting to a previous version after deployment | Return to last known good version |
| Serverless | Computing without managing servers | AWS Lambda functions |
| Stack | Collection of AWS resources managed together | CloudFormation stack with VPC, EC2, RDS |
Before proceeding to Domain 1, ensure you understand these fundamental concepts:
Scenario: You're tasked with designing a basic web application infrastructure that needs to be highly available, secure, and automatically deployable.
Requirements:
Your Task: Using the concepts learned in this chapter, sketch out:
Solution Approach (don't peek until you've tried!):
Congratulations! You now have the foundational knowledge needed to tackle the DOP-C02 exam domains. In the next chapter, we'll dive deep into Domain 1: SDLC Automation, where you'll learn to implement sophisticated CI/CD pipelines using AWS services.
Chapter 1 Preview: You'll learn to build complete CI/CD pipelines with CodePipeline, implement automated testing strategies, manage artifacts with CodeArtifact and ECR, and master deployment strategies for different compute platforms.
Ready to continue? Proceed to Chapter 1: SDLC Automation when you've completed the self-assessment checklist above.
Remember: This foundational knowledge is crucial for success on the exam. Take time to ensure you're comfortable with these concepts before moving forward.
What you'll learn:
Time to complete: 12-15 hours
Prerequisites: Chapter 0 (Fundamentals)
Exam weight: 22% (approximately 14-15 questions)
Domain Tasks Covered:
The problem: Manual software delivery processes are slow, error-prone, and don't scale. Teams spend more time on deployment mechanics than building features, leading to infrequent releases and higher risk of production issues.
The solution: Automated CI/CD pipelines that handle the entire software delivery process from source code to production deployment, with built-in quality gates, security checks, and rollback capabilities.
Why it's tested: CI/CD automation is fundamental to DevOps practices. The exam tests your ability to design pipelines that are secure, scalable, and maintainable across different AWS services and deployment targets.
What it is: AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.
Why it exists: Organizations need a way to orchestrate complex software delivery workflows that span multiple tools, environments, and teams. CodePipeline provides a visual interface and robust automation for these workflows.
Real-world analogy: Think of CodePipeline like an assembly line in a factory - raw materials (source code) enter at one end, go through various processing stations (build, test, deploy), and emerge as finished products (deployed applications) at the other end.
How it works (Detailed step-by-step):
š CodePipeline Architecture Diagram:
graph LR
subgraph "Source Stage"
CC[CodeCommit Repository]
GH[GitHub Repository]
S3[S3 Bucket]
end
subgraph "Build Stage"
CB[CodeBuild Project]
SPEC[buildspec.yml]
ENV[Build Environment]
end
subgraph "Test Stage"
UT[Unit Tests]
IT[Integration Tests]
SEC[Security Scans]
end
subgraph "Deploy Stage"
CD[CodeDeploy]
CF[CloudFormation]
ECS[ECS Deploy]
LMB[Lambda Deploy]
end
subgraph "Approval Stage"
MAN[Manual Approval]
AUTO[Automated Gates]
end
subgraph "Production Stage"
PROD[Production Deploy]
SMOKE[Smoke Tests]
MONITOR[Monitoring]
end
CC --> CB
GH --> CB
S3 --> CB
CB --> SPEC
SPEC --> ENV
ENV --> UT
UT --> IT
IT --> SEC
SEC --> MAN
MAN --> AUTO
AUTO --> CD
AUTO --> CF
AUTO --> ECS
AUTO --> LMB
CD --> PROD
CF --> PROD
ECS --> PROD
LMB --> PROD
PROD --> SMOKE
SMOKE --> MONITOR
style CB fill:#99ccff
style UT fill:#99ccff
style IT fill:#99ccff
style SEC fill:#ff9999
style MAN fill:#ffcc99
style PROD fill:#99ff99
See: diagrams/02_domain1_codepipeline_architecture.mmd
Diagram Explanation:
This diagram illustrates a comprehensive CodePipeline architecture. The Source Stage can pull from multiple repository types (CodeCommit, GitHub, S3). The Build Stage uses CodeBuild with buildspec.yml configuration files to create consistent build environments (blue). The Test Stage runs multiple types of automated tests, including security scans (red) that can fail the pipeline if issues are found. The Approval Stage includes both manual approvals and automated gates (orange) that control progression to production. The Deploy Stage supports multiple deployment targets (CodeDeploy for EC2, CloudFormation for infrastructure, ECS for containers, Lambda for serverless). Finally, the Production Stage (green) includes deployment, smoke testing, and monitoring setup. Artifacts flow between stages, enabling traceability and rollback capabilities.
Detailed Example 1: Multi-Environment Web Application Pipeline
Consider a three-tier web application with a React frontend, Node.js API, and PostgreSQL database. The pipeline starts when developers push code to the main branch in CodeCommit. The Source stage detects the change and triggers the Build stage, which uses CodeBuild to install dependencies, run unit tests, build the React application, and create deployment artifacts. The artifacts include a Docker image for the API, static files for the frontend, and CloudFormation templates for infrastructure. The Test stage deploys to a temporary environment, runs integration tests against the API, performs security scans using tools like OWASP ZAP, and validates the frontend with automated browser tests. If all tests pass, the pipeline proceeds to a Manual Approval stage where the product owner reviews the changes. Upon approval, the Deploy stage uses CloudFormation to update the staging environment infrastructure, CodeDeploy to deploy the API to EC2 instances behind an Application Load Balancer, and S3/CloudFront to deploy the frontend. Finally, smoke tests verify the staging deployment before the pipeline waits for another approval to deploy to production using the same process.
Detailed Example 2: Microservices Pipeline with Parallel Builds
A microservices architecture with five independent services requires a sophisticated pipeline design. Each service has its own repository, but changes to shared libraries trigger builds for all dependent services. The pipeline uses CodePipeline's parallel execution capabilities to build multiple services simultaneously. The Source stage monitors multiple repositories using CloudWatch Events and Lambda functions to determine which services need rebuilding based on dependency graphs. The Build stage runs parallel CodeBuild projects, each with service-specific buildspec.yml files that handle different technology stacks (Java Spring Boot, Python Flask, Go microservices). Each build produces Docker images tagged with the commit SHA and pushes them to separate ECR repositories. The Test stage runs service-specific unit tests in parallel, then performs integration testing by deploying all services to a test environment and running end-to-end test suites. Contract testing ensures API compatibility between services. The Deploy stage uses a blue/green deployment strategy, deploying all services to a new ECS cluster while keeping the old cluster running. Traffic is gradually shifted using Application Load Balancer weighted routing, with automatic rollback if error rates exceed thresholds.
Detailed Example 3: Infrastructure-as-Code Pipeline
An infrastructure pipeline manages AWS resources across multiple accounts and regions. The pipeline starts when infrastructure engineers commit CloudFormation templates or CDK code to a dedicated infrastructure repository. The Source stage pulls the latest templates and validates their syntax. The Build stage uses CodeBuild to run CDK synthesis (if applicable), CloudFormation template validation, and security scanning using tools like cfn-nag to identify potential security issues. The Test stage deploys infrastructure to a sandbox account, runs compliance checks using AWS Config rules, and validates that resources are created correctly using custom Lambda functions. The pipeline includes multiple deployment stages for different environments: development, staging, and production, each in separate AWS accounts. Each deployment stage uses CloudFormation StackSets to deploy across multiple regions simultaneously. The pipeline includes drift detection that runs daily to identify manual changes to infrastructure and can automatically remediate or alert on drift. Rollback capabilities ensure that failed deployments can be quickly reverted to the last known good state.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
What it is: AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.
Why it exists: Traditional build servers require infrastructure management, scaling, and maintenance. CodeBuild provides on-demand, scalable build capacity without the overhead of managing build infrastructure.
Real-world analogy: CodeBuild is like a construction crew that you can hire on-demand - they bring all their own tools, work on your project, and you only pay for the time they spend working.
How it works (Detailed step-by-step):
š CodeBuild Execution Flow Diagram:
graph TB
subgraph "Build Project Configuration"
ENV[Build Environment]
COMPUTE[Compute Type]
RUNTIME[Runtime Version]
SPEC[buildspec.yml]
end
subgraph "Build Phases"
INSTALL[install phase]
PREBUILD[pre_build phase]
BUILD[build phase]
POSTBUILD[post_build phase]
end
subgraph "Build Environment"
CONTAINER[Docker Container]
TOOLS[Build Tools]
DEPS[Dependencies]
CACHE[Build Cache]
end
subgraph "Outputs"
ARTIFACTS[Build Artifacts]
LOGS[CloudWatch Logs]
REPORTS[Test Reports]
METRICS[Build Metrics]
end
ENV --> CONTAINER
COMPUTE --> CONTAINER
RUNTIME --> TOOLS
SPEC --> INSTALL
INSTALL --> PREBUILD
PREBUILD --> BUILD
BUILD --> POSTBUILD
CONTAINER --> TOOLS
TOOLS --> DEPS
DEPS --> CACHE
POSTBUILD --> ARTIFACTS
POSTBUILD --> LOGS
POSTBUILD --> REPORTS
POSTBUILD --> METRICS
style INSTALL fill:#e1f5fe
style PREBUILD fill:#e1f5fe
style BUILD fill:#99ccff
style POSTBUILD fill:#e1f5fe
style ARTIFACTS fill:#99ff99
style LOGS fill:#ffcc99
See: diagrams/02_domain1_codebuild_execution_flow.mmd
Diagram Explanation:
This diagram shows the complete CodeBuild execution flow. The Build Project Configuration defines the environment settings, compute resources, runtime versions, and build specifications. The Build Phases (light blue) execute sequentially - install phase sets up dependencies, pre_build phase handles authentication and preparation, build phase (dark blue) performs the actual compilation/testing, and post_build phase handles artifact creation and cleanup. The Build Environment provides an isolated Docker container with necessary tools, dependencies, and optional build cache for performance. The Outputs include build artifacts (green), CloudWatch logs (orange), test reports, and build metrics that provide visibility into the build process.
Build Specification (buildspec.yml) Deep Dive:
Complete buildspec.yml Example:
version: 0.2
# Environment variables available to all phases
env:
variables:
NODE_ENV: production
API_URL: https://api.example.com
parameter-store:
DATABASE_URL: /myapp/database/url
API_KEY: /myapp/api/key
secrets-manager:
DOCKER_HUB_PASSWORD: prod/dockerhub:password
GITHUB_TOKEN: prod/github:token
# Build phases executed in sequence
phases:
install:
# Install runtime versions and package managers
runtime-versions:
nodejs: 18
python: 3.9
docker: 20
commands:
- echo Installing dependencies...
- npm install -g yarn
- pip install --upgrade pip
pre_build:
# Authentication, setup, and preparation
commands:
- echo Logging in to Amazon ECR...
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
- echo Logging in to Docker Hub...
- echo $DOCKER_HUB_PASSWORD | docker login --username $DOCKER_HUB_USERNAME --password-stdin
- echo Setting up test database...
- docker run -d --name test-db -p 5432:5432 -e POSTGRES_PASSWORD=test postgres:13
build:
# Main build and test execution
commands:
- echo Build started on `date`
- echo Installing application dependencies...
- yarn install --frozen-lockfile
- echo Running unit tests...
- yarn test --coverage --ci
- echo Running security audit...
- yarn audit --audit-level moderate
- echo Building application...
- yarn build
- echo Building Docker image...
- docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
- docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
post_build:
# Artifact creation and cleanup
commands:
- echo Build completed on `date`
- echo Pushing Docker image to ECR...
- docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
- echo Creating deployment artifacts...
- printf '[{"name":"web-app","imageUri":"%s"}]' $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG > imagedefinitions.json
- echo Cleaning up test resources...
- docker stop test-db && docker rm test-db
# Artifacts to be uploaded to S3
artifacts:
files:
- imagedefinitions.json
- appspec.yml
- scripts/**/*
- cloudformation/**/*
name: myapp-$(date +%Y-%m-%d-%H-%M-%S)
# Test reports for CodeBuild console
reports:
jest-reports:
files:
- coverage/lcov.info
- junit.xml
file-format: JUNITXML
base-directory: coverage
# Build caching for performance
cache:
paths:
- node_modules/**/*
- ~/.cache/pip/**/*
- /root/.docker/**/*
Build Environment Options:
Compute Types:
Operating Systems:
Runtime Versions:
Detailed Example 4: Multi-Language Microservices Build
A complex microservices project with Java Spring Boot backend, React frontend, and Python data processing service requires a sophisticated build configuration. The buildspec.yml uses multiple runtime versions (Java 11, Node.js 18, Python 3.9) and parallel build processes. The install phase sets up all three runtime environments and installs package managers (Maven, Yarn, pip). The pre_build phase authenticates with multiple registries (ECR for private images, Docker Hub for base images, npm registry for private packages) and starts test dependencies (PostgreSQL database, Redis cache, Elasticsearch for search). The build phase runs builds in parallel using background processes - Maven builds the Java service while Yarn builds the React app and pip installs Python dependencies. Each service runs its own test suite, with integration tests running after all unit tests complete. Security scanning runs in parallel using multiple tools (OWASP dependency check for Java, npm audit for Node.js, safety for Python). The post_build phase creates separate Docker images for each service, pushes them to ECR with appropriate tags, and creates deployment artifacts including ECS task definitions, Kubernetes manifests, and CloudFormation templates for infrastructure updates.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
What it is: The integration between source code repositories and CI/CD pipelines, enabling automated pipeline triggers when code changes occur.
Why it exists: Manual pipeline triggers don't scale and introduce human error. Automated triggers ensure that every code change goes through the same quality gates and deployment process.
Real-world analogy: Source integration is like a motion sensor that automatically turns on lights when someone enters a room - it responds immediately to changes without manual intervention.
Integration Patterns:
AWS CodeCommit Integration:
GitHub Integration:
Bitbucket Integration:
š Source Integration Architecture Diagram:
graph TB
subgraph "Source Repositories"
CC[CodeCommit]
GH[GitHub]
BB[Bitbucket]
S3[S3 Bucket]
end
subgraph "Event Processing"
CWE[CloudWatch Events]
WEBHOOK[Webhooks]
POLLING[Polling]
end
subgraph "Pipeline Triggers"
CP[CodePipeline]
FILTER[Branch/Path Filters]
APPROVAL[Auto/Manual Trigger]
end
subgraph "Build Initiation"
CB[CodeBuild]
ARTIFACT[Source Artifacts]
ENV[Build Environment]
end
CC --> CWE
GH --> WEBHOOK
BB --> WEBHOOK
S3 --> POLLING
CWE --> CP
WEBHOOK --> CP
POLLING --> CP
CP --> FILTER
FILTER --> APPROVAL
APPROVAL --> CB
CB --> ARTIFACT
ARTIFACT --> ENV
style CC fill:#ff9999
style GH fill:#99ccff
style BB fill:#99ccff
style CP fill:#99ff99
style CB fill:#ffcc99
See: diagrams/02_domain1_source_integration_architecture.mmd
Diagram Explanation:
This diagram shows how different source repositories integrate with AWS CI/CD services. CodeCommit (red) uses native CloudWatch Events for real-time pipeline triggers. GitHub and Bitbucket (blue) use webhooks to notify CodePipeline of changes. S3 can be used as a source with polling-based triggers. All sources feed into CodePipeline (green) which applies branch and path filters before triggering builds. CodeBuild (orange) receives source artifacts and creates build environments. This architecture enables automated, event-driven CI/CD workflows that respond immediately to code changes.
Advanced Integration Patterns:
Multi-Repository Pipelines:
# Example: Pipeline triggered by changes to multiple repositories
Source:
- Repository: main-app
Branch: main
Trigger: immediate
- Repository: shared-library
Branch: main
Trigger: downstream
- Repository: infrastructure
Branch: main
Trigger: conditional
Branch-Based Workflows:
Monorepo vs Microrepo Strategies:
Monorepo Approach:
Microrepo Approach:
Detailed Example 5: Multi-Branch Pipeline Strategy
A large enterprise application uses a sophisticated branching strategy with different pipeline behaviors for each branch type. The main branch triggers a full production pipeline with comprehensive testing, security scanning, and multi-environment deployment. Feature branches trigger lightweight pipelines that create temporary environments for testing, run unit tests and basic integration tests, but skip expensive security scans and performance tests. The develop branch triggers a staging pipeline that deploys to a shared development environment and runs the full test suite including end-to-end tests. Release branches trigger a release candidate pipeline that creates release artifacts, generates release notes, and deploys to a pre-production environment that mirrors production. Hotfix branches trigger an expedited pipeline that skips some non-critical tests but includes all security checks and deploys directly to production after approval. Each branch type has different approval requirements - feature branches require code review, develop branch requires automated test passage, release branches require QA approval, and hotfix branches require both security team and operations team approval.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Situation: A large enterprise needs to deploy applications across development, staging, and production accounts with proper governance and security controls.
Challenge: Each environment is in a separate AWS account with different IAM policies, VPCs, and security requirements. The pipeline must deploy consistently across all environments while maintaining security boundaries.
Solution: Design a cross-account pipeline using IAM roles and centralized artifact storage.
š Multi-Account Pipeline Architecture:
graph TB
subgraph "Shared Services Account"
CP[CodePipeline]
CB[CodeBuild]
S3[Artifact Store]
ECR[Container Registry]
end
subgraph "Development Account"
DEV_ROLE[Dev Deployment Role]
DEV_VPC[Dev VPC]
DEV_ECS[Dev ECS Cluster]
end
subgraph "Staging Account"
STAGE_ROLE[Staging Deployment Role]
STAGE_VPC[Staging VPC]
STAGE_ECS[Staging ECS Cluster]
end
subgraph "Production Account"
PROD_ROLE[Prod Deployment Role]
PROD_VPC[Production VPC]
PROD_ECS[Production ECS Cluster]
end
CP --> CB
CB --> S3
CB --> ECR
CP --> DEV_ROLE
DEV_ROLE --> DEV_VPC
DEV_VPC --> DEV_ECS
CP --> STAGE_ROLE
STAGE_ROLE --> STAGE_VPC
STAGE_VPC --> STAGE_ECS
CP --> PROD_ROLE
PROD_ROLE --> PROD_VPC
PROD_VPC --> PROD_ECS
style CP fill:#99ccff
style DEV_ROLE fill:#99ff99
style STAGE_ROLE fill:#ffcc99
style PROD_ROLE fill:#ff9999
See: diagrams/02_domain1_multi_account_pipeline.mmd
Implementation Details:
Why this works: This architecture maintains security boundaries while enabling centralized pipeline management. Each account controls its own resources and policies, but the pipeline can deploy consistently across all environments.
What it is: The secure storage, access, and rotation of sensitive information (passwords, API keys, certificates) used in CI/CD pipelines.
Why it exists: CI/CD pipelines need access to sensitive information to deploy applications, but hardcoding secrets in code or configuration files creates security vulnerabilities and compliance issues.
Real-world analogy: Secrets management is like a secure vault in a bank - authorized personnel can access what they need when they need it, but everything is logged, controlled, and regularly audited.
AWS Secrets Management Services:
AWS Secrets Manager:
AWS Systems Manager Parameter Store:
Comparison Table:
| Feature | Secrets Manager | Parameter Store |
|---|---|---|
| Automatic Rotation | ā Built-in rotation for AWS services | ā Manual rotation required |
| Cost | Higher cost per secret | Free tier, lower cost |
| Secret Size | Up to 64KB | Up to 8KB (standard), 8KB (advanced) |
| Versioning | ā Automatic versioning | ā Manual versioning |
| Cross-Region | ā Cross-region replication | ā Region-specific |
| Integration | Native AWS service integration | Broader application integration |
| šÆ Exam tip | Use for rotating secrets | Use for configuration data |
Pipeline Integration Patterns:
CodeBuild Integration Example:
version: 0.2
env:
# Regular environment variables
variables:
NODE_ENV: production
# Parameter Store integration
parameter-store:
DATABASE_HOST: /myapp/prod/database/host
API_ENDPOINT: /myapp/prod/api/endpoint
# Secrets Manager integration
secrets-manager:
DATABASE_PASSWORD: prod/myapp/database:password
API_KEY: prod/myapp/external:api_key
DOCKER_HUB_TOKEN: prod/myapp/dockerhub:token
phases:
pre_build:
commands:
- echo "Database host is $DATABASE_HOST"
- echo "Authenticating with external API..."
- curl -H "Authorization: Bearer $API_KEY" $API_ENDPOINT/health
- echo "Logging into Docker Hub..."
- echo $DOCKER_HUB_TOKEN | docker login --username myuser --password-stdin
build:
commands:
- echo "Building application with production configuration..."
- npm run build
- docker build -t myapp:$CODEBUILD_BUILD_NUMBER .
CodeDeploy Integration Example:
# appspec.yml
version: 0.0
os: linux
files:
- source: /
destination: /var/www/html
hooks:
BeforeInstall:
- location: scripts/install_dependencies.sh
timeout: 300
ApplicationStart:
- location: scripts/start_server.sh
timeout: 300
# scripts/start_server.sh
#!/bin/bash
# Retrieve secrets from Parameter Store
export DATABASE_URL=$(aws ssm get-parameter --name "/myapp/prod/database/url" --with-decryption --query "Parameter.Value" --output text)
export API_KEY=$(aws secretsmanager get-secret-value --secret-id "prod/myapp/api-key" --query "SecretString" --output text)
# Start application with secrets
node server.js
Security Best Practices:
Principle of Least Privilege:
Encryption and Transit Security:
Rotation Strategies:
Detailed Example 6: Comprehensive Secrets Management Strategy
A microservices application requires access to multiple types of secrets across different environments. The architecture uses a layered approach to secrets management. Application configuration (non-sensitive) is stored in Parameter Store with hierarchical paths like /myapp/prod/config/api-timeout and /myapp/staging/config/api-timeout. Database credentials are stored in Secrets Manager with automatic rotation enabled, using separate secrets for each environment and service. Third-party API keys are stored in Secrets Manager with manual rotation procedures documented and tested quarterly. Container registry credentials are stored in Secrets Manager and accessed during build time through CodeBuild environment variables. The pipeline uses different IAM roles for each environment, with production roles having additional approval requirements and audit logging. Secrets are retrieved just-in-time during deployment, never stored in intermediate artifacts or logs. The system includes monitoring for secret access patterns, alerting on unusual access attempts, and automatic rotation failure notifications. Cross-region replication ensures secrets are available in disaster recovery scenarios.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
The problem: Manual testing is slow, inconsistent, and doesn't scale with modern development practices. Teams either skip testing to meet deadlines or spend excessive time on manual validation, both leading to quality issues in production.
The solution: Automated testing integrated throughout the CI/CD pipeline, providing fast feedback, consistent validation, and confidence in deployments. Different types of tests run at appropriate stages to catch issues early while maintaining pipeline speed.
Why it's tested: Testing automation is critical for DevOps success. The exam tests your ability to implement comprehensive testing strategies that balance speed, coverage, and reliability across different types of applications and deployment targets.
What it is: A strategic approach to organizing automated tests that balances speed, cost, and confidence by using different types of tests at different levels of the application stack.
Why it exists: Not all tests are created equal - some are fast and cheap to run, others are slow and expensive. The test pyramid helps optimize testing strategy for maximum effectiveness.
Real-world analogy: The test pyramid is like a quality control system in manufacturing - quick checks happen frequently on the assembly line, while comprehensive inspections happen less frequently but catch different types of issues.
š Test Pyramid Diagram:
graph TB
subgraph "Test Pyramid"
subgraph "UI Tests (Few)"
E2E[End-to-End Tests]
UI[UI Integration Tests]
BROWSER[Cross-Browser Tests]
end
subgraph "Integration Tests (Some)"
API[API Integration Tests]
DB[Database Tests]
SERVICE[Service Integration]
CONTRACT[Contract Tests]
end
subgraph "Unit Tests (Many)"
UNIT[Unit Tests]
COMPONENT[Component Tests]
MOCK[Mock Tests]
PURE[Pure Function Tests]
end
end
subgraph "Test Characteristics"
FAST[Fast Execution<br/>Low Cost<br/>High Frequency]
MEDIUM[Medium Speed<br/>Medium Cost<br/>Medium Frequency]
SLOW[Slow Execution<br/>High Cost<br/>Low Frequency]
end
UNIT --> FAST
API --> MEDIUM
E2E --> SLOW
style UNIT fill:#99ff99
style API fill:#ffcc99
style E2E fill:#ff9999
style FAST fill:#99ff99
style MEDIUM fill:#ffcc99
style SLOW fill:#ff9999
See: diagrams/02_domain1_test_pyramid.mmd
Diagram Explanation:
The test pyramid shows the optimal distribution of automated tests. Unit Tests (green) form the foundation - they're fast, cheap, and should be numerous. They test individual functions and components in isolation. Integration Tests (orange) are in the middle - they test how components work together and are slower but catch different types of issues. UI/End-to-End Tests (red) are at the top - they're slow and expensive but test the complete user experience. The characteristics on the right show the trade-offs: as you move up the pyramid, tests become slower and more expensive but provide different types of confidence. The goal is to catch most issues with fast, cheap tests while using slower tests for scenarios that can't be tested at lower levels.
Test Types and Implementation:
Unit Tests:
Integration Tests:
End-to-End Tests:
Security Tests:
Testing Implementation in CodeBuild:
Comprehensive Testing buildspec.yml Example:
version: 0.2
env:
variables:
NODE_ENV: test
COVERAGE_THRESHOLD: 80
parameter-store:
TEST_DATABASE_URL: /myapp/test/database/url
EXTERNAL_API_URL: /myapp/test/external-api/url
secrets-manager:
TEST_API_KEY: test/myapp/external-api:key
phases:
install:
runtime-versions:
nodejs: 18
python: 3.9
commands:
- echo Installing test dependencies...
- npm install
- pip install pytest pytest-cov safety bandit
pre_build:
commands:
- echo Setting up test environment...
- docker run -d --name test-db -p 5432:5432 -e POSTGRES_PASSWORD=test postgres:13
- docker run -d --name test-redis -p 6379:6379 redis:6-alpine
- sleep 10 # Wait for services to start
- echo Running database migrations...
- npm run migrate:test
build:
commands:
# Unit Tests
- echo "=== Running Unit Tests ==="
- npm run test:unit -- --coverage --ci --watchAll=false
- echo "Unit test coverage:"
- npm run coverage:report
# Integration Tests
- echo "=== Running Integration Tests ==="
- npm run test:integration -- --ci --watchAll=false
# Security Tests
- echo "=== Running Security Scans ==="
- npm audit --audit-level moderate
- echo "Running SAST scan..."
- bandit -r ./src -f json -o bandit-report.json || true
- echo "Running dependency vulnerability scan..."
- safety check --json --output safety-report.json || true
# API Tests
- echo "=== Running API Tests ==="
- npm start & # Start application in background
- sleep 15 # Wait for application to start
- newman run tests/api/postman-collection.json --environment tests/api/test-environment.json --reporters cli,json --reporter-json-export api-test-results.json
# Performance Tests
- echo "=== Running Performance Tests ==="
- artillery run tests/performance/load-test.yml --output performance-results.json
# Build Application
- echo "=== Building Application ==="
- npm run build
- docker build -t myapp:test .
post_build:
commands:
- echo "=== Test Results Summary ==="
- node scripts/generate-test-summary.js
- echo "=== Cleanup ==="
- docker stop test-db test-redis
- docker rm test-db test-redis
- echo "Build completed on `date`"
# Test Reports for CodeBuild Console
reports:
unit-test-reports:
files:
- coverage/lcov.info
- junit.xml
file-format: JUNITXML
base-directory: coverage
integration-test-reports:
files:
- integration-test-results.xml
file-format: JUNITXML
security-reports:
files:
- bandit-report.json
- safety-report.json
file-format: CUCUMBERJSON
api-test-reports:
files:
- api-test-results.json
file-format: CUCUMBERJSON
# Artifacts for downstream stages
artifacts:
files:
- build/**/*
- docker-compose.yml
- appspec.yml
- scripts/**/*
secondary-artifacts:
test-reports:
files:
- coverage/**/*
- test-results/**/*
- security-reports/**/*
name: test-reports-$(date +%Y-%m-%d-%H-%M-%S)
Test Failure Handling Strategies:
Fail Fast Approach:
Test Result Analysis:
// scripts/generate-test-summary.js
const fs = require('fs');
// Parse test results from various tools
const unitTestResults = JSON.parse(fs.readFileSync('coverage/coverage-summary.json'));
const securityResults = JSON.parse(fs.readFileSync('bandit-report.json'));
const apiTestResults = JSON.parse(fs.readFileSync('api-test-results.json'));
// Generate summary
const summary = {
unitTests: {
passed: unitTestResults.total.lines.pct >= process.env.COVERAGE_THRESHOLD,
coverage: unitTestResults.total.lines.pct,
threshold: process.env.COVERAGE_THRESHOLD
},
securityScan: {
criticalIssues: securityResults.results.filter(r => r.issue_severity === 'HIGH').length,
mediumIssues: securityResults.results.filter(r => r.issue_severity === 'MEDIUM').length
},
apiTests: {
passed: apiTestResults.run.failures.length === 0,
totalTests: apiTestResults.run.stats.tests.total,
failures: apiTestResults.run.failures.length
}
};
// Determine overall build status
const buildPassed = summary.unitTests.passed &&
summary.securityScan.criticalIssues === 0 &&
summary.apiTests.passed;
console.log('=== TEST SUMMARY ===');
console.log(`Unit Tests: ${summary.unitTests.passed ? 'PASS' : 'FAIL'} (${summary.unitTests.coverage}% coverage)`);
console.log(`Security Scan: ${summary.securityScan.criticalIssues === 0 ? 'PASS' : 'FAIL'} (${summary.securityScan.criticalIssues} critical issues)`);
console.log(`API Tests: ${summary.apiTests.passed ? 'PASS' : 'FAIL'} (${summary.apiTests.failures}/${summary.apiTests.totalTests} failed)`);
console.log(`Overall Build: ${buildPassed ? 'PASS' : 'FAIL'}`);
// Exit with appropriate code
process.exit(buildPassed ? 0 : 1);
Advanced Testing Patterns:
Parallel Test Execution:
# Parallel testing in CodeBuild
build:
commands:
# Start multiple test suites in parallel
- npm run test:unit &
- npm run test:integration &
- npm run test:security &
- npm run test:performance &
# Wait for all background jobs to complete
- wait
# Check results from all test suites
- node scripts/check-all-test-results.js
Test Environment Management:
# Dynamic test environment creation
pre_build:
commands:
# Create isolated test environment
- export TEST_ENV_ID=$(date +%s)
- export TEST_DB_NAME="testdb_${TEST_ENV_ID}"
- export TEST_REDIS_PORT=$((6379 + ${TEST_ENV_ID} % 1000))
# Start services with unique identifiers
- docker run -d --name ${TEST_DB_NAME} -p 5432:5432 -e POSTGRES_DB=${TEST_DB_NAME} postgres:13
- docker run -d --name test-redis-${TEST_ENV_ID} -p ${TEST_REDIS_PORT}:6379 redis:6-alpine
# Update application configuration
- sed -i "s/DATABASE_NAME=.*/DATABASE_NAME=${TEST_DB_NAME}/" .env.test
- sed -i "s/REDIS_PORT=.*/REDIS_PORT=${TEST_REDIS_PORT}/" .env.test
Contract Testing Implementation:
# Contract testing for microservices
build:
commands:
# Consumer contract tests
- echo "Running consumer contract tests..."
- npm run test:pact:consumer
# Publish contracts to Pact Broker
- npm run pact:publish
# Provider contract verification
- echo "Running provider contract verification..."
- npm run test:pact:provider
# Check contract compatibility
- npm run pact:can-i-deploy --pacticipant myservice --version $CODEBUILD_BUILD_NUMBER
Detailed Example 7: Microservices Testing Strategy
A microservices architecture with 12 services requires a sophisticated testing approach that balances speed and coverage. Each service has its own repository and pipeline, but integration testing requires coordination across services. The testing strategy uses a layered approach: Unit tests run in parallel for all services, taking 2-3 minutes total. Component tests run each service in isolation with mocked dependencies, validating API contracts and business logic. Integration tests deploy multiple services to a shared test environment, running critical user journeys that span services. Contract tests ensure API compatibility between services using Pact, with consumer-driven contracts verified on both sides. End-to-end tests run against a full environment replica, testing complete user workflows including authentication, payment processing, and notification delivery. Security tests run at multiple levels - static analysis during build, dynamic scanning against running services, and penetration testing in staging. Performance tests validate individual service performance and system-wide load handling. The pipeline uses intelligent test selection - only services with changes run full test suites, while dependent services run contract verification and smoke tests. Test results are aggregated across all services, with a central dashboard showing overall system health and test coverage metrics.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
The problem: Modern applications consist of multiple components (code, containers, dependencies, configuration) that need to be versioned, stored securely, and distributed efficiently across environments. Manual artifact management leads to inconsistencies, security vulnerabilities, and deployment failures.
The solution: Automated artifact management using AWS services that provide secure storage, versioning, lifecycle management, and efficient distribution of all application components.
Why it's tested: Artifact management is fundamental to reliable deployments. The exam tests your ability to design artifact strategies that ensure consistency, security, and efficiency across the entire software delivery lifecycle.
What are Artifacts: Artifacts are the deployable outputs of your build process - compiled code, container images, configuration files, infrastructure templates, and dependencies that together form a complete application deployment.
Why Artifact Management Matters: Without proper artifact management, you can't guarantee that what you tested is what you deploy, leading to "it works on my machine" problems and deployment inconsistencies.
Real-world analogy: Artifact management is like a sophisticated warehouse system - everything is catalogued, versioned, secured, and can be quickly retrieved when needed for shipping (deployment).
Artifact Types and Storage:
Application Artifacts:
Container Artifacts:
Infrastructure Artifacts:
Dependency Artifacts:
What it is: AWS CodeArtifact is a fully managed artifact repository service that makes it easy for organizations to securely store, publish, and share software packages used in their software development process.
Why it exists: Organizations need a secure, scalable way to manage internal packages and control access to external dependencies. Public repositories like npm, PyPI, and Maven Central don't provide the security, compliance, and governance controls enterprises require.
Real-world analogy: CodeArtifact is like a private library system for your organization - you can store your own books (internal packages), control access to external books (public packages), and ensure everything meets your quality and security standards.
How it works (Detailed step-by-step):
š CodeArtifact Architecture Diagram:
graph TB
subgraph "CodeArtifact Domain"
subgraph "Internal Repositories"
NPM_INTERNAL[npm-internal]
PIP_INTERNAL[pip-internal]
MAVEN_INTERNAL[maven-internal]
NUGET_INTERNAL[nuget-internal]
end
subgraph "Upstream Repositories"
NPM_PUBLIC[npm-public]
PIP_PUBLIC[pip-public]
MAVEN_PUBLIC[maven-public]
end
end
subgraph "External Sources"
NPMJS[npmjs.com]
PYPI[pypi.org]
MAVEN_CENTRAL[Maven Central]
end
subgraph "Consumers"
DEV[Developer Workstation]
CB[CodeBuild]
EC2[EC2 Instance]
LAMBDA[Lambda Function]
end
NPM_PUBLIC --> NPMJS
PIP_PUBLIC --> PYPI
MAVEN_PUBLIC --> MAVEN_CENTRAL
NPM_INTERNAL --> NPM_PUBLIC
PIP_INTERNAL --> PIP_PUBLIC
MAVEN_INTERNAL --> MAVEN_PUBLIC
DEV --> NPM_INTERNAL
DEV --> PIP_INTERNAL
CB --> NPM_INTERNAL
CB --> MAVEN_INTERNAL
EC2 --> PIP_INTERNAL
LAMBDA --> NPM_INTERNAL
style NPM_INTERNAL fill:#99ccff
style PIP_INTERNAL fill:#99ccff
style MAVEN_INTERNAL fill:#99ccff
style NUGET_INTERNAL fill:#99ccff
style NPM_PUBLIC fill:#ffcc99
style PIP_PUBLIC fill:#ffcc99
style MAVEN_PUBLIC fill:#ffcc99
See: diagrams/02_domain1_codeartifact_architecture.mmd
Diagram Explanation:
This diagram shows a complete CodeArtifact setup within a domain. Internal repositories (blue) store organization-specific packages and are configured with upstream repositories (orange) that proxy external package sources. When consumers request packages, CodeArtifact first checks internal repositories, then upstream repositories, and finally fetches from external sources if needed. This creates a secure, controlled pipeline for both internal and external dependencies. Different consumers (developers, CodeBuild, EC2, Lambda) can access appropriate repositories based on IAM permissions.
Repository Configuration Examples:
npm Repository Setup:
# Create domain and repository
aws codeartifact create-domain --domain mycompany
aws codeartifact create-repository --domain mycompany --repository npm-internal --format npm
# Create upstream repository for external packages
aws codeartifact create-repository --domain mycompany --repository npm-public --format npm
aws codeartifact associate-external-connection --domain mycompany --repository npm-public --external-connection public:npmjs
# Configure npm-internal to use npm-public as upstream
aws codeartifact put-repository-permissions-policy --domain mycompany --repository npm-internal --policy-document file://npm-policy.json
# Configure npm client
aws codeartifact login --tool npm --domain mycompany --repository npm-internal
Python Repository Setup:
# Create Python repository with upstream
aws codeartifact create-repository --domain mycompany --repository pip-internal --format pypi
aws codeartifact create-repository --domain mycompany --repository pip-public --format pypi
aws codeartifact associate-external-connection --domain mycompany --repository pip-public --external-connection public:pypi
# Configure pip client
aws codeartifact login --tool pip --domain mycompany --repository pip-internal
Maven Repository Setup:
# Create Maven repository
aws codeartifact create-repository --domain mycompany --repository maven-internal --format maven
aws codeartifact create-repository --domain mycompany --repository maven-public --format maven
aws codeartifact associate-external-connection --domain mycompany --repository maven-public --external-connection public:maven-central
# Configure Maven settings.xml
aws codeartifact login --tool mvn --domain mycompany --repository maven-internal
CodeBuild Integration Example:
version: 0.2
env:
variables:
CODEARTIFACT_DOMAIN: mycompany
CODEARTIFACT_REPOSITORY: npm-internal
phases:
install:
runtime-versions:
nodejs: 18
commands:
- echo Configuring CodeArtifact...
- aws codeartifact login --tool npm --domain $CODEARTIFACT_DOMAIN --repository $CODEARTIFACT_REPOSITORY
pre_build:
commands:
- echo Installing dependencies from CodeArtifact...
- npm ci
- echo Publishing internal package...
- npm version patch
- npm publish
build:
commands:
- echo Building application...
- npm run build
- echo Creating deployment package...
- zip -r deployment.zip build/ package.json
Access Control and Security:
IAM Policy Example:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/CodeBuildServiceRole"
},
"Action": [
"codeartifact:ReadFromRepository",
"codeartifact:GetPackageVersionReadme",
"codeartifact:GetPackageVersionAsset",
"codeartifact:ListPackageVersions"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/DeveloperRole"
},
"Action": [
"codeartifact:PublishPackageVersion",
"codeartifact:PutPackageMetadata"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"codeartifact:namespace": "mycompany"
}
}
}
]
}
Package Lifecycle Management:
# Set retention policy
aws codeartifact put-package-origin-configuration \
--domain mycompany \
--repository npm-internal \
--format npm \
--package mypackage \
--restrictions publish=ALLOW,upstream=BLOCK
# Delete old package versions
aws codeartifact delete-package-versions \
--domain mycompany \
--repository npm-internal \
--format npm \
--package mypackage \
--versions 1.0.0 1.0.1
Detailed Example 8: Enterprise Package Management Strategy
A large enterprise with multiple development teams needs centralized package management across different technology stacks. The organization creates separate CodeArtifact domains for different business units (finance, marketing, engineering) with cross-domain sharing for common packages. Each domain contains multiple repositories: internal repositories for proprietary packages, upstream repositories for external dependencies, and shared repositories for cross-team collaboration. The npm-internal repository stores React components, utility libraries, and microservice SDKs developed internally. The pip-internal repository contains Python data processing libraries, ML models, and API clients. The maven-internal repository holds Java enterprise libraries, Spring Boot starters, and integration frameworks. Access control is implemented using IAM policies that restrict package publishing to specific teams while allowing read access across the organization. Automated scanning checks all packages for security vulnerabilities before allowing publication. Lifecycle policies automatically clean up old package versions while preserving release versions. The system includes monitoring for package usage, dependency analysis, and license compliance reporting. Integration with CI/CD pipelines ensures all builds use approved package versions and automatically publish new internal packages upon successful testing.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
What it is: Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.
Why it exists: Container-based applications need a secure, scalable registry for storing and distributing container images. Public registries like Docker Hub don't provide the security, compliance, and integration features required for enterprise applications.
Real-world analogy: ECR is like a secure warehouse for shipping containers - each container (Docker image) is catalogued, secured, and can be quickly shipped (deployed) to any destination (compute environment).
How it works (Detailed step-by-step):
š ECR Workflow Diagram:
graph TB
subgraph "Development"
DEV[Developer]
DOCKERFILE[Dockerfile]
BUILD[Docker Build]
end
subgraph "CI/CD Pipeline"
CB[CodeBuild]
AUTH[ECR Authentication]
PUSH[Docker Push]
end
subgraph "ECR Repository"
REPO[ECR Repository]
SCAN[Image Scanning]
LIFECYCLE[Lifecycle Policy]
REPLICATION[Cross-Region Replication]
end
subgraph "Deployment Targets"
ECS[ECS Service]
EKS[EKS Cluster]
EC2[EC2 Instance]
LAMBDA[Lambda Container]
end
subgraph "Security & Compliance"
VULN[Vulnerability Reports]
POLICY[Repository Policies]
AUDIT[Access Logging]
end
DEV --> DOCKERFILE
DOCKERFILE --> BUILD
BUILD --> CB
CB --> AUTH
AUTH --> PUSH
PUSH --> REPO
REPO --> SCAN
SCAN --> VULN
REPO --> LIFECYCLE
REPO --> REPLICATION
REPO --> ECS
REPO --> EKS
REPO --> EC2
REPO --> LAMBDA
REPO --> POLICY
REPO --> AUDIT
style BUILD fill:#99ccff
style REPO fill:#99ff99
style SCAN fill:#ff9999
style ECS fill:#ffcc99
style EKS fill:#ffcc99
See: diagrams/02_domain1_ecr_workflow.mmd
Diagram Explanation:
This diagram shows the complete ECR workflow from development to deployment. Developers create Dockerfiles and build images (blue), which are processed through CI/CD pipelines using CodeBuild. After ECR authentication, images are pushed to ECR repositories (green). ECR automatically scans images for vulnerabilities (red) and applies lifecycle policies for retention management. Images can be replicated across regions and deployed to various compute platforms (orange) including ECS, EKS, EC2, and Lambda. Security and compliance features provide vulnerability reports, access policies, and audit logging throughout the process.
ECR Repository Management:
Repository Creation and Configuration:
# Create ECR repository
aws ecr create-repository --repository-name myapp/web --region us-east-1
# Configure repository policy for cross-account access
aws ecr set-repository-policy --repository-name myapp/web --policy-text '{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrossAccountPull",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root"
},
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability"
]
}
]
}'
# Enable image scanning
aws ecr put-image-scanning-configuration --repository-name myapp/web --image-scanning-configuration scanOnPush=true
# Configure lifecycle policy
aws ecr put-lifecycle-policy --repository-name myapp/web --lifecycle-policy-text '{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 production images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
},
{
"rulePriority": 2,
"description": "Delete untagged images older than 1 day",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1
},
"action": {
"type": "expire"
}
}
]
}'
Docker Image Build and Push Process:
# Build multi-stage Docker image
docker build -t myapp:latest .
# Tag image for ECR
docker tag myapp:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/web:latest
docker tag myapp:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/web:v1.2.3
# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
# Push images to ECR
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/web:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/web:v1.2.3
CodeBuild Integration for Container Builds:
version: 0.2
env:
variables:
AWS_DEFAULT_REGION: us-east-1
AWS_ACCOUNT_ID: 123456789012
IMAGE_REPO_NAME: myapp/web
IMAGE_TAG: latest
phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
- echo Setting image tag with build number...
- IMAGE_TAG=$CODEBUILD_BUILD_NUMBER
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
- docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
- docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:latest
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker images...
- docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
- docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:latest
- echo Writing image definitions file...
- printf '[{"name":"web-container","imageUri":"%s"}]' $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG > imagedefinitions.json
artifacts:
files:
- imagedefinitions.json
- appspec.yml
Image Scanning and Security:
Vulnerability Scanning Configuration:
# Enable enhanced scanning (Inspector integration)
aws ecr put-registry-scanning-configuration --scan-type ENHANCED --rules '[
{
"scanFrequency": "SCAN_ON_PUSH",
"repositoryFilters": [
{
"filter": "*",
"filterType": "WILDCARD"
}
]
}
]'
# Get scan results
aws ecr describe-image-scan-findings --repository-name myapp/web --image-id imageTag=v1.2.3
# Set scan result retention
aws ecr put-registry-policy --policy-text '{
"rules": [
{
"rulePriority": 1,
"description": "Retain scan results for 30 days",
"targets": [
{
"repositoryFilters": [
{
"filter": "*",
"filterType": "WILDCARD"
}
]
}
],
"lifecycle": {
"policyText": "{\"rules\":[{\"rulePriority\":1,\"selection\":{\"tagStatus\":\"any\",\"countType\":\"sinceImagePushed\",\"countUnit\":\"days\",\"countNumber\":30},\"action\":{\"type\":\"expire\"}}]}"
}
}
]
}'
Cross-Region Replication:
# Configure replication to multiple regions
aws ecr put-replication-configuration --replication-configuration '{
"rules": [
{
"destinations": [
{
"region": "us-west-2",
"registryId": "123456789012"
},
{
"region": "eu-west-1",
"registryId": "123456789012"
}
],
"repositoryFilters": [
{
"filter": "myapp/*",
"filterType": "PREFIX_MATCH"
}
]
}
]
}'
Advanced ECR Features:
Pull Through Cache:
# Create pull through cache rule for Docker Hub
aws ecr create-pull-through-cache-rule --ecr-repository-prefix docker-hub --upstream-registry-url registry-1.docker.io
# Pull image through cache
docker pull 123456789012.dkr.ecr.us-east-1.amazonaws.com/docker-hub/library/nginx:latest
Immutable Image Tags:
# Enable image tag immutability
aws ecr put-image-tag-mutability --repository-name myapp/web --image-tag-mutability IMMUTABLE
Repository Templates:
# Create repository creation template
aws ecr create-repository-creation-template --prefix myapp/ --description "Template for myapp repositories" --image-tag-mutability IMMUTABLE --lifecycle-policy '{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 5 images",
"selection": {
"tagStatus": "any",
"countType": "imageCountMoreThan",
"countNumber": 5
},
"action": {
"type": "expire"
}
}
]
}'
Detailed Example 9: Multi-Environment Container Strategy
A microservices application with 15 services requires a sophisticated container management strategy across development, staging, and production environments. Each service has its own ECR repository with consistent naming conventions (service-name/environment). The build process creates multi-architecture images supporting both x86 and ARM architectures for cost optimization. Images are tagged with multiple identifiers: commit SHA for traceability, semantic version for releases, and environment-specific tags for deployment. The lifecycle policy retains the last 10 production images, 5 staging images, and 3 development images, while automatically cleaning up untagged images after 24 hours. Security scanning is enabled for all repositories with enhanced scanning for production images. Critical vulnerabilities block deployment to production, while medium vulnerabilities generate alerts but allow deployment with approval. Cross-region replication ensures images are available in disaster recovery regions. The deployment process uses immutable tags for production to prevent accidental overwrites. Monitoring tracks image pull metrics, scan results, and repository usage across all environments. Cost optimization includes using pull-through cache for base images and implementing repository cleanup automation.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
The problem: Traditional deployment approaches (taking systems offline, replacing all components at once) create significant risk, downtime, and user impact. Modern applications require deployment strategies that minimize risk, enable quick rollback, and maintain high availability.
The solution: Advanced deployment strategies that gradually introduce changes, validate functionality at each step, and provide automatic rollback capabilities when issues are detected.
Why it's tested: Deployment strategy choice significantly impacts application availability, user experience, and operational risk. The exam tests your ability to choose and implement appropriate deployment strategies for different scenarios and platforms.
What are Deployment Strategies: Systematic approaches to releasing new versions of applications that balance speed, risk, and availability requirements.
Why Multiple Strategies Exist: Different applications, environments, and business requirements need different approaches to change management and risk mitigation.
Real-world analogy: Deployment strategies are like different approaches to renovating a busy restaurant - you might close completely for major changes (recreate), renovate one section at a time (rolling), or open a second location and gradually move customers (blue/green).
Deployment Strategy Categories:
Basic Strategies:
Advanced Strategies:
What it is: Blue/Green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green, with only one serving production traffic at any time.
Why it exists: Traditional deployments require downtime and carry risk of deployment failures affecting users. Blue/Green enables zero-downtime deployments with instant rollback capabilities.
Real-world analogy: Blue/Green deployment is like having two identical stages in a theater - while one stage performs for the audience, the other is being set up for the next show. When ready, you simply switch the audience to the new stage.
How it works (Detailed step-by-step):
š Blue/Green Deployment Flow Diagram:
graph TB
subgraph "Load Balancer"
LB[Application Load Balancer]
TG_BLUE[Blue Target Group]
TG_GREEN[Green Target Group]
end
subgraph "Blue Environment (Current Production)"
BLUE_ASG[Blue Auto Scaling Group]
BLUE_EC2_1[EC2 Instance 1]
BLUE_EC2_2[EC2 Instance 2]
BLUE_EC2_3[EC2 Instance 3]
end
subgraph "Green Environment (New Version)"
GREEN_ASG[Green Auto Scaling Group]
GREEN_EC2_1[EC2 Instance 1]
GREEN_EC2_2[EC2 Instance 2]
GREEN_EC2_3[EC2 Instance 3]
end
subgraph "Shared Resources"
RDS[RDS Database]
CACHE[ElastiCache]
S3[S3 Storage]
end
LB --> TG_BLUE
LB -.-> TG_GREEN
TG_BLUE --> BLUE_ASG
TG_GREEN --> GREEN_ASG
BLUE_ASG --> BLUE_EC2_1
BLUE_ASG --> BLUE_EC2_2
BLUE_ASG --> BLUE_EC2_3
GREEN_ASG --> GREEN_EC2_1
GREEN_ASG --> GREEN_EC2_2
GREEN_ASG --> GREEN_EC2_3
BLUE_EC2_1 --> RDS
BLUE_EC2_2 --> RDS
BLUE_EC2_3 --> RDS
GREEN_EC2_1 -.-> RDS
GREEN_EC2_2 -.-> RDS
GREEN_EC2_3 -.-> RDS
BLUE_EC2_1 --> CACHE
GREEN_EC2_1 -.-> CACHE
BLUE_EC2_1 --> S3
GREEN_EC2_1 -.-> S3
style TG_BLUE fill:#99ccff
style TG_GREEN fill:#99ff99
style BLUE_ASG fill:#99ccff
style GREEN_ASG fill:#99ff99
style RDS fill:#ffcc99
See: diagrams/02_domain1_blue_green_deployment.mmd
Diagram Explanation:
This diagram shows a Blue/Green deployment setup using AWS services. The Application Load Balancer routes traffic to either the Blue Target Group (current production, blue) or Green Target Group (new version, green). Each environment has its own Auto Scaling Group with EC2 instances. Shared resources like RDS database, ElastiCache, and S3 storage are accessed by both environments. During normal operation, traffic flows to Blue (solid lines). During deployment, Green is prepared and tested (dotted lines), then traffic is switched from Blue to Green. This enables zero-downtime deployments with instant rollback capability.
AWS Implementation with CodeDeploy:
CodeDeploy Blue/Green Configuration:
# appspec.yml for Blue/Green deployment
version: 0.0
os: linux
files:
- source: /
destination: /var/www/html
hooks:
BeforeInstall:
- location: scripts/install_dependencies.sh
timeout: 300
ApplicationStart:
- location: scripts/start_server.sh
timeout: 300
ApplicationStop:
- location: scripts/stop_server.sh
timeout: 300
ValidateService:
- location: scripts/validate_service.sh
timeout: 300
CodeDeploy Deployment Configuration:
{
"deploymentConfigName": "CodeDeployDefault.BlueGreenEC2",
"minimumHealthyHosts": {
"type": "HOST_COUNT",
"value": 1
},
"blueGreenDeploymentConfiguration": {
"terminateBlueInstancesOnDeploymentSuccess": {
"action": "TERMINATE",
"terminationWaitTimeInMinutes": 5
},
"deploymentReadyOption": {
"actionOnTimeout": "CONTINUE_DEPLOYMENT"
},
"greenFleetProvisioningOption": {
"action": "COPY_AUTO_SCALING_GROUP"
}
}
}
ECS Blue/Green Deployment:
{
"family": "myapp-task",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "web-container",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:latest",
"portMappings": [
{
"containerPort": 80,
"protocol": "tcp"
}
],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
Detailed Example 10: Enterprise Blue/Green Strategy
A large e-commerce platform implements Blue/Green deployment across multiple services and environments. The architecture uses separate Auto Scaling Groups for Blue and Green environments, each with identical configurations but different AMI versions. The Application Load Balancer uses weighted target groups to control traffic distribution - initially 100% to Blue, 0% to Green. During deployment, the new version is deployed to Green environment and undergoes comprehensive testing including health checks, smoke tests, and limited user acceptance testing using a small percentage of traffic (5%). If tests pass, traffic is gradually shifted from Blue to Green over a 30-minute period (50%, 75%, 100%) while monitoring key metrics like error rates, response times, and business KPIs. If any metric exceeds thresholds, automatic rollback occurs by shifting traffic back to Blue within 2 minutes. The database layer uses read replicas to handle increased load during traffic shifts, and application state is managed through external session stores (ElastiCache) to ensure user sessions persist across environment switches. After successful deployment, the Blue environment remains available for 24 hours before being terminated, providing a safety net for any delayed issues. The entire process is automated through CodePipeline with manual approval gates for production deployments.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
What it is: Canary deployment is a technique that reduces the risk of introducing new software versions by slowly rolling out changes to a small subset of users before making them available to everyone.
Why it exists: Even with comprehensive testing, production environments can reveal issues not caught in testing. Canary deployments limit the blast radius of potential issues while gathering real-world feedback.
Real-world analogy: Canary deployment is like testing a new recipe on a few customers before adding it to the full menu - you get real feedback with limited risk.
How it works (Detailed step-by-step):
š Canary Deployment Progression Diagram:
graph TB
subgraph "Stage 1: Initial Canary (5%)"
LB1[Load Balancer]
PROD1[Production v1.0<br/>95% Traffic]
CANARY1[Canary v2.0<br/>5% Traffic]
LB1 --> PROD1
LB1 --> CANARY1
end
subgraph "Stage 2: Increased Canary (25%)"
LB2[Load Balancer]
PROD2[Production v1.0<br/>75% Traffic]
CANARY2[Canary v2.0<br/>25% Traffic]
LB2 --> PROD2
LB2 --> CANARY2
end
subgraph "Stage 3: Majority Canary (75%)"
LB3[Load Balancer]
PROD3[Production v1.0<br/>25% Traffic]
CANARY3[Canary v2.0<br/>75% Traffic]
LB3 --> PROD3
LB3 --> CANARY3
end
subgraph "Stage 4: Full Rollout (100%)"
LB4[Load Balancer]
PROD4[Production v2.0<br/>100% Traffic]
LB4 --> PROD4
end
CANARY1 --> CANARY2
CANARY2 --> CANARY3
CANARY3 --> PROD4
style CANARY1 fill:#ffcc99
style CANARY2 fill:#ffcc99
style CANARY3 fill:#ffcc99
style PROD4 fill:#99ff99
See: diagrams/02_domain1_canary_deployment_progression.mmd
Diagram Explanation:
This diagram shows the progression of a canary deployment through four stages. Stage 1 introduces the canary version (orange) to 5% of traffic while 95% remains on the current production version. If metrics are healthy, Stage 2 increases canary traffic to 25%. Stage 3 shifts majority traffic (75%) to the canary version. Finally, Stage 4 completes the rollout with 100% traffic on the new version (green). At each stage, metrics are monitored and the deployment can be halted or rolled back if issues are detected.
AWS Lambda Canary Deployment:
Lambda Alias Configuration:
# Create Lambda function versions
aws lambda publish-version --function-name myapp-api --description "Version 1.0"
aws lambda publish-version --function-name myapp-api --description "Version 2.0"
# Create alias with traffic shifting
aws lambda create-alias --function-name myapp-api --name PROD --function-version 1 --routing-config '{
"AdditionalVersionWeights": {
"2": 0.05
}
}'
# Gradually increase canary traffic
aws lambda update-alias --function-name myapp-api --name PROD --routing-config '{
"AdditionalVersionWeights": {
"2": 0.25
}
}'
# Complete rollout
aws lambda update-alias --function-name myapp-api --name PROD --function-version 2
CodeDeploy Lambda Canary Configuration:
# appspec.yml for Lambda canary deployment
version: 0.0
Resources:
- myLambdaFunction:
Type: AWS::Lambda::Function
Properties:
Name: "myapp-api"
Alias: "PROD"
CurrentVersion: "1"
TargetVersion: "2"
Hooks:
- BeforeAllowTraffic: "validateFunction"
- AfterAllowTraffic: "validateDeployment"
ECS Canary with Application Load Balancer:
{
"serviceName": "myapp-service",
"cluster": "production",
"taskDefinition": "myapp-task:2",
"desiredCount": 10,
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 50,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
},
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp-canary/1234567890123456",
"containerName": "web-container",
"containerPort": 80
}
]
}
Canary Monitoring and Automation:
# Automated canary analysis script
import boto3
import json
from datetime import datetime, timedelta
def analyze_canary_metrics():
cloudwatch = boto3.client('cloudwatch')
# Define metrics to monitor
metrics = [
{'name': 'ErrorRate', 'threshold': 0.01}, # 1% error rate
{'name': 'ResponseTime', 'threshold': 500}, # 500ms response time
{'name': 'ThroughputDrop', 'threshold': 0.1} # 10% throughput drop
]
# Get metrics for canary and production versions
end_time = datetime.utcnow()
start_time = end_time - timedelta(minutes=15)
canary_healthy = True
for metric in metrics:
# Get canary metrics
canary_response = cloudwatch.get_metric_statistics(
Namespace='AWS/ApplicationELB',
MetricName=metric['name'],
Dimensions=[
{'Name': 'TargetGroup', 'Value': 'myapp-canary'}
],
StartTime=start_time,
EndTime=end_time,
Period=300,
Statistics=['Average']
)
# Get production metrics for comparison
prod_response = cloudwatch.get_metric_statistics(
Namespace='AWS/ApplicationELB',
MetricName=metric['name'],
Dimensions=[
{'Name': 'TargetGroup', 'Value': 'myapp-production'}
],
StartTime=start_time,
EndTime=end_time,
Period=300,
Statistics=['Average']
)
# Analyze metrics
if canary_response['Datapoints'] and prod_response['Datapoints']:
canary_value = canary_response['Datapoints'][-1]['Average']
prod_value = prod_response['Datapoints'][-1]['Average']
if metric['name'] == 'ThroughputDrop':
if (prod_value - canary_value) / prod_value > metric['threshold']:
canary_healthy = False
print(f"Canary throughput drop detected: {canary_value} vs {prod_value}")
else:
if canary_value > metric['threshold']:
canary_healthy = False
print(f"Canary {metric['name']} threshold exceeded: {canary_value}")
return canary_healthy
def update_canary_traffic(percentage):
"""Update canary traffic percentage"""
elbv2 = boto3.client('elbv2')
# Update target group weights
elbv2.modify_listener(
ListenerArn='arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/myapp-alb/1234567890123456/1234567890123456',
DefaultActions=[
{
'Type': 'forward',
'ForwardConfig': {
'TargetGroups': [
{
'TargetGroupArn': 'arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp-production/1234567890123456',
'Weight': 100 - percentage
},
{
'TargetGroupArn': 'arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp-canary/1234567890123456',
'Weight': percentage
}
]
}
}
]
)
# Automated canary progression
def automated_canary_deployment():
stages = [5, 25, 50, 75, 100]
for stage in stages:
print(f"Deploying canary to {stage}% of traffic")
update_canary_traffic(stage)
# Wait for metrics to stabilize
time.sleep(600) # 10 minutes
# Analyze metrics
if not analyze_canary_metrics():
print("Canary metrics unhealthy, rolling back")
update_canary_traffic(0) # Rollback
return False
print(f"Stage {stage}% successful, proceeding")
print("Canary deployment completed successfully")
return True
What it is: Rolling update is a deployment strategy that gradually replaces instances of the previous version with instances of the new version, maintaining service availability throughout the process.
Why it exists: Rolling updates provide a balance between deployment speed and risk mitigation, requiring fewer resources than Blue/Green while providing better availability than recreate deployments.
Real-world analogy: Rolling update is like renovating a hotel one floor at a time - guests can still stay in the hotel while renovations happen, and you don't need to build a second hotel.
How it works (Detailed step-by-step):
ECS Rolling Update Configuration:
{
"serviceName": "myapp-service",
"cluster": "production",
"taskDefinition": "myapp-task:2",
"desiredCount": 10,
"deploymentConfiguration": {
"maximumPercent": 150,
"minimumHealthyPercent": 75,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
},
"healthCheckGracePeriodSeconds": 300
}
Auto Scaling Group Rolling Update:
{
"AutoScalingGroupName": "myapp-asg",
"LaunchTemplate": {
"LaunchTemplateName": "myapp-template",
"Version": "2"
},
"MinSize": 3,
"MaxSize": 6,
"DesiredCapacity": 3,
"DefaultCooldown": 300,
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300
}
CloudFormation Rolling Update Policy:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: 3
MaxSize: 6
DesiredCapacity: 3
TargetGroupARNs:
- !Ref TargetGroup
HealthCheckType: ELB
HealthCheckGracePeriod: 300
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: 2
MaxBatchSize: 1
PauseTime: PT5M
WaitOnResourceSignals: true
SuspendProcesses:
- HealthCheck
- ReplaceUnhealthy
- AZRebalance
- AlarmNotification
- ScheduledActions
Detailed Example 11: Kubernetes Rolling Update Strategy
A microservices application running on Amazon EKS implements sophisticated rolling update strategies for different service types. Critical services use a conservative approach with maxUnavailable: 1 and maxSurge: 1, ensuring only one pod is updated at a time while maintaining full capacity. Less critical services use maxUnavailable: 25% and maxSurge: 25% for faster updates. Each service has comprehensive readiness and liveness probes - readiness probes check application startup and dependency availability, while liveness probes detect application hangs or deadlocks. The update process includes pre-stop hooks that gracefully drain connections and post-start hooks that warm up caches and connections. Rolling updates are coordinated with Horizontal Pod Autoscaler (HPA) to prevent scaling conflicts during deployments. The system includes automated rollback triggers based on error rate thresholds, response time degradation, and failed readiness checks. Service mesh (Istio) provides additional traffic management capabilities, enabling fine-grained traffic splitting and circuit breaking during updates. Monitoring includes deployment progress tracking, pod restart counts, and service-level indicators (SLIs) that trigger alerts if deployment health degrades. The entire process is automated through GitOps workflows that validate changes, run tests, and coordinate updates across dependent services.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
This comprehensive chapter covered the four major task areas of SDLC Automation, which represents 22% of the DOP-C02 exam:
ā Task 1.1 - CI/CD Pipeline Implementation:
ā Task 1.2 - Automated Testing Integration:
ā Task 1.3 - Artifact Management:
ā Task 1.4 - Deployment Strategies:
Pipeline Orchestration: CodePipeline serves as the central orchestrator, integrating multiple AWS services and third-party tools into cohesive delivery workflows.
Security Integration: Security must be integrated throughout the pipeline - from secrets management to vulnerability scanning to access controls.
Testing Strategy: Implement a balanced testing approach using the test pyramid - many fast unit tests, some integration tests, few slow end-to-end tests.
Artifact Lifecycle: Proper artifact management ensures consistency between environments and enables reliable deployments and rollbacks.
Deployment Risk Management: Choose deployment strategies based on risk tolerance, resource constraints, and availability requirements.
Monitoring and Observability: Comprehensive monitoring is essential for detecting issues, triggering rollbacks, and maintaining system health.
Automation Over Manual Processes: Every manual step in the delivery process is an opportunity for error and inconsistency.
Test yourself before moving on to Domain 2:
Try these from your practice test bundles:
If you scored below target:
Copy this summary to your notes for quick review:
Key Services:
Key Concepts:
Decision Points:
šÆ Pattern 1: Pipeline Design Questions
šÆ Pattern 2: Testing Strategy Questions
šÆ Pattern 3: Deployment Strategy Selection
šÆ Pattern 4: Artifact Management Questions
šÆ Pattern 5: Troubleshooting Questions
Congratulations! You've mastered SDLC Automation, the largest domain on the DOP-C02 exam. You now understand how to design, implement, and manage sophisticated CI/CD pipelines that deliver applications reliably and securely.
Chapter 2 Preview: In the next chapter, we'll dive into Domain 2: Configuration Management and Infrastructure as Code (17% of exam). You'll learn to:
Ready to continue? Proceed to Chapter 2: Configuration Management and IaC when you've completed the self-assessment checklist above and achieved target scores on practice tests.
Remember: SDLC Automation is fundamental to DevOps success. The concepts you've learned here will be referenced throughout the remaining domains, so ensure you're comfortable with these foundations before moving forward.
What you'll learn:
Time to complete: 10-12 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (SDLC Automation)
Exam weight: 17% (approximately 11-12 questions)
Domain Tasks Covered:
The problem: Manual infrastructure provisioning is slow, error-prone, inconsistent across environments, and doesn't scale. Teams spend excessive time on repetitive infrastructure tasks instead of focusing on business value, and infrastructure changes lack proper version control and testing.
The solution: Infrastructure as Code (IaC) treats infrastructure like software - version controlled, tested, reviewed, and deployed through automated processes. This enables consistent, repeatable, and scalable infrastructure management.
Why it's tested: IaC is fundamental to modern cloud operations and DevOps practices. The exam tests your ability to design, implement, and manage infrastructure using code, ensuring consistency, security, and operational excellence at scale.
What is Infrastructure as Code: IaC is the practice of managing and provisioning computing infrastructure through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.
Why IaC Matters: IaC enables version control, testing, code review, and automation for infrastructure, bringing software development best practices to infrastructure management.
Real-world analogy: IaC is like architectural blueprints for buildings - once you have the blueprint, you can build identical structures repeatedly, make controlled modifications, and ensure everything meets building codes and standards.
IaC Benefits:
IaC Challenges:
What it is: AWS CloudFormation is a service that gives developers and businesses an easy way to create a collection of related AWS and third-party resources and provision them in an orderly and predictable fashion.
Why it exists: Managing AWS resources manually through the console or CLI doesn't scale and leads to configuration drift. CloudFormation provides declarative infrastructure management with dependency resolution, rollback capabilities, and change management.
Real-world analogy: CloudFormation is like a master chef's recipe - it specifies exactly what ingredients (resources) you need, in what quantities (properties), and the precise steps (dependencies) to create a perfect dish (infrastructure) every time.
How it works (Detailed step-by-step):
š CloudFormation Architecture Diagram:
graph TB
subgraph "Development"
DEV[Developer]
TEMPLATE[CloudFormation Template]
PARAMS[Parameters File]
VALIDATE[Template Validation]
end
subgraph "CloudFormation Service"
CF[CloudFormation Engine]
CHANGESET[Change Sets]
STACK[Stack Management]
EVENTS[Stack Events]
end
subgraph "AWS Resources"
VPC[VPC]
EC2[EC2 Instances]
RDS[RDS Database]
ALB[Load Balancer]
S3[S3 Buckets]
IAM[IAM Roles]
end
subgraph "Monitoring & Management"
CW[CloudWatch]
CT[CloudTrail]
CONFIG[AWS Config]
DRIFT[Drift Detection]
end
DEV --> TEMPLATE
TEMPLATE --> PARAMS
PARAMS --> VALIDATE
VALIDATE --> CF
CF --> CHANGESET
CHANGESET --> STACK
STACK --> EVENTS
CF --> VPC
CF --> EC2
CF --> RDS
CF --> ALB
CF --> S3
CF --> IAM
STACK --> CW
STACK --> CT
STACK --> CONFIG
STACK --> DRIFT
style TEMPLATE fill:#99ccff
style CF fill:#99ff99
style STACK fill:#99ff99
style VPC fill:#ffcc99
style EC2 fill:#ffcc99
style RDS fill:#ffcc99
See: diagrams/03_domain2_cloudformation_architecture.mmd
Diagram Explanation:
This diagram shows the complete CloudFormation workflow. Developers create templates (blue) and parameter files, which are validated before submission to the CloudFormation service (green). The CloudFormation engine processes templates, creates change sets for updates, and manages stack lifecycle. The service provisions AWS resources (orange) including VPC, EC2, RDS, ALB, S3, and IAM components. Monitoring and management services provide visibility into stack operations, track changes, and detect configuration drift. This architecture enables declarative infrastructure management with full lifecycle control.
CloudFormation Template Structure:
Complete Template Example:
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Multi-tier web application infrastructure with auto scaling and RDS'
# Input parameters for customization
Parameters:
Environment:
Type: String
Default: dev
AllowedValues: [dev, staging, prod]
Description: Environment name for resource tagging
InstanceType:
Type: String
Default: t3.micro
AllowedValues: [t3.micro, t3.small, t3.medium, t3.large]
Description: EC2 instance type for web servers
DBPassword:
Type: String
NoEcho: true
MinLength: 8
MaxLength: 41
Description: Database password (8-41 characters)
KeyPairName:
Type: AWS::EC2::KeyPair::KeyName
Description: EC2 Key Pair for SSH access
# Conditional logic based on environment
Conditions:
IsProduction: !Equals [!Ref Environment, prod]
IsNotProduction: !Not [!Equals [!Ref Environment, prod]]
# Mappings for environment-specific values
Mappings:
EnvironmentMap:
dev:
DBInstanceClass: db.t3.micro
MinSize: 1
MaxSize: 2
staging:
DBInstanceClass: db.t3.small
MinSize: 2
MaxSize: 4
prod:
DBInstanceClass: db.t3.medium
MinSize: 3
MaxSize: 10
# AWS resources to create
Resources:
# VPC and Networking
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: !Sub '${Environment}-vpc'
- Key: Environment
Value: !Ref Environment
PublicSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.1.0/24
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub '${Environment}-public-subnet-1'
PublicSubnet2:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.2.0/24
AvailabilityZone: !Select [1, !GetAZs '']
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub '${Environment}-public-subnet-2'
PrivateSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.3.0/24
AvailabilityZone: !Select [0, !GetAZs '']
Tags:
- Key: Name
Value: !Sub '${Environment}-private-subnet-1'
PrivateSubnet2:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.4.0/24
AvailabilityZone: !Select [1, !GetAZs '']
Tags:
- Key: Name
Value: !Sub '${Environment}-private-subnet-2'
# Internet Gateway and Routing
InternetGateway:
Type: AWS::EC2::InternetGateway
Properties:
Tags:
- Key: Name
Value: !Sub '${Environment}-igw'
AttachGateway:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId: !Ref VPC
InternetGatewayId: !Ref InternetGateway
PublicRouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${Environment}-public-rt'
PublicRoute:
Type: AWS::EC2::Route
DependsOn: AttachGateway
Properties:
RouteTableId: !Ref PublicRouteTable
DestinationCidrBlock: 0.0.0.0/0
GatewayId: !Ref InternetGateway
PublicSubnetRouteTableAssociation1:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnet1
RouteTableId: !Ref PublicRouteTable
PublicSubnetRouteTableAssociation2:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnet2
RouteTableId: !Ref PublicRouteTable
# NAT Gateway for private subnet internet access
NATGateway:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt EIPForNAT.AllocationId
SubnetId: !Ref PublicSubnet1
Tags:
- Key: Name
Value: !Sub '${Environment}-nat-gateway'
EIPForNAT:
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
PrivateRouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${Environment}-private-rt'
PrivateRoute:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PrivateRouteTable
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NATGateway
PrivateSubnetRouteTableAssociation1:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PrivateSubnet1
RouteTableId: !Ref PrivateRouteTable
PrivateSubnetRouteTableAssociation2:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PrivateSubnet2
RouteTableId: !Ref PrivateRouteTable
# Security Groups
WebServerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for web servers
VpcId: !Ref VPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
SourceSecurityGroupId: !Ref LoadBalancerSecurityGroup
- IpProtocol: tcp
FromPort: 443
ToPort: 443
SourceSecurityGroupId: !Ref LoadBalancerSecurityGroup
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: 10.0.0.0/16
Tags:
- Key: Name
Value: !Sub '${Environment}-web-sg'
LoadBalancerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for load balancer
VpcId: !Ref VPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: 443
ToPort: 443
CidrIp: 0.0.0.0/0
Tags:
- Key: Name
Value: !Sub '${Environment}-alb-sg'
DatabaseSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for RDS database
VpcId: !Ref VPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 3306
ToPort: 3306
SourceSecurityGroupId: !Ref WebServerSecurityGroup
Tags:
- Key: Name
Value: !Sub '${Environment}-db-sg'
# Application Load Balancer
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: !Sub '${Environment}-alb'
Scheme: internet-facing
Type: application
Subnets:
- !Ref PublicSubnet1
- !Ref PublicSubnet2
SecurityGroups:
- !Ref LoadBalancerSecurityGroup
Tags:
- Key: Name
Value: !Sub '${Environment}-alb'
TargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: !Sub '${Environment}-tg'
Port: 80
Protocol: HTTP
VpcId: !Ref VPC
HealthCheckPath: /health
HealthCheckProtocol: HTTP
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3
Tags:
- Key: Name
Value: !Sub '${Environment}-tg'
Listener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
DefaultActions:
- Type: forward
TargetGroupArn: !Ref TargetGroup
LoadBalancerArn: !Ref ApplicationLoadBalancer
Port: 80
Protocol: HTTP
# Launch Template and Auto Scaling Group
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub '${Environment}-launch-template'
LaunchTemplateData:
ImageId: ami-0abcdef1234567890 # Amazon Linux 2 AMI
InstanceType: !Ref InstanceType
KeyName: !Ref KeyPairName
SecurityGroupIds:
- !Ref WebServerSecurityGroup
IamInstanceProfile:
Arn: !GetAtt InstanceProfile.Arn
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "<h1>Hello from ${Environment} environment</h1>" > /var/www/html/index.html
echo "OK" > /var/www/html/health
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Sub '${Environment}-asg'
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: !FindInMap [EnvironmentMap, !Ref Environment, MinSize]
MaxSize: !FindInMap [EnvironmentMap, !Ref Environment, MaxSize]
DesiredCapacity: !FindInMap [EnvironmentMap, !Ref Environment, MinSize]
VPCZoneIdentifier:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
TargetGroupARNs:
- !Ref TargetGroup
HealthCheckType: ELB
HealthCheckGracePeriod: 300
Tags:
- Key: Name
Value: !Sub '${Environment}-web-server'
PropagateAtLaunch: true
- Key: Environment
Value: !Ref Environment
PropagateAtLaunch: true
# IAM Role for EC2 instances
InstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub '${Environment}-ec2-role'
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Tags:
- Key: Environment
Value: !Ref Environment
InstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: !Sub '${Environment}-ec2-profile'
Roles:
- !Ref InstanceRole
# RDS Database
DBSubnetGroup:
Type: AWS::RDS::DBSubnetGroup
Properties:
DBSubnetGroupName: !Sub '${Environment}-db-subnet-group'
DBSubnetGroupDescription: Subnet group for RDS database
SubnetIds:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
Tags:
- Key: Name
Value: !Sub '${Environment}-db-subnet-group'
Database:
Type: AWS::RDS::DBInstance
DeletionPolicy: !If [IsProduction, Snapshot, Delete]
Properties:
DBInstanceIdentifier: !Sub '${Environment}-database'
DBInstanceClass: !FindInMap [EnvironmentMap, !Ref Environment, DBInstanceClass]
Engine: mysql
EngineVersion: '8.0'
AllocatedStorage: 20
StorageType: gp2
StorageEncrypted: true
MasterUsername: admin
MasterUserPassword: !Ref DBPassword
DBSubnetGroupName: !Ref DBSubnetGroup
VPCSecurityGroups:
- !Ref DatabaseSecurityGroup
BackupRetentionPeriod: !If [IsProduction, 7, 1]
MultiAZ: !If [IsProduction, true, false]
PubliclyAccessible: false
Tags:
- Key: Name
Value: !Sub '${Environment}-database'
- Key: Environment
Value: !Ref Environment
# Output values for use by other stacks or applications
Outputs:
VPCId:
Description: VPC ID
Value: !Ref VPC
Export:
Name: !Sub '${Environment}-vpc-id'
PublicSubnet1Id:
Description: Public Subnet 1 ID
Value: !Ref PublicSubnet1
Export:
Name: !Sub '${Environment}-public-subnet-1-id'
PublicSubnet2Id:
Description: Public Subnet 2 ID
Value: !Ref PublicSubnet2
Export:
Name: !Sub '${Environment}-public-subnet-2-id'
PrivateSubnet1Id:
Description: Private Subnet 1 ID
Value: !Ref PrivateSubnet1
Export:
Name: !Sub '${Environment}-private-subnet-1-id'
PrivateSubnet2Id:
Description: Private Subnet 2 ID
Value: !Ref PrivateSubnet2
Export:
Name: !Sub '${Environment}-private-subnet-2-id'
LoadBalancerDNS:
Description: Load Balancer DNS Name
Value: !GetAtt ApplicationLoadBalancer.DNSName
Export:
Name: !Sub '${Environment}-alb-dns'
DatabaseEndpoint:
Description: RDS Database Endpoint
Value: !GetAtt Database.Endpoint.Address
Export:
Name: !Sub '${Environment}-db-endpoint'
This comprehensive template demonstrates advanced CloudFormation features including parameters, conditions, mappings, intrinsic functions, cross-stack references, and proper resource dependencies.
end
DEV -->|Writes| TEMPLATE
DEV -->|Defines| PARAMS
TEMPLATE -->|Submit| VALIDATE
VALIDATE -->|Valid| CF
PARAMS -->|Input| CF
CF -->|Creates| CHANGESET
CHANGESET -->|Preview| DEV
CF -->|Manages| STACK
STACK -->|Provisions| VPC
STACK -->|Provisions| EC2
STACK -->|Provisions| RDS
STACK -->|Provisions| ALB
STACK -->|Provisions| S3
STACK -->|Provisions| IAM
STACK -->|Emits| EVENTS
style CF fill:#ff9900
style STACK fill:#c8e6c9
style TEMPLATE fill:#e1f5fe
*See: diagrams/03_domain2_cloudformation_architecture.mmd*
**Diagram Explanation** (Detailed):
The diagram illustrates the complete CloudFormation workflow from development to resource provisioning. Developers write CloudFormation templates (YAML or JSON files) that define the desired infrastructure state. These templates, along with parameter files for environment-specific values, are submitted to the CloudFormation service for validation. The CloudFormation Engine (orange) validates template syntax, checks resource properties, and resolves dependencies between resources. Before making changes, CloudFormation can create Change Sets that preview exactly what will be added, modified, or deleted, allowing developers to review changes before execution. Once approved, the Stack Management component orchestrates the creation of AWS resources in the correct order based on dependencies. For example, it creates the VPC first, then subnets, then EC2 instances that depend on those subnets. Throughout the process, Stack Events provide real-time feedback on resource creation status, allowing monitoring and troubleshooting. The green Stack Management box represents the logical grouping of all resources, making it easy to manage, update, or delete the entire infrastructure as a single unit.
**CloudFormation Template Structure**:
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Template description'
# Input parameters for customization
Parameters:
EnvironmentName:
Type: String
Default: dev
AllowedValues: [dev, staging, prod]
Description: Environment name
# Conditional logic
Conditions:
IsProduction: !Equals [!Ref EnvironmentName, prod]
# Reusable mappings
Mappings:
RegionMap:
us-east-1:
AMI: ami-0c55b159cbfafe1f0
us-west-2:
AMI: ami-0d1cd67c26f5fca19
# Resources to create
Resources:
MyBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${EnvironmentName}-my-bucket'
VersioningConfiguration:
Status: !If [IsProduction, Enabled, Suspended]
# Output values
Outputs:
BucketName:
Value: !Ref MyBucket
Export:
Name: !Sub '${EnvironmentName}-bucket-name'
Detailed Example 1: Multi-Tier Web Application Infrastructure
A company needs to deploy a three-tier web application (web servers, application servers, database) across two Availability Zones with high availability. Using CloudFormation, they create a template that defines: (1) A VPC with public and private subnets in two AZs, (2) An Application Load Balancer in public subnets, (3) Auto Scaling groups for web and app tiers in private subnets, (4) An RDS Multi-AZ database in private subnets, (5) Security groups controlling traffic flow between tiers, (6) IAM roles for EC2 instances to access other AWS services. The template uses parameters for environment-specific values (instance types, database size) and conditions to enable Multi-AZ only in production. When deployed, CloudFormation automatically creates resources in the correct order: VPC first, then subnets, then security groups, then load balancer, then Auto Scaling groups, and finally the database. If any resource fails to create, CloudFormation automatically rolls back all changes, ensuring the infrastructure remains in a consistent state. The entire infrastructure can be replicated to other regions or accounts by simply deploying the same template with different parameters.
Detailed Example 2: Cross-Stack References for Microservices
An organization with a microservices architecture uses CloudFormation to manage infrastructure. They create a "network stack" that provisions the VPC, subnets, and shared networking resources. This stack exports values like VPC ID and subnet IDs using CloudFormation Outputs with Export names. Each microservice team then creates their own application stacks that import these network values using the !ImportValue function. For example, a payment service stack imports the VPC ID and private subnet IDs to launch its EC2 instances in the correct network. This approach provides: (1) Separation of concerns - network team manages networking, app teams manage applications, (2) Reusability - multiple services use the same network infrastructure, (3) Dependency management - CloudFormation prevents deletion of the network stack while dependent stacks exist, (4) Consistency - all services use the same network configuration. When the network team needs to add a new subnet, they update the network stack and export the new subnet ID, then application teams can update their stacks to use it without coordination overhead.
Detailed Example 3: Nested Stacks for Complex Architectures
A large enterprise needs to deploy a complex application with dozens of resources. Creating a single monolithic template would be difficult to maintain and test. Instead, they use nested stacks to break the infrastructure into logical components: (1) A root stack that orchestrates the deployment, (2) A network nested stack for VPC and networking, (3) A security nested stack for IAM roles and security groups, (4) A compute nested stack for EC2 and Auto Scaling, (5) A database nested stack for RDS and ElastiCache, (6) A monitoring nested stack for CloudWatch dashboards and alarms. Each nested stack is a separate template file stored in S3. The root stack references these templates using AWS::CloudFormation::Stack resources and passes parameters between them. This modular approach enables: (1) Team specialization - different teams own different nested stacks, (2) Reusability - nested stacks can be used across multiple applications, (3) Testing - each component can be tested independently, (4) Maintenance - updates to one component don't require changing others, (5) Version control - each nested stack has its own version history. When deploying, CloudFormation creates nested stacks in parallel where possible, speeding up deployment time.
ā Must Know (Critical Facts):
When to use CloudFormation (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
What it is: AWS CDK is an open-source software development framework for defining cloud infrastructure using familiar programming languages (TypeScript, Python, Java, C#, Go) and provisioning it through CloudFormation.
Why it exists: While CloudFormation templates are powerful, they can be verbose and lack programming constructs like loops, conditionals, and functions. CDK allows developers to use their existing programming skills and tools to define infrastructure with less code and more flexibility.
Real-world analogy: If CloudFormation is like writing assembly language, CDK is like writing in a high-level programming language - you get the same result but with more abstraction, reusability, and developer productivity.
How it works (Detailed step-by-step):
cdk synth to convert CDK code into CloudFormation templatesš CDK Architecture Diagram:
graph TB
subgraph "Developer Environment"
DEV[Developer]
CODE[CDK Code<br/>TypeScript/Python/Java]
IDE[IDE with IntelliSense]
TEST[Unit Tests]
end
subgraph "CDK CLI"
SYNTH[cdk synth]
DIFF[cdk diff]
DEPLOY[cdk deploy]
BOOTSTRAP[cdk bootstrap]
end
subgraph "CDK Framework"
CONSTRUCTS[Construct Library]
L1[L1: CloudFormation Resources]
L2[L2: Curated Constructs]
L3[L3: Patterns]
ASSETS[Asset Bundling]
end
subgraph "AWS"
S3[S3 CDK Assets Bucket]
CFN[CloudFormation]
RESOURCES[AWS Resources]
end
DEV -->|Writes| CODE
CODE -->|Uses| CONSTRUCTS
CODE -->|Runs| TEST
IDE -->|Autocomplete| CONSTRUCTS
CODE -->|Execute| SYNTH
SYNTH -->|Generates| CFN_TEMPLATE[CloudFormation Template]
CODE -->|Execute| DIFF
DIFF -->|Shows Changes| DEV
CODE -->|Execute| DEPLOY
DEPLOY -->|Uploads| ASSETS
ASSETS -->|Store| S3
DEPLOY -->|Submits| CFN_TEMPLATE
CFN_TEMPLATE -->|Deploys via| CFN
CFN -->|Creates| RESOURCES
BOOTSTRAP -->|Creates| S3
CONSTRUCTS -->|Contains| L1
CONSTRUCTS -->|Contains| L2
CONSTRUCTS -->|Contains| L3
style CODE fill:#e1f5fe
style CONSTRUCTS fill:#c8e6c9
style CFN fill:#ff9900
See: diagrams/03_domain2_cdk_architecture.mmd
Diagram Explanation (Detailed):
The diagram shows the complete CDK development and deployment workflow. Developers write infrastructure code using familiar programming languages (TypeScript, Python, Java, C#, Go) in their IDE with full IntelliSense support and autocomplete for AWS resources. The CDK Construct Library (green) provides three levels of abstractions: L1 constructs (direct CloudFormation resources), L2 constructs (curated resources with sensible defaults), and L3 constructs (opinionated patterns combining multiple resources). Developers can write unit tests for their infrastructure code just like application code. The CDK CLI provides commands for the entire workflow: cdk bootstrap creates the S3 bucket for storing assets, cdk synth converts CDK code into CloudFormation templates, cdk diff shows what will change before deployment, and cdk deploy uploads assets to S3 and deploys the generated CloudFormation template. The Asset Bundling component automatically packages Lambda functions, Docker images, and other files, uploading them to the CDK assets bucket. Finally, CloudFormation (orange) provisions the actual AWS resources based on the generated template. This architecture combines the power of programming languages with the reliability of CloudFormation's declarative infrastructure management.
CDK Construct Levels:
Level 1 (L1) - CloudFormation Resources:
// L1 Construct - verbose, all properties required
const bucket = new s3.CfnBucket(this, 'MyBucket', {
bucketName: 'my-bucket-name',
versioningConfiguration: {
status: 'Enabled'
},
publicAccessBlockConfiguration: {
blockPublicAcls: true,
blockPublicPolicy: true,
ignorePublicAcls: true,
restrictPublicBuckets: true
}
});
Level 2 (L2) - Curated Constructs:
// L2 Construct - concise, sensible defaults
const bucket = new s3.Bucket(this, 'MyBucket', {
versioned: true,
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
removalPolicy: RemovalPolicy.RETAIN
});
// Helper methods available
bucket.grantRead(lambdaFunction);
Level 3 (L3) - Patterns:
// L3 Pattern - entire architecture in few lines
const apiLambdaDynamoDB = new ApiGatewayToLambdaToDynamoDB(this, 'ApiLambdaDynamoDBPattern', {
lambdaFunctionProps: {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda')
},
dynamoTableProps: {
partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING }
}
});
Detailed Example 1: Serverless API with CDK
A startup needs to build a serverless REST API with Lambda functions, API Gateway, and DynamoDB. Using CDK with TypeScript, they write: (1) Define a DynamoDB table with partition key and GSI in 5 lines of code, (2) Create Lambda functions with automatic bundling of dependencies, (3) Set up API Gateway with Lambda integrations and CORS configuration, (4) Grant Lambda functions permissions to access DynamoDB using .grantReadWriteData() method, (5) Add CloudWatch alarms for Lambda errors and DynamoDB throttling. The entire infrastructure is defined in about 100 lines of TypeScript code compared to 500+ lines of CloudFormation YAML. CDK automatically handles: asset bundling (zipping Lambda code), IAM role creation, CloudWatch log groups, and resource dependencies. When they run cdk deploy, CDK synthesizes the code into CloudFormation templates, uploads Lambda code to S3, and deploys the stack. The team can write unit tests for their infrastructure using Jest, testing that the Lambda function has the correct environment variables and IAM permissions before deployment. When they need to add a new API endpoint, they simply add a new Lambda function and API Gateway route in code, and CDK handles all the underlying CloudFormation changes.
Detailed Example 2: Multi-Stack Application with CDK
An enterprise application consists of networking, security, compute, and database layers that need to be deployed independently. Using CDK, they create multiple stack classes: (1) NetworkStack creates VPC, subnets, NAT gateways, and exports VPC ID and subnet IDs, (2) SecurityStack creates IAM roles, security groups, and KMS keys, importing VPC ID from NetworkStack, (3) ComputeStack creates Auto Scaling groups and load balancers, importing networking and security resources, (4) DatabaseStack creates RDS instances, importing VPC and security group information. The main CDK app instantiates these stacks in order, passing dependencies between them. CDK automatically creates CloudFormation exports and imports for cross-stack references. Each stack can be deployed, updated, or destroyed independently. The team uses CDK Aspects to automatically add tags to all resources, enforce encryption, and validate security configurations across all stacks. When deploying to multiple environments (dev, staging, prod), they use CDK context to pass environment-specific configuration, and CDK generates separate CloudFormation stacks for each environment with appropriate resource naming and sizing.
Detailed Example 3: Custom Constructs for Organizational Standards
A large organization wants to enforce infrastructure standards across all teams. They create a custom CDK construct library with reusable components: (1) SecureS3Bucket construct that enforces encryption, versioning, and access logging, (2) MonitoredLambdaFunction construct that automatically creates CloudWatch alarms and dashboards, (3) CompliantRDSDatabase construct that enforces Multi-AZ, encryption, and backup policies. These custom constructs are published to an internal npm registry. Development teams install the construct library and use these pre-built components instead of creating resources from scratch. This ensures: (1) Consistency - all S3 buckets have the same security configuration, (2) Compliance - security policies are enforced at the infrastructure level, (3) Productivity - teams don't need to remember all security requirements, (4) Maintainability - security team can update constructs and teams get improvements automatically. The custom constructs use CDK Aspects to validate that resources meet organizational policies before deployment, failing the build if non-compliant configurations are detected.
ā Must Know (Critical Facts):
cdk bootstrap once per account/region to create S3 bucket and IAM roles for CDK deploymentscdk synth generates CloudFormation templates without deploying - use this to review generated templatescdk diff shows what will change before deployment - always review diffs in productionWhen to use CDK (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
cdk watch during development for rapid iteration with automatic deploymentā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
What it is: AWS SAM is an open-source framework for building serverless applications that extends CloudFormation with simplified syntax for defining serverless resources like Lambda functions, API Gateway APIs, and DynamoDB tables.
Why it exists: While CloudFormation can define serverless resources, the syntax is verbose and requires deep knowledge of all resource properties. SAM provides shorthand syntax and built-in best practices specifically for serverless applications, making it faster to build and deploy serverless solutions.
Real-world analogy: If CloudFormation is like writing detailed construction blueprints, SAM is like using pre-fabricated building modules - you get the same result but with less effort and built-in quality standards.
How it works (Detailed step-by-step):
sam local commands to test Lambda functions locally with Dockersam build to prepare application for deployment (install dependencies, compile code)sam package to upload code and assets to S3sam deploy to create/update CloudFormation stackš SAM Architecture Diagram:
graph TB
subgraph "Development"
DEV[Developer]
SAM_TEMPLATE[SAM Template<br/>template.yaml]
CODE[Lambda Code]
LOCAL[sam local start-api]
DOCKER[Docker Container]
end
subgraph "SAM CLI"
BUILD[sam build]
PACKAGE[sam package]
DEPLOY[sam deploy]
VALIDATE[sam validate]
end
subgraph "SAM Transform"
TRANSFORM[AWS::Serverless Transform]
EXPAND[Expand SAM Resources]
CFN_GEN[Generate CloudFormation]
end
subgraph "AWS"
S3[S3 Deployment Bucket]
CFN[CloudFormation]
LAMBDA[Lambda Functions]
APIGW[API Gateway]
DDB[DynamoDB Tables]
EVENTS[EventBridge Rules]
end
DEV -->|Writes| SAM_TEMPLATE
DEV -->|Writes| CODE
SAM_TEMPLATE -->|Test Locally| LOCAL
LOCAL -->|Runs in| DOCKER
SAM_TEMPLATE -->|Execute| BUILD
BUILD -->|Bundles| CODE
BUILD -->|Output| PACKAGE
PACKAGE -->|Uploads to| S3
PACKAGE -->|Execute| DEPLOY
DEPLOY -->|Transforms via| TRANSFORM
TRANSFORM -->|Expands| EXPAND
EXPAND -->|Generates| CFN_GEN
CFN_GEN -->|Deploys via| CFN
CFN -->|Creates| LAMBDA
CFN -->|Creates| APIGW
CFN -->|Creates| DDB
CFN -->|Creates| EVENTS
style SAM_TEMPLATE fill:#e1f5fe
style TRANSFORM fill:#c8e6c9
style CFN fill:#ff9900
See: diagrams/03_domain2_sam_architecture.mmd
Diagram Explanation (Detailed):
The diagram illustrates the complete SAM development and deployment workflow. Developers write SAM templates using simplified syntax that's much more concise than CloudFormation. For example, defining a Lambda function with API Gateway integration takes 10-15 lines in SAM versus 100+ lines in CloudFormation. The SAM CLI provides local testing capabilities - sam local start-api runs API Gateway and Lambda functions locally in Docker containers, allowing developers to test without deploying to AWS. The sam build command prepares the application by installing dependencies (npm install, pip install) and compiling code if needed. The sam package command uploads Lambda code and other assets to an S3 deployment bucket. During deployment, the SAM Transform (green) is the key component - it's a CloudFormation macro (AWS::Serverless-2016-10-31) that expands SAM's simplified syntax into full CloudFormation resources. For example, a single AWS::Serverless::Function resource expands into Lambda function, IAM role, CloudWatch log group, and potentially API Gateway resources. The generated CloudFormation template is then deployed through the standard CloudFormation service (orange), which creates all the AWS resources. This architecture combines the simplicity of SAM's syntax with the power and reliability of CloudFormation's deployment engine.
SAM Template Structure:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31 # SAM transform
Description: Serverless API application
# Global configuration for all functions
Globals:
Function:
Timeout: 30
Runtime: python3.11
Environment:
Variables:
TABLE_NAME: !Ref UsersTable
Resources:
# Lambda function with API Gateway integration
GetUserFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: app.get_user
Events:
GetUser:
Type: Api
Properties:
Path: /users/{id}
Method: get
Policies:
- DynamoDBReadPolicy:
TableName: !Ref UsersTable
# DynamoDB table
UsersTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey:
Name: userId
Type: String
Outputs:
ApiUrl:
Description: API Gateway endpoint URL
Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/'
SAM vs CloudFormation Comparison:
| Feature | SAM | CloudFormation |
|---|---|---|
| Lambda Function | 10-15 lines | 50-100 lines |
| API Gateway | Implicit creation | Explicit resources |
| IAM Roles | Policy templates | Full role definitions |
| Local Testing | Built-in with Docker | Not available |
| Deployment | sam deploy |
aws cloudformation deploy |
| Best Practices | Built-in defaults | Manual configuration |
| Learning Curve | Easier for serverless | Steeper, more flexible |
| Use Case | Serverless applications | Any AWS infrastructure |
Detailed Example 1: REST API with CRUD Operations
A team needs to build a REST API for managing user data with Lambda and DynamoDB. Using SAM, they create a template with: (1) A DynamoDB table using AWS::Serverless::SimpleTable (3 lines instead of 20), (2) Four Lambda functions (GET, POST, PUT, DELETE) each defined with AWS::Serverless::Function, (3) API Gateway endpoints automatically created using Events property on each function, (4) IAM permissions granted using SAM policy templates like DynamoDBCrudPolicy. The entire application is defined in about 80 lines of YAML. They use sam local start-api to test the API locally - SAM starts a local API Gateway and Lambda runtime in Docker, allowing them to test CRUD operations without deploying to AWS. When ready, they run sam build to install Python dependencies, then sam deploy --guided which prompts for stack name, region, and other parameters. SAM automatically creates an S3 bucket for deployment artifacts, uploads Lambda code, transforms the template into CloudFormation, and deploys the stack. The team can see the API endpoint URL in the outputs and immediately start testing. When they need to add a new endpoint, they simply add a new function and event in the SAM template and redeploy.
Detailed Example 2: Event-Driven Processing Pipeline
A company needs to process uploaded files: when a file is uploaded to S3, trigger a Lambda function to process it, store results in DynamoDB, and send notifications via SNS. Using SAM, they define: (1) An S3 bucket with AWS::S3::Bucket, (2) A Lambda function with an S3 event trigger using Events property, (3) A DynamoDB table for storing results, (4) An SNS topic for notifications, (5) IAM permissions using SAM policy templates (S3ReadPolicy, DynamoDBWritePolicy, SNSPublishMessagePolicy). SAM automatically creates the S3 bucket notification configuration, Lambda permissions for S3 to invoke the function, and all necessary IAM roles. The team uses SAM's Globals section to set common properties like timeout and memory size for all functions. They use sam local invoke with sample S3 events to test the Lambda function locally before deployment. When deployed, SAM creates all resources and configures the event-driven pipeline automatically. The team can monitor the pipeline using CloudWatch Logs, which SAM automatically creates for each Lambda function.
Detailed Example 3: Scheduled Data Processing
An organization needs to run a Lambda function every hour to aggregate data from multiple sources and generate reports. Using SAM, they create: (1) A Lambda function defined with AWS::Serverless::Function, (2) A schedule event using Events property with Schedule type and cron expression, (3) An S3 bucket for storing generated reports, (4) Environment variables for configuration. SAM automatically creates the EventBridge rule, Lambda permissions for EventBridge to invoke the function, and CloudWatch log group. They use SAM's Layers property to include shared libraries and dependencies. The template includes a SAM policy template (S3CrudPolicy) to grant the function access to the S3 bucket. They use sam logs command to tail CloudWatch Logs in real-time during testing. When they need to change the schedule, they simply update the cron expression in the template and redeploy - SAM handles updating the EventBridge rule automatically.
ā Must Know (Critical Facts):
Transform: AWS::Serverless-2016-10-31 line is required in every SAM template - it tells CloudFormation to process SAM syntaxsam local commands require Docker to be installed and runningsam sync provides rapid deployment for development (skips CloudFormation change sets)When to use SAM (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
sam validate to check template syntax before deploymentsam local generate-event to create sample events for testingsam sync --watch) during development for rapid iterationā ļø Common Mistakes & Misconceptions:
Transform: AWS::Serverless-2016-10-31 at the top of SAM templatessam package to see the transformed CloudFormation template and understand what resources are createdš Connections to Other Topics:
Troubleshooting Common Issues:
sam local commands fail with Docker errorssam validateWhat it is: CloudFormation StackSets extends CloudFormation to enable you to create, update, or delete stacks across multiple accounts and AWS Regions with a single operation.
Why it exists: Organizations with multiple AWS accounts need to deploy the same infrastructure (security baselines, networking, compliance controls) across all accounts. Manually deploying stacks to each account is time-consuming and error-prone. StackSets automates multi-account, multi-region deployments.
Real-world analogy: StackSets is like a franchise headquarters sending standardized store layouts to all franchise locations - one design is replicated consistently across many locations.
How it works (Detailed step-by-step):
š StackSets Architecture Diagram:
graph TB
subgraph "Administrator Account"
ADMIN[Administrator]
STACKSET[StackSet Definition]
TEMPLATE[CloudFormation Template]
ADMIN_ROLE[AWSCloudFormationStackSetAdministrationRole]
end
subgraph "Target Account 1"
EXEC_ROLE1[AWSCloudFormationStackSetExecutionRole]
STACK1[Stack Instance]
RESOURCES1[AWS Resources]
end
subgraph "Target Account 2"
EXEC_ROLE2[AWSCloudFormationStackSetExecutionRole]
STACK2[Stack Instance]
RESOURCES2[AWS Resources]
end
subgraph "Target Account 3"
EXEC_ROLE3[AWSCloudFormationStackSetExecutionRole]
STACK3[Stack Instance]
RESOURCES3[AWS Resources]
end
subgraph "AWS Organizations"
ORG[Organization]
OU1[Organizational Unit 1]
OU2[Organizational Unit 2]
end
ADMIN -->|Creates| STACKSET
TEMPLATE -->|Defines| STACKSET
STACKSET -->|Uses| ADMIN_ROLE
ADMIN_ROLE -->|Assumes| EXEC_ROLE1
ADMIN_ROLE -->|Assumes| EXEC_ROLE2
ADMIN_ROLE -->|Assumes| EXEC_ROLE3
EXEC_ROLE1 -->|Creates| STACK1
EXEC_ROLE2 -->|Creates| STACK2
EXEC_ROLE3 -->|Creates| STACK3
STACK1 -->|Provisions| RESOURCES1
STACK2 -->|Provisions| RESOURCES2
STACK3 -->|Provisions| RESOURCES3
STACKSET -.Target.-> OU1
STACKSET -.Target.-> OU2
OU1 -.Contains.-> EXEC_ROLE1
OU2 -.Contains.-> EXEC_ROLE2
OU2 -.Contains.-> EXEC_ROLE3
style STACKSET fill:#ff9900
style ADMIN_ROLE fill:#c8e6c9
style EXEC_ROLE1 fill:#e1f5fe
style EXEC_ROLE2 fill:#e1f5fe
style EXEC_ROLE3 fill:#e1f5fe
See: diagrams/03_domain2_stacksets_architecture.mmd
Diagram Explanation (Detailed):
The diagram shows how StackSets enables centralized multi-account deployments. An administrator in the Administrator Account (typically the management account in AWS Organizations) creates a StackSet definition with a CloudFormation template. The StackSet uses the AWSCloudFormationStackSetAdministrationRole (green) which has permissions to assume execution roles in target accounts. In each target account, the AWSCloudFormationStackSetExecutionRole (blue) has permissions to create CloudFormation stacks and provision AWS resources. When the administrator deploys the StackSet, it automatically creates stack instances in all specified target accounts and regions. The StackSet can target accounts explicitly or use AWS Organizations integration to target entire Organizational Units (OUs). For example, targeting the "Production OU" automatically deploys to all accounts in that OU, including accounts added in the future. Each stack instance is an independent CloudFormation stack that can be managed individually if needed, but updates to the StackSet template are automatically propagated to all instances. This architecture enables centralized governance while maintaining account isolation - the administrator account can deploy infrastructure but doesn't have direct access to resources in target accounts.
StackSets Permission Models:
Self-Managed Permissions:
Service-Managed Permissions (with AWS Organizations):
Detailed Example 1: Security Baseline Across Organization
A large enterprise with 50 AWS accounts needs to deploy a security baseline to all accounts. The baseline includes: (1) CloudTrail enabled with logs sent to central S3 bucket, (2) AWS Config enabled with required rules, (3) GuardDuty enabled and findings sent to Security Hub, (4) IAM password policy enforced, (5) S3 bucket public access blocked by default. They create a StackSet in the management account with service-managed permissions. The StackSet targets the root of the organization, automatically deploying to all existing accounts and any new accounts created in the future. They configure the StackSet with: (1) Maximum concurrent accounts: 10 (deploy to 10 accounts at a time), (2) Failure tolerance: 2 (continue if up to 2 accounts fail), (3) Region concurrency: Sequential (deploy to one region at a time). When they update the security baseline (e.g., add a new Config rule), they update the StackSet template and the change automatically propagates to all 50 accounts. They use StackSet drift detection to identify accounts where security controls have been manually modified, then remediate the drift by updating the stack instances.
Detailed Example 2: Multi-Region Networking Infrastructure
A company needs to deploy identical VPC infrastructure across 5 regions in 10 accounts (50 total stacks). The VPC template includes: (1) VPC with public and private subnets, (2) NAT gateways for private subnet internet access, (3) VPC Flow Logs for network monitoring, (4) Transit Gateway attachments for inter-VPC connectivity. They create a StackSet with self-managed permissions (accounts are in different organizations). The StackSet is configured to deploy to all 5 regions simultaneously in each account, but only 3 accounts at a time to avoid hitting service limits. They use StackSet parameters to customize CIDR blocks for each account, ensuring no IP address conflicts. When they need to add a new region, they simply add the region to the StackSet deployment targets and it automatically creates VPCs in all accounts in that region. They use StackSet operations history to track all deployments and updates, providing an audit trail for compliance.
Detailed Example 3: Automated Account Provisioning
An organization uses AWS Control Tower for account provisioning. They create StackSets that automatically deploy to new accounts: (1) A "Logging StackSet" that creates CloudWatch log groups and metric filters, (2) A "Monitoring StackSet" that creates CloudWatch dashboards and alarms, (3) A "Backup StackSet" that configures AWS Backup plans. These StackSets target specific OUs (e.g., "Production OU", "Development OU") with service-managed permissions. When Control Tower provisions a new account and places it in an OU, the StackSets automatically deploy within minutes, ensuring the account has all required infrastructure before developers start using it. They use StackSet automatic deployment to enable this behavior - new accounts in target OUs automatically receive stack instances. This eliminates manual account setup and ensures consistency across all accounts.
ā Must Know (Critical Facts):
When to use StackSets (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
[Content continues with AWS Organizations, Control Tower, and multi-account automation topics...]
The problem: Organizations with multiple AWS accounts face challenges in maintaining consistent security, compliance, and operational standards. Manual account provisioning is slow, accounts drift from standards over time, and enforcing policies across accounts is difficult.
The solution: Multi-account automation uses AWS Organizations, Control Tower, and Infrastructure as Code to centrally manage account creation, apply security baselines, enforce policies, and maintain compliance across all accounts automatically.
Why it's tested: The exam heavily tests multi-account strategies because most enterprises use multiple AWS accounts for isolation, security, and organizational boundaries. DevOps engineers must understand how to automate account management at scale.
What it is: AWS Organizations is a service that enables you to consolidate multiple AWS accounts into an organization that you create and centrally manage. It provides policy-based management for multiple AWS accounts.
Why it exists: Managing many AWS accounts individually is operationally complex. Organizations provides centralized billing, policy enforcement, and account management, reducing overhead and improving governance.
Real-world analogy: AWS Organizations is like a corporate headquarters managing multiple branch offices - the headquarters sets policies, manages budgets, and ensures all branches follow company standards.
How it works (Detailed step-by-step):
š AWS Organizations Architecture Diagram:
graph TB
subgraph "Management Account"
MGMT[Management Account]
ORG[Organization Root]
BILLING[Consolidated Billing]
POLICIES[Policy Management]
end
subgraph "Organizational Units"
PROD_OU[Production OU]
DEV_OU[Development OU]
SECURITY_OU[Security OU]
end
subgraph "Production Accounts"
PROD1[Prod Account 1]
PROD2[Prod Account 2]
PROD3[Prod Account 3]
end
subgraph "Development Accounts"
DEV1[Dev Account 1]
DEV2[Dev Account 2]
end
subgraph "Security Accounts"
LOG[Log Archive Account]
AUDIT[Audit Account]
end
subgraph "Service Control Policies"
SCP_PROD[Production SCP<br/>Restrict Regions]
SCP_DEV[Development SCP<br/>Allow All]
SCP_SEC[Security SCP<br/>Prevent Deletion]
end
MGMT -->|Creates| ORG
ORG -->|Contains| PROD_OU
ORG -->|Contains| DEV_OU
ORG -->|Contains| SECURITY_OU
PROD_OU -->|Contains| PROD1
PROD_OU -->|Contains| PROD2
PROD_OU -->|Contains| PROD3
DEV_OU -->|Contains| DEV1
DEV_OU -->|Contains| DEV2
SECURITY_OU -->|Contains| LOG
SECURITY_OU -->|Contains| AUDIT
SCP_PROD -.Applies to.-> PROD_OU
SCP_DEV -.Applies to.-> DEV_OU
SCP_SEC -.Applies to.-> SECURITY_OU
BILLING -->|Aggregates| PROD1
BILLING -->|Aggregates| PROD2
BILLING -->|Aggregates| PROD3
BILLING -->|Aggregates| DEV1
BILLING -->|Aggregates| DEV2
BILLING -->|Aggregates| LOG
BILLING -->|Aggregates| AUDIT
style MGMT fill:#ff9900
style ORG fill:#c8e6c9
style PROD_OU fill:#e1f5fe
style DEV_OU fill:#e1f5fe
style SECURITY_OU fill:#e1f5fe
See: diagrams/03_domain2_organizations_architecture.mmd
Diagram Explanation (Detailed):
The diagram illustrates a typical AWS Organizations structure. At the top is the Management Account (orange), which is the account that created the organization and has full administrative control. The Organization Root (green) is the parent container for all accounts and OUs. Organizational Units (blue) group accounts by purpose - Production OU for production workloads, Development OU for development/testing, and Security OU for centralized logging and auditing. Each OU contains multiple member accounts. Service Control Policies (SCPs) are attached to OUs and inherited by all accounts within them. For example, the Production SCP might restrict deployments to specific AWS regions, while the Development SCP allows all services for experimentation. The Security SCP prevents deletion of CloudTrail logs and Config rules. Consolidated Billing aggregates all charges from member accounts to the management account, providing volume discounts and simplified payment. This hierarchical structure enables centralized governance while maintaining account isolation - each account has its own resources and IAM users, but the organization enforces policies across all accounts.
Service Control Policies (SCPs):
What SCPs Do:
SCP Evaluation Logic:
Effective Permissions = IAM Permissions ā© SCP Permissions
Example:
- IAM Policy allows: s3:*, ec2:*, lambda:*
- SCP allows: s3:*, ec2:*
- Effective permissions: s3:*, ec2:* (lambda:* is denied by SCP)
Common SCP Patterns:
1. Deny List Strategy (Default):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"ec2:TerminateInstances",
"rds:DeleteDBInstance"
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:PrincipalOrgID": "o-exampleorgid"
}
}
}
]
}
2. Allow List Strategy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"ec2:*",
"lambda:*",
"dynamodb:*"
],
"Resource": "*"
}
]
}
3. Region Restriction:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:RequestedRegion": [
"us-east-1",
"us-west-2"
]
},
"ArnNotLike": {
"aws:PrincipalArn": "arn:aws:iam::*:role/OrganizationAccountAccessRole"
}
}
}
]
}
Detailed Example 1: Multi-Environment Organization Structure
A company organizes its 30 AWS accounts using Organizations with this structure: (1) Management Account for billing and organization management, (2) Security OU containing Log Archive and Audit accounts, (3) Production OU with 10 production accounts, (4) Staging OU with 5 staging accounts, (5) Development OU with 12 development accounts, (6) Sandbox OU with 2 accounts for experimentation. They apply different SCPs to each OU: Production SCP restricts to us-east-1 and us-west-2 regions only, requires MFA for sensitive operations, and prevents deletion of CloudTrail logs. Development SCP allows all regions but denies expensive instance types (p3, p4 instances). Sandbox SCP allows everything but limits spending using AWS Budgets. Security OU SCP prevents any modifications to logging and security services. They enable AWS CloudTrail organization trail to log all API calls across all accounts to the Log Archive account. They use AWS Config aggregator in the Audit account to view compliance status across all accounts. Consolidated billing provides volume discounts on EC2 Reserved Instances that benefit all accounts.
Detailed Example 2: Automated Account Provisioning
An organization needs to provision new AWS accounts quickly for new projects. They create an automated workflow: (1) Developer submits account request through ServiceNow ticket, (2) Lambda function triggered by ServiceNow webhook, (3) Lambda calls Organizations API to create new account, (4) Account is automatically placed in appropriate OU based on environment (dev/staging/prod), (5) SCPs are automatically applied based on OU, (6) StackSets automatically deploy security baseline (CloudTrail, Config, GuardDuty), (7) IAM Identity Center provisions SSO access for the team, (8) Account details are sent to requester via email. The entire process takes 10 minutes instead of days of manual work. The Lambda function uses Organizations APIs: CreateAccount to create the account, MoveAccount to place it in the correct OU, and TagResource to add metadata tags. They use EventBridge to trigger additional automation when account creation completes, such as creating VPCs and setting up networking.
Detailed Example 3: Compliance Enforcement with SCPs
A financial services company must comply with regulations requiring: (1) All data must stay in specific regions, (2) Encryption must be enabled for all data at rest, (3) CloudTrail logs cannot be deleted, (4) Root user cannot be used. They implement these requirements using SCPs: Region restriction SCP denies all actions outside us-east-1 and us-west-2. Encryption enforcement SCP denies creation of S3 buckets without encryption, EBS volumes without encryption, and RDS instances without encryption. CloudTrail protection SCP denies StopLogging, DeleteTrail, and PutEventSelectors actions. Root user restriction SCP denies all actions when principal is root user. These SCPs are attached to the organization root, applying to all accounts. Even if a developer has AdministratorAccess IAM policy, they cannot violate these restrictions because SCPs override IAM policies. The security team monitors SCP violations using CloudTrail and alerts on any denied actions, investigating potential compliance issues.
ā Must Know (Critical Facts):
When to use Organizations (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
What it is: AWS Control Tower is a service that automates the setup of a secure, multi-account AWS environment based on AWS best practices. It provides an easy way to set up and govern a new AWS multi-account environment.
Why it exists: Setting up a well-architected multi-account environment manually is complex and time-consuming. Control Tower automates this process, implementing AWS best practices for account structure, security baselines, and governance out of the box.
Real-world analogy: Control Tower is like a "smart home starter kit" - instead of buying and configuring each smart device individually, you get a pre-configured system that works together seamlessly from day one.
How it works (Detailed step-by-step):
š Control Tower Architecture Diagram:
graph TB
subgraph "Management Account"
CT[Control Tower]
DASHBOARD[Control Tower Dashboard]
AF[Account Factory]
end
subgraph "Security OU"
LOG[Log Archive Account<br/>CloudTrail, Config Logs]
AUDIT[Audit Account<br/>Security Hub, SNS]
end
subgraph "Sandbox OU"
SANDBOX1[Sandbox Account 1]
SANDBOX2[Sandbox Account 2]
end
subgraph "Custom OU - Production"
PROD1[Production Account 1]
PROD2[Production Account 2]
end
subgraph "Guardrails"
PREVENTIVE[Preventive Guardrails<br/>SCPs]
DETECTIVE[Detective Guardrails<br/>Config Rules]
MANDATORY[Mandatory]
STRONGLY_REC[Strongly Recommended]
ELECTIVE[Elective]
end
subgraph "Account Baseline"
CLOUDTRAIL[CloudTrail Enabled]
CONFIG[Config Enabled]
GUARDDUTY[GuardDuty Enabled]
IAM_CENTER[IAM Identity Center]
end
CT -->|Creates| LOG
CT -->|Creates| AUDIT
CT -->|Manages| SANDBOX1
CT -->|Manages| SANDBOX2
CT -->|Manages| PROD1
CT -->|Manages| PROD2
AF -->|Provisions| SANDBOX1
AF -->|Provisions| PROD1
PREVENTIVE -->|Applies to| SANDBOX1
PREVENTIVE -->|Applies to| PROD1
DETECTIVE -->|Monitors| SANDBOX1
DETECTIVE -->|Monitors| PROD1
MANDATORY -.Includes.-> PREVENTIVE
MANDATORY -.Includes.-> DETECTIVE
STRONGLY_REC -.Includes.-> PREVENTIVE
STRONGLY_REC -.Includes.-> DETECTIVE
ELECTIVE -.Includes.-> DETECTIVE
CLOUDTRAIL -->|Logs to| LOG
CONFIG -->|Logs to| LOG
GUARDDUTY -->|Findings to| AUDIT
DASHBOARD -->|Shows| PREVENTIVE
DASHBOARD -->|Shows| DETECTIVE
style CT fill:#ff9900
style LOG fill:#c8e6c9
style AUDIT fill:#c8e6c9
style PREVENTIVE fill:#ffebee
style DETECTIVE fill:#e1f5fe
See: diagrams/03_domain2_control_tower_architecture.mmd
Diagram Explanation (Detailed):
The diagram shows Control Tower's comprehensive multi-account governance architecture. Control Tower (orange) runs in the management account and orchestrates the entire environment. It automatically creates two core accounts in the Security OU: the Log Archive account (green) receives all CloudTrail and Config logs from all accounts, and the Audit account (green) aggregates security findings from Security Hub and sends notifications via SNS. Control Tower creates a Sandbox OU for experimentation and allows creation of custom OUs like Production. The Account Factory component enables self-service account provisioning - users request accounts through Service Catalog, and Control Tower automatically provisions them with the baseline configuration. Guardrails are the key governance mechanism: Preventive guardrails (red) use SCPs to prevent actions (e.g., prevent disabling CloudTrail), while Detective guardrails (blue) use Config rules to detect non-compliance (e.g., detect unencrypted S3 buckets). Guardrails are categorized as Mandatory (must be enabled), Strongly Recommended (AWS best practices), or Elective (optional based on requirements). The Account Baseline ensures every provisioned account has CloudTrail, Config, and GuardDuty enabled automatically. The Control Tower Dashboard provides a single pane of glass showing compliance status across all accounts and guardrails.
Control Tower Guardrails:
Preventive Guardrails (SCPs):
Detective Guardrails (Config Rules):
Guardrail Categories:
| Category | Description | Can Disable? | Example |
|---|---|---|---|
| Mandatory | Must be enabled, AWS best practices | No | Disallow changes to CloudTrail |
| Strongly Recommended | AWS recommends enabling | Yes | Detect unencrypted EBS volumes |
| Elective | Optional, based on requirements | Yes | Disallow internet connection through RDP |
Account Factory:
What it provides:
Account Factory Workflow:
Detailed Example 1: Enterprise Landing Zone Setup
A large enterprise with no existing multi-account structure decides to implement AWS best practices. They set up Control Tower in their management account: (1) Control Tower creates the landing zone with Log Archive and Audit accounts, (2) They enable all mandatory and strongly recommended guardrails, (3) They create custom OUs for Production, Staging, and Development, (4) They customize Account Factory to include company-specific tags and VPC configuration, (5) They integrate IAM Identity Center with their corporate Active Directory for SSO. Within 2 hours, they have a fully functional, well-architected multi-account environment. Development teams can now request new accounts through Service Catalog - accounts are provisioned in 15 minutes with all security baselines automatically applied. The security team uses the Control Tower dashboard to monitor compliance across all accounts, seeing real-time guardrail violations. When a developer accidentally creates an unencrypted S3 bucket, a detective guardrail flags it immediately, and the security team remediates it.
Detailed Example 2: Account Factory Customization
A company needs to customize the baseline configuration for new accounts beyond what Control Tower provides by default. They use Account Factory Customization (AFC) to: (1) Automatically create a VPC with specific CIDR blocks in each new account, (2) Deploy a standard set of IAM roles for cross-account access, (3) Configure AWS Backup plans for all resources, (4) Set up CloudWatch dashboards and alarms, (5) Deploy security tools like AWS Systems Manager Session Manager. They create a CloudFormation template with these resources and configure AFC to deploy it to all new accounts. They also create a Lambda function that runs after account creation to: (1) Tag all resources with cost center and owner information, (2) Create a budget alert, (3) Send welcome email to account owner with access instructions. Now when a team requests a new account, it's provisioned with all company standards automatically, reducing setup time from days to minutes.
Detailed Example 3: Guardrail Compliance Monitoring
A financial services company must maintain strict compliance with regulatory requirements. They use Control Tower guardrails to enforce and monitor compliance: (1) Enable mandatory guardrails to prevent disabling CloudTrail and Config, (2) Enable strongly recommended guardrails for encryption and public access prevention, (3) Create custom detective guardrails using Config rules for company-specific requirements (e.g., all EC2 instances must have specific tags), (4) Configure SNS notifications in the Audit account to alert security team of guardrail violations. The security team reviews the Control Tower dashboard daily, which shows: (1) Number of accounts in compliance, (2) Active guardrail violations by account and type, (3) Drift detection for accounts that have been modified outside Control Tower. When a violation is detected (e.g., someone creates an unencrypted RDS database), the security team receives an SNS notification, investigates using CloudTrail logs in the Log Archive account, and remediates by either fixing the resource or updating the guardrail if the violation was intentional and approved.
ā Must Know (Critical Facts):
When to use Control Tower (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
[Content continues with Systems Manager, Lambda automation, and Step Functions...]
The problem: Managing infrastructure at scale requires automating repetitive tasks, maintaining configuration compliance, patching systems, and responding to events. Manual operations don't scale and lead to configuration drift, security vulnerabilities, and operational overhead.
The solution: AWS provides multiple automation services that work together to automate complex operational tasks: Systems Manager for fleet management and automation, Lambda for event-driven automation, and Step Functions for orchestrating multi-step workflows.
Why it's tested: The exam tests your ability to design and implement automation solutions that reduce operational overhead, maintain compliance, and respond to events automatically. This is core to DevOps practices.
What it is: AWS Systems Manager is a collection of capabilities for managing AWS resources and on-premises servers at scale. The Automation capability specifically allows you to automate common maintenance and deployment tasks.
Why it exists: Managing large fleets of EC2 instances and other resources manually is time-consuming and error-prone. Systems Manager provides centralized management, automation, and compliance capabilities.
Real-world analogy: Systems Manager is like a fleet management system for vehicles - it tracks all vehicles (instances), schedules maintenance (patches), monitors health (compliance), and can remotely control them (run commands).
How it works (Detailed step-by-step):
š Systems Manager Architecture Diagram:
graph TB
subgraph "Systems Manager Console"
CONSOLE[Systems Manager Console]
FLEET[Fleet Manager]
COMPLIANCE[Compliance Dashboard]
end
subgraph "Systems Manager Capabilities"
RUN_CMD[Run Command]
AUTOMATION[Automation]
STATE_MGR[State Manager]
PATCH_MGR[Patch Manager]
SESSION_MGR[Session Manager]
PARAM_STORE[Parameter Store]
INVENTORY[Inventory]
end
subgraph "EC2 Instances"
INSTANCE1[EC2 Instance 1<br/>SSM Agent]
INSTANCE2[EC2 Instance 2<br/>SSM Agent]
INSTANCE3[EC2 Instance 3<br/>SSM Agent]
end
subgraph "On-Premises"
ONPREM1[Server 1<br/>SSM Agent]
ONPREM2[Server 2<br/>SSM Agent]
end
subgraph "Automation Triggers"
EVENTBRIDGE[EventBridge]
CLOUDWATCH[CloudWatch Alarms]
LAMBDA[Lambda Functions]
MANUAL[Manual Execution]
end
CONSOLE -->|Manages| RUN_CMD
CONSOLE -->|Manages| AUTOMATION
CONSOLE -->|Manages| STATE_MGR
CONSOLE -->|Manages| PATCH_MGR
RUN_CMD -->|Executes on| INSTANCE1
RUN_CMD -->|Executes on| INSTANCE2
RUN_CMD -->|Executes on| ONPREM1
AUTOMATION -->|Orchestrates| RUN_CMD
AUTOMATION -->|Uses| PARAM_STORE
STATE_MGR -->|Maintains| INSTANCE1
STATE_MGR -->|Maintains| INSTANCE2
PATCH_MGR -->|Patches| INSTANCE1
PATCH_MGR -->|Patches| INSTANCE2
PATCH_MGR -->|Patches| INSTANCE3
SESSION_MGR -->|Connects to| INSTANCE1
INVENTORY -->|Collects from| INSTANCE1
INVENTORY -->|Collects from| INSTANCE2
INVENTORY -->|Collects from| ONPREM1
EVENTBRIDGE -->|Triggers| AUTOMATION
CLOUDWATCH -->|Triggers| AUTOMATION
LAMBDA -->|Invokes| AUTOMATION
MANUAL -->|Starts| AUTOMATION
FLEET -->|Views| INSTANCE1
FLEET -->|Views| INSTANCE2
COMPLIANCE -->|Monitors| PATCH_MGR
COMPLIANCE -->|Monitors| STATE_MGR
style CONSOLE fill:#ff9900
style AUTOMATION fill:#c8e6c9
style INSTANCE1 fill:#e1f5fe
style INSTANCE2 fill:#e1f5fe
style INSTANCE3 fill:#e1f5fe
See: diagrams/03_domain2_systems_manager_architecture.mmd
Diagram Explanation (Detailed):
The diagram illustrates Systems Manager's comprehensive fleet management architecture. The Systems Manager Console (orange) provides a unified interface for managing all capabilities. At the core are the Systems Manager capabilities: Run Command executes commands across fleets, Automation orchestrates multi-step workflows, State Manager maintains desired configuration, Patch Manager automates patching, Session Manager provides secure shell access, Parameter Store stores configuration data, and Inventory collects metadata. All EC2 instances and on-premises servers (blue) run the SSM Agent, which communicates with Systems Manager over HTTPS (no inbound ports required). The agent registers instances as "managed instances" that can be targeted by Systems Manager operations. Automation can be triggered multiple ways: EventBridge rules for event-driven automation, CloudWatch alarms for metric-based automation, Lambda functions for custom logic, or manual execution. Fleet Manager provides a visual interface to view and manage all instances, while the Compliance Dashboard shows patch compliance and configuration compliance across the fleet. Parameter Store integrates with automation workflows to provide configuration values and secrets. This architecture enables centralized management of thousands of instances without requiring SSH access or bastion hosts.
Systems Manager Automation Documents:
What they are: JSON or YAML documents that define automation workflows with multiple steps. Each step can execute different actions like running commands, invoking Lambda functions, or creating AWS resources.
Common Automation Actions:
aws:runCommand: Execute commands on instancesaws:executeAwsApi: Call any AWS APIaws:invokeLambdaFunction: Invoke Lambda functionaws:createStack: Create CloudFormation stackaws:sleep: Wait for specified durationaws:waitForAwsResourceProperty: Wait for resource to reach desired stateaws:branch: Conditional branching based on previous step resultsaws:executeScript: Run Python or PowerShell scriptsExample Automation Document:
schemaVersion: '0.3'
description: 'Automated EC2 instance patching and restart'
parameters:
InstanceId:
type: String
description: 'EC2 instance to patch'
mainSteps:
- name: CreateSnapshot
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: CreateSnapshot
VolumeId: '{{ InstanceId }}'
Description: 'Pre-patch snapshot'
outputs:
- Name: SnapshotId
Selector: '$.SnapshotId'
Type: String
- name: WaitForSnapshot
action: 'aws:waitForAwsResourceProperty'
inputs:
Service: ec2
Api: DescribeSnapshots
SnapshotIds:
- '{{ CreateSnapshot.SnapshotId }}'
PropertySelector: '$.Snapshots[0].State'
DesiredValues:
- completed
- name: InstallPatches
action: 'aws:runCommand'
inputs:
DocumentName: 'AWS-RunPatchBaseline'
InstanceIds:
- '{{ InstanceId }}'
Parameters:
Operation: Install
- name: RebootInstance
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: RebootInstances
InstanceIds:
- '{{ InstanceId }}'
- name: WaitForReboot
action: 'aws:sleep'
inputs:
Duration: PT5M
- name: VerifyPatches
action: 'aws:runCommand'
inputs:
DocumentName: 'AWS-RunPatchBaseline'
InstanceIds:
- '{{ InstanceId }}'
Parameters:
Operation: Scan
Detailed Example 1: Automated Patch Management
A company with 500 EC2 instances needs to patch systems monthly while minimizing downtime. They implement automated patching using Systems Manager: (1) Create patch baselines defining which patches to install (security patches, critical updates), (2) Create maintenance windows for different application tiers (database tier: Sunday 2-4 AM, app tier: Sunday 4-6 AM, web tier: Sunday 6-8 AM), (3) Configure Patch Manager to install patches during maintenance windows, (4) Set up State Manager associations to scan for missing patches daily, (5) Create CloudWatch dashboard showing patch compliance across fleet. The automation workflow: (1) Maintenance window opens, (2) Patch Manager creates EBS snapshots of instances, (3) Patches are installed using Run Command, (4) Instances are rebooted if required, (5) Post-patch health checks verify applications are running, (6) If health checks fail, automation rolls back to snapshot, (7) Compliance data is updated in Systems Manager. The security team reviews the compliance dashboard weekly, seeing which instances are compliant, which have missing patches, and which failed patching. This reduces patching time from 2 days of manual work to 6 hours of automated execution.
Detailed Example 2: Configuration Drift Remediation
An organization needs to ensure all EC2 instances maintain specific security configurations: (1) CloudWatch agent installed and running, (2) Specific security groups attached, (3) IMDSv2 enabled, (4) SSM Agent updated to latest version. They use State Manager to maintain these configurations: (1) Create State Manager association targeting all instances with tag "Environment:Production", (2) Association runs every 30 minutes, (3) Association document checks each configuration item, (4) If drift detected, automation remediates automatically. For example, if someone manually stops the CloudWatch agent, State Manager detects this within 30 minutes and restarts it. They also create an EventBridge rule that triggers automation when new instances are launched - the automation immediately applies the baseline configuration. They use Systems Manager Compliance to view configuration compliance across all instances, seeing which instances are compliant and which have drifted. When drift is detected, they use CloudTrail to identify who made the manual change and provide training on proper change management procedures.
Detailed Example 3: Automated Incident Response
A security team needs to automatically respond to security findings. They create automation workflows: (1) GuardDuty detects suspicious activity (e.g., cryptocurrency mining), (2) EventBridge rule triggers Systems Manager automation, (3) Automation document executes response steps: (a) Isolate instance by changing security group to deny all traffic, (b) Create EBS snapshot for forensics, (c) Create memory dump using Run Command, (d) Tag instance with "SecurityIncident:True", (e) Send SNS notification to security team, (f) Create Systems Manager OpsItem for tracking. The automation completes in under 2 minutes, containing the threat before it spreads. The security team reviews the OpsItem, analyzes the memory dump and snapshot, determines root cause, and decides whether to terminate the instance or remediate it. They use Systems Manager Session Manager to access the isolated instance for investigation without opening SSH ports. This automated response reduces incident response time from 30 minutes (manual) to 2 minutes (automated).
ā Must Know (Critical Facts):
When to use Systems Manager (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
This chapter provided comprehensive coverage of Configuration Management and Infrastructure as Code, focusing on the tools and practices for managing AWS infrastructure at scale.
Section 1: Infrastructure as Code and Reusable Components
Section 2: Multi-Account Automation
Section 3: Complex Automation Solutions
Infrastructure as Code: Use CloudFormation for declarative infrastructure, CDK for programmatic infrastructure, and SAM for serverless applications. Choose based on team skills and use case.
Multi-Account Strategy: AWS Organizations provides the foundation, Control Tower automates setup and governance, StackSets deploy infrastructure across accounts.
Automation at Scale: Systems Manager manages fleets of instances, automation documents orchestrate complex workflows, State Manager maintains configuration compliance.
Governance and Compliance: SCPs enforce permission boundaries, guardrails detect non-compliance, centralized logging enables auditing.
Reusability: Create reusable CloudFormation modules, CDK constructs, and Service Catalog products to standardize infrastructure.
Test yourself before moving on:
Try these from your practice test bundles:
If you scored below 70%:
Key Services:
Key Concepts:
Decision Points:
Next Chapter: Chapter 3 - Resilient Cloud Solutions (High availability, scalability, disaster recovery)
What you'll learn:
Time to complete: 8-10 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (SDLC Automation), Chapter 2 (Configuration Management)
Exam weight: 15% (approximately 10 questions)
Domain Tasks Covered:
The problem: Applications fail when infrastructure components break, causing downtime, lost revenue, and poor user experience. Single points of failure, lack of redundancy, and manual failover processes lead to extended outages.
The solution: High availability (HA) architectures use redundancy, automatic failover, and geographic distribution to ensure applications remain operational even when components fail. AWS provides multiple services and patterns to achieve HA.
Why it's tested: The exam heavily tests HA design because it's fundamental to production systems. DevOps engineers must understand how to architect resilient systems that meet SLA requirements.
What is High Availability: HA is the ability of a system to remain operational and accessible with minimal downtime, typically measured as a percentage of uptime (e.g., 99.99% = 52 minutes downtime per year).
Why HA Matters: Downtime costs money, damages reputation, and impacts user experience. Modern applications require near-continuous availability.
Real-world analogy: HA is like having backup power generators, redundant internet connections, and multiple data centers for a hospital - if one system fails, others immediately take over to ensure continuous operation.
HA Principles:
Availability Tiers:
| Availability | Downtime/Year | Downtime/Month | Use Case |
|---|---|---|---|
| 99% | 3.65 days | 7.2 hours | Non-critical applications |
| 99.9% | 8.76 hours | 43.2 minutes | Standard business applications |
| 99.95% | 4.38 hours | 21.6 minutes | Important business applications |
| 99.99% | 52.56 minutes | 4.32 minutes | Critical business applications |
| 99.999% | 5.26 minutes | 25.9 seconds | Mission-critical applications |
What it is: Deploying application components across multiple Availability Zones (AZs) within a single AWS Region. Each AZ is a physically separate data center with independent power, cooling, and networking.
Why it exists: Single data centers can experience failures (power outages, network issues, natural disasters). Multi-AZ provides fault tolerance at the data center level while maintaining low latency between AZs.
Real-world analogy: Multi-AZ is like having multiple bank branches in the same city - if one branch has a problem, customers can go to another branch nearby without significant inconvenience.
How it works (Detailed step-by-step):
š Multi-AZ Architecture Diagram:
graph TB
subgraph "Region: us-east-1"
subgraph "Availability Zone 1a"
ALB1[Application Load Balancer]
APP1[App Server 1]
APP2[App Server 2]
RDS_PRIMARY[RDS Primary]
CACHE1[ElastiCache Node 1]
end
subgraph "Availability Zone 1b"
APP3[App Server 3]
APP4[App Server 4]
RDS_STANDBY[RDS Standby]
CACHE2[ElastiCache Node 2]
end
subgraph "Availability Zone 1c"
APP5[App Server 5]
APP6[App Server 6]
CACHE3[ElastiCache Node 3]
end
end
USERS[Users] -->|HTTPS| ALB1
ALB1 -->|Health Check| APP1
ALB1 -->|Health Check| APP2
ALB1 -->|Health Check| APP3
ALB1 -->|Health Check| APP4
ALB1 -->|Health Check| APP5
ALB1 -->|Health Check| APP6
APP1 -->|Read/Write| RDS_PRIMARY
APP2 -->|Read/Write| RDS_PRIMARY
APP3 -->|Read/Write| RDS_PRIMARY
APP4 -->|Read/Write| RDS_PRIMARY
APP5 -->|Read/Write| RDS_PRIMARY
APP6 -->|Read/Write| RDS_PRIMARY
RDS_PRIMARY -.Synchronous Replication.-> RDS_STANDBY
APP1 -->|Cache| CACHE1
APP3 -->|Cache| CACHE2
APP5 -->|Cache| CACHE3
CACHE1 -.Replication.-> CACHE2
CACHE2 -.Replication.-> CACHE3
style ALB1 fill:#ff9900
style RDS_PRIMARY fill:#c8e6c9
style RDS_STANDBY fill:#fff3e0
style APP1 fill:#e1f5fe
style APP3 fill:#e1f5fe
style APP5 fill:#e1f5fe
See: diagrams/04_domain3_multi_az_architecture.mmd
Diagram Explanation (Detailed):
The diagram shows a comprehensive Multi-AZ architecture across three Availability Zones in the us-east-1 region. Users connect to the Application Load Balancer (orange), which is automatically deployed across all AZs by AWS. The ALB continuously performs health checks on application servers in all three AZs - if a server fails health checks, the ALB stops sending traffic to it. Application servers (blue) are distributed evenly across AZs using an Auto Scaling group with balanced AZ distribution. The RDS Primary database (green) in AZ-1a handles all read and write operations and synchronously replicates every transaction to the RDS Standby (yellow) in AZ-1b. This synchronous replication ensures zero data loss during failover. If AZ-1a experiences a complete failure, RDS automatically promotes the Standby to Primary within 1-2 minutes, and the ALB continues routing traffic to healthy app servers in AZ-1b and AZ-1c. ElastiCache nodes are distributed across all three AZs with replication enabled, providing cache availability even if an entire AZ fails. This architecture can tolerate the complete loss of any single AZ without service interruption.
Detailed Example 1: E-Commerce Application Multi-AZ Design
An e-commerce company needs 99.99% availability (52 minutes downtime per year). They implement Multi-AZ architecture: (1) Application Load Balancer automatically spans all three AZs in us-east-1, (2) Auto Scaling group launches EC2 instances evenly across three AZs with minimum 6 instances (2 per AZ), (3) RDS MySQL database configured with Multi-AZ deployment - primary in us-east-1a, standby in us-east-1b with synchronous replication, (4) ElastiCache Redis cluster with cluster mode enabled, distributing shards across three AZs, (5) EFS file system for shared storage, automatically replicated across AZs, (6) CloudWatch alarms monitor ALB target health and RDS failover events. During Black Friday, one AZ experiences a power failure. The ALB immediately stops routing traffic to instances in the failed AZ, distributing load across the remaining two AZs. The Auto Scaling group launches replacement instances in healthy AZs within 5 minutes. Users experience no downtime - the only impact is slightly higher latency as remaining instances handle increased load. Total customer-facing downtime: 0 minutes. The company's monitoring shows the incident, but customers never noticed.
Detailed Example 2: RDS Multi-AZ Failover Scenario
A financial services application uses RDS PostgreSQL Multi-AZ for transaction processing. The database handles 10,000 transactions per minute. At 2:15 PM, the primary database instance in us-east-1a experiences a hardware failure. Here's what happens: (1) At 2:15:00, RDS detects the primary instance is unresponsive (health checks fail), (2) At 2:15:05, RDS initiates automatic failover to the standby in us-east-1b, (3) At 2:15:10, RDS promotes the standby to primary, (4) At 2:15:15, RDS updates the DNS record for the database endpoint to point to the new primary, (5) At 2:15:45, application servers reconnect to the database (DNS TTL expires), (6) At 2:16:00, RDS begins creating a new standby instance in us-east-1c, (7) At 2:16:30, normal operations resume with full Multi-AZ protection. Total failover time: 45 seconds. Because the application uses connection pooling with automatic retry logic, most transactions complete successfully. Only transactions in-flight during the 45-second window need to be retried. The application's error rate spikes briefly from 0.01% to 2% during failover, then returns to normal. No data is lost because synchronous replication ensures the standby had all committed transactions.
Detailed Example 3: Eliminating Single Points of Failure
A SaaS company performs an architecture review to identify single points of failure (SPOFs). They discover: (1) NAT Gateway in single AZ - if AZ fails, private subnets lose internet access, (2) Application Load Balancer in only two AZs - not using all available AZs, (3) ElastiCache Redis in single node - no failover capability, (4) EBS volumes for application state - not replicated across AZs. They remediate: (1) Deploy NAT Gateway in each AZ, update route tables so each private subnet uses NAT Gateway in same AZ, (2) Ensure ALB spans all three AZs by creating subnets in third AZ, (3) Convert ElastiCache to cluster mode with replication across three AZs, (4) Migrate application state from EBS to DynamoDB (automatically replicated across AZs) or EFS (automatically Multi-AZ). They implement automated testing: Lambda function runs weekly, simulates AZ failure by blocking traffic to one AZ, verifies application continues operating. This "chaos engineering" approach ensures HA architecture works as designed. After remediation, they achieve 99.99% availability, meeting their SLA commitments.
ā Must Know (Critical Facts):
When to use Multi-AZ (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
What it is: Deploying application components across multiple AWS Regions (geographic locations). Each region is completely independent with its own set of Availability Zones.
Why it exists: Regional failures (though rare) can occur due to natural disasters, widespread network issues, or service disruptions. Multi-Region provides the highest level of availability and enables global application deployment for reduced latency.
Real-world analogy: Multi-Region is like having bank branches in different countries - if one country experiences problems, operations continue in other countries independently.
How it works (Detailed step-by-step):
š Multi-Region Architecture Diagram:
graph TB
subgraph "Users"
USER_US[Users in US]
USER_EU[Users in Europe]
USER_ASIA[Users in Asia]
end
subgraph "Route 53"
R53[Route 53<br/>Geolocation/Latency Routing]
HEALTH[Health Checks]
end
subgraph "Region: us-east-1 (Primary)"
ALB_US[Application Load Balancer]
APP_US[Application Servers]
RDS_US[RDS Primary]
S3_US[S3 Bucket]
DYNAMO_US[DynamoDB Global Table]
end
subgraph "Region: eu-west-1 (Secondary)"
ALB_EU[Application Load Balancer]
APP_EU[Application Servers]
RDS_EU[RDS Read Replica]
S3_EU[S3 Bucket]
DYNAMO_EU[DynamoDB Global Table]
end
subgraph "Region: ap-southeast-1 (Secondary)"
ALB_ASIA[Application Load Balancer]
APP_ASIA[Application Servers]
RDS_ASIA[RDS Read Replica]
S3_ASIA[S3 Bucket]
DYNAMO_ASIA[DynamoDB Global Table]
end
USER_US -->|DNS Query| R53
USER_EU -->|DNS Query| R53
USER_ASIA -->|DNS Query| R53
R53 -->|Routes to| ALB_US
R53 -->|Routes to| ALB_EU
R53 -->|Routes to| ALB_ASIA
HEALTH -->|Monitors| ALB_US
HEALTH -->|Monitors| ALB_EU
HEALTH -->|Monitors| ALB_ASIA
ALB_US -->|Distributes| APP_US
ALB_EU -->|Distributes| APP_EU
ALB_ASIA -->|Distributes| APP_ASIA
APP_US -->|Read/Write| RDS_US
APP_EU -->|Read| RDS_EU
APP_ASIA -->|Read| RDS_ASIA
RDS_US -.Async Replication.-> RDS_EU
RDS_US -.Async Replication.-> RDS_ASIA
S3_US -.Cross-Region Replication.-> S3_EU
S3_US -.Cross-Region Replication.-> S3_ASIA
DYNAMO_US <-.Bi-directional Replication.-> DYNAMO_EU
DYNAMO_EU <-.Bi-directional Replication.-> DYNAMO_ASIA
DYNAMO_ASIA <-.Bi-directional Replication.-> DYNAMO_US
style R53 fill:#ff9900
style ALB_US fill:#c8e6c9
style ALB_EU fill:#e1f5fe
style ALB_ASIA fill:#e1f5fe
See: diagrams/04_domain3_multi_region_architecture.mmd
Diagram Explanation (Detailed):
The diagram illustrates a global Multi-Region architecture spanning three regions: us-east-1 (primary), eu-west-1, and ap-southeast-1. Users from different geographic locations query Route 53 (orange), which uses geolocation or latency-based routing to direct them to the nearest region for optimal performance. Route 53 Health Checks continuously monitor the health of Application Load Balancers in each region - if a region becomes unhealthy, Route 53 automatically routes traffic to healthy regions. Each region has a complete application stack: ALB, application servers, database, and storage. The us-east-1 region hosts the RDS Primary database (green) that handles all write operations. RDS Read Replicas in eu-west-1 and ap-southeast-1 (blue) asynchronously replicate data from the primary, serving read traffic in their regions. S3 buckets use Cross-Region Replication to automatically copy objects between regions, ensuring data availability globally. DynamoDB Global Tables provide bi-directional replication between all three regions, allowing writes in any region with automatic conflict resolution. This architecture provides both high availability (survives regional failures) and low latency (users connect to nearest region). If us-east-1 fails completely, Route 53 stops routing traffic there, and one of the read replicas can be promoted to primary to restore write capability.
Multi-Region Patterns:
1. Active-Passive (Disaster Recovery):
2. Active-Active (High Availability):
3. Active-Read (Hybrid):
Detailed Example 1: Global SaaS Application
A SaaS company serves customers globally and needs low latency worldwide. They implement active-active Multi-Region architecture: (1) Deploy application in us-east-1, eu-west-1, and ap-southeast-1, (2) Use DynamoDB Global Tables for user data - writes in any region replicate to others within seconds, (3) Use Aurora Global Database for transactional data - primary in us-east-1, read replicas in other regions with <1 second replication lag, (4) Use S3 with Cross-Region Replication for user-uploaded files, (5) Use Route 53 latency-based routing to direct users to nearest region, (6) Use CloudFront for static assets with origins in all regions. A user in London connects to eu-west-1 (30ms latency vs 100ms to us-east-1). They can read and write data locally - writes to DynamoDB replicate globally, writes to Aurora go to us-east-1 but with optimized network path. When us-east-1 experiences an outage, Route 53 health checks detect the failure within 30 seconds and stop routing traffic there. The company promotes the Aurora read replica in eu-west-1 to primary (takes 1 minute), and operations continue with eu-west-1 as the new primary. Users in US are automatically routed to eu-west-1 or ap-southeast-1, experiencing slightly higher latency but no service interruption. Total customer-facing downtime: <2 minutes.
Detailed Example 2: Disaster Recovery with Pilot Light
A financial services company needs disaster recovery for regulatory compliance but wants to minimize costs. They implement pilot light DR strategy: (1) Primary region us-east-1 runs full application stack, (2) DR region us-west-2 has minimal infrastructure: RDS read replica receiving continuous replication, S3 bucket receiving cross-region replication, AMIs and CloudFormation templates ready to deploy, (3) Route 53 health checks monitor primary region, (4) Automated runbook in Systems Manager for DR failover. During a regional outage: (1) Route 53 health checks fail, triggering SNS notification, (2) On-call engineer reviews situation and initiates DR runbook, (3) Systems Manager automation executes: (a) Promote RDS read replica to primary in us-west-2, (b) Deploy CloudFormation stack creating ALB, Auto Scaling group, and application servers, (c) Update Route 53 to point to us-west-2 ALB, (d) Verify application health, (4) Total recovery time: 15 minutes (RTO), (5) Data loss: <5 minutes of transactions (RPO). This approach costs 20% of active-active but provides acceptable RTO/RPO for their requirements.
Detailed Example 3: Content Delivery with Multi-Region
A media streaming company needs to deliver video content globally with low latency. They implement Multi-Region content delivery: (1) Store master video files in S3 in us-east-1, (2) Use S3 Cross-Region Replication to replicate to eu-west-1, ap-southeast-1, and sa-east-1, (3) Use CloudFront with multiple regional origins - each region's S3 bucket is an origin, (4) Configure CloudFront origin failover - if primary origin fails, automatically use secondary, (5) Use Lambda@Edge to route requests to nearest origin based on viewer location, (6) Store metadata in DynamoDB Global Tables for fast access worldwide. A user in Brazil requests a video: (1) CloudFront edge location in SĆ£o Paulo receives request, (2) Lambda@Edge determines nearest origin is sa-east-1, (3) If video exists in sa-east-1 S3, serve from there (10ms latency), (4) If not, fetch from us-east-1 (150ms latency) and cache in CloudFront, (5) Subsequent requests from Brazil serve from CloudFront cache (5ms latency). This architecture provides 99.99% availability and <50ms latency for 95% of users globally.
ā Must Know (Critical Facts):
When to use Multi-Region (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
š Connections to Other Topics:
Troubleshooting Common Issues:
The problem: Applications experience variable load - traffic spikes during peak hours, seasonal variations, and unpredictable growth. Fixed-capacity infrastructure is either over-provisioned (wasting money) or under-provisioned (causing performance issues).
The solution: Scalable architectures automatically adjust capacity based on demand using Auto Scaling, serverless technologies, and managed services. This ensures performance during peak load while minimizing costs during low load.
Why it's tested: The exam tests your ability to design systems that scale automatically, handle traffic spikes, and optimize costs through elastic capacity.
[Content continues with Auto Scaling, serverless scaling, and database scaling patterns...]
What it is: Automatically adjusting the number of compute resources based on demand using metrics like CPU utilization, request count, or custom metrics.
Why it exists: Manual scaling is slow and reactive. Auto Scaling proactively adjusts capacity, ensuring performance during spikes and cost optimization during low demand.
Real-world analogy: Auto Scaling is like a restaurant automatically adjusting staff based on customer count - more servers during dinner rush, fewer during slow periods.
Auto Scaling Types:
1. Target Tracking Scaling:
2. Step Scaling:
3. Scheduled Scaling:
4. Predictive Scaling:
Detailed Example 1: E-Commerce Auto Scaling
An e-commerce site experiences predictable daily patterns and unpredictable spikes. They implement comprehensive Auto Scaling: (1) Base capacity: 10 instances minimum (handle baseline traffic), (2) Target tracking: Maintain 60% CPU utilization, (3) Scheduled scaling: Increase minimum to 20 instances weekdays 8 AM-8 PM, (4) Predictive scaling: Enabled with 30 days historical data, (5) Step scaling for extreme spikes: Add 10 instances if request count >10,000/minute. During a flash sale: (1) 9:55 AM: Predictive scaling increases capacity to 25 instances (forecasts spike), (2) 10:00 AM: Flash sale starts, traffic jumps 5x, (3) 10:01 AM: Target tracking adds 15 more instances (CPU hits 80%), (4) 10:02 AM: Step scaling adds 10 instances (requests >10,000/min), (5) 10:05 AM: 50 instances running, CPU stabilizes at 60%, (6) 11:00 AM: Sale ends, traffic drops, (7) 11:15 AM: Auto Scaling terminates excess instances, (8) 11:30 AM: Back to 20 instances (scheduled minimum). Cost optimization: Pay for 50 instances for 1 hour instead of provisioning 50 instances 24/7.
Detailed Example 2: Serverless Auto Scaling
A mobile app backend uses Lambda, API Gateway, and DynamoDB. All components scale automatically: (1) API Gateway: Handles 10,000 requests/second automatically, no configuration needed, (2) Lambda: Scales to 1,000 concurrent executions (account limit), each execution handles one request, (3) DynamoDB: Uses on-demand capacity mode, automatically scales to handle any traffic. During app launch: (1) Normal traffic: 100 requests/second, 100 Lambda executions, DynamoDB handles easily, (2) App featured in App Store: Traffic jumps to 5,000 requests/second, (3) API Gateway handles increased load automatically, (4) Lambda scales to 1,000 concurrent executions within seconds, (5) DynamoDB on-demand mode scales automatically, (6) No configuration changes needed, (7) Cost: Pay only for actual usage. The team monitors CloudWatch metrics: Lambda concurrent executions, API Gateway 4xx/5xx errors, DynamoDB throttling. They set CloudWatch alarms: Alert if Lambda concurrent executions >800 (approaching limit), alert if DynamoDB throttling >0 (capacity issue). This serverless architecture scales from 0 to millions of requests without manual intervention.
ā Must Know (Critical Facts):
When to use Auto Scaling (Comprehensive):
The problem: Data loss, corruption, and disasters can destroy business operations. Manual backup and recovery processes are slow, error-prone, and often untested.
The solution: Automated backup, recovery, and disaster recovery processes ensure data protection and business continuity with defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Why it's tested: The exam tests your ability to design and implement automated recovery solutions that meet business requirements for data protection and availability.
Recovery Time Objective (RTO): Maximum acceptable time to restore service after disruption.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time.
RTO/RPO Trade-offs:
Disaster Recovery Strategies (ordered by RTO/RPO):
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours-Days | Hours | Lowest | Restore from backups stored in S3/Glacier |
| Pilot Light | 10-30 min | Minutes | Low | Minimal infrastructure running, scale up during DR |
| Warm Standby | Minutes | Seconds | Medium | Scaled-down version running, scale up during DR |
| Multi-Site Active-Active | Seconds | None | Highest | Full capacity in multiple regions |
Detailed Example 1: Automated Backup Strategy
A SaaS company implements comprehensive automated backups: (1) RDS databases: Automated backups enabled, 7-day retention, snapshots every 6 hours, (2) DynamoDB: Point-in-time recovery enabled (35-day retention), (3) EBS volumes: AWS Backup creates daily snapshots, 30-day retention, (4) S3 data: Versioning enabled, lifecycle policy moves old versions to Glacier after 90 days, (5) EC2 AMIs: AWS Backup creates weekly AMIs, 4-week retention. They use AWS Backup for centralized management: (1) Create backup plan with retention policies, (2) Assign resources using tags (Environment:Production), (3) Backup vault with encryption and access controls, (4) Cross-region backup copy to us-west-2 for DR, (5) Backup compliance reporting shows coverage. During a data corruption incident: (1) Developer accidentally deletes production table, (2) DBA identifies issue within 10 minutes, (3) Uses DynamoDB point-in-time recovery to restore table to 5 minutes before deletion, (4) Recovery completes in 15 minutes, (5) Data loss: 5 minutes (RPO met), (6) Downtime: 15 minutes (RTO met). Total cost: $500/month for backups vs. potential $100K+ loss from data corruption.
Detailed Example 2: Disaster Recovery Testing
A financial services company must test DR annually for compliance. They implement automated DR testing: (1) Create DR runbook in Systems Manager, (2) Runbook steps: (a) Promote RDS read replica in DR region, (b) Deploy CloudFormation stack for application tier, (c) Update Route 53 to point to DR region, (d) Run smoke tests, (e) Generate DR test report, (3) Schedule quarterly DR tests using EventBridge, (4) Lambda function triggers runbook, monitors progress, sends notifications. During DR test: (1) EventBridge triggers Lambda at 2 AM Sunday, (2) Lambda executes DR runbook, (3) RDS read replica promoted to primary (2 minutes), (4) CloudFormation deploys application stack (10 minutes), (5) Route 53 updated to DR region (1 minute), (6) Smoke tests verify functionality (5 minutes), (7) Lambda generates report: RTO achieved: 13 minutes (target: 15 minutes), RPO: 30 seconds (target: 5 minutes), (8) After test, Lambda executes rollback runbook, (9) Environment restored to normal. This automated testing ensures DR procedures work and meet RTO/RPO targets without manual effort.
ā Must Know (Critical Facts):
Section 1: High Availability Solutions
Section 2: Scalable Solutions
Section 3: Automated Recovery
Try these from your practice test bundles:
Next Chapter: Chapter 4 - Monitoring and Logging (CloudWatch, X-Ray, log aggregation)
What you'll learn:
Time to complete: 8-10 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 1 (SDLC Automation)
Domain Weight: 15% of exam (approximately 10 questions)
The problem: Without proper monitoring and logging, you're flying blind - you can't detect issues, troubleshoot problems, or optimize performance. Applications fail, users complain, and you have no data to understand what went wrong.
The solution: Amazon CloudWatch provides centralized monitoring and logging for all AWS resources and applications. It collects metrics, logs, and events, allowing you to visualize, analyze, and respond to system behavior in real-time.
Why it's tested: Monitoring and logging are fundamental to DevOps practices. The exam tests your ability to design comprehensive monitoring solutions, aggregate logs from multiple sources, create meaningful metrics, and automate responses to system events.
What it is: CloudWatch Logs is a centralized log management service that collects, stores, and analyzes log data from AWS services, applications, and on-premises servers.
Why it exists: Applications and infrastructure generate massive amounts of log data across distributed systems. Without centralization, logs are scattered across hundreds of servers, making troubleshooting nearly impossible. CloudWatch Logs solves this by providing a single place to store, search, and analyze all your logs.
Real-world analogy: Think of CloudWatch Logs like a library's card catalog system. Instead of searching through thousands of books scattered across multiple buildings, you have one centralized index that tells you exactly where to find what you need. The logs are the books, and CloudWatch Logs is the catalog system that organizes and indexes them.
How it works (Detailed step-by-step):
š CloudWatch Logs Architecture Diagram:
graph TB
subgraph "Log Sources"
EC2[EC2 Instances<br/>CloudWatch Agent]
Lambda[Lambda Functions<br/>Automatic Logging]
ECS[ECS Containers<br/>awslogs Driver]
RDS[RDS Databases<br/>Slow Query Logs]
VPC[VPC Flow Logs]
CT[CloudTrail<br/>API Logs]
end
subgraph "CloudWatch Logs"
LG1[Log Group: /aws/ec2/webservers]
LG2[Log Group: /aws/lambda/functions]
LG3[Log Group: /aws/ecs/applications]
subgraph "Log Group 1"
LS1[Log Stream: i-abc123]
LS2[Log Stream: i-def456]
end
end
subgraph "Processing & Analysis"
MF[Metric Filters<br/>Extract Metrics]
SF[Subscription Filters<br/>Stream to Kinesis/Lambda]
LI[Logs Insights<br/>SQL-like Queries]
end
subgraph "Storage & Retention"
S3[S3 Export<br/>Long-term Archive]
KMS[KMS Encryption<br/>At Rest]
end
EC2 -->|PutLogEvents API| LG1
Lambda -->|Automatic| LG2
ECS -->|awslogs driver| LG3
RDS -->|Export| LG1
VPC -->|Flow Logs| LG1
CT -->|API Calls| LG1
LG1 --> LS1
LG1 --> LS2
LG1 --> MF
LG2 --> SF
LG3 --> LI
LG1 --> S3
LG1 --> KMS
style EC2 fill:#e1f5fe
style Lambda fill:#f3e5f5
style ECS fill:#fff3e0
style LG1 fill:#c8e6c9
style MF fill:#ffccbc
style S3 fill:#e8f5e9
See: diagrams/05_domain4_cloudwatch_logs_architecture.mmd
Diagram Explanation (detailed):
This diagram shows the complete CloudWatch Logs architecture from log generation to storage and analysis. At the top, we have multiple log sources: EC2 instances running the CloudWatch agent, Lambda functions with automatic logging, ECS containers using the awslogs log driver, RDS databases exporting slow query logs, VPC Flow Logs capturing network traffic, and CloudTrail recording API calls. Each source sends logs to CloudWatch Logs using the PutLogEvents API (or automatic integration for managed services).
In the middle layer, logs are organized into Log Groups (logical containers) like "/aws/ec2/webservers" for web server logs. Within each log group, individual Log Streams represent specific sources (like instance i-abc123). This hierarchical organization makes it easy to find and query related logs.
The processing layer shows three key capabilities: Metric Filters extract custom metrics from log patterns (like counting error messages), Subscription Filters stream logs in real-time to Kinesis or Lambda for processing, and Logs Insights provides SQL-like query capabilities for ad-hoc analysis.
Finally, the storage layer shows S3 export for long-term archival and KMS encryption for securing logs at rest. This architecture enables centralized logging, real-time processing, and long-term retention while maintaining security and compliance.
Detailed Example 1: Web Server Access Log Monitoring
Imagine you're running a fleet of 50 web servers behind an Application Load Balancer. Each server generates access logs showing every HTTP request. Here's how CloudWatch Logs handles this:
This setup gives you centralized visibility into all web server activity, automatic error detection, and powerful query capabilities - all without manually SSH-ing into servers.
Detailed Example 2: Lambda Function Error Tracking
You have 20 Lambda functions processing orders in an e-commerce application. Here's how CloudWatch Logs helps:
This approach provides automatic error tracking, real-time notifications, and compliance-friendly retention without any manual log management.
Detailed Example 3: Multi-Account Log Aggregation
Your organization has 50 AWS accounts (dev, test, prod for multiple teams). You need centralized logging:
This architecture provides organization-wide log visibility, centralized security monitoring, and cost-effective long-term storage.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Not setting retention policies and accumulating years of logs
Mistake 2: Creating one log stream per application instead of per instance
Mistake 3: Sending all logs to CloudWatch without filtering
š Connections to Other Topics:
Troubleshooting Common Issues:
Issue 1: Logs not appearing in CloudWatch
Issue 2: "ThrottlingException" errors
Issue 3: High CloudWatch Logs costs
What it is: CloudWatch Metrics is a time-series database that collects and stores numerical measurements (metrics) about your AWS resources and applications. Metrics represent data points over time, like CPU utilization, request count, or error rate.
Why it exists: You can't improve what you don't measure. Applications and infrastructure generate thousands of performance indicators, but without structured collection and visualization, this data is useless. CloudWatch Metrics provides a standardized way to collect, store, and analyze performance data across all AWS services.
Real-world analogy: Think of CloudWatch Metrics like a car's dashboard. The speedometer, fuel gauge, and temperature gauge are all metrics that help you understand your car's performance. CloudWatch Metrics is the dashboard for your AWS infrastructure, showing you CPU "speed", memory "fuel level", and error "temperature".
How it works (Detailed step-by-step):
š CloudWatch Metrics Architecture Diagram:
graph TB
subgraph "Metric Sources"
EC2M[EC2 Instances<br/>CPUUtilization<br/>NetworkIn/Out]
LambdaM[Lambda Functions<br/>Invocations<br/>Duration<br/>Errors]
ALBM[Application Load Balancer<br/>RequestCount<br/>TargetResponseTime]
RDSM[RDS Databases<br/>DatabaseConnections<br/>ReadLatency]
CustomM[Custom Application<br/>OrdersProcessed<br/>PaymentErrors]
end
subgraph "CloudWatch Metrics"
NS1[Namespace: AWS/EC2]
NS2[Namespace: AWS/Lambda]
NS3[Namespace: Custom/MyApp]
subgraph "Metric Details"
M1[Metric: CPUUtilization<br/>Dimensions: InstanceId<br/>Unit: Percent]
M2[Metric: Invocations<br/>Dimensions: FunctionName<br/>Unit: Count]
end
end
subgraph "Aggregation & Statistics"
AGG[Statistics:<br/>Average, Sum, Min, Max<br/>SampleCount, Percentiles]
PERIOD[Periods:<br/>1 min, 5 min, 1 hour]
end
subgraph "Visualization & Alerting"
DASH[CloudWatch Dashboards<br/>Graphs & Widgets]
ALARM[CloudWatch Alarms<br/>Threshold Evaluation]
INSIGHT[Metric Insights<br/>SQL Queries]
end
subgraph "Actions"
SNS[SNS Notifications]
ASG[Auto Scaling Actions]
LAMBDA[Lambda Functions]
end
EC2M -->|PutMetricData| NS1
LambdaM -->|Automatic| NS2
ALBM -->|Automatic| NS1
RDSM -->|Automatic| NS1
CustomM -->|PutMetricData| NS3
NS1 --> M1
NS2 --> M2
M1 --> AGG
M2 --> AGG
AGG --> PERIOD
PERIOD --> DASH
PERIOD --> ALARM
PERIOD --> INSIGHT
ALARM --> SNS
ALARM --> ASG
ALARM --> LAMBDA
style EC2M fill:#e1f5fe
style LambdaM fill:#f3e5f5
style NS1 fill:#c8e6c9
style ALARM fill:#ffccbc
style SNS fill:#fff3e0
See: diagrams/05_domain4_cloudwatch_metrics_architecture.mmd
Diagram Explanation (detailed):
This diagram illustrates the complete CloudWatch Metrics workflow from data collection to action. At the top, we have various metric sources: EC2 instances publishing CPU and network metrics, Lambda functions publishing invocation and error metrics, ALB publishing request metrics, RDS publishing database metrics, and custom applications publishing business metrics.
All metrics flow into CloudWatch Metrics and are organized by namespace (AWS/EC2, AWS/Lambda, Custom/MyApp). Within each namespace, individual metrics are identified by name and dimensions. For example, CPUUtilization in the AWS/EC2 namespace with dimension InstanceId=i-abc123 represents CPU usage for a specific instance.
The aggregation layer shows how CloudWatch calculates statistics (Average, Sum, Min, Max) over time periods (1 minute, 5 minutes, 1 hour). This aggregation is crucial because raw data points are too granular for analysis.
The visualization layer shows three ways to use metrics: Dashboards for visual monitoring, Alarms for threshold-based alerting, and Metric Insights for SQL-based analysis. When alarms trigger, they can send SNS notifications, trigger Auto Scaling actions, or invoke Lambda functions for automated remediation.
Detailed Example 1: EC2 Auto Scaling Based on Custom Metrics
You're running a video processing application on EC2 instances. CPU utilization isn't a good scaling metric because video encoding is I/O-bound, not CPU-bound. Here's how you use custom metrics:
This approach provides application-aware scaling that's more efficient than CPU-based scaling, potentially saving 40-60% on compute costs.
Detailed Example 2: Multi-Dimensional Metric Analysis
Your e-commerce application processes orders across multiple regions and payment methods. You need detailed insights:
This multi-dimensional approach provides deep insights into application behavior without creating hundreds of separate metrics.
Detailed Example 3: High-Resolution Metrics for Latency Monitoring
Your API needs to maintain 99th percentile latency under 100ms. Standard 1-minute metrics aren't granular enough:
This setup provides near-real-time latency monitoring and faster incident detection, crucial for maintaining SLAs.
ā Must Know (Critical Facts):
When to use (Comprehensive):
What it is: AWS X-Ray is a distributed tracing service that helps you analyze and debug distributed applications by tracking requests as they travel through multiple services.
Why it exists: Modern applications are built using microservices - dozens or hundreds of services working together. When a request fails or is slow, it's nearly impossible to know which service caused the problem without distributed tracing. X-Ray solves this by tracking requests end-to-end across all services.
Real-world analogy: Imagine tracking a package through the postal system. X-Ray is like the tracking number that shows you every step: picked up from sender, arrived at sorting facility, loaded on truck, out for delivery, delivered. Without it, you'd have no idea where delays occurred.
How it works:
ā Must Know:
What it is: CloudWatch Alarms monitor metrics and trigger actions when thresholds are breached.
Key Concepts:
Alarm Actions:
ā Must Know:
What it is: CloudWatch Logs Insights is a query language for analyzing log data using SQL-like syntax.
Query Capabilities:
Example Query:
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
| sort @timestamp desc
ā Must Know:
Try these from your practice test bundles:
Next Chapter: Chapter 5 - Incident and Event Response (EventBridge, Systems Manager, troubleshooting)
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Publishing too many unique metric combinations (dimensions)
Mistake 2: Using Average statistic for everything
Mistake 3: Not using metric math to create derived metrics
š Connections to Other Topics:
Troubleshooting Common Issues:
Issue 1: Metrics not appearing in CloudWatch
Issue 2: Metrics delayed or missing data points
Issue 3: Unexpected CloudWatch costs
What it is: CloudWatch Alarms monitor metrics and trigger actions when thresholds are breached. Alarms have three states: OK (within threshold), ALARM (breached threshold), and INSUFFICIENT_DATA (not enough data to evaluate).
Why it exists: Humans can't watch dashboards 24/7. Alarms provide automated monitoring that detects issues immediately and triggers automated responses, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).
Real-world analogy: CloudWatch Alarms are like smoke detectors in your home. They continuously monitor for danger (smoke/fire), and when detected, they trigger an alarm (sound) and can trigger automated actions (call fire department, activate sprinklers). You don't need to constantly check for fires - the alarm does it for you.
How it works (Detailed step-by-step):
š CloudWatch Alarms Architecture Diagram:
graph TB
subgraph "Metric Sources"
M1[EC2 CPUUtilization]
M2[ALB TargetResponseTime]
M3[Lambda Errors]
M4[Custom: QueueDepth]
end
subgraph "CloudWatch Alarms"
A1[Alarm: High CPU<br/>Threshold: > 80%<br/>Periods: 3 of 3]
A2[Alarm: High Latency<br/>Threshold: > 500ms<br/>Periods: 2 of 3]
A3[Alarm: Lambda Errors<br/>Threshold: > 10<br/>Periods: 1 of 1]
A4[Composite Alarm<br/>A1 AND A2]
end
subgraph "Alarm States"
OK[OK State<br/>Within Threshold]
ALARM[ALARM State<br/>Breached Threshold]
INSUF[INSUFFICIENT_DATA<br/>Not Enough Data]
end
subgraph "Actions"
SNS[SNS Topic<br/>Email/SMS/Lambda]
ASG[Auto Scaling<br/>Add Instances]
LAMBDA[Lambda Function<br/>Custom Remediation]
SSM[Systems Manager<br/>Run Automation]
EC2[EC2 Action<br/>Stop/Reboot/Terminate]
end
M1 --> A1
M2 --> A2
M3 --> A3
A1 --> A4
A2 --> A4
A1 --> OK
A1 --> ALARM
A1 --> INSUF
ALARM --> SNS
ALARM --> ASG
ALARM --> LAMBDA
ALARM --> SSM
ALARM --> EC2
style M1 fill:#e1f5fe
style A1 fill:#fff3e0
style ALARM fill:#ffccbc
style SNS fill:#c8e6c9
See: diagrams/05_domain4_cloudwatch_alarms_architecture.mmd
Diagram Explanation (detailed):
This diagram shows the complete CloudWatch Alarms workflow from metric monitoring to action execution. At the top, we have various metric sources: EC2 CPU utilization, ALB response time, Lambda errors, and custom application metrics like queue depth. Each metric feeds into one or more CloudWatch Alarms.
The alarms layer shows different alarm configurations. Alarm A1 monitors CPU and triggers when it exceeds 80% for 3 consecutive periods. Alarm A2 monitors latency and triggers when it exceeds 500ms for 2 out of 3 periods. Alarm A3 monitors Lambda errors with immediate triggering (1 of 1 period). The Composite Alarm (A4) combines A1 and A2 using AND logic, triggering only when both CPU is high AND latency is high.
Each alarm can be in one of three states: OK (metric within threshold), ALARM (threshold breached), or INSUFFICIENT_DATA (not enough data points to evaluate). State transitions trigger configured actions.
The actions layer shows five types of responses: SNS notifications (email, SMS, or Lambda), Auto Scaling actions (add/remove instances), Lambda functions (custom remediation), Systems Manager automation (run runbooks), and EC2 actions (stop, reboot, terminate, or recover instances). This architecture enables automated incident response without human intervention.
Detailed Example 1: Multi-Tier Application Monitoring
You're running a three-tier web application (web, app, database). Here's a comprehensive alarm strategy:
Web Tier Alarms:
Application Tier Alarms:
Database Tier Alarms:
Composite Alarms:
Cost Optimization:
This multi-layered approach provides comprehensive monitoring with automated responses at each tier, reducing MTTR from hours to minutes.
Detailed Example 2: Anomaly Detection for Variable Workloads
Your application has highly variable traffic patterns (10x difference between peak and off-peak). Static thresholds don't work:
Problem with Static Thresholds:
Anomaly Detection Solution:
Configuration:
Benefits:
Use Cases:
This approach provides intelligent monitoring that adapts to your application's behavior, dramatically reducing false positives.
Detailed Example 3: Automated Incident Response with Composite Alarms
You need sophisticated incident response that considers multiple signals before taking action:
Individual Alarms:
Composite Alarm Logic:
Tiered Response:
Benefits:
Implementation:
{
"AlarmName": "Critical-Application-Degradation",
"AlarmRule": "(ALARM(High-CPU) OR ALARM(High-Memory)) AND (ALARM(High-Latency) OR ALARM(High-Errors))",
"ActionsEnabled": true,
"AlarmActions": [
"arn:aws:sns:us-east-1:123456789012:pagerduty-critical",
"arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:policy-id",
"arn:aws:lambda:us-east-1:123456789012:function:investigate-incident"
]
}
This sophisticated approach provides context-aware incident response that considers multiple signals before taking action, reducing false positives and improving response quality.
ā Must Know (Critical Facts):
When to use (Comprehensive):
Limitations & Constraints:
š” Tips for Understanding:
ā ļø Common Mistakes & Misconceptions:
Mistake 1: Setting evaluation periods too short (1 period)
Mistake 2: Not configuring "Treat Missing Data"
Mistake 3: Creating separate alarms instead of composite alarms
š Connections to Other Topics:
Troubleshooting Common Issues:
Issue 1: Alarm not triggering despite metric breaching threshold
Issue 2: Too many false alarms
Issue 3: Alarm stuck in INSUFFICIENT_DATA
What you'll learn:
Time to complete: 6-8 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 2 (Configuration Management), Chapter 4 (Monitoring)
Domain Weight: 14% of exam (approximately 9 questions)
The problem: Traditional polling-based systems waste resources checking for changes that rarely happen. When events do occur, delays in detection lead to slow response times. Manual intervention for routine events is time-consuming and error-prone.
The solution: Amazon EventBridge enables event-driven architectures where systems react to events in real-time. Events trigger automated workflows, eliminating polling overhead and enabling instant response to changes.
Why it's tested: Event-driven architecture is fundamental to modern DevOps. The exam tests your ability to design event-driven workflows, integrate multiple event sources, and automate responses to operational events.
What it is: EventBridge is a serverless event bus service that connects applications using events. It receives events from AWS services, custom applications, and SaaS providers, then routes them to targets based on rules.
Why it exists: Applications need to communicate and react to changes, but tight coupling creates brittle systems. EventBridge provides loose coupling through event-driven communication, where producers don't need to know about consumers.
Real-world analogy: EventBridge is like a newspaper delivery system. Publishers (event sources) create newspapers (events) and give them to the delivery service (EventBridge). The delivery service routes newspapers to subscribers (targets) based on their interests (rules), without publishers knowing who the subscribers are.
How it works:
ā Must Know:
Detailed Example 1: Automated EC2 Instance Remediation
Your EC2 instances occasionally fail status checks due to network issues. Here's how EventBridge automates recovery:
This automation reduces mean time to recovery (MTTR) from hours to minutes and eliminates manual intervention for common issues.
Detailed Example 2: Multi-Account Security Event Aggregation
You have 50 AWS accounts and need centralized security event monitoring:
This architecture provides real-time security monitoring across all accounts with automated response capabilities.
What it is: Systems Manager Automation executes predefined runbooks to perform common operational tasks like patching, backup, or incident response.
Key Capabilities:
ā Must Know:
CodePipeline Failures:
CloudFormation Failures:
Auto Scaling Failures:
ā Must Know:
Try these from your practice test bundles:
Next Chapter: Chapter 6 - Security and Compliance (IAM, encryption, security automation)
Detailed Example 3: Cross-Account Security Event Aggregation (Expanded)
Your organization has 50 AWS accounts across development, staging, and production environments. Security events need to be aggregated and responded to centrally:
Architecture Design:
Event Sources in Each Account:
Local EventBridge Rules (in each of 50 accounts):
{
"source": ["aws.guardduty", "aws.securityhub", "aws.config"],
"detail-type": ["GuardDuty Finding", "Security Hub Findings - Imported", "Config Rules Compliance Change"],
"detail": {
"severity": ["HIGH", "CRITICAL"]
}
}
Central Security Account Event Bus:
Event Processing Rules (in central account):
Automated Response Lambda Function:
def lambda_handler(event, context):
finding_type = event['detail']['type']
account_id = event['account']
severity = event['detail']['severity']
if finding_type == 'UnauthorizedAccess:EC2/SSHBruteForce':
# Isolate compromised instance
isolate_instance(account_id, instance_id)
# Revoke suspicious IAM credentials
revoke_credentials(account_id, access_key_id)
# Create forensic snapshot
create_snapshot(account_id, instance_id)
# Notify security team
send_alert(severity, finding_type, account_id)
Metrics and Dashboards:
Benefits:
Cost Optimization:
This architecture provides enterprise-grade security monitoring and automated response at minimal cost, demonstrating the power of event-driven architectures.
What it is: Systems Manager Automation documents (also called runbooks) are JSON or YAML documents that define a series of steps to perform operational tasks. Each step can execute AWS API calls, run scripts, invoke Lambda functions, or pause for approval.
Why it exists: Operational tasks like patching, backup, or incident response involve multiple steps across multiple services. Manual execution is error-prone, slow, and doesn't scale. Automation documents codify operational procedures, ensuring consistent execution every time.
Real-world analogy: Automation documents are like recipes in a cookbook. A recipe lists ingredients (parameters) and step-by-step instructions (actions) to create a dish (desired outcome). Anyone following the recipe gets consistent results, and you can share recipes with others. Similarly, automation documents ensure consistent operational procedures across teams.
How it works (Detailed step-by-step):
š Systems Manager Automation Architecture Diagram:
graph TB
subgraph "Trigger Sources"
EB[EventBridge Rule<br/>Scheduled or Event-driven]
MANUAL[Manual Execution<br/>Console or CLI]
LAMBDA[Lambda Function<br/>Programmatic]
CW[CloudWatch Alarm<br/>Metric-based]
end
subgraph "Automation Document"
PARAMS[Input Parameters<br/>InstanceId, AMI, Tags]
subgraph "Execution Steps"
S1[Step 1: Describe Instance<br/>API: ec2:DescribeInstances]
S2[Step 2: Create Snapshot<br/>API: ec2:CreateSnapshot]
S3[Step 3: Wait for Snapshot<br/>API: ec2:DescribeSnapshots]
S4[Step 4: Approval<br/>Pause for Human]
S5[Step 5: Terminate Instance<br/>API: ec2:TerminateInstances]
end
OUTPUTS[Outputs<br/>SnapshotId, Status]
end
subgraph "Execution Control"
RATE[Rate Control<br/>Concurrency Limits]
TARGETS[Target Selection<br/>Tags, Resource Groups]
APPROVAL[Approval Workflow<br/>SNS Notification]
end
subgraph "Monitoring"
CWLOGS[CloudWatch Logs<br/>Execution History]
METRICS[CloudWatch Metrics<br/>Success/Failure Rate]
OPSCENTER[OpsCenter<br/>OpsItems for Failures]
end
EB --> PARAMS
MANUAL --> PARAMS
LAMBDA --> PARAMS
CW --> PARAMS
PARAMS --> S1
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
S5 --> OUTPUTS
PARAMS --> RATE
PARAMS --> TARGETS
S4 --> APPROVAL
S1 --> CWLOGS
S2 --> CWLOGS
S5 --> METRICS
S5 --> OPSCENTER
style EB fill:#e1f5fe
style S1 fill:#fff3e0
style S4 fill:#ffccbc
style CWLOGS fill:#c8e6c9
See: diagrams/06_domain5_systems_manager_automation.mmd
Diagram Explanation (detailed):
This diagram illustrates the complete Systems Manager Automation workflow from trigger to completion. At the top, we have four trigger sources: EventBridge rules (scheduled or event-driven), manual execution (console or CLI), Lambda functions (programmatic), and CloudWatch alarms (metric-based). Any of these can initiate an automation execution.
The automation document section shows the structure of a runbook. It starts with input parameters (InstanceId, AMI, Tags) that customize the execution. The execution steps section shows a typical workflow: Step 1 describes the instance, Step 2 creates a snapshot, Step 3 waits for snapshot completion, Step 4 pauses for human approval, and Step 5 terminates the instance. Each step makes specific AWS API calls.
The execution control section shows three key capabilities: rate control (limit concurrent executions to avoid overwhelming systems), target selection (use tags or resource groups to select multiple targets), and approval workflow (send SNS notification and wait for approval).
The monitoring section shows how executions are tracked: CloudWatch Logs stores execution history, CloudWatch Metrics track success/failure rates, and OpsCenter creates OpsItems for failed executions. This comprehensive monitoring ensures visibility into all automation activities.
Detailed Example 1: Automated EC2 Incident Response
You receive a GuardDuty alert that an EC2 instance is compromised (cryptocurrency mining detected). Here's an automated response runbook:
Automation Document: "Isolate-Compromised-Instance"
schemaVersion: '0.3'
description: 'Isolate compromised EC2 instance and create forensic snapshot'
parameters:
InstanceId:
type: String
description: 'ID of compromised instance'
NotificationTopic:
type: String
description: 'SNS topic for notifications'
mainSteps:
- name: GetInstanceDetails
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: DescribeInstances
InstanceIds:
- '{{ InstanceId }}'
outputs:
- Name: SubnetId
Selector: '$.Reservations[0].Instances[0].SubnetId'
- Name: SecurityGroups
Selector: '$.Reservations[0].Instances[0].SecurityGroups'
- name: CreateForensicSnapshot
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: CreateSnapshot
VolumeId: '{{ GetInstanceDetails.VolumeId }}'
Description: 'Forensic snapshot of compromised instance {{ InstanceId }}'
TagSpecifications:
- ResourceType: snapshot
Tags:
- Key: Purpose
Value: Forensics
- Key: IncidentId
Value: '{{ automation:EXECUTION_ID }}'
outputs:
- Name: SnapshotId
Selector: '$.SnapshotId'
- name: CreateIsolationSecurityGroup
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: CreateSecurityGroup
GroupName: 'isolation-{{ automation:EXECUTION_ID }}'
Description: 'Isolation security group - no inbound/outbound'
VpcId: '{{ GetInstanceDetails.VpcId }}'
- name: AttachIsolationSecurityGroup
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: ModifyInstanceAttribute
InstanceId: '{{ InstanceId }}'
Groups:
- '{{ CreateIsolationSecurityGroup.GroupId }}'
- name: RevokeIAMCredentials
action: 'aws:executeAwsApi'
inputs:
Service: iam
Api: DeleteAccessKey
UserName: '{{ GetInstanceDetails.IamRole }}'
AccessKeyId: '{{ GetInstanceDetails.AccessKeyId }}'
- name: SendNotification
action: 'aws:publish'
inputs:
TopicArn: '{{ NotificationTopic }}'
Message: |
Compromised instance {{ InstanceId }} has been isolated.
- Forensic snapshot: {{ CreateForensicSnapshot.SnapshotId }}
- Isolation security group: {{ CreateIsolationSecurityGroup.GroupId }}
- IAM credentials revoked
- Instance is now isolated for investigation
- name: ApprovalForTermination
action: 'aws:approve'
inputs:
NotificationArn: '{{ NotificationTopic }}'
Message: 'Approve termination of compromised instance {{ InstanceId }}?'
MinRequiredApprovals: 1
Approvers:
- 'arn:aws:iam::123456789012:role/SecurityTeamRole'
- name: TerminateInstance
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: TerminateInstances
InstanceIds:
- '{{ InstanceId }}'
Execution Flow:
Benefits:
This automation transforms incident response from a manual, error-prone process taking hours into an automated, consistent process taking minutes.
What you'll learn:
Time to complete: 10-12 hours
Prerequisites: Chapter 0 (Fundamentals), Chapter 2 (Multi-Account Management)
Domain Weight: 17% of exam (approximately 11 questions)
The problem: Managing permissions for thousands of users, hundreds of applications, and dozens of AWS accounts is complex and error-prone. Overly permissive policies create security risks, while overly restrictive policies break applications. Manual permission management doesn't scale.
The solution: AWS IAM provides fine-grained access control with policies, roles, and identity federation. Combined with AWS Organizations and IAM Identity Center, you can manage permissions at scale across multiple accounts while maintaining security.
Why it's tested: IAM is the foundation of AWS security. The exam tests your ability to design least-privilege access, implement identity federation, manage cross-account access, and automate credential management.
What it is: IAM policies are JSON documents that define permissions (what actions are allowed on which resources). Roles are identities that can be assumed by users, applications, or services to obtain temporary credentials.
Why it exists: Hardcoded credentials are a security nightmare - they're leaked, shared, and never rotated. IAM roles provide temporary credentials that automatically expire, eliminating the need for long-term credentials in applications.
Real-world analogy: IAM roles are like hotel key cards. You check in (assume role), get a key card (temporary credentials) that works for your stay (session duration), and the card automatically stops working when you check out (credentials expire). You never get a permanent key that could be copied or lost.
How it works:
ā Must Know:
Detailed Example 1: Cross-Account CI/CD Pipeline
Your CI/CD pipeline in account A needs to deploy to production in account B:
This approach eliminates the need for long-term credentials and provides clear audit trail of cross-account access.
Detailed Example 2: Attribute-Based Access Control (ABAC)
You have 50 development teams, each with their own resources. Traditional role-per-team doesn't scale:
This approach reduces policy management from 50 policies to 1 policy, dramatically simplifying IAM management.
What it is: Security Hub aggregates security findings from multiple AWS services (GuardDuty, Inspector, Macie, IAM Access Analyzer) and third-party tools into a single dashboard.
Key Capabilities:
ā Must Know:
What it is: GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior using machine learning.
Detection Capabilities:
ā Must Know:
What it is: KMS is a managed service for creating and controlling encryption keys used to encrypt data.
Key Concepts:
ā Must Know:
Detailed Example: Multi-Account Encryption Strategy
Your organization has 50 accounts and needs centralized key management:
This centralized approach provides consistent encryption, simplified key management, and comprehensive audit trail.
What it is: Config continuously monitors and records AWS resource configurations and evaluates them against desired configurations.
Key Capabilities:
ā Must Know:
What it is: CloudTrail records all API calls made in your AWS account, providing audit trail for security analysis and compliance.
Key Capabilities:
ā Must Know:
Try these from your practice test bundles:
Next Chapter: Chapter 7 - Integration and Cross-Domain Scenarios
Detailed Example 3: Zero Trust IAM Architecture
Your organization is implementing zero trust security principles. Here's a comprehensive IAM strategy:
Principles:
Implementation:
Identity Foundation:
Service-to-Service Authentication:
Permission Boundaries:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"dynamodb:*",
"lambda:*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": ["us-east-1", "us-west-2"]
}
}
},
{
"Effect": "Deny",
"Action": [
"iam:*",
"organizations:*",
"account:*"
],
"Resource": "*"
}
]
}
Attribute-Based Access Control (ABAC):
Session Policies:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::project-${aws:PrincipalTag/Project}/*"
}
]
}
Continuous Verification:
Credential Rotation:
Monitoring and Alerting:
Benefits:
This zero trust architecture provides defense in depth, assuming breach and verifying every request, dramatically reducing the impact of compromised credentials.
What it is: AWS KMS is a managed service that makes it easy to create and control cryptographic keys used to encrypt data. KMS uses Hardware Security Modules (HSMs) to protect keys and integrates with most AWS services for seamless encryption.
Why it exists: Encryption is critical for data security, but key management is complex and error-prone. Storing keys alongside encrypted data defeats the purpose. KMS provides secure key storage, automatic rotation, audit logging, and fine-grained access control, making encryption practical and secure.
Real-world analogy: KMS is like a bank's safe deposit box system. You don't store your valuables (data) and keys in the same place. The bank (KMS) stores your keys in a secure vault (HSM), and you need proper identification (IAM permissions) to access them. The bank keeps a log of every access (CloudTrail), and you can set rules for who can access your box (key policies).
How it works (Detailed step-by-step):
š KMS Envelope Encryption Diagram:
sequenceDiagram
participant App as Application
participant KMS as AWS KMS
participant HSM as Hardware Security Module
participant S3 as Amazon S3
Note over App,S3: Encryption Process
App->>KMS: GenerateDataKey(CMK-ID)
KMS->>HSM: Generate DEK
HSM-->>KMS: Plaintext DEK + Encrypted DEK
KMS-->>App: Plaintext DEK + Encrypted DEK
App->>App: Encrypt data with Plaintext DEK
App->>App: Discard Plaintext DEK from memory
App->>S3: Store Encrypted Data + Encrypted DEK
Note over App,S3: Decryption Process
App->>S3: Retrieve Encrypted Data + Encrypted DEK
S3-->>App: Encrypted Data + Encrypted DEK
App->>KMS: Decrypt(Encrypted DEK)
KMS->>HSM: Decrypt DEK with CMK
HSM-->>KMS: Plaintext DEK
KMS-->>App: Plaintext DEK
App->>App: Decrypt data with Plaintext DEK
App->>App: Discard Plaintext DEK from memory
style KMS fill:#fff3e0
style HSM fill:#ffccbc
style S3 fill:#c8e6c9
See: diagrams/07_domain6_kms_envelope_encryption.mmd
Diagram Explanation (detailed):
This sequence diagram illustrates KMS envelope encryption, a two-layer encryption approach that's both secure and performant. The process is divided into encryption and decryption phases.
Encryption Phase: The application requests a data encryption key (DEK) from KMS by calling GenerateDataKey with the CMK ID. KMS instructs the HSM to generate a new DEK. The HSM returns both a plaintext DEK (for immediate use) and an encrypted DEK (encrypted with the CMK). KMS passes both to the application. The application uses the plaintext DEK to encrypt the data (fast symmetric encryption), then immediately discards the plaintext DEK from memory for security. Finally, the application stores both the encrypted data and the encrypted DEK in S3.
Decryption Phase: The application retrieves both the encrypted data and encrypted DEK from S3. It sends the encrypted DEK to KMS for decryption. KMS uses the HSM to decrypt the DEK with the CMK, returning the plaintext DEK to the application. The application uses the plaintext DEK to decrypt the data, then immediately discards the plaintext DEK from memory.
Why Envelope Encryption: This approach provides security (CMK never leaves HSM) and performance (data encrypted with fast symmetric encryption, not slow API calls). The CMK encrypts DEKs, and DEKs encrypt data - hence "envelope" encryption.
Detailed Example 1: Multi-Account Encryption Strategy
Your organization has 50 AWS accounts and needs centralized key management with granular access control:
Architecture:
Central Key Account (Account ID: 111111111111):
alias/public-data - Public informationalias/internal-data - Internal use onlyalias/confidential-data - Confidential business dataalias/restricted-data - PII, PHI, financial dataKey Policies (example for confidential-data):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM policies",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111111111111:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow production accounts to encrypt",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::222222222222:root",
"arn:aws:iam::333333333333:root"
]
},
"Action": [
"kms:Encrypt",
"kms:GenerateDataKey",
"kms:DescribeKey"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": [
"s3.us-east-1.amazonaws.com",
"rds.us-east-1.amazonaws.com"
]
}
}
},
{
"Sid": "Allow production accounts to decrypt",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::222222222222:root",
"arn:aws:iam::333333333333:root"
]
},
"Action": [
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": [
"s3.us-east-1.amazonaws.com",
"rds.us-east-1.amazonaws.com"
]
}
}
},
{
"Sid": "Deny development accounts",
"Effect": "Deny",
"Principal": {
"AWS": "*"
},
"Action": "kms:*",
"Resource": "*",
"Condition": {
"StringLike": {
"aws:PrincipalArn": "arn:aws:iam::*:role/dev-*"
}
}
}
]
}
Service Integration:
{
"Rules": [
{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": "arn:aws:kms:us-east-1:111111111111:key/12345678-1234-1234-1234-123456789012"
},
"BucketKeyEnabled": true
}
]
}
Key Rotation:
Monitoring and Auditing:
Cost Optimization:
Disaster Recovery:
Benefits:
Cost Breakdown (for 50 accounts):
This centralized approach provides enterprise-grade encryption with minimal cost and operational overhead.
This chapter brings together concepts from all six domains to show how they work together in real-world scenarios. The exam frequently tests your ability to combine services and concepts across domains.
Business Requirement: Deploy a microservices application across multiple AWS accounts with automated testing, security scanning, and comprehensive monitoring.
Architecture Components:
Source Control (Domain 1):
Build & Test (Domain 1):
Infrastructure as Code (Domain 2):
Deployment (Domain 1 + 3):
Security (Domain 6):
Monitoring (Domain 4):
Incident Response (Domain 5):
Integration Points:
ā Key Exam Concepts:
Business Requirement: Ensure application availability even if an entire AWS region fails, with automated failover and minimal data loss.
Architecture Components:
Primary Region (us-east-1):
Secondary Region (us-west-2):
Failover Automation:
Data Synchronization:
Monitoring & Testing:
ā Key Exam Concepts:
Business Requirement: Maintain PCI-DSS compliance across 50 AWS accounts with automated detection and remediation of non-compliant resources.
Architecture Components:
Multi-Account Structure (Domain 2):
Compliance Monitoring (Domain 6):
Automated Remediation (Domain 5):
Audit and Reporting (Domain 4 + 6):
Encryption (Domain 6):
ā Key Exam Concepts:
Try these from your practice test bundles:
Next Chapter: Chapter 8 - Study Strategies and Test-Taking Techniques
Pass 1: Understanding (Weeks 1-6)
Pass 2: Application (Weeks 7-8)
Pass 3: Reinforcement (Weeks 9-10)
1. Teach Someone
2. Draw Diagrams
3. Write Scenarios
4. Compare Options
Mnemonics for CI/CD Pipeline Stages:
IAM Policy Evaluation:
High Availability Patterns:
Security Services:
Total Time: 180 minutes (3 hours)
Total Questions: 75 (65 scored + 10 unscored)
Time per Question: ~2.4 minutes
Strategy:
Time Allocation by Question Type:
Step 1: Read the Scenario (30 seconds)
Step 2: Identify Constraints (15 seconds)
Step 3: Eliminate Wrong Answers (30 seconds)
Step 4: Choose Best Answer (45 seconds)
When Stuck:
Common Traps:
Cost Keywords:
Performance Keywords:
Operational Keywords:
Security Keywords:
Beginner Tests (Weeks 1-4):
Intermediate Tests (Weeks 5-7):
Advanced Tests (Weeks 8-9):
Full Practice Tests (Week 10):
For each wrong answer:
Understand WHY you got it wrong
Review related chapter section
Create flashcard or note
Test yourself again
Day 7: Full Practice Test 1
Day 6: Review Weak Domains
Day 5: Full Practice Test 2
Day 4: Focused Review
Day 3: Full Practice Test 3
Day 2: Light Review
Day 1: Rest and Relax
3 Hours Before Exam:
At Test Center:
First 5 Minutes:
Throughout Exam:
Last 30 Minutes:
When exam starts, immediately write down on scratch paper:
This frees up mental space and reduces anxiety.
Good luck on your exam!
Next Chapter: Chapter 9 - Final Week Checklist
Go through this comprehensive checklist. If you can confidently answer "Yes" to 80%+ of items, you're ready.
CI/CD Pipelines:
Automated Testing:
Artifact Management:
Deployment Strategies:
Infrastructure as Code:
Multi-Account Automation:
Automation Solutions:
High Availability:
Scalability:
Disaster Recovery:
CloudWatch:
X-Ray:
Log Aggregation:
EventBridge:
Systems Manager:
Troubleshooting:
IAM:
Encryption:
Security Services:
Network Security:
Day 7 (Sunday): Full Practice Test 1
Day 6 (Monday): Review and Study
Day 5 (Tuesday): Full Practice Test 2
Day 4 (Wednesday): Focused Review
Day 3 (Thursday): Full Practice Test 3
Day 2 (Friday): Light Review
Day 1 (Saturday): Rest
Required:
Optional (check test center policies):
Not Allowed:
3 Hours Before Exam:
1 Hour Before Exam:
At Test Center:
First 90 Minutes:
Next 60 Minutes:
Final 30 Minutes:
For Each Question:
Common Mistakes to Avoid:
Knowledge:
Skills:
Mindset:
Celebrate!
Next Steps:
Don't Give Up!
Improvement Plan:
Good luck! You're ready to pass the AWS Certified DevOps Engineer - Professional exam!
Next: Appendices (Quick reference tables, glossary, resources)
| Service | Purpose | When to Use | Key Features |
|---|---|---|---|
| CodePipeline | Orchestration | Multi-stage workflows | Source, Build, Test, Deploy stages |
| CodeBuild | Build & Test | Compile code, run tests | Custom build environments, Docker support |
| CodeDeploy | Deployment | Deploy to EC2, Lambda, ECS | Blue/green, canary, rolling deployments |
| CodeArtifact | Artifact Repository | Store packages | npm, Maven, PyPI support |
| CodeCommit | Source Control | Git repositories | Fully managed, integrated with AWS |
| Tool | Language | Best For | Learning Curve |
|---|---|---|---|
| CloudFormation | JSON/YAML | AWS-native, declarative | Medium |
| CDK | TypeScript, Python, Java | Programmatic, reusable constructs | High |
| SAM | YAML | Serverless applications | Low |
| Terraform | HCL | Multi-cloud | Medium |
| Strategy | Downtime | Rollback Speed | Cost | Use Case |
|---|---|---|---|---|
| In-Place | Yes | Slow | Low | Non-critical apps |
| Blue/Green | No | Fast | High | Production apps |
| Canary | No | Fast | Medium | Risk mitigation |
| Rolling | Partial | Medium | Low | Gradual updates |
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup/Restore | Hours | Hours | Low | Low |
| Pilot Light | 10-30 min | Minutes | Medium | Medium |
| Warm Standby | Minutes | Seconds | High | Medium |
| Active-Active | Seconds | None | Very High | High |
| Service | Purpose | Data Type | Retention | Cost |
|---|---|---|---|---|
| CloudWatch Logs | Log storage | Text logs | Configurable | $0.50/GB-month |
| CloudWatch Metrics | Metrics | Time-series | 15 months | $0.10/metric |
| X-Ray | Tracing | Traces | 30 days | $5/million traces |
| CloudTrail | Audit logs | API calls | 90 days (console) | $2/100K events |
RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss
Choosing DR Strategy:
CloudWatch Logs Cost:
Lambda Cost:
Target Tracking:
Step Scaling:
ABAC (Attribute-Based Access Control): Access control based on attributes (tags) rather than explicit permissions
AMI (Amazon Machine Image): Pre-configured virtual machine image used to launch EC2 instances
API Gateway: Managed service for creating, publishing, and managing APIs
ASG (Auto Scaling Group): Collection of EC2 instances managed as a group for scaling
Blue/Green Deployment: Deployment strategy with two identical environments (blue=current, green=new)
Buildspec: YAML file defining build commands and settings for CodeBuild
Canary Deployment: Gradual deployment strategy releasing to small percentage of users first
CDK (Cloud Development Kit): Framework for defining cloud infrastructure using programming languages
CloudFormation: Infrastructure as Code service using JSON/YAML templates
CMK (Customer Master Key): Encryption key in KMS used to encrypt data keys
DLQ (Dead Letter Queue): Queue for messages that failed processing
Drift Detection: Identifying resources that have been modified outside of CloudFormation
ECS (Elastic Container Service): Container orchestration service
EKS (Elastic Kubernetes Service): Managed Kubernetes service
EventBridge: Serverless event bus for event-driven architectures
Fargate: Serverless compute engine for containers
GuardDuty: Threat detection service using machine learning
IAM (Identity and Access Management): Service for managing access to AWS resources
IaC (Infrastructure as Code): Managing infrastructure through code rather than manual processes
KMS (Key Management Service): Managed service for encryption keys
Lambda: Serverless compute service for running code without managing servers
Multi-AZ: Deploying resources across multiple Availability Zones for high availability
Multi-Region: Deploying resources across multiple AWS regions for disaster recovery
OpsCenter: Systems Manager capability for managing operational issues
Parameter Store: Secure storage for configuration data and secrets
Pilot Light: DR strategy with minimal resources running, ready to scale up
RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss
Runbook: Automated workflow for operational tasks
SAM (Serverless Application Model): Framework for building serverless applications
SCP (Service Control Policy): Policy in AWS Organizations that sets permission guardrails
StackSet: CloudFormation feature for deploying stacks across multiple accounts/regions
STS (Security Token Service): Service for requesting temporary credentials
Target Tracking: Auto Scaling policy that maintains a target metric value
VPC (Virtual Private Cloud): Isolated virtual network in AWS
Warm Standby: DR strategy with scaled-down version of production running
X-Ray: Distributed tracing service for analyzing application performance
Documentation:
Training:
Exam Resources:
Forums and Communities:
Practice:
Books:
Blogs:
Pattern 1: "Most cost-effective solution"
Pattern 2: "Least operational overhead"
Pattern 3: "Highest availability"
Pattern 4: "Secure solution"
Pattern 5: "Troubleshooting failure"
Congratulations on completing this comprehensive study guide. You've covered:
Good luck on your AWS Certified DevOps Engineer - Professional exam!
End of Study Guide