CC

AZ-305 学习指南

完整的考试准备指南

AZ-305: Designing Microsoft Azure Infrastructure Solutions - Comprehensive Study Guide

Complete Learning Path for Certification Success

Overview

This study guide provides a structured learning path from fundamentals to exam readiness for the AZ-305: Designing Microsoft Azure Infrastructure Solutions certification. Designed for complete novices and those transitioning to cloud architecture, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.

What is AZ-305?

The AZ-305 certification validates your expertise in designing cloud and hybrid solutions that run on Microsoft Azure. As an Azure Solutions Architect Expert, you'll demonstrate advanced skills in:

  • Identity, Governance, and Monitoring - Securing and managing Azure environments
  • Data Storage Solutions - Designing scalable, reliable data architectures
  • Business Continuity - Ensuring high availability and disaster recovery
  • Infrastructure Solutions - Architecting compute, networking, and application solutions

Target Audience: Solution architects, cloud engineers, and IT professionals designing Azure infrastructure solutions.

Prerequisites: One of the following associate-level certifications:

  • Azure Administrator Associate (AZ-104)
  • Azure Developer Associate (AZ-204)

Section Organization

Study Sections (in order):

  • Overview (this section) - How to use the guide and study plan
  • Fundamentals - Section 0: Essential Azure architecture fundamentals and Well-Architected Framework
  • 02_domain_1_identity_governance_monitoring - Section 1: Identity, Governance, and Monitoring Solutions (25-30% of exam)
  • 03_domain_2_data_storage - Section 2: Data Storage Solutions (20-25% of exam)
  • 04_domain_3_business_continuity - Section 3: Business Continuity Solutions (15-20% of exam)
  • 05_domain_4_infrastructure - Section 4: Infrastructure Solutions (30-35% of exam)
  • Integration - Integration & cross-domain scenarios
  • Study strategies - Study techniques & test-taking strategies
  • Final checklist - Final week preparation checklist
  • Appendices - Quick reference tables, glossary, resources
  • diagrams/ - Folder containing all Mermaid diagram files (.mmd)

Study Plan Overview

Total Time: 6-10 weeks (2-3 hours daily)

  • Week 1-2: Fundamentals & Well-Architected Framework + Domain 1 (Identity, Governance, Monitoring)

    • Files: 01_fundamentals, 02_domain_1_identity_governance_monitoring
    • Focus: Azure architecture principles, Microsoft Entra ID, RBAC, governance
  • Week 3-4: Domain 2 (Data Storage Solutions)

    • File: 03_domain_2_data_storage
    • Focus: SQL databases, Cosmos DB, storage accounts, data integration
  • Week 5-6: Domains 3-4 (Business Continuity & Infrastructure)

    • Files: 04_domain_3_business_continuity, 05_domain_4_infrastructure
    • Focus: Backup/DR, high availability, compute, networking, migrations
  • Week 7-8: Integration & Cross-domain scenarios

    • File: 06_integration
    • Focus: Complex architectures combining multiple domains
  • Week 9: Practice & Review

    • Use practice test bundles in ``
    • Target: 70%+ on practice tests
  • Week 10: Final Prep

    • Files: 07_study_strategies, 08_final_checklist
    • Final review, test strategies, mental preparation

Learning Approach

  1. Read: Study each section thoroughly with focus on understanding WHY and HOW
  2. Visualize: Study all diagrams - they are essential for understanding architecture patterns
  3. Highlight: Mark ⭐ items as must-know concepts
  4. Practice: Complete exercises after each section
  5. Test: Use practice questions to validate understanding (aim for 80%+)
  6. Review: Revisit marked sections and weak areas

Progress Tracking

Use checkboxes to track completion:

Week 1-2: Fundamentals & Identity/Governance/Monitoring

  • 01_fundamentals completed
  • Chapter exercises done
  • 02_domain_1_identity_governance_monitoring completed
  • Domain 1 practice questions passed (80%+)
  • Self-assessment checklist completed

Week 3-4: Data Storage

  • 03_domain_2_data_storage completed
  • Chapter exercises done
  • Domain 2 practice questions passed (80%+)
  • Self-assessment checklist completed

Week 5-6: Business Continuity & Infrastructure

  • 04_domain_3_business_continuity completed
  • 05_domain_4_infrastructure completed
  • Chapter exercises done
  • Domains 3-4 practice questions passed (80%+)
  • Self-assessment checklists completed

Week 7-8: Integration

  • 06_integration completed
  • Cross-domain scenarios practiced
  • Full practice test passed (75%+)

Week 9: Practice

  • Practice Test Bundle 1 (target: 70%+)
  • Review mistakes and weak areas
  • Practice Test Bundle 2 (target: 75%+)
  • Practice Test Bundle 3 (target: 80%+)

Week 10: Final Prep

  • 07_study_strategies reviewed
  • 08_final_checklist completed
  • Cheat sheet memorized
  • Ready for exam!

Legend

  • Must Know: Critical for exam - memorize this
  • 💡 Tip: Helpful insight or shortcut
  • ⚠️ Warning: Common mistake to avoid
  • 🔗 Connection: Related to other topics
  • 📝 Practice: Hands-on exercise
  • 🎯 Exam Focus: Frequently tested concept
  • 📊 Diagram: Visual representation available

How to Navigate

  1. Sequential Study: Go through files in order (01 → 02 → 03... → 99)

    • Each file builds on previous chapters
    • Don't skip fundamentals even if experienced
  2. Self-Contained Chapters: Each domain chapter is comprehensive

    • Can be studied independently after fundamentals
    • Cross-references guide you to related topics
  3. Quick Reference: Use 99_appendices during study

    • Service comparison tables
    • Decision frameworks
    • Glossary of terms
  4. Final Week: Return to 08_final_checklist

    • Knowledge audit
    • Practice test marathon
    • Exam day preparation

Exam Details

Exam Information:

  • Passing Score: 700 (out of 1000)
  • Duration: 120 minutes (150 minutes for non-native English speakers)
  • Question Format:
    • Case studies (scenario-based questions)
    • Multiple choice
    • Drag-and-drop
    • Hot area (select regions on image)
  • Number of Questions: 40-60 questions
  • Cost: $165 USD (varies by region)

Skills Measured:

  1. Design Identity, Governance, and Monitoring Solutions (25-30%)
  2. Design Data Storage Solutions (20-25%)
  3. Design Business Continuity Solutions (15-20%)
  4. Design Infrastructure Solutions (30-35%)

What Makes This Guide Different

Comprehensive for Novices:

  • Assumes minimal prior Azure knowledge (but requires AZ-104 or AZ-204)
  • Explains WHY services exist and HOW they work
  • Multiple detailed examples for every concept (3+ examples per topic)
  • Real-world analogies for complex concepts

Self-Sufficient Learning:

  • No need for external resources - everything explained here
  • 120-200 visual diagrams with detailed explanations
  • Every diagram has 200-400 words of explanation
  • Covers ALL exam objectives comprehensively

Exam-Focused:

  • Based on official Microsoft exam guide
  • Includes insights from 900+ practice questions
  • Decision frameworks for architecture choices
  • Common traps and how to avoid them

Visual Learning Priority:

  • Every complex concept has multiple diagrams
  • Architecture diagrams for all design patterns
  • Sequence diagrams for all processes
  • Decision trees for service selection

Study Tips

Active Learning:

  1. Don't just read - draw your own diagrams
  2. Explain concepts - teach someone or explain out loud
  3. Build scenarios - create your own architecture problems
  4. Compare options - understand tradeoffs between services

Effective Memorization:

  1. Use the diagrams - visual memory is powerful
  2. Create mnemonics - for lists and decision criteria
  3. Practice recall - test yourself without looking
  4. Space repetition - review material multiple times

Avoid Common Mistakes:

  1. Don't skip fundamentals - they're foundation for everything
  2. Don't just memorize - understand the reasoning
  3. Don't ignore diagrams - they're 50% of learning
  4. Don't cram - consistent daily study is better

Support Resources

Official Microsoft Resources:

Practice Materials (included):

  • Practice test bundles in ``
  • Cheat sheets in ``

Community:

  • Microsoft Tech Community
  • Azure Architecture Discord/Slack channels
  • Reddit: r/AzureCertification

How This Guide Was Built

This comprehensive study guide was created by:

  1. Analyzing 900+ Practice Questions: Identified frequently tested concepts and common patterns
  2. Mapping Learning Dependencies: Built a logical progression from basics to advanced
  3. Verifying with Official Docs: Used Microsoft Docs MCP to ensure accuracy
  4. Creating Visual Aids: Generated 120-200 diagrams for visual learning
  5. Adding Real-World Context: Included practical scenarios and decision frameworks

Ready to Begin?

Start with Fundamentals to build your foundation in Azure architecture principles and the Well-Architected Framework. This foundation is critical for everything that follows.

Remember:

  • Quality over speed - understand deeply
  • Practice consistently - 2-3 hours daily
  • Use visual aids - diagrams are your friends
  • Test regularly - practice questions reveal gaps

You can do this! With dedication and the right approach, you'll master Azure architecture and pass AZ-305.


Last Updated: October 2025
Based on exam skills measured as of October 18, 2024


Chapter 0: Essential Azure Architecture Fundamentals

What You Need to Know First

This AZ-305 certification assumes you have completed either AZ-104 (Azure Administrator) or AZ-204 (Azure Developer) and understand:

  • Basic Azure concepts - Resources, resource groups, subscriptions
  • Azure Portal navigation - Creating and managing resources
  • Core Azure services - VMs, Storage, Networking basics
  • Identity basics - Microsoft Entra ID (formerly Azure AD), users, groups
  • Basic ARM templates or Bicep - Infrastructure as Code fundamentals

If you're missing any: Review your AZ-104 or AZ-204 materials before proceeding. This guide builds on that foundation.

Introduction: What is Azure Architecture?

What it is: Azure architecture is the design and structure of how cloud services, resources, and components are organized and interconnected to deliver business solutions. It's like being the architect of a building - you don't just pile bricks randomly; you create blueprints that ensure the building is stable, secure, efficient, and serves its purpose.

Why it matters for AZ-305: As a Solutions Architect, you're not implementing solutions (that's the administrator's job) - you're DESIGNING them. You must make high-level decisions about which services to use, how they connect, how data flows, security boundaries, cost optimization, and disaster recovery strategies.

Real-world analogy: Think of designing a shopping mall:

  • Architect (you): Designs the layout, decides where stores go, plans emergency exits, ensures structural integrity
  • Construction crew (administrators/developers): Builds according to your plans
  • Shoppers (end users): Use the finished product

As an Azure Solutions Architect, you create the "blueprint" that others will build and users will consume.

Core Concepts Foundation

The Azure Well-Architected Framework

What it is: The Azure Well-Architected Framework is a set of five guiding principles (pillars) that help you design reliable, secure, efficient, and cost-effective cloud solutions. It's Microsoft's official design philosophy for Azure workloads.

Why it exists: Without a framework, architects might focus only on functionality and ignore security, or optimize for cost while sacrificing reliability. The Well-Architected Framework ensures you consider ALL critical aspects when designing solutions. It prevents costly redesigns and security breaches by incorporating best practices from the start.

Real-world analogy: Building a house - you wouldn't just focus on making it look good (performance) while ignoring the foundation (reliability), locks on doors (security), energy efficiency (cost), or ease of maintenance (operational excellence). You need to balance all aspects.

The Five Pillars:

  1. Reliability: Ensures your workload can recover from failures and continue functioning
  2. Security: Protects your applications and data from threats
  3. Cost Optimization: Maximizes value while minimizing unnecessary expenses
  4. Operational Excellence: Enables efficient operations and continuous improvement
  5. Performance Efficiency: Uses resources efficiently to meet requirements

How it works (The Design Process):

  1. Assess current state: Understand business requirements, constraints, and existing architecture
  2. Apply pillar principles: For each pillar, apply specific design principles and best practices
  3. Make tradeoff decisions: Balance conflicting requirements (e.g., security vs. cost)
  4. Document design: Create architecture diagrams, decision records, and deployment plans
  5. Review and iterate: Continuously assess and improve the architecture

📊 Well-Architected Framework Overview Diagram:

graph TB
    subgraph "Azure Well-Architected Framework"
        WAF[Well-Architected<br/>Framework]

        WAF --> REL[Reliability<br/>🔄]
        WAF --> SEC[Security<br/>🔒]
        WAF --> COST[Cost Optimization<br/>💰]
        WAF --> OPS[Operational Excellence<br/>⚙️]
        WAF --> PERF[Performance Efficiency<br/>⚡]
    end

    subgraph "Reliability Pillar"
        REL --> REL1[Resiliency<br/>Handle failures gracefully]
        REL --> REL2[Availability<br/>Minimize downtime]
        REL --> REL3[Recovery<br/>Restore from disasters]
    end

    subgraph "Security Pillar"
        SEC --> SEC1[Confidentiality<br/>Protect data privacy]
        SEC --> SEC2[Integrity<br/>Prevent tampering]
        SEC --> SEC3[Availability<br/>Prevent DoS]
    end

    subgraph "Cost Optimization"
        COST --> COST1[Plan & Estimate<br/>Budget appropriately]
        COST --> COST2[Monitor & Optimize<br/>Reduce waste]
        COST --> COST3[Right-size<br/>Match capacity to demand]
    end

    subgraph "Operational Excellence"
        OPS --> OPS1[DevOps Practices<br/>Automate operations]
        OPS --> OPS2[Monitoring<br/>Observe system health]
        OPS --> OPS3[Safe Deployments<br/>Minimize risk]
    end

    subgraph "Performance Efficiency"
        PERF --> PERF1[Scale<br/>Grow with demand]
        PERF --> PERF2[Optimize<br/>Improve efficiency]
        PERF --> PERF3[Test<br/>Validate performance]
    end

    style WAF fill:#e1f5fe
    style REL fill:#fff3e0
    style SEC fill:#f3e5f5
    style COST fill:#e8f5e9
    style OPS fill:#fce4ec
    style PERF fill:#f3e5f5

See: diagrams/01_fundamentals_well_architected_framework.mmd

Diagram Explanation (Understanding the Framework):

The central Well-Architected Framework connects to five pillars, each representing a critical design consideration. These pillars are NOT independent - they interact and sometimes conflict, requiring you to make tradeoff decisions.

Reliability Pillar (top left): Focuses on keeping systems running despite failures. Resiliency ensures graceful degradation when components fail (like having backup power in a hospital). Availability minimizes planned and unplanned downtime (like a 24/7 convenience store). Recovery enables restoration after major disasters (like having fire insurance and rebuild plans).

Security Pillar (top right): Protects against threats through defense-in-depth. Confidentiality prevents unauthorized data access (like medical records). Integrity ensures data isn't tampered with (like sealed evidence). Availability (from security perspective) prevents denial-of-service attacks that make systems unusable.

Cost Optimization (center left): Ensures you don't overspend. Planning involves budgeting and cost estimation before building. Monitoring tracks actual spend and identifies waste. Right-sizing matches resources to actual needs (don't rent a warehouse when you need a closet).

Operational Excellence (center right): Streamlines day-to-day operations. DevOps practices automate repetitive tasks (like automatic backups). Monitoring provides visibility into system health (like a car dashboard). Safe deployments minimize risk when releasing changes (like testing parachutes before jumping).

Performance Efficiency (bottom): Ensures systems perform well. Scaling allows growth as demand increases (like adding lanes to a highway). Optimization improves efficiency of existing resources (like tuning an engine). Testing validates performance meets requirements (like crash testing cars).

Must Know (Critical Facts):

  • All five pillars are equally important - neglecting one creates risk
  • Tradeoffs are necessary - improving one pillar often negatively impacts another (e.g., better security might increase cost)
  • The framework is a guide, not a checklist - apply principles thoughtfully based on your specific context
  • AZ-305 exam frequently tests - understanding these tradeoffs and when to prioritize each pillar
  • Document your tradeoff decisions - explain WHY you chose to prioritize certain pillars over others

💡 Tips for Understanding:

  • Think "balance" - like balancing a budget, you optimize across competing goals
  • Use decision matrices - list requirements, score each option against all five pillars
  • Consider failure scenarios - for reliability, always ask "what if this fails?"
  • Calculate TCO (Total Cost of Ownership) - not just Azure costs, but operational costs too

⚠️ Common Mistakes & Misconceptions:

  • Mistake 1: Focusing only on cost optimization and ignoring security/reliability
    • Why it's wrong: Cheap solutions that get breached or fail cost MORE in the long run
    • Correct understanding: Find the right balance - sometimes spending more upfront saves money later
  • Mistake 2: Thinking the pillars are independent
    • Why it's wrong: Decisions impact multiple pillars simultaneously
    • Correct understanding: Every architectural decision creates tradeoffs across pillars
  • Mistake 3: Applying the same pattern to every workload
    • Why it's wrong: A financial trading system needs different priorities than a blog
    • Correct understanding: Adjust pillar priorities based on business requirements

Azure Resource Hierarchy

What it is: Azure's resource hierarchy is a multi-level organizational structure for managing cloud resources. It consists of four levels: Management Groups → Subscriptions → Resource Groups → Resources. This hierarchy determines how access control, policies, and billing are applied and inherited.

Why it exists: Without hierarchy, managing thousands of resources across multiple teams and departments would be chaos. Imagine a large corporation trying to manage security and billing without any organizational structure - it would be impossible to ensure compliance or track costs effectively.

The hierarchy solves several critical problems:

  1. Governance at scale - Apply policies once at high levels instead of individually to thousands of resources
  2. Security boundaries - Separate production from development, or one department from another
  3. Cost management - Track and allocate costs by department, project, or environment
  4. Delegation - Give teams autonomy within their boundaries while maintaining enterprise control

Real-world analogy: Think of a large corporation's organizational chart:

  • Management Groups = Corporate divisions (e.g., North America Division, Europe Division)
  • Subscriptions = Departments within divisions (e.g., Finance Department, IT Department)
  • Resource Groups = Projects or teams within departments (e.g., ERP Implementation Project)
  • Resources = Individual assets or tools (e.g., specific servers, databases)

Just as corporate policies flow from the top down (all divisions must follow compliance rules), Azure policies and access controls cascade through the hierarchy.

How it works (Detailed step-by-step):

  1. Start with a Microsoft Entra Tenant: Your organization's identity directory (like the company headquarters). This is the root of everything Azure.

  2. Create the Root Management Group: Automatically created for your tenant. This is the top of your hierarchy (like the CEO level). Policies here affect EVERYTHING below.

  3. Organize with Management Groups (levels 2-6): Create a structure matching your organization. Common patterns:

    • Geographic: North America MG, Europe MG, Asia MG
    • Business Unit: Finance MG, Marketing MG, Engineering MG
    • Environment: Production MG, Non-Production MG
    • Hybrid: Mix approaches (e.g., BU at level 1, then environment at level 2)
  4. Assign Subscriptions: Place subscriptions under appropriate management groups. Subscriptions inherit all policies from parent MGs. Each subscription is a billing boundary and contains resource groups.

  5. Create Resource Groups: Within subscriptions, group related resources. RGs typically align with:

    • Application lifecycle: Resources deployed/deleted together
    • Team ownership: Resources managed by same team
    • Environment: Dev RG, Test RG, Prod RG within a subscription
  6. Deploy Resources: Actual Azure services (VMs, databases, etc.) go into resource groups.

📊 Azure Resource Hierarchy Diagram:

graph TD
    TENANT[Microsoft Entra Tenant<br/>contoso.onmicrosoft.com]

    TENANT --> ROOT[Root Management Group<br/>Tenant Root Group<br/><br/>🔐 Enterprise Policies Applied Here]

    ROOT --> MG1[Management Group<br/>Production<br/><br/>📋 Policy: Allowed Regions = US, EU]
    ROOT --> MG2[Management Group<br/>Non-Production<br/><br/>📋 Policy: Auto-shutdown VMs at night]

    MG1 --> SUB1[Subscription<br/>Prod-Finance<br/>💰 $50k/month budget<br/><br/>🔐 RBAC: Finance team = Contributor]
    MG1 --> SUB2[Subscription<br/>Prod-Engineering<br/>💰 $100k/month budget]

    MG2 --> SUB3[Subscription<br/>Dev-Engineering<br/>💰 $10k/month budget]

    SUB1 --> RG1[Resource Group<br/>rg-finance-erp-prod<br/>Location: East US<br/><br/>🏷️ Tags: dept=finance, env=prod]
    SUB1 --> RG2[Resource Group<br/>rg-finance-analytics-prod<br/>Location: East US]

    SUB3 --> RG3[Resource Group<br/>rg-engineering-webapp-dev<br/>Location: West US]

    RG1 --> RES1[Azure SQL Database<br/>sql-erp-prod-001]
    RG1 --> RES2[App Service<br/>app-erp-frontend-prod]
    RG1 --> RES3[Storage Account<br/>sterpprod001]

    RG3 --> RES4[Virtual Machine<br/>vm-webapp-dev-001]
    RG3 --> RES5[Virtual Network<br/>vnet-webapp-dev]

    style TENANT fill:#e1f5fe
    style ROOT fill:#fff3e0
    style MG1 fill:#e8f5e9
    style MG2 fill:#e8f5e9
    style SUB1 fill:#f3e5f5
    style SUB2 fill:#f3e5f5
    style SUB3 fill:#f3e5f5
    style RG1 fill:#fce4ec
    style RG2 fill:#fce4ec
    style RG3 fill:#fce4ec
    style RES1 fill:#e0f2f1
    style RES2 fill:#e0f2f1
    style RES3 fill:#e0f2f1
    style RES4 fill:#e0f2f1
    style RES5 fill:#e0f2f1

See: diagrams/01_fundamentals_resource_hierarchy.mmd

Diagram Explanation (Understanding the Hierarchy):

This diagram shows a realistic organizational structure for a company called Contoso. Let's trace how governance flows from top to bottom:

Level 1 - Tenant: The Microsoft Entra Tenant (contoso.onmicrosoft.com) is the identity foundation. All users, groups, and service principals authenticate here. Only ONE tenant can be associated with resources in this hierarchy.

Level 2 - Root Management Group: Automatically created and named after your tenant. This is where you apply enterprise-wide policies that must affect ALL Azure resources. Examples: "All resources must have required tags" or "All resources must be in allowed regions only". Be very careful here - mistakes affect everything.

Level 3 - Management Groups (Production vs Non-Production): In this example, resources are separated by environment at the top level. The Production MG has a policy restricting deployments to US and EU regions only (for compliance). The Non-Production MG has a policy to auto-shutdown VMs at night to save costs. Notice how policies are DIFFERENT at this level because needs differ.

Level 4 - Subscriptions (Department and Environment Specific):

  • Prod-Finance subscription ($50k/month budget): For finance team's production workloads. Finance team has Contributor access (can create/manage resources but not assign permissions).
  • Prod-Engineering subscription ($100k/month budget): Engineering's production environment.
  • Dev-Engineering subscription ($10k/month budget): Lower budget for non-production work.

Each subscription is a BILLING BOUNDARY - you get separate invoices. It's also a SCALE BOUNDARY - each subscription has limits (e.g., max 25,000 VMs).

Level 5 - Resource Groups (Lifecycle and Ownership):

  • rg-finance-erp-prod: Contains all resources for the ERP production application. Named descriptively (rg = resource group, finance = department, erp = app, prod = environment). Tagged for cost tracking and compliance.
  • rg-finance-analytics-prod: Separate RG for analytics workload - different lifecycle and team.
  • rg-engineering-webapp-dev: Dev environment resources for web application.

Resource groups are DEPLOYMENT BOUNDARIES - resources in an RG are typically deployed together, managed together, and deleted together.

Level 6 - Resources (Actual Azure Services):
Individual services like SQL databases, App Services, VMs, storage accounts. These are the actual compute, storage, and networking services you consume. Each inherits policies and access controls from all levels above.

Policy and Access Inheritance Flow:
Imagine a user trying to deploy a VM in rg-finance-erp-prod in the Brazil region:

  1. ✅ Root MG: No blocking policy
  2. ❌ Production MG: Policy says "Allowed Regions = US, EU only" - DEPLOYMENT BLOCKED
  3. User cannot proceed - policy violation

This shows how governance cascades from top to bottom, enforcing compliance automatically.

Must Know (Critical Facts):

  • Management Group depth limit: Maximum 6 levels (not including root or subscription level)
  • Single tenant rule: All subscriptions in a hierarchy trust the SAME Microsoft Entra tenant
  • Policy inheritance: Cannot be overridden by child resources - parent policies always apply
  • RBAC inheritance: Permissions granted at higher levels flow down (Owner at MG = Owner on all subscriptions below)
  • Resource Group location: RG has a location, but resources inside can be in different regions
  • Lifecycle linkage: Deleting a resource group DELETES ALL resources inside (permanent!)
  • Subscription limits: Max 10,000 management groups per tenant; each subscription can have only ONE parent

Detailed Example 1: Setting up a Multi-National Corporation

Scenario: Contoso Corp operates in North America, Europe, and Asia. They have strict data residency requirements (EU data must stay in EU) and different teams managing each region.

Architecture design:

Step 1 - Management Group Structure (Hierarchical):

Root Management Group (Tenant Root)
├── Corp (Level 1 - Corporate policies)
    ├── North America (Level 2 - Geographic)
    │   ├── Prod-NA (Level 3 - Environment)
    │   └── Dev-NA (Level 3 - Environment)
    ├── Europe (Level 2 - Geographic)
    │   ├── Prod-EU (Level 3 - Environment)
    │   └── Dev-EU (Level 3 - Environment)
    └── Asia (Level 2 - Geographic)
        ├── Prod-Asia (Level 3 - Environment)
        └── Dev-Asia (Level 3 - Environment)

Step 2 - Policy Application:

  • Root MG: Require tags (CostCenter, Owner, Environment) on ALL resources
  • Corp MG: Enable Azure Defender for all subscriptions, enforce TLS 1.2+
  • Europe MG: GDPR compliance - restrict resources to EU regions only (West Europe, North Europe)
  • North America MG: Allow only US regions (East US, West US, Central US)
  • Prod MGs: Disable public IP addresses on VMs, require encryption at rest
  • Dev MGs: Auto-shutdown VMs from 6 PM to 8 AM to save costs

Step 3 - Subscription Assignment:

  • Prod-EU MG contains subscriptions: "EU-Finance-Prod", "EU-Engineering-Prod"
  • Dev-NA MG contains subscriptions: "NA-Engineering-Dev", "NA-Testing-Dev"

Step 4 - RBAC Assignment:

  • Europe MG: EU-Admins group = Contributor (can manage all EU resources)
  • Prod-EU MG: EU-Prod-Readers group = Reader (can view but not modify production)
  • EU-Finance-Prod subscription: Finance-Team group = Contributor

What happens:

  • An EU admin tries to deploy a VM in Brazil → ❌ Blocked by Europe MG policy (only EU regions)
  • A developer tries to deploy a VM without tags → ❌ Blocked by Root MG policy (tags required)
  • Finance team deploys a resource to EU-Finance-Prod → ✅ Succeeds (all policies satisfied, has permissions)
  • A VM in Dev-NA RG automatically shuts down at 6 PM → ✅ Auto-shutdown policy from Dev-NA MG
  • Auditor views all EU resources → ✅ EU-Prod-Readers group has Reader permission via RBAC

Detailed Example 2: Startup Growing to Enterprise

Scenario: TechStartup begins with one subscription and grows to need governance as they scale from 5 to 500 employees.

Phase 1 - Startup (Flat Structure):

  • 1 subscription: "Default Subscription"
  • 3 resource groups: "Dev", "Test", "Prod"
  • All developers have Contributor on subscription
  • No policies, no management groups
  • Problems: No cost control, security risks (everyone can access prod), compliance issues

Phase 2 - Growth (Basic Hierarchy):

Root MG
└── TechStartup
    ├── Production (MG)
    │   └── Prod-Main (Subscription)
    └── Non-Production (MG)
        ├── Dev-Main (Subscription)
        └── Test-Main (Subscription)
  • Moved resources to environment-specific subscriptions
  • Policies added:
    • Production MG: No public endpoints, require MFA for access
    • Non-Production MG: Auto-tag resources, auto-shutdown
  • RBAC refined:
    • Developers: Contributor on Dev/Test, Reader on Prod
    • Ops Team: Contributor on Prod
  • Benefits: Clear separation, cost control, improved security

Phase 3 - Enterprise (Advanced Hierarchy):

Root MG
└── TechStartup
    ├── Platform (MG - Shared services)
    │   ├── Identity-Sub
    │   ├── Networking-Sub
    │   └── Monitoring-Sub
    ├── Workloads (MG - Applications)
    │   ├── CustomerPortal (MG)
    │   │   ├── CustomerPortal-Prod (Sub)
    │   │   └── CustomerPortal-Dev (Sub)
    │   └── InternalTools (MG)
    │       ├── InternalTools-Prod (Sub)
    │       └── InternalTools-Dev (Sub)
    └── Sandbox (MG - Experimentation)
        └── Innovation-Sub
  • Separated platform/infrastructure from workloads
  • Each application has its own management group
  • Sandbox for experimentation without affecting production
  • Policies layered:
    • Platform MG: Stricter security (private endpoints only)
    • Workloads MG: Standard policies
    • Sandbox MG: Relaxed (allow public access for testing)
  • Cost management: Budgets per subscription, auto-alerts
  • Benefits: Scalable governance, clear ownership, flexibility

Why this works: As organizations grow, their hierarchy evolves from simple (few subscriptions) to complex (many MGs and subscriptions). The key is to start simple and add structure as needed, always aligning with business requirements.

Detailed Example 3: Troubleshooting Access Issues Using Hierarchy

Scenario: User Alice cannot deploy a storage account in a resource group, even though she has "Contributor" role.

Investigation process:

Step 1 - Check Resource Group RBAC:

  • Alice has Contributor on RG ✅
  • Contributor can create resources ✅

Step 2 - Check Subscription Policies:

  • Subscription has policy: "Storage accounts must use private endpoints only"
  • Alice's deployment template includes public endpoint ❌
  • Root cause found: Policy violation

Step 3 - Check if Policy Can Be Changed:

  • Policy inherited from Management Group above subscription
  • Alice doesn't have permission to modify MG policies ❌

Resolution:

  • Option A: Modify deployment to use private endpoint ✅
  • Option B: Request exception from governance team (if justified)
  • Option C: Use a different subscription without the policy (if allowed)

Lesson: RBAC permissions alone don't guarantee success - policies at higher levels can block actions even with appropriate roles.

When to use Management Groups vs Subscriptions vs Resource Groups:

Use Management Groups when: Use Subscriptions when: Use Resource Groups when:
✅ Need to apply policies to many subscriptions ✅ Need billing separation (different cost centers) ✅ Grouping resources with same lifecycle
✅ Organizing by business unit or geography ✅ Need to delegate ownership to a team ✅ Resources managed by same team
✅ Need hierarchical governance structure ✅ Hit subscription limits (need more resources) ✅ Deploying related resources together
✅ Managing enterprise-wide compliance ✅ Isolating environments (prod vs dev) ✅ Sharing configuration or deployment templates

Limitations & Constraints:

  • Management Groups: Max 10,000 per tenant, max 6 levels deep, cannot be nested under subscriptions
  • Subscriptions: Max 10,000 management groups per tenant; one parent MG only; moving subscriptions takes up to 30 minutes to propagate
  • Resource Groups: Cannot be nested; deletion is permanent and deletes all resources; max 800 deployments per RG (rolling history)
  • Resources: Subject to subscription quotas and limits (e.g., max VMs, storage accounts)

💡 Tips for Understanding:

  • Draw your hierarchy - visualize before implementing
  • Name consistently - use naming conventions (e.g., "mg-prod", "sub-finance-prod", "rg-app-env-region")
  • Think inheritance - permissions and policies flow downward, never upward
  • Plan for growth - design hierarchy to accommodate future expansion

⚠️ Common Mistakes & Misconceptions:

  • Mistake 1: Creating too many management groups too early
    • Why it's wrong: Adds complexity without benefit for small organizations
    • Correct understanding: Start simple (2-3 MGs), add structure as you scale
  • Mistake 2: Applying restrictive policies at Root MG without testing
    • Why it's wrong: Can block all deployments across the entire organization
    • Correct understanding: Test policies in lower MGs first, then promote to Root
  • Mistake 3: Treating resource groups like folders
    • Why it's wrong: RGs are lifecycle boundaries, not just organizational containers
    • Correct understanding: Resources with same lifecycle go in same RG
  • Mistake 4: Mixing production and dev resources in same subscription
    • Why it's wrong: Security risk, cost confusion, policy conflicts
    • Correct understanding: Separate environments with different subscriptions

🔗 Connections to Other Topics:

  • Relates to Azure Policy (Domain 1) because: Policies use the hierarchy for scope and inheritance
  • Builds on RBAC (Domain 1) by: Providing structure for permission delegation
  • Often used with Resource Tagging (Domain 1) to: Organize and track costs across the hierarchy
  • Critical for Cost Management because: Hierarchy determines billing aggregation and budget allocation

Mental Model: How Everything Fits Together

The Well-Architected Framework provides the WHAT (principles to follow), while the Resource Hierarchy provides the WHERE (structure to apply them).

Think of it this way:

  1. Framework = Philosophy: The "why" and "what" of good design
  2. Hierarchy = Organization: The "where" governance is applied
  3. Policies + RBAC = Enforcement: The "how" governance is implemented

When designing any Azure solution:

  1. Start with business requirements - what does the business need?
  2. Apply WAF principles - which pillars are most critical?
  3. Design hierarchy - how should resources be organized?
  4. Implement governance - use policies and RBAC to enforce rules
  5. Deploy resources - create actual services within the structure

📊 Complete Ecosystem Diagram:

graph TB
    subgraph "Azure Architecture Design Process"
        BR[Business Requirements<br/>💼<br/>What do we need to achieve?]

        BR --> WAF[Apply Well-Architected Framework<br/>⚖️<br/>Balance 5 pillars]

        WAF --> HIER[Design Resource Hierarchy<br/>🏗️<br/>Organize MG, Subs, RGs]

        HIER --> GOV[Implement Governance<br/>🔐<br/>Policies + RBAC]

        GOV --> DEPLOY[Deploy Resources<br/>☁️<br/>VMs, DBs, Networks, Apps]

        DEPLOY --> MON[Monitor & Optimize<br/>📊<br/>Continuous improvement]

        MON -.Feedback.-> WAF
    end

    subgraph "Well-Architected Framework Pillars"
        WAF --> P1[Reliability]
        WAF --> P2[Security]
        WAF --> P3[Cost Optimization]
        WAF --> P4[Operational Excellence]
        WAF --> P5[Performance Efficiency]
    end

    subgraph "Resource Hierarchy Levels"
        HIER --> H1[Management Groups<br/>Governance scope]
        HIER --> H2[Subscriptions<br/>Billing & scale boundary]
        HIER --> H3[Resource Groups<br/>Lifecycle boundary]
        HIER --> H4[Resources<br/>Actual services]
    end

    subgraph "Governance Implementation"
        GOV --> G1[Azure Policy<br/>What's allowed/required]
        GOV --> G2[RBAC<br/>Who can do what]
        GOV --> G3[Tagging<br/>Organize & track costs]
        GOV --> G4[Budgets & Alerts<br/>Control spending]
    end

    style BR fill:#e1f5fe
    style WAF fill:#fff3e0
    style HIER fill:#e8f5e9
    style GOV fill:#f3e5f5
    style DEPLOY fill:#fce4ec
    style MON fill:#e0f2f1

See: diagrams/01_fundamentals_complete_ecosystem.mmd

Diagram Explanation: This shows the complete Azure architecture design workflow from requirements to deployment and continuous improvement.

Phase 1 - Business Requirements: Everything starts here. Understand what the business needs: performance targets, security requirements, budget constraints, compliance needs, availability SLAs. Document these clearly - they drive all subsequent decisions.

Phase 2 - Apply Well-Architected Framework: Evaluate requirements against all five pillars. For a financial trading platform, Reliability and Performance might be top priorities. For a public website, Cost Optimization and Security might lead. Make explicit tradeoff decisions and document why.

Phase 3 - Design Resource Hierarchy: Based on the organization structure and requirements, design management groups, subscriptions, and resource group strategy. Consider factors like: team ownership, environment separation, geographic requirements, billing separation needs.

Phase 4 - Implement Governance: Translate requirements into concrete policies and access controls. Use Azure Policy to enforce compliance (e.g., "all data must be encrypted"). Use RBAC to control who can do what. Apply tags for cost tracking and organization. Set budgets to prevent overspending.

Phase 5 - Deploy Resources: Within the governed structure, deploy actual Azure services - virtual machines, databases, networking components, applications. These resources automatically inherit governance from the hierarchy.

Phase 6 - Monitor & Optimize: Continuously monitor performance, costs, security, and reliability. Use Azure Monitor, Cost Management, Security Center. Feed insights back to the Well-Architected Framework assessment - did your design achieve the goals? What needs adjustment?

The feedback loop (Monitor → WAF) represents continuous improvement - architecture is never "done", it evolves with business needs and Azure platform improvements.

Terminology Guide

Term Definition Example
Azure Resource A manageable item available through Azure Virtual Machine, Storage Account, SQL Database
Resource Group Logical container for resources with shared lifecycle All resources for a web application (web app, database, storage)
Subscription Billing and management boundary for resources Production subscription, Development subscription
Management Group Container for organizing subscriptions with inherited governance Corporate MG, Production MG, Finance MG
Microsoft Entra Tenant Identity directory for an organization in Azure contoso.onmicrosoft.com
Azure Policy Service for enforcing organizational standards Policy: "All VMs must use managed disks"
RBAC (Role-Based Access Control) Authorization system for Azure resources Assign "Contributor" role to developers
Well-Architected Framework Design principles for Azure solutions Five pillars: Reliability, Security, Cost, Ops, Performance
Pillar Core principle of Well-Architected Framework Security pillar focuses on protecting data and systems
Tradeoff Compromise where improving one aspect degrades another Higher security (more encryption) increases cost
Governance Enforcement of organizational standards and policies Using policies and RBAC to control resource creation
Landing Zone Pre-configured environment for workload deployment Production landing zone with networking, identity, governance pre-configured
Azure Region Geographic location containing Azure datacenters East US, West Europe, Southeast Asia
Resource Provider Service that supplies Azure resources Microsoft.Compute (provides VMs), Microsoft.Storage (provides storage)
ARM Template / Bicep Infrastructure as Code for deploying Azure resources JSON or Bicep file defining all resources for an application

📝 Practice Exercise 1: Applying the Well-Architected Framework

Scenario: You're designing a customer-facing e-commerce website for a startup with limited budget but high growth potential.

Requirements:

  • Must handle Black Friday traffic spikes (10x normal load)
  • Process credit card payments (PCI DSS compliance)
  • Limited budget: $5,000/month
  • Small team (2 developers, 1 ops engineer)

Task: For each pillar, identify specific design decisions:

  1. Reliability: How will you handle failures and traffic spikes?
  2. Security: How will you protect customer payment data?
  3. Cost Optimization: How will you stay within budget?
  4. Operational Excellence: How will a small team manage the system?
  5. Performance Efficiency: How will you scale for Black Friday?

Sample Solution:

  • Reliability: Use Azure App Service with auto-scaling, Azure Front Door for global distribution, Azure SQL Database with geo-replication
  • Security: Use Azure Key Vault for secrets, private endpoints, PCI DSS compliant payment gateway (Stripe/PayPal), encryption at rest and in transit
  • Cost: Start with lower-tier App Service, scale up only when needed, use Azure reservations for predictable costs, implement auto-shutdown for dev/test
  • Operational Excellence: Use GitHub Actions for CI/CD automation, Azure Monitor for centralized logging, managed services (App Service, SQL DB) to reduce operational burden
  • Performance: Configure auto-scale rules for App Service (scale to 20 instances during Black Friday), use Azure CDN for static content, implement caching (Redis Cache)

Tradeoff decisions made:

  • ⚖️ Reliability vs Cost: Chose geo-replication for database (higher reliability) despite extra cost - justified by customer trust requirements
  • ⚖️ Operational Excellence vs Cost: Used managed services (more expensive) instead of VMs (cheaper) - justified by small team size
  • ⚖️ Performance vs Cost: Auto-scale only when needed rather than always running at max capacity

📝 Practice Exercise 2: Designing a Resource Hierarchy

Scenario: Medium-sized company "FabriFiber" with:

  • 3 departments: Finance, Marketing, Engineering
  • 2 environments: Production, Non-Production (Dev + Test)
  • 2 geographic locations: US, Europe
  • Compliance requirement: EU data must stay in EU

Task: Design a management group and subscription structure. Draw the hierarchy and justify your choices.

Sample Solution:

Root Management Group
└── FabriFiber (Corporate)
    ├── Production
    │   ├── US-Production
    │   │   ├── Sub: Finance-Prod-US
    │   │   ├── Sub: Marketing-Prod-US
    │   │   └── Sub: Engineering-Prod-US
    │   └── EU-Production
    │       ├── Sub: Finance-Prod-EU
    │       └── Sub: Engineering-Prod-EU
    └── Non-Production
        ├── US-NonProd
        │   └── Sub: Shared-Dev-US
        └── EU-NonProd
            └── Sub: Shared-Dev-EU

Justification:

  • Level 1 (FabriFiber): Apply company-wide policies (e.g., require tagging, enable Defender)
  • Level 2 (Prod/NonProd): Separate governance - Prod has stricter security
  • Level 3 (Geography): Enforce data residency - EU MG restricts to EU regions only
  • Subscriptions: Separate by department in production for cost tracking; shared subscription in non-prod to save costs

Policies applied:

  • Root: Require tags (Department, Environment, CostCenter)
  • Production MG: No public IPs, require encryption, require MFA
  • EU-Production MG: Restrict to EU regions only (GDPR compliance)
  • Non-Production MG: Auto-shutdown VMs 6 PM - 8 AM

Check Your Understanding

  • Can you explain the five Well-Architected Framework pillars and give an example of each?
  • Can you describe when you'd need to make a tradeoff between pillars?
  • Do you understand the four levels of Azure resource hierarchy?
  • Can you explain how policies and RBAC inherit through the hierarchy?
  • Can you design a simple management group structure for an organization?
  • Can you identify which level of hierarchy to use for different governance needs?

If you answered "no" to any, review the relevant sections above before proceeding to Domain chapters.


Next: Proceed to 02_domain_1_identity_governance_monitoring to dive deep into identity solutions, governance strategies, and monitoring architectures.


Chapter 1: Design Identity, Governance, and Monitoring Solutions (25-30% of exam)

Chapter Overview

What you'll learn:

  • How to design comprehensive logging and monitoring solutions for Azure workloads
  • Authentication and authorization strategies using Microsoft Entra ID (formerly Azure AD)
  • Governance frameworks for managing Azure resources at scale
  • Security patterns for protecting identities and controlling access

Time to complete: 12-16 hours

Prerequisites: Chapter 0 (Fundamentals) - Understanding of Well-Architected Framework and Azure resource hierarchy


Section 1: Design Solutions for Logging and Monitoring

Introduction

The problem: In traditional on-premises environments, you might have servers, applications, and network devices spread across multiple locations with no unified way to see what's happening. When something breaks, IT teams spend hours (or days) correlating logs from different systems to find the root cause. Performance issues go undetected until users complain. Security breaches might be discovered months after they occur.

The solution: Azure Monitor provides a unified, cloud-native platform that collects, analyzes, and acts on telemetry from all your Azure resources, on-premises systems, and even other clouds. It gives you a single pane of glass to observe everything happening in your environment, with the ability to set up intelligent alerts, create visual dashboards, and even automate responses to issues.

Why it's tested: The AZ-305 exam heavily tests your ability to design monitoring architectures because observability is critical for:

  • Reliability: Detecting and responding to failures before they impact users
  • Security: Identifying anomalous behavior and potential breaches
  • Cost optimization: Understanding resource utilization to eliminate waste
  • Performance: Tracking metrics to ensure SLAs are met
  • Compliance: Maintaining audit trails for regulatory requirements

Core Concepts

Azure Monitor - The Foundation

What it is: Azure Monitor is the central platform service in Azure that collects, analyzes, stores, and visualizes telemetry data from all your resources. It's like having a universal sensor system that monitors everything from CPU usage on VMs to application errors in your code, and from network traffic patterns to user behavior analytics.

Why it exists: Without centralized monitoring, organizations face:

  • Blind spots: Can't see what's happening across distributed systems
  • Delayed detection: Problems discovered too late
  • Manual correlation: Hours wasted connecting dots between different data sources
  • Reactive operations: Fixing problems after they impact users instead of preventing them

Real-world analogy: Think of Azure Monitor like a modern car's dashboard and diagnostic system:

  • Gauges and displays (Metrics): Show real-time status - speed, fuel, engine temp
  • Warning lights (Alerts): Notify you when something needs attention
  • Black box recorder (Logs): Records detailed history of everything that happened
  • Diagnostic scanner (Insights): Analyzes patterns to predict future problems
  • Maintenance scheduler (Automation): Automatically takes action based on conditions

How it works (Detailed step-by-step):

  1. Data Collection: Azure Monitor automatically starts collecting platform metrics (CPU, memory, network) from Azure resources the moment they're created. For deeper insights, you configure diagnostic settings to send resource logs and configure agents (Azure Monitor Agent) on VMs to collect custom data.

  2. Data Ingestion: All collected data flows into Azure Monitor's data stores. Metrics go to the metrics database (time-series data optimized for fast queries and dashboards). Logs go to Log Analytics workspaces (document-based storage optimized for complex queries and analysis).

  3. Data Organization: Data is categorized and tagged with metadata (resource ID, subscription, region, custom tags). This metadata enables filtering, grouping, and correlation across different data sources.

  4. Analysis and Querying: You can query data using Kusto Query Language (KQL) in Log Analytics. Metrics can be visualized in Azure Metrics Explorer. Pre-built insights provide automatic analysis for common scenarios (VMs, containers, applications).

  5. Alerting and Actions: Alert rules continuously evaluate your data. When conditions are met (e.g., CPU > 80% for 5 minutes), alerts fire. Action groups define what happens next - send email, trigger webhook, run automation runbook, create ITSM ticket.

  6. Visualization: Data is presented through Azure dashboards, Workbooks (dynamic reports), Power BI, or Grafana for visualization and reporting.

  7. Automation and Response: Azure Monitor integrates with Azure Automation, Logic Apps, and Functions to automatically remediate issues (e.g., scale out VMs when CPU is high, restart services that crash).

📊 Azure Monitor Architecture Diagram:

graph TB
    subgraph "Data Sources"
        A[Azure Resources<br/>VMs, Databases, Storage]
        B[Applications<br/>App Insights SDK]
        C[Guest OS & Apps<br/>Azure Monitor Agent]
        D[On-premises<br/>Arc-enabled servers]
    end

    subgraph "Azure Monitor Data Platform"
        E[Metrics Database<br/>Time-series data]
        F[Log Analytics<br/>Workspace Logs]
    end

    subgraph "Analysis & Insights"
        G[Metrics Explorer<br/>Real-time charts]
        H[Log Analytics<br/>KQL Queries]
        I[Application Insights<br/>APM]
        J[VM Insights<br/>Performance]
        K[Container Insights<br/>AKS Monitoring]
    end

    subgraph "Actions & Responses"
        L[Alert Rules<br/>Conditions]
        M[Action Groups<br/>Email, SMS, Webhook]
        N[Automation<br/>Auto-remediation]
        O[Dashboards<br/>Visualization]
    end

    A -->|Platform Metrics| E
    A -->|Resource Logs| F
    B -->|Telemetry| F
    B -->|Metrics| E
    C -->|Custom Logs/Metrics| E
    C -->|Custom Logs/Metrics| F
    D -->|Logs/Metrics| E
    D -->|Logs/Metrics| F

    E --> G
    F --> H
    F --> I
    F --> J
    F --> K

    G --> L
    H --> L
    L --> M
    M --> N
    E --> O
    F --> O

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#e1f5fe
    style D fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#fff3e0
    style G fill:#f3e5f5
    style H fill:#f3e5f5
    style I fill:#f3e5f5
    style J fill:#f3e5f5
    style K fill:#f3e5f5
    style L fill:#e8f5e9
    style M fill:#e8f5e9
    style N fill:#e8f5e9
    style O fill:#e8f5e9

See: diagrams/02_domain_1_azure_monitor_architecture.mmd

Diagram Explanation (detailed):

The Azure Monitor architecture consists of four main layers:

Data Sources Layer (Blue): This is where monitoring data originates. Azure resources automatically emit platform metrics (CPU, memory, disk, network) without any configuration. Applications instrumented with Application Insights SDK send detailed telemetry including request traces, exceptions, and dependencies. Guest operating systems and applications on VMs require the Azure Monitor Agent to be installed to send custom metrics and logs. On-premises servers can be monitored after enabling Azure Arc, which extends Azure management to any infrastructure.

Data Platform Layer (Orange): All collected data flows into two primary stores. The Metrics Database stores time-series numeric data optimized for fast retrieval and real-time charting - perfect for dashboards and quick performance checks. Log Analytics Workspaces store log data (text, JSON, XML) in a document database optimized for complex queries - ideal for troubleshooting, security analysis, and compliance auditing. These two stores are complementary, not redundant.

Analysis Layer (Purple): Multiple tools help you make sense of the data. Metrics Explorer provides real-time charts without writing queries. Log Analytics uses KQL (Kusto Query Language) for powerful ad-hoc analysis across billions of log records. Application Insights automatically analyzes application performance and user behavior. VM Insights provides performance analysis specifically for virtual machines. Container Insights does the same for Kubernetes workloads.

Action Layer (Green): Alert rules continuously evaluate metrics and logs against conditions you define. When thresholds are breached, Action Groups determine what happens - notifications (email, SMS, push, voice), integrations (webhooks, ITSM, Event Hubs), or automated responses (Azure Functions, Logic Apps, Automation Runbooks). Dashboards bring everything together in customizable visual representations shared across teams.

The key takeaway: Data flows from left to right, from sources → storage → analysis → action. Each layer is modular - you can use just Metrics Explorer for simple scenarios or build complex solutions with custom queries, machine learning anomaly detection, and automated remediation.

Detailed Example 1: E-commerce Website Monitoring

Situation: You're architecting monitoring for an e-commerce platform running on Azure App Service with an Azure SQL Database backend, Storage Accounts for product images, and Application Gateway for load balancing.

Requirements:

  • Track application performance (response times, failures)
  • Monitor database performance and blocking queries
  • Alert when checkout failures exceed 1% of transactions
  • Dashboard showing real-time business metrics (orders/minute, revenue)
  • Audit trail for compliance (who accessed what data)

Solution Architecture:

  1. Application Insights for the App Service:

    • Install the App Insights SDK in your application code
    • Automatically tracks HTTP requests, dependencies (SQL calls), exceptions
    • Custom telemetry for business metrics (track each successful checkout)
    • Distributed tracing shows full request flow: App Gateway → App Service → SQL Database → Storage
  2. Diagnostic Settings for Azure SQL Database:

    • Enable diagnostic settings to send QueryStoreRuntimeStatistics logs to Log Analytics
    • Send SQLInsights to track blocking queries, deadlocks, and timeouts
    • Metrics for DTU/CPU percentage go to Azure Monitor Metrics automatically
  3. Storage Account Monitoring:

    • Platform metrics track blob operations (transactions, latency, availability)
    • Enable Storage Analytics logging for detailed access logs (who downloaded which images)
    • Set diagnostic settings to send logs to Log Analytics for long-term retention
  4. Alert Configuration:

    • Log-based alert: Query Application Insights for checkout failures
    requests
    | where name == "POST /checkout"
    | where success == false
    | summarize FailureRate = (count() * 100.0) / toint(countif(success == true) + countif(success == false)) by bin(timestamp, 5m)
    | where FailureRate > 1
    
    • Metric alert: SQL Database DTU > 80% for 5 minutes
    • Action group sends SMS to on-call engineer and creates a PagerDuty incident
  5. Dashboard Creation:

    • Workbook combining Application Insights and SQL metrics
    • Real-time tile showing orders per minute (from custom telemetry)
    • Revenue chart (calculated from order amounts in Application Insights)
    • P95 response time across all endpoints
    • Error rate by API endpoint

What happens: When a customer places an order, Application Insights tracks the entire flow. If the database starts slowing down due to blocking queries, the SQL diagnostic logs capture the blocking chain. The alert rule detects elevated DTU and notifies the team. The dashboard shows the impact - increased response times and dropping order rate. Engineers query Log Analytics to identify the specific query causing blocks and optimize it. The compliance team later audits storage logs to prove only authorized services accessed customer data.

Detailed Example 2: Multi-Region VM Monitoring

Situation: A global financial services company runs 500 VMs across three Azure regions (East US, West Europe, Southeast Asia) hosting their trading platform. They need comprehensive monitoring with minimal manual configuration.

Requirements:

  • Monitor performance metrics (CPU, memory, disk, network) for all VMs
  • Track running processes and identify unauthorized software
  • Map dependencies between VMs (which VMs talk to which)
  • Alert on anomalies (sudden CPU spikes, unusual network patterns)
  • Centralized logging for security auditing

Solution Architecture:

  1. Azure Monitor Agent (AMA) Deployment:

    • Use Azure Policy to automatically install AMA on all VMs (existing and new)
    • Data Collection Rules (DCRs) define what data to collect from each VM
    • DCR associates VMs with Log Analytics Workspace based on region
  2. VM Insights Configuration:

    • Enable VM Insights at scale using Azure Policy
    • Deploys Dependency Agent automatically (maps VM connections)
    • Collects performance counters every 60 seconds: Processor (% Processor Time), Memory (Available MBytes), Logical Disk (Free Megabytes), Network Adapter (Bytes Sent/Received)
  3. Log Collection Strategy:

    • System logs: Windows Event Logs (Application, Security, System) or Linux Syslog
    • Performance counters: As defined above, stored in Perf table
    • Process and dependencies: ServiceMapProcess_CL and ServiceMapComputer_CL tables
  4. Data Collection Rules (DCR) Design:

    // Simplified DCR example
    {
      "dataSources": {
        "performanceCounters": [
          {"counterSpecifiers": ["\\Processor(*)\\% Processor Time"], "samplingFrequency": 60},
          {"counterSpecifiers": ["\\Memory\\Available MBytes"], "samplingFrequency": 60}
        ],
        "windowsEventLogs": [
          {"streams": ["Microsoft-Event"], "xPathQueries": ["Security!*", "Application!*[System[(Level=1 or Level=2 or Level=3)]]"]}
        ]
      },
      "destinations": {
        "logAnalytics": [{"workspaceResourceId": "/subscriptions/.../workspaces/central-logs"}]
      }
    }
    
  5. Dependency Mapping:

    • VM Insights creates visual maps showing how VMs communicate
    • Identifies: Which VMs accept connections from the internet (potential attack surface), which VMs have no communication (candidates for decommissioning), network bottlenecks between regions
  6. Anomaly Detection Alerts:

    • Smart Detection in Azure Monitor uses machine learning
    • Learns normal CPU patterns for each VM (CPU typically 20-30% during trading hours, 5-10% after hours)
    • Alerts when a VM deviates significantly (e.g., 80% CPU at 2 AM - possible crypto mining malware)

What happens: A VM in Southeast Asia gets compromised and starts bitcoin mining at 3 AM local time. The CPU jumps to 100%. Azure Monitor's smart detection, having learned the VM typically uses 8% CPU at that hour, fires an anomaly alert within 10 minutes. The SOC team receives the alert, checks the VM Insights dependency map, and sees unusual outbound connections to unknown IPs. They query Log Analytics for recent process starts on that VM, identify the malicious process, and isolate the VM. The entire detection-to-response cycle takes 20 minutes instead of hours/days.

Detailed Example 3: Hybrid Monitoring with Azure Arc

Situation: A manufacturing company has 200 production servers in their factory (on-premises) and 50 VMs in Azure. They want unified monitoring across both environments with identical capabilities.

Requirements:

  • Monitor on-premises servers same as Azure VMs
  • Centralized Log Analytics for all servers
  • Use Azure Monitor alerts for on-premises systems
  • Consistent update management across hybrid environment

Solution Architecture:

  1. Azure Arc Deployment:

    • Install Azure Arc Connected Machine agent on each on-premises server
    • Servers now appear in Azure Portal as Azure Arc-enabled servers
    • Can be managed with Azure Policy, monitored with Azure Monitor, protected with Microsoft Defender
  2. Unified Monitoring Configuration:

    • Same Azure Monitor Agent (AMA) on both Azure VMs and Arc-enabled servers
    • Same Data Collection Rules (DCRs) apply to both
    • Same Log Analytics Workspace receives data from both
  3. Monitoring Strategy:

    • Performance monitoring: Identical performance counters from all servers
    • Log collection: Application logs, security logs, system logs from all servers
    • Update assessment: Azure Update Management assesses both Azure and on-premises
    • Alerts work identically: High CPU on-premises server triggers same alert as Azure VM

What happens: A production machine in the factory starts overheating, causing its monitoring sensors to flood the on-premises server with telemetry. The server's CPU spikes and disk fills up with log files. Because it's Arc-enabled and monitored by Azure Monitor, an alert fires in Azure (same as for Azure VMs). The alert triggers an Automation Runbook that remotely clears old logs and restarts the telemetry service. The entire remediation is automated and handled from Azure, despite the server being on-premises.

Must Know (Critical Facts):

  • Azure Monitor is always on: Platform metrics are collected automatically without configuration; you pay only for ingestion and retention of logs sent to Log Analytics
  • Diagnostic Settings are resource-specific: Each Azure resource needs its own diagnostic setting configured to send logs to Log Analytics; there's no "inherit from subscription" option
  • Metrics vs Logs distinction: Metrics are numeric time-series data (fast, cheap, real-time); Logs are text/JSON records (rich detail, queryable, more expensive to store)
  • Log Analytics Workspace is regional: Data stored in a workspace stays in that region (compliance implication); you can have multiple workspaces for data residency requirements
  • Retention limits: Metrics are retained for 93 days; Log Analytics default is 30 days (configurable up to 730 days at additional cost)
  • Application Insights classic is retired: Always use workspace-based Application Insights (data stored in Log Analytics workspace, not separate Application Insights storage)

When to use (Comprehensive):

  • ✅ Use Azure Monitor Metrics when: You need real-time visualization (dashboards showing current state), fast alert response (< 1 minute), low-cost monitoring of numeric values (CPU, memory, request count)
  • ✅ Use Log Analytics when: You need to query and correlate data from multiple sources, perform complex analysis (root cause investigation), retain data for compliance (years), create custom reports with KQL
  • ✅ Use Application Insights when: Monitoring web applications and services, need distributed tracing across microservices, want automatic performance anomaly detection, require end-user monitoring (Real User Monitoring)
  • ✅ Use VM Insights when: Managing multiple VMs, need dependency mapping between servers, want pre-built performance analysis workbooks
  • ✅ Use Container Insights when: Running Kubernetes (AKS, Arc-enabled Kubernetes, AKS on Azure Stack HCI), need container-level metrics and logs
  • ❌ Don't use Azure Monitor when: You need to monitor application business logic (use Application Insights instead), you need network packet inspection (use Network Watcher), you only need flow logs for NSG (use NSG Flow Logs directly)
  • ❌ Don't use Log Analytics when: You only need current metric values for dashboards (use Metrics Explorer - it's faster and cheaper), you need real-time streaming analytics (use Azure Stream Analytics), you need to store raw logs indefinitely at low cost (archive to Storage Account)

Limitations & Constraints:

  • Log Analytics query timeout: Queries must complete within 10 minutes (for interactive queries) or 30 minutes (for alert queries). Workaround: Optimize queries with summarization, use materialized views for frequently accessed aggregations
  • Log Analytics ingestion limit: 30 MB/sec per workspace (about 2.5 TB/day). Workaround: Use multiple workspaces or configure sampling in Application Insights
  • Metrics retention: Only 93 days for platform metrics. Workaround: Create a Log Analytics alert rule to archive important metrics to Log Analytics (which can retain for years)
  • Alert rule limits: 5000 metric alert rules and 512 log alert rules per subscription. Workaround: Use dynamic thresholds or combine multiple conditions in single alert rule
  • Data Collection Rule limits: 10 DCRs per subscription. Workaround: Design reusable DCRs that apply to multiple resource groups using tags

💡 Tips for Understanding:

  • Remember the pipeline: Source → Collection → Storage → Analysis → Action. Every monitoring solution follows this flow
  • Think in layers: Platform metrics (free, automatic) → Resource logs (configured per resource) → Guest OS/App data (requires agent)
  • Metrics are for "now", Logs are for "why": Use metrics to know that there's a problem (high CPU). Use logs to know why (which process, what was it doing)
  • Workspace design is critical: Decide workspace strategy early (single central, per-environment, per-region) because migration is painful

⚠️ Common Mistakes & Misconceptions:

  • Mistake 1: "I enabled diagnostic settings on my subscription, so all resources are monitored"

    • Why it's wrong: Diagnostic settings are per-resource, not inherited. You must enable on each resource or use Azure Policy to automate
    • Correct understanding: Use Azure Policy with DeployIfNotExists effect to automatically create diagnostic settings when resources are created
  • Mistake 2: "I'll just send all logs to Log Analytics for long-term storage"

    • Why it's wrong: Log Analytics retention beyond 30 days costs $0.10/GB/month. Storing 1 TB for a year costs $1,200
    • Correct understanding: Keep 30-90 days in Log Analytics for active querying, archive older logs to Storage Account ($0.002/GB/month = $24/year for 1 TB)
  • Mistake 3: "Azure Monitor Agent and Log Analytics Agent are the same"

    • Why it's wrong: Log Analytics Agent (MMA) is deprecated. Azure Monitor Agent (AMA) is the replacement with different configuration model
    • Correct understanding: Migrate to AMA before MMA retirement (August 2024). AMA uses Data Collection Rules (DCRs) for flexible, scalable configuration
  • Mistake 4: "I can't monitor on-premises servers with Azure Monitor"

    • Why it's wrong: Azure Arc extends Azure management to any server anywhere (on-premises, other clouds, edge)
    • Correct understanding: Arc-enabled servers have identical monitoring capabilities as Azure VMs using same Azure Monitor Agent and Data Collection Rules

🔗 Connections to Other Topics:

  • Relates to Security (Domain 1.2) because: Diagnostic logs provide audit trail for compliance; Log Analytics integrates with Microsoft Sentinel for SIEM; identity anomalies detected through sign-in log analysis
  • Builds on Azure Resource Hierarchy (Fundamentals) by: Monitoring can be organized by management groups; subscriptions; resource groups can share workspaces; tagging enables flexible log queries
  • Often used with High Availability (Domain 3.2) to: Detect failures early with alerts; implement automated recovery with Azure Automation; prove SLA compliance with availability metrics
  • Connects to Cost Optimization (Well-Architected Framework) through: Identifying underutilized resources with metrics; rightsizing VMs based on performance data; log retention optimization

Troubleshooting Common Issues:

  • Issue 1: Diagnostic settings configured, but logs not appearing in Log Analytics

    • Solution: Check 1) Diagnostic setting uses correct workspace resource ID, 2) Selected log categories are actually generating data, 3) Wait 5-15 minutes for first logs to appear, 4) Verify no resource locks preventing log export
  • Issue 2: Log Analytics queries timeout or are very slow

    • Solution: 1) Add time range filter to beginning of query (| where TimeGenerated > ago(1h)), 2) Use summarize to aggregate before filtering, 3) Create summary rules for frequently queried data, 4) Limit columns returned with project command

Log Analytics Workspace - The Data Foundation

What it is: A Log Analytics Workspace is a unique environment in Azure Monitor where log data is collected, stored, and queried. It's a container with its own data retention settings, access controls, and query capabilities. Each workspace is essentially a database optimized for storing and analyzing billions of log records from diverse sources.

Why it exists: Before Log Analytics workspaces, organizations had logs scattered across multiple systems - Windows Event Viewer, Linux syslog, application log files, network device syslogs. Correlating these for troubleshooting required:

  • Manually collecting logs from each system
  • Writing custom scripts to parse different formats
  • No unified query language
  • No long-term retention strategy

Log Analytics Workspaces solve this by providing a centralized repository with a powerful query language (KQL) and built-in connectors to hundreds of data sources.

Real-world analogy: Think of a Log Analytics Workspace like a centralized library:

  • Books (logs) from many publishers (Azure resources, apps, servers) are collected in one location
  • Catalog system (schema) organizes books by topic (tables like Heartbeat, Perf, Event)
  • Search system (KQL) lets you find exactly what you need across millions of books
  • Retention policy determines how long books are kept on-site vs archived off-site
  • Access control determines who can read which books

How it works (Detailed):

  1. Workspace Creation: You create a workspace in a specific Azure region (data will reside there for compliance). You choose a pricing tier: Pay-as-you-go ($2.76/GB) or Commitment Tier (discounts starting at 100GB/day).

  2. Data Ingestion: Diagnostic settings from Azure resources, Azure Monitor Agent from VMs, and Application Insights send data to the workspace. Each data stream goes into specific tables (AzureDiagnostics, Perf, Syslog, ContainerLog, etc.). Custom data can go to custom tables (name ends with _CL).

  3. Schema Management: Workspace maintains schema for all tables. Common columns like TimeGenerated, ResourceId, Type exist in all tables. Each table has specific columns for its data type (Perf has CounterName, CounterValue; Syslog has Facility, SeverityLevel).

  4. Data Processing: Transformation rules can parse, filter, or enrich data during ingestion. For example, extract JSON from raw text logs, add custom tags based on content, or filter out noisy logs before storage (saving cost).

  5. Storage & Retention: Data is actively queryable for the retention period (default 30 days, max 730 days). After that, it can be automatically archived to cheaper storage or deleted. Archived data can still be restored for querying if needed (within 12 days).

  6. Querying: Users write KQL queries against workspace tables. Queries can span multiple tables, join data sources, perform aggregations, create visualizations. Query results can be exported, pinned to dashboards, or used in alert rules.

  7. Access Control: Azure RBAC determines who can query the workspace. Table-level RBAC restricts access to sensitive tables (like SecurityEvent). Resource-context access allows users to see only logs from resources they have permission to.

📊 Log Analytics Workspace Architecture Diagram:

graph TB
    subgraph "Data Sources"
        A[Diagnostic Settings<br/>Azure Resources]
        B[Azure Monitor Agent<br/>VMs & Arc Servers]
        C[Application Insights<br/>Apps & Services]
        D[Custom Sources<br/>REST API, Data Collector]
    end

    subgraph "Log Analytics Workspace"
        E[Ingestion Pipeline<br/>Parsing & Transformation]
        F[Standard Tables<br/>Perf, Syslog, Event]
        G[Azure Tables<br/>AzureDiagnostics, AzureMetrics]
        H[Custom Tables<br/>*_CL Tables]
        I[Analytics Engine<br/>KQL Query Processor]
    end

    subgraph "Storage Tiers"
        J[Interactive Analytics<br/>0-30 days]
        K[Long-term Retention<br/>31-730 days]
        L[Archive Storage<br/>Low-cost archival]
    end

    subgraph "Consumption"
        M[Log Analytics UI<br/>Interactive Queries]
        N[Workbooks<br/>Custom Reports]
        O[Alert Rules<br/>KQL-based Alerts]
        P[Power BI<br/>Business Intelligence]
    end

    A --> E
    B --> E
    C --> E
    D --> E

    E --> F
    E --> G
    E --> H

    F --> I
    G --> I
    H --> I

    I --> J
    J --> K
    K --> L

    I --> M
    I --> N
    I --> O
    I --> P

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#e1f5fe
    style D fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#f3e5f5
    style G fill:#f3e5f5
    style H fill:#f3e5f5
    style I fill:#fff3e0
    style J fill:#e8f5e9
    style K fill:#e8f5e9
    style L fill:#e8f5e9
    style M fill:#c8e6c9
    style N fill:#c8e6c9
    style O fill:#c8e6c9
    style P fill:#c8e6c9

See: diagrams/02_domain_1_log_analytics_workspace.mmd

Diagram Explanation: Log Analytics Workspace architecture shows the complete data flow from ingestion to consumption. Data sources (blue) send logs via different mechanisms - diagnostic settings for Azure services, agents for VMs, SDKs for applications, or custom APIs for third-party data. The ingestion pipeline (orange) parses incoming data and applies transformation rules. Data is organized into tables (purple) - standard tables for common data types, Azure-specific tables for platform data, and custom tables for specialized scenarios. The KQL query engine provides unified access across all tables. Storage tiers (green) implement hot (fast, expensive) to cold (slow, cheap) data lifecycle. Consumption layer (light green) offers multiple ways to access and analyze data - interactive queries, custom reports, alerts, and BI dashboards.

Application Insights - Application Performance Management

What it is: Application Insights is an Application Performance Management (APM) service within Azure Monitor that provides deep insights into your web applications and services. It automatically captures telemetry about requests, dependencies, exceptions, metrics, and user behavior without requiring significant code changes.

Why it exists: Traditional monitoring only tells you that your application is running (or not). It doesn't answer critical questions like: Which pages are slow? Which database queries are blocking users? What's the user journey when the app fails? Where do users abandon the checkout process? Application Insights answers these by providing distributed tracing, performance profiling, and user analytics.

Real-world analogy: Imagine Application Insights as a black box recorder + surveillance system for your app:

  • Flight data recorder: Captures every request, dependency call, and exception with precise timestamps
  • Security cameras: Tracks user flows through your application (which pages, which actions, where they drop off)
  • Performance analyzer: Identifies slow components and bottlenecks automatically
  • Crash investigation tool: Provides complete stack traces and context when errors occur

Detailed Example: Microservices Distributed Tracing:

Situation: E-commerce platform with microservices architecture:

  • Frontend (React SPA) → API Gateway → Order Service → Payment Service → Inventory Service → Shipping Service
  • Each service is independent, some in different regions
  • Need end-to-end visibility when orders fail

Solution with Application Insights:

  1. Instrument Each Service: Install Application Insights SDK in each microservice, configure same Application Insights resource

  2. Automatic Distributed Tracing: Application Insights automatically propagates correlation IDs across services using W3C Trace Context headers

  3. End-to-End Transaction View: When a user places an order, one transaction ID tracks the complete flow:

    Request to /api/order (200 OK, 850ms total)
    ├─ Dependency: POST payment-service.com/charge (200 OK, 340ms)
    │  └─ Dependency: SQL query SELECT * FROM PaymentMethods (180ms)
    ├─ Dependency: POST inventory-service.com/reserve (500 Error, 120ms)
    │  └─ Exception: InsufficientStockException
    └─ Dependency: POST shipping-service.com/calculate (200 OK, 85ms)
    
  4. Automatic Failure Analysis: When inventory service fails, Application Insights shows exactly: which customer, what they ordered, which specific product failed, the complete stack trace, related logs from that service

Must Know - Application Insights:

  • Workspace-based is mandatory: Classic Application Insights retired; always use workspace-based (data stored in Log Analytics workspace)
  • Sampling reduces costs: Adaptive sampling automatically reduces telemetry volume during high traffic while preserving statistical accuracy (default: max 5 events/sec)
  • Correlation IDs are automatic: Distributed tracing works out-of-box with supported frameworks (ASP.NET, Node.js, Java Spring Boot)
  • Smart Detection uses ML: Automatically detects anomalies in failure rates, response time degradation, memory leaks without configuration
  • Availability tests are global: Can test your app from 16 Azure regions worldwide, simulating user access patterns

Section 2: Design Authentication and Authorization Solutions

Introduction

The problem: Identity is the new perimeter. In the cloud era, resources are accessed from anywhere, by anyone, on any device. Traditional network-based security (firewalls, VPNs) is insufficient. How do you ensure only the right people access the right resources with the right permissions? How do you prevent password-based attacks when passwords are the weakest link? How do you audit who did what, when?

The solution: Microsoft Entra ID (formerly Azure AD) provides comprehensive identity and access management with modern authentication protocols, conditional policies that adapt to risk, and fine-grained authorization models. It's a cloud-native identity platform that replaces traditional Active Directory for cloud scenarios while still integrating with on-premises AD when needed.

Why it's tested: AZ-305 heavily tests identity architecture because:

  • Zero Trust requires strong identity: "Never trust, always verify" starts with identity
  • Compliance mandates: MFA, access reviews, privileged access management are regulatory requirements
  • Security breaches mostly involve identity: 80%+ of breaches involve compromised credentials
  • Complex scenarios: Hybrid identity, B2B collaboration, application integration all require architectural decisions

Core Concepts

Microsoft Entra ID - Cloud Identity Platform

What it is: Microsoft Entra ID is a cloud-based identity and access management service that handles authentication (proving who you are) and authorization (what you're allowed to do). It's the backbone of identity for Azure, Microsoft 365, and thousands of SaaS applications.

Why it exists: Traditional Active Directory was designed for on-premises networks with domain-joined devices. It doesn't natively support:

  • Cloud applications (SaaS apps need federated authentication)
  • Modern authentication protocols (OAuth 2.0, OpenID Connect, SAML)
  • Mobile devices (BYOD scenarios)
  • Conditional access (risk-based authentication)
  • External users (B2B collaboration)

Microsoft Entra ID solves these modern identity challenges while providing optional synchronization with on-premises AD for hybrid scenarios.

Real-world analogy: Think of Entra ID like a modern digital identity system:

  • Passport office (Entra ID): Central authority that verifies and issues identities
  • Passport (identity): Your verified digital identity with claims (name, email, group memberships)
  • Border checkpoints (Conditional Access): Check not just passport but also health status, travel history, device security before allowing entry
  • Customs declaration (authorization): What you're allowed to bring in (permissions, roles)

📊 Microsoft Entra ID Architecture Diagram:

graph TB
    subgraph "Identity Sources"
        A[Cloud-only Users<br/>Created in Entra ID]
        B[Synchronized Users<br/>From on-prem AD]
        C[Guest Users<br/>B2B Collaboration]
        D[Service Principals<br/>App Identities]
    end

    subgraph "Microsoft Entra ID Core"
        E[Authentication<br/>MFA, Passwordless]
        F[Directory Services<br/>Users, Groups, Devices]
        G[Application Gallery<br/>6000+ SaaS Apps]
        H[Identity Protection<br/>Risk Detection]
    end

    subgraph "Access Control"
        I[Conditional Access<br/>Policy Engine]
        J[Privileged Identity<br/>Management PIM]
        K[Entitlement<br/>Management]
        L[Access Reviews<br/>Recertification]
    end

    subgraph "Protected Resources"
        M[Azure Resources<br/>VMs, Storage, etc]
        N[Microsoft 365<br/>Exchange, SharePoint]
        O[SaaS Applications<br/>Salesforce, Workday]
        P[Custom Apps<br/>Your Applications]
    end

    A --> E
    B --> E
    C --> E
    D --> E

    E --> F
    F --> I
    H --> I

    I --> J
    I --> K
    I --> L

    I --> M
    I --> N
    I --> O
    I --> P

    F --> G
    G --> O

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#e1f5fe
    style D fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#fff3e0
    style G fill:#fff3e0
    style H fill:#ffebee
    style I fill:#f3e5f5
    style J fill:#e8f5e9
    style K fill:#e8f5e9
    style L fill:#e8f5e9
    style M fill:#c8e6c9
    style N fill:#c8e6c9
    style O fill:#c8e6c9
    style P fill:#c8e6c9

See: diagrams/02_domain_1_entra_id_architecture.mmd

Diagram Explanation: Entra ID architecture has four main layers. Identity Sources (blue) include cloud-only users created directly in Entra ID, synchronized users from on-premises AD via Entra Connect, guest users invited for B2B collaboration, and service principals representing applications and managed identities. Entra ID Core (orange) provides authentication services with MFA and passwordless options, directory services for managing objects, application gallery with pre-integrated SaaS apps, and identity protection for risk-based decisions. Access Control (purple) implements Zero Trust through Conditional Access policies that evaluate signals before granting access, PIM for just-in-time privileged access, Entitlement Management for access packages, and Access Reviews for periodic recertification. Protected Resources (green) include Azure services using RBAC, Microsoft 365 workloads, third-party SaaS applications via federation, and custom applications integrated via OAuth/OIDC or SAML.

Conditional Access - Dynamic Policy Engine

What it is: Conditional Access is the policy engine in Microsoft Entra ID that enforces access controls based on real-time signals like user identity, device state, location, application sensitivity, and risk level. It's Zero Trust in action - never trust, always verify.

Why it exists: Static access control (username + password) is insufficient because:

  • Context matters: Accessing payroll from corporate network on managed device is different than from coffee shop on personal phone
  • Risk is dynamic: User behavior can indicate compromise (impossible travel, anonymous IP usage)
  • Compliance requires adaptive controls: Regulations mandate stronger authentication for sensitive data

Conditional Access evaluates all these signals and applies appropriate controls (require MFA, block access, limit actions) in real-time.

How it works (Detailed step-by-step):

  1. User attempts access: User tries to sign in to an application (Azure Portal, Microsoft 365, SaaS app)

  2. Identity verification: User authenticates with username/password or passwordless method

  3. Signal collection: Entra ID collects signals:

    • User/Group membership: Is user in "Finance Team" group?
    • Location: IP address geolocation (country, city, ASN)
    • Device state: Is device Entra joined? Compliant with Intune policies?
    • Application: What app are they accessing? Sensitivity level?
    • Risk level: Any suspicious activity? Anonymous IP? Atypical travel?
    • Session: First-time access? New device?
  4. Policy evaluation: All Conditional Access policies are evaluated. Policies have:

    • Assignments: Who this policy applies to (users, groups, roles)
    • Conditions: When policy triggers (locations, device platforms, client apps, risk levels)
    • Access controls: What to enforce (grant with MFA, block, session limits)
  5. Decision aggregation: If multiple policies apply:

    • Any policy that grants access → Access granted (with all controls enforced)
    • Any policy blocks → Access blocked
    • Report-only policies → Logged but not enforced (for testing)
  6. Control enforcement: Selected controls are enforced:

    • Grant controls: Require MFA, require compliant device, require password change, require approved client app
    • Session controls: Limit session duration, restrict download/print (with Defender for Cloud Apps), continuous access evaluation
  7. Continuous evaluation: Access token issued for limited time. If user/device state changes (device becomes non-compliant, location changes to risky region), access is revoked immediately

Detailed Example: Financial Services Conditional Access Architecture:

Situation: Investment bank needs to protect sensitive financial data while allowing flexible work (remote, BYOD). Different data sensitivity requires different controls.

Requirements:

  • Employees accessing Office 365 from anywhere: Require MFA
  • Access to trading platform: Require managed/compliant device + MFA
  • Admin access to Azure: Require PAW (Privileged Access Workstation) + phishing-resistant MFA
  • Third-party vendors: Block access to sensitive apps, allow only specific apps
  • Detect and block risky sign-ins automatically

Conditional Access Policy Design:

Policy 1: Baseline MFA for All Users

  • Assignments: All users
  • Conditions: Any location, any device
  • Controls: Require MFA (excluding emergency access accounts)

Policy 2: Trading Platform Protection

  • Assignments: Trading team
  • Application: Trading application
  • Conditions: Any location
  • Controls: Require device to be Entra hybrid joined AND compliant with Intune policies AND MFA

Policy 3: Admin Protection

  • Assignments: Azure admin roles (Global Admin, Security Admin, etc.)
  • Applications: Azure management
  • Conditions: Any location
  • Controls: Require FIDO2 security key (phishing-resistant) AND compliant device

Policy 4: High-Risk Block

  • Assignments: All users
  • Conditions: Sign-in risk is High
  • Controls: Block access (forces password reset to regain access)

Policy 5: Vendor Access Restriction

  • Assignments: B2B guest users (vendors)
  • Conditions: Any location
  • Controls: Grant access only to approved vendor apps + Require MFA

What happens:

  • Employee logs in from home laptop to Outlook (corporate network not required, Policy 1 applies → MFA required)
  • Trader tries to access trading platform from personal iPad (Policy 2 applies → blocked because iPad is not Entra joined)
  • Global Admin accesses Azure Portal from managed PAW (Policy 3 applies → must use FIDO2 key)
  • User's account compromised in phishing attack, attacker tries to login from TOR network (Policy 4 detects high risk → blocked immediately)
  • Vendor partner accesses approved collaboration tool (Policy 5 allows with MFA → cannot access internal financial systems)

Must Know - Conditional Access:

  • Evaluation is real-time: Policies evaluated on every sign-in attempt; changing a policy affects next sign-in
  • AND logic within policy, OR between policies: Within one policy, all conditions must be true (AND). User needs to match just one policy to get access (OR)
  • Report-only mode for testing: Deploy policies in report-only to see impact before enforcement
  • Blocking is absolute: If any policy blocks access, user cannot access even if other policies grant
  • Continuous Access Evaluation (CAE): Revokes access in near real-time when critical events occur (user disabled, password changed, location changed to risky region)
  • Requires Entra ID P1 or P2: P1 for basic Conditional Access, P2 for risk-based policies with Identity Protection

Privileged Identity Management (PIM) - Just-in-Time Access

What it is: PIM is a service in Entra ID that enables just-in-time (JIT) privileged access to Azure resources, Entra ID roles, and Microsoft 365 services. Instead of permanent admin rights, users request time-limited elevation when needed.

Why it exists: Standing privileged access is a security risk because:

  • Broad attack surface: Compromised admin account = full access
  • Insider threat: Admins have access even when not needed
  • Compliance violations: Cannot prove minimal necessary access
  • Audit difficulties: Who used admin rights when and for what?

PIM implements "least privilege" and "just enough administration" by making privileged access temporary, audited, and approved.

How it works (Detailed):

  1. Role Discovery: PIM scans Entra ID and Azure subscriptions to identify current role assignments (who has what admin rights)

  2. Convert to Eligible: Permanent assignments are converted to eligible assignments (user doesn't have access until activated)

  3. Activation Request: When user needs privileged access:

    • User requests activation in PIM portal
    • Specifies duration (1-24 hours based on role settings)
    • Provides business justification
    • Completes MFA if required
    • Gets approval if required (from designated approvers)
  4. Time-Limited Access: Once approved:

    • Role is activated for specified duration
    • User has actual permissions in Entra ID/Azure
    • All actions are logged with justification
    • Notification sent to security team
  5. Automatic Expiration: When time expires:

    • Role assignment automatically removed
    • User no longer has elevated permissions
    • No manual cleanup needed
  6. Access Reviews: Periodic reviews ensure:

    • Only necessary people are eligible for roles
    • Remove eligibility for users who no longer need it
    • Reviewers receive notifications to approve/deny continued access

Detailed Example: Cloud Operations Team PIM Design:

Situation: Cloud platform team manages Azure subscriptions for entire organization (500+ subscriptions). Team has 15 engineers who occasionally need Owner or Contributor access. Want to minimize standing privileges.

PIM Configuration:

  1. Role Settings for Owner Role:

    • Activation maximum duration: 8 hours
    • Require justification on activation: Yes
    • Require ticket number on activation: Yes
    • Require MFA on activation: Yes
    • Require approval: Yes (from Cloud Architect)
    • Notification: Alert security team on activation
  2. Role Settings for Contributor Role:

    • Activation maximum duration: 12 hours
    • Require justification: Yes
    • Require MFA: Yes
    • Require approval: No (self-service)
    • Notification: Email to manager
  3. Eligible Assignment:

    • 15 engineers made eligible for Contributor on their assigned subscriptions
    • 3 senior engineers made eligible for Owner on production subscriptions
    • No permanent assignments
  4. Access Review Schedule:

    • Quarterly review of all eligible assignments
    • Manager reviews: Does engineer still need this role?
    • Automatically remove if not confirmed

What happens:

  • Engineer needs to deploy a new application (requires Contributor)
  • Requests activation in PIM portal, enters ticket number and justification
  • Completes MFA challenge
  • Access granted immediately (no approval needed for Contributor)
  • Has 12 hours to complete deployment
  • All deployment actions logged with PIM activation context
  • After 12 hours, Contributor automatically removed
  • If engineer tries to deploy again, must request activation again

Benefits realized:

  • Zero standing admin access (attack surface minimized)
  • Complete audit trail (who activated when, for what reason)
  • Compliance proof (access reviews demonstrate least privilege)
  • Reduced risk (if account compromised, attacker has no privileges without activation)

Must Know - PIM:

  • Requires Entra ID P2: PIM is P2-only feature
  • Roles vs Groups: Can make role-assignable groups, then use PIM for group membership (simpler at scale)
  • Azure PIM vs Entra PIM: Separate PIM workflows for Azure resource roles vs Entra ID admin roles
  • Permanent still possible: Can still have permanent assignments, but PIM tracks them and flags for review
  • Alerts for anomalies: PIM alerts on excessive activations, suspicious patterns, assignments outside policy

Section 3: Design Governance Solutions

Introduction

The problem: Without governance, Azure environments become chaotic. Subscriptions proliferate with inconsistent naming. Resources are deployed without cost controls or security standards. Compliance audits reveal violations because no one tracked who had what access. Teams fight over budget overruns because there's no cost attribution. The cloud promise of agility turns into cloud sprawl.

The solution: Azure governance provides the guardrails and controls needed to manage Azure at scale while maintaining agility. Through management groups, policies, cost management, and compliance frameworks, you can enforce standards, control spending, and prove regulatory compliance without slowing down development teams.

Why it's tested: AZ-305 tests governance extensively because:

  • Enterprise scale: Managing hundreds of subscriptions requires hierarchy and automation
  • Compliance mandates: Regulatory requirements (SOC 2, HIPAA, PCI-DSS) require enforceable controls
  • Cost control: Cloud bills can spiral without governance
  • Security baseline: Policy-driven security ensures consistent protection

Core Concepts

Management Groups - Hierarchical Organization

What it is: Management groups provide a hierarchy above subscriptions to organize and apply governance at scale. They're containers that group subscriptions, allowing you to apply policies, RBAC, and budgets to multiple subscriptions simultaneously.

Why it exists: Without management groups:

  • Repetitive configuration: Must apply same policies to each subscription individually
  • No organizational structure: Cannot reflect business units or geography in Azure hierarchy
  • Limited inheritance: Cannot cascade settings from parent to child subscriptions
  • Poor compliance visibility: Cannot see compliance across organizational boundaries

Management groups solve this by creating an organizational tree where governance rolls down from parent to child.

Real-world analogy: Think of management groups like a corporate org chart:

  • CEO (root management group): Top-level policies apply to entire company
  • Divisions (child management groups): Finance, Engineering, Sales each with division-specific policies
  • Departments (nested management groups): Within Engineering: Frontend, Backend, Infrastructure
  • Teams (subscriptions): Individual teams within departments with team-specific resources

How it works (Detailed):

  1. Hierarchy Creation: Management group tree created under tenant root (maximum 6 levels deep). Each level represents logical grouping (environment, business unit, geography, compliance boundary)

  2. Subscription Assignment: Subscriptions placed in appropriate management groups. One subscription can only be in one management group at a time

  3. Policy Assignment: Azure Policies assigned at management group level automatically apply to all child management groups and subscriptions (inheritance)

  4. RBAC Assignment: Role assignments at management group give access to all resources in child subscriptions

  5. Inheritance Flow: Settings flow down (parent to child) but cannot flow up. Child can add more restrictions but cannot loosen parent restrictions

Detailed Example: Enterprise Management Group Design:

Situation: Global manufacturing company with presence in US, Europe, Asia. Multiple business units (Manufacturing, Sales, IT). Each has production, development, and sandbox environments. Need to enforce different compliance standards by region and environment.

Management Group Structure:

Tenant Root
├── Corporate (MG)
│   ├── IT Department (MG)
│   │   ├── Production (MG)
│   │   │   ├── IT-Prod-Subscription-1
│   │   │   └── IT-Prod-Subscription-2
│   │   ├── Development (MG)
│   │   │   └── IT-Dev-Subscription-1
│   │   └── Sandbox (MG)
│   │       └── IT-Sandbox-Subscription-1
│   ├── Manufacturing (MG)
│   │   ├── US-Manufacturing (MG)
│   │   │   └── US-Mfg-Prod-Subscription-1
│   │   ├── EU-Manufacturing (MG)
│   │   │   └── EU-Mfg-Prod-Subscription-1
│   │   └── APAC-Manufacturing (MG)
│   │       └── APAC-Mfg-Prod-Subscription-1
│   └── Sales (MG)
│       ├── Sales-Prod (MG)
│       └── Sales-Dev (MG)
└── Decommissioned (MG)
    └── Legacy-Subscription-1

Governance Applied at Each Level:

Tenant Root Level:

  • Policy: Require tags (CostCenter, Owner, Environment)
  • Policy: Allowed Azure regions (US East, West Europe, Southeast Asia only)
  • Policy: Require TLS 1.2 minimum for all services
  • RBAC: Security team as Reader on all subscriptions

Corporate MG:

  • Policy: Enable Azure Defender on all subscriptions
  • Policy: Enforce diagnostic settings to central Log Analytics
  • Budget: $500K monthly cap with alerts at 80%, 100%, 120%

Production MG (under IT):

  • Policy: Require encryption at rest for all storage
  • Policy: No delete locks can be removed
  • Policy: Require backup for all VMs
  • RBAC: IT Ops team as Contributor

EU-Manufacturing MG:

  • Policy: Data must stay in West Europe region only (override parent)
  • Policy: Require GDPR compliance tags
  • Policy: Enable Azure Policy Regulatory Compliance dashboard for GDPR

What happens:

  • IT creates a storage account in IT-Prod-Subscription-1
  • Inherits from Tenant Root: Must have tags, must be in allowed region, must use TLS 1.2
  • Inherits from Corporate: Will have Azure Defender enabled, diagnostic settings configured
  • Inherits from Production: Must be encrypted at rest, protected by delete lock, backup configured
  • Any violation → Deployment blocked or flagged as non-compliant

Benefits realized:

  • Consistency: All production environments have same security baseline
  • Compliance: EU data stays in EU (GDPR), policies prove compliance
  • Cost control: Budgets at business unit level show who spent what
  • Delegation: Developers can't deploy to production, but have Contributor in sandbox

Must Know - Management Groups:

  • Maximum depth: 6 levels (not including tenant root and subscription level)
  • Inheritance is cumulative: Child subscriptions inherit all policies from all parent management groups (cannot escape)
  • One subscription, one MG: Subscription can only be in one management group at a time (but can be moved)
  • Tenant Root is automatic: Every tenant has a root management group; cannot be deleted
  • RBAC is additive: User with Reader at root + Contributor at child = Contributor on that child
  • Policy merge: If parent and child have different allowed regions, child must satisfy both (intersection, not union)

Azure Policy - Governance Automation

What it is: Azure Policy is a service that creates, assigns, and manages policies to enforce rules and effects over your Azure resources. Policies can audit resources for compliance, deny non-compliant deployments, or automatically remediate resources to comply with standards.

Why it exists: Manual governance doesn't scale and isn't enforceable:

  • Human error: Administrators forget to enable encryption, miss required tags
  • No prevention: Can't block non-compliant deployments before they happen
  • Audit burden: Must manually check thousands of resources for compliance
  • Drift: Resources compliant at creation drift over time

Azure Policy provides automated, continuous governance with preventive and detective controls.

How it works (Detailed):

  1. Policy Definition: JSON document defining:

    • Rule: Condition to evaluate (if storage account && encryption disabled)
    • Effect: What to do when condition is true (Deny, Audit, Append, Modify, DeployIfNotExists)
    • Parameters: Variables to make policy reusable (allowed regions, required tags)
  2. Policy Assignment: Policy definition applied to a scope (management group, subscription, resource group). Assignment can:

    • Include or exclude specific resources
    • Set parameter values (e.g., allowed regions = "East US, West US")
    • Define non-compliance messages
  3. Evaluation: Policy engine evaluates resources:

    • On create/update: Before resource deployment (for Deny/Append/Modify effects)
    • Scheduled scan: Every 24 hours for all resources (for Audit/AuditIfNotExists)
    • On-demand: Manual compliance evaluation triggered
  4. Compliance Reporting: Dashboard shows:

    • Overall compliance percentage
    • Non-compliant resources by policy
    • Exempt resources and reasons
    • Compliance over time (trending)
  5. Remediation: For DeployIfNotExists and Modify policies:

    • Manual remediation task remediates existing non-compliant resources
    • Managed identity required to perform remediation actions
    • Can schedule automatic remediation

Detailed Example: Comprehensive Policy Strategy:

Situation: Financial services company needs to prove SOC 2 compliance. Requirements include encryption, logging, network security, identity controls. Want to prevent violations, not just detect them.

Policy Initiative (Bundle of Policies):

Initiative: SOC 2 Compliance

  1. Encryption Policies:

    • Require encryption at rest for storage accounts - Effect: Deny - Blocks unencrypted storage accounts
    • Require TDE for SQL databases - Effect: DeployIfNotExists - Automatically enables TDE if missing
    • Require disk encryption for VMs - Effect: Audit - Flags VMs without disk encryption
  2. Logging Policies:

    • Require diagnostic settings for all resources - Effect: DeployIfNotExists - Automatically configures logging to central workspace
    • Require retention of 365 days for logs - Effect: Modify - Changes retention to meet compliance requirement
    • Deny deletion of diagnostic settings - Effect: Deny - Prevents removing logging configuration
  3. Network Security Policies:

    • Require NSG on all subnets - Effect: Deny - Blocks subnet creation without NSG
    • Deny public IP on production VMs - Effect: Deny - Prevents public exposure
    • Require private endpoints for PaaS - Effect: Audit - Flags public PaaS endpoints
  4. Identity & Access Policies:

    • Require MFA for admin accounts - Effect: Audit - Reports admins without MFA (can't enforce through policy, must use Conditional Access)
    • Require service principal expiration - Effect: Audit - Finds service principals without expiration date
    • Deny classic Azure resources - Effect: Deny - Blocks old deployment model

Assignment Strategy:

  • Initiative assigned at Corporate MG: All subscriptions inherit
  • Exemptions: Dev/Test subscriptions exempted from "Deny public IP" (need for testing)
  • Parameters: Allowed regions set to "East US 2" (primary data center)
  • Remediation: Weekly automated task remediates non-compliant diagnostic settings

What happens:

  • Developer tries to create unencrypted storage account → Denied immediately (can't deploy)
  • Someone creates SQL database without TDE → Automatically remediated within 15 minutes (DeployIfNotExists triggers)
  • Existing VM discovered without disk encryption → Flagged in compliance dashboard (Audit effect)
  • Compliance team shows dashboard to auditors → 99.8% compliant with SOC 2 policy initiative

Must Know - Azure Policy:

  • Policy vs Initiative: Policy is single rule; Initiative (Policy Set) is bundle of related policies
  • Effects are cumulative: If policy inherited from parent MG denies, child cannot override (can only add more restrictions)
  • Evaluation is eventual: New policies take 10-30 minutes to evaluate existing resources
  • Deny happens at deployment: Blocks ARM deployment before resource created
  • Audit doesn't prevent: Just flags non-compliance in dashboard, doesn't block
  • DeployIfNotExists needs identity: Managed identity required for remediation (must grant appropriate RBAC)
  • Custom policies: Can write custom policies in JSON for organization-specific requirements

Cost Management - FinOps in Azure

What it is: Cost Management + Billing provides tools to analyze, monitor, and optimize cloud spending. It includes cost analysis, budgets, alerts, recommendations, and integration with third-party FinOps tools.

Why it exists: Cloud costs are variable and can spiral out of control without visibility and controls:

  • Unpredictable bills: Don't know what's being spent until bill arrives
  • No accountability: Can't attribute costs to teams/projects/customers
  • Waste: Idle resources, oversized VMs, inefficient services running
  • No optimization: Not using reservations, savings plans, or spot instances

Cost Management provides transparency and control to maximize cloud value while minimizing waste.

How it works (Detailed):

  1. Cost Analysis: Interactive tool showing:

    • Actual costs (what's already spent)
    • Forecasted costs (prediction based on trends)
    • Breakdown by service, resource group, tag, location
    • Filtering, grouping, time ranges
  2. Budgets: Set spending limits with alerts:

    • Monthly, quarterly, annual budgets
    • Alert at thresholds (50%, 80%, 100%, 110%)
    • Notifications via email, webhook, Action Group
    • Can trigger automation (e.g., scale down at 100%)
  3. Cost Allocation: Tags enable showback/chargeback:

    • Tag resources with CostCenter, Project, Environment
    • Generate reports by tag
    • Export to billing systems for chargeback
  4. Recommendations: Azure Advisor suggests:

    • Rightsizing: Reduce VM size based on utilization
    • Reservations: Buy 1-year or 3-year commitments for predictable workloads
    • Idle resources: Delete unattached disks, unused public IPs
    • Savings plans: Commit to compute spend for up to 72% discount
  5. Exports: Scheduled exports to storage:

    • Daily cost details
    • Monthly invoice
    • Format: CSV or Parquet
    • Automate ingestion into BI tools

Detailed Example: Multi-Team FinOps Architecture:

Situation: SaaS company with 50 product teams sharing Azure environment. CFO wants each team to own their cloud spend. Need cost visibility, controls, and optimization.

FinOps Implementation:

1. Tagging Strategy:

  • Required tags enforced by Azure Policy:
    • CostCenter: Finance code for chargeback
    • Team: Product team name
    • Environment: Prod, Dev, Test
    • Project: Specific project or customer
  • Policy denies resource creation without these tags

2. Budget Structure:

Corporate Budget: $2M/month
├── Production Budget: $1.5M/month (alert at 90%)
│   ├── Team-A-Prod: $300K/month
│   ├── Team-B-Prod: $500K/month
│   └── Team-C-Prod: $200K/month
└── Non-Production Budget: $500K/month (alert at 80%)
    ├── Team-A-Dev: $100K/month
    ├── Team-B-Dev: $150K/month
    └── Shared-Test: $50K/month

3. Alert Configuration:

  • 80% budget → Email to team lead
  • 100% budget → Email to team + manager + webhook to Slack
  • 110% budget → Trigger Azure Automation runbook:
    • Deallocate non-production VMs
    • Scale down dev app services to free tier
    • Send warning to team

4. Cost Optimization Automation:

  • Idle Resource Cleanup: Logic App runs weekly
    • Query: Find VMs stopped for > 7 days
    • Action: Snapshot disk, delete VM, notify owner
  • Rightsizing Recommendations: Power Automate flow
    • Get Azure Advisor recommendations
    • Create ServiceNow tickets for teams
    • Track implementation and savings

5. Chargeback Reports:

  • Monthly export: Costs grouped by CostCenter tag
  • Power BI dashboard showing:
    • Cost per team (stacked by service type)
    • Trend over 12 months
    • Budget vs actual
    • Top 10 expensive resources per team
  • Finance team uses for internal billing

What happens:

  • Team B deploys new microservices architecture in production
  • Costs jump 30% in first week
  • Budget alert at 80% ($400K spent of $500K budget)
  • Team lead gets email notification
  • Checks Cost Analysis filtered by Team-B tag
  • Discovers oversized VM SKU (Standard_D32s instead of D8s)
  • Rightsizes VM, cost drops 60%
  • Month ends at $480K (under budget)
  • Finance generates chargeback report showing Team B costs by service
  • Team B budget adjusted up for next month based on new workload

Must Know - Cost Management:

  • Cost Analysis is free: No charge to use cost analysis tools
  • Budgets are alerts only: Budgets don't prevent spending, just notify (must integrate with automation for enforcement)
  • Tags are key: Can't allocate costs without tags; enforce tagging with policy
  • Costs are delayed: 8-24 hours lag for costs to appear in portal (not real-time)
  • Reservations vs Savings Plans: Reservations are service-specific (e.g., VM only); Savings Plans are more flexible (any compute)
  • Export for long-term: Cost Analysis UI only shows 13 months; export to storage for longer retention
  • Forecast accuracy: Machine learning forecast improves with more data; less accurate for new subscriptions

Chapter Summary

What We Covered

Monitoring & Logging Solutions:

  • Azure Monitor architecture and data flow
  • Log Analytics Workspace design strategies
  • Application Insights for distributed tracing
  • Alert rules and action groups
  • VM Insights and Container Insights

Authentication & Authorization:

  • Microsoft Entra ID as cloud identity platform
  • Conditional Access for Zero Trust enforcement
  • Privileged Identity Management (PIM) for JIT access
  • Hybrid identity with Entra Connect
  • Azure RBAC and role assignment strategies

Governance Solutions:

  • Management group hierarchy design
  • Azure Policy for compliance automation
  • Cost Management and FinOps practices
  • Tagging strategies for cost allocation
  • Budgets and automated cost controls

Critical Takeaways

  1. Monitoring is layered: Platform metrics (free) → Resource logs (configured) → Guest/App data (requires agent)
  2. Zero Trust starts with identity: Conditional Access + PIM + MFA is foundation
  3. Governance must be inherited: Management groups + policies enforce at scale
  4. Tags enable everything: Without tags, can't allocate costs or query logs effectively
  5. Cost optimization is continuous: Not one-time; requires ongoing review and automation

Self-Assessment Checklist

Test yourself before moving on:

  • I can design a multi-region monitoring solution with Log Analytics workspaces
  • I can explain when to use Metrics vs Logs vs Application Insights
  • I can design Conditional Access policies for different security scenarios
  • I can architect PIM for just-in-time administrative access
  • I can create management group hierarchy for enterprise governance
  • I can write Azure Policies with appropriate effects (Deny, Audit, DeployIfNotExists)
  • I can design cost allocation strategy using tags and budgets
  • I understand which features require P1 vs P2 Entra ID licenses

Practice Questions

Try these from your practice test bundles:

  • Domain 1 Bundle 1: Questions 1-50 (comprehensive domain coverage)
  • Identity & Security Bundle: Questions 1-50 (focused on auth/authz)
  • Governance & Compliance Bundle: Questions 1-50 (focused on policies/costs)

Expected score: 75%+ to proceed

If you scored below 75%:

  • Review sections: [Conditional Access decision flows, Policy effects comparison, Management group inheritance]
  • Focus on: [Hybrid identity scenarios, Cost optimization strategies, Monitoring workspace design]

Quick Reference Card

Key Services:

  • Azure Monitor: Unified observability platform (metrics + logs)
  • Log Analytics: Queryable log storage with KQL
  • Application Insights: APM with distributed tracing
  • Microsoft Entra ID: Cloud identity and access management
  • Conditional Access: Real-time Zero Trust policy engine
  • PIM: Just-in-time privileged access
  • Azure Policy: Automated compliance enforcement
  • Cost Management: FinOps and spend optimization

Key Concepts:

  • Metrics vs Logs: Numeric time-series vs rich text/JSON records
  • Workspace design: Single central vs per-region vs per-environment
  • Conditional Access AND logic: All conditions must match within policy
  • Policy inheritance: Child cannot loosen parent restrictions
  • Tagging: Foundation for cost allocation and log queries

Decision Points:

  • Monitoring workspace strategy → Consider compliance, cost, query performance
  • Conditional Access enforcement → Report-only → Test → Enforce
  • PIM vs permanent roles → High privilege = PIM, Service accounts = permanent (with policy restrictions)
  • Management group depth → Business units = 2-3 levels, Regional = 3-4 levels
  • Policy effect selection → Preventive = Deny, Detective = Audit, Remediation = DeployIfNotExists/Modify

Next Chapter: Proceed to 03_domain_2_data_storage to learn about designing data storage solutions for relational, semi-structured, and unstructured data, plus data integration architectures.


Chapter 2: Design Data Storage Solutions (20-25% of exam)

Chapter Overview

What you'll learn:

  • Designing relational database solutions (Azure SQL, PostgreSQL, MySQL)
  • Architecting semi-structured and unstructured data storage (Blob, Data Lake, Cosmos DB)
  • Data integration patterns (Data Factory, Synapse, Event Hubs, Stream Analytics)
  • Storage performance, scalability, and cost optimization strategies

Time to complete: 10-14 hours

Prerequisites: Chapter 0 (Fundamentals) and Chapter 1 (Identity/Governance/Monitoring)


Section 1: Design Data Storage Solutions for Relational Data

Introduction

The problem: Applications need to store structured data with relationships (customers, orders, products). Traditional on-premises SQL servers are expensive to maintain, difficult to scale, and lack cloud-native features like automatic backups, geo-replication, and elastic scalability.

The solution: Azure provides multiple managed relational database services (Azure SQL Database, SQL Managed Instance, PostgreSQL, MySQL) that eliminate infrastructure management while providing enterprise features like high availability, automated backups, intelligent performance optimization, and global distribution.

Why it's tested: AZ-305 heavily tests database architecture because:

  • Workload fit: Must choose right database service for the workload
  • Performance vs cost: Balance performance requirements with budget constraints
  • Scalability patterns: Design for growth without re-architecture
  • HA/DR: Ensure business continuity for critical data

Core Concepts

Azure SQL Database - Cloud-Native Relational Database

What it is: Azure SQL Database is a fully managed Platform-as-a-Service (PaaS) relational database based on the latest stable version of Microsoft SQL Server. It's serverless, auto-scales, and includes built-in intelligence for performance optimization.

Why it exists: Organizations want SQL Server capabilities without managing servers, storage, backups, patching, or high availability infrastructure. Azure SQL Database provides all SQL Server features (T-SQL, stored procedures, indexes) with zero infrastructure management.

Real-world analogy: Azure SQL Database is like having a personal database admin team:

  • Server provisioning (automatic): No server setup, just create database
  • Patching (automatic): Updates applied with zero downtime
  • Backups (automatic): Point-in-time restore up to 35 days
  • Scaling (automatic): Resources increase/decrease based on load
  • Optimization (automatic): AI tunes queries and indexes

How it works (Detailed step-by-step):

  1. Provisioning: Create logical SQL server (container) and databases within it. Choose compute tier (vCore or DTU), service tier (General Purpose, Business Critical, Hyperscale), and compute generation.

  2. Compute Models:

    • vCore model: Choose # of cores (2-80 vCores) and memory (explicitly defined). Best for predictable workloads needing specific resources
    • DTU model: Database Transaction Units bundle compute, memory, IO. Choose performance level (10-4000 DTUs). Simpler but less granular
    • Serverless: Auto-pause during inactivity, auto-resume on connection. Pay per second of compute used. Perfect for dev/test or intermittent workloads
  3. Service Tiers:

    • General Purpose: Standard SSD storage, 99.99% SLA, read replicas for scale-out. Best for most workloads
    • Business Critical: Local SSD storage, 99.995% SLA, 1 read replica included, faster recovery. For mission-critical apps
    • Hyperscale: Up to 100TB database, fast backups, rapid scale-out. For largest databases needing massive scale
  4. Storage Architecture:

    • Data files in Azure Storage (remote for General Purpose, local SSD for Business Critical)
    • Transaction log replicated to 3 secondary nodes
    • Backups automatic to geo-redundant storage
  5. Connections: Applications connect via TDS protocol (same as on-prem SQL Server). Use connection pooling for efficiency. Private endpoints for network isolation.

📊 Azure SQL Database Architecture Diagram:

graph TB
    subgraph "Client Layer"
        A[Application<br/>Connection Pool]
        B[Azure Data Studio<br/>Management]
    end

    subgraph "Azure SQL Database"
        C[Logical Server<br/>Firewall, AAD Auth]
        D[Database 1<br/>General Purpose]
        E[Database 2<br/>Business Critical]
        F[Elastic Pool<br/>Shared Resources]
    end

    subgraph "Compute Tiers"
        G[Serverless<br/>Auto-pause/resume]
        H[Provisioned vCore<br/>Dedicated resources]
        I[DTU-based<br/>Bundled resources]
    end

    subgraph "Storage & HA"
        J[Premium Storage<br/>Zone-redundant]
        K[Read Replicas<br/>Scale-out reads]
        L[Geo-Replica<br/>DR in another region]
    end

    subgraph "Management"
        M[Automatic Backups<br/>PITR 7-35 days]
        N[Automatic Tuning<br/>AI optimization]
        O[Threat Detection<br/>Security alerts]
    end

    A --> C
    B --> C
    C --> D
    C --> E
    C --> F

    D --> G
    E --> H
    F --> I

    D --> J
    E --> K
    E --> L

    J --> M
    K --> N
    L --> O

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#f3e5f5
    style F fill:#f3e5f5
    style G fill:#e8f5e9
    style H fill:#e8f5e9
    style I fill:#e8f5e9
    style J fill:#c8e6c9
    style K fill:#c8e6c9
    style L fill:#c8e6c9
    style M fill:#ffebee
    style N fill:#ffebee
    style O fill:#ffebee

See: diagrams/03_domain_2_azure_sql_database.mmd

Diagram Explanation: Azure SQL Database architecture shows logical server as the management boundary containing multiple databases. Clients connect through the logical server which handles firewall rules and authentication. Compute tiers offer flexibility - serverless for variable workloads, provisioned vCore for predictable performance, DTU for simplicity. Storage is always replicated for HA with options for read replicas and geo-replication. Management features like automatic backups, AI-driven tuning, and threat detection operate continuously without manual intervention.

Detailed Example 1: E-commerce Database Design:

Situation: Online retailer needs database for product catalog (100GB), orders (300GB/year growth), and customer data (50GB). Peak traffic 10,000 orders/hour during sales, normal 1,000/hour. Budget-conscious but need 99.99% availability.

Solution Architecture:

1. Service Tier Selection:

  • General Purpose tier - 99.99% SLA sufficient for e-commerce
  • vCore model - 8 vCores (for query parallelism) with 24GB memory
  • Zone-redundant - Protection against datacenter failures
  • Storage: Start with 500GB, autogrow enabled (up to 4TB)

2. Database Structure:

  • ProductCatalog DB (separate): Read-heavy, can use read-only replica for scale
  • Orders DB: Write-heavy, needs high IOPS
  • CustomerData DB: Medium load, contains PII (needs encryption, masking)

3. Scalability Pattern:

  • Use Elastic Pool for ProductCatalog replicas (cost sharing)
  • Serverless for analytics database (queried only during business hours)
  • Read scale-out for ProductCatalog: Route read queries to geo-secondary

4. Performance Optimization:

  • Columnstore indexes on Orders history for analytics
  • In-memory OLTP for shopping cart table (high concurrency)
  • Query Performance Insight to identify slow queries
  • Automatic tuning enabled: auto-create/drop indexes

5. Security & Compliance:

  • Always Encrypted for credit card fields
  • Dynamic data masking for customer phone/email (non-admins see masked)
  • Auditing enabled: All access to CustomerData logged to Log Analytics
  • Private endpoint : No public internet access

6. Disaster Recovery:

  • Active geo-replication to secondary region
  • Failover group with read-write listener endpoint (auto-failover)
  • Auto-failover policy: Fail over if primary unavailable >1 hour
  • RPO: 5 seconds (minimal data loss)
  • RTO: 30 seconds (fast recovery)

What happens:

  • Black Friday sale: Traffic spikes 10x
  • Autoscale triggers: vCores increase from 8 to 32 within 2 minutes
  • Read-only queries routed to geo-secondary replica (reduces primary load)
  • Shopping cart operations use in-memory tables (sub-millisecond response)
  • Primary region fails due to Azure outage
  • Failover group detects failure, promotes secondary to primary (30 sec)
  • Applications reconnect automatically (connection string has failover endpoint)
  • Order processing continues with <5 sec data loss
  • Post-sale: Usage drops, autoscale reduces to 8 vCores (cost optimization)

Must Know - Azure SQL Database:

  • General Purpose vs Business Critical: GP uses remote storage (lower cost, 99.99% SLA), BC uses local SSD (higher cost, 99.995% SLA, faster)
  • Serverless saves cost: Auto-pauses after 1 hour of inactivity, only pay for storage. Resume time ~1 second
  • Hyperscale is different: Separate compute and storage, can scale storage to 100TB independently, fast backups via snapshots
  • Geo-replication is async: 5-second RPO means up to 5 seconds of data loss if primary fails
  • Elastic pools share resources: Multiple databases share vCores/DTUs, cost-effective when databases have complementary usage patterns
  • TDE is automatic: Transparent Data Encryption enabled by default, encryption at rest with service-managed keys

Section 2: Design Data Storage Solutions for Semi-Structured and Unstructured Data

Introduction

The problem: Modern applications generate massive amounts of non-relational data: JSON documents, log files, images, videos, IoT telemetry. Relational databases are expensive and inefficient for this data. Traditional file storage lacks global distribution, scalability, and rich querying.

The solution: Azure provides specialized storage for each data type: Blob Storage for unstructured data (files, images), Data Lake Storage for analytics, Cosmos DB for globally distributed semi-structured data, Table Storage for NoSQL key-value pairs.

Why it's tested: AZ-305 tests storage architecture because:

  • Cost optimization: Choosing wrong storage costs 10x more
  • Performance: Each storage type optimized for specific access patterns
  • Global distribution: Data locality impacts latency and compliance
  • Integration: Storage must integrate with analytics and processing pipelines

Core Concepts

Azure Blob Storage - Object Storage at Scale

What it is: Blob (Binary Large Object) Storage is Azure's object storage service for unstructured data. It's massively scalable (exabytes), globally available, and cost-optimized with multiple access tiers (Hot, Cool, Cold, Archive).

Why it exists: Files don't fit in databases. Traditional file servers don't scale globally. Need storage that:

  • Scales to billions of files without performance degradation
  • Replicates globally for low-latency access
  • Costs pennies per GB with cheaper archival tiers
  • Integrates with analytics tools (Data Factory, Synapse, Databricks)

Blob Storage solves all these while providing HTTP/REST access from any platform.

How it works (Detailed):

  1. Storage Account: Container for blobs with unique namespace (<account>.blob.core.windows.net). Choose performance tier (Standard or Premium), redundancy (LRS, ZRS, GRS), and access tier.

  2. Containers: Like folders, organize blobs. Set access level (private, blob-level public, container-level public). Apply retention policies and legal holds.

  3. Blob Types:

    • Block blobs: For text and binary files. Optimized for upload/download. Most common type.
    • Append blobs: For append operations (logs). Cannot modify existing data, only append.
    • Page blobs: For random read/write (VHD disks for VMs). Optimized for IaaS.
  4. Access Tiers (cost vs latency tradeoff):

    • Hot ($0.018/GB/month): Frequently accessed, low latency, high access cost
    • Cool ($0.01/GB/month): Infrequently accessed (30+ days), 30-day min retention
    • Cold ($0.004/GB/month): Rarely accessed (90+ days), 90-day min retention
    • Archive ($0.001/GB/month): Rarely accessed (180+ days), 180-day min, hours to rehydrate
  5. Lifecycle Management: Automated tier transitions and deletion:

    {
      "rules": [{
        "name": "moveToArchive",
        "type": "Lifecycle",
        "definition": {
          "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["logs/"]},
          "actions": {
            "baseBlob": {
              "tierToCool": {"daysAfterModificationGreaterThan": 30},
              "tierToArchive": {"daysAfterModificationGreaterThan": 90},
              "delete": {"daysAfterModificationGreaterThan": 730}
            }
          }
        }
      }]
    }
    

Detailed Example: Media Streaming Platform:

Situation: Video streaming service stores user-uploaded videos (1 PB total). New videos accessed frequently (week 1), occasionally (week 2-12), rarely after 3 months. Must retain 7 years for legal compliance. Need global distribution for low-latency streaming.

Solution Architecture:

1. Storage Strategy:

  • Hot tier: New uploads (last 7 days), frequently watched content
  • Cool tier: Videos 8-90 days old
  • Archive tier: Videos >90 days old (legal hold)
  • Lifecycle policy: Auto-transition between tiers based on last access time

2. Performance Optimization:

  • Azure CDN: Cache frequently accessed videos at edge locations globally
  • Blob versioning: Keep multiple video qualities (4K, 1080p, 720p, 480p)
  • Blob index tags: Metadata for fast filtering (genre, rating, upload date)
  • Concurrent uploads: Use block blob multi-part upload for large files

3. Security & Access:

  • Private endpoints: Blob storage not accessible from internet
  • SAS tokens: Generate time-limited signed URLs for video streaming
  • Soft delete: 30-day retention for accidentally deleted blobs
  • Immutable storage: Archive tier with legal hold (cannot delete/modify)

4. Cost Optimization:

  • Hot tier (7 days): 100 TB × $0.018 = $1,800/month
  • Cool tier (83 days): 300 TB × $0.01 = $3,000/month
  • Archive tier (6+ years): 600 TB × $0.001 = $600/month
  • Total storage cost: $5,400/month for 1 PB (vs $18,000 if all Hot)
  • Savings: 70% reduction vs single-tier approach
  1. Rehydration Strategy (when user requests archived video):
    • Standard rehydration: 15 hours, lower cost
    • High-priority rehydration: 1 hour, higher cost
    • Pre-emptive rehydration: ML predicts requests, rehydrates proactively

Must Know - Blob Storage:

  • Access tiers can be set per blob: Not just account-level, each blob can have different tier
  • Changing tiers has costs: Moving from Archive to Hot costs retrieval fees + early deletion fees if <180 days
  • Page blobs are different: Only support Hot tier, used for VHD disks, fixed size (up to 8TB)
  • Hierarchical namespace = Data Lake: Enable HNS on storage account to get Data Lake Gen2 capabilities (folders, ACLs, faster operations)
  • Immutable storage with legal hold: Once set, even owner cannot delete (for regulatory compliance)
  • Snapshot vs Versioning: Snapshots are manual point-in-time copies, Versioning auto-tracks all changes

Azure Cosmos DB - Globally Distributed NoSQL

What it is: Cosmos DB is a fully managed NoSQL database designed for global distribution, elastic scale, and single-digit millisecond response times. It supports multiple data models (document, key-value, graph, columnar) and APIs (SQL, MongoDB, Cassandra, Gremlin, Table).

Why it exists: Modern apps need:

  • Global distribution: Users worldwide expect local latency
  • Elastic scale: Workload grows 100x overnight (viral content, Black Friday)
  • Multi-model: Different parts of app need different data models
  • Guaranteed performance: SLA-backed latency, availability, throughput

Cosmos DB is built from ground-up for these scenarios. It's planet-scale database.

Real-world analogy: Cosmos DB is like a global franchise with local stores:

  • Headquarters (primary region): Writes go here first
  • Local branches (read regions): Reads served locally for speed
  • Instant replication: Changes at HQ sync to all branches in <1 second
  • Regional autonomy (multi-region writes): Each branch can accept writes
  • Consistency choices: You decide trade-off between speed and accuracy

How it works (Detailed):

  1. Resource Model:

    • Account: Top-level container, globally unique, spans regions
    • Database: Container for containers (tables)
    • Container: Holds items (documents), defines partition key
    • Item: Actual data (JSON document, row, vertex, etc.)
  2. Partition Key Design (CRITICAL for performance):

    • Logical partition: All items with same partition key value
    • Physical partition: Cosmos DB manages distribution across servers
    • Choose key with:
      • High cardinality (many distinct values)
      • Even distribution (no hot partitions)
      • Query efficiency (filter on partition key = fast)
  3. Request Units (RUs): Normalized cost of operations

    • 1 RU = read 1 KB document by ID + partition key
    • Write 1 KB = ~5 RUs
    • Query without partition key = variable RUs (depends on result size)
    • Provision RUs per second at container or database level
  4. Consistency Levels (five choices):

    • Strong: Linearizability, reads always see latest write (highest latency)
    • Bounded Staleness: Reads lag by max K versions or T seconds
    • Session: Consistent within client session (read your writes)
    • Consistent Prefix: Reads never see out-of-order writes
    • Eventual: Lowest latency, reads may see stale data
  5. Global Distribution:

    • Add/remove regions with one click
    • Data automatically replicates to all regions
    • Choose read regions (reads served locally)
    • Choose write regions (multi-region writes for low latency)
    • Automatic failover if region offline

Detailed Example: E-commerce Product Catalog:

Situation: Global e-commerce platform with users in US, Europe, Asia. Product catalog (10M products), user sessions (1M concurrent), shopping carts. Need sub-10ms reads globally, 99.999% availability.

Solution Architecture:

1. Cosmos DB Configuration:

  • API: SQL API (familiar to SQL developers, rich queries)
  • Regions: East US (primary), West Europe, Southeast Asia (all read+write)
  • Consistency: Session (read your own writes, good enough for shopping)
  • Throughput: Autoscale 10,000-100,000 RU/s

2. Container Design:

Products Container:

{
  "id": "product-12345",
  "productId": "12345",
  "category": "electronics",  // Partition key
  "name": "Laptop XYZ",
  "price": 1299.99,
  "stock": 45,
  "region": "us",
  "ttl": 3600  // Cache product for 1 hour
}
  • Partition key: category (even distribution, query filter)
  • Indexing: All properties (flexible queries)
  • TTL: 1 hour (products re-loaded from master source)

Shopping Carts Container:

{
  "id": "cart-user789",
  "userId": "user789",  // Partition key
  "items": [
    {"productId": "12345", "quantity": 2},
    {"productId": "67890", "quantity": 1}
  ],
  "createdAt": "2025-01-15T10:30:00Z",
  "ttl": 86400  // Delete abandoned carts after 24 hours
}
  • Partition key: userId (each user's cart in one partition)
  • TTL: 24 hours (auto-cleanup abandoned carts)

3. Multi-Region Strategy:

  • US users: Write/read from East US (5ms latency)
  • Europe users: Write/read from West Europe (4ms latency)
  • Asia users: Write/read from Southeast Asia (6ms latency)
  • Conflict resolution: Last-Write-Wins (product updates use timestamp)

4. Cost Optimization:

  • Autoscale RUs: 10K RU/s baseline (off-peak), 100K RU/s peak (flash sales)
  • Serverless option: For dev/test environments
  • Analytical store: Export to Synapse for reporting (cheaper than querying Cosmos)

5. Performance Patterns:

  • Point reads (by ID + partition key): 1 RU, ~5ms
  • Query within partition: 5-50 RUs, ~10-50ms
  • Cross-partition query: 100+ RUs, ~100+ ms (avoid if possible)
  • Bulk operations: Use Cosmos DB bulk executor for batch inserts

What happens:

  • User in Tokyo browses electronics category
  • Query routed to Southeast Asia region (lowest latency)
  • Products returned in 6ms (within partition query)
  • User adds laptop to cart (partition key = userId)
  • Cart write to Southeast Asia (3ms)
  • Cart replicates to all regions (within 100ms)
  • User moves to checkout, switch to payment service (different DB)
  • Cart auto-deleted after 24 hours if not checked out (TTL)
  • Black Friday: Traffic 50x normal
  • Autoscale increases RUs from 10K to 100K automatically
  • Cost increases proportionally but performance maintained

Must Know - Cosmos DB:

  • Partition key cannot be changed: Choose wisely during design, can't change after container creation
  • Provisioned vs Serverless: Provisioned for predictable workload (cheaper at scale), Serverless for variable (pay per request)
  • Multi-region writes have tradeoff: Lower latency but potential conflicts (need conflict resolution policy)
  • Analytical store is separate: Columnar storage for analytics, doesn't consume RUs, synced automatically
  • Change feed enables event-driven: Track all changes in real-time, trigger functions, sync to other systems
  • Consistency default is Session: Balances performance and data correctness for most scenarios

Section 3: Design Data Integration Solutions

Introduction

The problem: Modern enterprises have data everywhere: on-premises databases, cloud databases, SaaS applications, IoT devices, streaming sources. Need to move, transform, and analyze this data without building complex ETL pipelines.

The solution: Azure provides comprehensive data integration services: Data Factory for orchestration, Synapse Analytics for big data and warehousing, Event Hubs for streaming, Stream Analytics for real-time processing.

Why it's tested: AZ-305 tests data integration because:

  • Complexity: Integrating disparate sources requires architectural decisions
  • Scale: Petabytes of data need efficient pipelines
  • Real-time: Business decisions require streaming analytics
  • Cost: Wrong architecture costs millions in compute/storage

Core Concepts

Azure Data Factory - Cloud ETL/ELT

What it is: Data Factory is a cloud-based data integration service that creates automated data pipelines to move and transform data from various sources to destinations. It's code-free orchestration for ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) workflows.

Why it exists: Building custom data pipelines requires coding, error handling, retry logic, monitoring, scheduling. Data Factory provides visual designer for complex workflows with 90+ connectors to data sources without writing code.

Real-world analogy: Data Factory is like an automated shipping logistics system:

  • Pipelines = Shipping routes (source to destination)
  • Activities = Individual steps (pick up, sort, deliver)
  • Integration runtime = Trucks/planes (compute that moves data)
  • Triggers = Shipping schedule (when to run pipeline)
  • Parameters = Shipping labels (dynamic values)

Detailed Example: Data Warehouse Loading:

Situation: Retail company has sales data in on-premises SQL Server (1000 stores), product catalog in Oracle (headquarters), customer reviews in MongoDB Atlas. Need to load all into Azure Synapse for analytics, run nightly.

Solution with Data Factory:

1. Integration Runtimes:

  • Self-hosted IR: On-premises, connects to SQL Server behind firewall
  • Azure IR: Cloud-based, connects to Oracle (public endpoint), MongoDB Atlas
  • Managed VNet IR: For private endpoint connections (secure)

2. Pipeline Design:

Pipeline: NightlyDataWarehouseLoad (runs at 2 AM daily)
├─ Activity: CopySalesData (Parallel, 1000 stores)
│  ├─ Source: SQL Server (on-prem via Self-hosted IR)
│  ├─ Sink: Synapse staging tables
│  └─ Settings: Incremental copy (only new records since last run)
├─ Activity: CopyProductCatalog
│  ├─ Source: Oracle (via Azure IR)
│  ├─ Sink: Synapse staging tables
│  └─ Settings: Full copy (small dataset, 100K products)
├─ Activity: CopyCustomerReviews
│  ├─ Source: MongoDB Atlas (via Azure IR)
│  ├─ Sink: Synapse staging tables
│  └─ Settings: Change data capture (only modified documents)
└─ Activity: RunTransformationStoredProc
   ├─ Execute Synapse stored procedure
   └─ Transform staging → fact/dimension tables (star schema)

3. Incremental Copy Pattern (for 1000 SQL stores):

  • Watermark column: LastModifiedDate in each table
  • Lookup activity: Get max LastModifiedDate from previous run (stored in metadata table)
  • Copy activity: WHERE LastModifiedDate > @{watermark_value}
  • Update watermark: Store new max LastModifiedDate for next run

4. Error Handling:

  • Retry policy: 3 attempts with 30-second delay
  • Fault tolerance: Skip bad rows, log to storage account
  • Alerts: Send email if pipeline fails, create incident in ServiceNow
  • Monitoring: Integration with Azure Monitor, custom dashboards

5. Cost Optimization:

  • DIU (Data Integration Units): Start with auto (2-256), tune based on performance
  • Parallel copies: Set to 4-32 based on source/sink throughput
  • Compression: Enable during copy (reduces transfer time)
  • Reserved capacity: For predictable workloads (60% discount)

What happens:

  • 2 AM: Trigger fires, pipeline starts
  • Self-hosted IR connects to 1000 stores in parallel
  • Incremental copy pulls only yesterday's sales (5M rows vs 5B total)
  • Azure IR connects to Oracle, full copies 100K products (fast, small dataset)
  • MongoDB Atlas change streams provide only modified reviews
  • All data staged in Synapse within 20 minutes
  • Stored procedure runs, transforms to star schema (10 minutes)
  • Fact/Dimension tables ready for BI tools by 2:30 AM
  • Cost: $20 for pipeline run (vs $500 for full copy)

Must Know - Data Factory:

  • Integration Runtime is key: Self-hosted for on-prem, Azure for cloud, Managed VNet for private endpoints
  • Mapping Data Flows vs Copy Activity: Copy for simple moves (fast, cheap), Data Flows for complex transformations (Spark-based, more expensive)
  • Incremental copy saves cost: Use watermarks (timestamp) or CDC (change data capture), don't copy everything every time
  • Parallel copies: Configure parallelCopies setting, default 4, max 256, too high can overwhelm source/sink
  • Linked services are reusable: Define connection once, use in multiple pipelines (parameterize for flexibility)

Chapter Summary

What We Covered

Relational Data Solutions:

  • Azure SQL Database service tiers and compute models
  • Hyperscale for massive databases
  • SQL Managed Instance for lift-and-shift
  • PostgreSQL and MySQL for open-source workloads

Semi-Structured & Unstructured Data:

  • Blob Storage access tiers (Hot, Cool, Cold, Archive)
  • Data Lake Storage Gen2 for big data analytics
  • Cosmos DB multi-model global distribution
  • Cosmos DB partition key design and consistency levels

Data Integration:

  • Data Factory pipelines and integration runtimes
  • Incremental copy patterns with watermarks
  • Synapse Analytics for data warehousing
  • Event Hubs and Stream Analytics for real-time streaming

Critical Takeaways

  1. Right database for workload: SQL for relational, Cosmos for global NoSQL, Synapse for analytics
  2. Storage tiers save money: Archive tier is 50x cheaper than Hot (use lifecycle policies)
  3. Cosmos partition key is critical: Determines performance and scale, can't change after creation
  4. Data integration needs runtime: Self-hosted for on-prem, Azure for cloud, consider network security
  5. Incremental patterns save cost: Don't copy everything, use watermarks or change data capture

Self-Assessment Checklist

  • I can choose between Azure SQL tiers (General Purpose vs Business Critical vs Hyperscale)
  • I understand vCore vs DTU vs Serverless models
  • I can design Blob Storage lifecycle policies for cost optimization
  • I can design Cosmos DB partition keys for even distribution
  • I understand Cosmos DB consistency levels and trade-offs
  • I can design Data Factory pipelines with error handling
  • I know when to use Data Factory vs Synapse vs Event Hubs
  • I understand incremental copy patterns for large datasets

Practice Questions

Try these from practice test bundles:

  • Domain 2 Bundle 1: Questions 1-50 (relational + non-relational)
  • Data Platform Bundle: Questions 1-50 (comprehensive storage)
  • Data Integration Bundle: Questions 1-50 (pipelines and streaming)

Expected score: 75%+ to proceed


Next Chapter: Proceed to 04_domain_3_business_continuity to learn about backup, disaster recovery, and high availability solutions.


Chapter 3: Design Business Continuity Solutions (15-20% of exam)

Chapter Overview

What you'll learn:

  • Backup and disaster recovery strategies for Azure workloads
  • RPO and RTO requirements and how to meet them
  • High availability patterns across compute, data, and networking
  • Multi-region architectures for global resilience

Time to complete: 8-12 hours

Prerequisites: Chapters 0-2 (Fundamentals, Identity/Governance, Data Storage)


Section 1: Design Solutions for Backup and Disaster Recovery

Introduction

The problem: Disasters happen: datacenter outages, regional failures, ransomware attacks, human errors (accidental deletions). Without backup and DR strategy, businesses lose data permanently or face days/weeks of downtime costing millions in lost revenue and reputation damage.

The solution: Azure provides comprehensive BCDR services: Azure Backup for automated backups, Azure Site Recovery for replication and failover, geo-redundant storage for data durability. Combined with proper RPO/RTO planning, ensures business continuity.

Why it's tested: AZ-305 heavily tests BCDR because:

  • Business impact: Downtime = lost revenue, damaged reputation
  • Compliance: Regulations require specific RPO/RTO guarantees
  • Cost vs resilience: More resilience costs more, must balance
  • Complexity: Multi-tier applications need coordinated recovery

Core Concepts

Azure Backup - Managed Backup Service

What it is: Azure Backup is a one-click backup solution for Azure VMs, SQL/SAP databases, file shares, and on-premises workloads. It provides application-consistent backups, long-term retention, and central management.

Why it exists: Manual backups are unreliable, error-prone, and don't scale:

  • Administrators forget to backup
  • Backups stored on same infrastructure (lost if datacenter fails)
  • No application consistency (database backups mid-transaction)
  • No central visibility across thousands of resources

Azure Backup automates everything with policy-driven protection.

How it works (Detailed):

  1. Recovery Services Vault: Central repository for backup data, stores backup copies in geo-redundant Azure storage by default

  2. Backup Policies: Define schedule (daily, weekly), retention (7-9999 days), snapshot type (crash-consistent or app-consistent)

  3. Application-Consistent Backups:

    • VMs: Uses VSS (Windows) or fsfreeze (Linux) to quiesce applications
    • SQL: Coordinates with SQL Server for transactional consistency
    • SAP HANA: Uses backint interface for application consistency
  4. Incremental Backups: First backup is full, subsequent are incremental (only changed blocks), reduces time and cost

  5. Retention: Follows 3-2-1 rule automatically:

    • 3 copies of data (production + 2 backups)
    • 2 different storage types (local snapshot + vault)
    • 1 off-site (geo-redundant storage)

Detailed Example: Multi-Tier Application Backup:

Situation: E-commerce application with web tier (20 VMs), app tier (50 VMs), SQL databases (5 servers), file shares (product images). Need comprehensive backup with 4-hour RPO.

Solution Architecture:

1. Recovery Services Vault Configuration:

  • Vault: Production-Backup-Vault (West Europe)
  • Storage redundancy: GRS (replicates to North Europe for regional disaster)
  • Soft delete: 14 days (protection against accidental deletion/ransomware)
  • Security: Multi-user authorization (requires approval to disable backup)

2. Backup Policies per Workload:

Web/App VMs:

  • Policy: Daily backup at 2 AM
  • Retention: 30 days daily, 12 weeks weekly, 12 months monthly
  • Snapshot tier: First snapshot retained locally (instant recovery <5 mins)
  • Encryption: Automatic (data encrypted at rest and in transit)

SQL Databases:

  • Full backup: Weekly (Sunday 2 AM)
  • Differential backup: Daily (2 AM)
  • Transaction log backup: Every 15 minutes (RPO = 15 minutes)
  • Retention: 35 days short-term, 10 years long-term (compliance)

File Shares (Azure Files):

  • Snapshot-based backup: 4x daily (every 6 hours)
  • Retention: 30 days
  • Restore granularity: Individual files or entire share

3. Recovery Scenarios:

Scenario A: Accidental VM Deletion:

  • Detection: Within 14 days (soft delete window)
  • Recovery: Restore VM from any recovery point in last 30 days
  • RTO: 15 minutes (from instant snapshot) or 2 hours (from vault)
  • RPO: Max 24 hours (last daily backup)

Scenario B: SQL Database Corruption:

  • Detection: Immediate (application errors)
  • Recovery: Point-in-time restore to 5 minutes before corruption
  • RTO: 30 minutes (database restore to new server)
  • RPO: Max 15 minutes (transaction log frequency)

Scenario C: Ransomware Attack:

  • Detection: Alerts from Microsoft Defender detect encryption activity
  • Recovery: Restore all VMs and databases from 48 hours ago (before infection)
  • RTO: 4 hours (parallel restore of 70 VMs + 5 SQL servers)
  • RPO: 48 hours data loss (acceptable for ransomware scenario)
  • Protection: Soft delete prevents attacker from deleting backups

4. Cost Optimization:

  • Instant restore snapshots: Free for first 2 days, then $0.05/GB/month
  • Vault storage: $0.05/GB/month (GRS)
  • Data transfer: Free (restore within same region)
  • Total: ~$15K/month for 50TB backup (vs $100K for on-prem backup infrastructure)

Must Know - Azure Backup:

  • Application-consistent is default for VMs: VSS ensures databases in VM are consistent
  • Incremental backups after first full: Only changed blocks backed up (saves time and money)
  • Soft delete prevents ransomware: Even if attacker has admin rights, can't delete backups for 14 days
  • Cross-region restore: Can restore from GRS vault to any Azure region (for regional disaster)
  • MARS agent for on-prem: Azure Backup agent for Windows servers, backs up files/folders
  • Operational backup for blobs: Continuous protection (no schedule), local recovery points

Azure Site Recovery - Disaster Recovery as a Service

What it is: Azure Site Recovery (ASR) provides automated replication and failover for VMs (Azure, on-prem, other clouds) to Azure. It orchestrates multi-tier application recovery with minimal downtime.

Why it exists: Building DR infrastructure is expensive:

  • Duplicate datacenter costs millions
  • Keeping DR site updated is complex
  • Failover testing disrupts production
  • No built-in orchestration (manual runbooks)

ASR eliminates DR infrastructure costs, provides automated orchestration, and enables non-disruptive testing.

How it works (Detailed):

  1. Replication: Continuous replication of VMs to Azure:

    • Azure to Azure: Replicate to paired region
    • VMware to Azure: Replication via Configuration Server
    • Hyper-V to Azure: Direct replication or via System Center VMM
    • Physical to Azure: Mobility service on each server
  2. Replication Process:

    • Initial replication: Full VM disk copy (can take hours)
    • Delta replication: Continuous, tracks changed blocks every 5 minutes
    • Recovery points: Created every 5 minutes (RPO = 5 minutes minimum)
    • App-consistent snapshots: Created hourly by default (application quiescence)
  3. Recovery Plan: Orchestrates failover of multi-tier apps:

    • Groups: VMs organized by tier (database → app → web)
    • Sequencing: Tiers fail over in order (database first, web last)
    • Manual actions: Pause for manual steps (e.g., reconfigure load balancer)
    • Azure Automation: Run scripts before/after failover (automation runbooks)
  4. Failover Types:

    • Test failover: Isolated test in virtual network (no impact to production)
    • Planned failover: Graceful shutdown → replication → failover (zero data loss)
    • Unplanned failover: Production down, immediate failover (potential data loss = RPO)
  5. Failback: Reverse replication from Azure back to on-prem after disaster resolved

Detailed Example: Regional DR for Mission-Critical App:

Situation: Financial trading platform in East US, 99.99% availability SLA (52 mins downtime/year). Need DR in West US with RPO <5 minutes, RTO <30 minutes.

Solution Architecture:

1. ASR Configuration:

  • Primary site: East US (20 VMs: 5 SQL, 10 app servers, 5 web servers)
  • DR site: West US (ASR creates replicas)
  • Replication: Continuous (every 5 minutes)
  • App-consistent snapshots: Every 1 hour

2. Recovery Plan Design:

Recovery Plan: Trading-Platform-DR
├─ Group 1: Database Tier (5 SQL VMs)
│  ├─ Pre-action: Run script to update DNS (database endpoint)
│  ├─ VMs: sql-01, sql-02, sql-03, sql-04, sql-05 (sequential boot)
│  └─ Post-action: Wait 5 mins (database startup)
├─ Group 2: Application Tier (10 app VMs)
│  ├─ VMs: app-01 to app-10 (parallel boot)
│  └─ Post-action: Health check script (verify connectivity to database)
└─ Group 3: Web Tier (5 web VMs)
   ├─ VMs: web-01 to web-05 (parallel boot)
   └─ Post-action: Update Traffic Manager to West US endpoint

3. Network Configuration:

  • Virtual Network: Pre-created in West US (same IP scheme as East US)
  • Network Security Groups: Replicated with VMs
  • ExpressRoute: Exists to both regions (users can connect to either)
  • Traffic Manager: Routes users to healthy region (automatic failover)

4. Disaster Scenario & Response:

T+0: East US region experiences Azure outage (total datacenter failure)
T+2 mins: Monitoring detects outage, pages on-call team
T+5 mins: Team validates primary is down, initiates unplanned failover
T+10 mins: ASR fails over Group 1 (databases) to West US from latest recovery point
T+15 mins: Group 1 online, health check passes, Group 2 starts
T+20 mins: Group 2 online, Group 3 starts
T+25 mins: Group 3 online, Traffic Manager updated
T+28 mins: Trading platform fully operational in West US
T+30 mins: Users reconnect, trading resumes

Total RTO: 28 minutes (within 30-minute SLA)
Total RPO: 5 minutes (last recovery point before failure)
Data loss: 5 minutes of trades (acceptable per business continuity plan)

5. Cost Optimization:

  • DR VMs: Not running, only replicated disks ($0.12/GB/month)
  • Storage: Standard SSD in DR (upgrade to Premium on failover)
  • Compute: Pay only when failed over (no cost for idle DR)
  • Total: $5K/month for DR readiness (vs $50K/month for active DR datacenter)

Must Know - Site Recovery:

  • RPO minimum is 5 minutes: Can't get lower than 5-minute recovery points with ASR
  • Test failover doesn't impact production: Creates isolated VMs in test network
  • Recovery plan is key: Must design multi-tier sequence for proper application recovery
  • Failback requires re-protection: After failover to Azure, must configure reverse replication to failback
  • Mobility service required: Must be installed on each VM (Azure VMs get it automatically)
  • Churn limit: 54 Mbps per VM max (disk change rate), exceed = replication lag

Section 2: Design for High Availability

Introduction

The problem: Single points of failure cause outages. One server crashes = application down. One region fails = total outage. Load balancer overwhelmed = degraded performance. High availability requires redundancy at every layer.

The solution: Azure provides availability zones (datacenter redundancy), availability sets (rack redundancy), load balancing (traffic distribution), autoscaling (capacity redundancy). Combine these for 99.99-99.999% uptime.

Why it's tested: AZ-305 tests HA architecture because:

  • SLA requirements: Different patterns achieve different SLAs
  • Cost tradeoffs: Zone redundancy costs more than single-zone
  • Design decisions: Which tier needs HA vs can tolerate downtime
  • Composite SLAs: Understanding how component SLAs multiply

Availability Zones - Datacenter-Level Redundancy

What it is: Availability Zones are physically separate datacenters within an Azure region (minimum 3 zones per region). Each zone has independent power, cooling, networking. Deploying across zones protects against datacenter failures.

Why it exists: Traditional HA with availability sets only protects against rack/server failure within same datacenter. Datacenter-level failures (fire, flood, power outage) take down all VMs. Availability Zones provide resilience against entire datacenter failure.

How it works:

  1. Zone-Redundant Resources: Services automatically replicated across zones:

    • Zone-redundant storage (ZRS): 3 copies across 3 zones
    • Zone-redundant VPN Gateway: Active in multiple zones
    • Zone-redundant Load Balancer Standard: Traffic across zones
  2. Zonal Resources: Pinned to specific zone:

    • VM in Zone 1, Zone 2, Zone 3 (you manage distribution)
    • Managed disk in same zone as VM
    • Public IP Standard with zone assignment
  3. Application Pattern:

    • Deploy VMs across all 3 zones (3 web VMs: 1 per zone)
    • Use zone-redundant load balancer (distributes traffic to healthy zones)
    • Use zone-redundant storage (data accessible even if zone down)

Detailed Example: Zone-Redundant Web Application:

Situation: SaaS application needs 99.99% SLA (52 mins downtime/year). Web tier (stateless), app tier (stateless), database tier (SQL Database).

Solution Architecture:

1. Web Tier (3 VMs, 1 per zone):

  • VM in Zone 1: web-vm-z1
  • VM in Zone 2: web-vm-z2
  • VM in Zone 3: web-vm-z3
  • Load Balancer: Standard (zone-redundant) with health probe
  • Traffic distribution: Round-robin across healthy zones

2. App Tier (6 VMs, 2 per zone for capacity):

  • Zone 1: app-vm-z1-1, app-vm-z1-2
  • Zone 2: app-vm-z2-1, app-vm-z2-2
  • Zone 3: app-vm-z3-1, app-vm-z3-2
  • Internal Load Balancer: Zone-redundant
  • Autoscale: Min 2 per zone, max 10 per zone (scales within zone first)

3. Data Tier (Azure SQL Business Critical):

  • Zone redundancy: Enabled (automatic 3-zone replication)
  • Read replicas: In all 3 zones
  • Failover: Automatic to healthy zone (no manual intervention)

4. Failure Scenario:

Zone 1 Datacenter Failure:

  • Load balancer detects web-vm-z1 unhealthy (health probe fails)
  • Traffic redirected to web-vm-z2 and web-vm-z3 (capacity reduced 33%)
  • App tier loses 2 VMs, autoscale deploys 2 more in Zone 2 and Zone 3
  • SQL automatically fails to Zone 2 replica (transparent to app, <30 sec)
  • Impact: No downtime, slight performance degradation (33% less web capacity)
  • SLA maintained: 99.99% (52 mins/year budget not consumed)

5. SLA Calculation:

  • Single VM: 99.9% (8.76 hours downtime/year)
  • Availability Set: 99.95% (4.38 hours downtime/year)
  • Availability Zone: 99.99% (52.56 minutes downtime/year)
  • Multi-region: 99.999% (5.26 minutes downtime/year)

Must Know - Availability Zones:

  • Not all regions have zones: Only select regions have 3+ physical zones (check Azure region map)
  • Zone-redundant vs Zonal: Zone-redundant = Azure manages across zones, Zonal = you pin to specific zone
  • Latency between zones: <2ms (within region), can run synchronous replication
  • Cost: Zone-redundant resources cost ~20% more than single-zone (worth it for HA)
  • Load Balancer must be Standard: Basic load balancer doesn't support zones
  • SQL Business Critical auto zone-redundant: General Purpose requires manual zone config

Chapter Summary

What We Covered

Backup and Disaster Recovery:

  • Azure Backup for automated, application-consistent backups
  • Site Recovery for VM replication and orchestrated failover
  • RPO/RTO requirements and how to meet them
  • Soft delete and immutable backups for ransomware protection

High Availability:

  • Availability Zones for datacenter redundancy (99.99% SLA)
  • Zone-redundant services (Load Balancer, Storage, SQL)
  • Composite SLAs and how they multiply
  • Multi-tier HA architecture patterns

Critical Takeaways

  1. RPO = data loss, RTO = downtime: Different workloads need different levels
  2. Zones provide datacenter redundancy: 99.99% vs 99.95% for availability sets
  3. Backup isn't DR: Backup = restore data, DR = failover applications
  4. Test DR regularly: Unused DR plans fail when needed, ASR test failover is free
  5. SLAs multiply: 99.9% web × 99.9% database = 99.8% composite (not 99.9%)

Self-Assessment Checklist

  • I can calculate RPO/RTO for different backup strategies
  • I understand difference between Azure Backup and Site Recovery
  • I can design zone-redundant architecture for 99.99% SLA
  • I know when to use zones vs availability sets vs regions
  • I can calculate composite SLA for multi-tier application
  • I understand soft delete and immutable backup for ransomware protection

Practice Questions

Try these from practice test bundles:

  • Domain 3 Bundle 1: Questions 1-50 (backup and HA)
  • Backup & Recovery Bundle: Questions 1-50 (BCDR focused)
  • High Availability Bundle: Questions 1-50 (zones and failover)

Expected score: 75%+ to proceed


Next Chapter: Proceed to 05_domain_4_infrastructure to learn about compute, networking, application architecture, and migration solutions.


Chapter 4: Design Infrastructure Solutions (30-35% of exam)

Chapter Overview

What you'll learn:

  • Compute solutions: VMs, containers, serverless architectures
  • Application architecture: Messaging, events, API management
  • Migration strategies and Azure Migrate
  • Network solutions: Connectivity, security, load balancing

Time to complete: 15-20 hours

Prerequisites: Chapters 0-3 (Fundamentals, Identity/Governance, Data Storage, Business Continuity)


Section 1: Design Compute Solutions

Introduction

The problem: Applications need compute resources to run, but choosing the wrong compute model leads to: overpaying for idle resources (always-on VMs when serverless would work), poor scaling (VMs that can't handle traffic spikes), operational overhead (managing OS patches instead of focusing on code), or vendor lock-in (proprietary code that can't migrate).

The solution: Azure offers a spectrum of compute options: VMs for full control, containers for portability, serverless for automatic scaling and pay-per-execution. Each has specific use cases, cost models, and management requirements.

Why it's tested: AZ-305 heavily tests compute decisions because:

  • Cost impact: Wrong compute choice = 10x cost difference
  • Performance: Compute sizing directly affects application performance
  • Operations: Different models have vastly different operational overhead
  • Modernization: Many solutions involve migrating from legacy compute to modern patterns

Core Concepts

Azure Virtual Machines - Full Control Compute

What it is: Azure VMs are on-demand, scalable compute resources providing full control over operating system, runtime, and configuration. You choose VM size (CPU/memory), OS (Windows/Linux), disk type (SSD/HDD), and networking.

Why it exists: Despite cloud-native alternatives, VMs are essential for:

  • Legacy applications: Applications requiring specific OS versions or configurations
  • Full control: Need to install kernel modules, drivers, or low-level software
  • Licensing: Bring-your-own-license (BYOL) for Windows Server, SQL Server
  • Compliance: Regulations requiring specific OS hardening or configurations

Azure VMs provide IaaS flexibility while eliminating datacenter management.

How it works (Detailed):

  1. VM Families: Optimized for different workloads:

    • General Purpose (D-series): Balanced CPU/memory, web servers, dev/test
    • Compute Optimized (F-series): High CPU-to-memory, batch processing, analytics
    • Memory Optimized (E-series): High memory-to-CPU, databases, caching
    • Storage Optimized (L-series): High disk throughput, big data, SQL, NoSQL
    • GPU (N-series): GPU acceleration, AI/ML training, rendering
  2. High Availability Options:

    Availability Sets (rack-level redundancy):

    • Fault Domains: Physical server racks (max 3 per region)
    • Update Domains: Logical grouping for planned maintenance (max 20)
    • VMs distributed across domains to avoid single rack failure
    • SLA: 99.95% (4.38 hours downtime/year)

    Availability Zones (datacenter-level redundancy):

    • Physically separate datacenters in region (minimum 3 zones)
    • Each zone has independent power, cooling, networking
    • Deploy VMs across zones for datacenter failure protection
    • SLA: 99.99% (52 minutes downtime/year)

    Virtual Machine Scale Sets (auto-scaling):

    • Automatically create/delete VMs based on demand or schedule
    • Supports up to 1,000 VM instances (standard), 600 (custom images)
    • Integrates with Azure Load Balancer and Application Gateway
    • Flexible orchestration: Mix VM sizes, availability zones in single scale set
  3. Disk Options:

    • Ultra Disk: <1ms latency, 160,000+ IOPS, mission-critical workloads ($$$)
    • Premium SSD: <5ms latency, 20,000 IOPS, production databases ($$)
    • Standard SSD: <10ms latency, 6,000 IOPS, web servers, dev/test ($)
    • Standard HDD: ~15ms latency, 500 IOPS, backups, archives (¢)

Detailed Example: Multi-Tier Application on VMs:

Situation: Financial services company migrating 3-tier application to Azure. Web tier (5 VMs), app tier (10 VMs), database tier (2 SQL Servers in Always On). Need 99.99% SLA, compliance requires full OS control.

Solution Architecture:

1. Web Tier (DMZ, public-facing):

  • VM Size: Standard_D4s_v5 (4 vCPU, 16GB RAM)
  • Count: 5 VMs across 3 availability zones (2-2-1 distribution)
  • Disk: 128GB Premium SSD (OS), 256GB Standard SSD (logs)
  • Load Balancer: Azure Load Balancer Standard (zone-redundant)
  • Autoscale: VMSS scales 3-10 instances based on CPU >70%

2. Application Tier (business logic):

  • VM Size: Standard_E8s_v5 (8 vCPU, 64GB RAM, memory-optimized)
  • Count: 10 VMs in availability set (distributed across 3 fault domains, 5 update domains)
  • Disk: 128GB Premium SSD (OS), 512GB Premium SSD (app cache)
  • Internal Load Balancer: Distributes traffic from web tier
  • Proximity Placement Group: Low latency between app and database tier

3. Database Tier (SQL Server Always On):

  • VM Size: Standard_E32s_v5 (32 vCPU, 256GB RAM)
  • Count: 2 VMs (primary + secondary in different availability zones)
  • Disk: 128GB Premium SSD (OS), 4TB Ultra Disk (data), 2TB Premium SSD (logs)
  • SQL Always On: Synchronous replication between zones
  • Witness: Cloud Witness in Azure Storage for quorum

4. High Availability Configuration:

Zone Failure Scenario (Zone 1 down):

  • Web tier: 2 VMs in Zone 1 down, traffic routes to Zones 2 & 3 (3 VMs remain)
  • App tier: Availability set protects, VMs in other fault domains handle load
  • Database: Primary in Zone 1 fails, automatic failover to secondary in Zone 2 (<30 sec)
  • Impact: No downtime, slight performance degradation during failover

5. Cost Optimization:

  • Reserved Instances: 3-year reservation for always-on VMs = 60% savings
  • Spot VMs: Non-production dev/test environments = 90% savings
  • Right-sizing: Monitoring showed web VMs at 30% CPU, downsize to D2s_v5 = 50% savings
  • Total: $45K/month for production, $8K/month for dev/test (vs $120K on-premises)

Must Know - Virtual Machines:

  • Availability Set vs Zone: Set = rack redundancy (99.95%), Zone = datacenter redundancy (99.99%)
  • Fault domains max 3: Region limitation, can't have more than 3 fault domains per availability set
  • Update domains for maintenance: Azure updates one domain at a time, 30-min recovery between
  • VMSS Flexible vs Uniform: Flexible = mix sizes/zones (recommended), Uniform = identical VMs (legacy)
  • Disk caching: OS disk (Read/Write), data disk (ReadOnly for databases, None for logs)
  • Proximity Placement Group: Forces VMs in same datacenter for <1ms latency (use for database clusters)

Azure Kubernetes Service - Container Orchestration

What it is: AKS is a managed Kubernetes service that automates container orchestration: deployment, scaling, networking, and lifecycle management. Microsoft manages the Kubernetes control plane (masters), you manage worker nodes and applications.

Why it exists: Containers solve application portability, but managing clusters manually is complex:

  • Kubernetes complexity: Installing, upgrading, securing Kubernetes is difficult
  • Control plane management: Master nodes require high availability, backups
  • Integration: Need to integrate with load balancers, storage, identity, monitoring
  • Operational overhead: Patching nodes, scaling clusters, disaster recovery

AKS eliminates control plane management (free), automates operations, integrates with Azure services.

How it works (Detailed):

  1. AKS Architecture:

    • Control Plane: Managed by Microsoft (free), runs in Azure infrastructure

      • API Server: Entry point for kubectl commands
      • Scheduler: Assigns pods to nodes based on resources
      • Controller Manager: Maintains desired state (deployments, replica sets)
      • etcd: Distributed key-value store for cluster state
    • Node Pools: Groups of VMs running containers

      • System node pool: Runs Kubernetes system pods (CoreDNS, metrics-server)
      • User node pools: Run application workloads
      • Can have multiple node pools with different VM sizes, OS (Linux/Windows)
  2. Scaling Mechanisms:

    Horizontal Pod Autoscaler (HPA):

    • Scales number of pod replicas based on metrics
    • Metrics: CPU, memory, custom (queue length, HTTP requests/sec)
    • Check interval: 15 seconds (configurable)
    • Example: Scale from 3 to 10 pods when CPU >70%

    Cluster Autoscaler:

    • Automatically adds/removes nodes based on pod resource requests
    • Scale out: Pod can't be scheduled (insufficient resources) → add node
    • Scale in: Node underutilized for >10 minutes → drain and remove node
    • Works per node pool, respects min/max node count

    Vertical Pod Autoscaler (VPA):

    • Adjusts CPU/memory requests for pods based on usage
    • Prevents over-provisioning (pods requesting too much) or under-provisioning
  3. Networking Options:

    Kubenet (basic, default):

    • Nodes get IPs from VNet subnet
    • Pods get IPs from separate CIDR (10.244.0.0/16)
    • User-Defined Routes (UDR) created automatically
    • Limitation: Pods not directly accessible from VNet (need NodePort/LoadBalancer)

    Azure CNI (advanced):

    • Both nodes and pods get IPs from VNet subnet
    • Pods directly accessible from VNet (no NAT)
    • Supports Network Policies (Calico, Azure Network Policy)
    • Limitation: Requires larger subnet (1 IP per pod, plan for scale)
  4. High Availability:

    • Control plane: Automatically deployed across zones (in supported regions)
    • Node pools: Deploy across availability zones for 99.99% SLA
    • Multi-region: Use Azure Traffic Manager or Front Door for global availability

Detailed Example: Microservices Platform on AKS:

Situation: E-commerce company running 20 microservices (Node.js, Python, .NET), 500+ pods. Need autoscaling, CI/CD, zero-downtime deployments. Traffic varies 10x (Black Friday spikes).

Solution Architecture:

1. AKS Cluster Configuration:

  • Region: East US (primary), West US (DR)
  • Kubernetes version: 1.28 (N-1 version, stable)
  • Network plugin: Azure CNI (pods need VNet connectivity)
  • Network policy: Azure Network Policy (pod-to-pod security)
  • Subnet: /20 (4,096 IPs for nodes and pods)

2. Node Pools Design:

System Node Pool (Kubernetes services):

  • VM Size: Standard_D4s_v5 (4 vCPU, 16GB RAM)
  • Count: 3 nodes across 3 availability zones (1 per zone)
  • Autoscale: Disabled (always 3 nodes for system stability)
  • Taints: CriticalAddonsOnly=true:NoSchedule (system pods only)

General Workload Pool (web services, APIs):

  • VM Size: Standard_D8s_v5 (8 vCPU, 32GB RAM)
  • Count: Min 6, Max 30 (autoscale enabled)
  • Zones: Distributed across 3 zones (2-2-2 initial)
  • Autoscale: Based on CPU/memory, pod pending time
  • Labels: workload=general

Memory-Intensive Pool (data processing):

  • VM Size: Standard_E16s_v5 (16 vCPU, 128GB RAM, memory-optimized)
  • Count: Min 3, Max 10
  • Labels: workload=memory-intensive
  • Node affinity: Pods with >32GB memory request schedule here

3. Application Deployment:

Frontend Service (React SPA):

  • Deployment: 10 replica pods (3 per zone initially)
  • HPA: Min 10, Max 50, target CPU 70%
  • Resources: Request 500m CPU, 1GB RAM; Limit 1000m CPU, 2GB RAM
  • Ingress: NGINX Ingress Controller with TLS termination
  • Service: LoadBalancer type, Azure Load Balancer Standard

Order Service (Node.js API):

  • Deployment: 5 replicas
  • HPA: Min 5, Max 30, custom metric (HTTP requests/sec >1000)
  • Resources: Request 1 CPU, 2GB RAM
  • Service: ClusterIP (internal only, accessed via ingress)

Payment Service (PCI compliance):

  • Deployment: 3 replicas on dedicated node pool (PCI-compliant VMs)
  • Node selector: compliance=pci
  • Network policy: Only allow traffic from order-service namespace
  • Secret: Payment gateway keys from Azure Key Vault (CSI driver)

4. Scaling Scenario (Black Friday traffic spike):

T+0: Normal load, 500 pods, 15 nodes
T+1 hour: Traffic increases 3x

  • HPA detects CPU >70% on frontend pods
  • Scales frontend from 10 to 25 replicas
  • Order service scales from 5 to 15 replicas

T+1.5 hours: Pods pending (not enough node capacity)

  • Cluster Autoscaler detects pending pods
  • Adds 10 nodes across 3 zones (distributed evenly)
  • Pods scheduled on new nodes within 5 minutes

T+4 hours: Peak traffic, 10x normal

  • 50 frontend pods (HPA max reached)
  • 30 order service pods
  • 45 nodes total (30 from autoscale)
  • All zones healthy, load distributed

T+12 hours: Traffic returns to normal

  • HPA scales down pods to min replicas (10-minute cooldown)
  • Cluster Autoscaler waits 10 minutes, then drains underutilized nodes
  • Returns to 15 nodes after 30 minutes

5. Cost Optimization:

  • Spot node pools: Non-critical batch workloads = 80% savings ($12K/month)
  • Cluster autoscaler: Only pay for needed capacity (avg 20 nodes vs 45 peak)
  • Reserved Instances: System pool + min user pool nodes = 40% savings
  • Total: $25K/month average (vs $65K for always-on peak capacity)

Must Know - Azure Kubernetes Service:

  • Control plane is free: Only pay for worker node VMs, not Kubernetes masters
  • System node pool required: Must have at least one system pool, can't delete
  • Kubenet vs Azure CNI: Kubenet = less IPs, pods not in VNet; CNI = pods in VNet, need more IPs
  • Cluster autoscaler requires HPA: Use both together - HPA scales pods, CA scales nodes
  • Upgrade strategy: Drain nodes gracefully, upgrade one at a time (surge upgrades available)
  • 99.95% vs 99.99% SLA: 99.95% = free, 99.99% = Uptime SLA ($73/month), requires zones

Azure Functions - Serverless Compute

What it is: Azure Functions is a serverless compute service for event-driven code execution. You write functions (code), Azure handles infrastructure: provisioning servers, scaling, patching, load balancing. Pay only for execution time (per 100ms).

Why it exists: Traditional compute requires managing capacity for peak load:

  • Over-provisioning: Pay for idle servers during low traffic
  • Under-provisioning: Performance degrades during spikes
  • Operational overhead: Patching, scaling, monitoring infrastructure
  • Cold start complexity: Spinning up new instances takes time

Functions eliminate infrastructure management, auto-scale from zero to thousands, charge per execution.

How it works (Detailed):

  1. Hosting Plans:

    Consumption Plan (true serverless):

    • Auto-scale from 0 to 200 instances
    • Timeout: 5 minutes default, 10 minutes max
    • Memory: 1.5GB per instance
    • Pricing: $0.20 per million executions + $0.000016/GB-sec
    • Use case: Infrequent workloads, true pay-per-use

    Premium Plan (pre-warmed serverless):

    • Pre-warmed instances eliminate cold start (0-sec startup)
    • Timeout: 30 minutes default, unlimited possible
    • Memory: 3.5GB-14GB per instance
    • VNet integration, private endpoints
    • Pricing: $0.174/hour per instance (EP1) + execution charges
    • Use case: Latency-sensitive, VNet connectivity, longer execution

    Dedicated Plan (App Service):

    • Run on dedicated App Service Plan VMs
    • Always-on, no cold start
    • Full control over scaling rules
    • Pricing: App Service Plan cost (no per-execution charge)
    • Use case: Existing App Service Plan with capacity, predictable cost
  2. Triggers and Bindings:

    Triggers (what starts function):

    • HTTP Trigger: API endpoint, webhook
    • Timer Trigger: Cron schedule (0 0 * * * = midnight daily)
    • Queue Trigger: Azure Storage Queue, Service Bus Queue
    • Blob Trigger: New blob in Storage Account
    • Event Grid Trigger: Event Grid event
    • Cosmos DB Trigger: Change feed (new/updated documents)

    Bindings (input/output without code):

    • Input binding: Read data (blob, table, Cosmos DB) - no SDK code needed
    • Output binding: Write data (queue, blob, database) - declarative config
    • Example: HTTP trigger → read blob (input) → write to Cosmos DB (output)
  3. Scaling Behavior:

    Consumption Plan:

    • Scale controller monitors queue length, CPU, memory every 10 seconds
    • Scale out: Adds instances (max 200) when load increases
    • Scale in: Removes instances after 5 minutes idle
    • Cold start: 1-3 seconds (C#), 0.5-2 seconds (Node.js/Python)

    Premium Plan:

    • Always-on instances (min 1) eliminate cold start
    • Scales same as Consumption beyond always-on count
    • Pre-warming: New instances start within 1 second
  4. Deployment Slots:

    • Consumption: 2 slots (production + 1 staging)
    • Premium: 3 slots total
    • Swap slots for zero-downtime deployments
    • Slot settings stay with slot (not swapped): connection strings, app settings

Detailed Example: Event-Driven Order Processing:

Situation: E-commerce platform processes 100K orders/day, spikes to 1M during sales. Order flow: HTTP → Queue → Processing → Database. Need auto-scaling, <100ms API response, cost-effective.

Solution Architecture:

1. Function App Configuration:

  • Plan: Premium Plan EP1 (1 always-on instance)
  • Region: East US (zone-redundant)
  • Runtime: .NET 8 Isolated (best performance)
  • Always-on instances: 1 (eliminates cold start)
  • Max scale-out: 100 instances

2. Function Design:

OrderSubmissionFunction (HTTP trigger):

[FunctionName("OrderSubmission")]
public async Task<IActionResult> Run(
    [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequest req,
    [Queue("orders")] IAsyncCollector<Order> orderQueue)
{
    var order = await req.ReadFromJsonAsync<Order>();
    order.Id = Guid.NewGuid();
    order.Status = "Pending";

    await orderQueue.AddAsync(order); // Output binding to queue

    return new OkObjectResult(new { orderId = order.Id });
}
  • Trigger: HTTP POST to /api/orders
  • Output binding: Writes to Azure Storage Queue (orders)
  • Response time: <50ms (just writes to queue, returns)
  • Scaling: Consumption plan, scales per HTTP request load

OrderProcessingFunction (Queue trigger):

[FunctionName("OrderProcessing")]
public async Task Run(
    [QueueTrigger("orders")] Order order,
    [CosmosDB("OrderDB", "Orders", CreateIfNotExists = true)] IAsyncCollector<Order> ordersOut)
{
    // Business logic: validate payment, check inventory, etc.
    await ValidatePayment(order);
    await CheckInventory(order);

    order.Status = "Confirmed";
    order.ProcessedAt = DateTime.UtcNow;

    await ordersOut.AddAsync(order); // Output binding to Cosmos DB
}
  • Trigger: Azure Storage Queue (orders)
  • Input binding: Order message from queue
  • Output binding: Writes to Cosmos DB (OrderDB/Orders collection)
  • Scaling: Based on queue length (1 instance per 16 messages)

NotificationFunction (Cosmos DB trigger):

[FunctionName("NotificationFunction")]
public async Task Run(
    [CosmosDBTrigger("OrderDB", "Orders",
        ConnectionStringSetting = "CosmosDBConnection",
        LeaseCollectionName = "leases",
        CreateLeaseCollectionIfNotExists = true)]
        IReadOnlyList<Document> orders)
{
    foreach (var order in orders)
    {
        await SendEmailNotification(order);
        await SendSMSNotification(order);
    }
}
  • Trigger: Cosmos DB change feed (new orders)
  • Batching: Processes up to 100 orders per invocation
  • Scaling: One instance per 10K documents/sec throughput

3. Scaling Scenario (Flash sale - 1M orders in 1 hour):

T+0: Normal load (100 orders/minute)

  • HTTP function: 1 instance (always-on)
  • Queue processor: 2 instances (32 messages in queue)
  • Notification: 1 instance

T+5 mins: Traffic spike begins (10K orders/minute)

  • HTTP function: Scales to 20 instances (handles 500 req/sec per instance)
  • Queue depth: 5,000 messages (backlog building)
  • Queue processor: Scales to 40 instances (16 messages per instance)
  • Notification: Scales to 10 instances (batching 100 orders)

T+30 mins: Peak traffic (30K orders/minute = 500/sec)

  • HTTP function: 50 instances (API still responds <100ms)
  • Queue depth: Stable at 8,000 (processing keeps up)
  • Queue processor: 100 instances (max scale-out reached)
  • Notification: 30 instances (processing change feed)

T+1 hour: Traffic subsides

  • Scale controller begins scale-in (5-minute idle threshold)
  • After 10 minutes: Returns to 5 instances (HTTP), 5 (queue), 2 (notification)
  • After 30 minutes: Back to normal (1-2-1 instances)

4. Cost Analysis:

Normal traffic (100K orders/day):

  • Premium Plan: 1 always-on EP1 instance = $125/month
  • Executions: 300K/day × 30 = 9M executions = $1.80/month
  • Compute: ~10 GB-sec per execution × 9M = $1.44/month
  • Total: ~$130/month

Sale day (1M orders in 1 hour, then normal):

  • Premium Plan: Same $125/month (base cost)
  • Executions: 3M (sale) + 9M (normal) = 12M = $2.40/month
  • Compute: Peak usage adds ~$50 for sale hour
  • Total: ~$180/month (only $50 extra for 10x spike)

vs Always-On VMs (sized for peak):

  • 50 D4s_v5 VMs (to handle peak) = $13,000/month
  • Savings: 99% reduction ($180 vs $13,000)

Must Know - Azure Functions:

  • Consumption cold start: 1-3 seconds, eliminated with Premium Plan always-on instances
  • Execution timeout: Consumption 10 min max, Premium 30 min default (unlimited possible), Dedicated unlimited
  • Durable Functions: Orchestrate long-running workflows with checkpointing (survive restarts)
  • Queue scaling: 1 instance per 16 messages (Storage Queue), customizable for Service Bus
  • VNET integration: Requires Premium or Dedicated plan, Consumption doesn't support
  • Deployment slots: Consumption = 2 slots, Premium = 3, staging → production swap for zero downtime

Section 2: Design Application Architecture

Introduction

The problem: Modern applications are distributed across multiple services: APIs, microservices, data processors, third-party integrations. Communication between these components creates challenges: tight coupling (services directly call each other, failures cascade), lost messages (network failures drop requests), traffic spikes overwhelm services, no audit trail of events.

The solution: Azure provides messaging and event services to decouple components: Service Bus for reliable messaging with transactions, Event Grid for reactive event routing, Event Hubs for high-throughput streaming. API Management centralizes API governance and security.

Why it's tested: AZ-305 tests application architecture because:

  • Resilience: Decoupled services survive component failures
  • Scalability: Message queues buffer traffic spikes
  • Integration: Most solutions integrate multiple services/systems
  • Modernization: Legacy monoliths transition to event-driven microservices

Core Concepts

Azure Service Bus - Enterprise Messaging

What it is: Service Bus is an enterprise message broker supporting queues (point-to-point) and topics (publish-subscribe). It provides guaranteed delivery, transactions, dead-letter queues, message sessions for ordering, and duplicate detection.

Why it exists: Direct service-to-service communication is fragile:

  • Tight coupling: Service A calls Service B directly, if B is down, A fails
  • Lost messages: Network failure = lost request (no retry, no persistence)
  • No load leveling: Traffic spike overwhelms receiver (no buffer)
  • No transactions: Can't ensure exactly-once delivery across services

Service Bus decouples sender/receiver, guarantees delivery, provides transactional messaging.

How it works (Detailed):

  1. Queues (point-to-point):

    • One sender → Queue → One receiver
    • Message stays in queue until receiver processes (pull model)
    • FIFO ordering: Messages processed in send order (with sessions)
    • Competing consumers: Multiple receivers pull from same queue (load distribution)
  2. Topics and Subscriptions (pub-sub):

    • Publisher → Topic → Multiple subscriptions → Multiple receivers
    • Each subscription gets copy of message
    • Filters: Subscriptions filter messages (SQL filter, correlation filter)
    • Example: OrderCreated event → Email subscription, Analytics subscription, Shipping subscription
  3. Features:

    Dead-Letter Queue:

    • Messages that can't be delivered go to DLQ
    • Reasons: Exceeded retry count, message expired (TTL), filter evaluation failed
    • Can inspect and reprocess DLQ messages

    Message Sessions:

    • Ensures FIFO ordering for related messages (same SessionId)
    • Example: All messages for Order #12345 processed in order

    Transactions:

    • Receive message → Process → Send response (atomic operation)
    • If processing fails, message returns to queue (retry)

    Duplicate Detection:

    • Detects and discards duplicate messages within configurable window (30 sec - 7 days)
    • Based on MessageId or custom property
  4. Tiers:

    • Basic: Queues only, no topics, 256KB message size, $0.05/million ops
    • Standard: Queues + topics, 256KB messages, $0.05/million ops
    • Premium: Dedicated resources, 100MB messages, VNet integration, geo-DR, $668/month for 1 messaging unit

Detailed Example: Order Processing with Service Bus:

Situation: Retail application processes orders through multiple steps: payment, inventory, shipping. Need guaranteed delivery, transaction support, message ordering per order. Handle 10K orders/hour, peak 50K.

Solution Architecture:

1. Service Bus Configuration:

  • Tier: Premium (1 messaging unit)
  • Queues:
    • orders-payment (payment processing)
    • orders-inventory (inventory check)
    • orders-shipping (shipping label generation)
  • Topic: order-events (for notifications)
  • Subscriptions:
    • email-notifications (filter: eventType = 'OrderConfirmed')
    • analytics (all events)
    • customer-history (filter: customerId IN database)

2. Message Flow:

Step 1: Order Submission

API Gateway receives order → Send to orders-payment queue
Message: {
  "orderId": "12345",
  "customerId": "C-789",
  "amount": 299.99,
  "sessionId": "12345"  // All order 12345 messages have same session
}

Step 2: Payment Processing

Payment Service:
  - Pull message from orders-payment queue (with session lock)
  - Process payment with payment gateway
  - If success:
      * Complete message (remove from queue)
      * Send to orders-inventory queue
  - If failure:
      * Abandon message (returns to queue for retry)
      * If retry count > 5, message goes to DLQ

Step 3: Inventory Check

Inventory Service:
  - Pull message from orders-inventory queue (same session)
  - Check inventory availability
  - If available:
      * Complete message
      * Send to orders-shipping queue
      * Publish OrderConfirmed event to order-events topic
  - If out of stock:
      * Publish OrderCancelled event to topic
      * Complete message (no retry, business logic decision)

Step 4: Shipping

Shipping Service:
  - Pull message from orders-shipping queue
  - Generate shipping label
  - Complete message
  - Publish OrderShipped event to topic

Step 5: Notifications

Email Service:
  - Subscribe to order-events topic (email-notifications subscription)
  - Filter: eventType = 'OrderConfirmed' OR 'OrderShipped'
  - Send email to customer

3. Failure Scenarios:

Payment Gateway Down:

  • Payment service can't process, abandons message
  • Message returns to queue, retries after 1 minute (exponential backoff)
  • After 5 retries (5 minutes), message moves to DLQ
  • Ops team alerted, manually processes DLQ when gateway recovers

Inventory Service Crashes Mid-Processing:

  • Message lock expires (1 minute default)
  • Message becomes available again (automatic retry)
  • New instance picks up message, processes successfully
  • Result: Exactly-once delivery guarantee (via sessions + transactions)

4. Scaling:

  • Normal load (10K orders/hour): 1 messaging unit handles 1,000 msg/sec
  • Peak load (50K orders/hour): Auto-scale to 3 messaging units (3,000 msg/sec)
  • Cost: 1 MU = $668/month, 3 MU during peak = $2,004/month (auto-scale per hour)

Must Know - Service Bus:

  • Queues vs Topics: Queue = 1 receiver, Topic = multiple subscribers (pub-sub)
  • Sessions = ordering: Without sessions, messages processed in any order; with sessions (same ID) = FIFO
  • Premium for VNet: Only Premium tier supports VNet integration and geo-DR
  • Message size: Basic/Standard = 256KB max, Premium = 100MB max
  • Dead-letter queue automatic: Every queue/subscription has DLQ, messages auto-moved on failure
  • Duplicate detection window: 30 seconds to 7 days, prevents reprocessing same message

API Management - API Gateway

What it is: API Management is a fully managed API gateway that sits between clients and backend services. It provides: API security (authentication, rate limiting), transformation (request/response modification), caching, analytics, and developer portal.

Why it exists: Exposing APIs directly to clients creates problems:

  • Security: Each API implements auth differently (inconsistent, error-prone)
  • Rate limiting: No protection against DDoS or abusive clients
  • Versioning: Breaking changes require updating all clients
  • Monitoring: No central visibility into API usage, errors
  • Documentation: APIs documented separately (hard to discover)

API Management centralizes these concerns in a single gateway.

How it works (Detailed):

  1. Components:

    API Gateway:

    • Entry point for all API requests
    • Executes policies: authentication, rate limiting, transformation
    • Routes requests to backend services
    • Deployed in Azure (managed) or self-hosted (container)

    Management Plane:

    • Configure APIs, products, policies via Azure Portal/CLI/ARM
    • Define rate limits, caching rules, transformations

    Developer Portal:

    • Self-service portal for API consumers
    • Browse API catalog, view documentation (OpenAPI)
    • Generate API keys, test APIs interactively
  2. Policies (applied at different scopes):

    Inbound (before backend call):

    • validate-jwt: Verify OAuth token from Azure AD
    • rate-limit: Max 100 requests per minute per API key
    • set-header: Add correlation ID header
    • cache-lookup: Check cache before backend call

    Backend (modify backend request):

    • set-backend-service: Route to different backend based on condition
    • retry: Retry backend call on failure (exponential backoff)

    Outbound (modify response):

    • cache-store: Cache response for 10 minutes
    • json-to-xml: Convert JSON response to XML
    • set-header: Add cache headers, CORS headers

    On-Error:

    • log-to-eventhub: Log errors to Event Hub for analysis
    • return-response: Return custom error message
  3. Tiers:

    • Consumption: Serverless, auto-scale, $3.50 per million calls + $0.0014/GB
    • Developer: 1 unit, no SLA, $50/month (dev/test only)
    • Basic: 2 units, 99.95% SLA, $158/month
    • Standard: 4 units, 99.95% SLA, $800/month
    • Premium: Multi-region, VNet, 99.99% SLA, $2,800/month per region
  4. Products and Subscriptions:

    • Product: Bundle of APIs with usage quota, terms of use
    • Subscription: API key granting access to product
    • Example: "Starter" product (5 APIs, 1K req/day), "Enterprise" product (all APIs, unlimited)

Detailed Example: Multi-Backend API Gateway:

Situation: SaaS company exposes APIs for: User management (on-prem), Orders (Azure Functions), Analytics (AKS). Need unified API, OAuth security, rate limiting, caching. Support 100K external developers.

Solution Architecture:

1. API Management Configuration:

  • Tier: Premium (2 units, multi-region)
  • Regions: East US (primary), West Europe (failover + low latency)
  • VNet Integration: Enabled (access on-prem user service via VPN)
  • Custom domain: api.company.com (SSL cert from Key Vault)

2. Backend Services:

  • Users API: On-premises .NET service (via VPN Gateway)
  • Orders API: Azure Functions (Consumption plan)
  • Analytics API: AKS cluster (internal load balancer)

3. API Design:

Users API (/api/users):

<policies>
  <inbound>
    <!-- Authenticate with Azure AD OAuth -->
    <validate-jwt header-name="Authorization"
                  failed-validation-httpcode="401">
      <openid-config url="https://login.microsoft.com/common/.well-known/openid-configuration" />
      <audiences>
        <audience>api://company-api</audience>
      </audiences>
    </validate-jwt>

    <!-- Rate limit: 100 req/min per subscription -->
    <rate-limit calls="100" renewal-period="60" />

    <!-- Add correlation ID for tracing -->
    <set-header name="X-Correlation-ID" exists-action="override">
      <value>@(Guid.NewGuid().ToString())</value>
    </set-header>
  </inbound>

  <backend>
    <!-- Route to on-prem service via VPN -->
    <set-backend-service base-url="http://10.0.1.10/users-api" />

    <!-- Retry on failure (3 times, exponential backoff) -->
    <retry count="3" interval="1" delta="1" />
  </backend>

  <outbound>
    <!-- Cache GET requests for 5 minutes -->
    <cache-store duration="300" />

    <!-- Add CORS headers -->
    <cors allow-credentials="true">
      <allowed-origins>
        <origin>https://app.company.com</origin>
      </allowed-origins>
    </cors>
  </outbound>
</policies>

Orders API (/api/orders):

<policies>
  <inbound>
    <validate-jwt header-name="Authorization" .../>
    <rate-limit calls="1000" renewal-period="60" />

    <!-- Check cache before backend call -->
    <cache-lookup vary-by-developer="true" />
  </inbound>

  <backend>
    <!-- Route to Azure Functions -->
    <set-backend-service base-url="https://orders-func.azurewebsites.net" />
  </backend>

  <outbound>
    <cache-store duration="60" /> <!-- 1 minute cache -->

    <!-- Transform: Remove internal fields -->
    <set-body>@{
      var response = context.Response.Body.As<JObject>();
      response.Remove("internalProcessingId");
      response.Remove("debugInfo");
      return response.ToString();
    }</set-body>
  </outbound>
</policies>

Analytics API (/api/analytics):

<policies>
  <inbound>
    <validate-jwt header-name="Authorization" .../>

    <!-- Throttle by IP: max 10 concurrent requests -->
    <rate-limit-by-key calls="10"
                       renewal-period="60"
                       counter-key="@(context.Request.IpAddress)" />
  </inbound>

  <backend>
    <!-- Route to AKS internal LB -->
    <set-backend-service base-url="http://10.0.2.50/analytics-api" />
  </backend>

  <on-error>
    <!-- Log errors to Application Insights -->
    <trace source="error-logger" severity="error">
      <message>@(context.LastError.Message)</message>
    </trace>

    <!-- Return custom error -->
    <return-response>
      <set-status code="500" reason="Internal Server Error" />
      <set-body>@{
        return new JObject(
          new JProperty("error", "An error occurred"),
          new JProperty("correlationId", context.RequestId)
        ).ToString();
      }</set-body>
    </return-response>
  </on-error>
</policies>

4. Products:

Free Tier:

  • Access: Users API (read-only), Orders API (own orders)
  • Rate limit: 100 requests/hour
  • No SLA, community support

Professional Tier ($99/month):

  • Access: All APIs
  • Rate limit: 10,000 requests/hour
  • 99.9% SLA, email support

Enterprise Tier (custom pricing):

  • Access: All APIs + premium features
  • Rate limit: Unlimited (custom quota)
  • 99.99% SLA, dedicated support

5. Traffic Scenario:

Developer makes API call:

  1. Request: GET https://api.company.com/api/orders/12345
  2. APIM validates JWT token (Azure AD OAuth)
  3. Check subscription key, rate limit (within quota)
  4. Cache lookup (miss, order data frequently changes)
  5. Route to Azure Functions backend
  6. Backend returns order data
  7. Transform response (remove internal fields)
  8. Cache response for 1 minute
  9. Return to client (120ms total latency)

6. Multi-Region Failover:

  • East US region down: Traffic Manager detects unhealthy APIM endpoint
  • Redirects traffic to West Europe region (APIM Premium multi-region)
  • Latency increases 50ms for US clients (cross-region)
  • RTO: <1 minute (automatic DNS failover)

7. Cost:

  • Premium tier (2 units, 2 regions): $5,600/month base
  • API calls: 50M/month × $0.0035 per 10K = $17.50/month
  • Egress: 500GB × $0.087/GB = $43.50/month
  • Total: ~$5,700/month

Must Know - API Management:

  • Policies execute in order: Inbound → Backend → Outbound → On-Error (if failure)
  • Consumption tier limitations: No VNet support, no multi-region, cold start possible
  • Premium required for: VNet integration, multi-region deployment, >99.95% SLA
  • Caching scope: Developer-specific (vary-by-developer), product-specific, or global
  • Subscription keys: Primary + secondary (rotate without downtime)
  • Named values: Store secrets (connection strings, API keys) referenced in policies

Section 3: Design Network Solutions

Introduction

The problem: Applications need network connectivity: users accessing websites, services communicating, hybrid connections to on-premises. Poor network design causes: security vulnerabilities (exposed services), performance issues (cross-region latency), high costs (unnecessary traffic through expensive paths), compliance failures (data leaving region).

The solution: Azure provides comprehensive networking: VNets for isolation, VPN/ExpressRoute for hybrid connectivity, load balancers for traffic distribution, Azure Firewall for security. Proper network architecture ensures security, performance, and compliance.

Why it's tested: Network design is fundamental to AZ-305 because:

  • Security foundation: Network controls are first line of defense
  • Performance impact: Network topology affects latency, throughput
  • Hybrid scenarios: Most enterprises have on-premises connectivity needs
  • Cost optimization: Traffic routing significantly impacts egress costs

Core Concepts

Virtual Network (VNet) - Network Isolation

What it is: Azure VNet is a logically isolated network in Azure, providing IP addressing, subnets, routing, and connectivity for Azure resources. VNets are region-specific and can be connected via peering or VPN.

Why it exists: Azure resources need network-level isolation:

  • Security: Prevent unauthorized access between resources
  • IP management: Control IP addressing for resources
  • Hybrid integration: Connect to on-premises via VPN/ExpressRoute
  • Service endpoints: Private connectivity to Azure services (Storage, SQL)

VNets provide private networking similar to on-premises datacenter networks.

How it works (Detailed):

  1. VNet Components:

    Address Space:

    • CIDR notation: 10.0.0.0/16 (65,536 IPs)
    • Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
    • Can have multiple address spaces (10.0.0.0/16 + 172.16.0.0/16)

    Subnets:

    • Divide VNet address space: 10.0.1.0/24 (256 IPs, 251 usable)
    • Azure reserves 5 IPs per subnet (.0, .1, .2, .3, .255)
    • Special subnets: GatewaySubnet (VPN Gateway), AzureFirewallSubnet

    Network Security Groups (NSG):

    • Firewall rules for subnets or NICs
    • Inbound/outbound rules with priority (100-4096, lower = higher priority)
    • Default rules: Allow VNet-to-VNet, deny Internet inbound, allow outbound

    Route Tables:

    • Custom routes to override Azure system routes
    • Example: Force Internet traffic through firewall (0.0.0.0/0 → Firewall IP)
  2. VNet Peering:

    • Connects two VNets (same region or cross-region)
    • Traffic uses Azure backbone (not Internet)
    • Transitive: Not supported by default (A↔B, B↔C doesn't mean A↔C)
    • Use case: Hub-spoke topology (hub has shared services, spokes peer to hub)
  3. Service Endpoints and Private Endpoints:

    Service Endpoints:

    • Extend VNet private address space to Azure services
    • Traffic to Storage/SQL stays on Azure backbone
    • Service still has public IP, but restricted to VNet
    • Free, no bandwidth charges

    Private Endpoints:

    • Injects private IP for Azure service into VNet
    • Service accessible only from VNet (no public IP)
    • Supports DNS integration (privatelink.blob.core.windows.net)
    • Cost: $0.01/hour per endpoint (~$7/month)

Detailed Example: Hub-Spoke Network Topology:

Situation: Enterprise with 10 application teams, each needs isolated VNet. Shared services: Active Directory, DNS, monitoring. On-premises connectivity via ExpressRoute. Need central security control, prevent team-to-team direct access.

Solution Architecture:

1. VNet Design:

Hub VNet (10.0.0.0/16):

  • Region: East US
  • Subnets:
    • GatewaySubnet: 10.0.0.0/27 (ExpressRoute Gateway)
    • AzureFirewallSubnet: 10.0.1.0/26 (Azure Firewall)
    • SharedServices: 10.0.2.0/24 (AD, DNS, Jump box)
    • Management: 10.0.3.0/24 (monitoring tools)

Spoke VNets (application teams):

  • Team A VNet: 10.1.0.0/16 (East US)
  • Team B VNet: 10.2.0.0/16 (East US)
  • Team C VNet: 10.3.0.0/16 (West US - different region)
  • ... (10 total spoke VNets)

2. Peering Configuration:

  • Hub ↔ Team A: Peering with "Use Remote Gateways" (Team A uses hub's ExpressRoute)
  • Hub ↔ Team B: Peering with "Use Remote Gateways"
  • Hub ↔ Team C: Global VNet Peering (cross-region, $0.035/GB)
  • No spoke-to-spoke peering: Team A can't directly access Team B

3. Traffic Flow:

Team A to On-Premises:

  1. Team A VM (10.1.1.5) sends packet to on-prem (192.168.1.10)
  2. Peering routes to Hub VNet
  3. ExpressRoute Gateway forwards to on-premises
  4. Latency: 10ms (Azure backbone + ExpressRoute)

Team A to Team B (blocked):

  1. Team A VM tries to access Team B (10.2.1.5)
  2. No peering between spokes, packet dropped
  3. Result: Isolation enforced

Team A to Team B via Firewall (allowed):

  1. Route table in Team A: 10.2.0.0/16 → Firewall (10.0.1.4)
  2. Firewall receives packet, applies rules
  3. Allow rule: Source=TeamA, Dest=TeamB, Service=HTTPS (443)
  4. Firewall forwards to Team B via Hub
  5. Latency: 5ms (firewall inspection)

4. Azure Firewall Rules:

Network Rules (L3/L4):

Rule 1: Team A to Team B HTTPS
  Source: 10.1.0.0/16
  Destination: 10.2.0.0/16
  Protocol: TCP
  Port: 443
  Action: Allow

Rule 2: All teams to Shared Services
  Source: 10.1.0.0/16, 10.2.0.0/16, 10.3.0.0/16
  Destination: 10.0.2.0/24
  Protocol: Any
  Action: Allow

Application Rules (L7, FQDN):

Rule 3: Allow Azure DevOps
  Source: 10.0.0.0/8
  Target FQDN: dev.azure.com, *.visualstudio.com
  Protocol: HTTPS
  Action: Allow

Rule 4: Block Social Media
  Source: 10.0.0.0/8
  Target FQDN: *.facebook.com, *.twitter.com
  Action: Deny

5. Cost:

  • Hub VNet: Free
  • Spoke VNets (10): Free
  • VNet Peering: $0.01/GB (intra-region), $0.035/GB (cross-region to Team C)
  • Azure Firewall: $1.25/hour = $912.50/month + $0.016/GB processed
  • ExpressRoute Gateway: VpnGw2 = $262/month
  • Total: ~$1,500/month for network infrastructure

Must Know - Virtual Networks:

  • 5 IPs reserved per subnet: .0 (network), .1 (gateway), .2/.3 (DNS), .255 (broadcast)
  • VNet peering not transitive: A↔B + B↔C ≠ A↔C (need routes or third peering)
  • Service endpoint vs Private endpoint: Endpoint = service still public (restricted), Private = private IP only
  • Global peering cost: Cross-region peering = $0.035/GB (intra-region = $0.01/GB)
  • NSG stateful: Allow inbound HTTP (80) automatically allows outbound response
  • GatewaySubnet name required: VPN/ExpressRoute Gateway must be in subnet named "GatewaySubnet"

Application Gateway - Layer 7 Load Balancer

What it is: Application Gateway is a web traffic load balancer (Layer 7) that routes HTTP/HTTPS traffic based on URL path, host headers, and supports SSL termination, WAF, autoscaling, and zone redundancy.

Why it exists: Traditional load balancers work at Layer 4 (TCP/UDP) - no awareness of HTTP:

  • Can't route by URL: Can't send /api → backend1, /images → backend2
  • No SSL termination: Each backend handles SSL (CPU overhead)
  • No WAF: No protection against SQL injection, XSS attacks
  • No host-based routing: Can't route multiple domains to different backends

Application Gateway provides HTTP-aware routing, security, and optimization.

How it works (Detailed):

  1. Components:

    Frontend:

    • Public IP or Private IP (internal load balancer)
    • SSL certificate (for HTTPS termination)
    • Listener: IP + port + protocol (HTTP/HTTPS)

    Routing Rules:

    • Path-based: /api → backend-api, /images → backend-storage
    • Multi-site: contoso.com → backend1, fabrikam.com → backend2
    • Priority-based: Process rules in priority order (1-20000)

    Backend Pools:

    • VMs, VMSS, App Service, AKS, IP addresses
    • Health probe checks backend health (HTTP GET every 30 sec)

    HTTP Settings:

    • Backend port (80, 443, custom)
    • Protocol (HTTP or HTTPS)
    • Cookie-based affinity (session stickiness)
    • Request timeout (1-86400 seconds)
  2. Web Application Firewall (WAF):

    • OWASP Core Rule Set: Protection against top 10 vulnerabilities
    • Bot protection: Block malicious bots (scrapers, scanners)
    • Custom rules: Block by IP, geo-location, rate limit
    • Modes: Detection (log only) or Prevention (block)
  3. SSL Termination and End-to-End SSL:

    • SSL termination: Gateway decrypts, sends HTTP to backend (offloads backend CPU)
    • End-to-end SSL: Gateway decrypts, re-encrypts, sends HTTPS to backend (for compliance)
    • Certificate management: Store certs in Key Vault, auto-rotate
  4. Autoscaling:

    • V2 SKU supports autoscaling (V1 requires manual scaling)
    • Min instances: 2 (zone-redundant), Max: 125
    • Scale based on request rate, CPU, connection count

Detailed Example: Multi-Site Application Gateway:

Situation: Company hosts 3 customer-facing websites: www.contoso.com (e-commerce), api.contoso.com (REST API), admin.contoso.com (admin portal). Need SSL termination, WAF protection, autoscaling. 100K requests/day normal, 1M during sales.

Solution Architecture:

1. Application Gateway Configuration:

  • SKU: WAF_v2 (autoscaling + WAF)
  • Tier: WAF_v2
  • Capacity: Min 2, Max 10 (autoscale)
  • Zones: Zone-redundant (instances across 3 zones)
  • VNet: 10.0.0.0/16, Subnet: 10.0.1.0/24 (Application Gateway subnet)

2. Frontend Configuration:

Public IP: Static (appgw-pip)
Listeners:

  • Listener 1: www.contoso.com, Port 443 (HTTPS), SSL cert from Key Vault
  • Listener 2: api.contoso.com, Port 443 (HTTPS), SSL cert from Key Vault
  • Listener 3: admin.contoso.com, Port 443 (HTTPS), SSL cert from Key Vault
  • Listener 4: HTTP → HTTPS redirect (port 80 → 443)

3. Backend Pools:

www-backend:

  • 5 VMs in VMSS (Standard_D4s_v5)
  • Health probe: GET /health, 200 OK expected
  • HTTP Settings: Port 80 (HTTP), cookie affinity enabled

api-backend:

  • AKS cluster (10.1.0.0/16)
  • Internal load balancer: 10.1.1.100
  • Health probe: GET /api/health
  • HTTP Settings: Port 443 (HTTPS), custom host header: api-internal

admin-backend:

  • App Service (admin-app.azurewebsites.net)
  • Health probe: GET /
  • HTTP Settings: Port 443 (HTTPS), pick hostname from backend

4. Routing Rules:

Rule 1: www.contoso.com

Listener: www.contoso.com (HTTPS)
Backend pool: www-backend
HTTP settings: Port 80, cookie affinity
Priority: 100

Rule 2: api.contoso.com with path routing

Listener: api.contoso.com (HTTPS)
Path-based:
  - /api/v1/* → api-backend (v1 settings)
  - /api/v2/* → api-v2-backend (v2 settings)
Default: api-backend
Priority: 200

Rule 3: admin.contoso.com with IP restriction

Listener: admin.contoso.com (HTTPS)
Backend pool: admin-backend
HTTP settings: Port 443, HTTPS
WAF policy: Admin-WAF (allow only corporate IP ranges)
Priority: 300

5. WAF Configuration:

www.contoso.com Policy:

  • Mode: Prevention
  • Rule set: OWASP 3.2
  • Custom rules:
    • Rate limit: 100 requests/minute per IP
    • Geo-blocking: Block countries except US, CA, MX
    • Bot protection: Block known malicious bots

api.contoso.com Policy:

  • Mode: Prevention
  • Rule set: OWASP 3.2
  • Custom rules:
    • API key validation: Require X-API-Key header
    • Rate limit: 1000 requests/minute per API key (higher for API traffic)

admin.contoso.com Policy:

  • Mode: Prevention
  • Rule set: OWASP 3.2
  • Custom rules:
    • IP allowlist: Only corporate IP ranges (203.0.113.0/24)
    • MFA required: Check for MFA claim in JWT token

6. Traffic Flow:

User Request to www.contoso.com:

  1. User browser: GET https://www.contoso.com/products
  2. DNS resolves to Application Gateway public IP (20.10.5.100)
  3. SSL termination at gateway (decrypts using cert from Key Vault)
  4. WAF inspects request (OWASP rules, rate limit, geo-location)
  5. Routing rule matches listener (www.contoso.com)
  6. Forwards to www-backend pool (HTTP port 80)
  7. Health probe ensures backend healthy
  8. Round-robin to one of 5 VMs (or based on session cookie)
  9. Backend returns HTML
  10. Gateway returns to user (re-encrypts with SSL)
  11. Total latency: 120ms (20ms SSL, 50ms WAF, 50ms backend)

7. Autoscaling Scenario:

Normal traffic (100K requests/day):

  • 2 gateway instances (min capacity)
  • CPU: 30%, handling 70 requests/sec
  • Cost: $0.443/hour × 2 instances × 730 hours = $646/month

Sale traffic (1M requests/day for 1 week):

  • Traffic spikes to 700 requests/sec
  • CPU > 80%, autoscale triggers
  • Scales to 8 instances (10 requests/sec per instance at 80% CPU)
  • After 1 week, traffic returns to normal, scales down to 2 instances
  • Cost: $646 (base) + $2,584 (7 days × 6 extra instances) = $3,230 for month

8. SSL Certificate Management:

  • Certificates stored in Azure Key Vault
  • Application Gateway uses managed identity to access Key Vault
  • Auto-renewal: Key Vault auto-renews cert 30 days before expiry
  • Application Gateway polls Key Vault every 4 hours, picks up new cert
  • No downtime during certificate rotation

Must Know - Application Gateway:

  • V2 vs V1: V2 = autoscaling + zone redundancy + WAF, V1 = manual scaling (deprecated)
  • WAF modes: Detection = log only, Prevention = block attacks
  • SSL termination vs End-to-end SSL: Termination = decrypt only, End-to-end = decrypt + re-encrypt to backend
  • Path-based routing: Single listener, route by URL (/api → backend1, /images → backend2)
  • Multi-site routing: Multiple listeners, route by hostname (site1.com → backend1, site2.com → backend2)
  • Health probe required: Unhealthy backends removed from pool until healthy again

Chapter Summary

What We Covered

Compute Solutions:

  • Azure VMs with availability sets (99.95%) and zones (99.99%)
  • AKS for container orchestration with HPA and cluster autoscaler
  • Azure Functions serverless with Consumption/Premium plans
  • VMSS for autoscaling VM fleets

Application Architecture:

  • Service Bus for reliable messaging (queues, topics, sessions)
  • Event Grid for event routing (pub-sub, event sources)
  • Event Hubs for streaming telemetry (millions events/sec)
  • API Management for API gateway (policies, products, developer portal)

Network Solutions:

  • VNet architecture with hub-spoke topology
  • VPN Gateway and ExpressRoute for hybrid connectivity
  • Application Gateway for Layer 7 load balancing and WAF
  • Azure Firewall for network security and FQDN filtering

Critical Takeaways

  1. Compute choice = cost/performance tradeoff: VMs = full control + cost, Containers = portability, Serverless = auto-scale + pay-per-use
  2. Availability Zones = datacenter redundancy: 99.99% SLA requires zones (vs 99.95% for sets)
  3. Messaging decouples services: Service Bus = guaranteed delivery, Event Grid = reactive events, Event Hubs = streaming
  4. Hub-spoke network = centralized control: Shared services in hub, isolated workloads in spokes, firewall for security
  5. Application Gateway = web traffic LB: Layer 7 routing, SSL termination, WAF, autoscaling

Self-Assessment Checklist

  • I can choose between VMs, AKS, and Functions based on requirements
  • I understand difference between availability sets and zones
  • I can design AKS architecture with node pools and autoscaling
  • I know when to use Service Bus vs Event Grid vs Event Hubs
  • I can design hub-spoke network with VNet peering
  • I understand Application Gateway routing rules and WAF policies
  • I can calculate composite SLAs for multi-tier solutions

Practice Questions

Try these from practice test bundles:

  • Domain 4 Bundle 1: Questions 1-100 (compute and containers)
  • Domain 4 Bundle 2: Questions 101-200 (messaging and networking)
  • Infrastructure Solutions Bundle: Questions 1-150 (comprehensive)

Expected score: 80%+ to proceed


Next Chapter: Proceed to Integration to learn about cross-domain scenarios combining identity, data, compute, and networking solutions.


Chapter 5: Integration - Cross-Domain Scenarios

Chapter Overview

What you'll learn:

  • How identity, data, compute, and networking integrate in real solutions
  • Multi-service architectures combining Azure services
  • Decision frameworks for choosing between alternatives
  • Common architectural patterns for the AZ-305 exam

Time to complete: 4-6 hours

Prerequisites: Chapters 1-4 (all domains)


Integration Scenario 1: Secure Multi-Tier Web Application

Business Requirements

Global e-commerce platform serving 5M users across 3 continents. Requirements:

  • Performance: <200ms page load globally
  • Security: PCI DSS compliance, zero-trust architecture
  • Availability: 99.99% uptime SLA
  • Scale: Handle 10x traffic during Black Friday
  • Cost: Optimize for $50K/month budget

Architecture Design

Identity & Security Layer (Domain 1):

  • Microsoft Entra ID: Centralized authentication for admins and APIs
  • Conditional Access: Require MFA + compliant device for admin access
  • PIM: Just-in-time elevation for production access (max 4 hours)
  • Key Vault: Store SSL certs, connection strings, API keys
  • Managed Identity: VM/AKS access to Key Vault without credentials

Data Layer (Domain 2):

  • Cosmos DB: Product catalog (globally distributed, multi-region writes)
  • Azure SQL Business Critical: Order database (zone-redundant, 99.99% SLA)
  • Azure Cache for Redis: Session state, product cache (Premium tier, zone-redundant)
  • Blob Storage: Product images (Hot tier, CDN-enabled)

Compute Layer (Domain 4):

  • AKS: Microservices (catalog, cart, checkout) - autoscale 10-100 nodes
  • Azure Functions: Order processing, inventory updates (Premium plan, VNet-integrated)
  • VM Scale Sets: Legacy order fulfillment system (can't containerize yet)

Networking Layer (Domain 4):

  • Azure Front Door: Global load balancer, WAF, SSL termination
  • VNet: Hub-spoke topology (hub = shared services, spokes = app environments)
  • Private Endpoints: SQL, Storage, Key Vault (no public access)
  • Azure Firewall: Egress filtering for AKS to external APIs

Business Continuity (Domain 3):

  • Availability Zones: All services zone-redundant (3 zones)
  • Geo-replication: Cosmos DB in 3 regions (US, EU, Asia), SQL geo-replica in paired region
  • Backup: SQL PITR (35 days), VM backups (30 days), blob soft delete (14 days)
  • DR: Active-active across regions, Front Door routes to healthy region

Traffic Flow

User Request (New York user buying product):

  1. Browser → Front Door (global entry point)

    • DNS: app.contoso.com → Front Door anycast IP
    • TLS termination at Front Door (cert from Key Vault)
    • WAF inspects request (OWASP rules, bot protection)
    • Routes to East US region (closest to user, <50ms)
  2. Front Door → AKS Ingress (internal load balancer)

    • Private endpoint connection (no public IP on AKS)
    • NGINX Ingress Controller receives request
    • Routes to catalog-service pod (based on URL /api/products)
  3. Catalog Service → Cosmos DB (read product data)

    • Service uses Managed Identity to authenticate
    • Connects via Private Endpoint (10.0.3.5)
    • Cosmos DB returns product from East US read region (<10ms)
    • Service caches result in Redis for 5 minutes
  4. User Adds to Cart → Cart Service

    • Front Door routes /api/cart to cart-service
    • Service reads session from Redis (sub-millisecond)
    • Updates cart, writes to Redis with 30-minute expiry
  5. User Checks Out → Checkout Service

    • Checkout service creates order in Azure SQL (via Private Endpoint)
    • SQL Business Critical: synchronous replication across 3 zones
    • Transaction committed, order ID returned (<100ms)
  6. Order Created → Functions Trigger (async processing)

    • Checkout service publishes OrderCreated event to Service Bus topic
    • Azure Function (Premium plan, pre-warmed) subscribes to topic
    • Function processes payment via external gateway
    • Updates inventory in Cosmos DB
    • Sends confirmation email via Logic Apps

Cross-Domain Integration Points

Identity ↔ Compute:

  • AKS pods use Managed Identity (workload identity) to access Key Vault
  • No secrets in code or environment variables
  • PIM grants temporary admin access to AKS cluster (kubectl access)

Data ↔ Networking:

  • Private Endpoints inject SQL, Cosmos DB into VNet (10.0.3.0/24 subnet)
  • NSG rules: Only AKS subnet can access data subnet
  • No public internet access to data layer (zero-trust)

Compute ↔ Business Continuity:

  • AKS nodes in 3 availability zones (zone-redundant node pools)
  • Functions auto-scale across zones (Premium plan zone-redundancy)
  • If Zone 1 fails: AKS scheduler moves pods to Zone 2/3, Functions scale out in healthy zones

Networking ↔ Security:

  • Azure Firewall controls AKS egress (allow only approved APIs)
  • Application rules: *.stripe.com (payment), *.sendgrid.net (email)
  • Network rules: Block all outbound except approved FQDNs
  • Threat intelligence feed blocks malicious IPs

Cost Optimization

Before Optimization ($85K/month):

  • AKS: 30 D8s_v5 nodes always-on = $20K
  • Cosmos DB: 50K RU/s provisioned = $30K
  • SQL: Business Critical 16 vCore always-on = $15K
  • Front Door: $5K
  • Other services: $15K

After Optimization ($48K/month):

  • AKS: Autoscale 10-30 nodes (avg 15) = $10K (-50%)
  • Cosmos DB: Autoscale 10K-50K RU/s (avg 25K) = $15K (-50%)
  • SQL: Reserved Instance 3-year = $7K (-53%)
  • Functions: Premium plan with consumption pricing = $2K
  • Front Door: Same $5K
  • Other: $9K
  • Savings: 44% reduction ($37K/month saved)

Optimizations Applied:

  1. AKS cluster autoscaler (min 10, max 30 nodes)
  2. Cosmos DB autoscale (provision for avg, not peak)
  3. SQL Reserved Instances for predictable workload
  4. Azure Hybrid Benefit for Windows VMs (save 40%)
  5. Spot VMs for batch processing (save 90%)

Integration Scenario 2: Enterprise Hybrid Integration

Business Requirements

Financial services company with on-premises datacenter + Azure. Requirements:

  • Hybrid: Keep core banking on-prem (compliance), new apps in Azure
  • Connectivity: Private, reliable connection (not Internet)
  • Security: On-prem AD integration, no public endpoints
  • DR: Azure as DR site for on-prem workloads
  • Data sync: Replicate on-prem SQL to Azure (analytics)

Architecture Design

Hybrid Connectivity (Domain 4):

  • ExpressRoute: 10 Gbps circuit to on-prem datacenter (Boston)
  • VPN Gateway: Backup path (if ExpressRoute fails)
  • Route Server: BGP peering between ExpressRoute and Azure Firewall
  • Private DNS: On-prem DNS forwards to Azure Private DNS zones

Identity Integration (Domain 1):

  • Entra Connect: Sync on-prem AD users to Entra ID (hybrid identities)
  • Passthrough Authentication: Users authenticate against on-prem AD (SSO)
  • Conditional Access: Cloud policy applied to on-prem apps (via App Proxy)
  • RBAC: Entra ID groups grant access to Azure resources

Data Replication (Domain 2):

  • SQL Data Sync: Bidirectional sync on-prem ↔ Azure SQL (trading data)
  • Azure Data Factory: ETL pipelines (on-prem SQL → Synapse for analytics)
  • Self-hosted Integration Runtime: Runs in on-prem datacenter, connects to ADF
  • Azure File Sync: Replicate file server to Azure Files (disaster recovery)

DR Architecture (Domain 3):

  • Azure Site Recovery: Replicate on-prem VMware VMs to Azure
  • Recovery Plan: Automated failover (database → app → web sequence)
  • Test Failover: Quarterly DR drills (no impact to production)
  • RPO: 5 minutes (ASR replication frequency), RTO: 30 minutes (automated failover)

Traffic Flow

User Accessing On-Prem App from Azure:

  1. Azure VM → ExpressRoute Gateway

    • VM in Azure (10.1.1.5) needs to access on-prem app (192.168.1.100)
    • Route table: 192.168.0.0/16 → ExpressRoute Gateway (10.0.0.1)
    • Traffic goes through ExpressRoute (private connection, no Internet)
  2. ExpressRoute → On-Prem Datacenter

    • Traffic arrives at on-prem edge router (via ExpressRoute circuit)
    • On-prem firewall allows Azure subnet (10.1.0.0/16)
    • Routes to application server (192.168.1.100)
    • Latency: 15ms (Boston to Azure East US)

Data Sync Flow (On-Prem SQL → Azure Synapse):

  1. ADF Pipeline Trigger (nightly at 2 AM)

    • Managed Identity authenticates ADF to Azure resources
    • Pipeline executes Copy Activity (on-prem SQL → Synapse)
  2. Self-Hosted IR Connects to On-Prem SQL

    • Integration Runtime runs in on-prem datacenter (VM)
    • Connects to SQL Server using SQL authentication (creds in Key Vault)
    • Reads data: SELECT * FROM Trades WHERE LastModified > @LastRun
  3. IR Transfers to Azure

    • Data compressed and encrypted during transfer
    • Goes through ExpressRoute (private connection)
    • Lands in Synapse staging (Blob Storage)
  4. Synapse Loads Data

    • PolyBase loads from Blob to Synapse tables
    • Transforms data (joins, aggregations)
    • Final tables ready for Power BI reporting

DR Failover Scenario (On-Prem Datacenter Down):

T+0: Boston datacenter power outage (confirmed total failure)

T+2 mins: Operations team initiates unplanned failover via Azure Portal

T+5 mins: ASR Recovery Plan executes:

  • Group 1: SQL Server VMs fail over to Azure (from latest recovery point)
  • Waits 5 minutes for SQL to start

T+10 mins: Group 2: Application server VMs fail over

  • Post-action script updates connection strings (point to Azure SQL VMs)

T+15 mins: Group 3: Web server VMs fail over

  • Post-action script updates Traffic Manager (route users to Azure)

T+20 mins: All VMs online in Azure, health checks pass

T+25 mins: Traffic Manager DNS TTL expires, users routed to Azure

T+30 mins: Full DR failover complete, business operational

RPO: 5 minutes (lost trades between last replication and failure)
RTO: 30 minutes (time to full operation in Azure)

Cross-Domain Integration

Identity ↔ Hybrid Connectivity:

  • Entra Connect sync uses ExpressRoute (not Internet) for AD replication
  • Conditional Access policies apply to on-prem apps via Application Proxy
  • MFA required even when accessing on-prem resources (zero-trust)

Data ↔ DR:

  • ASR replicates on-prem VMs, includes SQL databases
  • File Sync ensures file server data in Azure (DR ready)
  • Data Factory pipelines use Self-hosted IR for on-prem access

Networking ↔ Security:

  • ExpressRoute uses Microsoft peering (for Azure PaaS) + Private peering (for VNets)
  • NSGs in Azure allow only on-prem IPs (192.168.0.0/16)
  • On-prem firewall allows only Azure subnets (10.0.0.0/8)

Common Decision Frameworks

Compute Choice Decision Tree

Start: What type of workload?

Web Application:

  • Stateless, containerized → AKS (scalable, portable)
  • Stateless, simple → App Service (PaaS, less management)
  • Stateful, legacy → VMs (full control, lift-and-shift)

Event-Driven Processing:

  • Short duration (<10 min), unpredictable → Functions Consumption (pay-per-use)
  • Consistent load, needs VNet → Functions Premium (pre-warmed, VNet)
  • Long-running workflows → Logic Apps or Durable Functions

Batch Processing:

  • Large-scale, parallel → Azure Batch (1000s of VMs)
  • Scheduled ETL → Data Factory (orchestration)
  • Docker-based → AKS with Jobs

Messaging Choice Decision Tree

Start: What communication pattern?

Request-Response:

  • Synchronous, low latency → Direct HTTP call (with retry)
  • Async, guaranteed delivery → Service Bus Queue (transactions)

Event Broadcasting:

  • Many subscribers, filtering → Service Bus Topic (pub-sub)
  • Simple routing, serverless → Event Grid (reactive)

Streaming Data:

  • High throughput (>1M events/sec) → Event Hubs (streaming)
  • IoT devices → IoT Hub (device management)

Integration:

  • Enterprise integration → Logic Apps (connectors)
  • API composition → API Management (gateway)

Network Choice Decision Tree

Start: What connectivity need?

Hybrid On-Prem to Azure:

  • High bandwidth, mission-critical → ExpressRoute (private, 10 Gbps)
  • Backup or low bandwidth → VPN Gateway (encrypted Internet)
  • Both for redundancy → ExpressRoute + VPN (failover)

Load Balancing:

  • HTTP/HTTPS, global → Azure Front Door (anycast, WAF)
  • HTTP/HTTPS, regional → Application Gateway (Layer 7)
  • TCP/UDP, regional → Load Balancer (Layer 4)
  • DNS-based → Traffic Manager (DNS routing)

Security:

  • Network firewall → Azure Firewall (FQDN filtering)
  • Web application firewall → Front Door or App Gateway WAF
  • DDoS protection → DDoS Protection Standard

Chapter Summary

What We Covered

Multi-Service Integration:

  • Secure e-commerce platform combining identity, data, compute, networking
  • Hybrid enterprise integration with on-prem connectivity
  • DR failover orchestration across on-prem and Azure

Cross-Domain Concepts:

  • Identity integrates with compute (Managed Identity, PIM)
  • Data layer secured by networking (Private Endpoints, NSGs)
  • Compute highly available through zones and scaling
  • Networking enables hybrid connectivity (ExpressRoute, VPN)

Decision Frameworks:

  • Compute choice tree (VMs vs AKS vs Functions)
  • Messaging pattern selection (queues vs topics vs events)
  • Network design decisions (ExpressRoute vs VPN, load balancer types)

Key Integration Patterns

  1. Zero-Trust Architecture: No public endpoints, Managed Identity, Private Endpoints, NSGs
  2. Hub-Spoke Hybrid: ExpressRoute in hub, shared services, spoke VNets for apps
  3. Multi-Region Active-Active: Cosmos multi-region writes, Front Door global LB, SQL geo-replica
  4. Event-Driven Microservices: Service Bus topics, Functions subscriptions, async processing
  5. Hybrid DR: ASR replication, automated failover, ExpressRoute backup path

Self-Assessment Checklist

  • I can design multi-tier architecture combining identity, data, compute, networking
  • I understand how Managed Identity eliminates secrets in applications
  • I can design hybrid connectivity with ExpressRoute and VPN Gateway
  • I know when to use Private Endpoints vs Service Endpoints
  • I can create DR plan with RPO/RTO using ASR and geo-replication
  • I understand cross-domain security (zero-trust, NSGs, firewalls)

Practice Approach

Integration questions test multiple domains:

  • "Design secure hybrid app with DR" = Identity + Networking + Compute + BCDR
  • "Optimize cost for global e-commerce" = All domains + cost decisions
  • "Implement zero-trust architecture" = Identity + Networking + Security

Study Strategy:

  1. Review each domain chapter (1-4)
  2. Understand how services integrate (this chapter)
  3. Practice scenario-based questions (100+ from bundles)
  4. Focus on decision justification (why this service, not that one)

Next Chapter: Proceed to Study strategies for test-taking techniques, time management, and exam-day strategies.


Chapter 6: Study Strategies and Test-Taking Techniques

Chapter Overview

What you'll learn:

  • Effective study schedule for AZ-305 preparation
  • Test-taking strategies for scenario-based questions
  • Time management during the 180-minute exam
  • Common traps and how to avoid them
  • Anxiety management and exam-day preparation

Time to complete: 2-3 hours


Section 1: Study Planning

Recommended Study Schedule

Total Preparation Time: 80-120 hours over 6-8 weeks

Week 1-2: Foundation (20-30 hours)

  • Section 0: Azure Fundamentals (if new to Azure)
  • Section 1: Identity, Governance, Monitoring (25-30% of exam)
  • Practice: 50 questions from Domain 1 bundle
  • Goal: 70%+ on practice questions

Week 3-4: Data & Business Continuity (20-30 hours)

  • Section 2: Data Storage Solutions (20-25% of exam)
  • Section 3: Business Continuity (15-20% of exam)
  • Practice: 100 questions from Domains 2 & 3 bundles
  • Goal: 75%+ on practice questions

Week 5-6: Infrastructure (30-40 hours)

  • Section 4: Infrastructure Solutions (30-35% of exam - largest)
  • Deep dive: Networking, compute, application architecture
  • Practice: 150 questions from Domain 4 bundle
  • Goal: 80%+ on practice questions

Week 7: Integration & Review (15-20 hours)

  • Section 5: Integration scenarios
  • Review wrong answers from all practice tests
  • Create personal cheat sheet of weak areas
  • Goal: Identify knowledge gaps

Week 8: Final Prep (15-20 hours)

  • Section 6: Final checklist review
  • Full-length practice exam (180 minutes, 50-60 questions)
  • Review missed questions thoroughly
  • Goal: 85%+ on full practice exam

Daily Study Routine

Optimal Study Session: 2-3 hours per day

Session Structure (Pomodoro technique):

  1. Study (25 min): Read chapter section, take notes
  2. Break (5 min): Stand up, stretch, water
  3. Study (25 min): Continue reading or review diagrams
  4. Break (5 min): Short walk
  5. Practice (25 min): Answer 10-15 practice questions
  6. Review (15 min): Analyze wrong answers, add to notes
  7. Long Break (15 min): Complete mental break

Why This Works:

  • Prevents burnout (frequent breaks maintain focus)
  • Active recall (practice questions > passive reading)
  • Immediate feedback (review wrong answers while fresh)
  • Spaced repetition (revisit topics across weeks)

Learning Techniques

Active Recall (most effective):

  • After reading section, close book and write what you remember
  • Explain concept to someone else (or rubber duck)
  • Create flashcards for key facts (e.g., "NSG max rules?" = "1000")

Elaborative Interrogation (why it works):

  • Ask "why" for every concept: "Why use Premium Functions over Consumption?"
  • Answer: "Premium eliminates cold start, supports VNet, longer timeout"
  • Connects concepts (understanding > memorization)

Interleaved Practice (mix topics):

  • Don't study Domain 1 for 3 days straight
  • Mix: Identity (1 hour) → Data (1 hour) → Practice (1 hour)
  • Forces brain to discriminate between concepts
  • Improves retention by 40% vs blocked practice

Spaced Repetition:

  • Review weak topics every 3 days, then weekly
  • Use practice question results to prioritize
  • Example: Scored 60% on VNet peering → Review in 3 days, 1 week, 2 weeks

Section 2: Test-Taking Strategies

Question Types and Approaches

Type 1: Direct Knowledge Questions (20% of exam)

Example:
"What is the maximum number of VMs in a single availability set?"

  • A) 50
  • B) 100
  • C) 200
  • D) 1000

Strategy:

  • These are straightforward, test memorization
  • Know limits cold: Availability Set = 200 VMs, 3 fault domains, 20 update domains
  • If unsure, eliminate obviously wrong answers
  • Time: 30 seconds max per question

Common Limits to Memorize:

  • Availability Set: 200 VMs, 3 FD, 20 UD
  • VNet: 65,536 IPs per VNet, 500 VNets per subscription
  • NSG: 1,000 rules per NSG, 5,000 NSGs per subscription
  • Function timeout: Consumption 10 min, Premium 30 min default (unlimited possible)
  • Service Bus message: Basic/Standard 256 KB, Premium 100 MB

Type 2: Scenario-Based Questions (60% of exam)

Example:
"A company has a web application that must handle traffic spikes during sales events. The application consists of stateless web servers and a SQL database. The company needs to minimize costs during normal operation while ensuring the application can scale to 10x capacity during sales. What should you recommend?"

Options:

  • A) VMs with manual scaling
  • B) VM Scale Sets with autoscale + Azure SQL elastic pool
  • C) AKS with cluster autoscaler + Azure SQL Business Critical
  • D) App Service with autoscale + Cosmos DB

Strategy:

Step 1: Identify Requirements (30 seconds)

  • Underline key words: "traffic spikes", "minimize costs", "scale to 10x", "stateless"
  • Requirements:
    • ✅ Autoscaling (handle spikes)
    • ✅ Cost optimization (minimize during normal)
    • ✅ 10x scale capability
    • ✅ Works with stateless apps

Step 2: Eliminate Wrong Answers (30 seconds)

  • A) Manual scaling ❌ (can't handle spikes automatically)
  • D) Cosmos DB ❌ (overkill for SQL database, expensive)

Step 3: Compare Remaining (30 seconds)

  • B) VMSS + Elastic Pool: ✅ Autoscale, ✅ Cost-effective, ✅ SQL compatible
  • C) AKS + Business Critical: ✅ Autoscale, ❌ More expensive (AKS overhead, BC tier)

Step 4: Choose Best Fit (30 seconds)

  • Answer: B
  • Reasoning: VMSS autoscales compute (cost-effective), elastic pool shares database resources across apps (cost-effective), matches SQL requirement
  • AKS is overcomplicated for stateless web servers (unless already containerized)

Time: 2 minutes max per scenario question

Type 3: Best Practice Questions (20% of exam)

Example:
"A company wants to ensure that administrative access to Azure VMs requires approval and is time-limited. What should you implement?"

Options:

  • A) Conditional Access with MFA
  • B) Azure AD Privileged Identity Management (PIM)
  • C) Azure Policy with deny effect
  • D) Just-in-Time VM access

Strategy:

Step 1: Identify Best Practice Pattern

  • Key words: "approval", "time-limited", "administrative access"
  • This is about least privilege + just-in-time access

Step 2: Map to Azure Services

  • Conditional Access = Access control (device, location, MFA) ❌ (no approval workflow)
  • PIM = Just-in-time elevation with approval ✅
  • Azure Policy = Governance (prevent/audit) ❌ (not access control)
  • JIT VM access = Network-level (RDP/SSH port opening) ❌ (not role-based)

Step 3: Choose Best Match

  • Answer: B (PIM)
  • PIM provides: Approval workflow, time-limited activation, audit trail
  • Defender for Cloud JIT is for network access, not role elevation

Common Best Practices:

  • Least privilege: Use PIM for admin access
  • Defense in depth: Multiple security layers (NSG + Firewall + WAF)
  • Zero trust: Verify explicitly, use least privilege, assume breach
  • Encryption: Data at rest (Storage encryption) + in transit (TLS)
  • High availability: Zones (99.99%) > Availability Sets (99.95%)

Time Management

Exam Duration: 180 minutes (3 hours)
Total Questions: 50-60 questions (varies)
Time per Question: ~3 minutes average

Recommended Pacing:

First Pass (90 minutes): Answer all questions

  • Direct knowledge (20 questions): 30 sec each = 10 min
  • Scenarios (36 questions): 2 min each = 72 min
  • Best practice (10 questions): 1 min each = 10 min
  • Total: 92 minutes (buffer: -2 min)

Mark for Review: Flag uncertain questions (aim for <15 flagged)

Second Pass (60 minutes): Review flagged questions

  • 15 flagged questions × 4 minutes each = 60 min
  • Re-read scenario carefully, eliminate wrong answers
  • Trust your gut if still unsure (first instinct often correct)

Final Pass (30 minutes): Quality check

  • Verify no questions skipped
  • Review case study questions (if any)
  • Double-check calculations (e.g., SLA composite math)
  • Don't change answers unless you find clear mistake

Common Traps and How to Avoid

Trap 1: Keyword Distraction

  • Question: "Need highly available database"
  • Trap: See "database" → choose SQL Database
  • Reality: Cosmos DB might be better (globally distributed)
  • Avoid: Read ALL requirements, not just keywords

Trap 2: Over-Engineering

  • Question: "Host simple static website"
  • Trap: Design complex AKS + CDN + App Gateway
  • Reality: Blob Storage static website + CDN = simplest
  • Avoid: Choose simplest solution that meets ALL requirements

Trap 3: Under-Engineering

  • Question: "Need 99.99% SLA for web app"
  • Trap: Single region with availability set (99.95%)
  • Reality: Must use availability zones (99.99%)
  • Avoid: Calculate exact SLA requirements

Trap 4: Cost Ignorance

  • Question: "Minimize costs for dev/test environment"
  • Trap: Provision Production-tier services
  • Reality: Use lower SKUs, autoscale, Reserved Instances
  • Avoid: Consider cost implications of every choice

Trap 5: Compliance Blindness

  • Question: "PCI DSS compliant payment processing"
  • Trap: Public endpoints, no encryption
  • Reality: Private endpoints, encryption at rest/transit, audit logs
  • Avoid: Map compliance to technical controls

Handling Uncertainty

When You Don't Know the Answer:

Strategy 1: Eliminate Obviously Wrong

  • Remove answers with disqualifying terms
  • Example: "Minimize cost" → Eliminate "Premium tier" options

Strategy 2: Use Constraints

  • Question mentions "VNet integration" → Eliminate services without VNet support
  • Question needs "Windows containers" → Eliminate Linux-only options

Strategy 3: Reason by Analogy

  • "This is like the VPN Gateway question (where ExpressRoute was answer)"
  • Apply same pattern: High bandwidth + mission-critical = ExpressRoute

Strategy 4: Trust Patterns

  • Hybrid connectivity = ExpressRoute or VPN
  • Global distribution = Front Door or Traffic Manager or Cosmos DB
  • Event-driven = Functions or Logic Apps or Event Grid
  • Messaging = Service Bus or Event Hubs

Strategy 5: Educated Guess

  • After elimination, if 2 answers remain, choose more specific one
  • "Application Gateway" is more specific than "Load Balancer" for HTTP
  • Specific often correct for scenario questions

Section 3: Exam Day Preparation

Week Before Exam

7 Days Before:

  • Complete final full-length practice exam
  • Score should be 85%+ (if not, reschedule exam)
  • Review all missed questions thoroughly

3 Days Before:

  • Review personal cheat sheet (weak areas)
  • No new material (consolidation only)
  • Get adequate sleep (7-8 hours per night)

1 Day Before:

  • Light review only (2-3 hours max)
  • Prepare exam logistics: ID, confirmation email, test center location
  • Avoid heavy study (causes anxiety, fatigue)
  • Relax: Exercise, hobby, early dinner
  • Sleep 8 hours minimum

Exam Day Routine

Morning (for afternoon exam):

  • Light breakfast (avoid heavy, sugary foods)
  • Review cheat sheet (30 minutes, no more)
  • Arrive test center 30 minutes early

During Exam:

  • First 5 minutes: Breathe deeply, read instructions carefully
  • Brain dump: Write key facts on provided notepad (SLA calculations, limits)
  • Pacing: Check time every 15 questions (should have 135 min remaining after Q15)
  • Breaks: Microsoft exams allow bathroom breaks (time keeps running)
    • Take break at 90-minute mark if needed (clear head, stretch)

Mental State:

  • Anxiety spike: Breathe 4-7-8 (inhale 4 sec, hold 7 sec, exhale 8 sec)
  • Confusion: Skip question, flag for review, move on (don't spiral)
  • Fatigue: Stand up (if allowed), stretch neck, blink eyes rapidly
  • Confidence: Remember, 700/1000 passes (you don't need 100%)

After Exam

Immediate:

  • Results shown immediately (Pass/Fail + score)
  • If fail: Note weak areas from score report
  • If pass: Celebrate! Certificate available in 24 hours

Within 24 Hours:

  • Download certificate from Microsoft Learn profile
  • Share badge on LinkedIn (optional)
  • Plan next certification (AZ-104, AZ-400, or SC-300)

Section 4: Resource Recommendations

Official Microsoft Resources

Microsoft Learn Paths (Free):

  • "AZ-305: Design identity, governance, and monitoring solutions"
  • "AZ-305: Design data storage solutions"
  • "AZ-305: Design business continuity solutions"
  • "AZ-305: Design infrastructure solutions"
  • Time: 40-60 hours total
  • Value: Official curriculum, interactive sandboxes

Microsoft Docs:

Practice Tests

Official Practice Test (Microsoft):

  • 50 questions, timed, similar difficulty to real exam
  • Cost: $99 (often bundled with exam)
  • Value: Most accurate predictor of readiness

MeasureUp:

  • 120+ questions, detailed explanations
  • Cost: $99-$129
  • Value: Good question variety, challenging

Whizlabs:

  • 300+ questions across multiple practice tests
  • Cost: $19-$29 (frequent sales)
  • Value: Budget-friendly, decent quality

Study Groups and Communities

Microsoft Q&A:

Reddit:

  • r/AzureCertification: Exam experiences, study tips
  • r/AZURE: Technical discussions, real-world scenarios

Discord/Slack:

  • "Azure Certification Study Group" Discord
  • Share resources, study together, motivation

Hands-On Labs

Azure Free Tier (12 months free):

  • 55+ always-free services
  • $200 credit for first 30 days
  • Practice: Build architectures from study guide

Microsoft Learn Sandbox:

  • Temporary Azure subscription (4 hours)
  • Pre-configured scenarios
  • No credit card required

GitHub Repositories:

  • Azure Quickstart Templates: ARM/Bicep examples
  • Azure Architecture: Sample reference architectures

Chapter Summary

Key Takeaways

Study Planning:

  • 80-120 hours over 6-8 weeks
  • Interleaved practice (mix topics)
  • Spaced repetition (review weak areas 3 days, 1 week, 2 weeks)
  • Active recall (practice questions > passive reading)

Test-Taking:

  • 3 minutes average per question (pace accordingly)
  • Eliminate wrong answers (narrow to 2, then choose)
  • Flag uncertain questions (review in second pass)
  • Trust first instinct (don't change unless clear error)

Common Patterns:

  • High availability = Availability Zones (99.99%)
  • Hybrid connectivity = ExpressRoute (high bandwidth) or VPN (backup)
  • Global distribution = Front Door (HTTP) or Traffic Manager (DNS)
  • Event-driven = Service Bus (reliable) or Event Grid (reactive)
  • Cost optimization = Autoscale, Reserved Instances, right-sizing

Exam Day:

  • Arrive 30 min early, well-rested (8 hours sleep)
  • Brain dump key facts on notepad (first 5 minutes)
  • Take break at 90 min if needed (clear head)
  • Manage anxiety with breathing (4-7-8 technique)

Final Reminders

  1. 700/1000 passes: You don't need perfection (70% correct)
  2. Scenario-based: Understand WHY, not just WHAT
  3. Hands-on helps: Build architectures in Azure (reinforces theory)
  4. Review wrong answers: Each mistake is learning opportunity
  5. Stay calm: Anxiety hurts performance, confidence helps

Success Indicators

You're ready when:

  • ✅ Scoring 85%+ on full-length practice exams consistently
  • ✅ Can explain WHY you chose answer (not just guessing)
  • ✅ Built 3+ architectures hands-on (hybrid, multi-tier, DR)
  • ✅ Reviewed ALL study guide chapters and diagrams
  • ✅ Feeling confident (not overconfident, not anxious)

Reschedule if:

  • ❌ Scoring <75% on practice exams
  • ❌ Guessing on >40% of questions
  • ❌ Haven't completed hands-on labs
  • ❌ Extreme anxiety about exam
  • Better to delay 2 weeks than fail and retake

Next Chapter: Proceed to Final checklist for the comprehensive final week review checklist covering all exam domains.


Chapter 7: Final Week Checklist

Overview

This comprehensive checklist covers every key concept tested on AZ-305. Use this during your final week of preparation to ensure no gaps in knowledge.

How to use:

  • ☐ Check each box as you verify you understand the concept
  • ❌ Mark items you're weak on, review those sections
  • Target: 95%+ boxes checked before exam day
  • Review unchecked items 24-48 hours before exam

Domain 1: Identity, Governance, and Monitoring (25-30%)

Design Monitoring Solutions

Azure Monitor:

  • ☐ I understand Metrics vs Logs (Metrics = time-series, Logs = text-based query)
  • ☐ I know how diagnostic settings route logs (Storage, Log Analytics, Event Hub)
  • ☐ I can design Log Analytics Workspace topology (single vs multiple workspaces)
  • ☐ I understand workspace retention (30-730 days, cost implications)
  • ☐ I know when to use Application Insights vs Azure Monitor (APM vs infrastructure)

Application Insights:

  • ☐ I understand Application Map (visualize dependencies)
  • ☐ I know Live Metrics vs Metrics Explorer (real-time vs historical)
  • ☐ I can configure availability tests (URL ping, multi-step web test)
  • ☐ I understand sampling (reduce telemetry volume, 3 types: adaptive, fixed, ingestion)

Alerting:

  • ☐ I can design action groups (email, SMS, webhook, Logic App, Function)
  • ☐ I understand alert rules (metric, log, activity log)
  • ☐ I know smart detection vs metric alerts (ML-based vs threshold)

Design Authentication and Authorization

Microsoft Entra ID:

  • ☐ I understand cloud-only vs synchronized vs guest identities
  • ☐ I know Entra Connect sync methods (Password Hash Sync, Pass-through Auth, Federation)
  • ☐ I can design for B2B (guest users) vs B2C (customer identity)
  • ☐ I understand External Identities (B2B collaboration, B2B direct connect, B2C)

Conditional Access:

  • ☐ I can design policies: Assignments (who/what) + Conditions (where/how) + Controls (grant/block)
  • ☐ I know common policies: MFA for all, block legacy auth, require compliant device
  • ☐ I understand report-only mode (test policies without enforcement)
  • ☐ I know named locations (trusted IPs, geo-location)

Privileged Identity Management (PIM):

  • ☐ I understand just-in-time elevation (activate roles temporarily)
  • ☐ I know approval workflow (require approval for privileged roles)
  • ☐ I can configure eligible vs active assignments (eligible = JIT, active = permanent)
  • ☐ I understand access reviews (periodic recertification of role assignments)

Managed Identities:

  • ☐ I know system-assigned vs user-assigned (lifecycle, multi-resource)
  • ☐ I understand use cases: VM/Function → Key Vault, AKS → ACR
  • ☐ I can design for no secrets in code (replace connection strings with Managed Identity)

Design Governance

Management Groups:

  • ☐ I understand hierarchy: Tenant Root → Management Groups → Subscriptions → Resource Groups
  • ☐ I know 10,000 management group limit per tenant
  • ☐ I can design multi-subscription governance (departments, environments)

Azure Policy:

  • ☐ I understand policy vs initiative (single rule vs bundle)
  • ☐ I know effects: Deny, Audit, Append, DeployIfNotExists, AuditIfNotExists, Modify
  • ☐ I can design compliance enforcement (require tags, deny public IP, enforce encryption)
  • ☐ I understand policy assignment scope (management group, subscription, resource group)

Azure Blueprints:

  • ☐ I know blueprints vs ARM templates (versioned, reusable, includes RBAC/Policy)
  • ☐ I understand artifacts: Resource Groups, ARM templates, Policies, RBAC
  • ☐ I can design for environment deployment (dev, test, prod blueprints)

Cost Management:

  • ☐ I understand budgets and alerts (spending thresholds)
  • ☐ I know cost allocation (tags, resource groups, subscriptions)
  • ☐ I can design for FinOps (showback, chargeback, optimization)

Domain 2: Data Storage (20-25%)

Design Relational Data Solutions

Azure SQL Database:

  • ☐ I understand tiers: General Purpose (balanced), Business Critical (low latency), Hyperscale (100TB+)
  • ☐ I know compute: Serverless (auto-pause), Provisioned (dedicated), DTU vs vCore
  • ☐ I can calculate costs: vCore = CPU + memory separately, DTU = bundled
  • ☐ I understand zone redundancy (Business Critical only, 99.99% SLA)
  • ☐ I know elastic pools (share resources across databases)
  • ☐ I understand backup: PITR (7-35 days), LTR (up to 10 years)

Azure SQL Managed Instance:

  • ☐ I know when to use: Full SQL Server compatibility, lift-and-shift
  • ☐ I understand VNet injection (private IP, on-prem connectivity)
  • ☐ I know limitations: 100 databases per instance, 4 TB storage per DB

Azure Cosmos DB:

  • ☐ I understand APIs: Core (SQL), MongoDB, Cassandra, Gremlin, Table
  • ☐ I know consistency levels: Strong, Bounded Staleness, Session, Consistent Prefix, Eventual
  • ☐ I can design partition key (high cardinality, evenly distributed)
  • ☐ I understand global distribution (multi-region writes, multi-region reads)
  • ☐ I know capacity modes: Provisioned (RU/s), Serverless (pay-per-request), Autoscale

Design Non-Relational Data Solutions

Azure Blob Storage:

  • ☐ I understand tiers: Hot (frequent), Cool (30 days), Cold (90 days), Archive (180 days)
  • ☐ I know lifecycle management (auto-tier based on age)
  • ☐ I can design for redundancy: LRS, ZRS, GRS, RA-GRS, GZRS, RA-GZRS
  • ☐ I understand blob types: Block (files), Append (logs), Page (VHD disks)

Azure Files:

  • ☐ I know when to use: Lift-and-shift file shares, hybrid scenarios
  • ☐ I understand tiers: Premium (SSD), Transaction Optimized, Hot, Cool
  • ☐ I can design Azure File Sync (on-prem → Azure Files replication)
  • ☐ I know identity-based auth (Entra Domain Services, AD DS)

Azure Table Storage:

  • ☐ I understand NoSQL key-value store (partition key + row key)
  • ☐ I know when to use: Simple key-value, low cost, legacy apps
  • ☐ I understand limitations: No relationships, basic queries only

Design Data Integration

Azure Data Factory:

  • ☐ I understand pipelines (activities, data flow, integration runtime)
  • ☐ I know copy activity (80+ connectors, incremental load)
  • ☐ I can design hybrid integration (self-hosted IR for on-prem)
  • ☐ I understand mapping data flows (visual transformation, no code)

Azure Synapse Analytics:

  • ☐ I know dedicated SQL pool (data warehouse, provisioned)
  • ☐ I understand serverless SQL pool (on-demand queries, pay-per-query)
  • ☐ I can design for big data (Spark pools, data lake integration)

Domain 3: Business Continuity (15-20%)

Design Backup and Disaster Recovery

Azure Backup:

  • ☐ I understand Recovery Services Vault (backup storage, geo-redundant by default)
  • ☐ I know workload types: VMs, SQL, SAP HANA, File Shares, on-prem (MARS agent)
  • ☐ I can calculate RPO/RTO: Daily backup = 24hr RPO, restore = 2hr RTO
  • ☐ I understand soft delete (14 days, ransomware protection)
  • ☐ I know cross-region restore (restore from GRS vault to any region)

Azure Site Recovery:

  • ☐ I understand replication: Azure → Azure, VMware → Azure, Hyper-V → Azure
  • ☐ I know recovery plans (multi-tier orchestration, scripts, manual actions)
  • ☐ I can design for RPO/RTO: 5-min RPO (replication), 30-min RTO (failover)
  • ☐ I understand test failover (isolated test, no production impact)
  • ☐ I know failback process (re-protect, reverse replication)

Backup Strategy:

  • ☐ I understand 3-2-1 rule: 3 copies, 2 media types, 1 off-site
  • ☐ I know immutable backups (prevent deletion, ransomware protection)
  • ☐ I can design retention: Short-term (7-35 days), Long-term (1-10 years)

Design High Availability

Availability Zones:

  • ☐ I understand 3 zones per region (separate datacenters)
  • ☐ I know SLA: Zones = 99.99%, Availability Set = 99.95%, Single VM = 99.9%
  • ☐ I can design zone-redundant services: Load Balancer, Storage (ZRS), SQL Business Critical
  • ☐ I understand cross-zone latency: <2ms (synchronous replication possible)

Availability Sets:

  • ☐ I know fault domains (rack separation, max 3)
  • ☐ I understand update domains (planned maintenance, max 20)
  • ☐ I can design distribution: 5 VMs across 3 FD = 2-2-1 distribution

SLA Calculations:

  • ☐ I can calculate composite SLA: 99.9% web × 99.9% DB = 99.8%
  • ☐ I understand parallel redundancy: 1 - (0.001 × 0.001) = 99.9999% (two 99.9% paths)
  • ☐ I know downtime per SLA: 99.9% = 8.76 hrs/year, 99.99% = 52 min/year, 99.999% = 5.26 min/year

Domain 4: Infrastructure (30-35%)

Design Compute Solutions

Virtual Machines:

  • ☐ I understand VM families: D (general), F (compute), E (memory), L (storage), N (GPU)
  • ☐ I know disk types: Ultra (sub-ms), Premium SSD (5ms), Standard SSD (10ms), HDD (15ms)
  • ☐ I can design VMSS (autoscale, max 1000 instances standard, 600 custom image)
  • ☐ I understand flexible vs uniform orchestration (flexible = mix sizes/zones, recommended)
  • ☐ I know proximity placement group (force same datacenter, <1ms latency)

Azure Kubernetes Service:

  • ☐ I understand control plane (managed) vs node pools (customer-managed)
  • ☐ I know system pool (required, Kubernetes services) vs user pools (applications)
  • ☐ I can design autoscaling: HPA (pod scale) + Cluster Autoscaler (node scale)
  • ☐ I understand networking: Kubenet (pods not in VNet) vs Azure CNI (pods in VNet)
  • ☐ I know 99.95% SLA (free) vs 99.99% (Uptime SLA, requires zones)

Azure Functions:

  • ☐ I understand plans: Consumption (serverless, 10min timeout), Premium (pre-warmed, 30min), Dedicated (App Service, unlimited)
  • ☐ I know cold start: Consumption 1-3sec, Premium 0sec (always-on)
  • ☐ I can design triggers: HTTP, Timer, Queue, Blob, Event Grid, Cosmos DB
  • ☐ I understand bindings: Input (read), Output (write), declarative (no SDK code)
  • ☐ I know deployment slots: Consumption 2, Premium 3, swap for zero downtime

Design Application Architecture

Service Bus:

  • ☐ I understand queues (point-to-point) vs topics (pub-sub)
  • ☐ I know features: Dead-letter, Sessions (FIFO), Transactions, Duplicate detection
  • ☐ I can design for guaranteed delivery (at-least-once, with duplicate detection = exactly-once)
  • ☐ I understand tiers: Basic (queues), Standard (topics), Premium (VNet, 100MB messages)

Event Grid:

  • ☐ I know event sources: Storage, VMs, Service Bus, custom topics
  • ☐ I understand event handlers: Functions, Logic Apps, Webhooks, Event Hubs
  • ☐ I can design filtering (event type, subject, advanced)
  • ☐ I know delivery guarantee: At-least-once with retry (24hr max)

Event Hubs:

  • ☐ I understand streaming (millions events/sec, append-only log)
  • ☐ I know partitions (parallel processing, ordering per partition)
  • ☐ I can design capture (auto-archive to Blob/Data Lake)
  • ☐ I understand tiers: Basic (1MB/s), Standard (20MB/s), Premium (120MB/s), Dedicated (dedicated capacity)

API Management:

  • ☐ I understand components: Gateway (runtime), Management plane (config), Developer portal (docs)
  • ☐ I know policies: Inbound (before backend), Backend (modify request), Outbound (modify response)
  • ☐ I can design products (bundle APIs, usage quota, terms)
  • ☐ I understand tiers: Consumption (serverless), Developer (no SLA), Basic/Standard, Premium (multi-region, VNet)

Design Network Solutions

Virtual Networks:

  • ☐ I understand address space (CIDR, private ranges: 10.x, 172.16.x, 192.168.x)
  • ☐ I know 5 reserved IPs per subnet (.0, .1, .2, .3, .255)
  • ☐ I can design VNet peering (not transitive, same or cross-region)
  • ☐ I understand service endpoints (VNet → Azure services, free) vs private endpoints (private IP, $7/month)

Hybrid Connectivity:

  • ☐ I understand VPN Gateway: Site-to-site (on-prem ↔ Azure), Point-to-site (client ↔ Azure), VNet-to-VNet
  • ☐ I know ExpressRoute: Private connection, 50Mbps-100Gbps, Microsoft peering (PaaS) + Private peering (VNets)
  • ☐ I can design for HA: ExpressRoute + VPN (failover), dual ExpressRoute circuits
  • ☐ I understand Route Server (BGP peering, route exchange)

Load Balancing:

  • ☐ I know Load Balancer (Layer 4, TCP/UDP, regional, 99.99% SLA)
  • ☐ I understand Application Gateway (Layer 7, HTTP/HTTPS, WAF, URL/path routing, SSL termination)
  • ☐ I can design with Front Door (global, anycast, WAF, routing rules)
  • ☐ I know Traffic Manager (DNS-based, global, routing methods: Priority, Weighted, Performance, Geographic)

Network Security:

  • ☐ I understand NSG (stateful, allow/deny rules, priority 100-4096)
  • ☐ I know Azure Firewall (FQDN filtering, threat intelligence, DNAT/SNAT)
  • ☐ I can design with WAF (OWASP rules, bot protection, custom rules, on Front Door or App Gateway)
  • ☐ I understand DDoS Protection (Basic free, Standard $3K/month, auto-mitigation)

Integration Scenarios

Common Patterns

Zero-Trust Architecture:

  • ☐ I can design: No public endpoints + Private Endpoints + NSGs + Managed Identity + Conditional Access
  • ☐ I understand: Verify explicitly, least privilege, assume breach

Hub-Spoke Network:

  • ☐ I can design: Hub (shared services, ExpressRoute, Firewall) + Spokes (isolated apps, peer to hub)
  • ☐ I understand: Centralized control, spoke-to-spoke via Firewall, no transitive peering

Multi-Region Active-Active:

  • ☐ I can design: Front Door (global LB) + Cosmos DB (multi-region writes) + SQL geo-replica + Traffic Manager
  • ☐ I understand: Read-write in all regions, automatic failover, <200ms global latency

Event-Driven Microservices:

  • ☐ I can design: Functions (compute) + Service Bus (messaging) + Cosmos DB (state) + API Management (gateway)
  • ☐ I understand: Async communication, decoupled services, auto-scale

Hybrid DR:

  • ☐ I can design: ExpressRoute (primary) + VPN (backup) + ASR (replication) + Recovery Plan (orchestration)
  • ☐ I understand: RPO 5min, RTO 30min, automated failover

Decision Frameworks

Compute Choice:

  • ☐ VMs: Full control, lift-and-shift, legacy apps, bring-your-own-license
  • ☐ AKS: Containers, microservices, portability, DevOps
  • ☐ Functions: Event-driven, serverless, short execution, pay-per-use
  • ☐ App Service: Web apps, PaaS, integrated deployment, built-in autoscale

Messaging Choice:

  • ☐ Service Bus: Guaranteed delivery, transactions, FIFO (sessions), enterprise messaging
  • ☐ Event Grid: Reactive events, pub-sub, serverless, simple routing
  • ☐ Event Hubs: Streaming, high throughput (millions/sec), analytics, IoT
  • ☐ Storage Queue: Simple queue, cheap, eventual consistency

Networking Choice:

  • ☐ ExpressRoute: High bandwidth, mission-critical, predictable latency, private
  • ☐ VPN: Backup, low bandwidth, encrypted Internet, cost-effective
  • ☐ Application Gateway: Layer 7, WAF, path/multi-site routing, regional
  • ☐ Front Door: Global, anycast, WAF, geo-routing, CDN

Must-Know Limits and Numbers

Service Limits

  • ☐ VNet: 65,536 IPs, 500 VNets per subscription
  • ☐ NSG: 1,000 rules, 5,000 NSGs per subscription
  • ☐ Availability Set: 200 VMs, 3 fault domains, 20 update domains
  • ☐ VMSS: 1,000 instances (standard), 600 (custom image)
  • ☐ Function timeout: Consumption 10min, Premium 30min default (unlimited possible)
  • ☐ Service Bus: Basic/Standard 256KB, Premium 100MB messages
  • ☐ API Gateway: Consumption no VNet, Premium multi-region

SLAs

  • ☐ Single VM (Premium SSD): 99.9%
  • ☐ Availability Set: 99.95%
  • ☐ Availability Zone: 99.99%
  • ☐ Multi-region: 99.999% (if designed correctly)
  • ☐ Composite: 99.9% × 99.9% = 99.8% (multiply for serial, parallel formula different)

Pricing Factors

  • ☐ Compute: vCores + memory (per hour)
  • ☐ Storage: Capacity (GB/month) + Operations (per transaction) + Egress (per GB)
  • ☐ Networking: VNet peering ($/GB), ExpressRoute (port fee + data transfer)
  • ☐ Cost optimization: Reserved Instances (40-60% off), Spot VMs (90% off), Autoscale (right-size)

Final Exam Day Checklist

24 Hours Before

  • ☐ Review this entire checklist, focus on unchecked items
  • ☐ Review personal cheat sheet (weak areas)
  • ☐ No new material (consolidation only)
  • ☐ Prepare logistics: ID, confirmation email, test center directions
  • ☐ Sleep 8+ hours

Exam Morning

  • ☐ Light breakfast (avoid heavy/sugary)
  • ☐ Quick review: SLA calculations, service limits, decision frameworks (30 min max)
  • ☐ Arrive test center 30 minutes early
  • ☐ Bathroom break before exam starts

During Exam

  • ☐ First 5 minutes: Breathe, brain dump key facts on notepad
  • ☐ Pace: 3 min/question average, check time every 15 questions
  • ☐ Strategy: Eliminate wrong answers, flag uncertain (max 15), trust first instinct
  • ☐ Anxiety: 4-7-8 breathing if stress spikes
  • ☐ Take break at 90 min if needed (time keeps running)

After Exam

  • ☐ Results immediate (Pass/Fail + score)
  • ☐ If pass: Certificate in 24 hours, update LinkedIn
  • ☐ If fail: Review score report, identify weak areas, reschedule 2-4 weeks out

Confidence Check

You're ready when you can answer YES to all:

  • ☐ I checked 95%+ of boxes in this checklist
  • ☐ I scored 85%+ on full-length practice exam
  • ☐ I can explain WHY I choose answers (not just memorizing)
  • ☐ I built 3+ architectures hands-on in Azure
  • ☐ I feel confident (not overconfident, not anxious)

If ANY answer is NO:

  • Review that section thoroughly
  • Take another practice exam
  • Consider rescheduling 1-2 weeks (better delay than fail)

Good luck on your AZ-305 exam! You've got this! 🎯


Appendices: Quick Reference Tables and Glossary

Appendix A: Service Comparison Tables

Compute Services Comparison

Feature Virtual Machines AKS Azure Functions App Service
Management IaaS (full control) Managed control plane Serverless PaaS
Scaling VMSS (manual/auto) HPA + Cluster Autoscaler Auto (0-200 instances) Built-in autoscale
OS Control Full Node level None Limited
Pricing Model Per hour (VM size) Per node (VM) Per execution + GB-sec Per hour (plan)
Cold Start None (always-on) None 1-3sec (Consumption) None
Max Timeout Unlimited Unlimited 10min (Consumption) Unlimited
VNet Support Yes Yes Premium/Dedicated only Yes
Use Case Legacy apps, full control Containers, microservices Event-driven, serverless Web apps, APIs

Database Services Comparison

Feature Azure SQL Cosmos DB PostgreSQL MySQL
Type Relational (SQL) NoSQL (multi-model) Relational (SQL) Relational (SQL)
Global Distribution Geo-replica (read) Multi-region write Read replicas Read replicas
Consistency Strong 5 levels (Strong to Eventual) Strong Strong
Max Storage 4TB (MI), 100TB (Hyperscale) Unlimited 64TB 64TB
APIs T-SQL SQL, MongoDB, Cassandra, Gremlin, Table PostgreSQL MySQL
Zone Redundancy Business Critical Yes (built-in) Yes Yes
Pricing vCore or DTU RU/s (provisioned or serverless) vCore vCore
Use Case OLTP, relational Globally distributed, NoSQL Open-source relational Open-source relational

Messaging Services Comparison

Feature Service Bus Event Grid Event Hubs Storage Queue
Pattern Queue + Pub-sub Pub-sub (reactive) Streaming Queue
Message Size 256KB (Std), 100MB (Premium) 1MB 1MB 64KB
Ordering Sessions (FIFO) No guarantee Per partition No guarantee
Retention 7-90 days 24 hours 1-90 days 7 days
Throughput Moderate 10M events/sec Millions events/sec Moderate
Transactions Yes No No No
Dead-Letter Yes No No No (manual poison queue)
Use Case Enterprise messaging Reactive events, serverless Streaming, analytics, IoT Simple queue, cheap

Load Balancing Services Comparison

Feature Load Balancer Application Gateway Front Door Traffic Manager
Layer Layer 4 (TCP/UDP) Layer 7 (HTTP/HTTPS) Layer 7 (HTTP/HTTPS) DNS (Layer 7)
Scope Regional Regional Global Global
Protocol TCP, UDP HTTP, HTTPS, WebSocket HTTP, HTTPS Any (DNS)
SSL Termination No Yes Yes No
WAF No Yes Yes No
Path Routing No Yes Yes No
Health Probes TCP, HTTP HTTP, HTTPS HTTP, HTTPS HTTP, HTTPS, TCP
SLA 99.99% (Standard) 99.95% (v2) 99.99% 99.99%
Pricing $0.025/hour + data $0.443/hour (v2) $0.36/hour + data $0.54/M queries
Use Case TCP/UDP apps, regional Web apps, WAF, regional Global web apps, CDN DNS failover, geo-routing

Appendix B: Service Limits and Quotas

Networking Limits

Resource Default Limit Maximum Limit
VNets per subscription 500 1,000 (support request)
IP addresses per VNet 65,536 65,536 (hard limit)
Subnets per VNet 3,000 3,000
VNet peerings per VNet 500 500
NSG rules per NSG 1,000 1,000
NSGs per subscription 5,000 5,000
Route table entries 400 400
Public IPs (Standard) per subscription 1,000 Contact support
VPN Gateway connections 30 (High Perf), 100 (VpnGw4) 100

Compute Limits

Resource Default Limit Maximum Limit
VMs per availability set 200 200
Fault domains per availability set 2-3 (region-dependent) 3
Update domains per availability set 5 20
VMSS instances (standard) 100 1,000
VMSS instances (custom image) 100 600
AKS nodes per cluster 100 5,000 (support request)
AKS pods per node 30 (kubenet), 250 (Azure CNI) 250
Function Consumption instances 200 200 (per region)
Function Premium instances 100 100 (per plan)

Storage Limits

Resource Default Limit Maximum Limit
Storage accounts per subscription 250 500 (support request)
Max storage account capacity 5 PB 5 PB
Blob size (Block blob) 190.7 TB 190.7 TB
Blob size (Page blob) 8 TB 8 TB
File share size (Standard) 5 TB 100 TB (large file shares)
File share size (Premium) 100 TB 100 TB
IOPS per storage account (Standard) 20,000 20,000
IOPS per storage account (Premium) 100,000 100,000

Database Limits

Resource Default Limit Maximum Limit
Azure SQL databases per server 500 5,000 (support request)
Azure SQL DB size (General Purpose) 4 TB 4 TB
Azure SQL DB size (Hyperscale) 100 TB 100 TB
SQL Managed Instance databases 100 100
Cosmos DB containers per account Unlimited Unlimited
Cosmos DB max RU/s per container 1,000,000 1,000,000
Cosmos DB max storage per container Unlimited Unlimited

Appendix C: SLA Reference

Compute SLAs

Service Configuration SLA Downtime/Year
Virtual Machine Single instance, Premium SSD 99.9% 8.76 hours
Virtual Machine Availability Set 99.95% 4.38 hours
Virtual Machine Availability Zones 99.99% 52.6 minutes
AKS Without Uptime SLA None N/A
AKS With Uptime SLA + Zones 99.95% 4.38 hours
Azure Functions Consumption/Premium 99.95% 4.38 hours
App Service Free/Shared tier None N/A
App Service Basic/Standard/Premium 99.95% 4.38 hours

Data SLAs

Service Configuration SLA Downtime/Year
Azure SQL Database Single DB, no zones 99.99% 52.6 minutes
Azure SQL Database Zone-redundant (Business Critical) 99.995% 26.3 minutes
Cosmos DB Single region 99.99% 52.6 minutes
Cosmos DB Multi-region (read) 99.999% 5.26 minutes
Cosmos DB Multi-region (write) 99.999% 5.26 minutes
Blob Storage LRS/ZRS 99.9% (read), 99.99% (write) 8.76h / 52.6m
Blob Storage RA-GRS/RA-GZRS 99.99% (read), 99.9% (write) 52.6m / 8.76h

Network SLAs

Service Configuration SLA Downtime/Year
Load Balancer Standard SKU 99.99% 52.6 minutes
Application Gateway V2 SKU 99.95% 4.38 hours
Front Door Standard/Premium 99.99% 52.6 minutes
Traffic Manager Any 99.99% 52.6 minutes
VPN Gateway Basic/VpnGw1-5 99.95% 4.38 hours
ExpressRoute Any 99.95% 4.38 hours
Azure Firewall Any 99.95% (single zone), 99.99% (multi-zone) 4.38h / 52.6m

Appendix D: Pricing Quick Reference

Compute Pricing (East US, Linux, approximate)

Service Configuration Hourly Monthly Notes
VM D4s_v5 (4 vCPU, 16GB) $0.19 $140 General purpose
VM E8s_v5 (8 vCPU, 64GB) $0.50 $365 Memory optimized
VM F4s_v2 (4 vCPU, 8GB) $0.17 $125 Compute optimized
VMSS Same as VM pricing - - No additional charge
AKS Control plane free $0 $0 Only pay for nodes (VMs)
AKS Uptime SLA $0.10/hour $73 Optional add-on
Functions Consumption $0.20/M executions - + $0.000016/GB-sec
Functions Premium EP1 $0.24 $175 Per instance hour
App Service B1 (Basic) $0.075 $55 1 core, 1.75GB
App Service P1v3 (Premium) $0.25 $182 2 cores, 8GB

Storage Pricing (approximate)

Service Tier/Type Per GB/Month Notes
Blob Storage Hot $0.018 Frequent access
Blob Storage Cool $0.01 30-day min storage
Blob Storage Archive $0.002 180-day min, hours to retrieve
Azure Files Premium $0.20 Provisioned, SSD
Azure Files Transaction Optimized $0.03 Hot tier
SQL Database GP 4 vCore $0.70/hour ~$511/month
SQL Database BC 4 vCore $2.29/hour ~$1,672/month
Cosmos DB Provisioned $0.008/hour per 100 RU/s $0.06/GB storage
Cosmos DB Serverless $0.285/M RU $0.285/GB storage

Network Pricing (approximate)

Service Type Price Notes
VNet Peering Same region $0.01/GB Both directions
VNet Peering Cross-region $0.035/GB Both directions
VPN Gateway VpnGw1 $0.36/hour ~$262/month
VPN Gateway VpnGw2 $0.50/hour ~$365/month
ExpressRoute 50 Mbps $55/month Port fee + data transfer
ExpressRoute 1 Gbps $1,235/month Port fee + data transfer
Load Balancer Standard $0.025/hour + $0.005/GB processed
Application Gateway WAF_v2 $0.443/hour + $0.008/CU
Front Door Standard $0.36/hour + data transfer

Appendix E: Common Acronyms and Terms

Identity and Security

  • AAD: Azure Active Directory (now Microsoft Entra ID)
  • CA: Conditional Access
  • PIM: Privileged Identity Management
  • MFA: Multi-Factor Authentication
  • RBAC: Role-Based Access Control
  • MI: Managed Identity (System-assigned or User-assigned)
  • B2B: Business-to-Business (guest user access)
  • B2C: Business-to-Consumer (customer identity)
  • SSO: Single Sign-On
  • SAML: Security Assertion Markup Language
  • OAuth: Open Authorization (token-based auth)
  • OIDC: OpenID Connect

Networking

  • VNet: Virtual Network
  • NSG: Network Security Group
  • ASG: Application Security Group
  • UDR: User-Defined Route
  • BGP: Border Gateway Protocol
  • VWAN: Virtual WAN
  • ER: ExpressRoute
  • S2S: Site-to-Site (VPN)
  • P2S: Point-to-Site (VPN)
  • CIDR: Classless Inter-Domain Routing
  • NAT: Network Address Translation
  • DNAT: Destination NAT
  • SNAT: Source NAT

Compute

  • VM: Virtual Machine
  • VMSS: Virtual Machine Scale Set
  • AKS: Azure Kubernetes Service
  • ACI: Azure Container Instances
  • ACR: Azure Container Registry
  • HPA: Horizontal Pod Autoscaler (Kubernetes)
  • CA: Cluster Autoscaler (Kubernetes)
  • CNI: Container Network Interface

Data

  • BCDR: Business Continuity and Disaster Recovery
  • RPO: Recovery Point Objective (max data loss)
  • RTO: Recovery Time Objective (max downtime)
  • PITR: Point-In-Time Restore
  • LTR: Long-Term Retention
  • ASR: Azure Site Recovery
  • GRS: Geo-Redundant Storage
  • RA-GRS: Read-Access Geo-Redundant Storage
  • ZRS: Zone-Redundant Storage
  • LRS: Locally Redundant Storage
  • GZRS: Geo-Zone-Redundant Storage

Monitoring

  • LAW: Log Analytics Workspace
  • KQL: Kusto Query Language
  • APM: Application Performance Management
  • SIEM: Security Information and Event Management

General

  • IaaS: Infrastructure as a Service (VMs)
  • PaaS: Platform as a Service (App Service, SQL DB)
  • SaaS: Software as a Service (Office 365)
  • FaaS: Function as a Service (Azure Functions)
  • SKU: Stock Keeping Unit (service tier/size)
  • ARM: Azure Resource Manager
  • RG: Resource Group
  • MG: Management Group

Appendix F: Well-Architected Framework Pillars

Cost Optimization

Key Principles:

  • Right-size resources (don't overprovision)
  • Use autoscaling (pay for what you use)
  • Reserved Instances (40-60% savings for predictable workloads)
  • Spot VMs (90% savings for interruptible workloads)
  • Storage tiers (Hot → Cool → Archive based on access patterns)

Common Patterns:

  • Dev/Test: Use lower SKUs, auto-shutdown VMs after hours
  • Production: Reserved Instances for baseline, autoscale for peaks
  • Data: Lifecycle management for blobs (auto-tier based on age)

Operational Excellence

Key Principles:

  • Infrastructure as Code (ARM, Bicep, Terraform)
  • CI/CD pipelines (Azure DevOps, GitHub Actions)
  • Monitoring and alerting (Azure Monitor, Application Insights)
  • Automated testing (unit, integration, load)

Common Patterns:

  • GitOps: Infrastructure code in Git, automated deployment
  • Blue-Green: Deploy to staging, swap to production (zero downtime)
  • Canary: Gradual rollout (5% → 25% → 100% traffic)

Performance Efficiency

Key Principles:

  • Choose right compute (VMs vs AKS vs Functions based on workload)
  • Use caching (Redis, CDN, Application Gateway cache)
  • Database optimization (indexing, partitioning, read replicas)
  • Network optimization (VNet peering, ExpressRoute, global distribution)

Common Patterns:

  • Caching layer: Redis in front of database (sub-ms reads)
  • CDN: Static content at edge (images, videos, files)
  • Autoscaling: Scale out under load, scale in when idle
  • Global distribution: Front Door + Cosmos DB multi-region writes

Reliability

Key Principles:

  • Availability Zones (99.99% SLA for datacenter failure)
  • Multi-region (99.999% SLA for regional disaster)
  • Backups and disaster recovery (RPO/RTO requirements)
  • Health monitoring and auto-remediation

Common Patterns:

  • Active-Passive: Primary region + DR replica (ASR, geo-replica)
  • Active-Active: Front Door routes to healthy region, multi-region writes
  • Circuit breaker: Fail fast, retry with exponential backoff
  • Health endpoints: /health probe for load balancers

Security

Key Principles:

  • Zero Trust: Verify explicitly, least privilege, assume breach
  • Defense in depth: Multiple security layers (NSG + Firewall + WAF)
  • Encryption: At rest (Storage, SQL) + in transit (TLS)
  • Identity and access: Entra ID, Conditional Access, PIM, Managed Identity

Common Patterns:

  • No public endpoints: Private Endpoints for all PaaS services
  • Network segmentation: Hub-spoke, subnets with NSGs
  • Secrets management: Key Vault, never in code or config
  • Just-in-time access: PIM for admins, JIT VM access for RDP/SSH

Appendix G: Exam Day Brain Dump Template

Write this on your notepad in first 5 minutes of exam:

SLA Calculations

  • Single VM (Premium SSD): 99.9%
  • Availability Set: 99.95%
  • Availability Zone: 99.99%
  • Composite (serial): Multiply (99.9% × 99.9% = 99.8%)
  • Composite (parallel): 1 - (downtime × downtime) = 1 - (0.001 × 0.001) = 99.9999%

Service Limits

  • VNet: 65,536 IPs, 5 reserved per subnet
  • Availability Set: 200 VMs, 3 FD, 20 UD
  • VMSS: 1,000 instances standard, 600 custom
  • NSG: 1,000 rules
  • Function timeout: 10min (Consumption), 30min (Premium)
  • Service Bus: 256KB (Std), 100MB (Premium)

Compute Decision Tree

  • Legacy app → VMs
  • Containers → AKS
  • Event-driven → Functions
  • Web app → App Service

Messaging Decision Tree

  • Guaranteed delivery + transactions → Service Bus
  • Reactive events → Event Grid
  • Streaming (high throughput) → Event Hubs
  • Simple queue (cheap) → Storage Queue

Network Decision Tree

  • High bandwidth hybrid → ExpressRoute
  • Encrypted hybrid → VPN
  • Layer 7 regional LB → Application Gateway
  • Layer 7 global LB → Front Door
  • Layer 4 LB → Load Balancer
  • DNS failover → Traffic Manager

End of Appendices

This concludes the comprehensive AZ-305 study guide. Review all chapters, practice extensively, and trust your preparation!

Good luck! 🚀