AZ-400: Designing and Implementing Microsoft DevOps Solutions - Comprehensive Study Guide

Complete Learning Path for Certification Success

Overview

This study guide provides a structured learning path from fundamentals to exam readiness for the Microsoft Certified: DevOps Engineer Expert certification. Designed for novices, it teaches all concepts progressively while focusing exclusively on exam-relevant content. Extensive diagrams and visual aids are integrated throughout to enhance understanding and retention.

Section Organization

Study Sections (in order):

Overview (this section) - How to use the guide and study plan
Fundamentals - Section 0: Essential background and prerequisites
02_domain1_processes_communications - Section 1: Design and Implement Processes and Communications (12.5% of exam)
03_domain2_source_control - Section 2: Design and Implement Source Control Strategy (12.5% of exam)
04_domain3_build_release_pipelines - Section 3: Design and Implement Build and Release Pipelines (52.5% of exam)
05_domain4_security_compliance - Section 4: Develop Security and Compliance Plan (12.5% of exam)
06_domain5_instrumentation - Section 5: Implement Instrumentation Strategy (7.5% of exam)
Integration - Integration & cross-domain scenarios
Study strategies - Study techniques & test-taking strategies
Final checklist - Final week preparation checklist
Appendices - Quick reference tables, glossary, resources
diagrams/ - Folder containing all Mermaid diagram files (.mmd)

Study Plan Overview

Total Time: 6-10 weeks (2-3 hours daily)

Week 1-2: Fundamentals & Domain 1 (sections 01-02)

DevOps foundations, Agile principles, Azure basics
Work tracking, metrics, dashboards, collaboration

Week 3: Domain 2 (section 03)

Branching strategies, Git operations, repository management

Week 4-6: Domain 3 (section 04)

Package management, testing strategies
Pipeline design, deployment strategies
Infrastructure as Code, pipeline maintenance

Week 7: Domains 4-5 (sections 05-06)

Security, compliance, authentication
Monitoring, instrumentation, metrics analysis

Week 8: Integration & Cross-domain scenarios (section 07)

End-to-end DevOps workflows
Real-world integration patterns

Week 9: Practice & Review

Use practice test bundles
Identify weak areas and review

Week 10: Final Prep (sections 08-09)

Study strategies, exam techniques
Final week checklist and review

Learning Approach

Read: Study each section thoroughly, taking notes
Visualize: Study all diagrams to understand architecture and flows
Highlight: Mark ⭐ items as must-know concepts
Practice: Complete exercises after each section
Test: Use practice questions to validate understanding
Review: Revisit marked sections and weak areas

Progress Tracking

Use checkboxes to track completion:

section completed and notes taken
All diagrams reviewed and understood
Exercises completed
Practice questions passed (80%+)
Self-assessment checklist completed

Legend

⭐ Must Know: Critical for exam success
💡 Tip: Helpful insight or shortcut
⚠️ Warning: Common mistake to avoid
🔗 Connection: Related to other topics
📝 Practice: Hands-on exercise
🎯 Exam Focus: Frequently tested concept
📊 Diagram: Visual representation available

How to Navigate

study sections sequentially (01 → 02 → 03 → ... → 09)
Each file is self-contained but builds on previous chapters
Use 99_appendices as quick reference during study
Return to 09_final_checklist in your last week before exam
Review diagrams/ folder for all visual aids

Exam Details

Exam Information:

Exam Code: AZ-400
Duration: 120 minutes
Question Count: 40-60 questions
Passing Score: 700/1000
Question Types: Multiple choice, multiple answer, case studies, drag-and-drop

Domain Weight Distribution:

Design and Implement Processes and Communications: 10-15%
Design and Implement Source Control Strategy: 10-15%
Design and Implement Build and Release Pipelines: 50-55%
Develop Security and Compliance Plan: 10-15%
Implement Instrumentation Strategy: 5-10%

Prerequisites:

Azure Administrator (AZ-104) OR Azure Developer (AZ-204) certification recommended
Hands-on experience with Azure DevOps or GitHub
Basic understanding of Agile methodologies
Familiarity with Azure services and Git

Study Resources Included

Practice Test Bundles:

6 difficulty-based bundles (beginner, intermediate, advanced)
3 full practice tests (exam simulation)
8 domain-focused bundles
5 service-focused bundles

Cheat Sheets: - Quick reference for final review

Essential commands and configurations
Critical concepts summary

Tips for Success

Follow the sequence: Don't skip chapters, concepts build on each other
Practice regularly: Use practice test bundles throughout your study
Understand, don't memorize: Focus on WHY and HOW, not just WHAT
Use diagrams: Visual learning enhances retention significantly
Track weak areas: Review and strengthen before moving forward
Simulate exam conditions: Take full practice tests under time pressure
Review mistakes: Learn from every wrong answer in practice tests

Getting Started

Begin with Fundamentals to build your foundation. Take your time with each chapter, ensuring you understand concepts before moving forward. This guide is designed to be comprehensive and self-sufficient - you should not need external resources to pass the exam.

Good luck on your DevOps Engineer Expert certification journey!

Chapter 0: Essential Background and DevOps Foundations

What You Need to Know First

This certification assumes you understand:

Azure Fundamentals - Basic Azure concepts, services, and portal navigation
Version Control Basics - Understanding of source code management and Git fundamentals
Agile Methodology - Basic knowledge of Agile, Scrum, and iterative development
Software Development Lifecycle - Understanding of development, testing, and deployment phases
Cloud Computing Basics - Understanding of cloud service models (IaaS, PaaS, SaaS)

If you're missing any: Consider reviewing Azure Fundamentals (AZ-900) materials or taking introductory courses in Git and Agile methodologies before proceeding.

Core Concepts Foundation

What is DevOps?

What it is: DevOps is a cultural and technical movement that combines software development (Dev) and IT operations (Ops) into a unified approach. It integrates development, quality assurance, and IT operations into a unified culture and set of processes for delivering software efficiently and reliably.

Why it matters: Traditional software development had separate development and operations teams working in isolation, leading to slow releases, communication gaps, and deployment failures. DevOps breaks down these silos to enable faster, more reliable software delivery.

Real-world analogy: Think of DevOps like a relay race where the baton (your code) is passed seamlessly between runners (teams). In traditional development, runners would stop, hand over documentation about the baton, and the next runner would have to figure out how to carry it. In DevOps, everyone trains together, knows the process, and the handoff is smooth and automatic.

Key points:

DevOps is both a culture (collaboration, shared responsibility) and a set of practices (automation, monitoring)
The goal is to shorten the development lifecycle while delivering features, fixes, and updates frequently
DevOps emphasizes automation, continuous improvement, and fast feedback loops
Success requires buy-in from all teams - developers, operations, QA, security, and management

💡 Tip: DevOps isn't a tool or a single role - it's a philosophy. You can't "install DevOps," but you can adopt DevOps practices and culture.

The DevOps Lifecycle

What it is: The DevOps lifecycle represents the continuous flow of activities from planning to monitoring in software delivery. Unlike traditional waterfall development with discrete phases, DevOps creates a continuous loop of improvement.

Why it exists: Software delivery is not a one-time event but a continuous process. Applications need updates, bug fixes, new features, and security patches throughout their lifetime. The DevOps lifecycle provides a framework for managing this continuous delivery.

Real-world analogy: The DevOps lifecycle is like a circular assembly line where each completed product immediately informs improvements to the next iteration. Feedback from customers (monitoring) directly influences what gets built next (planning), creating a continuous improvement loop.

How it works (Detailed step-by-step):

Plan: Teams define what to build, prioritize features, and create work items
- WHY: Without planning, development lacks direction and may build wrong features
- TOOLS: Azure Boards, GitHub Projects, JIRA
- OUTPUT: User stories, tasks, sprint plans
Develop: Developers write code following best practices and standards
- WHY: Quality code is the foundation of reliable software
- TOOLS: Visual Studio, VS Code, Git
- OUTPUT: Source code, unit tests
Build: Code is compiled, packaged, and prepared for deployment
- WHY: Ensures code can be transformed into runnable applications
- TOOLS: Azure Pipelines, GitHub Actions, Maven, npm
- OUTPUT: Build artifacts (compiled binaries, containers, packages)
Test: Automated tests validate functionality, performance, and security
- WHY: Catches bugs before they reach production, reducing risk
- TOOLS: pytest, JUnit, Selenium, OWASP ZAP
- OUTPUT: Test results, code coverage reports
Release: Approved builds are deployed to staging and production environments
- WHY: Makes software available to end users
- TOOLS: Azure Pipelines, GitHub Actions, Kubernetes
- OUTPUT: Running application in target environment
Deploy: Application is installed and configured in target environments
- WHY: Puts software into operation
- TOOLS: ARM templates, Terraform, Ansible
- OUTPUT: Configured infrastructure and applications
Operate: Application runs in production, serving real users
- WHY: This is where value is delivered to customers
- TOOLS: Azure App Service, Kubernetes, VMs
- OUTPUT: Live application serving traffic
Monitor: Telemetry and logs track application health and user behavior
- WHY: Provides insights for improvement and detects issues early
- TOOLS: Azure Monitor, Application Insights, Log Analytics
- OUTPUT: Metrics, logs, alerts, dashboards
- FEEDBACK LOOP: Insights from monitoring feed back into planning

📊 DevOps Lifecycle Diagram:

graph TB
    Plan[1. Plan<br/>Define features & priorities] --> Develop[2. Develop<br/>Write code & tests]
    Develop --> Build[3. Build<br/>Compile & package]
    Build --> Test[4. Test<br/>Validate quality]
    Test --> Release[5. Release<br/>Approve for deployment]
    Release --> Deploy[6. Deploy<br/>Install to environment]
    Deploy --> Operate[7. Operate<br/>Run in production]
    Operate --> Monitor[8. Monitor<br/>Track performance]
    Monitor -.Feedback.-> Plan
    
    style Plan fill:#e3f2fd
    style Develop fill:#f3e5f5
    style Build fill:#fff3e0
    style Test fill:#e8f5e9
    style Release fill:#fce4ec
    style Deploy fill:#e0f2f1
    style Operate fill:#e1f5fe
    style Monitor fill:#f9fbe7

See: diagrams/01_fundamentals_devops_lifecycle.mmd

Diagram Explanation (200-400 words):
The DevOps lifecycle diagram illustrates the eight continuous phases that form the foundation of modern software delivery. Starting with Plan (blue), teams use tools like Azure Boards or GitHub Projects to define user stories and prioritize work based on business value and customer feedback. The Develop phase (purple) represents developers writing code in their chosen IDE, creating unit tests, and committing changes to version control systems like Git.

The Build phase (orange) takes source code and transforms it into deployable artifacts through compilation, dependency resolution, and packaging - Azure Pipelines or GitHub Actions automate this process triggered by code commits. Test (green) represents the critical quality gates where automated tests (unit, integration, security) run against builds to catch defects early before they reach production.

Once tests pass, the Release phase (pink) manages approvals and gates, determining which builds are ready for deployment to various environments. Deploy (teal) executes the actual installation and configuration of applications to target environments using Infrastructure as Code (IaC) tools like ARM templates or Terraform. The Operate phase (light blue) represents the running application serving real users and generating business value.

Finally, Monitor (yellow-green) continuously collects telemetry, logs, and metrics about application performance, user behavior, and system health using tools like Azure Monitor and Application Insights. The critical feedback arrow from Monitor back to Plan represents the continuous improvement loop - insights from production inform what features to build next, what issues to fix, and how to optimize performance. This circular flow means DevOps never stops; each iteration builds on lessons learned from the previous deployment.

Continuous Integration (CI)

What it is: Continuous Integration is the practice of automatically building and testing code every time a team member commits changes to version control. Every code commit to the main branch triggers an automated build process that compiles the code, runs tests, and validates quality.

Why it exists: Before CI, developers would work in isolation for days or weeks, then try to merge their changes together. This led to "integration hell" - massive merge conflicts, broken builds, and bugs that were hard to trace. CI solves this by integrating code frequently (multiple times per day), catching conflicts and issues immediately when they're easier to fix.

Real-world analogy: CI is like checking your bank account balance daily versus once a month. Daily checks let you catch errors immediately (a duplicate charge today), while monthly checks mean discovering problems weeks later when you can't remember the transactions (which code change broke the build?).

How it works (Detailed step-by-step):

Developer commits code: A developer finishes a feature or bug fix and pushes changes to a shared Git repository (GitHub, Azure Repos)
- WHY: Version control provides a single source of truth for code
- WHAT HAPPENS: Git records the changes, who made them, and when
Trigger fires: The commit triggers a webhook or polling mechanism that notifies the CI system
- WHY: Automation ensures every change is validated without manual intervention
- WHAT HAPPENS: Azure Pipelines or GitHub Actions receives notification of new code
Build process starts: CI system checks out the code and begins building
- WHY: Validates that code can be compiled into a runnable application
- WHAT HAPPENS: Dependencies are installed, code is compiled, artifacts are created
Automated tests run: The build includes running the test suite (unit tests, integration tests, linting)
- WHY: Ensures new code doesn't break existing functionality
- WHAT HAPPENS: Tests execute, results are captured, code coverage is calculated
Results reported: The CI system reports success or failure to the team
- WHY: Fast feedback allows developers to fix issues immediately
- WHAT HAPPENS: Notifications sent via email, Slack, or pull request status updates
Artifacts published (if successful): Compiled code is packaged and stored for deployment
- WHY: Creates a deployable unit that can be released to environments
- WHAT HAPPENS: Docker images, npm packages, or binaries are pushed to registries

📊 CI Process Flow Diagram:

sequenceDiagram
    participant Dev as Developer
    participant Git as Git Repository
    participant CI as CI System<br/>(Azure Pipelines/GitHub Actions)
    participant Tests as Test Suite
    participant Artifact as Artifact Registry

    Dev->>Git: 1. Push code commit
    Git->>CI: 2. Trigger webhook
    CI->>Git: 3. Clone repository
    CI->>CI: 4. Install dependencies
    CI->>CI: 5. Compile/Build code
    CI->>Tests: 6. Run automated tests
    Tests-->>CI: 7. Test results
    alt Tests Pass
        CI->>Artifact: 8a. Publish build artifact
        CI->>Dev: 9a. ✅ Success notification
    else Tests Fail
        CI->>Dev: 8b. ❌ Failure notification
        Dev->>Dev: 9b. Fix issues
    end

See: diagrams/01_fundamentals_ci_flow.mmd

Diagram Explanation:
This sequence diagram shows the automated CI workflow from code commit to artifact publication. When a Developer pushes code to the Git Repository (step 1), Git immediately sends a webhook notification to the CI System like Azure Pipelines or GitHub Actions (step 2). The CI system responds by cloning the latest code from the repository (step 3), then installs all necessary dependencies like npm packages or NuGet libraries (step 4).

Next, the CI system compiles or builds the code (step 5) - for compiled languages this means creating binaries, for interpreted languages it might mean bundling and minification. The build is then passed to the Test Suite (step 6) where automated tests execute. The test results (step 7) determine the next steps: if tests pass (green path), the CI system publishes the build artifact to a registry like Azure Artifacts, Docker Hub, or npm (step 8a) and sends a success notification to the developer (step 9a). If tests fail (red path), the developer receives a failure notification immediately (step 8b) and can fix the issues before they affect others (step 9b). This entire process typically completes in minutes, providing rapid feedback to developers.

Detailed Example 1: Web Application CI Scenario

Imagine you're developing an e-commerce web application using React for frontend and Node.js for backend. A developer named Sarah completes a new feature that adds a shopping cart widget to the product page. She commits her code changes to the main branch in GitHub at 10:00 AM. Within seconds, GitHub sends a webhook to Azure Pipelines, which has been configured to trigger on any commit to main.

Azure Pipelines spins up a build agent (a clean virtual machine) and clones the repository. It runs npm install to download all dependencies listed in package.json - this includes React, Express, testing libraries, and dozens of other packages. Next, it runs npm run build which compiles the React code using Webpack, minifies JavaScript and CSS, and creates optimized bundles for production.

The build then executes npm test, running Jest unit tests that verify Sarah's shopping cart logic handles edge cases (empty cart, max quantity limits, price calculations). It also runs Cypress end-to-end tests that simulate a user adding items to cart in a real browser. All 247 tests pass in 3 minutes. Azure Pipelines then runs npm run lint to check code style (ESLint) - all checks pass. Finally, it creates a Docker image of the application, tags it with the commit SHA, and pushes it to Azure Container Registry. Sarah receives a Slack notification at 10:04 AM: "✅ Build #2847 succeeded - your changes are ready for deployment." The entire process took 4 minutes from commit to deployable artifact.

Detailed Example 2: Microservices CI with Multiple Languages

Consider a financial services company with a microservices architecture: payment service (Java), user service (Python), and notification service (C#). Each service has its own Git repository and CI pipeline, but they need to work together. When a developer commits to the payment service repository, Azure Pipelines triggers a build specific to Java: it runs mvn clean install to compile Java code with Maven, execute JUnit tests, perform static code analysis with SonarQube, and scan for vulnerabilities with OWASP Dependency-Check.

For the Python service, the CI pipeline runs pip install -r requirements.txt, executes pytest for unit tests, runs pylint for code quality, and uses safety to check for insecure dependencies. The C# service uses dotnet build, dotnet test with xUnit, and runs security scanning with Microsoft Security Code Analysis. Each pipeline is tailored to its language but follows the same principles: build, test, scan, publish. When all three services pass their individual CI pipelines, an integration test pipeline triggers that deploys all three services to a test environment and runs end-to-end API tests to verify they communicate correctly. This multi-language, multi-service CI approach ensures each component is validated individually and as part of the whole system.

Detailed Example 3: CI Catching a Critical Bug

A developer named Mike is working on a database migration feature for a SaaS application. He writes code to add a new column to the users table and updates the data access layer. He runs tests locally on his machine - everything passes. Confident, he commits the code at 2:00 PM. The CI pipeline triggers and starts building. During the automated test phase, integration tests that use a real PostgreSQL database (via Docker container) discover that Mike's migration script has a syntax error for PostgreSQL 14 (his local machine had PostgreSQL 13). The test "user_migration_adds_column_correctly" fails.

The CI system immediately sends Mike an email and updates the GitHub pull request with a red X. The build log shows: "ERROR: column 'user_preferences' of relation 'users' already exists." Mike realizes his migration doesn't check if the column exists before adding it. He adds IF NOT EXISTS to the SQL, commits again at 2:15 PM. The CI pipeline reruns - this time all tests pass, including additional tests that run migrations twice to verify idempotency. Without CI, this bug would have been discovered only when deploying to staging (maybe days later), potentially causing database corruption and requiring manual rollback. CI caught it in 15 minutes, before any environment was affected.

⭐ Must Know (Critical Facts):

CI must run on every commit to the main/shared branch, not just before releases - this ensures continuous validation
CI builds must be fast (ideally under 10 minutes) to provide rapid feedback without blocking developers
CI systems must use clean environments (fresh VMs or containers) for each build to avoid "works on my machine" issues
All tests must be automated in CI - manual testing doesn't scale and defeats the purpose of continuous integration
CI should fail fast - run quickest tests first (unit tests) before slower ones (integration tests) to save time
Broken builds must be fixed immediately - a failing CI pipeline should be the team's top priority, not something to "fix later"

When to use (Comprehensive):

✅ Use CI when: Working on a team project where multiple developers commit code regularly - prevents integration conflicts
✅ Use CI when: Building production software that requires reliability - automated testing catches bugs before users see them
✅ Use CI when: You need to maintain code quality standards - CI can enforce linting, code coverage, and security checks automatically
✅ Use CI when: Deploying frequently (daily or multiple times per day) - CI ensures every build is potentially deployable
❌ Don't skip CI when: "It's just a small change" - even one-line changes can break builds, always validate through CI
❌ Don't skip CI when: "We're in a hurry" - skipping CI to save time often creates bigger problems that waste more time fixing

Continuous Delivery (CD)

What it is: Continuous Delivery is the practice of automatically building, testing, and preparing code changes for release to production. CD extends CI by ensuring that every successful build is automatically deployed to staging/testing environments and is always ready to be deployed to production at the click of a button. It's about keeping your software in a deployable state at all times.

Why it exists: Traditional software releases were risky, manual, and infrequent (quarterly or yearly). Teams spent weeks preparing for releases, writing deployment documents, and coordinating downtime windows. Continuous Delivery eliminates this friction by automating the entire release process, making deployment a low-risk, routine event that can happen anytime.

Real-world analogy: Think of CD like having pre-packed bags ready for a trip. Without CD, you pack frantically before each trip (deployment), often forgetting things. With CD, your bags are always packed and ready - you just grab them and go. The packing process (testing and preparation) happens automatically after each shopping trip (code commit).

How it works (Detailed step-by-step):

CI completes successfully: Continuous Integration has built the code and run all automated tests
- WHY: CD builds on CI - you can't deliver untested code
- WHAT HAPPENS: Build artifacts are ready, all quality gates passed
Deployment to testing environment: Artifacts are automatically deployed to a staging/QA environment
- WHY: Validates that deployment process works and app functions in production-like environment
- WHAT HAPPENS: Infrastructure is provisioned (if needed), app is configured, health checks run
Automated acceptance tests: Additional tests run in the staging environment
- WHY: Verifies the application works correctly after deployment, including integration with real services
- WHAT HAPPENS: UI tests, API tests, performance tests execute against deployed app
Manual approval gates (optional): Stakeholders review and approve for production
- WHY: Some organizations require human sign-off before production changes
- WHAT HAPPENS: Product owners or change advisory boards review and approve
Production-ready state: Application is ready to deploy to production anytime
- WHY: The goal is to always have deployable software, reducing time-to-market
- WHAT HAPPENS: Artifact is tagged as production-ready, deployment can trigger on demand

Key Difference - Continuous Delivery vs Continuous Deployment:

Continuous Delivery: Deployments to production require manual approval (human decides when to deploy)
Continuous Deployment: Deployments to production are fully automated (every change that passes tests goes to production automatically)

📊 Continuous Delivery Pipeline Diagram:

graph LR
    A[Code Commit] --> B[CI Build & Test]
    B -->|Success| C[Deploy to DEV]
    C --> D[Automated Tests DEV]
    D -->|Pass| E[Deploy to QA/Staging]
    E --> F[Integration Tests]
    F --> G[Performance Tests]
    G --> H{Manual Approval Gate}
    H -->|Approved| I[Ready for Production]
    H -->|Rejected| J[Back to Development]
    I -.Manual Trigger.-> K[Deploy to Production]
    
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fff3e0
    style E fill:#f1f8e9
    style F fill:#fce4ec
    style G fill:#f3e5f5
    style H fill:#ffebee
    style I fill:#e0f2f1
    style K fill:#c8e6c9

See: diagrams/01_fundamentals_cd_pipeline.mmd

Diagram Explanation:
This flowchart illustrates a complete Continuous Delivery pipeline from code commit to production-ready state. The journey begins when a developer commits code (A, blue), triggering the CI Build & Test phase (B, purple) where code is compiled and unit tests run. Upon success, the green path activates: the build automatically deploys to the DEV environment (C, light green) for developer validation.

Automated tests run in DEV (D, orange) to verify basic functionality. When those pass, the pipeline progresses to deploying to QA/Staging environment (E, yellow-green), which mirrors production infrastructure. Here, comprehensive testing occurs: Integration Tests (F, pink) validate that all services work together, and Performance Tests (G, purple) ensure the application meets speed and scalability requirements under load.

After all automated validations pass, the pipeline reaches a Manual Approval Gate (H, red) where designated approvers (product managers, tech leads, or change boards) review test results and business readiness. If approved, the build enters "Ready for Production" state (I, teal) - it's fully tested and can be deployed to production anytime via manual trigger. If rejected, feedback loops back to Development (J). The final production deployment (K, green) happens when someone clicks "Deploy" - this could be immediately after approval or scheduled for a specific time. Notice the dashed line to production indicates this step is manual (the key difference from Continuous Deployment where it would be automatic).

Detailed Example 1: E-commerce CD Pipeline

Consider an online retail company that processes millions of orders daily. They've implemented CD for their order processing service. When a developer commits a bug fix to improve order confirmation emails, the CI pipeline builds and tests the code (2 minutes). The CD pipeline automatically deploys the new version to the DEV environment where developers can manually verify the email formatting looks correct (5 minutes).

Next, it auto-deploys to the QA environment where automated Selenium tests verify the complete order flow: add to cart, checkout, payment, and confirmation email. These tests include checking that emails contain order numbers, item details, and delivery estimates (10 minutes). Then performance tests simulate 10,000 concurrent orders to ensure the fix doesn't impact throughput (15 minutes).

All tests pass, and the pipeline notifies the product manager via Microsoft Teams. She reviews the test results dashboard showing 100% test pass rate, 0.3% CPU increase (acceptable), and previews the new email template. She clicks "Approve for Production" in Azure DevOps. The system marks this build as "Production Ready" and tags it in the container registry.

That evening during off-peak hours (2 AM), the deployment engineer clicks "Deploy to Production" in Azure DevOps. The CD pipeline executes: it pulls the approved container image, performs a blue-green deployment (standing up new instances alongside old ones), runs smoke tests against the new version, and gradually shifts traffic from old to new. The entire production deployment takes 20 minutes with zero downtime. By morning, all order confirmations use the improved template, and customers notice the clearer messaging.

Detailed Example 2: Banking Application with Compliance Gates

A financial institution has strict regulatory requirements for their mobile banking app. Their CD pipeline includes compliance validation steps. When developers complete a new feature for balance transfers, the standard CI/CD flow begins: build, unit tests, deploy to DEV. But this pipeline includes additional compliance gates.

After DEV deployment, the pipeline triggers automated security scans: SAST (static analysis) with SonarQube checks for SQL injection vulnerabilities, DAST (dynamic analysis) with OWASP ZAP tests the running application for XSS attacks, and dependency scanning verifies no packages have known CVEs. The pipeline also runs accessibility tests (WCAG 2.1 compliance for screen readers) and data encryption validation (ensuring PII is encrypted at rest and in transit).

Only if all security scans pass does the build proceed to the QA environment. Here, automated test scripts validated by the compliance team run scenarios like: transaction limits, audit log generation, and session timeout enforcement. The pipeline generates a compliance report documenting all tests and their results.

The report goes to three approval groups in sequence: (1) QA Lead reviews test coverage, (2) Security Officer reviews vulnerability scans, (3) Compliance Officer verifies regulatory requirements. Each approver has 24 hours to review; if no response, the request auto-escalates to their manager. Once all three approve, the build reaches "Production Ready" status.

Production deployment happens only during approved change windows (Tuesday/Thursday nights, 10 PM - 2 AM). The deployment includes an automatic rollback trigger: if error rates exceed 0.1% or response times increase by 20%, the system automatically reverts to the previous version and pages the on-call engineer. This heavily gated CD pipeline ensures that the balance transfer feature meets all regulatory requirements before reaching customers' mobile devices.

⭐ Must Know (Critical Facts):

CD requires comprehensive automated testing - without it, you can't confidently deploy to production automatically
Infrastructure as Code (IaC) is essential for CD - environments must be provisioned consistently using automated scripts (ARM templates, Terraform)
CD pipelines must include rollback capabilities - if deployment fails, automatic reversion to previous version prevents downtime
Environment parity is critical - DEV, QA, and Production should be as similar as possible to catch environment-specific issues early
CD reduces deployment risk - frequent small changes are less risky than infrequent large changes
Deployment and release are separate - CD can deploy code to production without making it visible to users (using feature flags)

Infrastructure as Code (IaC)

What it is: Infrastructure as Code is the practice of managing and provisioning infrastructure (servers, networks, databases) through machine-readable definition files rather than manual configuration or interactive configuration tools. Instead of clicking through Azure Portal to create resources, you write declarative code that describes what infrastructure you want, and tools automatically create it.

Why it exists: Manual infrastructure setup is error-prone, inconsistent, and doesn't scale. Two engineers manually creating "identical" environments will inevitably create slight differences (different OS patch levels, configuration settings, installed software). These differences cause the dreaded "works in dev, breaks in production" scenarios. IaC solves this by codifying infrastructure, making it repeatable, version-controlled, and testable.

Real-world analogy: IaC is like having a detailed recipe for cooking versus following verbal instructions. Verbal instructions ("add some salt, cook until it looks done") produce inconsistent results. A recipe with exact measurements (2 tsp salt, 20 minutes at 350°F) produces the same dish every time. Your infrastructure definition files are the exact recipe that produces identical environments consistently.

How IaC works: You write template files (ARM templates, Bicep, Terraform HCL) that declare desired infrastructure state. Tools read these templates and make API calls to cloud providers to create/update resources to match the declaration.

IaC Tools for Azure:

ARM Templates (JSON): Native Azure format, verbose but powerful
Bicep: Modern Azure DSL, compiles to ARM, cleaner syntax
Terraform: Multi-cloud, uses HCL language, maintains state files

⭐ Must Know: IaC templates should be stored in version control (Git) alongside application code - this enables infrastructure versioning, code reviews, and rollback capabilities.

Version Control and Git Fundamentals

What it is: Version control systems track changes to code over time, allowing multiple developers to collaborate, review history, and revert changes. Git is the most popular distributed version control system used in DevOps.

Why it matters for DevOps: DevOps requires collaboration and automation. Git provides the foundation for both - teams collaborate through branches and pull requests, while CI/CD pipelines trigger from Git events (commits, merges, tags).

Key Git Concepts:

Repository (Repo): Storage location for code and history
Commit: Snapshot of code changes with message and author
Branch: Parallel version of code for isolated development
Merge: Combining changes from one branch into another
Pull Request (PR): Request to merge changes with code review
Clone: Copy of repository on local machine
Push: Upload local commits to remote repository
Pull: Download remote changes to local repository

Agile and DevOps Culture

What is Agile: Iterative approach to software development with short cycles (sprints), frequent feedback, and adaptability to change. Common frameworks: Scrum, Kanban.

How DevOps Extends Agile: Agile focuses on development team collaboration; DevOps extends this to include operations, creating end-to-end ownership from code to production.

DevOps Cultural Principles:

Collaboration: Breaking down silos between Dev, Ops, QA, Security
Automation: Eliminate manual, repetitive tasks
Continuous Improvement: Learn from failures, optimize processes
Customer Focus: Deliver value to end users quickly
Shared Responsibility: Developers own code in production, ops involved in development
Fail Fast: Detect and fix issues quickly, learn from failures

Azure DevOps vs GitHub

Azure DevOps Services:

Azure Boards: Work item tracking, Kanban boards, backlogs
Azure Repos: Git repositories, pull requests, branch policies
Azure Pipelines: CI/CD automation, YAML and classic pipelines
Azure Test Plans: Manual and exploratory testing tools
Azure Artifacts: Package management (NuGet, npm, Maven, Python)

GitHub Services:

GitHub Repositories: Git hosting with advanced features
GitHub Issues & Projects: Work tracking and project boards
GitHub Actions: CI/CD workflows using YAML
GitHub Packages: Package registry (npm, Docker, Maven)
GitHub Advanced Security: Code scanning, secret scanning, Dependabot

When to Choose:

Azure DevOps: Enterprise features, Test Plans, tight Azure integration
GitHub: Open source projects, public collaboration, simpler interface
Both: Can integrate (GitHub repos with Azure Pipelines)

Terminology Guide

Term	Definition	Example
Artifact	Build output (compiled code, packages, containers)	Docker image, NuGet package, WAR file
Pipeline	Automated sequence of stages for build/deploy	CI pipeline, Release pipeline
Agent/Runner	Machine that executes pipeline tasks	Microsoft-hosted agent, Self-hosted runner
Stage	Logical phase in pipeline (Build, Test, Deploy)	Build stage runs compilation and unit tests
Job	Collection of steps executed on single agent	Build job with 5 steps
Task/Step	Individual action in pipeline	npm install task, Docker build step
Trigger	Event that starts pipeline	Commit to main, Pull request, Schedule
Gate	Approval or validation before proceeding	Manual approval, Security scan gate
Environment	Deployment target (Dev, QA, Prod)	Production environment with 10 VMs
Service Connection	Authenticated connection to external service	Azure subscription connection

Mental Model: DevOps Ecosystem

📊 DevOps Ecosystem Overview:

graph TB
    subgraph "Planning & Collaboration"
        A[Azure Boards/<br/>GitHub Issues]
    end
    
    subgraph "Source Control"
        B[Azure Repos/<br/>GitHub]
    end
    
    subgraph "CI/CD Automation"
        C[Azure Pipelines/<br/>GitHub Actions]
    end
    
    subgraph "Testing"
        D[Test Plans/<br/>Test Frameworks]
    end
    
    subgraph "Package Management"
        E[Azure Artifacts/<br/>GitHub Packages]
    end
    
    subgraph "Infrastructure"
        F[ARM/Bicep/<br/>Terraform]
    end
    
    subgraph "Deployment Targets"
        G[Azure App Service<br/>Kubernetes<br/>VMs]
    end
    
    subgraph "Monitoring"
        H[Azure Monitor/<br/>App Insights]
    end
    
    A -->|Work Items| B
    B -->|Code Commits| C
    C -->|Run Tests| D
    C -->|Publish| E
    C -->|Provision| F
    F -->|Deploy To| G
    C -->|Deploy| G
    G -->|Telemetry| H
    H -->|Feedback| A
    
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style D fill:#e8f5e9
    style E fill:#fce4ec
    style F fill:#e0f2f1
    style G fill:#e1f5fe
    style H fill:#f9fbe7

See: diagrams/01_fundamentals_ecosystem.mmd

Diagram Explanation: This ecosystem diagram shows how DevOps tools and practices interconnect. Planning & Collaboration (blue) tools like Azure Boards or GitHub Issues track work items (features, bugs, tasks). When work begins, developers create branches in Source Control (purple) using Azure Repos or GitHub. Code commits trigger CI/CD Automation (orange) via Azure Pipelines or GitHub Actions, which orchestrates the entire delivery process.

Pipelines execute Testing (green) using various frameworks (JUnit, pytest, Selenium) to validate quality. Successful builds publish artifacts to Package Management (pink) systems like Azure Artifacts or GitHub Packages for versioning and distribution. Pipelines also execute Infrastructure (teal) provisioning using ARM templates, Bicep, or Terraform to create or update cloud resources.

Applications deploy to Deployment Targets (light blue) like Azure App Service for web apps, Kubernetes for containers, or VMs for traditional applications. Running applications send telemetry to Monitoring (yellow-green) systems like Azure Monitor and Application Insights, collecting metrics, logs, and traces. Crucially, monitoring insights feed back into Planning (feedback arrow), creating a continuous improvement loop where production data informs what to build next. This circular flow represents the never-ending DevOps lifecycle.

Chapter Summary

What We Covered

✅ DevOps Definition: Integration of development and operations with automation and culture shift
✅ DevOps Lifecycle: 8-phase continuous loop (Plan → Develop → Build → Test → Release → Deploy → Operate → Monitor)
✅ Continuous Integration: Automated build and test on every commit, catching issues early
✅ Continuous Delivery: Automated deployment to staging, manual to production
✅ Infrastructure as Code: Defining infrastructure through code for consistency and automation
✅ Version Control: Git fundamentals for collaboration and pipeline triggers
✅ DevOps Culture: Collaboration, automation, continuous improvement principles
✅ Tool Platforms: Azure DevOps vs GitHub services and when to use each

Critical Takeaways

DevOps is culture + practices + tools - not just automation, but collaboration and shared ownership
CI/CD reduces risk - frequent small deployments are safer than infrequent large releases
Automation is essential - manual processes don't scale and introduce errors
Infrastructure as Code - treat infrastructure like application code (version control, testing, automation)
Feedback loops are critical - monitoring must inform planning for continuous improvement

Self-Assessment Checklist

Test yourself before moving on:

I can explain the 8 phases of the DevOps lifecycle and how they connect
I understand the difference between CI, CD, and Continuous Deployment
I can describe how CI catches bugs earlier in the development cycle
I know why Infrastructure as Code is necessary for consistent environments
I understand the basic Git workflow (commit, push, pull, branch, merge)
I can list the core Azure DevOps and GitHub services
I understand DevOps cultural principles (collaboration, automation, shared responsibility)

Quick Reference Card

Key Concepts:

CI: Build + test on every commit
CD: Automated deployment to staging, ready for production
IaC: Infrastructure defined in code (ARM, Bicep, Terraform)
Pipeline: Automated build → test → deploy workflow

Tools:

Azure DevOps: Boards, Repos, Pipelines, Artifacts, Test Plans
GitHub: Issues, Repos, Actions, Packages, Advanced Security

Cultural Pillars:

Collaboration over silos
Automation over manual processes
Continuous improvement over "it works, don't touch it"
Fast feedback over delayed reviews

📝 Practice: Before proceeding to Domain 1, ensure you can explain CI/CD to someone unfamiliar with DevOps using simple analogies. If you can't, review the CI and CD sections again.

Next Chapter: 02_domain1_processes_communications - Design and Implement Processes and Communications (Work tracking, metrics, collaboration)

Chapter 1: Design and Implement Processes and Communications (12.5% of exam)

Chapter Overview

What you'll learn:

Design and implement traceability and flow of work using Azure Boards and GitHub
Design and implement metrics, dashboards, and KQL queries for DevOps insights
Configure collaboration and communication tools (wikis, documentation, integrations)

Time to complete: 8-10 hours
Prerequisites: Chapter 0 (Fundamentals) - Understanding of DevOps lifecycle and version control

Exam Weight: 10-15% (using 12.5%)
This domain focuses on the planning and collaboration aspects of DevOps, ensuring teams can track work effectively, measure progress with meaningful metrics, and communicate efficiently.

Section 1: Traceability and Work Flow

Introduction

The problem: Without proper work tracking and traceability, teams lose visibility into what's being worked on, why changes are made, and how work progresses from idea to production. This leads to missed requirements, duplicated effort, and inability to understand the impact of code changes.

The solution: Implement end-to-end traceability systems that connect planning (work items) to execution (code commits, builds, deployments) and results (monitoring data). Azure Boards and GitHub provide comprehensive work tracking with deep integration into the DevOps lifecycle.

Why it's tested: The AZ-400 exam emphasizes the ability to design workflow systems that provide visibility, enable collaboration, and support data-driven decision making. Understanding how to configure work tracking and establish traceability is fundamental to effective DevOps implementation.

Core Concepts

Work Item Tracking with Azure Boards

What it is: Azure Boards is a work tracking system that uses customizable work items to plan, track, and discuss work across teams. It supports Agile methodologies (Scrum, Kanban) and provides visualization through boards, backlogs, and dashboards.

Why it exists: Traditional project management tools (spreadsheets, email threads) don't integrate with development workflows. Azure Boards solves this by embedding work tracking directly into the DevOps toolchain, linking planning to code, builds, and deployments for complete traceability.

Real-world analogy: Azure Boards is like a digital task board in a restaurant kitchen where each ticket (work item) represents an order. The ticket moves from "New Order" to "Cooking" to "Quality Check" to "Ready to Serve." Anyone in the kitchen can see all orders, their status, and who's working on what - and each ticket links to the recipe (code), ingredients used (commits), and customer feedback (monitoring).

How it works (Detailed step-by-step):

Create work items: Teams create work items representing features, user stories, tasks, bugs, or issues
- WHY: Captures what needs to be done in a structured, trackable format
- WHAT HAPPENS: Work item gets unique ID, assignment, priority, and iteration path
- TOOLS: Azure Boards web interface, CLI, Visual Studio, VS Code extension
Organize in backlogs: Work items are prioritized and organized in product and sprint backlogs
- WHY: Ensures team works on highest-value items first
- WHAT HAPPENS: Product Owner/Scrum Master orders items by business value and dependencies
- VIEWS: Product Backlog (all work), Sprint Backlog (current sprint work)
Visualize on Kanban board: Work items appear as cards on configurable board columns
- WHY: Provides visual status of work in progress, identifies bottlenecks
- WHAT HAPPENS: Cards move across columns (New → Active → Resolved → Closed) as work progresses
- WIP LIMITS: Can set maximum cards per column to prevent overload
Link to code: Developers reference work items in commits, pull requests, and branches
- WHY: Creates traceability from requirement to implementation
- WHAT HAPPENS: Commit message "Fixes AB#123" creates link from code to work item
- SYNTAX: AB#{ID} in commit messages, PR descriptions, or branch names
Track progress: Automated state transitions and burndown charts show sprint/iteration progress
- WHY: Enables data-driven decisions about velocity and capacity
- WHAT HAPPENS: Work items automatically update as linked PRs merge or builds complete
- METRICS: Velocity (story points completed), Burndown (work remaining over time)

📊 Azure Boards Workflow Diagram:

graph LR
    A[Product Owner<br/>Creates Work Item] --> B[Backlog<br/>Prioritization]
    B --> C[Sprint Planning<br/>Assign to Sprint]
    C --> D[Developer<br/>Creates Branch]
    D --> E[Code + Commits<br/>Link to Work Item]
    E --> F[Pull Request<br/>Code Review]
    F --> G[Merge to Main<br/>Auto-update Work Item]
    G --> H[CI/CD Pipeline<br/>Build & Deploy]
    H --> I[Work Item State<br/>Closed/Resolved]
    
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style D fill:#e8f5e9
    style E fill:#fce4ec
    style F fill:#f1f8e9
    style G fill:#e0f2f1
    style H fill:#e1f5fe
    style I fill:#c8e6c9

See: diagrams/02_domain1_azure_boards_workflow.mmd

Diagram Explanation:
This workflow diagram illustrates the complete lifecycle of work tracking in Azure Boards from inception to completion. The Product Owner (blue) creates work items representing features or user stories based on business requirements or user feedback. These items enter the Backlog (purple) where they're prioritized by business value, dependencies, and team capacity.

During Sprint Planning (orange), the team selects high-priority items from the backlog and assigns them to the current sprint, estimating effort in story points. A Developer (green) picks a work item and creates a feature branch (e.g., "feature/add-shopping-cart-AB123") from the main branch. As they write code and make Commits (pink), they reference the work item ID using "AB#123" syntax - this creates automatic links from commits to work items visible in both Git history and the work item's Development section.

When code is complete, the developer creates a Pull Request (light green) for code review, again mentioning "AB#123" in the PR description to maintain traceability. After approval and Merge to Main (teal), Azure Boards can automatically transition the work item state (e.g., from "Active" to "Resolved") based on keywords like "Fixes AB#123" in the merge commit. The merge triggers the CI/CD Pipeline (light blue) which builds, tests, and deploys the code. Finally, when deployment succeeds and validation passes, the Work Item State (green) updates to "Closed," completing the traceability loop. Every step is tracked, creating a complete audit trail from requirement to production.

GitHub Projects Integration

What it is: GitHub Projects is a native project management tool built directly into GitHub repositories and organizations that provides kanban boards, tables, and roadmaps for tracking work.

Why it exists: Teams need lightweight, code-centric project management without switching between separate tools. GitHub Projects integrates work tracking directly where code lives, reducing context switching and improving developer productivity.

Real-world analogy: Like a digital whiteboard next to your desk where you can move sticky notes representing tasks, but this whiteboard automatically updates when code changes happen and is visible to your entire distributed team.

How it works (Detailed step-by-step):

Create a project: From repository or organization settings, create a new Project and choose view type (Board, Table, or Roadmap)
- WHY: Different views suit different workflows - Board for kanban, Table for detailed tracking, Roadmap for timeline planning
- WHAT HAPPENS: GitHub creates a customizable workspace with default fields (Status, Assignees, Labels)
- OPTIONS: Can be repository-level (single repo) or organization-level (multiple repos)
Add issues and PRs: Drag issues/PRs from repositories into the project or create new draft issues directly in the project
- WHY: Centralizes work from multiple repositories into unified view
- WHAT HAPPENS: Items maintain bidirectional sync - changes in project update the issue, changes in issue update project
- AUTOMATION: Can set up workflows to auto-add issues with specific labels
Customize fields: Add custom fields like Priority, Sprint, Team, Story Points, or custom statuses
- WHY: Adapts to your team's specific workflow and tracking needs
- WHAT HAPPENS: Fields appear on all project items, can be used for filtering and grouping
- DATA TYPES: Single select, text, number, date, iteration (sprints)
Automate workflows: Set up built-in automations for common actions (e.g., "Auto-archive items when closed")
- WHY: Reduces manual board maintenance, ensures consistent state management
- WHAT HAPPENS: When triggers occur (item closed, PR merged), actions execute automatically (move to Done, set status)
- EXAMPLES: "Move to In Progress when PR created", "Set status to Done when issue closed"
Track progress: Use insights, charts, and filters to monitor velocity, burndown, and completion rates
- WHY: Provides data-driven visibility into team performance and sprint progress
- WHAT HAPPENS: GitHub analyzes project data and generates visual reports
- METRICS: Items completed over time, open vs closed trends, distribution by assignee

📊 GitHub Projects Architecture Diagram:

graph TB
    subgraph "Organization Level"
        ORG[Organization Project<br/>Cross-Repo View]
    end
    
    subgraph "Repository A"
        ISSUE1[Issue #45<br/>Add Login Feature]
        PR1[PR #46<br/>Implement OAuth]
    end
    
    subgraph "Repository B"
        ISSUE2[Issue #12<br/>Fix API Bug]
        PR2[PR #13<br/>Update Endpoint]
    end
    
    subgraph "Project Board"
        TODO[📋 To Do]
        PROGRESS[🔄 In Progress]
        REVIEW[👀 In Review]
        DONE[✅ Done]
    end
    
    ORG --> |aggregates| TODO
    ORG --> |aggregates| PROGRESS
    ORG --> |aggregates| REVIEW
    ORG --> |aggregates| DONE
    
    ISSUE1 --> TODO
    PR1 --> PROGRESS
    ISSUE2 --> REVIEW
    PR2 --> DONE
    
    AUTO[Automation:<br/>PR created → In Progress<br/>PR merged → Done]
    AUTO -.triggers.-> PROGRESS
    AUTO -.triggers.-> DONE
    
    style ORG fill:#e3f2fd
    style TODO fill:#fff3e0
    style PROGRESS fill:#e1f5fe
    style REVIEW fill:#f3e5f5
    style DONE fill:#c8e6c9
    style AUTO fill:#ffebee

See: diagrams/02_domain1_github_projects_architecture.mmd

Diagram Explanation:
This architecture diagram shows how GitHub Projects creates a unified view across multiple repositories at the organization level. The Organization Project (blue) acts as an aggregation layer that pulls issues and pull requests from multiple repositories into a single project board.

In Repository A, developers create Issue #45 for a new login feature and PR #46 to implement OAuth authentication. In Repository B, there's Issue #12 for an API bug and PR #13 to fix it. All these items are automatically or manually added to the organization-level project.

The Project Board has four columns representing workflow states: To Do (orange - work not started), In Progress (light blue - active development), In Review (purple - code review stage), and Done (green - completed work). Items move across columns as work progresses. Issue #45 sits in To Do waiting for someone to pick it up. PR #46 is In Progress as someone actively codes. Issue #12 is In Review as the fix undergoes code review. PR #13 is in Done because it merged successfully.

The Automation box (red) shows automated workflows that trigger state changes. When a developer creates a PR, automation moves the linked issue to "In Progress." When a PR merges, automation moves it to "Done" and can auto-close linked issues. This reduces manual board maintenance and ensures the project always reflects current work state. All changes bidirectionally sync - updating an item's status in the project updates the actual issue/PR, and vice versa.

Detailed Example 1: Sprint Planning with GitHub Projects
Your team starts a 2-week sprint. The product owner has prioritized 15 issues in the backlog labeled "sprint-12." You create a new GitHub Project called "Sprint 12" with custom fields: Priority (High/Medium/Low), Story Points (1-13), and Iteration (Sprint 12). Using automation rules, you configure "Auto-add items with label:sprint-12" which automatically populates the project with all 15 issues.

During sprint planning, the team reviews each issue in Table view, assigns story points based on complexity, sets priorities, and assigns developers. The team's capacity is 50 story points, and the auto-calculated sum shows 48 points - perfect fit. You switch to Board view for daily standups where developers move cards as they work.

When developer Sarah picks up "Add shopping cart," she creates a branch and a draft PR. The automation "When PR created → Move to In Progress" triggers, automatically moving the issue card. After code review and merge, "When PR merged → Move to Done" automation triggers, and the issue closes automatically via "Closes #45" in the PR description. By sprint end, the Insights tab shows burndown chart with 46/48 points completed, velocity chart showing improvement from last sprint, and distribution chart showing balanced workload across team members. The project becomes historical record of the sprint.

Detailed Example 2: Cross-Repository Dependency Tracking
Your microservices architecture has 8 repositories, and you're implementing a feature that touches 4 of them: API Gateway (repo A), Auth Service (repo B), User Service (repo C), and Database Migrations (repo D). You create an organization-level project called "SSO Implementation" to track all related work across repositories.

In each repository, developers create issues: API-123 in repo A, AUTH-45 in repo B, USER-67 in repo C, DB-89 in repo D. You add all issues to the project and create a custom "Dependency" field to track blockers. USER-67 depends on AUTH-45 (can't update user profiles until auth is ready), and API-123 depends on all others (gateway integrates everything).

You set up View filters: "Group by: Repository" shows work per service, "Group by: Assignee" shows work per developer, "Group by: Status" shows overall progress. As AUTH-45 completes, you update its status to Done, and USER-67's assignee gets notified via GitHub notifications that their blocker is cleared. The Roadmap view (timeline) shows all 4 issues with their target completion dates, making dependencies visual. When all issues move to Done, you know the cross-repo feature is complete. This single project replaces what would otherwise require tracking in separate tools or spreadsheets, keeping all information in the context of the code.

⭐ Must Know (Critical Facts):

GitHub Projects can span multiple repositories at the organization level, enabling cross-repo work tracking in a single view - critical for microservices and multi-repo workflows
Bidirectional sync means updates anywhere propagate everywhere - changing status in project updates the issue, closing issue updates project automatically
Automation workflows reduce manual maintenance - configure once, then items auto-move between columns based on triggers (PR created, issue closed, label added)
Custom fields enable workflow customization - add Sprint, Story Points, Priority, Team, or any metadata your process requires
Built-in insights and charts provide metrics without external tools - burndown, velocity, cumulative flow all generated from project data

When to use (Comprehensive):

✅ Use when: Your team works primarily in GitHub and wants integrated work tracking - eliminates context switching between code and project management tools
✅ Use when: Managing cross-repository initiatives - organization projects aggregate issues/PRs from multiple repos into unified view
✅ Use when: You need lightweight, flexible project management - simpler than Azure Boards, perfect for teams that don't need heavy process
✅ Use when: Tracking sprints for small to medium teams (up to 20-30 developers) - custom iterations, story points, and burndown tracking built-in
✅ Use when: Open source projects need transparent work tracking - projects can be public, showing community what's being worked on
❌ Don't use when: You need complex work item hierarchies (Epics > Features > Stories > Tasks) - GitHub Projects has flat structure; use Azure Boards instead
❌ Don't use when: Require advanced reporting and custom queries beyond built-in insights - Azure Boards has more powerful Analytics and OData queries
❌ Don't use when: Team already invested in Azure DevOps Boards with established processes - migration overhead not worth it unless consolidating tools

Limitations & Constraints:

Flat hierarchy: No native Epic > Story > Task relationships - must use labels or custom fields to simulate hierarchy
Limited reporting: Built-in insights are basic; can't create complex custom reports or export to PowerBI like Azure Boards
Organization projects require GitHub Team or Enterprise: Free tier only supports repository-level projects
25,000 items per project: Large enterprises may hit this limit on long-running projects (workaround: archive old items, create new projects quarterly)
No time tracking: No native support for time estimates or logged hours (use custom fields as workaround)

💡 Tips for Understanding:

Think of Projects as "living spreadsheets connected to code" - like Excel/Google Sheets but rows are issues/PRs that update when code changes
Automation rules are "if-this-then-that" for project management - "if PR created, then move to In Progress" reduces manual drag-and-drop
Organization vs Repository projects: Repo projects track one codebase, Org projects track entire product spanning multiple repos

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Creating separate projects for each repository in a multi-repo product
- Why it's wrong: Forces team to check multiple boards, loses cross-repo visibility
- Correct understanding: Use ONE organization-level project that aggregates issues from ALL relevant repositories for unified tracking
Mistake 2: Manually moving every item between columns instead of setting up automation
- Why it's wrong: Wastes time on mechanical updates, boards become stale when people forget to update
- Correct understanding: Configure automation rules once (PR created→In Progress, Issue closed→Done) and let GitHub maintain board state automatically
Mistake 3: Trying to replicate Azure Boards' complex hierarchy in GitHub Projects
- Why it's wrong: GitHub Projects doesn't support nested work items; forcing it creates confusion
- Correct understanding: Keep it flat with labels/custom fields, or use Azure Boards if you need Epic>Feature>Story>Task hierarchy

🔗 Connections to Other Topics:

Relates to Azure Boards Integration (covered next) because: Teams often use BOTH - GitHub Projects for day-to-day dev work, Azure Boards for higher-level program management
Builds on GitHub Flow by: Providing visual tracking layer on top of branch→PR→merge workflow
Often used with GitHub Actions to: Trigger automation not just in projects but also CI/CD pipelines (e.g., deploy when item moves to "Ready for Release")

Section 2: DevOps Metrics and Dashboards

Introduction

The problem: Without measurable metrics, teams can't identify bottlenecks, track improvement, or make data-driven decisions about their development process.
The solution: DevOps dashboards aggregate key metrics (cycle time, lead time, velocity, deployment frequency) to provide visibility into team performance and process health.
Why it's tested: The AZ-400 exam emphasizes metric-driven continuous improvement - 12.5% of exam focuses on designing appropriate metrics for DevOps activities.

Core Concepts

Cycle Time vs Lead Time

What it is: Two critical flow metrics that measure how fast work moves through your development pipeline, but they measure different parts of the process.

Why it exists: Teams need to distinguish between total delivery time (lead time - customer perspective) and actual work time (cycle time - team efficiency). Understanding both helps identify where delays occur and whether problems are in planning (long lead time) or execution (long cycle time).

Real-world analogy: Ordering a pizza: Lead time is from when you place the order to when it arrives at your door (total customer wait). Cycle time is from when the kitchen starts preparing your pizza to when it comes out of the oven (actual work time). If lead time is 60 minutes but cycle time is 15 minutes, most delay is in the queue, not preparation.

How it works (Detailed step-by-step):

Work item created: Timer for Lead Time starts immediately when issue/user story is created in backlog
- WHY: Measures total time from customer request to delivery - the customer's perspective
- WHAT HAPPENS: Issue gets timestamp in "Created Date" field
- EXAMPLE: User story "Add payment gateway" created on Jan 1, lead time timer starts
Work item moves to Active/In Progress: Timer for Cycle Time starts when team begins active work
- WHY: Measures only actual development time, excluding wait time in backlog
- WHAT HAPPENS: State changes from "New" to "Active" or "In Progress", cycle time clock starts
- EXAMPLE: Developer picks up story on Jan 10, cycle time timer starts (lead time already at 9 days)
Work item completed: Both timers stop when work item reaches "Done/Closed" state
- WHY: Marks end of both development work and total delivery process
- WHAT HAPPENS: Final state transition recorded, both metrics calculated
- EXAMPLE: PR merges and deploys on Jan 15 - Cycle time = 5 days, Lead time = 14 days
Reactivation handling: If work item reopens, cycle time aggregates active periods
- WHY: Bug fixes after deployment shouldn't inflate original cycle time
- WHAT HAPPENS: New cycle time period starts, total cycle time = sum of all active periods
- EXAMPLE: Bug reopened Jan 20, fixed Jan 22 - Total cycle time = 5 + 2 = 7 days

📊 Lead Time vs Cycle Time Diagram:

graph LR
    A[📝 Work Item<br/>Created] -->|Waiting in Backlog<br/>9 days| B[🚀 Work Started<br/>Active/In Progress]
    B -->|Development<br/>5 days| C[✅ Work Item<br/>Completed]
    
    A -.Lead Time: 14 days.-> C
    B -.Cycle Time: 5 days.-> C
    
    D[📊 Metrics] --> E[Lead Time:<br/>Customer perspective<br/>Total delivery time]
    D --> F[Cycle Time:<br/>Team efficiency<br/>Actual work time]
    
    style A fill:#fff3e0
    style B fill:#e1f5fe
    style C fill:#c8e6c9
    style D fill:#f3e5f5
    style E fill:#ffebee
    style F fill:#e8f5e9

See: diagrams/02_domain1_lead_cycle_time_comparison.mmd

Diagram Explanation:
This diagram illustrates the critical difference between Lead Time and Cycle Time metrics in DevOps workflows. The timeline starts when a Work Item is Created (orange) - this could be a user story, bug, or feature request. At this moment, the Lead Time clock starts ticking because from the customer's perspective, they're waiting for this functionality.

The work item sits in the backlog for 9 days - prioritization meetings happen, dependencies clear, team capacity becomes available. During this Waiting in Backlog period, lead time continues accumulating, but cycle time hasn't started yet because no active development is occurring. This waiting period often reveals process inefficiencies: oversized backlogs, unclear priorities, or capacity constraints.

When Work Started (blue) - a developer picks up the item and moves it to "Active" or "In Progress" - the Cycle Time clock starts. Now the team is actively coding, testing, and reviewing. This development phase takes 5 days, during which both lead time and cycle time increase together. The cycle time measures pure team efficiency: how fast can developers deliver once they start working?

Finally, the Work Item Completed (green) when the code merges and deploys. Both timers stop. The Lead Time = 14 days (9 waiting + 5 working) represents what the customer experienced - two weeks from request to delivery. The Cycle Time = 5 days represents team efficiency - when focused, the team delivers in a week.

The Metrics section (purple) summarizes: Lead Time (red) is the customer perspective showing total delivery time, while Cycle Time (green) is team efficiency showing actual work time. If lead time is much higher than cycle time (like 14 vs 5), the problem isn't team speed - it's backlogs, prioritization, or wait time. If cycle time is high, the team's execution needs improvement through automation, better practices, or removing impediments.

Detailed Example 1: E-commerce Feature Delivery
Your team receives a request for a new feature: "Add wishlist functionality." On January 1, Product Owner creates User Story #456 in Azure Boards - Lead Time starts at Day 0. The story sits in the "New" state while the PO writes acceptance criteria, talks to stakeholders, and prioritizes against other work. During sprint planning on January 10, the team estimates it at 8 story points and adds to current sprint, moving it to "Approved" state - lead time is now at 9 days, but cycle time hasn't started because no development occurred yet.

On January 12, developer Maria picks up the story, creates branch feature/wishlist-456, and moves the work item to "Active" - Cycle Time starts at Day 0. She spends 3 days implementing the frontend wishlist UI, 1 day on backend API, and 1 day writing tests. On January 17, she creates a PR - cycle time is 5 days. Code review takes 1 day, PR merges on January 18, and automated deployment to production completes. Work item moves to "Closed" - Cycle Time = 6 days (Jan 12-18), Lead Time = 17 days (Jan 1-18).

Analysis: The 11-day gap (17 lead - 6 cycle) represents wait time before development started. To improve customer satisfaction (lead time), the team needs to reduce backlog size or prioritize faster, not necessarily work faster (cycle time already good at 6 days).

Detailed Example 2: Bug Fix with Reactivation
A critical bug report "#789 - Payment fails for international cards" is created on March 1 in "New" state - Lead Time starts. On March 2, it's triaged as P0 (highest priority) and moved to "Active" - Cycle Time starts. Developer fixes the validation logic in 2 hours and deploys on March 2 - Cycle Time = 1 day, Lead Time = 1 day. Both timers stop when bug moves to "Closed."

On March 5, the bug is reported again - it only fixed US cards, not all international cards. The bug reopens to "Active" state - Cycle Time restarts (lead time continues from original creation). Developer spends March 5-6 implementing comprehensive international card support and deploys. Bug closes again on March 6 - Second cycle period = 2 days.

Final Metrics: Lead Time = 5 days (March 1-6 total), Total Cycle Time = 1 + 2 = 3 days (sum of both active periods). This shows the bug required 5 days to truly deliver from customer perspective, with 3 days of actual work spread across two attempts. The reactivation reveals incomplete initial fix - a process improvement opportunity for better testing before closure.

⭐ Must Know (Critical Facts):

Lead Time = Created Date → Closed Date (total customer wait time) - starts the moment work item is created in backlog
Cycle Time = First Active → Closed Date (actual work time) - starts when team begins active development, excludes backlog wait time
For reactivated work items, cycle time SUMS all active periods - if a bug is fixed twice, cycle time = first active period + second active period
Large gap between lead and cycle time indicates backlog/prioritization problems, not team efficiency issues - if lead=20 days but cycle=3 days, 17 days were waiting
Lower cycle time = healthier team process - faster delivery once work starts indicates good practices, automation, and flow

When to use (Comprehensive):

✅ Use Lead Time when: Measuring customer satisfaction and SLA compliance - shows total time from request to delivery, what customer experiences
✅ Use Lead Time when: Identifying backlog bottlenecks - high lead time with low cycle time means work waits too long before starting
✅ Use Cycle Time when: Measuring team efficiency and process health - how fast can team deliver once focused on a task
✅ Use Cycle Time when: Comparing team performance over time - stable/decreasing cycle time indicates improving practices and automation
✅ Use BOTH when: Comprehensive flow analysis - lead time shows customer impact, cycle time shows team performance, together they reveal where problems are
❌ Don't use Cycle Time for: SLA commitments to customers - customers don't care when you started, only total delivery time (use lead time)
❌ Don't use Lead Time for: Team performance evaluation - includes wait time outside team's control; use cycle time for team efficiency

Limitations & Constraints:

Requires consistent state discipline: Teams must reliably move items to "Active/In Progress" when starting work - inconsistent state changes produce invalid metrics
Work Item Type dependency: Different types (Bug, User Story, Task) may have different workflows - configure widgets separately for each type
Reactivation complexity: Understanding "total cycle time" vs "cycle time" requires knowing if items were reopened - not always obvious from charts
Outliers skew averages: One item with 90-day cycle time among items averaging 3 days significantly impacts mean - use median or remove outliers
Doesn't measure quality: Fast cycle time with high defect rate is worse than slower cycle time with clean code - combine with quality metrics

💡 Tips for Understanding:

Think of lead time as "customer waiting time" - includes everything from "I want this" to "I have this" regardless of what caused delays
Think of cycle time as "team working time" - only counts when team actively develops, excludes waiting in backlog or for dependencies
Lead Time - Cycle Time = Wait Time - this gap reveals process inefficiencies like backlog bloat, unclear priorities, or capacity constraints

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Using cycle time to measure customer satisfaction or SLA compliance
- Why it's wrong: Cycle time excludes backlog wait time, underestimating total customer wait; customer cares about lead time
- Correct understanding: Use lead time for customer-facing metrics and SLAs; cycle time is internal team efficiency metric
Mistake 2: Blaming team for high lead time when cycle time is low
- Why it's wrong: If lead=15 days, cycle=3 days, the 12-day gap is backlog/prioritization, not team speed
- Correct understanding: Large lead-cycle gap indicates process problems (too much WIP, poor prioritization), not execution problems
Mistake 3: Focusing only on lowering cycle time without considering quality
- Why it's wrong: Rushing work to reduce cycle time increases defects, leading to reactivations that ultimately increase total time
- Correct understanding: Balance cycle time with code quality, test coverage, and defect rates - sustainable speed, not reckless speed

🔗 Connections to Other Topics:

Relates to Cumulative Flow Diagram (CFD) because: CFD visualizes work in each state over time; the horizontal width between "Active" and "Done" bands represents cycle time
Builds on Kanban Board Column Configuration by: State categories (New, Active, Resolved, Closed) determine when cycle time starts/stops
Often used with Sprint Velocity to: Combine time metrics (lead/cycle time) with throughput metrics (velocity) for comprehensive team health view

Cumulative Flow Diagram (CFD)

What it is: A stacked area chart that visualizes the distribution of work items across different workflow states over time, showing work in progress, throughput, and bottlenecks at a glance.

Why it exists: Teams need to visualize flow health and identify process problems quickly. A CFD shows not just point-in-time status, but trends: is work piling up in code review? Is "To Do" growing faster than "Done"? Are we delivering consistently?

Real-world analogy: Like watching a multi-lane highway from above with traffic cameras. Each lane (New, Active, Review, Done) is a colored band. Wide bands = lots of cars (work items) in that lane. If one lane keeps getting wider while others stay stable, there's a traffic jam (bottleneck) that needs fixing.

How it works (Detailed step-by-step):

Horizontal axis = Time: Shows date range (typically 30, 60, or 90 days rolling window)
- WHY: Reveals trends over time, not just current snapshot
- WHAT HAPPENS: Chart updates daily as work items move through states
- CONFIGURATION: Can set custom date ranges like "Last Sprint" or "Last Quarter"
Vertical axis = Work Item Count: Total number of items across all states at each point in time
- WHY: Shows total WIP (Work in Progress) - higher vertical distance = more concurrent work
- WHAT HAPPENS: Each day's count stacks all states (New + Active + Review + Done)
- HEALTHY PATTERN: Relatively stable total height (controlled WIP)
Colored bands = Workflow states: Each color represents items in a specific state (bottom to top: Done, Review, Active, New)
- WHY: Visual pattern recognition - see which states accumulate items
- WHAT HAPPENS: Band width = count in that state; band growing = items accumulating
- INTERPRETATION: Growing top bands (early states) = input faster than output = trouble
Band transitions reveal flow: Smooth parallel bands = healthy flow; diverging/converging bands = bottlenecks or capacity changes
- WHY: Visual flow analysis without complex calculations
- WHAT HAPPENS: "Active" band widening while "Review" stays narrow = code review bottleneck
- ACTION: Team adjusts (add reviewers, pair programming, automate tests)
Arrival rate vs Departure rate: Slope of top edge (New) vs slope of bottom edge (Done) indicates balance
- WHY: If New grows faster than Done, backlog accumulates indefinitely
- WHAT HAPPENS: Parallel slopes = sustainable; diverging slopes = unsustainable growth
- EXAMPLE: New growing at 10 items/day, Done at 6 items/day = 4-item daily backlog growth

📊 Cumulative Flow Diagram Example:

graph TD
    subgraph "CFD Visualization (30 Days)"
        A[Day 1] --> B[Day 15] --> C[Day 30]
        
        D["✅ Done<br/>(Growing Steadily)"] 
        E["👀 Review<br/>(Stable)"]
        F["⚙️ Active<br/>(Bottleneck - Growing)"]
        G["📋 To Do<br/>(Stable)"]
        
        D -.Band 1: Green, Bottom.-> D
        E -.Band 2: Purple.-> E
        F -.Band 3: Blue - WIDENING.-> F
        G -.Band 4: Orange, Top.-> G
    end
    
    H{Analysis} --> I[Bottleneck in Active<br/>Too many items in development]
    H --> J[Review capacity adequate<br/>Band stable]
    H --> K[Done rate steady<br/>Consistent throughput]
    
    L[Actions] --> M[Add pair programming<br/>Reduce WIP limits]
    L --> N[Break down large stories<br/>Improve flow]
    
    style D fill:#c8e6c9
    style E fill:#f3e5f5
    style F fill:#e1f5fe
    style G fill:#fff3e0
    style H fill:#ffebee
    style I fill:#ffe0b2
    style J fill:#e0f2f1
    style K fill:#e8eaf6

See: diagrams/02_domain1_cfd_example.mmd

Diagram Explanation:
The Cumulative Flow Diagram shows work distribution over a 30-day period from Day 1 to Day 30 with four workflow states stacked vertically. At the bottom, the Done band (green) shows steadily growing completion - items continuously move to Done, indicating healthy delivery. The slope of this band represents throughput rate.

Above Done, the Review band (purple) remains relatively stable in width throughout the period. This stable band indicates that code review capacity matches the flow - items don't pile up waiting for review. The team has adequate reviewers, or review is efficiently automated, preventing this from becoming a bottleneck.

The Active band (blue) is the problem area - notice it's WIDENING from Day 1 to Day 30. This expanding band shows items accumulating in active development. On Day 1, maybe 10 items were in Active; by Day 30, it's grown to 25 items. This is a bottleneck: work enters Active faster than it exits to Review. Possible causes: stories too large, developers context-switching, insufficient pair programming, or too high WIP limits.

At the top, the To Do band (orange) remains stable, indicating backlog is controlled. New work enters at roughly the same rate as work moves to Active, preventing backlog explosion. If this band were growing, it would signal prioritization problems or excessive commitments.

The Analysis section (red) identifies: (1) Bottleneck in Active due to widening band - too many concurrent items slow everything, (2) Review capacity is adequate since that band is stable, (3) Done rate is steady showing consistent team throughput despite the Active bottleneck.

Actions to improve: (1) Implement pair programming to increase Active capacity and knowledge sharing, (2) Reduce WIP limits to prevent too many concurrent items in Active - maybe limit to 1 item per developer instead of 2-3, (3) Break down large stories that sit in Active for weeks into smaller deliverable chunks that flow faster.

Detailed Example 1: Identifying Code Review Bottleneck
Your team's CFD for the past 60 days shows a concerning pattern. The "In Review" band starts narrow (5 items) in Week 1 but progressively widens to 25 items by Week 8. Meanwhile, the "Done" band's slope (delivery rate) flattens from 10 items/week to 4 items/week. The team lead examines the data: 25 PRs waiting for review, but only 3 team members designated as reviewers.

Root Cause: Reviewer capacity bottleneck. Only 3 of 10 developers review code, creating a queue. Solution: The team implements "reviewer rotation" - every developer reviews at least 2 PRs per week, distributing the load. They also add automated code quality gates (linting, security scanning) to catch issues before human review. After 3 weeks, the CFD shows "In Review" band narrowing back to 6-8 items, and "Done" slope returning to 9-10 items/week. The visual CFD made the invisible bottleneck obvious and measurable.

Detailed Example 2: Detecting Unsustainable Work Input Rate
An e-commerce team's CFD reveals troubling divergence. The top edge (New + Active + Review) rises at a steep slope of +15 items/week, while the bottom edge (Done) rises at only +8 items/week. Over 8 weeks, this 7-item/week gap accumulates to 56 extra items in the system. Total WIP grows from 40 items (Week 1) to 96 items (Week 8). Lead time increases from 12 days to 35 days because items wait longer in each state.

Root Cause: Product Owner adding work faster than team capacity. Analysis: Arrival rate (15/week) exceeds departure rate (8/week) by 87%. This is mathematically unsustainable - the backlog will grow infinitely. Solution: Product Owner implements strict WIP limit of 50 total items. When backlog reaches 50, no new items added until items complete. This forces prioritization: only truly important work enters. Within 4 weeks, CFD shows parallel top and bottom edges (balanced arrival/departure), total WIP stabilizes at 45 items, and lead time drops back to 14 days. The CFD's diverging bands visually proved the system was overloaded.

Detailed Example 3: Seasonal Capacity Variation
A mobile app team's CFD shows unusual pattern: every 3-4 weeks, the "Active" band suddenly narrows and the "Done" band's slope steepens sharply for 3-5 days, then returns to normal. Investigating, the team discovers this correlates with their biweekly "hackathon days" where developers focus solely on finishing in-progress work without starting new items. During these days, WIP drops from 30 to 18, and completion rate jumps from 3/day to 8/day.

Insight: Multitasking and frequent context switching during normal weeks significantly reduce throughput. When developers focus (hackathon days), they're 2.5x more productive. Solution: Team adopts WIP limits permanently - max 1 item per developer - mimicking hackathon focus daily. New CFD shows consistently narrow "Active" band and steeper "Done" slope. Average lead time drops from 18 to 9 days. The CFD's pattern revealed that their normal "busy" state was actually less productive than their focused "hackathon" state.

⭐ Must Know (Critical Facts):

Widening bands = bottleneck - if a colored band expands over time, work is accumulating in that state faster than it's exiting
Parallel top and bottom edges = sustainable flow - arrival rate matches departure rate; diverging edges = unsustainable, backlog will grow infinitely
Horizontal width between bands = average time in state - wide gap between "Active" and "Review" means long cycle time in Active state
Steep "Done" slope = high throughput - the steeper the bottom edge rises, the faster team completes work
CFD shows trends, not just snapshots - single day doesn't matter; patterns over weeks/months reveal true process health

When to use (Comprehensive):

✅ Use when: Identifying process bottlenecks visually - widening bands immediately show which states accumulate work
✅ Use when: Monitoring WIP (Work in Progress) trends - total vertical height shows if WIP is controlled or growing
✅ Use when: Assessing flow stability - smooth parallel bands = predictable flow; erratic bands = unpredictable process
✅ Use when: Validating process changes - implement WIP limits or add capacity, then watch CFD to measure impact
✅ Use when: Executive reporting on team health - visual, easy to understand for non-technical stakeholders
❌ Don't use when: Need individual work item details - CFD shows aggregates; drill into boards/queries for specific items
❌ Don't use when: Comparing different work item types - mixing bugs, stories, tasks in one CFD obscures patterns; use separate CFDs per type

Limitations & Constraints:

Requires consistent workflow states: Changing column names or adding states mid-period breaks CFD continuity - plan state changes carefully
Work item type mixing obscures insights: CFD combining bugs (fast) and features (slow) shows muddled patterns - use filtered CFDs per type
Doesn't show quality or value: High throughput (steep Done slope) with high defect rate or low-value features is bad; combine CFD with quality metrics
Arrival spikes look like bottlenecks: Sudden batch of 50 new items widens top band, mimicking bottleneck; differentiate planned spikes from flow problems
Historical accuracy: CFD relies on state history; if items were batch-updated retroactively, CFD shows false patterns

💡 Tips for Understanding:

Think of CFD as a "time-lapse of water flowing through pipes" - wide sections in a pipe = clogs/bottlenecks where water accumulates
Top edge rising = work arriving; bottom edge rising = work departing - if top rises faster, you're drowning in work
Horizontal distance between same-color band edges = average duration in that state - measuring left-to-right gives time in state

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Thinking a widening top band (To Do/New) is always bad
- Why it's wrong: Controlled backlog growth before a major release or planned sprint is intentional; unsustainable growth is the problem
- Correct understanding: Check if top edge slope exceeds bottom edge slope over weeks; parallel = okay, diverging = problem
Mistake 2: Trying to eliminate all variation and make bands perfectly flat
- Why it's wrong: Real work has natural variation; perfectly flat bands are unrealistic and not necessary for healthy flow
- Correct understanding: Aim for generally parallel bands with controlled variation; minor fluctuations are normal and healthy
Mistake 3: Using CFD for individual developer performance evaluation
- Why it's wrong: CFD shows team-level flow, not individual contribution; penalizing individuals for team bottlenecks is unfair
- Correct understanding: CFD diagnoses system-level process issues; use it for process improvement, not blame assignment

🔗 Connections to Other Topics:

Relates to Little's Law (WIP = Throughput × Lead Time) because: CFD visually represents this formula - vertical height (WIP), bottom slope (Throughput), horizontal distance (Lead Time)
Builds on Kanban Board Configuration by: CFD bands map directly to board columns; optimizing CFD requires adjusting board WIP limits and column definitions
Often used with Lead/Cycle Time widgets to: CFD shows aggregate flow; lead/cycle time widgets show distribution and outliers for same period

Section 3: Collaboration and Documentation

Introduction

The problem: Tribal knowledge trapped in developers' heads, outdated documentation in separate tools, and poor communication between distributed teams slow down onboarding and decision-making.
The solution: Integrated documentation tools (wikis, Markdown, Mermaid diagrams) and communication integrations (webhooks, Teams) keep knowledge accessible where code lives.
Why it's tested: 10-15% of AZ-400 exam covers process documentation and team collaboration - critical for DevOps culture and efficiency.

Core Concepts

Project Documentation with Wikis and Markdown

What it is: Built-in wiki systems in both Azure DevOps and GitHub that use Markdown formatting to create, version, and maintain project documentation directly alongside code repositories.

Why it exists: Teams need documentation to live close to code with the same version control,branching, and review processes. Separate wikis (Confluence, SharePoint) become outdated because updating them is a separate workflow. Integrated wikis get updated in the same PR that changes code.

Real-world analogy: Like having your car's owner manual stored in the glove compartment instead of on a shelf at home. When you need to check tire pressure or change oil, the instructions are right there in the car, always the correct version for your specific model year.

How it works (Detailed step-by-step):

Wiki creation: Azure DevOps creates wiki from repository folder (typically /docs) or as separate wiki, GitHub uses repository root or /docs folder
- WHY: Keeps documentation versioned with code; when code changes, docs can update in same commit/PR
- WHAT HAPPENS: Wiki auto-renders Markdown files as HTML with navigation sidebar
- EXAMPLE: Create /docs/architecture in repo, it appears as "Architecture" page in wiki
Markdown formatting: Write docs in Markdown with headers, lists, code blocks, tables, links, images
- WHY: Plain text format, version-controllable, renders beautifully, works everywhere (GitHub, Azure DevOps, VS Code)
- WHAT HAPPENS: # Header becomes <h1>, **bold** becomes bold, code blocks get syntax highlighting
- SYNTAX: ## Section, - bullet, ```python for code blocks, [link](url), ![image](path)
Wiki structure: Organize pages hierarchically with table of contents auto-generated from headers or folder structure
- WHY: Large projects need organized docs (getting started, architecture, API reference, deployment)
- WHAT HAPPENS: Folder structure becomes navigation tree; Azure DevOps generates sidebar from Markdown headers
- EXAMPLE: /docs/getting-started, /docs/api/authentication, /docs/deployment/azure
Versioning and branches: Wiki content versions with code; different branches can have different wiki versions
- WHY: Documentation for v1.0 differs from v2.0; each release branch has accurate docs
- WHAT HAPPENS: Checking out release/1.0 branch shows v1.0 docs; main branch shows latest docs
- BENEFIT: Users on old versions see docs matching their version, not latest-only docs
Collaborative editing: Wiki edits go through pull request review like code changes
- WHY: Prevents incorrect documentation from being published; maintains quality standards
- WHAT HAPPENS: Edit wiki file → commit to branch → create PR → review → merge to main → wiki updates
- APPROVAL: Technical writers or subject matter experts review docs before merging

📊 Wiki Documentation Workflow Diagram:

sequenceDiagram
    participant Dev as Developer
    participant Branch as Feature Branch
    participant Docs as Wiki/Docs Folder
    participant PR as Pull Request
    participant Review as Reviewer
    participant Main as Main Branch
    participant Wiki as Published Wiki
    
    Dev->>Branch: Create feature branch
    Dev->>Branch: Implement code changes
    Dev->>Docs: Update relevant .md docs
    Note over Docs: /docs/api/new-endpoint<br/>/docs/deployment/config
    
    Dev->>PR: Create Pull Request
    PR->>Review: Request review (code + docs)
    Review->>Review: Review code correctness
    Review->>Review: Review docs accuracy
    
    alt Docs need updates
        Review->>Dev: Request doc changes
        Dev->>Branch: Update documentation
        Dev->>PR: Push updated docs
    end
    
    Review->>PR: Approve PR
    PR->>Main: Merge to main branch
    Main->>Wiki: Auto-publish updated wiki
    
    Note over Wiki: Wiki now reflects<br/>latest code + docs
    
    style Dev fill:#e3f2fd
    style Branch fill:#f3e5f5
    style Docs fill:#fff3e0
    style PR fill:#e1f5fe
    style Review fill:#f3e5f5
    style Main fill:#c8e6c9
    style Wiki fill:#e8f5e9

See: diagrams/02_domain1_wiki_documentation_workflow.mmd

Diagram Explanation:
This sequence diagram shows the integrated workflow for maintaining documentation alongside code changes. A Developer (blue) starts by creating a Feature Branch (purple) to implement a new API endpoint. As they write code, they recognize the need to document the new endpoint and configuration changes.

The developer updates relevant Markdown files in the /docs folder (orange): they create /docs/api/new-endpoint explaining the new API with request/response examples, and update /docs/deployment/config to document the new configuration parameters required. These documentation changes are committed to the same feature branch as the code - keeping code and docs in sync.

When the developer creates a Pull Request (light blue), both code and documentation are included. The Reviewer (purple) performs a comprehensive review: they check that the code works correctly AND that the documentation accurately describes the new functionality. This dual review ensures docs don't lag behind code changes.

If documentation needs updates (alt flow), the reviewer requests changes: "Add error handling examples to API docs" or "Clarify the config parameter defaults." The developer updates documentation in the branch and pushes to the PR. This review loop continues until both code and docs meet quality standards.

After approval, the PR merges to Main Branch (green), and the Published Wiki (light green) auto-updates. Now when teammates or users access the wiki, they see documentation that exactly matches the current codebase. If someone checks out the feature branch before merge, they see docs for that branch's code version.

The key insight: Documentation updates flow through the same quality gates (branching, PR, review, merge) as code changes. This prevents the common problem where code gets reviewed rigorously but docs are added as an afterthought and become outdated. By treating docs as code, teams maintain accuracy and reduce knowledge silos.

Detailed Example 1: API Documentation with Code Generation
Your team builds a REST API in Azure. You set up wiki in the /docs folder of your repository. When a developer adds a new endpoint POST /api/orders, they:

Write the code: Implement OrdersController with validation, business logic, data access
Write OpenAPI spec: Add Swagger/OpenAPI annotations describing request/response schemas
Generate API docs: Run tool that converts OpenAPI spec to Markdown and saves to /docs/api/orders
Add usage examples: Manually write sample requests with curl, C#, Python in the same Markdown file
Create PR: Include code, OpenAPI annotations, and generated + manually enhanced docs

The PR reviewer checks: (1) Code quality, (2) OpenAPI spec accuracy, (3) Generated docs completeness, (4) Example code works correctly. After merge, the wiki automatically displays the new API endpoint documentation. Three months later, when the endpoint changes, the developer updates code, OpenAPI spec, regenerates Markdown, updates examples, and creates PR - docs stay in sync through the same workflow.

Benefit: Documentation isn't a separate task done later; it's part of definition of done for every feature. Teams following this pattern have 90%+ accurate docs because updating docs is as natural as updating code.

Detailed Example 2: Architectural Decision Records (ADRs) in Wiki
An enterprise team adopts the practice of documenting major architectural decisions as ADRs in wiki. When architect proposes using microservices pattern instead of monolith, they:

Create ADR file: /docs/adrs/003-microservices-architecture (numbered sequentially)
Follow ADR template:
- Status: Proposed / Accepted / Deprecated
- Context: Why this decision is needed (scalability problems with monolith)
- Decision: What was decided (migrate to microservices using Kubernetes)
- Consequences: Positive (better scalability, team autonomy) and Negative (complexity, operational overhead)
- Alternatives considered: Vertical scaling monolith, serverless functions (and why rejected)
Create PR for ADR: Team reviews the architectural decision like they review code
Discussion in PR comments: Team debates trade-offs, suggests alternatives, asks questions
Approval and merge: When consensus reached, ADR merges and becomes official architectural guideline

Six months later, a new developer joins and wonders "Why microservices?" They browse /docs/adrs/ in wiki and find ADR-003 explaining the exact reasoning, context, and trade-offs. When a different team proposes serverless, they reference ADR-003 showing it was already considered and why microservices was chosen instead. The wiki becomes institutional memory that survives team turnover.

Detailed Example 3: Release Notes Auto-Generation from Git History
Your team wants release notes generated from commits. You set up automation:

Conventional commits: Developers write commits following convention: feat:, fix:, docs:, refactor:
- feat: Add shopping cart persistence to database
- fix: Resolve payment gateway timeout error
- docs: Update API authentication examples
Git history parsing: CI pipeline runs script that reads commits between last release tag and current
Categorize changes: Script groups commits by type (Features, Bug Fixes, Documentation, etc.)

Generate Markdown: Script creates /docs/releases/v2.5.0:

# Release v2.5.0 (2024-10-15)

## Features
- Add shopping cart persistence to database (#142)
- Implement guest checkout flow (#155)

## Bug Fixes
- Resolve payment gateway timeout error (#148)
- Fix mobile UI rendering on iOS (#151)

## Documentation
- Update API authentication examples (#143)

Commit to release branch: Automation commits generated release notes to release branch
Wiki displays release notes: /docs/releases/ folder appears in wiki with all historical releases

Benefit: Release notes are always complete and accurate because they're generated from actual commits, not manually written (and forgotten) after the fact. Product managers, support teams, and customers can see exactly what changed in each release by browsing the wiki. The automation ensures no release ships without documentation.

⭐ Must Know (Critical Facts):

Wiki content is version-controlled with code - different branches can have different wiki versions, ensuring docs match code version
Markdown is the standard format - GitHub Flavored Markdown (GFM) is widely supported, plain text, version-controllable, and renders everywhere
Wiki updates go through PR review - documentation changes should be reviewed like code to maintain accuracy and quality
Azure DevOps supports wiki as code (repository-based) - create wiki from /docs folder or dedicated wiki repository with full Git features
Automated documentation generation from code - OpenAPI specs, code comments (JSDoc, XML docs), and git history can auto-generate Markdown documentation

When to use (Comprehensive):

✅ Use repository-based wiki when: Docs must version with code - API docs, config guides, deployment instructions that change with each release
✅ Use wiki for: Getting started guides, architecture docs, ADRs, troubleshooting guides - knowledge that should live close to code
✅ Use Markdown for: All technical documentation - universally supported, works in PRs, wikis, README files, even when repository is cloned offline
✅ Automate doc generation when: Source of truth exists elsewhere - OpenAPI specs, code annotations, Git commits can generate Markdown
✅ Use wiki for: Cross-linking between docs and code - link to specific files/lines in repo from wiki, link to wiki pages from code comments
❌ Don't use wiki for: Long-form prose or books - complex documentation with heavy formatting is better in dedicated docs platforms (DocFX, MkDocs, GitBook)
❌ Don't use Markdown for: Pixel-perfect design documents - mockups, detailed UI specifications need tools like Figma, not Markdown

Limitations & Constraints:

Markdown formatting limitations: No complex tables, limited styling, no interactive elements - keep docs simple
GitHub wiki limitations: GitHub wikis are separate Git repos, not tied to code branches like Azure DevOps repository wikis
Search across wiki: Basic keyword search; advanced semantic search requires external tools
Access control granularity: Wiki permissions usually match repository permissions; can't easily make some wiki pages public and others private
Large binary files (images/PDFs) bloat repo: Store large assets in Azure Blob/S3, link from Markdown rather than committing to wiki repo

💡 Tips for Understanding:

Think of wiki as "living documentation" - it grows and changes with code, never gets stale because it's maintained with same discipline
Markdown is "docs as code" - same version control, branching, reviewing workflows apply to both documentation and source code
Wiki sidebar = table of contents - organize Markdown files in logical folder structure to create intuitive navigation

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Creating separate wiki outside repository and manually keeping it in sync
- Why it's wrong: Separate wikis quickly become outdated; developers forget to update them after code changes
- Correct understanding: Use repository-based wiki or /docs folder so documentation updates happen in the same PR as code changes
Mistake 2: Writing documentation after features are complete instead of during development
- Why it's wrong: Docs become an afterthought, often incomplete or missing; developers forget implementation details by the time they write docs
- Correct understanding: Treat documentation as part of definition of done; PRs that add features without docs should be rejected
Mistake 3: Only subject matter experts can edit wiki
- Why it's wrong: Creates bottleneck; developers who implement features know them best but can't document them
- Correct understanding: Any team member can edit wiki via PR; reviewers ensure accuracy and quality before merging

🔗 Connections to Other Topics:

Relates to Pull Request Workflows because: Wiki edits flow through same PR → review → merge process as code changes
Builds on Branching Strategies by: Different branches have different wiki versions matching their code state
Often used with CI/CD Pipelines to: Auto-generate documentation (API specs, release notes) and commit to wiki during builds

Chapter Summary

What We Covered

✅ Work Tracking and Traceability:

Azure Boards work item tracking with automated workflows and traceability through AB# references
GitHub Projects for lightweight, cross-repository project management with automation
Integration patterns between Azure Boards and GitHub for hybrid workflows

✅ DevOps Metrics and Dashboards:

Lead Time vs Cycle Time: customer waiting time vs team working time
Cumulative Flow Diagrams for visual bottleneck detection and flow analysis
Dashboard design principles for different audiences (personal, team, executive)

✅ Documentation and Collaboration:

Repository-based wikis with Markdown for versioned documentation
Docs-as-code workflow: documentation changes flow through PR review
Automated documentation generation from code, specs, and Git history

Critical Takeaways

Traceability is bidirectional: Work items link to commits/PRs, commits reference work items via AB#{ID}, creating complete audit trail from requirement to deployment
Lead Time measures customer impact, Cycle Time measures team efficiency: Large gap between them indicates process problems (backlog, prioritization), not execution problems
Cumulative Flow Diagrams reveal bottlenecks visually: Widening bands show accumulating work in specific states; parallel top/bottom edges indicate sustainable flow
Documentation must version with code: Repository-based wikis ensure docs stay synchronized with code through same branching, PR review, and merge workflows
Metrics drive continuous improvement: Track cycle time, lead time, velocity, and deployment frequency to identify trends and validate process changes

Self-Assessment Checklist

Test yourself before moving on:

I can explain the difference between Lead Time and Cycle Time and when to use each metric
I can read a Cumulative Flow Diagram and identify bottlenecks from widening bands
I understand how AB#{ID} syntax creates traceability between Azure Boards work items and code
I can describe GitHub Projects automation workflows and when to use organization vs repository projects
I know how to set up repository-based wiki and why docs should be reviewed in PRs
I can explain what parallel vs diverging CFD edges indicate about system sustainability
I understand how to interpret cycle time widget scatter plots and what outliers represent

Practice Questions

Try these from your practice test bundles:

Domain 1 Bundle 1: Questions 1-20 (Work tracking and metrics)
Expected score: 70%+ to proceed confidently

If you scored below 70%:

Review sections: Work item traceability, CFD interpretation, Lead/Cycle time differences
Focus on: Understanding metric meanings, not just definitions - practice applying concepts to scenarios

Common Exam Question Patterns

Pattern 1: Metric Selection

How to recognize: "Which metric should you use to measure..." or "Your team wants to track..."
What they're testing: Understanding when to use lead time (customer SLA) vs cycle time (team efficiency) vs CFD (bottlenecks)
How to answer: Match metric to purpose - customer satisfaction = lead time, team performance = cycle time, process health = CFD

Pattern 2: Tool Integration

How to recognize: "Connect Azure Boards with GitHub..." or "Synchronize work items..."
What they're testing: Knowledge of Azure Boards + GitHub integration, AB#{ID} linking, automation rules
How to answer: Choose solutions that maintain bidirectional sync and automated state transitions

Pattern 3: Documentation Strategy

How to recognize: "Best way to maintain API documentation..." or "Keep docs synchronized with code..."
What they're testing: Understanding repository-based wikis, Markdown, docs-as-code workflow
How to answer: Select wiki approaches that version with code and use PR review for doc changes

Quick Reference Card

Key Work Tracking Concepts:

Azure Boards: Hierarchical work items (Epic→Feature→User Story→Task), AB#{ID} linking, state-based workflows
GitHub Projects: Flat structure with custom fields, cross-repo aggregation at org level, automation workflows
Traceability: AB#{ID} in commits/PRs auto-links to work items; "Fixes AB#123" auto-closes on merge

Key Metrics Formulas:

Lead Time = Date Created → Date Closed (customer perspective, total delivery time)
Cycle Time = First Active → Date Closed (team efficiency, actual work time)
Lead Time - Cycle Time = Wait Time in backlog (process inefficiency indicator)

CFD Interpretation:

Widening band = Bottleneck in that state (work accumulating)
Parallel edges = Sustainable flow (arrival rate = departure rate)
Diverging edges = Unsustainable (backlog growing infinitely)
Steep bottom slope = High throughput (fast completion rate)

Documentation Best Practices:

Repository wiki: Docs version with code, accessible from same repo
Markdown standard: GitHub Flavored Markdown, syntax highlighting, tables supported
PR review for docs: Documentation changes reviewed like code for accuracy
Auto-generation: OpenAPI specs → Markdown API docs, Git commits → Release notes

Decision Points:

Choose Azure Boards when: Need hierarchical work items (Epic→Story→Task), advanced reporting, complex queries
Choose GitHub Projects when: Need lightweight tracking, cross-repo visibility, GitHub-native workflow
Choose Lead Time when: Measuring customer SLA, delivery commitments, end-to-end time
Choose Cycle Time when: Measuring team efficiency, comparing sprints, identifying process improvements
Choose CFD when: Visualizing bottlenecks, monitoring WIP trends, validating process changes

What's Next

In Chapter 3: Design and Implement a Source Control Strategy, you'll learn:

Advanced branching strategies (trunk-based, GitFlow, release flow)
Branch policies and protection rules
Repository management at scale (Git LFS, Scalar, monorepos vs multirepos)
Git recovery operations and sensitive data removal

These source control concepts build on the work tracking and metrics you've learned - every commit will link to work items, every branch will follow team standards, and every merge will update your flow metrics.

Additional Detailed Examples and Scenarios

Example 1: Implementing GitHub Flow with Branch Policies

Scenario: Your team of 10 developers is transitioning from a chaotic branching model to GitHub Flow. You need to implement branch policies to ensure code quality and prevent direct commits to main.

Step-by-Step Implementation:

Configure Branch Protection Rules (GitHub):
- Navigate to repository → Settings → Branches → Add rule
- Branch name pattern: main
- Enable: Require pull request before merging
- Required approvals: 2
- Dismiss stale reviews when new commits are pushed: Yes
- Require review from Code Owners: Yes (if CODEOWNERS file exists)
- Enable: Require status checks to pass before merging
- Required status checks: build, test, security-scan
- Require branches to be up to date before merging: Yes
- Enable: Require signed commits (optional, for high-security environments)
- Enable: Include administrators (even admins must follow rules)
Create CODEOWNERS File (optional but recommended):

# .github/CODEOWNERS
# Global owners (review all changes)
* @team-leads

# Frontend code
/src/frontend/** @frontend-team

# Backend code
/src/backend/** @backend-team

# Infrastructure code
/infrastructure/** @devops-team

# Security-sensitive files
/src/auth/** @security-team

Set Up Status Checks (GitHub Actions workflow):

# .github/workflows/pr-checks.yml
name: PR Checks

on:
  pull_request:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
      - name: Install dependencies
        run: npm ci
      - name: Build
        run: npm run build
  
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
      - name: Install dependencies
        run: npm ci
      - name: Run tests
        run: npm test
      - name: Upload coverage
        uses: codecov/codecov-action@v3
  
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          severity: 'CRITICAL,HIGH'

Developer Workflow:
- Developer creates feature branch: git checkout -b feature/add-login
- Makes changes and commits: git commit -m "Add login functionality"
- Pushes branch: git push origin feature/add-login
- Creates pull request on GitHub
- Automated checks run (build, test, security-scan)
- Two team members review and approve
- Developer merges PR (or uses auto-merge if all checks pass)
- Branch is automatically deleted after merge

Why This Works:

Quality Gates: Automated checks catch issues before merge
Code Review: Two approvals ensure knowledge sharing and quality
CODEOWNERS: Right people review relevant changes
Up-to-Date Requirement: Prevents merge conflicts
Signed Commits: Ensures commit authenticity (optional)

📊 GitHub Flow with Branch Protection Diagram:

graph LR
    A[Developer: Create Feature Branch] --> B[Developer: Make Changes]
    B --> C[Developer: Push Branch]
    C --> D[GitHub: Create Pull Request]
    D --> E[GitHub Actions: Run Checks]
    E --> F{All Checks Pass?}
    F -->|No| G[Developer: Fix Issues]
    G --> B
    F -->|Yes| H[Reviewers: Review Code]
    H --> I{2 Approvals?}
    I -->|No| J[Reviewers: Request Changes]
    J --> B
    I -->|Yes| K[Developer: Merge PR]
    K --> L[GitHub: Delete Branch]
    L --> M[Main Branch Updated]

    style A fill:#e1f5fe
    style M fill:#c8e6c9
    style F fill:#fff3e0
    style I fill:#fff3e0

See: diagrams/02_domain1_github_flow_branch_protection.mmd

Example 2: Configuring Azure Boards and GitHub Integration

Scenario: Your organization uses Azure Boards for work tracking and GitHub for source control. You need to link commits, pull requests, and builds to work items for full traceability.

Step-by-Step Implementation:

Install Azure Boards App in GitHub:
- Go to GitHub Marketplace: https://github.com/marketplace/azure-boards
- Click "Set up a plan" → Free
- Select repositories to integrate
- Authorize Azure Boards app
Connect Azure Boards to GitHub:
- Azure DevOps → Project Settings → GitHub connections
- Click "Connect your GitHub account"
- Authorize Azure DevOps
- Select repositories to connect
Link Commits to Work Items:
- In commit message, reference work item: git commit -m "Add login feature AB#123"
- Format: AB#{work-item-id} or Fixes AB#{work-item-id}
- Commit appears in work item's Development section
Link Pull Requests to Work Items:
- In PR description, reference work item: Fixes AB#123
- PR appears in work item's Development section
- Work item status can auto-update when PR is merged
Configure Auto-Linking Rules:
- Azure Boards → Project Settings → GitHub connections → Select connection
- Enable: "Automatically create links for mentions in commit messages"
- Enable: "Automatically transition work items to 'Resolved' when PR is merged"
View Traceability:
- Open work item in Azure Boards
- Development section shows: Commits, Pull Requests, Builds
- Click any link to view details in GitHub or Azure Pipelines

Benefits:

Full Traceability: See all code changes related to a work item
Automated Updates: Work items auto-update when PRs merge
Cross-Platform: Works with GitHub and Azure DevOps together
Audit Trail: Complete history of work item → code → deployment

📊 Azure Boards and GitHub Integration Diagram:

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant AB as Azure Boards
    participant AP as Azure Pipelines

    Dev->>AB: 1. Create Work Item (AB#123)
    AB->>Dev: 2. Work Item Created
    Dev->>GH: 3. Create Branch (feature/AB#123)
    Dev->>GH: 4. Commit with "AB#123"
    GH->>AB: 5. Link Commit to Work Item
    AB->>AB: 6. Update Development Section
    Dev->>GH: 7. Create PR with "Fixes AB#123"
    GH->>AB: 8. Link PR to Work Item
    GH->>AP: 9. Trigger Build
    AP->>AB: 10. Link Build to Work Item
    Dev->>GH: 11. Merge PR
    GH->>AB: 12. Auto-Transition Work Item to Resolved
    AP->>AB: 13. Link Deployment to Work Item

    Note over Dev,AB: Full traceability:<br/>Work Item → Code → Build → Deployment

See: diagrams/02_domain1_azure_boards_github_integration.mmd

Example 3: Creating Effective Dashboards with Key Metrics

Scenario: Your DevOps team needs a dashboard to monitor pipeline health, deployment frequency, and lead time. The dashboard should be visible to the entire team and update in real-time.

Step-by-Step Implementation:

Create Azure DevOps Dashboard:
- Azure DevOps → Overview → Dashboards → New Dashboard
- Name: "DevOps Metrics Dashboard"
- Visibility: Team (or Public for organization-wide visibility)
Add Widgets:

Widget 1: Build Success Rate:

Add widget: Chart for Build History
Configuration:
- Build pipeline: Select your main pipeline
- Period: Last 30 days
- Chart type: Stacked area
- Metrics: Success rate, Failure rate
Shows: Trend of build success over time

Widget 2: Deployment Frequency:

Add widget: Query Results
Configuration:
- Query: Custom query counting deployments per week
- Visualization: Bar chart
Shows: How often you deploy to production

Widget 3: Lead Time:

Add widget: Lead Time
Configuration:
- Work item type: User Story
- Time period: Last 90 days
Shows: Average time from work item creation to completion

Widget 4: Cycle Time:

Add widget: Cycle Time
Configuration:
- Work item type: User Story
- Time period: Last 90 days
Shows: Average time from work start to completion

Widget 5: Cumulative Flow Diagram:

Add widget: Cumulative Flow Diagram
Configuration:
- Work item types: User Story, Bug
- Time period: Last 60 days
- States: New, Active, Resolved, Closed
Shows: Work in progress and bottlenecks

Widget 6: Test Results Trend:

Add widget: Test Results Trend
Configuration:
- Build pipeline: Select your main pipeline
- Period: Last 30 days
Shows: Test pass rate over time

Configure Auto-Refresh:
- Dashboard settings → Auto-refresh: Every 5 minutes
- Ensures dashboard always shows current data
Share Dashboard:
- Copy dashboard URL
- Share with team via email or Teams
- Display on TV in team area (optional)

Dashboard Layout Example:

+------------------+------------------+------------------+
| Build Success    | Deployment       | Lead Time        |
| Rate (30 days)   | Frequency        | (90 days)        |
|                  | (per week)       |                  |
+------------------+------------------+------------------+
| Cycle Time       | Test Results     | Active Bugs      |
| (90 days)        | Trend (30 days)  | (by priority)    |
+------------------+------------------+------------------+
| Cumulative Flow Diagram (60 days)                     |
| Shows: Work in progress, bottlenecks, flow            |
+-------------------------------------------------------+

Key Metrics to Track:

Build Success Rate: Should be >90% (if lower, investigate flaky tests or infrastructure issues)
Deployment Frequency: Higher is better (daily deployments indicate mature CI/CD)
Lead Time: Lower is better (measures end-to-end delivery speed)
Cycle Time: Lower is better (measures team efficiency)
Test Pass Rate: Should be >95% (if lower, improve test quality)

📊 DevOps Metrics Dashboard Diagram:

graph TB
    subgraph "DevOps Metrics Dashboard"
        subgraph "Row 1: Velocity Metrics"
            M1[Build Success Rate<br/>Target: >90%<br/>Current: 94%]
            M2[Deployment Frequency<br/>Target: Daily<br/>Current: 3x/week]
            M3[Lead Time<br/>Target: <7 days<br/>Current: 5.2 days]
        end
        
        subgraph "Row 2: Quality Metrics"
            M4[Cycle Time<br/>Target: <3 days<br/>Current: 2.8 days]
            M5[Test Pass Rate<br/>Target: >95%<br/>Current: 97%]
            M6[Active Bugs<br/>Critical: 2<br/>High: 5<br/>Medium: 12]
        end
        
        subgraph "Row 3: Flow Visualization"
            M7[Cumulative Flow Diagram<br/>Shows work in progress<br/>Identifies bottlenecks]
        end
    end

    style M1 fill:#c8e6c9
    style M2 fill:#fff3e0
    style M3 fill:#c8e6c9
    style M4 fill:#c8e6c9
    style M5 fill:#c8e6c9
    style M6 fill:#ffebee

See: diagrams/02_domain1_devops_metrics_dashboard.mmd

Chapter 3: Design and Implement a Source Control Strategy (12.5% of exam)

Chapter Overview

What you'll learn:

Branching strategies: trunk-based development, GitFlow, GitHub Flow, release flow
Branch policies and protection rules for quality gates
Pull request workflows with required reviewers and status checks
Repository management at scale: Git LFS, Scalar, monorepos vs multirepos
Git recovery operations and sensitive data removal techniques

Time to complete: 6-8 hours
Prerequisites: Chapter 2 (Fundamentals and DevOps Principles)

Section 1: Branching Strategies

Introduction

The problem: Without a defined branching strategy, teams create chaos - conflicting changes, broken builds, unclear release process, difficulty tracking what's in production.
The solution: Structured branching strategies provide clear rules for when to branch, how to merge, and how to release, enabling team collaboration at scale.
Why it's tested: 12.5% of AZ-400 exam focuses on source control strategy - branch management is foundation of DevOps collaboration.

Core Concepts

Trunk-Based Development

What it is: A branching strategy where all developers commit directly to a single main branch (trunk) or use very short-lived feature branches (< 24 hours) that merge quickly to main.

Why it exists: Long-lived feature branches create merge conflicts, delay integration, and hide problems. Trunk-based development forces continuous integration - developers integrate code daily, conflicts are small and manageable, feedback is immediate.

Real-world analogy: Like a highway with one main lane where everyone drives. Instead of building separate roads that later need to connect (merge conflicts), everyone stays on the main highway. If you need to make a quick stop, you pull into a rest area briefly (short branch) then merge back immediately.

How it works (Detailed step-by-step):

Single main branch: Team maintains one "trunk" (usually main or master) that is always deployable
- WHY: Eliminates branch management overhead, makes CI/CD straightforward
- WHAT HAPPENS: All production releases come from trunk; trunk stays in releasable state
- DISCIPLINE: Requires robust testing, feature flags, and CI automation
Small, frequent commits: Developers commit working code to trunk multiple times per day
- WHY: Reduces integration conflicts; smaller changes are easier to review and revert
- WHAT HAPPENS: Each commit triggers automated tests; failing commits must be fixed immediately or reverted
- PRACTICE: "Commit early, commit often" - even if feature isn't complete, commit working incremental changes
Short-lived feature branches (optional): If using branches, they live less than 24 hours and merge quickly
- WHY: Prevents branches from diverging; conflicts remain small and manageable
- WHAT HAPPENS: Developer creates branch in morning, commits several times, creates PR, merges same day
- RULE: If branch lives >24 hours, it's not trunk-based development
Feature flags for incomplete work: Use feature toggles to hide incomplete features in production while code is in trunk
- WHY: Allows committing incomplete code without exposing it to users
- WHAT HAPPENS: if (featureFlag.enabled("newCheckout")) { /* new code */ } else { /* old code */ }
- DEPLOYMENT: Feature goes live by flipping flag, not deploying new code
Automated quality gates: Comprehensive CI pipeline runs on every commit - tests, linting, security scans
- WHY: Trunk must always be deployable; automation prevents broken code from entering
- WHAT HAPPENS: Commit → CI runs tests → If pass, merge; If fail, auto-revert or block merge
- SPEED: Fast test suite (<10 min) enables multiple daily integrations

📊 Trunk-Based Development Flow Diagram:

graph TD
    A[Developer Workstation] -->|1. Pull latest trunk| B[Local Main Branch]
    B -->|2. Create short branch<br/>feature/quick-fix| C[Feature Branch<br/>&lt;24 hours]
    C -->|3. Multiple commits<br/>2-3 hours work| C
    C -->|4. Push branch| D[Remote Repository]
    D -->|5. Create PR| E[Pull Request]
    E -->|6. Automated CI| F{CI Pipeline}
    F -->|Tests pass| G[Code Review]
    F -->|Tests fail| H[Fix or Revert]
    H -->|Fix commits| C
    G -->|Approved| I[Merge to Trunk]
    I -->|7. Deploy| J[Production<br/>via feature flags]
    
    K[Feature Flags] -.Control visibility.-> J
    
    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style D fill:#e1f5fe
    style E fill:#f3e5f5
    style F fill:#ffe0b2
    style G fill:#f3e5f5
    style I fill:#c8e6c9
    style J fill:#e8f5e9
    style K fill:#ffebee

See: diagrams/03_domain2_trunk_based_development.mmd

Diagram Explanation:
This diagram illustrates the trunk-based development workflow for a single feature. The process starts at the Developer Workstation (blue) where a developer pulls the latest code from the Local Main Branch (purple) - this is the trunk, always up-to-date with remote main.

The developer creates a Short-Lived Feature Branch (orange) named feature/quick-fix with the discipline that it must merge within 24 hours. Over 2-3 hours, they make multiple commits to this branch - each commit represents incremental progress on the fix. This is shorter than traditional feature branches that might live for days or weeks.

After pushing the branch to the Remote Repository (light blue), the developer creates a Pull Request (purple) for code review. The CI Pipeline (orange) immediately triggers, running all automated tests, linting, and security scans. If tests fail, the developer must Fix or Revert - either push additional commits to fix the issue or abandon the branch entirely. No broken code enters trunk.

If tests pass, the PR enters Code Review (purple) where teammates review the changes. After approval, the code Merges to Trunk (green), and the automated deployment pipeline pushes to Production (light green). However, if the feature isn't complete, Feature Flags (red) control its visibility - the code deploys but remains hidden behind a toggle until ready.

The key principles: (1) Branches live <24 hours, (2) Trunk is always deployable, (3) CI blocks broken code, (4) Feature flags decouple deployment from release. This enables continuous integration while maintaining production stability.

Detailed Example 1: E-commerce Checkout Refactor
Your team needs to refactor the checkout flow for better performance. Traditional branching would create a long-lived feature/checkout-refactor branch, work for 2 weeks, then merge - causing massive conflicts. With trunk-based development, you:

Day 1: Create feature flag checkout_v2_enabled = false. Commit to trunk. Deploy to production (flag is off, users see old checkout).

Day 2: Create short branch, implement new payment validation, commit behind flag if (checkout_v2_enabled), create PR, merge same day. Code in production but not active.

Day 3-5: Repeat - each day, small branch for cart calculation logic, order submission, confirmation page. Each merges to trunk daily. All code deployed but hidden.

Day 6: All refactoring complete. In production, flip checkout_v2_enabled = true for 10% of users (canary). Monitor metrics.

Day 7: No issues detected. Flip to 100%. Refactor complete without a single merge conflict because changes integrated daily.

Contrast: Traditional feature branch would have 2 weeks of code divergence, 100+ file conflicts on merge, 2-3 days resolving conflicts, high risk of breaking production. Trunk-based had zero conflicts, continuous validation, and controlled rollout.

Detailed Example 2: Hotfix for Production Bug
Production bug discovered: payment processor returns error for amounts >$1000. With trunk-based development:

10:00 AM: Bug reported. Developer pulls latest trunk (already up-to-date, no drift).
10:15 AM: Creates hotfix/payment-limit branch, fixes validation logic in 30 minutes.
10:45 AM: Pushes branch, creates PR. CI runs full test suite in 8 minutes.
10:53 AM: Tests pass. Senior dev reviews code in 5 minutes, approves.
11:00 AM: Merges to trunk. Automated deployment triggers.
11:10 AM: Fix deployed to production. Bug resolved in 70 minutes.

Why this was fast: (1) Trunk was current - no time wasted syncing branches, (2) CI was fast - optimized for trunk-based workflow, (3) No complex merge process - direct to trunk, (4) Automated deployment - no manual release process.

Contrast: In GitFlow with long-lived develop/feature branches, this same hotfix requires: merge to develop, test in staging, create release branch, merge to main, then deploy - adding hours or days of delay.

⭐ Must Know (Critical Facts):

Trunk-based development requires feature flags to hide incomplete work in production - deploy code continuously even if feature isn't user-ready
Branches must live <24 hours for true trunk-based development - longer branches are feature branching, not trunk-based
Trunk must always be deployable - every commit on trunk should be production-ready, achievable through rigorous CI and rollback capabilities
Works best with automated testing and CI - manual testing too slow for multiple daily integrations; automation is mandatory
Used by high-performing teams - Google, Facebook, Netflix use trunk-based for rapid delivery with stability

When to use (Comprehensive):

✅ Use when: Team has strong CI/CD and automated testing - trunk-based demands fast, comprehensive test automation to maintain trunk quality
✅ Use when: Need rapid delivery - multiple daily deployments, quick feedback loops, minimal integration delays
✅ Use when: Team practices continuous integration culture - developers comfortable with frequent commits, small changes, feature flags
✅ Use when: Monolithic or microservices architectures - works well for both because each service can have its own trunk
✅ Use when: Want to minimize merge conflicts - daily integration prevents long-lived divergence that causes conflicts
❌ Don't use when: Team lacks automated testing - manual testing can't keep pace with continuous integration; trunk will break frequently
❌ Don't use when: Regulatory requirements mandate separate release branches - some industries require auditable release branches with explicit promotion
❌ Don't use when: Team is new to DevOps practices - trunk-based is advanced; start with feature branching, graduate to trunk-based

Limitations & Constraints:

Requires cultural shift: Developers must change habits - smaller commits, feature flags, accepting incomplete code in trunk
Feature flag management overhead: Flags accumulate; must clean up old flags to avoid technical debt
Testing must be fast: Slow test suites (>10 min) bottleneck multiple daily integrations; requires investment in test optimization
Not suitable for open source: Contributors can't commit to trunk; need PR-based workflow with feature branches
Difficult with large, slow build processes: If build takes 30+ minutes, multiple daily integrations become impractical

💡 Tips for Understanding:

Think "integrate daily, release when ready" - trunk-based separates integration (continuous) from release (controlled via flags)
Feature flags are the secret weapon - they let you deploy incomplete code safely, avoiding long-lived branches
Trunk-based is a destination, not starting point - mature teams with good CI/CD graduate to trunk-based; don't force it prematurely

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Using trunk-based but creating branches that live for days or weeks
- Why it's wrong: Defeats the purpose - conflicts still accumulate, integration still delayed
- Correct understanding: If branch lives >1 day, revert to trunk and use feature flags instead, or admit you're not doing trunk-based
Mistake 2: Committing broken code to trunk "because we'll fix it later"
- Why it's wrong: Breaks trunk deployment; blocks other developers; defeats "always deployable trunk" principle
- Correct understanding: Trunk must always pass CI; if code isn't ready, use feature flags or revert commit immediately
Mistake 3: Thinking trunk-based means no code review
- Why it's wrong: Fast integration doesn't mean skipping review - PRs still required, just processed quickly
- Correct understanding: Trunk-based requires lightweight but rigorous review; optimize for speed (automate checks) but maintain quality

🔗 Connections to Other Topics:

Relates to Feature Flags (Domain 3) because: Trunk-based relies on feature toggles to deploy incomplete code safely
Builds on CI/CD Pipelines (Domain 3) by: Automated testing and deployment enable multiple daily integrations to trunk
Often used with Pull Request Workflows to: Even in trunk-based, PRs provide code review before merging to trunk

Section 2: Branch Policies and Pull Requests

Introduction

The problem: Without enforced quality gates, developers can push broken code directly to important branches, bypassing review, skipping tests, causing production incidents.
The solution: Branch policies enforce automated and manual checks before code merges - required reviewers, passing builds, resolved comments, and more.
Why it's tested: AZ-400 exam heavily tests branch protection configuration - knowing what policies enforce which quality gates is critical.

Core Concepts

Branch Protection Rules and Policies

What it is: Configurable rules in Azure DevOps (branch policies) and GitHub (branch protection rules) that enforce quality standards before allowing merges to protected branches like main or release/*.

Why it exists: Teams need programmatic enforcement of standards - relying on developer discipline alone fails at scale. Branch policies make quality gates automatic and consistent.

Real-world analogy: Like airport security checkpoints. You can't board a plane (merge to main) without passing through security (branch policies) - ID check (code review), metal detector (automated tests), baggage scan (security scans). No exceptions, automated enforcement.

How it works (Detailed step-by-step):

Protect critical branches: Configure policies on main, release/*, or any important branches
- WHY: Prevents direct pushes that bypass quality gates; all changes must go through PR workflow
- WHAT HAPPENS: Git push to protected branch is rejected; developer must create PR instead
- CONFIGURATION: Azure DevOps → Repos → Branches → ... → Branch Policies; GitHub → Settings → Branches → Add rule
Require pull requests: Policy enforces that all merges happen via PR, never direct push
- WHY: PRs provide review, discussion, automated checks, and audit trail
- WHAT HAPPENS: git push origin main fails with "branch protected"; must push to feature branch, create PR
- BYPASS: Only admins can bypass (should be rare, logged for audit)
Require minimum reviewers: Policy demands X approvals before PR can merge (typically 1-2)
- WHY: Ensures peer review catches issues; shares knowledge across team
- WHAT HAPPENS: PR shows "Needs 2 approvals" status; can't merge until satisfied
- RESET ON CHANGE: New commits invalidate approvals, requiring re-review
Require build validation: Policy requires CI build to pass before merge
- WHY: Automated tests validate changes don't break functionality
- WHAT HAPPENS: PR creates status check; if build fails, "Merge" button disabled
- CONFIGURATION: Link build pipeline to branch policy; can require multiple builds (unit tests, integration tests)
Require linked work items: Policy enforces PR must link to work item (user story/bug)
- WHY: Maintains traceability from requirement to implementation
- WHAT HAPPENS: PR without AB#{ID} link can't merge; developer must associate work item
- AUDIT: Work item shows all commits/PRs that implemented it
Comment resolution: Policy requires all PR comments resolved before merge
- WHY: Ensures reviewer feedback is addressed, not ignored
- WHAT HAPPENS: Unresolved comments block merge; author must resolve or discuss each
- WORKFLOW: Reviewer marks "Won't fix" for non-blocking comments

📊 Branch Policy Enforcement Diagram:

stateDiagram-v2
    [*] --> FeatureBranch: Developer creates branch
    FeatureBranch --> PRCreated: Push + Create PR
    PRCreated --> BuildRunning: Automatic CI trigger
    BuildRunning --> BuildFailed: Tests fail
    BuildRunning --> BuildPassed: Tests pass
    BuildFailed --> FeatureBranch: Fix code, push again
    BuildPassed --> CodeReview: Request reviewers
    CodeReview --> ChangesRequested: Reviewer requests changes
    CodeReview --> Approved: Reviewers approve
    ChangesRequested --> FeatureBranch: Update code
    Approved --> CommentCheck: Check comment resolution
    CommentCheck --> CommentsUnresolved: Comments pending
    CommentCheck --> AllResolved: All resolved
    CommentsUnresolved --> CodeReview: Resolve comments
    AllResolved --> WorkItemCheck: Check work item link
    WorkItemCheck --> NoWorkItem: No AB# link
    WorkItemCheck --> WorkItemLinked: AB# present
    NoWorkItem --> PRCreated: Add work item link
    WorkItemLinked --> MergeReady: All policies passed ✓
    MergeReady --> Merged: Merge to main
    Merged --> [*]: Branch deleted

See: diagrams/03_domain2_branch_policy_enforcement.mmd

Diagram Explanation:
This state diagram shows the complete journey of a pull request through branch policy enforcement gates. A developer starts by creating a Feature Branch and makes code changes. After pushing changes, they Create PR, which immediately triggers the Build Running state where automated CI executes.

If the Build Fails, the PR cannot proceed - it returns to Feature Branch state where the developer fixes code and pushes again, restarting the cycle. Only when Build Passes can the PR move to Code Review state.

During Code Review, reviewers examine the changes. If they find issues, the state becomes Changes Requested, sending the PR back to Feature Branch for updates. When reviewers Approve (meeting the minimum reviewer count), the PR advances to Comment Check.

At Comment Check, the system verifies all review comments are resolved. If Comments Unresolved, the PR returns to Code Review to resolve them. When All Resolved, it proceeds to Work Item Check.

Work Item Check validates that the PR links to a work item via AB# syntax. If No Work Item, the developer must add the link, returning to PR Created state. With Work Item Linked, all policies are satisfied - the PR reaches Merge Ready state where the merge button becomes active.

Finally, the PR Merges to main, the feature branch is deleted, and the workflow completes. Every gate must pass - skip one, and merge is blocked. This ensures consistent quality enforcement without relying on developer memory or discipline.

Detailed Example 1: Implementing Branch Policies for Main Branch
Your organization wants to protect main branch. You configure Azure DevOps branch policies:

Require pull request reviews: Minimum 2 reviewers, reset approvals on new commits
- REASON: Critical code needs multiple perspectives; new commits might introduce issues, need re-review
- IMPLEMENTATION: Branch Policies → Check "Require a minimum number of reviewers" → Set to 2 → Enable "Reset votes when source branch is updated"
Build validation: Require "PR-CI" pipeline to pass
- REASON: Automated tests must pass before human review time is spent
- IMPLEMENTATION: Branch Policies → Add build validation → Select "PR-CI" pipeline → Build expiration: Immediately
Check for linked work items: Require PR associates with work item
- REASON: Maintains traceability, prevents random code changes without planning
- IMPLEMENTATION: Branch Policies → Check "Check for linked work items" → Required
Check for comment resolution: All comments must be resolved or marked "Won't fix"
- REASON: Ensures reviewer feedback isn't ignored
- IMPLEMENTATION: Branch Policies → Check "Check for comment resolution" → Required

Result: Developer creates PR for main. PR shows 4 status checks:

❌ Build: Pending (waiting for CI)
❌ Reviewers: 0/2 (needs 2 approvals)
❌ Work Items: None linked
❌ Comments: N/A (no comments yet)

Developer adds AB#456 link → Work Items: ✓. Build completes → Build: ✓. Two teammates approve → Reviewers: 2/2 ✓. No comments → Comments: ✓. All green, merge button enabled.

Detailed Example 2: Configuring GitHub Branch Protection
Your team uses GitHub. You protect main branch with these rules:

Navigate: Settings → Branches → Add branch protection rule → Branch name pattern: main
Require pull request reviews before merging: Check this box
- Required approving reviews: 2
- Dismiss stale PR approvals when new commits are pushed: ✓
- Require review from code owners: ✓ (for critical paths)
Require status checks to pass: Check this box
- Add status checks: "CI Build", "CodeQL Analysis", "Dependency Check"
- Require branches to be up to date before merging: ✓
Require conversation resolution before merging: Check this box
Include administrators: Leave unchecked (admins can bypass for emergencies)
Save changes

Result: Developer tries git push origin main → Rejected: "Cannot push to protected branch". Creates PR instead. PR requires: (1) 2 approvals, (2) CI Build ✓, (3) CodeQL ✓, (4) Dependencies ✓, (5) Comments resolved. Only when all satisfied can merge occur.

⭐ Must Know (Critical Facts):

Branch policies prevent direct pushes - protected branches reject git push; all changes via PR
Build validation runs on PR, not after merge - catches issues before bad code enters protected branch
Approvals reset on new commits - when developer pushes updates, previous approvals invalidate (configurable)
Policies apply to everyone except bypass permissions - even admins follow policies unless explicitly granted bypass
Work item linking enables traceability - requirement → commit → PR → deployment all connected via AB#{ID}

Decision Framework: When to Use Branch Policies

When choosing which branch policies to implement:

📊 Branch Policy Decision Tree:

graph TD
    A[Start: Analyze Branch Requirements] --> B{Critical branch?}
    B -->|Yes - main/master/release| C[Enable ALL core policies]
    B -->|No - feature/topic branch| D{Team size > 5?}
    
    C --> E[Required Reviewers: 2+]
    C --> F[Build Validation: Required]
    C --> G[Work Item Linking: Required]
    C --> H[Comment Resolution: Required]
    
    D -->|Yes| I[Require 1 reviewer minimum]
    D -->|No| J[Optional policies only]
    
    I --> K[Build validation recommended]
    J --> L[Work item linking optional]
    
    style C fill:#c8e6c9
    style E fill:#fff3e0
    style F fill:#fff3e0
    style G fill:#fff3e0
    style H fill:#fff3e0

See: diagrams/03_domain2_branch_policy_decision.mmd

Decision Logic Explained:
For critical branches (main, master, release), always enable the complete policy suite to ensure code quality and traceability. This includes minimum 2 reviewers (prevents single person approving their own questionable code), build validation (catches breaking changes before merge), work item linking (maintains audit trail), and comment resolution (ensures feedback is addressed). For team branches with 5+ members, require at least 1 reviewer and strongly recommend build validation to catch integration issues early. For small teams or personal feature branches, policies can be optional to avoid slowing down exploratory work, but work item linking helps track feature development.

🎯 Exam Focus: Questions often test understanding of when to require vs. recommend policies

Look for keywords: "critical production code" → All policies required
"fast-moving development team" → Selective policies (build validation + reviewers)
"compliance requirements" → Work item linking + audit policies
Common trap: Applying heavy policies to all branches slows development unnecessarily

Section 2: Git Workflow Strategies

Introduction

The problem: Teams struggle with merge conflicts, integration issues, and release coordination when using Git without clear workflow patterns.
The solution: Adopt proven branching strategies (trunk-based, GitFlow, feature branch) that match team size, release cadence, and risk tolerance.
Why it's tested: DevOps engineers must design workflows that balance speed with stability (15% of Domain 2 questions).

Core Concepts

Trunk-Based Development

What it is: A source control workflow where developers collaborate on code in a single branch (trunk/main) with very short-lived feature branches (hours to 1-2 days maximum) that merge frequently.

Why it exists: Traditional long-lived feature branches cause massive merge conflicts and integration headaches. Trunk-based development emerged from Google, Facebook, and other tech giants to enable continuous integration and rapid deployment. The core principle: integrate often to avoid integration hell.

Real-world analogy: Like a highway where all cars (developers) stay in the main lanes and only briefly exit for quick stops (feature work), then immediately merge back. Contrast with GitFlow which is like having separate roads for each type of vehicle that rarely intersect.

How it works (Detailed step-by-step):

Developer creates short-lived feature branch from main: git checkout -b feature/add-login-button main (morning)
Developer makes small, focused changes over 2-4 hours, commits frequently: git commit -m "Add login button UI"
Developer pulls latest main to ensure up-to-date: git pull origin main --rebase (keeps history clean)
Developer runs local tests, fixes any integration issues discovered: npm test (catches problems before PR)
Developer creates PR to main, automated CI runs, reviewer approves (same day): PR open → CI green → Approve → Merge
Feature branch is immediately deleted after merge: git branch -d feature/add-login-button
Main branch is always deployable, releases can happen any time via feature flags or direct deploy
If feature incomplete, use feature toggles to hide incomplete work in production while keeping code integrated

📊 Trunk-Based Development Diagram:

sequenceDiagram
    participant Dev as Developer
    participant FB as Feature Branch
    participant Main as Main Branch
    participant CI as CI/CD Pipeline
    participant Prod as Production

    Note over Dev,Main: Morning: Start Work
    Dev->>Main: Pull latest
    Dev->>FB: Create short-lived branch
    
    Note over Dev,FB: 2-4 hours: Development
    Dev->>FB: Make small changes
    Dev->>FB: Commit frequently
    
    Note over FB,Main: Same Day: Integration
    Dev->>Main: Pull latest (rebase)
    Dev->>FB: Merge main changes
    Dev->>CI: Create PR
    
    CI->>FB: Run automated tests
    CI-->>Dev: Tests pass ✓
    
    Note over Dev,Prod: Same Day: Deployment
    Dev->>Main: Merge PR (approved)
    FB->>Main: Delete feature branch
    CI->>Prod: Deploy (or feature flag)
    
    style Main fill:#c8e6c9
    style FB fill:#fff3e0
    style CI fill:#e1f5fe
    style Prod fill:#f3e5f5

See: diagrams/03_domain2_trunk_based_sequence.mmd

Diagram Explanation (detailed):
This sequence diagram shows a complete trunk-based development cycle from start to finish. In the morning, the developer pulls the latest code from the main branch to ensure they're working with current code, then creates a very short-lived feature branch. Over the next 2-4 hours (not days!), they make focused changes and commit frequently to avoid losing work. The same day, before creating a PR, they pull main again and rebase their changes on top (this prevents merge conflicts by integrating latest changes before the PR). The PR triggers CI pipeline which runs all automated tests. Since the changes are small and frequently integrated, tests typically pass quickly. Once approved, the code merges to main and the feature branch is immediately deleted. The main branch remains deployable at all times - either deploy immediately or use feature flags to hide incomplete features. This rapid cycle (hours, not days) prevents integration problems and enables continuous deployment.

Detailed Example 1: E-commerce Team Using Trunk-Based Development
Your e-commerce platform team of 15 developers needs to deploy multiple times daily during Black Friday preparation. Here's how trunk-based development works: Monday 9 AM, Sarah pulls main and creates feature/cart-discount-badge. By 11 AM, she's added the discount badge UI component, written unit tests, and committed 4 times. She pulls main again (3 other developers merged since 9 AM), rebases her branch, runs tests locally - all pass. She creates PR at 11:30 AM. CI pipeline runs: unit tests ✓, integration tests ✓, security scan ✓. Mike reviews at 12 PM, approves with minor comment about CSS naming. Sarah fixes, pushes update, CI re-runs, Mike re-approves. Merge completes at 12:15 PM. The discount badge code is now in main, but wrapped in feature flag discount_badge_enabled=false so it's hidden from users. At 2 PM, product team enables flag for 5% of users to test. At 4 PM, enabled for 100%. The badge is live. Total time: feature branch lived 3 hours. No merge conflicts because changes were small and frequently integrated.

Detailed Example 2: Trunk-Based with Feature Flags for Long Features
Your team needs to build a complete checkout redesign that will take 2 weeks. Old approach: long-lived feature/checkout-redesign branch → massive merge conflicts. Trunk-based approach: Day 1, add feature flag new_checkout_enabled=false to main. Days 1-10, developers create small branches that merge same day: feature/checkout-step1-ui (4 hours), feature/checkout-validation-logic (6 hours), feature/checkout-payment-integration (1 day, split into 3 PRs). Each PR adds code to main wrapped in if (feature_flag.new_checkout_enabled) checks. Old checkout still works because flag is false. By Day 10, entire new checkout is in main but hidden. Day 11-12, QA tests by enabling flag in staging. Day 13, enable for 10% users. Day 14, enable for 100%, remove old checkout code. Result: no merge conflicts (integrated daily), reduced risk (gradual rollout), faster feedback (QA started Day 11, not Day 14).

Detailed Example 3: Handling Hotfixes in Trunk-Based Development
Production bug discovered Friday 3 PM: payment processing fails for Safari users. Trunk-based hotfix flow: (1) Developer pulls main, creates hotfix/safari-payment-fix, (2) Fixes bug in 30 minutes, adds test that reproduces issue, (3) Creates PR with [HOTFIX] label, (4) Automated tests run + required reviewer notified, (5) Reviewer approves in 10 minutes (small, obvious fix), (6) Merge to main at 4 PM, (7) CI auto-deploys to production (main is always deployable), (8) Fix live by 4:15 PM. Total time: 1 hour 15 minutes from discovery to production. If using GitFlow: would need to merge to develop, then merge to master, then create release branch, then deploy → 3-4 hours minimum.

⭐ Must Know (Critical Facts):

Feature branches live hours to 1-2 days maximum - if branch lives longer, you're not doing trunk-based (merge conflicts increase exponentially with time)
Main branch is always deployable - every commit to main must pass all tests and be production-ready (use feature flags for incomplete work)
Small, frequent integrations prevent merge conflicts - merging 50 lines daily is easier than merging 5,000 lines weekly
Feature flags enable continuous integration of incomplete features - code merged but hidden from users until ready
Requires strong CI/CD pipeline and automated testing - can't deploy constantly without confidence in automated tests
Best for teams with mature DevOps practices - trunk-based is advanced; requires discipline, automation, and cultural buy-in

When to use (Comprehensive):

✅ Use when: Team does continuous deployment or multiple daily releases (trunk-based enables rapid deployment)
✅ Use when: Team has strong automated testing (>80% code coverage, comprehensive integration tests)
✅ Use when: Team size is 3-100+ developers (scales well because no long-lived branch conflicts)
✅ Use when: Feature flags/toggles infrastructure exists (enables hiding incomplete features)
✅ Use when: Team values quick feedback over perfect isolation (find integration issues early)
❌ Don't use when: Team lacks automated tests (trunk-based without tests = production bugs)
❌ Don't use when: Compliance requires formal release approval cycles (trunk-based assumes always deployable)
❌ Don't use when: Team is new to Git/DevOps (requires discipline; start with feature branch workflow)
❌ Don't use when: Product has scheduled releases (monthly/quarterly) with no interim deployments (GitFlow better fit)

Limitations & Constraints:

Requires feature flag infrastructure - can't hide incomplete features without flags; need LaunchDarkly, ConfigCat, or custom solution
Demands high developer discipline - one developer pushing broken code affects entire team immediately
Needs fast CI/CD pipeline - if tests take 30 minutes, developers wait 30 minutes per merge (productivity killer)
May conflict with compliance requirements - some regulations require isolated release branches with formal approvals
Difficult with unstable dependencies - if external APIs frequently break, constant integration exposes team to instability

💡 Tips for Understanding:

Think "integrate early, integrate often" - trunk-based is opposite of "develop in isolation for weeks then merge"
Feature flags are the key - they decouple "merge to main" from "release to users"; essential concept
Main = Production - mentally treat main branch as if it's already in production; forces quality mindset
Small PRs are easier to review - 50-line PR gets reviewed in 10 minutes; 5,000-line PR takes hours and gets rubber-stamped

⚠️ Common Mistakes & Misconceptions:

Mistake 1: "Trunk-based means no branches at all"
- Why it's wrong: Confusion from "single trunk" terminology; developers still create feature branches
- Correct understanding: Feature branches exist but are short-lived (hours/1-2 days); they're not long-lived development branches
Mistake 2: "Can't use trunk-based without feature flags"
- Why it's wrong: Feature flags help but aren't strictly required for small, complete features
- Correct understanding: Feature flags enable trunk-based for large features (2+ days); small features (hours) can merge directly
Mistake 3: "Trunk-based is always better than GitFlow"
- Why it's wrong: Different workflows for different contexts; one-size-doesn't-fit-all
- Correct understanding: Trunk-based for continuous deployment teams; GitFlow for scheduled release teams; neither is universally "better"

🔗 Connections to Other Topics:

Relates to CI/CD pipelines because: Trunk-based requires automated testing and deployment (you'll learn pipeline design in Domain 3)
Builds on branch policies by: Using required status checks to prevent broken code from merging to always-deployable main
Often used with feature flags to: Decouple deployment from release; deploy dark features to production safely (instrumentation topic in Domain 5)

Troubleshooting Common Issues:

Issue 1: Merge conflicts despite using trunk-based → Developers not pulling/rebasing main frequently enough; enforce "pull main before PR" rule
Issue 2: Main branch breaks frequently → Automated tests insufficient; add more integration tests and require status checks before merge
Issue 3: Feature branches living 1+ weeks → Team not breaking work small enough; coach on vertical slicing (thin end-to-end features)
Issue 4: Feature flags accumulating technical debt → No cleanup process; establish "flag retirement" policy (remove flags 2 weeks after 100% rollout)

GitFlow Workflow

What it is: A structured branching model with dedicated branch types (main/master, develop, feature, release, hotfix) designed for projects with scheduled release cycles and the need to support multiple production versions simultaneously.

Why it exists: Created by Vincent Driessen in 2010 to solve the problem of coordinating parallel development, managing scheduled releases, and supporting production hotfixes without disrupting ongoing development. Before GitFlow, teams struggled with "when do we stop adding features and start stabilizing for release?" GitFlow provides clear answers through its branching structure.

Real-world analogy: Like a manufacturing assembly line with different stations - features are built in parallel (feature branches), assembled on the main line (develop), sent to quality control for final checks (release branch), shipped to customers (master/main), and if a defect is found, a recall process fixes it (hotfix branch). Each station has a specific purpose and clear handoff points.

How it works (Detailed step-by-step):

Long-lived branches exist: main (production code only), develop (integration branch for next release)
Feature development: Developer creates feature/user-authentication from develop, works for days/weeks, merges back to develop when complete
Release preparation: When enough features in develop, create release/v2.0 from develop for final testing and bug fixes
Release fixes: Bug fixes go to release/v2.0, also merged back to develop to keep it updated
Production deployment: When release/v2.0 is stable, merge to main, tag as v2.0, deploy to production
Hotfix process: Critical bug in production → create hotfix/payment-bug from main, fix, merge to both main and develop
Parallel development continues: While release branch is being stabilized, new features continue merging to develop for next release

📊 GitFlow Architecture Diagram:

graph TB
    subgraph "Long-Lived Branches"
        M[main/master<br/>Production Code]
        D[develop<br/>Next Release]
    end
    
    subgraph "Short-Lived Branches"
        F1[feature/login]
        F2[feature/dashboard]
        R[release/v2.0]
        H[hotfix/bug-123]
    end
    
    D -->|Create| F1
    D -->|Create| F2
    F1 -->|Merge when complete| D
    F2 -->|Merge when complete| D
    
    D -->|Create when ready| R
    R -->|Bug fixes| R
    R -->|Merge when stable| M
    R -->|Merge fixes back| D
    
    M -->|Critical bug| H
    H -->|Merge fix| M
    H -->|Merge fix| D
    
    M -->|Tag| TAG[v2.0 Tag]
    
    style M fill:#c8e6c9
    style D fill:#e1f5fe
    style F1 fill:#fff3e0
    style F2 fill:#fff3e0
    style R fill:#f3e5f5
    style H fill:#ffebee

See: diagrams/03_domain2_gitflow_architecture.mmd

Diagram Explanation (detailed):
GitFlow maintains two permanent branches: main (green) contains only production-ready code, and develop (blue) serves as the integration branch for the next release. Feature branches (orange) like feature/login and feature/dashboard are created from develop and can live for days or weeks while developers work on complete features. When a feature is done, it merges back to develop. When enough features accumulate in develop and it's time for a release, a release/v2.0 branch (purple) is created from develop. This release branch is where final testing, documentation, and minor bug fixes occur - no new features allowed. Once the release branch is stable, it merges to both main (becoming production) and back to develop (ensuring bug fixes aren't lost). The main branch is tagged with version number for traceability. If a critical production bug is discovered, a hotfix branch (red) is created from main, the fix is applied, and then merged to both main (immediate production fix) and develop (prevent bug in next release). This structure allows parallel work: new features can continue in develop while a release is being stabilized.

Detailed Example 1: Software Company with Quarterly Releases
Your SaaS company releases new versions quarterly. Current state: v1.5 in production, v1.6 in development. January: Developers create feature branches from develop: feature/export-pdf, feature/dark-mode, feature/api-v2. Over 6 weeks, these features are completed and merged to develop. Mid-February: Product decides v1.6 has enough features, time to release. QA creates release/1.6 from develop. Meanwhile, developers continue creating feature branches from develop for v1.7. QA finds 5 bugs in release/1.6 branch - fixes are committed to release/1.6 and also merged back to develop. March 1: release/1.6 is stable, merged to main, tagged v1.6, deployed to production. March 15: Customer reports critical data loss bug. Developer creates hotfix/1.6.1-data-loss from main, fixes it, merges to both main (becomes v1.6.1 in production) and develop (prevents bug in v1.7). Development continues on develop for v1.7 release in June. Result: Structured release process with clear separation between "next release" and "current production."

Detailed Example 2: GitFlow for Multi-Version Support
Your enterprise software supports 3 versions: v3.0 (current), v2.5 (legacy support), v1.0 (critical fixes only). GitFlow adaptation: Maintain main-v3, main-v2, main-v1 branches (one per supported version), plus develop for next release (v3.1). Customer on v2.5 reports security bug. Flow: (1) Create hotfix/v2.5-security from main-v2, (2) Fix bug, test, (3) Merge to main-v2, deploy to v2.5 customers, (4) Cherry-pick fix to main-v3 (current version needs fix too), (5) Merge fix to develop (v3.1 needs it). For new features: All features go to develop, when ready create release/3.1, stabilize, merge to new main-v3 (becomes current), old main-v3 becomes main-v3-archived. This allows supporting multiple versions while developing new features.

Detailed Example 3: GitFlow Release Branch Workflow Detail
Release day approaches for v2.0. State: 50 features merged to develop over 3 months. Actions: (1) March 1, 9 AM: Release manager creates release/2.0 from develop, (2) CI pipeline deploys release/2.0 to staging environment, (3) QA tests for 2 weeks, logs 12 bugs in Azure Boards, (4) Developers fix bugs by creating small branches from release/2.0: bugfix/login-crash, bugfix/export-timeout, each merges back to release/2.0 AND develop, (5) March 14: All bugs fixed, QA approves, (6) March 15: Merge release/2.0 to main, tag as v2.0.0, deploy to production, (7) March 16: Monitor production, no issues, (8) March 17: Delete release/2.0 branch (no longer needed), (9) Development continues on develop for v2.1, already has 15 new features merged during the 2-week release stabilization. Clean separation of release stabilization from ongoing development.

⭐ Must Know (Critical Facts):

Two long-lived branches required: main (production), develop (next release) - if you only have one long-lived branch, it's not GitFlow
Release branches freeze feature development: Once release/X created, no new features - only bug fixes allowed
Hotfixes merge to both main and develop: Critical to prevent bug reappearing in next release; easy to forget under pressure
Tags mark production releases: main is tagged (v1.0, v2.0) to identify what's deployed when
Best for scheduled releases: GitFlow shines with monthly/quarterly release cycles, struggles with continuous deployment
Feature branches can be long-lived: Unlike trunk-based (hours), GitFlow features can live weeks/months

When to use (Comprehensive):

✅ Use when: Scheduled release cycles (monthly, quarterly, annually) with clear "code freeze" dates
✅ Use when: Multiple production versions need simultaneous support (v2.x for enterprise, v3.x for cloud)
✅ Use when: Formal QA/UAT required before production deployment (release branch is QA environment)
✅ Use when: Distributed teams working on large features that take weeks to complete
✅ Use when: Business requires production hotfixes without disrupting development (hotfix branches)
✅ Use when: Compliance requires release traceability and approval gates (release branches provide checkpoints)
❌ Don't use when: Team does continuous deployment (multiple times daily) - GitFlow adds unnecessary overhead
❌ Don't use when: Team is small (1-5 developers) - GitFlow complexity outweighs benefits, use feature branch workflow instead
❌ Don't use when: Features must be deployed independently - use trunk-based with feature flags instead
❌ Don't use when: Team lacks Git expertise - GitFlow merging complexity (release to main AND develop) causes errors

Limitations & Constraints:

Complex merge patterns prone to errors: Forgetting to merge hotfix to develop or release fixes to develop causes bugs to reappear
Long-lived branches increase merge conflicts: Feature branches that live weeks accumulate conflicts when merging to develop
Slows down hotfix deployment: Hotfix requires merging to main, develop, possibly release branch - extra steps during outages
Not compatible with continuous deployment: Having code in develop that's not in main contradicts "main always deployable"
Overhead for small teams: Managing 5 branch types (main, develop, feature, release, hotfix) is complex for 3-person teams

💡 Tips for Understanding:

Think "railroad switching yard": develop is the staging area, release is the final inspection track, main is the departure platform
Remember the merge pattern: Feature → develop, develop → release, release → main AND develop (the double-merge is key)
Hotfix is the emergency override: Only branch type that can bypass develop and go straight to main, but must still merge to develop after
Use tags religiously: Without tags on main, you can't identify what version is deployed when

⚠️ Common Mistakes & Misconceptions:

Mistake 1: "GitFlow means long-lived feature branches are okay"
- Why it's wrong: Even in GitFlow, feature branches should be merged reasonably quickly (1-2 weeks max); month-long branches cause integration problems
- Correct understanding: GitFlow tolerates longer feature branches than trunk-based, but still favors frequent integration to develop
Mistake 2: "Merge release to main, then delete it - done!"
- Why it's wrong: Forgetting to merge release back to develop means bug fixes are lost in next release
- Correct understanding: Release branches must merge to BOTH main and develop before deletion
Mistake 3: "Hotfix from develop is fine"
- Why it's wrong: develop might have untested features; hotfix must be from stable main branch
- Correct understanding: Hotfixes ALWAYS branch from main to ensure only production-tested code is included
Mistake 4: "We use GitFlow because we have feature branches"
- Why it's wrong: Feature branches alone don't make it GitFlow; must have develop, release, and hotfix branches too
- Correct understanding: GitFlow is a complete system with 5 branch types; just using feature branches is "feature branch workflow"

🔗 Connections to Other Topics:

Relates to Azure Pipelines because: Different pipelines trigger for each branch type (CI for develop, QA for release, prod for main) - covered in Domain 3
Builds on branch policies by: Each branch type (main, develop, release) needs different policies (main strictest, develop moderate, feature minimal)
Often used with semantic versioning to: Tag releases (v1.0.0, v2.0.0) and determine breaking changes vs. patches (instrumentation in Domain 5)

Troubleshooting Common Issues:

Issue 1: Merge conflicts when merging release to main → Release branch diverged too far from main; merge main to release branch weekly during stabilization
Issue 2: Bug fixed in hotfix reappears in next release → Hotfix wasn't merged to develop; add checklist "Hotfix merged to: [ ] main [ ] develop"
Issue 3: Features in develop not ready for release → Don't merge incomplete features to develop; use feature branches until fully done
Issue 4: Too many active release branches → Limit to 1 active release at a time; finish v2.0 before starting v2.1 release branch

Comparison Table: Branching Strategies

Feature	Trunk-Based Development	GitFlow	Feature Branch Workflow
Use case	Continuous deployment, rapid iteration	Scheduled releases, formal QA	Simple projects, small teams
Main branch	Always deployable, directly to prod	Production-ready code only	Integration branch
Feature branch lifespan	Hours to 1-2 days	Days to weeks	Days to weeks
Release mechanism	Deploy main anytime, use feature flags	Dedicated release branches	Tag main or create release branch
Hotfix process	Fix in main, deploy (1 step)	Hotfix branch → main + develop (3 steps)	Fix in main, tag
Pros	• Fast deployment • No merge conflicts • Simple structure	• Clear release process • Multiple version support • Formal QA stage	• Easy to learn • Flexible • Good for small teams
Cons	• Requires feature flags • Needs strong CI/CD • High discipline needed	• Complex merges • Slower hotfixes • Not for continuous deployment	• Can cause merge conflicts • No formal release process • Scales poorly
🎯 Exam tip	Look for: "continuous deployment", "multiple deploys/day", "fast iteration"	Look for: "quarterly releases", "support multiple versions", "formal approval"	Look for: "small team", "simple process", "getting started"

Practical Scenarios

Scenario 1: Choosing Strategy for E-commerce Platform

Situation: E-commerce site, Black Friday approaching, need to deploy fixes hourly, team of 20 developers
Challenge: Must balance rapid deployments with stability
Solution: Trunk-based development with feature flags
Why this works: Hourly deploys require always-deployable main branch (trunk-based strength). Feature flags let team hide incomplete Black Friday features while deploying bug fixes. Team size (20) is manageable with trunk-based if strong CI/CD exists. GitFlow would be too slow (release branches delay deployment).

📊 Solution Architecture:

graph LR
    A[Developer] -->|Create branch| B[feature/fix-cart]
    B -->|4 hours work| C[PR to main]
    C -->|CI tests pass| D[Merge to main]
    D -->|Auto-deploy| E[Production]
    E -->|Feature flag OFF| F[Hidden from users]
    F -->|Black Friday| G[Enable flag]
    G -->|Gradual rollout| H[100% users]
    
    style D fill:#c8e6c9
    style E fill:#e1f5fe
    style G fill:#fff3e0

See: diagrams/03_domain2_scenario_ecommerce.mmd

Scenario 2: Enterprise SaaS with Compliance Requirements

Situation: Healthcare SaaS, quarterly releases, HIPAA compliance requires release approvals, 18-month support for old versions
Challenge: Regulatory approvals slow deployment, must support v2.x (cloud) and v1.x (on-premise) simultaneously
Solution: GitFlow with multiple main branches (main-v2, main-v1)
Why this works: Quarterly releases fit GitFlow's scheduled release model. Release branches provide approval checkpoints for compliance. Multiple main branches support parallel versions. Hotfix branches allow emergency patches to specific versions without disrupting development.

Scenario 3: Startup Rapid Prototyping

Situation: 5-person startup, building MVP, no formal QA, deploying when features ready
Challenge: Need simplicity, can't handle complex branching overhead
Solution: Feature branch workflow (simple: main + feature branches)
Why this works: Small team doesn't need GitFlow complexity. No scheduled releases, deploy when ready. Feature branches provide isolation for experimentation. PR to main provides lightweight review. Can evolve to trunk-based or GitFlow as team grows.

Section 3: Code Review and Pull Request Best Practices

Introduction

The problem: Code reviews are often inconsistent, delayed, or superficial, leading to bugs slipping through and knowledge silos forming in teams.
The solution: Implement structured pull request workflows with clear guidelines, automated checks, and effective review practices.
Why it's tested: Code review is the primary quality gate in modern development (20% of Domain 2 questions test PR workflows).

Core Concepts

Effective Pull Request Structure

What it is: A systematic approach to creating pull requests that are easy to review, understand, and approve quickly while maintaining high code quality standards.

Why it exists: Large, complex PRs (500+ line changes) take hours to review and often get rubber-stamped without thorough inspection. Small, well-structured PRs get reviewed in 10-15 minutes with better quality outcomes. The problem: developers create massive PRs; the solution: enforce size limits and clear structure.

Real-world analogy: Like proofreading documents - reviewing a 2-page memo takes 5 minutes and catches most errors, while reviewing a 100-page report takes hours and errors slip through due to reviewer fatigue.

How it works (Detailed step-by-step):

Developer keeps changes small: Aim for <250 lines changed per PR (research shows optimal review effectiveness)
PR includes clear description: What changed, why it changed, how to test, any breaking changes
Self-review first: Developer reviews own PR before requesting review, catches obvious issues
Link to work item: PR references Azure Boards item (AB#123) or GitHub issue (#456) for traceability
Add appropriate reviewers: Include code owners, domain experts, minimum 1-2 reviewers
Respond to feedback promptly: Answer questions, make requested changes, resolve conversations
Keep PR updated: Rebase/merge main frequently to avoid conflicts, re-request review after changes
Complete only when approved: All conversations resolved, all checks pass, required approvals received

📊 Effective PR Workflow:

sequenceDiagram
    participant Dev as Developer
    participant PR as Pull Request
    participant CI as CI Pipeline
    participant Rev1 as Reviewer 1
    participant Rev2 as Reviewer 2
    participant Main as Main Branch

    Dev->>Dev: Self-review code
    Dev->>PR: Create PR (< 250 lines)
    PR->>CI: Trigger automated checks
    PR->>Rev1: Notify reviewer 1
    PR->>Rev2: Notify reviewer 2
    
    CI->>PR: Build ✓, Tests ✓, Lint ✓
    
    Rev1->>PR: Review code, add comments
    Rev2->>PR: Review code, add comments
    
    Dev->>PR: Address feedback
    Dev->>PR: Resolve conversations
    
    PR->>CI: Re-run checks
    CI->>PR: All checks pass ✓
    
    Rev1->>PR: Approve
    Rev2->>PR: Approve
    
    PR->>Main: Merge (squash/rebase)
    PR->>PR: Delete feature branch
    
    style CI fill:#e1f5fe
    style Main fill:#c8e6c9
    style PR fill:#fff3e0

See: diagrams/03_domain2_pr_workflow.mmd

Diagram Explanation (detailed):
The effective PR workflow begins with developer self-review - before creating the PR, the developer reviews their own code to catch obvious issues (typos, console.log statements, unused imports). This saves reviewer time. When creating the PR, the developer ensures it's under 250 lines (large PRs get poor reviews). The PR automatically triggers CI pipeline for automated checks (build, tests, linting, security scans) and notifies assigned reviewers. Both automated (CI) and human (reviewers) validation happen in parallel. Reviewers examine code and add comments/questions. Developer addresses feedback by making changes and explicitly resolving conversations (not ignoring them). After changes, CI re-runs to ensure fixes didn't break anything. When all conversations are resolved and checks pass, reviewers approve. Only then can the PR merge to main using squash or rebase strategy to keep history clean. Finally, the feature branch is automatically deleted to prevent clutter.

Detailed Example 1: Small PR vs Large PR Review Quality
Scenario A (Small PR): Developer creates PR with 150 lines changed - adds new API endpoint. Description: "Add GET /api/users/:id endpoint. Returns user by ID. Related to AB#789." Reviewer clicks PR, sees concise changes in 3 files (route, controller, test). Reviews in 12 minutes, spots issue: "Missing error handling for invalid user ID." Developer fixes in 5 minutes, reviewer re-approves. Total time: 20 minutes, caught 1 bug.

Scenario B (Large PR): Developer creates PR with 1,200 lines changed - refactors entire API layer. Description: "API refactoring." Reviewer clicks PR, sees 45 files changed, overwhelmed. Skims for 30 minutes, approves with "LGTM" comment despite not fully understanding changes. Merges. Production deploys. 3 bugs discovered in production because reviewer missed: (1) broken error handling, (2) race condition, (3) memory leak. Total time: 30 minutes review + 4 hours fixing production bugs. Lesson: Small PRs get better reviews.

Detailed Example 2: PR Description Best Practices
Bad PR description: "Fixed stuff. Updated code. See changes." - Reviewer has no context, must read entire codebase to understand.

Good PR description template:

## What changed
- Added retry logic to payment API client
- Increased timeout from 5s to 30s
- Added exponential backoff (max 3 retries)

## Why
Payment API occasionally returns 503 under load. Current implementation fails immediately.
Customer transactions lost. Business impact: $50K/month failed orders.

## How to test
1. Run: npm test -- payment.test.js
2. Manual test: Simulate API timeout (see test/README)
3. Verify: Logs show retry attempts

## Breaking changes
None - backward compatible

## Related work item
Fixes AB#1234

Result: Reviewer understands context immediately, knows what to focus on, can test changes. Review is faster and more effective.

⭐ Must Know (Critical Facts):

Optimal PR size is <250 lines changed - research shows review effectiveness drops sharply beyond this; larger PRs get superficial reviews
PR descriptions should answer: What, Why, How to test - context speeds up reviews and improves quality
Self-review before requesting review - catches 30-40% of issues before wasting reviewer time
Resolve all conversations explicitly - don't leave reviewer comments unanswered; shows respect for their time
Keep PRs updated with main - rebase/merge main daily to avoid merge conflicts at completion time
Link to work items for traceability - every PR should reference Azure Boards item (AB#123) or GitHub issue (#456)

When to use (Comprehensive):

✅ Use small PRs (<250 lines) when: Feature can be split into incremental changes (most of the time)
✅ Use detailed descriptions when: Changes are complex or affect critical systems (always for prod code)
✅ Use draft PRs when: Want early feedback before code is complete (collaboration, architectural decisions)
✅ Use PR templates when: Team needs consistency (auto-populate description format)
✅ Use codeowners when: Specific people must review certain paths (security team for auth/, platform team for infra/)
❌ Don't create huge PRs (>500 lines) when: Can be split (refactoring can be incremental)
❌ Don't skip description when: Convenient but reviewer needs context (always write description)
❌ Don't merge without resolving conversations when: Reviewer raised valid concerns (address feedback)

Limitations & Constraints:

Not all changes can be split to <250 lines - database migrations, dependency upgrades, auto-generated code sometimes requires large PRs
Small PRs increase overhead - creating 10 small PRs instead of 1 large PR means 10x CI runs, 10x review cycles
Description quality varies - enforcing templates helps but can't force good explanations
Review fatigue still occurs - even small PRs cause fatigue if reviewing 20/day

💡 Tips for Understanding:

Think "reviewable chunks" - would YOU want to review this PR after a long day?
Description is documentation - in 6 months, this PR description explains why the change was made
Self-review is quality gate #1 - review your own PR first, pretend you're the reviewer
Conversations are asynchronous - write clear comments, don't assume instant back-and-forth

⚠️ Common Mistakes & Misconceptions:

Mistake 1: "Big features need big PRs"
- Why it's wrong: Confuses feature size with PR size; big features should be multiple small PRs
- Correct understanding: Break big feature into incremental PRs, use feature flags to hide incomplete work
Mistake 2: "PR description wastes time, code is self-explanatory"
- Why it's wrong: Code shows WHAT changed, not WHY; reviewer needs context to evaluate correctness
- Correct understanding: Good descriptions save net time (5 min to write, saves 30 min reviewer investigation)
Mistake 3: "Approve PR even if I don't understand it - don't want to block the team"
- Why it's wrong: Rubber-stamping defeats the purpose of code review; bugs reach production
- Correct understanding: If you don't understand, ask questions; good PRs should be understandable

🔗 Connections to Other Topics:

Relates to branch policies because: Required reviewers, build validation, work item linking enforce PR quality (Domain 2, Section 1)
Builds on CI/CD pipelines by: PR triggers automated tests, security scans, code quality checks before human review (Domain 3)
Often used with code coverage metrics to: Ensure new code is tested; PR shows coverage delta (instrumentation in Domain 5)

Troubleshooting Common Issues:

Issue 1: PRs sit unreviewed for days → Add automatic reviewer assignment (codeowners), set SLA reminders in Azure DevOps
Issue 2: Reviewers always approve without comments → Require minimum time between PR creation and approval (branch policy), educate on review value
Issue 3: PR conflicts with main at merge time → Require branch to be up-to-date before merge (GitHub branch protection), use auto-merge to rebase
Issue 4: Reviewers don't have context → Enforce PR template with description sections (What/Why/How to test)

Section 4: Git Advanced Operations

Introduction

The problem: Teams struggle with advanced Git scenarios - resolving conflicts, recovering lost work, cleaning up history, managing large repositories.
The solution: Master Git's powerful features (rebase, cherry-pick, reflog, bisect) to handle complex situations efficiently.
Why it's tested: DevOps engineers must troubleshoot Git issues and guide teams (15% of Domain 2 questions).

Core Concepts

Git Rebase for Clean History

What it is: An alternative to merge that rewrites commit history by replaying commits from one branch onto another, creating a linear history instead of merge commits.

Why it exists: git merge creates merge commits that clutter history with "Merged feature/X into main" messages. For frequently-integrated branches, history becomes a tangled web. Rebase solves this by making history linear and readable.

Real-world analogy: Merge is like combining two separate document timelines with a note "Combined documents here." Rebase is like rewriting the second document as if it was always part of the first document's timeline - cleaner, but changes history.

How it works (Detailed step-by-step):

Developer has feature branch with commits F1, F2, F3 based on main commit M1
Meanwhile, main has new commits M2, M3 (other developers merged work)
Developer runs: git checkout feature/login then git rebase main
Git temporarily removes F1, F2, F3 from history
Git fast-forwards feature branch to M3 (latest main)
Git replays F1 onto M3 (creating F1'), resolves conflicts if any
Git replays F2 onto F1' (creating F2'), resolves conflicts if any
Git replays F3 onto F2' (creating F3'), resolves conflicts if any
Result: Linear history: M1 → M2 → M3 → F1' → F2' → F3'
Original F1, F2, F3 are abandoned (exist in reflog for 30 days)

⭐ Must Know (Critical Facts):

Rebase rewrites history - commits get new SHAs; never rebase public/shared branches (others have old SHAs)
Use for local branches - rebase feature branches before creating PR to keep history clean
Interactive rebase (git rebase -i) - lets you squash, reword, reorder commits; clean up before PR
Golden rule: never rebase public branches - if commit is pushed to shared branch (main, develop), don't rebase it
Conflicts resolved per commit - rebase may require resolving conflicts multiple times (once per commit being replayed)
Rebase vs merge trade-off - rebase gives clean history but changes history; merge preserves exact history but adds noise

Chapter Summary

What We Covered

✅ Branch Policies: Protection mechanisms, required reviewers, build validation, work item linking, comment resolution
✅ Workflow Strategies: Trunk-based development, GitFlow, feature branch workflow - when to use each
✅ Pull Request Best Practices: Small PRs (<250 lines), effective descriptions, review workflows, code quality gates
✅ Git Advanced Operations: Rebase for clean history, conflict resolution, history management

Critical Takeaways

Branch policies prevent bad code from reaching protected branches - use required reviewers (2+), build validation, work item linking on main/master
Trunk-based development enables continuous deployment - short-lived branches (hours to 1-2 days), main always deployable, requires feature flags
GitFlow suits scheduled releases - develop branch for next release, release branches for stabilization, hotfix branches for production fixes
Small PRs get better reviews - <250 lines optimal, detailed descriptions required, self-review before requesting review
Rebase creates clean history - but rewrites commits, never rebase public/shared branches

Self-Assessment Checklist

Test yourself before moving on:

I can explain the difference between trunk-based development and GitFlow
I can configure branch policies in Azure DevOps (required reviewers, build validation, work item linking)
I understand when to use rebase vs merge
I can describe characteristics of an effective pull request (<250 lines, good description)
I know the "golden rule of rebase" (never rebase public branches)
I can explain how hotfixes work in GitFlow (branch from main, merge to main AND develop)
I understand how branch policies enforce code quality

Practice Questions

Try these from your practice test bundles:

Domain 2 Bundle 1: Questions 1-15 (Branch policies and workflows)
Domain 2 Bundle 2: Questions 16-30 (Pull requests and code review)
Expected score: 70%+ to proceed

If you scored below 70%:

Review sections: Branch Policies (Section 1), Workflow Strategies (Section 2)
Focus on: Decision frameworks for choosing branching strategies, when to use each branch policy
Re-read: Common mistakes and misconceptions in each section

Quick Reference Card

[One-page summary of chapter - copy to your notes]

Key Concepts:

Branch Policy: Protection rule on branch requiring conditions before merge (reviewers, build, work items)
Trunk-Based: Main branch always deployable, short-lived feature branches (hours/days), continuous deployment
GitFlow: Structured workflow with main, develop, feature, release, hotfix branches for scheduled releases
PR Best Practices: <250 lines, What/Why/How description, self-review, linked work item
Rebase: Replay commits onto different base, creates linear history, rewrites SHAs

Decision Points:

Need continuous deployment? → Trunk-based + feature flags
Scheduled releases (quarterly)? → GitFlow
Small team getting started? → Feature branch workflow
Protect critical branch? → Enable all policies (reviewers, build, work items, comments)
Clean up history before PR? → Interactive rebase (git rebase -i)

Commands:

Branch policies: Azure DevOps → Repos → Branches → branch → "..." → Branch policies
Trunk-based rebase: git pull origin main --rebase before PR
GitFlow hotfix: Branch from main, merge to main AND develop
Interactive rebase: git rebase -i HEAD~3 (squash last 3 commits)

Chapter 3: Design and Implement Build and Release Pipelines (52.5% of exam)

Chapter Overview

What you'll learn:

Azure Pipelines fundamentals (YAML, agents, triggers, jobs, stages)
CI/CD pipeline design patterns and best practices
Build strategies, artifact management, and package feeds
Deployment strategies (blue-green, canary, ring-based, feature flags)
Container and Kubernetes deployments (ACR, AKS)
Pipeline optimization (caching, parallelization, templates)

Time to complete: 12-16 hours (largest domain, most exam weight)
Prerequisites: Chapters 1-2 (Fundamentals, Source Control)

Section 1: Azure Pipelines Fundamentals

Introduction

The problem: Teams struggle with manual builds, inconsistent deployments, and lack of automation, leading to slow delivery and production bugs.
The solution: Azure Pipelines automates build, test, and deployment processes with YAML-based configuration-as-code.
Why it's tested: Azure Pipelines is the core of DevOps automation (30% of Domain 3 questions, 15% of entire exam).

Core Concepts

YAML Pipeline Structure

What it is: Azure Pipelines uses YAML (YAML Ain't Markup Language) files to define CI/CD pipelines as code, with a hierarchical structure of stages, jobs, and steps that execute automation tasks.

Why it exists: Before YAML, pipelines were configured through UI (Classic pipelines), which had problems: not version-controlled, hard to replicate, no code review for pipeline changes, prone to drift. YAML pipelines solve this by treating pipeline configuration as code - versioned, reviewed, reusable, consistent.

Real-world analogy: YAML pipeline is like a recipe book checked into source control. Classic pipeline is like verbal instructions passed between cooks - inconsistent and forgotten. With YAML, every team member has the exact same recipe, can suggest improvements via PR, and changes are tracked in Git history.

How it works (Detailed step-by-step):

Developer creates azure-pipelines.yml in repository root (or any path, specified in pipeline settings)
File defines hierarchy: trigger → stages → jobs → steps (each level contains the next)
Trigger section: Specifies when pipeline runs (push to main, PR, schedule, manual)
Stages section (optional): Logical divisions like Build, Test, Deploy (can run sequentially or parallel with dependencies)
Jobs section: Units of work that run on agents (can run parallel within a stage)
Steps section: Individual tasks executed sequentially (script, task, checkout)
Azure DevOps parses YAML on trigger: Validates syntax, expands templates, queues jobs
Agent pool allocates agent: Pulls job from queue, executes steps in order
Results reported back: Logs, artifacts, test results stored in Azure DevOps
Pipeline shows in UI: Visual representation of stages/jobs/steps with status

📊 YAML Pipeline Structure Diagram:

graph TD
    A[azure-pipelines.yml] --> B[Trigger: push to main]
    A --> C[Variables: Build config]
    A --> D[Stages]
    
    D --> E[Stage: Build]
    D --> F[Stage: Test]
    D --> G[Stage: Deploy]
    
    E --> H[Job: BuildJob]
    H --> I[Step: Install dependencies]
    H --> J[Step: Compile code]
    H --> K[Step: Publish artifact]
    
    F --> L[Job: UnitTests]
    F --> M[Job: IntegrationTests]
    
    G --> N[Job: DeployToDev]
    G --> O[Job: DeployToStaging]
    
    style A fill:#e1f5fe
    style E fill:#c8e6c9
    style F fill:#fff3e0
    style G fill:#f3e5f5
    
    I -.Sequential.-> J
    J -.Sequential.-> K
    
    L -.Parallel.-> M
    N -.Dependent.-> O

See: diagrams/04_domain3_yaml_pipeline_structure.mmd

Diagram Explanation (detailed):
The YAML pipeline starts with a single file (azure-pipelines.yml, blue) containing all configuration. At the top level, you define triggers (when pipeline runs), variables (configuration values), and stages (logical divisions). The pipeline flows through three stages sequentially: Build (green), Test (orange), Deploy (purple). Within the Build stage, a single job (BuildJob) contains three steps that run sequentially on the same agent: install dependencies → compile code → publish artifact (the arrows show sequential execution). The Test stage has two jobs (UnitTests and IntegrationTests) that run in parallel on separate agents to speed up testing. The Deploy stage has two jobs where DeployToStaging depends on DeployToDev completing successfully (dependent execution). This hierarchical structure (trigger → stages → jobs → steps) provides flexibility: parallel where possible (jobs within stage), sequential where necessary (steps within job, stages with dependencies).

Detailed Example 1: Simple Node.js CI Pipeline
Your Node.js app needs automated testing on every commit to main. Here's the YAML:

# azure-pipelines.yml
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - src/*
    - tests/*

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Build
  jobs:
  - job: BuildJob
    steps:
    - task: NodeTool@0
      inputs:
        versionSpec: '18.x'
      displayName: 'Install Node.js'
    
    - script: npm ci
      displayName: 'Install dependencies'
    
    - script: npm run build
      displayName: 'Build application'
    
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: 'dist'
        ArtifactName: 'webapp'
      displayName: 'Publish artifact'

- stage: Test
  dependsOn: Build
  jobs:
  - job: UnitTest
    steps:
    - script: npm ci
      displayName: 'Install dependencies'
    
    - script: npm test -- --coverage
      displayName: 'Run unit tests'
    
    - task: PublishTestResults@2
      inputs:
        testResultsFormat: 'JUnit'
        testResultsFiles: '**/junit.xml'
      condition: always()
    
    - task: PublishCodeCoverageResults@1
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '**/coverage/cobertura-coverage.xml'

Breakdown: Pipeline triggers on push to main, but only if src/ or tests/ changed (path filter saves agent time). Runs on Microsoft-hosted Ubuntu agent (vmImage). Build stage installs Node 18, runs npm ci (faster than npm install in CI), builds app, publishes dist folder as artifact named 'webapp'. Test stage depends on Build (runs after), reinstalls dependencies (fresh job, clean environment), runs tests with coverage, publishes results (visible in Azure DevOps Tests tab), publishes coverage report (condition: always() means publish even if tests fail). Result: Every commit to main triggers → build → test → artifact + results available in 5-10 minutes.

Detailed Example 2: GitHub Actions Workflow with Matrix Strategy

You're building a Node.js library that needs to support multiple Node versions (14, 16, 18) and run on both Linux and Windows. Matrix strategy runs the same job with different variable combinations in parallel.

Workflow file (.github/workflows/test.yml):

name: Test Suite

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  test:
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest]
        node: [14, 16, 18]
    runs-on: ${{ matrix.os }}
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Node ${{ matrix.node }}
      uses: actions/setup-node@v3
      with:
        node-version: ${{ matrix.node }}
    
    - name: Install dependencies
      run: npm ci
    
    - name: Run tests
      run: npm test
    
    - name: Upload coverage (Ubuntu + Node 18 only)
      if: matrix.os == 'ubuntu-latest' && matrix.node == 18
      uses: codecov/codecov-action@v3

What happens: GitHub creates 6 parallel jobs (2 OS × 3 Node versions = 6 combinations). Each job checks out code, installs the specific Node version from matrix, runs tests. Coverage only uploads once (from Ubuntu+Node 18 to avoid duplicates). All 6 jobs must pass for PR to be mergeable. If Node 16 on Windows fails but others pass, you know it's a Windows+16 specific issue. Result: Comprehensive compatibility testing across environments, completed in parallel (same time as 1 job, not 6× longer).

Detailed Example 3: Self-Hosted Runner with Specific Capabilities

Your company has specialized build requirements: access to private NuGet feed (internal network), requires specific SDK versions not available on Microsoft-hosted agents, needs access to on-premises database for integration tests. Solution: Set up self-hosted agent/runner.

Azure DevOps agent setup:

# Download and configure agent on your build server
./config.sh --url https://dev.azure.com/yourorg --auth pat --token YOUR_PAT
./run.sh

# Add to agent pool "OnPremises"
# Configure capabilities in Azure DevOps: custom.sdk=specialized, custom.network=internal

Pipeline configuration:

pool:
  name: OnPremises
  demands:
  - custom.sdk -equals specialized
  - custom.network -equals internal

steps:
- script: dotnet restore --source http://internal-nuget.company.local/feed
  displayName: 'Restore from internal feed'

- script: dotnet build
  displayName: 'Build with specialized SDK'

- script: dotnet test --settings integration.runsettings
  displayName: 'Run integration tests'
  env:
    DB_CONNECTION: $(OnPremDbConnection)

Breakdown: Self-hosted agent runs on your infrastructure (Windows Server in your datacenter), has network access to internal resources, pre-configured with specialized SDK. Pipeline requests agents from "OnPremises" pool with specific capabilities (demands). Agent evaluates: "Do I have custom.sdk=specialized? Yes. Do I have custom.network=internal? Yes. I can run this job." Restore pulls packages from internal feed (http://internal-nuget.company.local), build uses the specialized SDK installed on agent, tests connect to on-premises database using secure variable. Result: Build succeeds where Microsoft-hosted agents would fail (no access to internal network). Trade-off: You maintain the agent infrastructure (updates, security patches, scaling).

⭐ Must Know (Pipeline Design Critical Facts):

Triggers determine when: Push triggers (CI), PR triggers (validation), scheduled triggers (nightly builds), manual triggers (on-demand)
Agents determine where: Microsoft-hosted (clean, scalable, paid per minute), self-hosted (your infrastructure, persistent, free compute)
Stages create structure: Build → Test → Deploy (sequential or parallel with dependsOn)
Jobs are execution units: Run on single agent, can run in parallel across multiple agents
Steps are actions: Tasks (pre-built), scripts (inline code), checkout (get source code)
YAML is the standard: Azure Pipelines moving from Classic to YAML (exam focuses on YAML), GitHub Actions is YAML-only

When to use (Pipeline Design Decisions):

✅ Use GitHub Actions when: Your code is on GitHub, you want marketplace actions, simpler syntax, free minutes for public repos
✅ Use Azure Pipelines when: Code on Azure Repos, need enterprise features (retention policies, compliance), integration with Azure Boards/Test Plans, more free parallel jobs for private projects
✅ Use Microsoft-hosted agents when: Standard build needs, want clean environment every run, don't want infrastructure maintenance
✅ Use Self-hosted agents when: Need access to internal resources, require specific software/hardware, want faster builds (agent pools with pre-cached dependencies), reduce costs for heavy usage
❌ Don't use Classic pipelines for new projects: Microsoft is investing in YAML, Classic lacks features like templates and multi-stage support
❌ Don't use Microsoft-hosted agents when: Need to access on-premises systems, require specific hardware (GPU for ML), need persistent state between runs

Limitations & Constraints:

Microsoft-hosted agent limits: 6 hours max job time (Azure Pipelines), 6 hours for public repos / 360 minutes for private (GitHub Actions), no incoming connections allowed
Self-hosted agent requirements: Must maintain agent software updates, need network connectivity to Azure DevOps/GitHub, responsible for security hardening
Parallel job limits: 1 free parallel job for private projects (Azure Pipelines), need to purchase more or wait in queue
YAML complexity: Multi-stage pipelines can become large (1000+ lines), hard to debug (no breakpoints, must rely on logging)

💡 Tips for Understanding:

Think "trigger → agent → stages → jobs → steps": That's the execution hierarchy, each level adds control
Stages = deployment gates: Use stages to separate environments (build → staging → production), add approvals between stages
Jobs = parallelism: Multiple jobs in same stage run in parallel (unless dependsOn specified), speeds up execution
Steps = actual work: If it's not a step, it doesn't run (checkout, build, test, publish, deploy)

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Thinking self-hosted agents are always faster
- Why it's wrong: First run is slow (downloads dependencies), only faster on subsequent runs if you don't clean (but then you risk dirty state bugs)
- Correct understanding: Self-hosted is for access/cost reasons, not always speed. If speed is goal, use caching on Microsoft-hosted agents
Mistake 2: Putting everything in one stage
- Why it's wrong: Can't add approvals mid-pipeline, can't selectively redeploy, harder to visualize progress
- Correct understanding: Use stages to separate logical phases (Build, Test, Deploy-Staging, Deploy-Prod), enables better control and visibility
Mistake 3: Not using templates for common patterns
- Why it's wrong: Copy-paste YAML across pipelines, changes require updating 20 files, errors multiply
- Correct understanding: Create template files for common jobs (build-node.yml, deploy-webapp.yml), reference them, change once applies everywhere

🔗 Connections to Other Topics:

Relates to Source Control Strategy because: Trigger rules reference branches, PR validation integrates with branch policies, pipeline-as-code stored in repository
Builds on Package Management by: Pipelines publish artifacts to feeds, restore dependencies from feeds, versioning strategies implemented in pipeline variables
Often used with Deployment Strategies to: Trigger blue-green swaps, control canary releases, implement progressive rollouts
Integrates with Security through: Service connections for authentication, secure variables for secrets, pipeline permissions aligned with Azure RBAC

Troubleshooting Common Issues:

Issue 1: "Pipeline triggers on every branch, not just main"
- Problem: Trigger section missing or has wildcard
- Solution: Add explicit trigger: trigger: branches: include: [main]
Issue 2: "Job fails with 'No hosted parallelism has been purchased'"
- Problem: Free tier limit reached (1 parallel job), another pipeline is running
- Solution: Wait for other pipeline to finish, or purchase parallel jobs, or use self-hosted agent
Issue 3: "Self-hosted agent not picking up jobs"
- Problem: Agent offline, capabilities don't match demands, or pool permissions
- Solution: Check agent is running (./run.sh), verify capabilities match pipeline demands, ensure pipeline has permission to access agent pool

Section 2: Package Management Strategy

Introduction

The problem: Applications depend on external libraries (packages). Without centralized management, developers pull packages from public internet (security risk), version conflicts arise (dependency hell), no control over what enters codebase (compliance nightmare). Teams waste time troubleshooting "works on my machine" issues caused by different package versions.

The solution: Implement package management strategy using Azure Artifacts or GitHub Packages. Create feeds (package repositories) for different purposes (development, production, upstream caching). Define versioning standards (SemVer for releases, CalVer for time-based). Control package lifecycle (publishing, promotion, retention). Result: Consistent dependencies across all environments, security scanning of packages, faster builds with upstream caching.

Why it's tested: Package management is fundamental to modern DevOps (20% of Domain 3). Exam tests: Choosing between Azure Artifacts and GitHub Packages, designing feed structures, implementing versioning strategies, configuring upstream sources, managing package retention.

Core Concepts

Package Feeds and Views

What it is: A package feed is a repository that stores packages (NuGet, npm, Maven, Python). Views are filtered subsets of a feed that show only packages meeting certain criteria (e.g., "Release" view shows only non-prerelease packages, "Latest" view shows only latest versions).

Why it exists: Organizations need to separate package maturity levels (development packages shouldn't mix with production-approved packages), control what developers can consume (some packages may have vulnerabilities), improve performance (upstream sources cache external packages locally). Feed views solve this by creating logical partitions without duplicating storage.

Real-world analogy: Think of a feed like a warehouse with different sections. The warehouse holds all inventory (all package versions), but you create "sections" (views) for different customers: "Retail Section" (stable products only), "Wholesale Section" (bulk items), "Clearance Section" (old versions). Same warehouse, different access points.

How it works (Detailed step-by-step):

Feed creation: You create a feed in Azure Artifacts named "MyCompanyPackages" with visibility set to Organization (private) or Public
Package publishing: CI pipeline publishes NuGet package "MyLibrary 1.0.0-beta" to feed, it enters the "@Local" view (all packages go here)
View configuration: You create view "@Prerelease" (shows packages with version suffix like -beta, -alpha) and "@Release" (shows only stable versions without suffix)
Package promotion: After testing, you promote "MyLibrary 1.0.0" to @Release view (now visible to production pipelines)
Consumer configuration: Production pipelines configure feed URL with @Release view: https://pkgs.dev.azure.com/myorg/_packaging/MyCompanyPackages@Release/nuget/v3/index.json
Restore process: Pipeline runs dotnet restore, connects to feed, sees only packages in @Release view (beta packages hidden), downloads approved packages only

📊 Package Feed Architecture Diagram:

graph TB
    subgraph "Azure Artifacts Feed: MyCompanyPackages"
        LOCAL["@Local View<br/>(All Packages)"]
        PRERELEASE["@Prerelease View<br/>(Beta/Alpha packages)"]
        RELEASE["@Release View<br/>(Stable only)"]
        
        LOCAL --> PRERELEASE
        LOCAL --> RELEASE
    end
    
    subgraph "Upstream Sources"
        NUGET["nuget.org<br/>(Public NuGet)"]
        NPM["npmjs.com<br/>(Public npm)"]
    end
    
    subgraph "Consumers"
        DEV["Dev Pipelines"] --> LOCAL
        TEST["Test Pipelines"] --> PRERELEASE
        PROD["Prod Pipelines"] --> RELEASE
    end
    
    CI["CI Pipeline"] -->|Publish| LOCAL
    LOCAL -->|Cache| NUGET
    LOCAL -->|Cache| NPM
    
    style LOCAL fill:#e3f2fd
    style PRERELEASE fill:#fff3e0
    style RELEASE fill:#c8e6c9
    style PROD fill:#f3e5f5

See: diagrams/04_domain3_package_feed_architecture.mmd

Diagram Explanation (comprehensive breakdown):
This diagram illustrates a complete Azure Artifacts feed architecture with views and upstream sources. At the center is the MyCompanyPackages feed containing three views. The @Local view (blue) is the entry point where ALL packages land when published by CI Pipeline - it contains every version including prereleases, betas, and stable releases. From @Local, packages can be visible in two filtered views: @Prerelease view (orange) automatically shows packages with version suffixes (-beta, -alpha, -rc) for testing environments, and @Release view (green) shows only stable packages without suffixes for production use. On the left, Upstream Sources (nuget.org and npmjs.com) are configured as package origins - when a pipeline requests a package not in the feed, Azure Artifacts fetches it from upstream and caches it in @Local view, so subsequent requests are instant (no internet call). On the right, Consumers show different pipeline types connecting to appropriate views: Dev Pipelines use @Local (can access all packages including experiments), Test Pipelines use @Prerelease (validate beta packages before release), Prod Pipelines use @Release (only approved stable packages). Flow: Developer commits code → CI Pipeline builds and publishes "MyLib 1.2.0-beta" → Package enters @Local and @Prerelease views → Test pipeline tests beta → If tests pass, developer promotes package to version "1.2.0" (removes suffix) → Package now visible in @Release view → Production pipeline can consume it. Upstream caching means if pipeline requests "Newtonsoft.Json 13.0.1" (external package), feed checks @Local, doesn't find it, fetches from nuget.org, caches in @Local, returns to pipeline. Next pipeline requesting same package gets it from cache instantly. Result: Security (all packages flow through your feed, can be scanned), Performance (upstream caching eliminates internet calls), Control (views ensure environments get appropriate package maturity levels).

Detailed Example 1: Publishing npm Package to GitHub Packages

You're building a shared React component library used across multiple projects in your organization. You want to publish it to GitHub Packages so other teams can consume it. GitHub Packages is free for public repos, tightly integrated with GitHub repositories.

Package.json configuration:

{
  "name": "@myorg/component-library",
  "version": "2.1.0",
  "repository": {
    "type": "git",
    "url": "https://github.com/myorg/component-library.git"
  },
  "publishConfig": {
    "registry": "https://npm.pkg.github.com/@myorg"
  }
}

GitHub Actions workflow (.github/workflows/publish.yml):

name: Publish Package

on:
  release:
    types: [published]

jobs:
  publish:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    
    steps:
    - uses: actions/checkout@v3
    
    - uses: actions/setup-node@v3
      with:
        node-version: '18'
        registry-url: 'https://npm.pkg.github.com'
        scope: '@myorg'
    
    - run: npm ci
    
    - run: npm run build
    
    - run: npm run test
    
    - run: npm publish
      env:
        NODE_AUTH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Consumer configuration (in other repos, .npmrc file):

@myorg:registry=https://npm.pkg.github.com
//npm.pkg.github.com/:_authToken=${NPM_TOKEN}

Breakdown: Package name must be scoped (@myorg/component-library) for GitHub Packages. publishConfig tells npm to publish to GitHub registry, not public npmjs.com. Workflow triggers on GitHub Release creation (when you tag v2.1.0 and create release). Job needs permissions: packages:write to push package, contents:read to checkout code. Setup-node action configures npm to authenticate with GitHub Packages using built-in GITHUB_TOKEN (no manual secret needed). npm ci installs deps (clean install), npm run build compiles TypeScript to JavaScript, npm run test validates package works, npm publish pushes package to GitHub Packages at https://npm.pkg.github.com/@myorg/component-library. Other teams consuming this package create .npmrc file telling npm where to find @myorg packages (GitHub, not npmjs), authenticate with personal access token (NPM_TOKEN secret in their repo), run npm install @myorg/component-library@2.1.0, package downloads from GitHub Packages. Result: Internal package stays within GitHub ecosystem, automatically linked to source code repository, free for private repos (unlike Azure Artifacts which charges after 2GB), version tied to Git tags.

Detailed Example 2: Azure Artifacts with Upstream Sources

Your company builds .NET applications. You want all teams to restore NuGet packages through Azure Artifacts (for security scanning and caching), but don't want to manually copy every public package. Solution: Configure upstream sources.

Azure Artifacts feed setup (via Azure DevOps UI):

Create feed "CompanyNuGet" with visibility: Organization
Add upstream source: nuget.org (public NuGet gallery)
Create view "@Approved" for packages that passed security scan

Pipeline configuration (azure-pipelines.yml):

steps:
- task: NuGetAuthenticate@1
  displayName: 'Authenticate with Azure Artifacts'

- task: DotNetCoreCLI@2
  displayName: 'Restore packages'
  inputs:
    command: 'restore'
    projects: '**/*.csproj'
    feedsToUse: 'select'
    vstsFeed: 'CompanyNuGet@Approved'
    includeNuGetOrg: false

- task: DotNetCoreCLI@2
  displayName: 'Build solution'
  inputs:
    command: 'build'
    projects: '**/*.sln'

NuGet.config (in repository):

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <clear />
    <add key="CompanyNuGet" value="https://pkgs.dev.azure.com/myorg/_packaging/CompanyNuGet@Approved/nuget/v3/index.json" />
  </packageSources>
</configuration>

What happens on first package request:

Pipeline runs dotnet restore for project requiring "Newtonsoft.Json 13.0.3"
Restore checks Azure Artifacts feed "CompanyNuGet@Approved" (configured in NuGet.config)
Feed doesn't have Newtonsoft.Json 13.0.3 locally, checks upstream source (nuget.org)
Fetches package from nuget.org (30 MB download, 2-3 seconds)
Caches package in feed @Local view (security scan can now run on it)
Returns package to pipeline (restore completes)
Pipeline builds project successfully

What happens on subsequent requests (same package):

Different pipeline or developer runs dotnet restore for same package
Restore checks Azure Artifacts feed "CompanyNuGet@Approved"
Feed finds Newtonsoft.Json 13.0.3 in local cache (already fetched before)
Returns package immediately (0.2 seconds, no internet call)
Restore completes 10X faster

Security benefit: InfoSec team configures Azure Defender for DevOps to scan all packages in CompanyNuGet feed. If Newtonsoft.Json 13.0.3 has vulnerability, alert triggers, package can be blocked from @Approved view. All projects automatically prevented from using vulnerable package. Result: Upstream caching improves build speed (cached packages are instant), security scanning protects codebase (all packages flow through scanning), compliance audit (track exactly which packages entered organization).

Detailed Example 3: Semantic Versioning (SemVer) Strategy

Your team maintains internal libraries consumed by 50+ microservices. You need versioning strategy that communicates breaking changes clearly so consumers know when updates are safe vs risky. Solution: Semantic Versioning (SemVer): MAJOR.MINOR.PATCH.

Versioning rules implementation:

MAJOR version (1.0.0 → 2.0.0): Breaking changes (API contracts change, require consumer code updates)
MINOR version (1.1.0 → 1.2.0): New features added, backward compatible (consumers can update without code changes)
PATCH version (1.1.0 → 1.1.1): Bug fixes only, backward compatible (always safe to update)

Example scenario - Library evolution:

v1.0.0 (Initial release):
  - UserService.GetUser(id) returns User object
  
v1.1.0 (Added feature, MINOR bump):
  - Added UserService.GetUserByEmail(email) method
  - GetUser(id) still works exactly as before
  - Consumers can update from 1.0.0 → 1.1.0 safely
  
v1.1.1 (Bug fix, PATCH bump):
  - Fixed null reference bug in GetUser
  - No API changes
  - Consumers should update 1.1.0 → 1.1.1 (bug fix)
  
v2.0.0 (Breaking change, MAJOR bump):
  - Changed GetUser(id) return type from User to Task<User> (async)
  - Consumers must update code: await GetUser(id)
  - Update 1.1.1 → 2.0.0 requires code changes

Pipeline implementation (automatically bump version):

variables:
  majorVersion: 2
  minorVersion: 3
  patchVersion: $[counter(variables['minorVersion'], 0)]
  packageVersion: $(majorVersion).$(minorVersion).$(patchVersion)

steps:
- script: dotnet pack -p:PackageVersion=$(packageVersion)
  displayName: 'Create package with version $(packageVersion)'

- task: NuGetCommand@2
  inputs:
    command: 'push'
    packagesToPush: '**/*.nupkg'
    nuGetFeedType: 'internal'
    publishVstsFeed: 'MyFeed'

Consumer package.json dependency configurations:

{
  "dependencies": {
    "@mycompany/lib-stable": "1.1.1",
    "@mycompany/lib-minor-updates": "^1.1.0",
    "@mycompany/lib-patch-only": "~1.1.0",
    "@mycompany/lib-bleeding-edge": "*"
  }
}

Version range meanings:

"1.1.1" (exact): Only version 1.1.1, no automatic updates (maximum stability, miss bug fixes)
"^1.1.0" (caret): 1.1.0 to <2.0.0 (accept minor and patch updates, no breaking changes)
"~1.1.0" (tilde): 1.1.0 to <1.2.0 (accept patch updates only, very conservative)
"*" (wildcard): Any version (dangerous, could get breaking changes)

Result: SemVer provides contract between library maintainer and consumers. MAJOR bump signals "read changelog, expect code changes", MINOR bump signals "safe to update, new features available", PATCH bump signals "bug fixes, update recommended". Automated counter in pipeline ensures each build gets unique version (2.3.0, 2.3.1, 2.3.2...). Consumers use version ranges to control update aggressiveness (^ for active development, ~ for production stability).

⭐ Must Know (Package Management Critical Facts):

Azure Artifacts: Integrated with Azure DevOps, supports NuGet/npm/Maven/Python/Universal packages, 2GB free then paid, feed views for maturity levels, upstream sources for caching
GitHub Packages: Integrated with GitHub, supports npm/NuGet/Maven/Docker/RubyGems, free for public repos, scoped to repository or organization, automatic linking to source code
Feed views: Filter packages by maturity (@Local = all, @Prerelease = beta/alpha, @Release = stable only), configure pipelines to use different views by environment
Upstream sources: Cache external packages (nuget.org, npmjs.com) in your feed, improves speed (cached packages instant), enables security scanning before use, tracks external dependencies
SemVer (Semantic Versioning): MAJOR.MINOR.PATCH format, MAJOR = breaking changes, MINOR = new features (compatible), PATCH = bug fixes, enables safe dependency updates
Package promotion: Move package between views (test in @Prerelease, promote to @Release after validation), ensures production only gets approved packages

When to use (Package Management Decisions):

✅ Use Azure Artifacts when: Already using Azure DevOps, need universal package support, require feed views for package promotion, want built-in upstream caching, organization-wide package sharing
✅ Use GitHub Packages when: Code is on GitHub, want tight source-code integration, only need npm/NuGet/Docker, public repos (free), prefer simpler setup
✅ Use Feed views when: Need to separate package maturity (dev/staging/prod), want to promote packages through stages, require approval process before production use
✅ Use Upstream sources when: Consume public packages (npm, NuGet), want security scanning of external dependencies, need faster builds (caching), must audit all packages
✅ Use SemVer when: Publishing libraries/packages consumed by other teams, need to communicate breaking vs safe changes, want predictable update behavior
❌ Don't use Multiple feeds for same package type: Creates confusion (which feed has latest?), duplication, use views instead
❌ Don't use Public internet package sources directly in production pipelines: Security risk (packages not scanned), reliability risk (npmjs.com downtime breaks your build), use upstream sources

Limitations & Constraints:

Azure Artifacts storage: 2GB free per organization, then $2/GB/month, includes all package types (NuGet, npm, Maven combined)
GitHub Packages storage: 500MB free for private repos, 1GB bandwidth/month, then $0.25/GB storage, unlimited for public repos
Feed retention: Default 30 days for older package versions, configure retention policy to auto-delete, keeps storage costs down
Package size limits: Azure Artifacts 500MB per package file, GitHub Packages 5GB per package, Docker images count toward limits
Upstream caching delay: First request hits internet (slow), subsequent requests use cache (fast), cache updates every 3-6 hours

💡 Tips for Understanding:

Think "Pipeline Publishes → Feed Stores → Pipeline Consumes": That's the package lifecycle in three steps
Views are filters, not copies: @Release view doesn't duplicate packages, just shows subset of @Local, saves storage
Upstream sources = performance + security: Cache external packages locally (speed), scan them (security), track them (compliance)
Version ranges are contracts: ^ means "trust minor updates", ~ means "trust patches only", exact means "never change"

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Publishing every build to feed (including failed/test builds)
- Why it's wrong: Feed fills with junk versions, developers confused which version to use, storage costs increase
- Correct understanding: Only publish from main/release branches, or from successful PR validation, use build number for uniqueness
Mistake 2: Not using views, giving everyone access to @Local
- Why it's wrong: Production pipelines can accidentally use beta packages, no promotion workflow, can't distinguish stable from experimental
- Correct understanding: Dev uses @Local, Test uses @Prerelease, Prod uses @Release, each environment gets appropriate maturity
Mistake 3: Bypassing feed by allowing pipelines to hit public internet
- Why it's wrong: No security scanning (vulnerable packages enter codebase), no audit trail (can't see what was consumed), build breaks if npmjs.com is down
- Correct understanding: Configure NuGet.config/npmrc to only use your feed, feed uses upstream sources to fetch external packages, everything flows through security

🔗 Connections to Other Topics:

Relates to Pipeline Design because: Pipelines publish packages (build artifacts), restore packages (dependencies), version derived from build number
Builds on Source Control by: Package version tied to git tags/commits, package.json/csproj stored in repository, semantic versioning reflects code changes
Often used with Security to: Scan packages for vulnerabilities, authenticate access to feeds (PATs, service principals), manage package credentials in Key Vault
Integrates with Deployment through: Packages are deployment units (NuGet → NuGet deploy, Docker → Kubernetes deploy), package version determines what gets deployed

Troubleshooting Common Issues:

Issue 1: "Package restore fails with 401 Unauthorized"
- Problem: Pipeline not authenticated to feed, or personal access token expired
- Solution: Use NuGetAuthenticate@1 task (Azure Pipelines) or GITHUB_TOKEN (GitHub Actions), regenerate PAT if expired
Issue 2: "Published package doesn't appear in @Release view"
- Problem: Package version has prerelease suffix (-beta), @Release view filters those out
- Solution: Remove suffix from version (2.1.0-beta → 2.1.0) or add package to @Release view manually (promote)
Issue 3: "Upstream source package fetch is very slow"
- Problem: First fetch always hits internet (can't cache what you don't have), large package (500MB) takes time
- Solution: This is expected behavior (first fetch slow, subsequent fast), or pre-populate feed by manually uploading common packages

Section 3: Testing Strategy for Pipelines

Introduction

The problem: Code defects discovered in production are 100X more expensive to fix than defects found during development. Without automated testing in pipelines, manual testing creates bottleneck (QA team overwhelmed), inconsistency (tests skipped under pressure), late detection (bugs found weeks after coding). Teams ship broken code, customers experience failures, reputation damaged.

The solution: Implement comprehensive testing strategy in CI/CD pipelines. Run tests automatically on every commit (shift-left testing). Create quality gates (pipelines fail if tests fail, broken code never merges). Layer tests by type (unit tests for logic, integration tests for APIs, load tests for performance). Measure code coverage (ensure tests actually exercise code). Result: Defects caught in minutes not weeks, consistent quality enforcement, faster development cycles (confident deploys).

Why it's tested: Testing strategy is critical DevOps practice (15% of Domain 3). Exam tests: Designing quality gates, implementing test pyramid (unit/integration/e2e), configuring test tasks and agents, analyzing code coverage, managing flaky tests.

Core Concepts

Quality Gates and Release Gates

What it is: Quality gates are automated checks in CI pipelines that must pass before code can merge (e.g., "80% code coverage required", "0 critical bugs", "all tests pass"). Release gates are approval checkpoints in CD pipelines before deploying to environments (e.g., "security scan passed", "manual approval from manager", "incident count is low").

Why it exists: Prevents quality degradation by enforcing standards automatically. Without gates, developers can merge failing code (pressure to ship fast), deploy to production during incidents (should wait for stability), skip security scans (convenience over safety). Gates act as guardrails: pipeline stops if quality standards not met, human judgment required for critical deploys, compliance requirements enforced programmatically.

Real-world analogy: Like TSA security at airport. Quality gates are the metal detector and X-ray (automated checks, everyone must pass, no exceptions). Release gates are the customs officer (human review at specific checkpoints, judgment call on suspicious items). You can't board (deploy) until you pass both.

How it works (Detailed step-by-step):

Quality gate definition: In Azure Pipelines, you add task PublishTestResults@2, configure branch policy requiring test runs to pass with 80% pass rate, set code coverage threshold in SonarQube quality gate
PR creation: Developer creates pull request to merge feature branch → main
Pipeline trigger: PR validation pipeline triggers automatically (branch policy enforcement)
Test execution: Pipeline runs unit tests (500 tests), integration tests (50 tests), publishes results to Azure DevOps
Quality gate evaluation: System checks: Did 80% of tests pass? (Yes, 540/550 passed). Is code coverage ≥60%? (Yes, 65%). Are there critical bugs? (No). All gates passed
PR status update: Azure DevOps marks PR as "Build succeeded, requirements met", green checkmark appears, merge button enabled
If gates fail: Suppose only 70% tests passed. Pipeline publishes results, branch policy evaluates, gate fails, PR marked "Build failed - testing requirements not met", merge button disabled, developer must fix tests before merging
Release gate (deployment): After merge, release pipeline triggers for staging deploy. Before deployment, release gate checks: (A) Work items in "testing" state count < 5 (currently 3, pass), (B) Security scan completed in last 24hr (yes, pass), (C) Manual approval from @SecurityTeam (pending, pipeline pauses). Security team reviews, approves, gate passes, deployment proceeds to staging

📊 Quality Gates and Release Gates Flow Diagram:

graph TD
    subgraph "CI Pipeline - Quality Gates"
        CODE[Code Commit] --> BUILD[Build Code]
        BUILD --> UNIT[Run Unit Tests]
        UNIT --> INT[Run Integration Tests]
        INT --> COV[Code Coverage Analysis]
        COV --> QG{Quality Gates<br/>Pass?}
        QG -->|Yes| MERGE[Allow Merge]
        QG -->|No| BLOCK[Block Merge]
    end
    
    subgraph "CD Pipeline - Release Gates"
        MERGE --> DEPLOY_START[Start Release]
        DEPLOY_START --> SEC{Security Scan<br/>Passed?}
        SEC -->|Yes| INC{Incident Count<br/>Low?}
        SEC -->|No| REJECT1[Deployment Blocked]
        INC -->|Yes| APPROVAL{Manual<br/>Approval?}
        INC -->|No| REJECT2[Deployment Blocked]
        APPROVAL -->|Approved| DEPLOY[Deploy to Staging]
        APPROVAL -->|Rejected| REJECT3[Deployment Blocked]
    end
    
    style QG fill:#fff3e0
    style SEC fill:#fff3e0
    style INC fill:#fff3e0
    style APPROVAL fill:#fff3e0
    style MERGE fill:#c8e6c9
    style DEPLOY fill:#c8e6c9
    style BLOCK fill:#ffcdd2
    style REJECT1 fill:#ffcdd2
    style REJECT2 fill:#ffcdd2
    style REJECT3 fill:#ffcdd2

See: diagrams/04_domain3_quality_release_gates.mmd

Diagram Explanation: This diagram shows the two-phase gating system in modern DevOps pipelines. The top section illustrates Quality Gates in the CI Pipeline (Continuous Integration), which act as automated checks preventing bad code from merging. Flow starts with Code Commit, which triggers Build Code step (compilation). If build succeeds, pipeline runs Unit Tests (500+ fast tests checking individual functions), then Integration Tests (50-100 medium-speed tests validating API interactions). Next, Code Coverage Analysis measures what percentage of code is exercised by tests. All results feed into Quality Gates decision point (orange diamond): System evaluates "Did ≥80% of tests pass?", "Is code coverage ≥60%?", "Are there critical bugs?". If ALL conditions met, Quality Gate passes (green arrow) → Allow Merge (code can enter main branch). If ANY condition fails, Quality Gate fails (red arrow) → Block Merge (PR cannot merge, developer must fix). This prevents broken code from entering codebase. The bottom section shows Release Gates in CD Pipeline (Continuous Deployment), which are approval checkpoints before environment deployment. After successful merge, Start Release initiates deployment pipeline. First release gate: Security Scan check (orange diamond) - "Did vulnerability scan complete in last 24 hours? Were critical CVEs found?". If scan failed or has critical vulnerabilities → Deployment Blocked (red). If passed → continue to next gate: Incident Count check - "Are there <5 active incidents in production?". If incident count high (system unstable) → Deployment Blocked (wise to wait for stability). If low → continue to Manual Approval gate: Security team or manager must explicitly approve deploy (human judgment for production changes). Rejected → Deployment Blocked. Approved → Deploy to Staging (green box, deployment proceeds). Result: Quality gates enforce technical standards automatically (tests, coverage), Release gates enforce operational safety (security, stability, human oversight). Together they prevent both code defects (quality) and risky deploys (operational risk).

Detailed Example 1: Branch Policy Quality Gate

You want to ensure all code merged to main branch meets quality standards: all tests pass, code coverage ≥60%, security scan clean. Manual review too slow (20 PRs/day). Solution: Configure branch policies as quality gates.

Azure DevOps branch policy configuration (via UI → Repos → Branches → main → Branch Policies):

# Build validation
Require build validation:
  Build pipeline: CI-Pipeline
  Trigger: Automatic (when PR updated)
  Policy requirement: Required (blocks PR if build fails)
  Build expiration: Immediately (re-run on every commit)
  Display name: "CI Validation"

# Status checks
Require status checks to pass:
  - SonarQube Quality Gate: Required
  - Security Scan: Required
  - Code Coverage ≥60%: Required

# Code reviewers
Require minimum number of reviewers: 2
Reset votes when source branch updated: Yes
Allow requestors to approve changes: No

Pipeline with quality checks (azure-pipelines.yml):

trigger:
  branches:
    include: [main]

pr:
  branches:
    include: [main]

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DotNetCoreCLI@2
  displayName: 'Restore packages'
  inputs:
    command: 'restore'

- task: DotNetCoreCLI@2
  displayName: 'Build solution'
  inputs:
    command: 'build'

- task: DotNetCoreCLI@2
  displayName: 'Run unit tests'
  inputs:
    command: 'test'
    arguments: '--collect:"XPlat Code Coverage"'
    publishTestResults: true

- task: PublishCodeCoverageResults@1
  displayName: 'Publish code coverage'
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '**/coverage.cobertura.xml'
    failIfCoverageEmpty: true

- task: SonarQubeAnalyze@5
  displayName: 'Run SonarQube analysis'

- task: SonarQubePublish@5
  displayName: 'Publish Quality Gate result'
  inputs:
    pollingTimeoutSec: '300'

Developer workflow:

Create PR: Developer creates PR from feature/login → main
Auto-trigger: Branch policy automatically triggers CI-Pipeline (no manual action)
Pipeline runs: Restores packages, builds code, runs 500 unit tests (490 pass, 10 fail), calculates coverage (58%), runs SonarQube (finds 2 critical bugs)
Quality gate evaluation:
- Test pass rate: 490/500 = 98% ✅ (meets ≥80% requirement)
- Code coverage: 58% ❌ (below 60% requirement)
- SonarQube bugs: 2 critical ❌ (must be 0)
PR status: Azure DevOps shows "Build failed - Quality gates not met", lists failures: "Code coverage below threshold (58% < 60%)", "SonarQube found 2 critical bugs". Merge button disabled (gray, can't click)
Developer fixes: Adds tests to increase coverage to 62%, fixes 2 critical bugs, commits changes to feature branch
Auto re-run: Branch policy detects new commit, re-triggers CI-Pipeline automatically
Second evaluation:
- Test pass rate: 510/510 = 100% ✅
- Code coverage: 62% ✅ (meets 60%)
- SonarQube bugs: 0 critical ✅
PR approved: Status changes to "Build succeeded - Requirements met", green checkmark appears, merge button enabled
Human review: Still needs 2 code reviewers (separate from quality gates), they approve, developer clicks merge, code enters main branch

Result: Quality standards enforced automatically (no human oversight needed for technical checks), broken code physically cannot merge (disabled button), developers get immediate feedback (5 minutes after commit, not 2 days later from QA), consistent quality (same standards for all PRs, no exceptions).

Test Pyramid Strategy

What it is: Test pyramid is a testing strategy that balances test coverage with execution speed by organizing tests into three layers: Unit tests at the base (70% of tests, fast, cheap), Integration tests in the middle (20% of tests, medium speed), End-to-End tests at the top (10% of tests, slow, expensive). More tests at bottom (fast), fewer at top (slow).

Why it exists: Running only E2E tests is too slow (1 hour to get feedback, developers context-switched to other work, expensive infrastructure). Running only unit tests misses integration bugs (database connection failures, API contract mismatches). Pyramid balances coverage (all types tested) with speed (most tests are fast unit tests giving quick feedback).

Real-world analogy: Building inspection process. Unit tests are like checking individual bricks (is each brick solid? cracked?). Integration tests are like checking walls (do bricks bond together? mortar correct?). E2E tests are like checking whole building (does roof not leak when it rains? do doors open?). You check thousands of bricks (fast, cheap), hundreds of walls (medium effort), final building once (slow, expensive). Same pyramid shape.

How it works (Detailed step-by-step):

Unit tests (70% of suite): Test individual functions/methods in isolation. Example: calculateTax(income) function - test with income=50000 (expect tax=7500), income=0 (expect tax=0), income=-1000 (expect exception). Run in milliseconds (no database, no network), mocked dependencies. Pipeline runs 3500 unit tests in 2 minutes
Integration tests (20% of suite): Test component interactions. Example: API test - POST /api/users with valid data, verify database has new user, GET /api/users/123 returns correct data. Requires real database (Docker container spun up), API server running. Pipeline runs 1000 integration tests in 10 minutes
End-to-End tests (10% of suite): Test complete user workflows. Example: E2E test for "user signup" - Open browser, navigate to /signup, fill form, click submit, verify confirmation email sent, click email link, verify account activated, login, verify dashboard loads. Requires full environment (browser, web server, API, database, email service). Pipeline runs 500 E2E tests in 45 minutes
Pipeline execution strategy: Every commit triggers unit tests (2 min feedback). PR validation runs unit + integration (12 min feedback). Nightly build runs all tests including E2E (60 min, overnight). Deploy to staging runs smoke tests subset of E2E (10 min, critical paths only)

📊 Test Pyramid Strategy Diagram:

graph TB
    subgraph "Test Pyramid - DevOps Strategy"
        E2E["End-to-End Tests<br/>10-15% of tests<br/>Slow, Expensive<br/>Full user workflows"]
        INT["Integration Tests<br/>20-30% of tests<br/>Medium speed<br/>API/Service interactions"]
        UNIT["Unit Tests<br/>50-70% of tests<br/>Fast, Cheap<br/>Individual functions/methods"]
        
        E2E --> INT
        INT --> UNIT
    end
    
    subgraph "Pipeline Execution"
        UNIT --> FAST["Run in parallel<br/>1-5 minutes<br/>Every commit"]
        INT --> MEDIUM["Run selectively<br/>5-15 minutes<br/>PR validation"]
        E2E --> SLOW["Run nightly<br/>30-60 minutes<br/>Scheduled/Pre-deploy"]
    end
    
    style UNIT fill:#c8e6c9
    style INT fill:#fff3e0
    style E2E fill:#ffcdd2
    style FAST fill:#e8f5e9
    style MEDIUM fill:#fff9c4
    style SLOW fill:#ffebee

See: diagrams/04_domain3_test_pyramid.mmd

Diagram Explanation: The test pyramid (top section) visualizes the ideal distribution of test types in a DevOps pipeline, shaped like a pyramid to represent both quantity and execution characteristics. At the base (widest part, green) are Unit Tests comprising 50-70% of the test suite - these are fast (milliseconds each) and cheap (no infrastructure needed) because they test individual functions/methods in isolation with mocked dependencies. Example: Testing a calculateDiscount(price, percentage) function with various inputs (price=100, percentage=10 → expect 90). Moving up the pyramid, Integration Tests (middle layer, orange) make up 20-30% of tests - medium speed (seconds each) and moderate cost (need database, message queues) because they validate how components interact. Example: Testing that when API receives POST /orders, it correctly writes to database AND sends message to queue AND returns 201 status. At the top (smallest section, red) are End-to-End Tests comprising only 10-15% of suite - slow (minutes each) and expensive (require full environment: browser, multiple services, database, external APIs) because they test complete user workflows. Example: Selenium test that opens browser, logs in, adds item to cart, checks out, verifies order confirmation email. The pyramid shape ensures most tests are fast (quick feedback loop) while still having coverage of integration and user scenarios. The bottom section (Pipeline Execution) maps each test layer to execution strategy. Unit Tests (green) run in parallel across multiple agents, complete in 1-5 minutes, trigger on every commit (continuous feedback). Integration Tests (orange) run selectively (only on PR validation or when integration code changes), complete in 5-15 minutes, provide medium-latency feedback. End-to-End Tests (red) run nightly on schedule or pre-deployment only (not every commit - too slow), complete in 30-60 minutes, validate system health before releases. Result: Developers get feedback in 2 minutes from unit tests (90% of bugs caught here), 10 minutes from integration tests (API contract issues), full validation overnight (E2E catches UI/workflow bugs). Anti-pattern (inverted pyramid): Having 70% E2E tests and 10% unit tests → 2 hour feedback loop, flaky tests, expensive infrastructure, developers wait hours for results. Correct pyramid: Most tests fast and stable (unit), fewer tests slower but broader (integration), fewest tests slowest but comprehensive (E2E).

Detailed Example 2: Code Coverage Analysis with Thresholds

Your team ships critical financial software. Management requires proof that code is adequately tested before production deployment. Solution: Implement code coverage analysis with enforced thresholds.

Pipeline configuration with coverage (azure-pipelines.yml):

steps:
- task: DotNetCoreCLI@2
  displayName: 'Run tests with coverage'
  inputs:
    command: 'test'
    projects: '**/*Tests.csproj'
    arguments: '--configuration Release --collect:"XPlat Code Coverage" --results-directory $(Agent.TempDirectory)'

- task: PublishCodeCoverageResults@1
  displayName: 'Publish coverage results'
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'
    failIfCoverageEmpty: true
    
- task: BuildQualityChecks@8
  displayName: 'Check coverage threshold'
  inputs:
    checkCoverage: true
    coverageFailOption: 'build'
    coverageType: 'lines'
    coverageThreshold: '75'

What coverage measures (example):

public class PaymentProcessor {
    public decimal CalculateFee(decimal amount, string customerType) {
        if (amount < 0) {
            throw new ArgumentException("Amount cannot be negative");  // Line 3
        }
        
        decimal baseFee = amount * 0.029m;  // Line 6
        
        if (customerType == "Premium") {  // Line 8
            baseFee = baseFee * 0.5m;  // Line 9 - 50% discount for premium
        } else if (customerType == "Enterprise") {  // Line 10
            baseFee = 0;  // Line 11 - no fees for enterprise
        }
        
        return baseFee;  // Line 14
    }
}

// Test coverage scenario 1 (poor coverage - 57%)
[Test]
public void CalculateFee_StandardCustomer_ReturnsFee() {
    var processor = new PaymentProcessor();
    var fee = processor.CalculateFee(100, "Standard");
    Assert.AreEqual(2.90m, fee);
}
// Lines executed: 3 (checked), 6, 8 (false branch), 10 (false branch), 14
// Lines NOT executed: 9, 11 (Premium and Enterprise paths never tested)
// Coverage: 4 of 7 lines = 57% ❌ Below 75% threshold

// Test coverage scenario 2 (good coverage - 100%)
[Test]
public void CalculateFee_NegativeAmount_ThrowsException() {
    var processor = new PaymentProcessor();
    Assert.Throws<ArgumentException>(() => processor.CalculateFee(-10, "Standard"));
}
// Covers line 3 ✓

[Test]
public void CalculateFee_StandardCustomer_ReturnsFee() {
    var processor = new PaymentProcessor();
    var fee = processor.CalculateFee(100, "Standard");
    Assert.AreEqual(2.90m, fee);
}
// Covers lines 6, 8 (false), 10 (false), 14 ✓

[Test]
public void CalculateFee_PremiumCustomer_Returns50PercentDiscount() {
    var processor = new PaymentProcessor();
    var fee = processor.CalculateFee(100, "Premium");
    Assert.AreEqual(1.45m, fee);
}
// Covers lines 6, 8 (true), 9, 14 ✓

[Test]
public void CalculateFee_EnterpriseCustomer_ReturnsZeroFee() {
    var processor = new PaymentProcessor();
    var fee = processor.CalculateFee(100, "Enterprise");
    Assert.AreEqual(0m, fee);
}
// Covers lines 6, 8 (false), 10 (true), 11, 14 ✓

// All lines executed at least once = 100% coverage ✅

Pipeline execution result:

Tests run, coverage tool instruments code (tracks which lines execute)
Test 1 executes: lines 3, 6, 8, 10, 14 (5 lines)
Test 2 executes: lines 3 (again), 6, 8, 9, 14 (covered exception path)
Test 3 executes: lines 6, 8, 9, 14 (Premium path)
Test 4 executes: lines 6, 8, 10, 11, 14 (Enterprise path)
Coverage report generated: 7/7 lines covered = 100%
BuildQualityChecks task evaluates: 100% ≥ 75% threshold → PASS ✅
Coverage report published to Azure DevOps (visual charts, file-by-file breakdown)
If coverage was 70%, task would fail pipeline with error: "Coverage 70% is below required threshold 75%"

Result: Code coverage ensures tests actually exercise code paths (not just dummy tests that run but don't check anything). Threshold enforcement prevents coverage regression (can't merge PR that drops coverage from 80% to 70%). Visual reports identify untested code (red highlighting in Azure DevOps shows which exact lines have no tests). Management has audit trail (code is 75%+ tested, compliance requirement met).

⭐ Must Know (Testing Strategy Critical Facts):

Quality gates: Automated checks in CI pipeline (test pass rate, code coverage, security scan) that must pass before code can merge, enforced via branch policies
Release gates: Approval checkpoints in CD pipeline before deployment (manual approval, incident check, security validation), can pause pipeline until conditions met
Test pyramid: 70% unit tests (fast, isolated), 20% integration tests (API/database), 10% E2E tests (full workflows), optimize for speed while maintaining coverage
Code coverage: Percentage of code executed by tests (line coverage, branch coverage), enforced with thresholds (e.g., ≥75% required), published with PublishCodeCoverageResults@1 task
Shift-left testing: Run tests early in pipeline (every commit), not late (pre-deploy), find bugs in minutes not days, cheaper to fix (context still fresh)
Test execution: Unit tests run on every commit (fast feedback), integration on PR validation (medium), E2E nightly or pre-deploy (slow but comprehensive)

When to use (Testing Strategy Decisions):

✅ Use Quality gates when: Need to enforce coding standards automatically (test coverage, linting, security), want to prevent bad code from merging, require compliance audit trail
✅ Use Release gates when: Deploying to production (need approvals), want conditional deployment (only if no active incidents), need to pause for manual validation
✅ Use Unit tests when: Testing business logic, calculations, data transformations (pure functions), want fast feedback (milliseconds), can mock dependencies
✅ Use Integration tests when: Testing API contracts, database interactions, message queue operations (component boundaries), can accept slower feedback (seconds)
✅ Use E2E tests when: Testing critical user journeys (login, checkout), need browser automation, can tolerate slow execution (minutes), run nightly or pre-deploy
❌ Don't use Only E2E tests: Too slow (hours for feedback), flaky (UI changes break tests), expensive (infrastructure costs), hard to debug
❌ Don't use 100% coverage as goal: Diminishing returns (last 10% is very hard), focus on critical paths, 70-80% is pragmatic target

Limitations & Constraints:

Code coverage accuracy: Measures lines executed, not quality of assertions (can have 100% coverage with useless tests that don't validate anything)
Flaky tests: E2E tests especially prone to random failures (timing issues, network glitches), require retry logic and good logging
Test data management: Integration/E2E tests need test databases (Docker containers, cleanup between runs), data setup overhead
Pipeline duration: Full test suite can take hours (3500 unit + 1000 integration + 500 E2E = 90 min), balance coverage vs speed

💡 Tips for Understanding:

Quality gates = automated, Release gates = human judgment: Quality gates check facts (test pass?), Release gates check context (is now a good time to deploy?)
Think bottom-up in pyramid: Start with unit tests (foundation), add integration (middle), finish with E2E (top), not inverted
Coverage percentage is a metric, not a goal: 75% coverage with good assertions beats 100% coverage with dummy tests

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Aiming for 100% code coverage
- Why it's wrong: Last 10% is edge cases and error paths (hard to test), diminishing returns, false sense of security (coverage doesn't mean quality)
- Correct understanding: Target 70-80% coverage, focus on critical paths (payment processing, user auth), accept that some code (logging, simple getters) doesn't need tests
Mistake 2: Running all tests on every commit
- Why it's wrong: Developer waits 60 minutes for E2E tests, context switched to other work, productivity loss
- Correct understanding: Commit triggers unit tests (2 min), PR triggers unit+integration (12 min), nightly runs all including E2E (60 min), layer feedback by speed
Mistake 3: No retry logic for flaky tests
- Why it's wrong: Single network timeout fails entire E2E suite, developer re-runs pipeline, wastes time investigating false failures
- Correct understanding: Configure test retry (3 attempts) for E2E tests, mark consistently failing tests for investigation, accept some flakiness in UI tests

🔗 Connections to Other Topics:

Relates to Pipeline Design because: Tests are pipeline steps, quality gates enforced via branch policies, test execution determines pipeline duration
Builds on Source Control by: Branch policies trigger test pipelines, PR validation runs tests, test failures block merge
Often used with Security to: Include security tests (SAST, dependency scanning) in quality gates, validate no vulnerabilities before merge
Integrates with Deployment through: Release gates pause deployment for validation, smoke tests run after deploy to verify health

Troubleshooting Common Issues:

Issue 1: "Code coverage dropped from 75% to 60% but I only added new code"
- Problem: Added 100 new lines of code but only 25 lines have tests (25% coverage), overall percentage dropped
- Solution: Write tests for new code before merging (TDD approach), or add tests to bring new code to ≥75% coverage
Issue 2: "E2E tests pass locally but fail in pipeline"
- Problem: Different environment (pipeline uses Ubuntu, local is Windows), timing issues (pipeline slower), test data mismatch
- Solution: Run tests in Docker container locally (matches pipeline), add explicit waits (not sleep, use wait-for conditions), ensure test data seeded consistently
Issue 3: "Quality gate blocks PR but all tests passed"
- Problem: SonarQube found code smells or security hotspots (not test failures), coverage below threshold, security scan found vulnerability
- Solution: Check ALL quality gate conditions (not just tests), review SonarQube report, fix code quality issues or suppress false positives

Section 4: Deployment Strategies

Introduction

The problem: Traditional deployments cause downtime (application offline while deploying new version), high risk (if deployment fails, entire system down), no rollback plan (manually revert changes, takes hours), customer impact (users see errors during deployment). Friday night deploys become emergency events (team on call, stressful, error-prone).

The solution: Implement advanced deployment strategies that enable zero-downtime deployments. Blue-green: Run two identical environments, switch traffic instantly, rollback in seconds. Canary: Deploy to small subset of users first, monitor metrics, gradually increase if healthy, rollback if issues. Feature flags: Deploy code to production but keep features disabled, enable for specific users, instant rollback by toggling flag. Ring deployment: Progressive rollout starting with internal users, then early adopters, finally everyone. Result: Deployments happen during business hours (low risk), instant rollbacks (flip switch), gradual validation (catch issues early), happy customers (no downtime).

Why it's tested: Deployment strategies are core DevOps skill (20% of Domain 3). Exam tests: Blue-green vs canary differences, slot swap configuration, feature flag implementation with Azure App Configuration, ring deployment design, minimizing downtime techniques.

Core Concepts

Blue-Green Deployment

What it is: Blue-green deployment maintains two identical production environments: "Blue" (current version serving users) and "Green" (new version being prepared). Deploy new version to Green environment, run smoke tests, when ready switch router/load balancer from Blue → Green instantly. If issues found, switch back Blue ← Green instantly. Old Blue environment remains running as fallback.

Why it exists: Eliminates deployment downtime (users never see "under maintenance" page), enables instant rollback (flip switch back to Blue if Green has problems), reduces deployment risk (new version fully tested in production-like environment before users see it), allows validation before cutover (smoke tests, performance tests on Green before switching traffic).

Real-world analogy: Theater with two stages. Stage Blue has actors performing current play (audience watching). Stage Green has different actors rehearsing new play (no audience yet). When new play is ready, rotate theater (audience now sees stage Green), old play on stage Blue ready if need to rotate back. Audience never sees empty stage (no downtime).

How it works (Detailed step-by-step):

Initial state: Blue environment runs v1.5 of application, serving 100% of production traffic via load balancer, Green environment is idle (turned off or minimal resources)
Deployment start: Pipeline deploys v1.6 to Green environment, infrastructure provisioned (VMs, containers start), application deployed, database migrations run (non-breaking changes only)
Green validation: Pipeline runs smoke tests against Green (HTTP 200 on /health endpoint? User login works? Payment processing functional?), runs performance tests (response time <500ms? throughput ≥1000 req/sec?), manual validation optional (QA team spot-checks Green)
Traffic cutover: Pipeline executes swap: Load balancer configuration updated (route traffic from Blue → Green), DNS update propagated (5 min TTL), users gradually shift to Green as DNS caches expire, after 10 minutes 100% traffic on Green v1.6
Monitoring period: Pipeline monitors Green for 30 minutes (error rate <1%? latency normal? CPU usage stable?), Blue v1.5 still running (ready for rollback), logs compared (any spike in errors?)
Success scenario: Green stable after 30 min, Blue environment scaled down (save costs) or kept at minimum for quick rollback, deployment marked successful
Rollback scenario: Green shows 5% error rate at minute 15, pipeline executes immediate rollback: Load balancer flips back (Blue now receives traffic), DNS updated (point to Blue), Green investigated (logs analyzed, fix deployed later), users see <1 minute impact (only during flip)

📊 Blue-Green Deployment Flow Diagram:

graph TD
    subgraph "Blue-Green Deployment Flow"
        START[Start Deployment] --> DEPLOY_GREEN[Deploy v1.6 to Green]
        DEPLOY_GREEN --> SMOKE[Run Smoke Tests on Green]
        SMOKE --> HEALTH{Green<br/>Healthy?}
        HEALTH -->|No| ROLLBACK1[Keep Blue Active]
        HEALTH -->|Yes| SWITCH[Switch Load Balancer<br/>Blue → Green]
        SWITCH --> MONITOR[Monitor Green for 30 min]
        MONITOR --> CHECK{Error Rate<br/>Normal?}
        CHECK -->|No| ROLLBACK2[Immediate Rollback<br/>Green → Blue]
        CHECK -->|Yes| SUCCESS[Deployment Complete<br/>Scale Down Blue]
    end
    
    subgraph "Environment State"
        BLUE[Blue Environment<br/>v1.5 - Active]
        GREEN[Green Environment<br/>v1.6 - Standby]
        LB[Load Balancer]
        
        LB -->|100% Traffic| BLUE
        LB -.->|After Switch| GREEN
    end
    
    style HEALTH fill:#fff3e0
    style CHECK fill:#fff3e0
    style SUCCESS fill:#c8e6c9
    style ROLLBACK1 fill:#ffcdd2
    style ROLLBACK2 fill:#ffcdd2
    style GREEN fill:#e8f5e9
    style BLUE fill:#e3f2fd

See: diagrams/04_domain3_blue_green_deployment.mmd

Diagram Explanation: The blue-green deployment process flows through distinct validation stages before cutover. Starting at top-left, Start Deployment initiates the process by triggering pipeline to Deploy v1.6 to Green environment (new version) while Blue environment v1.5 continues serving production traffic. Once Green deployment completes, pipeline executes Run Smoke Tests on Green to validate basic functionality (health endpoints, critical paths, database connectivity). Results feed into Green Healthy? decision (orange diamond). If health checks fail (database migration issue, missing configuration, service won't start) → Keep Blue Active (red box, deployment aborted, Green torn down or fixed, users unaffected because Blue still serving). If healthy → Switch Load Balancer (Blue → Green) executes traffic cutover - load balancer configuration updated to route requests to Green, DNS records updated, traffic shifts from Blue to Green over ~5-10 minutes as DNS caches expire. After switch, Monitor Green for 30 min observes error rates, latency, CPU/memory, comparing to baseline. After monitoring window, Error Rate Normal? decision evaluates metrics. If error spike detected (5% error rate vs 1% baseline, latency 2x normal, memory leak) → Immediate Rollback (Green → Blue) - load balancer flips back instantly (<30 seconds), users return to stable Blue v1.5, Green stays up for debugging. If metrics normal → Deployment Complete, Scale Down Blue (green box success state) - Blue environment scaled to minimum or terminated (save costs), Green becomes new production, next deployment will flip roles (Green becomes old, new Blue provisioned). Bottom section shows Environment State: Blue environment (light blue box) running v1.5, Green environment (light green box) running v1.6, both connected to Load Balancer. Initially LB sends 100% traffic to Blue (solid arrow), after successful switch LB sends 100% to Green (dashed arrow shows new path). Key advantage: Blue remains running during monitoring period (instant rollback capability), users never experience downtime (switch is instant at LB level), full validation happens before traffic shift (smoke tests on Green with real production data/config).

Detailed Example 1: Azure App Service Deployment Slots (Blue-Green)

Azure App Service natively supports blue-green deployments through deployment slots. You have production slot (Blue) and staging slot (Green), can swap them instantly. This is Azure's built-in implementation of blue-green pattern.

Azure App Service setup:

# Create App Service with staging slot
az appservice plan create --name myAppPlan --resource-group myRG --sku S1
az webapp create --name myWebApp --resource-group myRG --plan myAppPlan
az webapp deployment slot create --name myWebApp --resource-group myRG --slot staging

Azure Pipeline for slot deployment (azure-pipelines.yml):

stages:
- stage: DeployToStaging
  jobs:
  - deployment: DeployStaging
    environment: 'staging'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureWebApp@1
            inputs:
              azureSubscription: 'MyAzureConnection'
              appName: 'myWebApp'
              deployToSlotOrASE: true
              resourceGroupName: 'myRG'
              slotName: 'staging'
              package: '$(Pipeline.Workspace)/drop/**/*.zip'
          
          - task: AzureAppServiceManage@0
            displayName: 'Start staging slot'
            inputs:
              azureSubscription: 'MyAzureConnection'
              action: 'Start Azure App Service'
              webAppName: 'myWebApp'
              specifySlotOrASE: true
              resourceGroupName: 'myRG'
              slot: 'staging'

- stage: ValidateStaging
  dependsOn: DeployToStaging
  jobs:
  - job: SmokeTests
    steps:
    - task: PowerShell@2
      displayName: 'Run smoke tests against staging'
      inputs:
        targetType: 'inline'
        script: |
          $response = Invoke-WebRequest -Uri 'https://mywebapp-staging.azurewebsites.net/health' -UseBasicParsing
          if ($response.StatusCode -ne 200) {
            Write-Error "Health check failed"
            exit 1
          }
          
          $loginResponse = Invoke-WebRequest -Uri 'https://mywebapp-staging.azurewebsites.net/api/login' `
            -Method POST `
            -Body '{"username":"test","password":"test123"}' `
            -ContentType 'application/json'
          
          if ($loginResponse.StatusCode -ne 200) {
            Write-Error "Login test failed"
            exit 1
          }
          
          Write-Host "Smoke tests passed"

- stage: SwapToProduction
  dependsOn: ValidateStaging
  jobs:
  - deployment: SwapSlots
    environment: 'production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureAppServiceManage@0
            displayName: 'Swap staging to production'
            inputs:
              azureSubscription: 'MyAzureConnection'
              action: 'Swap Slots'
              webAppName: 'myWebApp'
              resourceGroupName: 'myRG'
              sourceSlot: 'staging'
              targetSlot: 'production'
              
          - task: PowerShell@2
            displayName: 'Monitor production for 5 minutes'
            inputs:
              targetType: 'inline'
              script: |
                $endTime = (Get-Date).AddMinutes(5)
                while ((Get-Date) -lt $endTime) {
                  $response = Invoke-WebRequest -Uri 'https://mywebapp.azurewebsites.net/health' -UseBasicParsing
                  if ($response.StatusCode -ne 200) {
                    Write-Error "Production health check failed - Initiating rollback"
                    exit 1
                  }
                  Write-Host "Health check passed - $(Get-Date)"
                  Start-Sleep -Seconds 30
                }
                Write-Host "Monitoring complete - Deployment successful"

- stage: Rollback
  dependsOn: SwapToProduction
  condition: failed()
  jobs:
  - deployment: RollbackSwap
    environment: 'production-rollback'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureAppServiceManage@0
            displayName: 'Rollback - Swap production back to staging'
            inputs:
              azureSubscription: 'MyAzureConnection'
              action: 'Swap Slots'
              webAppName: 'myWebApp'
              resourceGroupName: 'myRG'
              sourceSlot: 'production'
              targetSlot: 'staging'

What happens during slot swap:

Before swap: Production slot serves mywebapp.azurewebsites.net (v1.5), Staging slot serves mywebapp-staging.azurewebsites.net (v1.6)
Pipeline deploy: New version v1.6 deployed to staging slot, application starts, app settings/connection strings swapped to staging values
Smoke tests: Pipeline hits staging URL (mywebapp-staging.azurewebsites.net), validates /health returns 200, validates /api/login works, all tests pass
Swap execution: Azure swaps virtual directories and configurations, production slot now runs v1.6 code, staging slot now runs v1.5 code, swap completes in 5-10 seconds (zero downtime)
Production URLs: mywebapp.azurewebsites.net now serves v1.6 (users see new version), mywebapp-staging.azurewebsites.net now serves v1.5 (old version on staging)
Monitoring: Pipeline checks production health every 30 seconds for 5 minutes, if any check fails → condition: failed() triggers, Rollback stage executes
Rollback (if needed): Another swap executed (production ↔ staging), production returns to v1.5, users see <10 second impact, staging has v1.6 for debugging

Result: Zero-downtime deployment (swap is instant), instant rollback capability (another swap), staging slot validates code with production configuration before going live, all built into Azure App Service (no custom load balancer config needed).

Canary Deployment

What it is: Canary deployment gradually rolls out new version to small percentage of users first (5% canary), monitors metrics (error rate, latency, CPU), if healthy increases percentage (25%, 50%, 100%), if unhealthy stops rollout and rollback. Named after "canary in coal mine" (early warning system).

Why it exists: Blue-green switches 100% traffic at once (big bang, high risk if bug only appears at scale). Canary reduces blast radius (only 5% of users affected by bugs), enables real-world validation (actual user traffic, not synthetic tests), provides early warning (metrics anomaly in canary traffic stops rollout before impacting everyone), allows A/B comparison (canary metrics vs production baseline).

Real-world analogy: Restaurant testing new menu item. Don't serve new dish to entire restaurant (what if everyone hates it?). Instead: Offer new dish to one table (5%), watch their reaction, if they love it offer to five tables (25%), if still positive offer to everyone (100%). If first table complains, remove dish, only one table impacted.

How it works (Detailed step-by-step):

Initial state: Production runs v2.9 serving 100% traffic, metrics baseline established (1% error rate, 200ms avg latency, 45% CPU)
Canary deployment: Pipeline deploys v3.0 to canary environment (separate pods/VMs/containers), load balancer configured to route 5% traffic to canary, 95% still goes to production v2.9
Canary monitoring (10 minutes): Azure Monitor collects metrics from canary and production, compares: Canary error rate 1.2% vs Production 1% (within tolerance), Canary latency 205ms vs Production 200ms (within 10% threshold), CPU usage normal
First increase: Metrics healthy, pipeline increases canary to 25% traffic, monitors for 15 minutes, metrics still healthy
Second increase: Pipeline increases to 50% traffic, monitors for 15 minutes, metrics still healthy
Full rollout: Pipeline increases to 100% traffic, all users now on v3.0, production v2.9 decommissioned

Rollback scenario:

Canary at 5%: v3.0 deployed to canary, 5% traffic routed
Anomaly detected (minute 5): Canary error rate jumps to 8% (vs 1% baseline), latency spikes to 800ms (vs 200ms baseline), Azure Monitor triggers alert
Automatic rollback: Pipeline receives alert, stops canary traffic (0% to canary), routes 100% back to production v2.9, only 5% of users saw errors for 5 minutes
Investigation: Dev team analyzes canary logs, finds database query timeout (new feature hits slow endpoint), fix deployed to staging, re-tested, deployed as new canary next day

📊 Canary Deployment Sequence:

sequenceDiagram
    participant Pipeline
    participant Canary as Canary (5%)
    participant Production as Production (95%)
    participant Monitor as Azure Monitor
    
    Pipeline->>Canary: Deploy v2.0 to canary
    Pipeline->>Monitor: Start monitoring canary metrics
    
    Note over Canary: 5% of traffic → v2.0
    Note over Production: 95% of traffic → v1.9
    
    Monitor-->>Pipeline: Canary metrics healthy (10 min)
    Pipeline->>Canary: Increase to 25% traffic
    
    Monitor-->>Pipeline: Error rate spike detected!
    Pipeline->>Canary: Rollback - Stop canary
    Pipeline->>Production: 100% traffic back to v1.9
    
    Note over Pipeline: Investigate issue, fix, redeploy

See: diagrams/04_domain3_canary_deployment.mmd

Diagram Explanation: Canary deployment uses progressive traffic shifting with continuous monitoring to detect issues early. The sequence begins with Pipeline deploying v2.0 to Canary environment (5% traffic destination) while Production environment (95% traffic) continues serving v1.9. Pipeline immediately starts monitoring canary metrics via Azure Monitor to establish baseline comparison. First monitoring phase (10 min): Small percentage of real user traffic flows to Canary (5%), allowing real-world validation with limited blast radius. Monitor continuously compares canary metrics (error rate, latency, throughput) against production baseline. If metrics are healthy (error rate within tolerance, latency normal, no anomalies) → Monitor returns "Canary metrics healthy" to Pipeline → Pipeline executes traffic increase to 25% (second phase), continues monitoring. However, the diagram shows failure scenario: During monitoring, Monitor detects "Error rate spike!" in canary (e.g., 8% errors in canary vs 1% in production, significant deviation). Monitor immediately alerts Pipeline. Pipeline responds with two actions: (1) Rollback - Stop canary: Traffic weight for canary set to 0%, (2) 100% traffic back to v1.9: All users return to stable production version. Final note indicates "Investigate issue, fix, redeploy" - dev team examines canary logs, identifies root cause (database timeout, memory leak, API integration failure), fixes bug, redeploys as new canary attempt. Key advantage over blue-green: Only 5% of users affected by bug (95% never saw issue), early detection prevented full rollout (canary acted as early warning system), automatic rollback triggered by metrics (no manual intervention), gradual increase allows validation at each stage (5% → 25% → 50% → 100%, stop at any stage if issues arise). This pattern reduces risk of large-scale failures by validating with real traffic in controlled increments.

Detailed Example 2: Kubernetes Canary with Istio

You're deploying microservice to Kubernetes cluster with 100 pods. Want canary deployment with traffic splitting. Solution: Use Istio service mesh for intelligent traffic routing.

Kubernetes deployment manifest:

# Production deployment (v1.9)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service-v1
spec:
  replicas: 95
  selector:
    matchLabels:
      app: api-service
      version: v1
  template:
    metadata:
      labels:
        app: api-service
        version: v1
    spec:
      containers:
      - name: api
        image: myregistry.azurecr.io/api-service:1.9
---
# Canary deployment (v2.0)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service-v2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-service
      version: v2
  template:
    metadata:
      labels:
        app: api-service
        version: v2
    spec:
      containers:
      - name: api
        image: myregistry.azurecr.io/api-service:2.0

Istio VirtualService for traffic splitting:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-service
spec:
  hosts:
  - api-service
  http:
  - match:
    - headers:
        canary-user:
          exact: "true"
    route:
    - destination:
        host: api-service
        subset: v2
      weight: 100
  - route:
    - destination:
        host: api-service
        subset: v1
      weight: 95
    - destination:
        host: api-service
        subset: v2
      weight: 5

Azure Pipeline for progressive canary:

stages:
- stage: DeployCanary5Percent
  jobs:
  - job: Deploy
    steps:
    - task: Kubernetes@1
      inputs:
        command: 'apply'
        arguments: '-f k8s/api-service-v2-deployment.yaml'
    
    - task: Kubernetes@1
      inputs:
        command: 'apply'
        arguments: '-f k8s/istio-virtualservice-5percent.yaml'
    
    - task: PowerShell@2
      displayName: 'Monitor canary metrics'
      inputs:
        script: |
          $query = @"
          requests
          | where timestamp > ago(10m)
          | where customDimensions.version == "v2"
          | summarize 
              ErrorRate = countif(success == false) * 100.0 / count(),
              AvgDuration = avg(duration),
              RequestCount = count()
          "@
          
          $metrics = Invoke-AzOperationalInsightsQuery -WorkspaceId $(WorkspaceId) -Query $query
          
          $errorRate = $metrics.Results.ErrorRate
          $avgDuration = $metrics.Results.AvgDuration
          
          Write-Host "Canary metrics: ErrorRate=$errorRate%, AvgDuration=$avgDuration ms"
          
          if ($errorRate -gt 2) {
            Write-Error "Canary error rate too high: $errorRate%"
            exit 1
          }
          
          if ($avgDuration -gt 300) {
            Write-Error "Canary latency too high: $avgDuration ms"
            exit 1
          }

- stage: IncreaseCanary25Percent
  dependsOn: DeployCanary5Percent
  condition: succeeded()
  jobs:
  - job: IncreaseTraffic
    steps:
    - task: Kubernetes@1
      inputs:
        command: 'apply'
        arguments: '-f k8s/istio-virtualservice-25percent.yaml'
    
    - task: PowerShell@2
      displayName: 'Monitor 25% canary'
      inputs:
        script: |
          # Same monitoring logic, 15 minute window
          Start-Sleep -Seconds 900

- stage: FullRollout
  dependsOn: IncreaseCanary25Percent
  condition: succeeded()
  jobs:
  - job: CompleteDeployment
    steps:
    - task: Kubernetes@1
      inputs:
        command: 'scale'
        arguments: 'deployment/api-service-v2 --replicas=100'
    
    - task: Kubernetes@1
      inputs:
        command: 'scale'
        arguments: 'deployment/api-service-v1 --replicas=0'
    
    - task: Kubernetes@1
      inputs:
        command: 'apply'
        arguments: '-f k8s/istio-virtualservice-100percent.yaml'

- stage: Rollback
  dependsOn: DeployCanary5Percent
  condition: failed()
  jobs:
  - job: RollbackCanary
    steps:
    - task: Kubernetes@1
      inputs:
        command: 'delete'
        arguments: 'deployment/api-service-v2'
    
    - task: Kubernetes@1
      inputs:
        command: 'apply'
        arguments: '-f k8s/istio-virtualservice-0percent.yaml'

What happens:

5% canary: Pipeline deploys 5 v2.0 pods (canary) alongside 95 v1.9 pods (production), Istio routes 5% traffic to v2 pods, Azure Monitor collects metrics from both versions
Metric comparison: Pipeline queries Application Insights every minute: v2 error rate 1.1% vs v1 error rate 1.0% (healthy), v2 latency 195ms vs v1 latency 200ms (healthy), after 10 minutes all checks passed
25% increase: Pipeline updates Istio VirtualService (weight: 25 for v2, 75 for v1), scales v2 to 25 replicas, v1 to 75 replicas, monitors for 15 minutes, metrics still healthy
100% rollout: Pipeline scales v2 to 100 replicas, v1 to 0 replicas, updates VirtualService (weight: 100 for v2), all traffic on v2, deployment complete
Rollback scenario: If 5% canary shows 8% error rate, pipeline stage fails, Rollback stage triggers automatically, v2 deployment deleted, VirtualService updated to 0% canary, only 5% of users impacted for 10 minutes

Result: Gradual validation with real production traffic, automatic rollback on metric anomaly, limited blast radius (5% → 25% → 100% progression), Istio provides fine-grained traffic control (can route by headers, paths, user segments), Azure Monitor integration for automated decision-making.

Feature Flags with Azure App Configuration

What it is: Feature flags (feature toggles) allow deploying code to production with new features disabled, then enable features for specific users/environments by toggling configuration, no code deployment needed. Implemented using Azure App Configuration Feature Manager, flags stored centrally, evaluated at runtime.

Why it exists: Traditional deployment couples code deployment with feature release (deploy new code = users see new feature immediately, risky). Feature flags decouple deployment from release (deploy code anytime, enable feature when ready, instant rollback by disabling flag, no code changes). Enable progressive rollout (enable for 5% users, then 25%, then 100%), A/B testing (enable for group A, disabled for group B, compare metrics), emergency killswitch (production bug? disable feature instantly, no deployment needed).

Real-world analogy: Light switch for new features. You install the light bulb (deploy code) but switch is OFF (feature disabled). When ready, flip switch ON (enable flag) - light turns on (users see feature). If light flickers (bug), flip switch OFF instantly - light off (feature disabled), no need to uninstall bulb (no code deployment). Can have dimmer switch (enable for 25% brightness = 25% of users).

How it works (Detailed step-by-step):

Azure App Configuration setup: Create App Configuration store in Azure, create feature flag "NewCheckout" with percentage filter 0% (disabled for everyone), configure ASP.NET app to read from App Configuration
Code deployment: Developer wraps new checkout feature in if statement: if (featureManager.IsEnabledAsync("NewCheckout")) { ShowNewCheckout(); } else { ShowOldCheckout(); }, code deployed to production, flag is OFF (0%), all users see old checkout
Initial rollout (5%): DevOps engineer updates flag in App Configuration: set percentage filter to 5%, no code deployment, app reads new config (30 second refresh), 5% of users randomly see new checkout, 95% still see old
Monitoring: Azure Monitor tracks new checkout metrics (conversion rate, error rate, page load time), compares to old checkout baseline, if metrics healthy after 1 day, increase to 25%
Progressive rollout: Increase to 25% (monitor), then 50% (monitor), then 100% (all users on new checkout), entire rollout took 1 week, zero code deployments (only config changes)
Bug discovered: At 50% rollout, payment processor integration broken for new checkout, error rate spike, DevOps engineer sets flag to 0% in App Configuration, within 30 seconds all users back to old checkout, zero downtime, dev team fixes bug, redeploys code, re-enables flag starting at 5% again

Detailed Example: Feature Flag with Targeted Rollout

You're deploying new recommendation engine. Want to enable for beta testers first, then premium customers, then everyone. Use Azure App Configuration with custom filters.

Azure App Configuration setup (via Azure CLI):

# Create App Configuration
az appconfig create --name myAppConfig --resource-group myRG --location eastus

# Create feature flag with targeting filter
az appconfig feature set --name RecommendationEngine \
  --feature-flag true \
  --label Production \
  --connection-string $(az appconfig credential list --name myAppConfig --query "[0].connectionString" -o tsv) \
  --description "New AI-powered recommendation engine"

# Configure targeting filter
az appconfig feature filter add --name RecommendationEngine \
  --feature-flag-filter Microsoft.Targeting \
  --filter-parameters '{"Audience":{"Users":["beta-tester@company.com"],"Groups":["BetaTesters"],"DefaultRolloutPercentage":0}}'

ASP.NET Core application code:

// Startup.cs - Configure feature management
public void ConfigureServices(IServiceCollection services)
{
    services.AddAzureAppConfiguration();
    
    services.AddFeatureManagement()
        .AddFeatureFilter<TargetingFilter>();
    
    services.AddSingleton<ITargetingContextAccessor, UserTargetingContextAccessor>();
}

public void Configure(IApplicationBuilder app)
{
    app.UseAzureAppConfiguration();
}

// UserTargetingContextAccessor.cs - Define targeting context
public class UserTargetingContextAccessor : ITargetingContextAccessor
{
    private readonly IHttpContextAccessor _httpContextAccessor;
    
    public ValueTask<TargetingContext> GetContextAsync()
    {
        var httpContext = _httpContextAccessor.HttpContext;
        var userId = httpContext.User.FindFirst(ClaimTypes.NameIdentifier)?.Value;
        var groups = httpContext.User.FindAll(ClaimTypes.Role).Select(c => c.Value);
        
        return new ValueTask<TargetingContext>(new TargetingContext
        {
            UserId = userId,
            Groups = groups.ToList()
        });
    }
}

// RecommendationsController.cs - Use feature flag
public class RecommendationsController : Controller
{
    private readonly IFeatureManager _featureManager;
    
    [HttpGet]
    public async Task<IActionResult> GetRecommendations()
    {
        if (await _featureManager.IsEnabledAsync("RecommendationEngine"))
        {
            // New AI recommendations
            var recommendations = await _aiService.GetRecommendations(User.Id);
            return Json(new { source = "ai", items = recommendations });
        }
        else
        {
            // Old rule-based recommendations
            var recommendations = await _ruleService.GetRecommendations(User.Id);
            return Json(new { source = "rules", items = recommendations });
        }
    }
}

Rollout progression (via App Configuration portal):

Day 1: Enable for "BetaTesters" group (20 internal users)
  - Targeting filter: Users=[], Groups=["BetaTesters"], DefaultRolloutPercentage=0
  - Result: Only users in BetaTesters AD group see new recommendations

Day 3: Add specific premium customers by email
  - Targeting filter: Users=["premium@acme.com", "vip@contoso.com"], Groups=["BetaTesters"], DefaultRolloutPercentage=0
  - Result: Beta testers + 2 premium customers see new recommendations

Day 5: Rollout to 10% of premium tier
  - Targeting filter: Users=[...], Groups=["BetaTesters", "PremiumCustomers"], DefaultRolloutPercentage=10
  - Result: All beta testers + specific users + 10% of premium customers

Day 7: Rollout to 50% of all users
  - Targeting filter: Users=[...], Groups=[...], DefaultRolloutPercentage=50
  - Result: Specific users + specific groups + 50% random rollout

Day 10: Full rollout
  - Targeting filter: Users=[...], Groups=[...], DefaultRolloutPercentage=100
  - Result: Everyone sees new AI recommendations

What happens during rollout:

App Configuration polling: ASP.NET app polls Azure App Configuration every 30 seconds for flag updates, caches configuration, refreshes when changes detected
Request evaluation: User makes request to /api/recommendations, TargetingContextAccessor reads user ID and groups from JWT claims, FeatureManager evaluates targeting filter against user context
Filter evaluation logic:
- Is user in Users list? → Enable flag
- Is user in any Groups? → Enable flag
- Neither? → Use DefaultRolloutPercentage (hash user ID, if hash % 100 < percentage → enable)
Flag value cached: Result cached per request (consistent experience during request), next request re-evaluates (can change if config updated)
Immediate toggle: DevOps updates DefaultRolloutPercentage from 50% → 0%, within 30 seconds all apps refresh config, new requests re-evaluate (0% means disabled), users gradually see old recommendations as they make new requests
Monitoring per segment: Application Insights custom dimensions track flag evaluation (userId, flagName, flagValue), can query: "What's error rate for users with RecommendationEngine=true vs false?", enables A/B comparison

Result: Zero-downtime feature rollout, instant rollback capability (toggle flag OFF), targeted rollout by user attributes (email, groups, percentage), no code deployment needed for rollout changes, built-in A/B testing capability, App Configuration provides UI for business users to manage flags (no DevOps needed for rollout adjustments).

⭐ Must Know (Deployment Strategy Critical Facts):

Blue-Green: Two identical environments, instant swap, instant rollback, requires 2X infrastructure (expensive), zero downtime, full validation before cutover
Canary: Progressive rollout (5% → 25% → 100%), monitor metrics at each stage, automatic rollback on anomaly, limited blast radius (early detection), gradual validation
Feature Flags: Decouple deployment from release, toggle features ON/OFF without code changes, instant rollback (disable flag), enable targeted rollout (users/groups/percentage), A/B testing built-in
Azure App Service Slots: Built-in blue-green deployment, production and staging slots, swap operation (5-10 sec), automatic configuration swap, rollback is another swap
Deployment slots settings: Some settings "stick" to slot (don't swap): deployment slots enabled, custom domains, SSL certificates, scale settings. Other settings swap with code: connection strings, app settings (unless marked "slot setting")
Ring Deployment: Progressive rollout in rings (Ring 0=internal users, Ring 1=early adopters, Ring 2=general availability), each ring represents user segment, pause between rings for validation

When to use (Deployment Strategy Decisions):

✅ Use Blue-Green when: Can afford 2X infrastructure, need instant rollback, want full validation before cutover, database changes are backward compatible
✅ Use Canary when: Want gradual rollout, need early detection of issues, can tolerate partial rollout, have good monitoring/metrics
✅ Use Feature Flags when: Want to decouple deployment from release, need instant on/off toggle, want targeted rollout (beta users first), conducting A/B tests, deploying risky features
✅ Use App Service Slots when: Deploying to Azure App Service, want built-in blue-green, need staging environment with production config, web apps or API apps
❌ Don't use Blue-Green when: Infrastructure cost prohibitive (2X cost), database has breaking changes (old Blue can't work with new schema), stateful applications (session state lost on swap)
❌ Don't use Canary when: No good metrics to monitor (can't detect issues), can't tolerate partial rollout (all users must have same version), feature is all-or-nothing (can't have 5% of users see it)
❌ Don't use Feature Flags when: Over-using them (tech debt accumulates, code full of if/else), flags never removed (permanent flags become configuration, not toggles), no flag cleanup process

Limitations & Constraints:

Blue-Green infrastructure cost: Requires 2X resources during deployment (both Blue and Green running), double the cost for deployment window (can scale down Blue after cutover)
Canary complexity: Requires traffic routing capability (load balancer with weights, service mesh), sophisticated monitoring (can compare canary vs prod metrics), more moving parts
Feature flags technical debt: Each flag adds if/else branch (code complexity), flags accumulate over time (10, 50, 100 flags?), must clean up after rollout (remove flag code after 100% enabled)
App Service slot limitations: S1 tier minimum for staging slot (Basic tier doesn't support slots), limited slots (S1 has 5 slots, P1 has 20 slots), each slot consumes resources (still costs money)

💡 Tips for Understanding:

Blue-Green = instant switch, Canary = gradual rollout: Blue-green is all-or-nothing (100% cutover), canary is incremental (5%, 25%, 50%, 100%)
Feature flags = deployment ≠ release: Code deployed (feature flag OFF) ≠ feature released (feature flag ON), separate deployment from release decision
Slots are environments, flags are toggles: Slot is copy of your app (staging vs production environment), flag is configuration (feature ON/OFF)
Think "blast radius": Blue-green impacts 100% at once (large blast radius), canary impacts 5% first (small blast radius), use canary for risky changes

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Using blue-green for database schema changes
- Why it's wrong: Blue environment can't work with new schema (breaking change), rollback requires database rollback (risky), schema changes need compatibility
- Correct understanding: Make schema changes backward compatible (expand-contract pattern), deploy schema separately, blue-green works when both versions work with same schema
Mistake 2: Canary without good metrics
- Why it's wrong: Can't detect if canary is healthy (no error rate metric? latency metric?), deploy bad canary to 100% because didn't notice issues at 5%
- Correct understanding: Instrument application first (Application Insights, custom metrics), establish baseline metrics, define health criteria (error rate <2%, latency <500ms), automate decisions
Mistake 3: Never cleaning up feature flags
- Why it's wrong: Code becomes littered with if/else (100 feature flags = 100 branches), flags stay forever (supposed to be temporary), tech debt accumulates, code hard to read
- Correct understanding: Feature flags have lifecycle (create → enable → rollout → remove), after 100% rollout and stable (2 weeks), remove flag code (make new feature default), flag should live weeks/months not years

🔗 Connections to Other Topics:

Relates to Pipeline Design because: Pipelines execute deployment strategies (blue-green swap step, canary rollout stages), deployment is pipeline output
Builds on Testing by: Blue-green runs smoke tests on Green before swap, canary monitors production metrics (real-world testing), feature flags enable testing in production (shadow mode)
Often used with Monitoring to: Track deployment health (error rates post-deploy), trigger rollbacks (metric anomaly detected), compare canary vs production performance
Integrates with Security through: Gradual security patch rollout (canary for patches), feature flags for security features (enable MFA gradually), slot settings for connection strings (keep secrets separate)

Troubleshooting Common Issues:

Issue 1: "App Service slot swap succeeded but application shows errors"
- Problem: Slot-sticky settings not configured correctly (connection string stuck to staging, production now uses staging DB)
- Solution: Mark slot-specific settings as "Deployment slot setting" checkbox, verify settings after swap, test staging slot with production config before swap
Issue 2: "Canary traffic not routing correctly - all traffic goes to production"
- Problem: Load balancer or service mesh not configured, routing rules missing, canary pods not labeled correctly
- Solution: Verify traffic splitting config (Istio VirtualService weights, NGINX Ingress annotations), check pod labels (version=canary?), test routing with curl -H "version: canary"
Issue 3: "Feature flag changes not reflected in application"
- Problem: App Configuration polling disabled, cache not refreshing, connection string incorrect
- Solution: Verify UseAzureAppConfiguration() called in Startup, check refresh interval (default 30sec), test with configuration sentinel (force refresh), check App Configuration connection in Azure

Chapter Summary

What We Covered

This comprehensive chapter covered the entire Build and Release Pipelines domain (50-55% of AZ-400 exam), including:

✅ Section 1: Pipeline Design and Implementation - GitHub Actions vs Azure Pipelines, agents (Microsoft-hosted vs self-hosted), YAML syntax, triggers, stages/jobs/steps hierarchy, templates, matrix builds, pipeline optimization

✅ Section 2: Package Management Strategy - Azure Artifacts vs GitHub Packages, feed views (@Local, @Prerelease, @Release), upstream sources, semantic versioning (SemVer), package promotion workflows, retention policies

✅ Section 3: Testing Strategy - Quality gates (branch policies, code coverage), release gates (approvals, monitoring), test pyramid (unit 70%, integration 20%, E2E 10%), shift-left testing, flaky test management

✅ Section 4: Deployment Strategies - Blue-green deployments (Azure App Service slots), canary releases (progressive rollout 5%→100%), feature flags (Azure App Configuration), ring deployment, zero-downtime techniques, rollback strategies

Critical Takeaways

Pipeline as Code (YAML): Azure Pipelines and GitHub Actions use YAML for pipeline definition, stored in repository, version controlled, supports templates/reusability, declarative vs imperative
Package Feeds & Views: Azure Artifacts uses views to separate maturity (@Prerelease for beta, @Release for stable), upstream sources cache external packages, SemVer communicates breaking changes (MAJOR.MINOR.PATCH)
Quality Before Merge: Quality gates (branch policies) enforce standards automatically (tests pass, coverage ≥75%, no critical bugs), broken code physically cannot merge, immediate developer feedback
Deployment ≠ Release: Feature flags decouple deployment (code to production) from release (feature enabled), enables gradual rollout, instant on/off toggle, A/B testing capability
Progressive Validation: Blue-green validates fully before 100% cutover, canary validates incrementally (5%, 25%, 50%, 100%), both enable instant rollback, choose based on risk tolerance

Self-Assessment Checklist

Test yourself before moving on:

Pipeline Design:

I can explain when to use GitHub Actions vs Azure Pipelines (project needs, ecosystem, features)
I can write YAML pipeline with stages, jobs, steps (multi-stage deployment, parallel execution, dependencies)
I understand agent types and when to use self-hosted vs Microsoft-hosted (access needs, cost, maintenance)
I can configure triggers (push, PR, scheduled) and path filters (selective builds)

Package Management:

I can design feed strategy with views (development → testing → production flow)
I understand upstream sources and caching (performance, security, compliance benefits)
I can implement SemVer strategy (MAJOR for breaking, MINOR for features, PATCH for fixes)
I know package retention policies and cost optimization (auto-delete old versions)

Testing:

I can configure quality gates with branch policies (test requirements, coverage thresholds, security scans)
I understand test pyramid distribution (70% unit, 20% integration, 10% E2E) and execution strategy
I can implement code coverage analysis with thresholds (PublishCodeCoverageResults, BuildQualityChecks)
I know how to design release gates (manual approvals, automated checks, incident monitoring)

Deployment:

I can implement blue-green deployment (Azure App Service slots, swap operation, rollback)
I understand canary deployment with progressive rollout (5%→25%→100%, metric monitoring, automatic rollback)
I can configure feature flags with Azure App Configuration (targeting filters, percentage rollout, instant toggle)
I know when to use each strategy (blue-green vs canary vs feature flags based on requirements)

Practice Questions

Try these from your practice test bundles:

Domain 3 Bundle 1 (Package Management & Testing): Questions 1-50
Domain 3 Bundle 2 (Pipeline Design): Questions 1-50
Domain 3 Bundle 3 (Deployment Strategies): Questions 1-50
Expected score: 75%+ to proceed

If you scored below 75%:

Review sections: Re-read sections where you missed questions (pipeline YAML syntax? deployment strategies?)
Focus on: Hands-on practice (create YAML pipeline, configure feed views, test feature flags in Azure)
Lab exercises: Deploy app with slot swap, implement canary with monitoring, configure quality gates

Quick Reference Card

[One-page summary of chapter - copy to your notes]

Pipeline Essentials:

Triggers: trigger: (CI), pr: (PR validation), schedules: (nightly builds)
Agents: Microsoft-hosted (clean, scalable, $), self-hosted (custom, on-prem access, free compute)
Structure: Stages (environments) → Jobs (agent execution units) → Steps (tasks/scripts)

Package Management:

Azure Artifacts: NuGet/npm/Maven, 2GB free, feed views for promotion, upstream caching
Feed Views: @Local (all), @Prerelease (beta/alpha), @Release (stable only)
SemVer: vMAJOR.MINOR.PATCH (1.2.3), ^ = minor updates, ~ = patch only, exact = no updates

Testing Gates:

Quality Gates: Branch policies enforce tests (80% pass rate), coverage (75% threshold), security (0 critical)
Test Pyramid: Unit 70% (fast), Integration 20% (medium), E2E 10% (slow), optimize for speed
Code Coverage: PublishCodeCoverageResults@1 (publish), BuildQualityChecks@8 (enforce threshold)

Deployment Patterns:

Blue-Green: Two environments, instant swap, instant rollback, 2X cost, zero downtime
Canary: Progressive rollout (5%→100%), monitor metrics, automatic rollback, limited blast radius
Feature Flags: Deploy with flag OFF, enable gradually, instant toggle, decouple deploy from release

Key Azure Tasks:

AzureWebApp@1 - Deploy to App Service
AzureAppServiceManage@0 - Swap slots, start/stop
PublishTestResults@2 - Publish test results (enables quality gates)
PublishCodeCoverageResults@1 - Publish coverage (visualize in Azure DevOps)
NuGetAuthenticate@1 - Authenticate to Azure Artifacts

Next Chapter: Domain 4 - Security and Compliance Plan
You should know: Authentication methods (service principals, managed identity), secrets management (Azure Key Vault), security scanning (Defender for Cloud, GitHub Advanced Security)

Chapter Summary

What We Covered

This comprehensive chapter covered the largest domain of the AZ-400 exam (50-55%), focusing on:

✅ Package Management Strategy

Azure Artifacts vs GitHub Packages for different scenarios
Feed views for package promotion (@Local, @Prerelease, @Release)
Dependency versioning with SemVer and CalVer
Upstream sources for caching public packages

✅ Testing Strategy for Pipelines

Quality and release gates with branch policies
Test pyramid (70% unit, 20% integration, 10% E2E)
Implementing tests in pipelines with PublishTestResults
Code coverage analysis and enforcement

✅ Pipeline Design and Implementation

Choosing between GitHub Actions and Azure Pipelines
YAML pipeline structure (stages, jobs, steps)
Trigger rules (CI, PR, scheduled, manual)
Parallel execution and dependencies
Reusable templates and task groups

✅ Deployment Strategies

Blue-green deployments for zero downtime
Canary releases for progressive rollout
Feature flags for decoupling deployment from release
Deployment slots in Azure App Service
Database deployment strategies

✅ Infrastructure as Code

ARM templates vs Bicep vs Terraform
Desired State Configuration with Azure Automation
Azure Deployment Environments for self-service
IaC testing and validation strategies

✅ Pipeline Maintenance

Monitoring pipeline health metrics
Optimizing for cost, time, and reliability
Retention strategies for artifacts
Migrating from classic to YAML pipelines

Critical Takeaways

Package Management: Use Azure Artifacts for enterprise scenarios with feed views for promotion; use GitHub Packages for open-source or GitHub-native workflows
Testing Gates: Implement quality gates at branch policy level to prevent bad code from merging; use test pyramid to optimize test execution time
Pipeline Structure: Use multi-stage YAML pipelines with stages for environments, jobs for parallel execution, and steps for tasks
Deployment Patterns: Choose blue-green for instant rollback, canary for progressive rollout with monitoring, feature flags for decoupling deployment from release
IaC Choice: Use Bicep for Azure-native IaC (simpler than ARM), Terraform for multi-cloud, ARM for complex scenarios requiring full control
Agent Strategy: Use Microsoft-hosted agents for standard builds (clean, scalable), self-hosted for custom requirements or on-premises access
Reusability: Create YAML templates for common patterns, task groups for UI-based pipelines, variable groups for shared configuration
Monitoring: Track failure rate, duration, and flaky tests; optimize pipelines continuously based on metrics

Self-Assessment Checklist

Test yourself before moving on:

Package Management:

I can explain when to use Azure Artifacts vs GitHub Packages
I understand feed views and how to promote packages through @Local → @Prerelease → @Release
I can describe SemVer format and what ^ and ~ mean in version ranges
I know how upstream sources work and why they're useful

Testing:

I can describe the test pyramid and why it's structured that way
I understand how to implement quality gates using branch policies
I know how to publish test results and code coverage in pipelines
I can explain when to use unit vs integration vs E2E tests

Pipeline Design:

I can write a multi-stage YAML pipeline with stages, jobs, and steps
I understand different trigger types (CI, PR, scheduled, manual)
I know how to implement parallel execution and job dependencies
I can create reusable YAML templates

Deployments:

I can explain blue-green, canary, and ring deployment patterns
I understand how deployment slots work in Azure App Service
I know how to implement feature flags using Azure App Configuration
I can describe strategies for zero-downtime deployments

Infrastructure as Code:

I can compare ARM templates, Bicep, and Terraform
I understand when to use each IaC tool
I know how to implement desired state configuration
I can explain Azure Deployment Environments

Pipeline Maintenance:

I can identify key pipeline health metrics
I understand how to optimize pipelines for cost and performance
I know how to set up artifact retention policies
I can migrate classic pipelines to YAML

Practice Questions

Try these from your practice test bundles:

Domain 3 Bundle 1: Questions 1-20 (Package Management & Testing)
Domain 3 Bundle 2: Questions 21-40 (Pipeline Design)
Domain 3 Bundle 3: Questions 41-60 (Deployments & IaC)
Full Practice Test 1: Domain 3 questions
Expected score: 75%+ to proceed (this is the largest domain)

If you scored below 75%:

Below 60%: Review entire chapter, focus on YAML syntax and deployment patterns
60-70%: Review sections on testing strategy, deployment patterns, and IaC
70-75%: Focus on edge cases and decision frameworks (when to use which tool/pattern)

Quick Reference Card

Package Management:

Azure Artifacts: Enterprise, feed views, 2GB free, upstream caching
GitHub Packages: Open-source, GitHub-native, unlimited public, $0.50/GB private
SemVer: vMAJOR.MINOR.PATCH (1.2.3), ^ = minor updates, ~ = patch only

Testing:

Test Pyramid: 70% unit (fast), 20% integration (medium), 10% E2E (slow)
Quality Gates: Branch policies → Build validation + Test pass rate + Coverage threshold
Tasks: PublishTestResults@2 (results), PublishCodeCoverageResults@1 (coverage)

Pipeline Structure:

stages:
  - stage: Build
    jobs:
      - job: BuildJob
        steps:
          - task: TaskName@Version

Deployment Patterns:

Blue-Green: Two environments, instant swap, zero downtime, 2X cost
Canary: Progressive rollout (5%→25%→100%), monitor metrics, automatic rollback
Feature Flags: Deploy OFF, enable gradually, instant toggle, decouple deploy from release

IaC Tools:

Bicep: Azure-native, simpler than ARM, transpiles to ARM
ARM: Full control, complex, JSON, native Azure
Terraform: Multi-cloud, HCL, state management, large ecosystem

Key Tasks:

AzureWebApp@1 - Deploy to App Service
AzureAppServiceManage@0 - Swap slots
PublishTestResults@2 - Publish test results
PublishCodeCoverageResults@1 - Publish coverage
NuGetAuthenticate@1 - Authenticate to Azure Artifacts

Decision Points:

Need enterprise package management? → Azure Artifacts
Need GitHub-native packages? → GitHub Packages
Need instant rollback? → Blue-green deployment
Need progressive rollout with monitoring? → Canary deployment
Need to decouple deployment from release? → Feature flags
Need Azure-only IaC? → Bicep
Need multi-cloud IaC? → Terraform
Need parallel execution? → Multiple jobs with dependencies
Need reusable pipeline logic? → YAML templates

Next Chapter: 05_domain4_security_compliance - Develop a Security and Compliance Plan (Authentication, secrets management, security scanning)

Chapter 4: Develop a Security and Compliance Plan (10-15% of exam)

Chapter Overview

What you'll learn:

Authentication and authorization strategies (Service Principals, Managed Identity, GitHub Apps)
Managing sensitive information (Azure Key Vault, secrets in pipelines)
Automating security and compliance scanning (Defender for Cloud, GitHub Advanced Security)
Permissions and access control in Azure DevOps and GitHub

Time to complete: 6-8 hours
Prerequisites: Chapters 0-3 (Fundamentals, Processes, Source Control, Pipelines)

Why this domain matters: Security is no longer an afterthought - it's integrated throughout the DevOps lifecycle (DevSecOps). This domain tests your ability to secure pipelines, protect sensitive data, and automate security scanning to catch vulnerabilities early.

Section 1: Design and Implement Authentication and Authorization

Introduction

The problem: Pipelines need to access Azure resources, GitHub repositories, and external services. Using personal credentials is insecure (credentials shared, no audit trail, no rotation). Manual permission management doesn't scale.

The solution: Use identity-based authentication (Service Principals, Managed Identity) and role-based access control (RBAC) to grant least-privilege access. Automate permission management through groups and teams.

Why it's tested: Authentication and authorization are fundamental to secure DevOps. The exam tests your ability to choose the right authentication method for different scenarios and implement proper access control.

Core Concepts

Service Principals vs Managed Identity

What they are: Both are Azure AD identities used by applications and services to authenticate to Azure resources without using user credentials.

Why they exist: Applications and pipelines need to access Azure resources (deploy to App Service, read from Key Vault, write to Storage). Using user credentials is problematic:

Credentials can be stolen or leaked
No clear audit trail (who did what?)
Credentials expire and need rotation
Shared credentials violate least-privilege principle

Service Principals and Managed Identities solve this by providing application-specific identities with their own permissions.

Real-world analogy: Think of a hotel key card system. Instead of giving every employee the master key (user credentials), each employee gets a key card (Service Principal/Managed Identity) that only opens the doors they need access to. If an employee leaves, you deactivate their card without changing all the locks.

How Service Principals work (Detailed step-by-step):

You create a Service Principal in Azure AD, which generates an Application ID and a secret (password) or certificate
You assign Azure RBAC roles to the Service Principal (e.g., "Contributor" on a resource group)
Your pipeline authenticates to Azure AD using the Application ID and secret
Azure AD validates the credentials and issues an access token
The pipeline uses the access token to call Azure APIs (deploy resources, read secrets, etc.)
Azure checks the token and verifies the Service Principal has the required RBAC role
If authorized, the operation succeeds; if not, it's denied

How Managed Identity works (Detailed step-by-step):

You enable Managed Identity on an Azure resource (VM, App Service, Azure DevOps agent)
Azure automatically creates an identity in Azure AD and manages its credentials (no secret to store)
Your application code requests a token from the Azure Instance Metadata Service (IMDS) endpoint
IMDS validates the request is coming from the resource with Managed Identity enabled
IMDS returns an access token for the Managed Identity
Your application uses the token to call Azure APIs
Azure validates the token and checks RBAC permissions
No secrets to manage, rotate, or leak - Azure handles everything

📊 Service Principal vs Managed Identity Architecture:

graph TB
    subgraph "Service Principal Flow"
        SP1[Pipeline] -->|1. Auth with App ID + Secret| AAD1[Azure AD]
        AAD1 -->|2. Access Token| SP1
        SP1 -->|3. API Call + Token| AZ1[Azure Resource]
        AZ1 -->|4. Validate Token & RBAC| AAD1
        AZ1 -->|5. Allow/Deny| SP1
    end

    subgraph "Managed Identity Flow"
        MI1[Azure VM/App Service] -->|1. Request Token| IMDS[Instance Metadata Service]
        IMDS -->|2. Validate Resource| AAD2[Azure AD]
        AAD2 -->|3. Access Token| IMDS
        IMDS -->|4. Return Token| MI1
        MI1 -->|5. API Call + Token| AZ2[Azure Resource]
        AZ2 -->|6. Validate Token & RBAC| AAD2
        AZ2 -->|7. Allow/Deny| MI1
    end

    style SP1 fill:#fff3e0
    style MI1 fill:#c8e6c9
    style AAD1 fill:#e1f5fe
    style AAD2 fill:#e1f5fe
    style IMDS fill:#f3e5f5

See: diagrams/05_domain4_sp_vs_mi_flow.mmd

Diagram Explanation (detailed):

The diagram shows two authentication flows side by side to highlight the key difference.

Service Principal Flow (orange): The pipeline must store and provide credentials (Application ID + Secret) to Azure AD. This creates a security risk - if the secret is leaked, anyone can impersonate the Service Principal. The secret must be rotated periodically (every 90 days recommended), requiring updates to all pipelines using it. The flow is: (1) Pipeline authenticates with stored credentials, (2) Azure AD validates and issues token, (3) Pipeline calls Azure resource with token, (4) Resource validates token and checks RBAC, (5) Operation allowed or denied.

Managed Identity Flow (green): No credentials to store or manage. The Azure resource (VM, App Service, or self-hosted agent) has Managed Identity enabled, which means Azure automatically manages its identity. The flow is: (1) Application requests token from IMDS (a special endpoint only accessible from within the Azure resource), (2) IMDS validates the request is coming from a resource with Managed Identity, (3) Azure AD issues token, (4) Token returned to application, (5) Application calls Azure resource with token, (6) Resource validates token and RBAC, (7) Operation allowed or denied. The key advantage: no secrets to leak, rotate, or manage.

Detailed Example 1: Service Principal for GitHub Actions

Scenario: You have a GitHub Actions workflow that needs to deploy a web application to Azure App Service. GitHub Actions runs on GitHub-hosted runners (not in Azure), so Managed Identity is not available.

Solution: Create a Service Principal and store its credentials in GitHub Secrets.

Step-by-step:

Create Service Principal: az ad sp create-for-rbac --name "github-actions-sp" --role Contributor --scopes /subscriptions/{subscription-id}/resourceGroups/{rg-name}
Azure returns JSON with appId, password, and tenant
Store these as GitHub Secrets: AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID
In your workflow, use azure/login@v1 action:

- uses: azure/login@v1
  with:
    creds: ${{ secrets.AZURE_CREDENTIALS }}

The action authenticates to Azure using the Service Principal credentials
Subsequent steps can deploy to Azure App Service using the authenticated session
After 90 days, you must rotate the secret: generate new password, update GitHub Secret

Why this approach: GitHub Actions runners are outside Azure, so they can't use Managed Identity. Service Principal is the only option. The secret is stored in GitHub Secrets (encrypted at rest), and only the workflow can access it.

Detailed Example 2: Managed Identity for Self-Hosted Azure DevOps Agent

Scenario: You have a self-hosted Azure DevOps agent running on an Azure VM. The agent needs to deploy to Azure resources and read secrets from Key Vault.

Solution: Enable System-Assigned Managed Identity on the VM and grant it appropriate RBAC roles.

Step-by-step:

Enable Managed Identity on the VM: Azure Portal → VM → Identity → System assigned → On
Azure automatically creates an identity in Azure AD (no secret generated)
Grant RBAC roles: az role assignment create --assignee {vm-principal-id} --role Contributor --scope /subscriptions/{sub-id}/resourceGroups/{rg-name}
Grant Key Vault access: Key Vault → Access policies → Add → Select principal (VM name) → Secret permissions: Get, List
In your pipeline, use AzureCLI@2 task without service connection:

- task: AzureCLI@2
  inputs:
    azureSubscription: 'ManagedIdentityConnection'
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      az webapp deploy --resource-group myRG --name myApp --src-path app.zip

The task automatically uses the VM's Managed Identity (no credentials needed)
Azure validates the VM's identity and checks RBAC permissions
Deployment succeeds if permissions are correct

Why this approach: The agent runs in Azure, so Managed Identity is available. No secrets to manage, rotate, or leak. If the VM is compromised, you can disable the Managed Identity instantly without updating any pipelines.

Detailed Example 3: User-Assigned Managed Identity for Multiple Resources

Scenario: You have 10 Azure VMs running self-hosted agents, and they all need the same permissions (deploy to App Service, read from Key Vault). You don't want to configure each VM individually.

Solution: Create a User-Assigned Managed Identity and assign it to all VMs.

Step-by-step:

Create User-Assigned Managed Identity: az identity create --name "devops-agents-identity" --resource-group "identities-rg"
Grant RBAC roles to the identity: az role assignment create --assignee {identity-principal-id} --role Contributor --scope /subscriptions/{sub-id}
Assign the identity to all VMs: az vm identity assign --name {vm-name} --resource-group {rg-name} --identities /subscriptions/{sub-id}/resourceGroups/identities-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/devops-agents-identity
In pipelines, use the identity (same as System-Assigned example)
If you need to change permissions, update the User-Assigned Identity once, and all VMs inherit the change

Why this approach: User-Assigned Managed Identity is reusable across multiple resources. You manage permissions in one place. If you need to revoke access, you can delete the identity or remove role assignments, affecting all VMs instantly.

⭐ Must Know (Critical Facts):

Service Principal: Requires storing and rotating secrets, used when Managed Identity is not available (GitHub Actions, on-premises agents, external services)
System-Assigned Managed Identity: Tied to a single Azure resource, lifecycle managed by Azure, automatically deleted when resource is deleted, no secrets to manage
User-Assigned Managed Identity: Independent resource, can be assigned to multiple Azure resources, survives resource deletion, centralized permission management
When to use Service Principal: GitHub Actions, Azure DevOps Microsoft-hosted agents, on-premises agents, external CI/CD tools
When to use Managed Identity: Self-hosted agents on Azure VMs, Azure App Service, Azure Functions, Azure Container Instances, any Azure resource that supports it
Secret rotation: Service Principal secrets should be rotated every 90 days; Managed Identity has no secrets to rotate

When to use (Comprehensive):

✅ Use Service Principal when: GitHub Actions workflows, Azure DevOps Microsoft-hosted agents, on-premises agents, external CI/CD tools (Jenkins, CircleCI), multi-tenant applications
✅ Use System-Assigned Managed Identity when: Single Azure resource needs access (one VM, one App Service), resource lifecycle matches identity lifecycle, simplest setup
✅ Use User-Assigned Managed Identity when: Multiple resources need same permissions, identity should survive resource deletion, centralized permission management, need to pre-create identity before resources
❌ Don't use Service Principal when: Resource runs in Azure and supports Managed Identity (unnecessary complexity and security risk)
❌ Don't use Managed Identity when: Resource is outside Azure (GitHub-hosted runners, on-premises servers, external services)

Limitations & Constraints:

Service Principal: Secrets expire (max 2 years), must be rotated, can be leaked, requires secure storage (Key Vault, GitHub Secrets)
Managed Identity: Only works for Azure resources, not available for GitHub-hosted runners or on-premises agents, requires Azure AD
System-Assigned MI: Deleted when resource is deleted, can't be shared across resources
User-Assigned MI: Requires separate resource creation and management, slightly more complex setup

💡 Tips for Understanding:

Remember: Managed Identity = No secrets to manage. If the resource is in Azure, always prefer Managed Identity.
Service Principal secret rotation: Set a calendar reminder for 90 days, or use Azure Key Vault to store secrets and enable auto-rotation
Exam tip: Questions often present a scenario and ask which authentication method to use. Look for keywords: "GitHub Actions" or "on-premises" → Service Principal. "Azure VM" or "App Service" → Managed Identity.

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Using Service Principal for self-hosted agents on Azure VMs
- Why it's wrong: Managed Identity is available and eliminates secret management
- Correct understanding: Always use Managed Identity when the resource is in Azure and supports it
Mistake 2: Thinking Managed Identity works for GitHub-hosted runners
- Why it's wrong: GitHub-hosted runners are outside Azure, so IMDS endpoint is not accessible
- Correct understanding: Managed Identity only works for Azure resources; use Service Principal for GitHub Actions
Mistake 3: Storing Service Principal secrets in pipeline variables (plain text)
- Why it's wrong: Pipeline variables are visible to anyone with pipeline edit access
- Correct understanding: Store secrets in Azure Key Vault or GitHub Secrets, reference them securely in pipelines

🔗 Connections to Other Topics:

Relates to Azure Key Vault (Section 2) because: Service Principal secrets should be stored in Key Vault, and Managed Identity is used to access Key Vault without additional secrets
Builds on Service Connections (Chapter 3) by: Service connections in Azure DevOps use Service Principals or Managed Identity to authenticate to Azure
Often used with RBAC (next section) to: Grant least-privilege permissions to Service Principals and Managed Identities

GitHub Authentication Methods

What they are: GitHub provides three main authentication methods for automation: GitHub Apps, GITHUB_TOKEN (automatic), and Personal Access Tokens (PATs).

Why they exist: GitHub Actions workflows and external tools need to interact with GitHub repositories (clone code, create issues, trigger workflows, publish packages). Using personal user credentials is insecure and doesn't scale. These authentication methods provide secure, scoped access for automation.

Real-world analogy: Think of a building with different types of access cards. A GitHub App is like a master access card for a specific application (can access multiple buildings/repos with fine-grained permissions). GITHUB_TOKEN is like a temporary visitor badge (automatically issued, expires after the visit/workflow run, limited scope). A PAT is like a personal employee badge (tied to your account, you control the scope, but if lost, it can be misused).

How GitHub Apps work (Detailed step-by-step):

You create a GitHub App in your organization or personal account settings
You configure the app's permissions (read/write access to repos, issues, packages, etc.)
You install the app on specific repositories or all repositories in an organization
The app authenticates using a private key (stored securely) and generates short-lived installation access tokens
Your workflow or tool uses the installation access token to interact with GitHub APIs
Tokens expire after 1 hour, and new tokens are generated as needed
All actions are attributed to the GitHub App, not a user (clear audit trail)

How GITHUB_TOKEN works (Detailed step-by-step):

GitHub automatically creates a GITHUB_TOKEN for every workflow run
The token is scoped to the repository where the workflow is running
The token has default permissions (read for most resources, write for some)
You can customize permissions in the workflow YAML using the permissions: block
The token is available as ${{ secrets.GITHUB_TOKEN }} in workflow steps
The token expires when the workflow run completes (typically within minutes to hours)
No setup required - it's automatic and free

How Personal Access Tokens (PATs) work (Detailed step-by-step):

You create a PAT in your GitHub account settings (Settings → Developer settings → Personal access tokens)
You choose between Classic PAT (broad scopes) or Fine-grained PAT (repo-specific, more granular)
You select scopes/permissions (repo, workflow, packages, admin:repo_hook, etc.)
You set an expiration date (max 1 year for Classic, custom for Fine-grained)
GitHub generates the token (shown once - copy it immediately)
You store the token in GitHub Secrets or Azure Key Vault
Your workflow or tool uses the PAT to authenticate to GitHub APIs
All actions are attributed to your user account (not ideal for automation)

📊 GitHub Authentication Methods Comparison:

graph TB
    subgraph "GitHub App"
        GA1[Workflow] -->|1. Request Token| GA2[GitHub App]
        GA2 -->|2. Auth with Private Key| GH1[GitHub API]
        GH1 -->|3. Installation Token 1hr| GA2
        GA2 -->|4. Return Token| GA1
        GA1 -->|5. API Call + Token| GH1
        GH1 -->|6. Validate & Authorize| GA1
    end

    subgraph "GITHUB_TOKEN Automatic"
        GT1[Workflow Starts] -->|1. Auto-Generate| GH2[GitHub]
        GH2 -->|2. GITHUB_TOKEN| GT2[Workflow Steps]
        GT2 -->|3. API Call + Token| GH2
        GH2 -->|4. Validate Permissions| GT2
        GT2 -->|5. Workflow Ends| GH2
        GH2 -->|6. Token Expires| GT1
    end

    subgraph "Personal Access Token"
        PAT1[User] -->|1. Create PAT| GH3[GitHub Settings]
        GH3 -->|2. Generate Token| PAT1
        PAT1 -->|3. Store in Secrets| PAT2[GitHub Secrets]
        PAT3[Workflow] -->|4. Read Secret| PAT2
        PAT3 -->|5. API Call + PAT| GH4[GitHub API]
        GH4 -->|6. Validate & Authorize| PAT3
    end

    style GA2 fill:#c8e6c9
    style GT2 fill:#e1f5fe
    style PAT2 fill:#fff3e0

See: diagrams/05_domain4_github_auth_methods.mmd

Diagram Explanation (detailed):

The diagram shows three authentication flows for GitHub automation, each with different security and lifecycle characteristics.

GitHub App Flow (green): Most secure and recommended for production. The workflow requests a token from the GitHub App, which authenticates using a private key (stored securely, never exposed). GitHub validates the app and issues a short-lived installation access token (1 hour expiration). The workflow uses this token for API calls. Key advantages: (1) Tokens are short-lived (1 hour), (2) Actions are attributed to the app, not a user, (3) Fine-grained permissions per repository, (4) Survives user account changes (not tied to a person).

GITHUB_TOKEN Flow (blue): Simplest and automatic. GitHub automatically generates a token when the workflow starts. The token is scoped to the repository and has default permissions (customizable via permissions: block). The token is available as ${{ secrets.GITHUB_TOKEN }} in all steps. When the workflow completes, the token expires immediately. Key advantages: (1) Zero setup required, (2) Automatic and free, (3) Scoped to repository, (4) Expires automatically. Limitations: (1) Can't trigger other workflows (prevents recursive triggers), (2) Can't access other repositories (unless explicitly granted), (3) Limited to workflow duration.

Personal Access Token Flow (orange): User-centric authentication. The user creates a PAT in GitHub settings, selects scopes, and sets expiration (max 1 year for Classic PATs). The PAT is stored in GitHub Secrets (encrypted at rest). The workflow reads the secret and uses it for API calls. Key advantages: (1) Can access multiple repositories, (2) Can trigger other workflows, (3) Works for cross-repo scenarios. Limitations: (1) Tied to user account (if user leaves, PAT stops working), (2) Requires manual rotation before expiration, (3) Broader permissions than needed (especially Classic PATs), (4) Actions attributed to user (audit trail confusion).

Detailed Example 1: Using GITHUB_TOKEN for Package Publishing

Scenario: You have a GitHub Actions workflow that builds a NuGet package and publishes it to GitHub Packages. You want to use the simplest authentication method.

Solution: Use the automatic GITHUB_TOKEN with appropriate permissions.

Step-by-step:

In your workflow YAML, add permissions block:

name: Publish Package
on:
  push:
    branches: [main]

permissions:
  packages: write  # Grant write access to GitHub Packages
  contents: read   # Grant read access to repository contents

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup .NET
        uses: actions/setup-dotnet@v3
        with:
          dotnet-version: '8.0.x'
      
      - name: Build and Pack
        run: dotnet pack -c Release
      
      - name: Publish to GitHub Packages
        run: dotnet nuget push **/*.nupkg --source https://nuget.pkg.github.com/${{ github.repository_owner }}/index.json --api-key ${{ secrets.GITHUB_TOKEN }}

GitHub automatically generates GITHUB_TOKEN with packages:write and contents:read permissions
The workflow uses ${{ secrets.GITHUB_TOKEN }} to authenticate to GitHub Packages
Package is published successfully
Token expires when workflow completes

Why this approach: GITHUB_TOKEN is automatic, requires no setup, and is scoped to the repository. For same-repo operations (build and publish), it's the simplest and most secure option.

Detailed Example 2: Using GitHub App for Cross-Repo Workflow Triggers

Scenario: You have a monorepo with multiple services. When code is pushed to the shared-library repository, you want to trigger workflows in 5 dependent service repositories. GITHUB_TOKEN can't trigger workflows in other repos.

Solution: Create a GitHub App with workflow permissions and use it to trigger workflows.

Step-by-step:

Create GitHub App: GitHub Settings → Developer settings → GitHub Apps → New GitHub App
Configure permissions: Repository permissions → Actions: Read and write, Contents: Read
Install the app on all 6 repositories (shared-library + 5 services)
Generate and download private key (store in GitHub Secrets as APP_PRIVATE_KEY)
Note the App ID (store in GitHub Secrets as APP_ID)
In shared-library workflow:

name: Trigger Dependent Workflows
on:
  push:
    branches: [main]

jobs:
  trigger:
    runs-on: ubuntu-latest
    steps:
      - name: Generate GitHub App Token
        id: generate_token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.APP_ID }}
          private-key: ${{ secrets.APP_PRIVATE_KEY }}
          repositories: service1,service2,service3,service4,service5
      
      - name: Trigger Service Workflows
        run: |
          for repo in service1 service2 service3 service4 service5; do
            curl -X POST \
              -H "Authorization: Bearer ${{ steps.generate_token.outputs.token }}" \
              -H "Accept: application/vnd.github.v3+json" \
              https://api.github.com/repos/${{ github.repository_owner }}/$repo/actions/workflows/build.yml/dispatches \
              -d '{"ref":"main"}'
          done

The GitHub App token is generated (1-hour expiration)
The workflow triggers workflows in all 5 dependent repositories
Each service repository receives the trigger and starts its build workflow

Why this approach: GitHub Apps can trigger workflows in other repositories (GITHUB_TOKEN cannot). The app is not tied to a user account, so it survives personnel changes. Tokens are short-lived (1 hour), reducing security risk.

Detailed Example 3: Using Fine-Grained PAT for Azure DevOps Integration

Scenario: You have an Azure DevOps pipeline that needs to clone a private GitHub repository, create issues, and update pull request statuses. Azure DevOps is outside GitHub, so GITHUB_TOKEN is not available.

Solution: Create a Fine-grained Personal Access Token with specific repository access and minimal scopes.

Step-by-step:

Create Fine-grained PAT: GitHub Settings → Developer settings → Personal access tokens → Fine-grained tokens → Generate new token
Configure token:
- Token name: "Azure DevOps Integration"
- Expiration: 90 days (set calendar reminder to rotate)
- Repository access: Only select repositories → Choose the specific repo
- Permissions: Contents (Read), Issues (Read and write), Pull requests (Read and write)
Generate token and copy it immediately (shown only once)
Store in Azure Key Vault: az keyvault secret set --vault-name myVault --name github-pat --value {token}
In Azure Pipeline, create variable group linked to Key Vault
In pipeline YAML:

trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: github-secrets  # Variable group linked to Key Vault

steps:
  - script: |
      git clone https://$(github-pat)@github.com/myorg/myrepo.git
    displayName: 'Clone GitHub Repo'
  
  - script: |
      curl -X POST \
        -H "Authorization: token $(github-pat)" \
        -H "Accept: application/vnd.github.v3+json" \
        https://api.github.com/repos/myorg/myrepo/issues \
        -d '{"title":"Build completed","body":"Azure Pipeline build #$(Build.BuildId) completed successfully"}'
    displayName: 'Create GitHub Issue'

Pipeline retrieves PAT from Key Vault (Managed Identity authentication)
Pipeline uses PAT to clone repo and create issue
Set reminder to rotate PAT before 90-day expiration

Why this approach: Azure DevOps is outside GitHub, so GITHUB_TOKEN is not available. Fine-grained PAT provides minimal necessary permissions (better than Classic PAT with broad scopes). Storing in Key Vault adds security layer (not in pipeline variables).

⭐ Must Know (Critical Facts):

GITHUB_TOKEN: Automatic, free, scoped to repository, expires with workflow, can't trigger other workflows, can't access other repos (default)
GitHub Apps: Most secure for production, short-lived tokens (1 hour), not tied to user, fine-grained permissions, can access multiple repos, can trigger workflows
Classic PAT: Broad scopes (repo, workflow, admin:repo_hook), max 1 year expiration, tied to user account, legacy (use Fine-grained instead)
Fine-grained PAT: Repository-specific, granular permissions, custom expiration, tied to user account, recommended over Classic
PAT Scopes for Azure DevOps integration: repo (clone), admin:repo_hook (webhooks), workflow (trigger workflows)
GITHUB_TOKEN limitations: Can't trigger workflows (prevents recursive triggers), can't access other repos (unless permissions: grants it), can't be used outside GitHub Actions

When to use (Comprehensive):

✅ Use GITHUB_TOKEN when: Same-repo operations (build, test, publish to GitHub Packages), simple workflows, no cross-repo access needed, no workflow triggering needed
✅ Use GitHub App when: Cross-repo workflow triggers, production automation, need to survive user account changes, need fine-grained permissions, need short-lived tokens
✅ Use Fine-grained PAT when: External tool integration (Azure DevOps, Jenkins), need specific repository access, need minimal scopes, temporary access (set short expiration)
✅ Use Classic PAT when: Legacy integrations that don't support Fine-grained PATs (rare, migrate to Fine-grained if possible)
❌ Don't use GITHUB_TOKEN when: Need to trigger workflows in other repos, need to access other repos, need to use token outside GitHub Actions
❌ Don't use PAT when: GITHUB_TOKEN or GitHub App can do the job (PATs are tied to users and require rotation)

Limitations & Constraints:

GITHUB_TOKEN: Can't trigger workflows (by design, prevents infinite loops), limited to repository scope (unless permissions: grants broader access), expires with workflow
GitHub Apps: Requires setup (create app, install, manage private key), tokens expire after 1 hour (must regenerate), more complex than GITHUB_TOKEN
Classic PAT: Broad scopes (can't limit to specific repos), max 1 year expiration, tied to user (if user leaves, PAT stops working)
Fine-grained PAT: Still tied to user account, requires manual rotation, can be revoked if user loses access to repository

💡 Tips for Understanding:

Remember: GITHUB_TOKEN for simple same-repo tasks, GitHub App for production cross-repo automation, PAT for external integrations
PAT rotation: Set expiration to 90 days and calendar reminder to rotate before expiration (avoid service disruption)
Exam tip: Questions often ask which authentication method to use. Look for keywords: "same repository" → GITHUB_TOKEN, "trigger workflows in other repos" → GitHub App, "Azure DevOps integration" → PAT

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Using Classic PAT with broad scopes when Fine-grained PAT would work
- Why it's wrong: Classic PAT grants access to all repos in the account (violates least-privilege)
- Correct understanding: Use Fine-grained PAT with repository-specific access and minimal scopes
Mistake 2: Storing PAT in pipeline variables (plain text)
- Why it's wrong: Pipeline variables are visible to anyone with pipeline edit access
- Correct understanding: Store PATs in GitHub Secrets (for GitHub Actions) or Azure Key Vault (for Azure Pipelines)
Mistake 3: Thinking GITHUB_TOKEN can trigger workflows in other repositories
- Why it's wrong: GITHUB_TOKEN is scoped to the repository where the workflow runs
- Correct understanding: Use GitHub App or PAT with workflow scope to trigger workflows in other repos

🔗 Connections to Other Topics:

Relates to Azure Key Vault (next section) because: PATs and GitHub App private keys should be stored in Key Vault for secure access
Builds on Service Connections (Chapter 3) by: Azure DevOps service connections can use PATs to authenticate to GitHub
Often used with GitHub Packages (Chapter 3) to: Authenticate when publishing or consuming packages

Azure DevOps Permissions and Security Groups

What they are: Azure DevOps uses a hierarchical permission model with security groups at organization, project, and resource levels. Permissions control who can view, create, modify, or delete resources.

Why they exist: Teams need different levels of access. Developers need to create branches and run pipelines, but shouldn't delete projects. Administrators need full control. Stakeholders need read-only access. Security groups provide role-based access control (RBAC) to enforce least-privilege access.

Real-world analogy: Think of a hospital with different access levels. Doctors (Contributors) can access patient records and write prescriptions. Nurses (Readers) can view records but not prescribe. Hospital administrators (Project Administrators) can manage departments and staff. Visitors (Stakeholders) can only access public areas. Each role has specific permissions appropriate to their responsibilities.

How Azure DevOps permissions work (Detailed step-by-step):

User is added to Azure DevOps organization (via Microsoft Entra ID or direct invitation)
User is assigned to one or more security groups (Project Contributors, Build Administrators, etc.)
Security groups have permissions at different levels (organization, project, repository, pipeline)
When user attempts an action (create branch, run pipeline, delete work item), Azure DevOps checks permissions
Permissions are evaluated hierarchically: Organization → Project → Resource (most specific wins)
If user has explicit "Allow" permission, action is allowed
If user has explicit "Deny" permission, action is denied (Deny overrides Allow)
If no explicit permission, inheritance from parent level is checked
Action is allowed or denied based on effective permissions

⭐ Must Know (Critical Facts):

Default Security Groups: Readers (read-only), Contributors (read/write), Build Administrators (manage pipelines), Project Administrators (full project control), Project Collection Administrators (full organization control)
Permission Inheritance: Permissions flow from organization → project → resource; can be overridden at each level
Deny Overrides Allow: If a user has both Allow and Deny for the same permission, Deny wins
Service Connections: Require "Administrator" or "User" role; Administrators can edit connection, Users can only use it in pipelines
PAT Scopes: Agent Pools (Read & manage) for self-hosted agents, Code (Read & write) for Git operations, Build (Read & execute) for pipeline triggers, Work Items (Read, write, & manage) for Azure Boards integration
Microsoft Entra Integration: When organization is connected to Microsoft Entra ID, users and groups are synced; use Microsoft Entra groups for centralized management

When to use (Comprehensive):

✅ Use Project Contributors when: Developers need to create branches, commit code, create work items, run pipelines
✅ Use Build Administrators when: DevOps engineers need to create/edit pipelines, manage agent pools, configure service connections
✅ Use Project Administrators when: Team leads need to manage project settings, security groups, area paths, iterations
✅ Use Stakeholders when: Business users need read-only access to work items and dashboards (free access level)
✅ Use Microsoft Entra groups when: Managing permissions for large teams (centralized in Microsoft Entra ID, synced to Azure DevOps)
❌ Don't use Project Collection Administrators for: Regular users (too much power, violates least-privilege)
❌ Don't grant full-scope PATs to: Service accounts or automation (use minimal scopes needed)

💡 Tips for Understanding:

Remember: Deny always wins over Allow (useful for restricting specific users while allowing group)
PAT scope selection: Only select scopes needed for the task; avoid "Full access" (violates least-privilege)
Exam tip: Questions often ask about permission levels. Look for keywords: "read-only" → Readers/Stakeholders, "create pipelines" → Build Administrators, "manage project" → Project Administrators

Section 2: Design and Implement Strategy for Managing Sensitive Information

Introduction

The problem: Pipelines need access to sensitive information (database passwords, API keys, certificates, connection strings). Storing secrets in code or pipeline variables is insecure (visible in logs, accessible to anyone with repo access, no rotation, no audit trail).

The solution: Use Azure Key Vault to centrally store and manage secrets, keys, and certificates. Access secrets in pipelines using Managed Identity or Service Principal. Implement secret rotation, access policies, and audit logging.

Why it's tested: Secret management is critical to DevSecOps. The exam tests your ability to securely store secrets, access them in pipelines without exposure, and implement secret rotation and compliance.

Core Concepts

Azure Key Vault for Secrets Management

What it is: Azure Key Vault is a cloud service for securely storing and accessing secrets (passwords, API keys), keys (encryption keys), and certificates (SSL/TLS certificates).

Why it exists: Applications need secrets to connect to databases, APIs, and services. Storing secrets in code, configuration files, or pipeline variables creates security risks:

Secrets are visible in source control history (even if deleted later)
Secrets appear in logs and error messages
No centralized rotation (must update every location)
No audit trail (who accessed which secret when?)
No access control (anyone with repo access can see secrets)

Azure Key Vault solves these problems by providing centralized, secure secret storage with access control, audit logging, and rotation capabilities.

Real-world analogy: Think of a bank safe deposit box. Instead of keeping valuables (secrets) in your desk drawer (code/variables) where anyone can find them, you store them in a secure vault. Only authorized people with the right key (Managed Identity/Service Principal) can access the box. The bank (Azure) keeps a log of every access. If you need to change the lock (rotate secret), you do it once in the vault, not in every location.

How Azure Key Vault works (Detailed step-by-step):

You create a Key Vault in Azure (unique name, region, pricing tier)
You store secrets in the vault: az keyvault secret set --vault-name myVault --name dbPassword --value "P@ssw0rd123"
You configure access policies or RBAC to grant permissions (who can read/write secrets)
Your pipeline authenticates using Managed Identity or Service Principal
Pipeline requests secret from Key Vault: az keyvault secret show --vault-name myVault --name dbPassword --query value -o tsv
Key Vault validates the identity and checks access policies
If authorized, Key Vault returns the secret value
Pipeline uses the secret (never logs or exposes it)
Key Vault logs the access (who, when, which secret) for audit

📊 Azure Key Vault Integration with Pipelines:

sequenceDiagram
    participant Pipeline
    participant MI as Managed Identity
    participant AAD as Azure AD
    participant KV as Key Vault
    participant App as Application

    Pipeline->>MI: 1. Request Token
    MI->>AAD: 2. Authenticate (no secret)
    AAD->>MI: 3. Access Token
    MI->>Pipeline: 4. Return Token
    Pipeline->>KV: 5. Get Secret + Token
    KV->>AAD: 6. Validate Token
    AAD->>KV: 7. Token Valid
    KV->>KV: 8. Check Access Policy
    KV->>Pipeline: 9. Return Secret Value
    Pipeline->>App: 10. Deploy with Secret
    KV->>KV: 11. Log Access (audit)

    Note over Pipeline,KV: Secret never stored in pipeline<br/>No secret in logs<br/>Centralized rotation<br/>Full audit trail

See: diagrams/05_domain4_keyvault_pipeline_flow.mmd

Diagram Explanation (detailed):

This sequence diagram shows the secure flow of accessing secrets from Azure Key Vault in a pipeline using Managed Identity.

Authentication Phase (Steps 1-4): The pipeline running on an Azure resource (VM, App Service) requests a token from the Managed Identity service. The Managed Identity authenticates to Azure AD without any stored credentials (Azure manages this automatically). Azure AD validates the resource has Managed Identity enabled and issues an access token. The token is returned to the pipeline. This entire phase happens without any secrets being stored or exposed.

Secret Retrieval Phase (Steps 5-9): The pipeline makes a request to Key Vault to retrieve a specific secret, including the access token. Key Vault validates the token with Azure AD to ensure it's legitimate and not expired. Azure AD confirms the token is valid. Key Vault then checks its access policies to verify the Managed Identity has permission to read the requested secret. If authorized, Key Vault returns the secret value to the pipeline. The secret is transmitted over HTTPS and never logged.

Usage and Audit Phase (Steps 10-11): The pipeline uses the secret to deploy the application (e.g., connection string for database, API key for external service). The secret is passed securely to the application without being exposed in logs or pipeline variables. Key Vault logs the access event, recording which identity accessed which secret at what time. This creates a full audit trail for compliance.

Key Security Benefits: (1) No secrets stored in pipeline code or variables, (2) Secrets never appear in logs or error messages, (3) Centralized secret rotation (update once in Key Vault, all pipelines get new value), (4) Full audit trail (who accessed what when), (5) Access control (only authorized identities can read secrets), (6) Managed Identity eliminates need to store Service Principal secrets.

Detailed Example 1: Using Key Vault in Azure Pipeline with Managed Identity

Scenario: You have an Azure Pipeline running on a self-hosted agent (Azure VM with Managed Identity). The pipeline needs to deploy a web app with a database connection string stored in Key Vault.

Solution: Configure Key Vault access policy for the VM's Managed Identity and retrieve the secret in the pipeline.

Step-by-step:

Enable Managed Identity on the VM: Azure Portal → VM → Identity → System assigned → On
Grant Key Vault access: az keyvault set-policy --name myVault --object-id {vm-principal-id} --secret-permissions get list
Store secret in Key Vault: az keyvault secret set --vault-name myVault --name dbConnectionString --value "Server=myserver;Database=mydb;User=admin;Password=P@ssw0rd"
In pipeline YAML:

trigger:
  - main

pool:
  name: 'SelfHostedPool'  # Pool with VMs that have Managed Identity

variables:
  keyVaultName: 'myVault'

steps:
  - task: AzureCLI@2
    displayName: 'Get Secret from Key Vault'
    inputs:
      azureSubscription: 'ManagedIdentityConnection'  # Service connection using Managed Identity
      scriptType: 'bash'
      scriptLocation: 'inlineScript'
      inlineScript: |
        # Retrieve secret from Key Vault
        DB_CONNECTION_STRING=$(az keyvault secret show --vault-name $(keyVaultName) --name dbConnectionString --query value -o tsv)
        
        # Set as pipeline variable (marked as secret)
        echo "##vso[task.setvariable variable=DbConnectionString;issecret=true]$DB_CONNECTION_STRING"
  
  - task: AzureWebApp@1
    displayName: 'Deploy Web App'
    inputs:
      azureSubscription: 'ManagedIdentityConnection'
      appName: 'myWebApp'
      package: '$(System.DefaultWorkingDirectory)/**/*.zip'
      appSettings: '-ConnectionStrings:DefaultConnection "$(DbConnectionString)"'

Pipeline runs, retrieves secret from Key Vault using Managed Identity
Secret is set as pipeline variable with issecret=true (masked in logs)
Web app is deployed with connection string from Key Vault
Secret never appears in logs or pipeline definition

Why this approach: Managed Identity eliminates need to store Service Principal credentials. Secret is retrieved at runtime (always current if rotated in Key Vault). Secret is masked in logs (issecret=true). Full audit trail in Key Vault.

Detailed Example 2: Using Key Vault Task in Azure Pipeline

Scenario: You have an Azure Pipeline using Microsoft-hosted agents (no Managed Identity available). The pipeline needs multiple secrets from Key Vault.

Solution: Use Azure Key Vault task to download secrets as pipeline variables.

Step-by-step:

Create Service Principal: az ad sp create-for-rbac --name "pipeline-sp" --role Reader --scopes /subscriptions/{sub-id}
Grant Key Vault access: az keyvault set-policy --name myVault --spn {app-id} --secret-permissions get list
Create service connection in Azure DevOps: Project Settings → Service connections → New → Azure Resource Manager → Service Principal (manual)
In pipeline YAML:

trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'  # Microsoft-hosted agent

steps:
  - task: AzureKeyVault@2
    displayName: 'Download Secrets from Key Vault'
    inputs:
      azureSubscription: 'AzureServiceConnection'  # Service connection with Service Principal
      KeyVaultName: 'myVault'
      SecretsFilter: 'dbPassword,apiKey,certificatePassword'  # Comma-separated list of secrets
      RunAsPreJob: false  # Download secrets during job execution
  
  - script: |
      echo "Connecting to database..."
      # Secrets are available as pipeline variables
      # $(dbPassword), $(apiKey), $(certificatePassword)
      # They are automatically masked in logs
      mysql -h myserver -u admin -p$(dbPassword) -e "SELECT 1"
    displayName: 'Use Secrets'

AzureKeyVault@2 task authenticates using Service Principal
Task downloads specified secrets from Key Vault
Secrets are automatically set as pipeline variables (masked in logs)
Subsequent steps can use secrets as $(secretName)

Why this approach: Works with Microsoft-hosted agents (no Managed Identity available). Multiple secrets downloaded in one task. Secrets automatically masked in logs. No need to manually retrieve each secret.

Detailed Example 3: Secret Rotation with Key Vault

Scenario: Your database password is stored in Key Vault and used by 10 different pipelines. The password must be rotated every 90 days for compliance.

Solution: Rotate secret in Key Vault; all pipelines automatically use new value on next run.

Step-by-step:

Current secret in Key Vault: dbPassword = "OldP@ssw0rd123"
10 pipelines retrieve secret from Key Vault at runtime
After 90 days, rotate password:
- Change password in database: ALTER USER admin WITH PASSWORD 'NewP@ssw0rd456'
- Update secret in Key Vault: az keyvault secret set --vault-name myVault --name dbPassword --value "NewP@ssw0rd456"
Next time any pipeline runs, it retrieves the new password from Key Vault
No pipeline changes needed (all 10 pipelines automatically use new password)
Old password version is retained in Key Vault (can be restored if needed)

Why this approach: Centralized rotation (update once, affects all pipelines). No pipeline changes required. Old versions retained for rollback. Full audit trail of rotation events.

⭐ Must Know (Critical Facts):

Key Vault Access Methods: Access policies (legacy, simpler) or Azure RBAC (recommended, consistent with other Azure resources)
Secret Versions: Key Vault maintains versions of secrets; can reference specific version or always get latest
Soft Delete: Deleted secrets are retained for 90 days (can be recovered); prevents accidental permanent deletion
Purge Protection: When enabled, secrets cannot be permanently deleted during retention period (compliance requirement)
AzureKeyVault@2 Task: Downloads secrets from Key Vault and sets them as pipeline variables (automatically masked)
Secret Masking: Use issecret=true when setting pipeline variables to mask them in logs
Key Vault Firewall: Can restrict access to specific IP ranges or virtual networks (defense in depth)

When to use (Comprehensive):

✅ Use Key Vault when: Storing database passwords, API keys, certificates, connection strings, any sensitive configuration
✅ Use Managed Identity when: Pipeline runs on Azure resource (VM, App Service, Container Instance) - no secrets to manage
✅ Use Service Principal when: Pipeline runs on Microsoft-hosted agents or external CI/CD tools - requires storing Service Principal secret
✅ Use AzureKeyVault@2 task when: Need to download multiple secrets in one step, using Microsoft-hosted agents
✅ Use Azure CLI when: Need fine-grained control over secret retrieval, using self-hosted agents with Managed Identity
❌ Don't store secrets in: Pipeline variables (visible to editors), code (visible in history), configuration files (visible in repo)
❌ Don't use Key Vault for: Non-sensitive configuration (use App Configuration), large files (use Blob Storage)

Limitations & Constraints:

Key Vault Limits: 25,000 secrets per vault, 25 KB max secret size, 5,000 requests per 10 seconds per vault
Access Policy Limit: 1,024 access policies per Key Vault (use RBAC for larger teams)
Network Access: By default, Key Vault is accessible from internet; use firewall rules or private endpoints for restricted access
Pricing: $0.03 per 10,000 operations (very low cost for most scenarios)

💡 Tips for Understanding:

Remember: Key Vault is for secrets, App Configuration is for non-sensitive settings
Secret rotation: Update secret in Key Vault once, all pipelines get new value automatically (no pipeline changes)
Exam tip: Questions often ask where to store secrets. Look for keywords: "database password" or "API key" → Key Vault, "feature flags" or "app settings" → App Configuration

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Storing secrets in pipeline variables instead of Key Vault
- Why it's wrong: Pipeline variables are visible to anyone with pipeline edit access, no audit trail, no rotation
- Correct understanding: Store secrets in Key Vault, retrieve at runtime, mask in logs
Mistake 2: Hardcoding Key Vault name in pipeline (not using variables)
- Why it's wrong: Different environments (dev, staging, prod) should use different Key Vaults
- Correct understanding: Use pipeline variables for Key Vault name, override per environment
Mistake 3: Not enabling soft delete and purge protection
- Why it's wrong: Accidental deletion can permanently lose secrets, compliance violation
- Correct understanding: Enable soft delete (90-day retention) and purge protection (cannot be disabled once enabled)

🔗 Connections to Other Topics:

Relates to Managed Identity (Section 1) because: Managed Identity is the recommended way to access Key Vault from Azure resources
Builds on Service Connections (Chapter 3) by: Service connections can use Key Vault to store Service Principal secrets
Often used with Deployment Slots (Chapter 3) to: Different slots can use different Key Vault secrets for environment-specific configuration

Section 3: Automate Security and Compliance Scanning

Introduction

The problem: Security vulnerabilities in code, dependencies, containers, and infrastructure configurations are discovered late in the development cycle (or in production), making them expensive and time-consuming to fix. Manual security reviews don't scale and miss issues.

The solution: Automate security scanning in CI/CD pipelines to detect vulnerabilities early ("shift-left security"). Use tools like Microsoft Defender for Cloud DevOps Security, GitHub Advanced Security, CodeQL, and Dependabot to scan code, secrets, dependencies, containers, and IaC templates.

Why it's tested: DevSecOps is a core principle of modern DevOps. The exam tests your ability to integrate security scanning into pipelines, configure scanning tools, and prioritize remediation based on findings.

Core Concepts

Microsoft Defender for Cloud DevOps Security

What it is: A cloud-native application protection platform (CNAPP) that provides unified visibility, posture management, and threat protection for DevOps environments (Azure DevOps, GitHub, GitLab). It scans code, secrets, dependencies, IaC templates, and container images for vulnerabilities and misconfigurations.

Why it exists: Security teams need visibility into security posture across multi-pipeline environments. Developers need actionable findings integrated into their workflows (pull request annotations). Organizations need to prioritize remediation based on code-to-cloud context (which vulnerabilities affect production resources?).

How Defender for Cloud DevOps Security works (Detailed step-by-step):

You connect your DevOps environment (Azure DevOps, GitHub, GitLab) to Defender for Cloud
Defender for Cloud discovers all organizations, projects, and repositories
Agentless scanners automatically scan resources every 24 hours (no pipeline changes needed)
Scanners detect: Code vulnerabilities (CodeQL), Secrets (exposed API keys, passwords), Dependencies (vulnerable packages), IaC misconfigurations (insecure ARM/Bicep templates), Container vulnerabilities (image scanning)
Findings are correlated with cloud resources (which vulnerable code is deployed to production?)
Security recommendations are generated with severity and priority
Pull request annotations automatically comment on PRs with findings (developers see issues before merge)
Security teams view findings in Defender for Cloud dashboard with code-to-cloud context

⭐ Must Know (Critical Facts):

Agentless Scanning: No pipeline changes required; Defender for Cloud scans repositories directly via API
Code-to-Cloud Correlation: Findings are prioritized based on whether vulnerable code is deployed to production
Pull Request Annotations: Automatically comments on PRs with security findings (shift-left security)
Supported Platforms: Azure DevOps, GitHub, GitLab (multi-pipeline support)
Scanning Types: Code (CodeQL), Secrets, Dependencies (Dependabot), IaC (ARM/Bicep/Terraform), Containers
Foundational CSPM: Free tier includes DevOps inventory and basic recommendations
Defender CSPM: Paid tier includes advanced scanning, attack path analysis, and code-to-cloud correlation

GitHub Advanced Security

What it is: A suite of security features for GitHub repositories, including CodeQL (code scanning), secret scanning, and Dependabot (dependency scanning). Available for GitHub Enterprise Cloud and GitHub Enterprise Server.

Why it exists: Developers need security feedback integrated into their workflow (GitHub UI, pull requests). Organizations need to enforce security policies (block PRs with critical vulnerabilities). GitHub Advanced Security provides native security scanning within GitHub.

How GitHub Advanced Security works (Detailed step-by-step):

You enable GitHub Advanced Security for your organization or repository (requires GitHub Enterprise)
CodeQL Code Scanning: Analyzes code for security vulnerabilities (SQL injection, XSS, etc.)
- You add CodeQL workflow to .github/workflows/codeql.yml
- Workflow runs on push, pull request, or schedule
- CodeQL builds code, extracts semantic model, and queries for vulnerabilities
- Findings are displayed in Security tab and as PR comments
Secret Scanning: Detects exposed secrets (API keys, passwords, tokens)
- GitHub automatically scans all commits for known secret patterns
- When secret is detected, GitHub alerts repository admins
- Push protection (optional) blocks commits containing secrets
Dependabot: Scans dependencies for known vulnerabilities (CVEs)
- Dependabot checks dependencies against vulnerability database
- Creates pull requests to update vulnerable dependencies
- Dependabot security updates are automatic (can be configured)

⭐ Must Know (Critical Facts):

CodeQL: Semantic code analysis engine that understands code structure (not just regex patterns)
Secret Scanning: Detects 200+ secret types (AWS keys, Azure tokens, GitHub PATs, etc.)
Push Protection: Blocks commits containing secrets (prevents accidental exposure)
Dependabot Alerts: Notifies when dependencies have known vulnerabilities
Dependabot Security Updates: Automatically creates PRs to update vulnerable dependencies
Dependabot Version Updates: Automatically creates PRs to keep dependencies up-to-date (not security-specific)
GitHub Enterprise Required: Advanced Security features require GitHub Enterprise Cloud or Server (not available on Free/Pro/Team)

Container Scanning

What it is: Automated scanning of container images for vulnerabilities in OS packages, application dependencies, and configuration issues. Integrated into CI/CD pipelines to prevent vulnerable images from reaching production.

Why it exists: Container images often contain vulnerable OS packages (outdated Ubuntu, Alpine) and application dependencies (vulnerable npm packages). Scanning images before deployment prevents known vulnerabilities from reaching production.

How container scanning works (Detailed step-by-step):

You build a container image in your pipeline: docker build -t myapp:latest .
You scan the image using a scanning tool (Defender for Containers, Trivy, Snyk, etc.)
Scanner analyzes image layers and extracts:
- OS packages (apt, yum, apk packages)
- Application dependencies (npm, pip, Maven packages)
- Configuration files (Dockerfile, environment variables)
Scanner compares packages against vulnerability databases (CVE, NVD)
Scanner generates findings with severity (Critical, High, Medium, Low)
Pipeline fails if critical vulnerabilities are found (configurable threshold)
Findings are reported to security dashboard (Defender for Cloud, GitHub Security tab)
Developers fix vulnerabilities (update base image, update dependencies) and rebuild

⭐ Must Know (Critical Facts):

Defender for Containers: Azure-native container scanning integrated with Defender for Cloud
Trivy: Open-source container scanner (supports Docker, Kubernetes, IaC)
Scan Triggers: Scan on image build (CI pipeline), scan on push to registry (Azure Container Registry), scan on deployment (admission controller)
Vulnerability Databases: CVE (Common Vulnerabilities and Exposures), NVD (National Vulnerability Database)
Base Image Updates: Regularly update base images (e.g., FROM ubuntu:22.04 → FROM ubuntu:24.04) to get security patches
Multi-Stage Builds: Use multi-stage Dockerfiles to reduce image size and attack surface (only include runtime dependencies)

When to use (Comprehensive):

✅ Use Defender for Cloud DevOps Security when: Need unified visibility across Azure DevOps, GitHub, and GitLab; need code-to-cloud correlation; need agentless scanning
✅ Use GitHub Advanced Security when: Using GitHub Enterprise; need native GitHub integration; need CodeQL for semantic code analysis
✅ Use Dependabot when: Need automated dependency updates; using GitHub (free for public repos, requires Enterprise for private repos)
✅ Use container scanning when: Deploying containerized applications; using Azure Container Registry, Docker Hub, or other registries
✅ Use CodeQL when: Need deep semantic code analysis (not just regex patterns); scanning C/C++, C#, Java, JavaScript, Python, Go, Ruby
❌ Don't rely solely on: Manual code reviews (don't scale, miss issues); scanning only in production (too late, expensive to fix)

💡 Tips for Understanding:

Remember: Shift-left security = scan early in development (PR, commit) rather than late (production)
CodeQL vs regex: CodeQL understands code semantics (data flow, control flow), regex only matches patterns
Exam tip: Questions often ask which scanning tool to use. Look for keywords: "GitHub Enterprise" → GitHub Advanced Security, "multi-pipeline" → Defender for Cloud, "container images" → Defender for Containers or Trivy

⚠️ Common Mistakes & Misconceptions:

Mistake 1: Thinking GitHub Advanced Security is free for all GitHub users
- Why it's wrong: Advanced Security requires GitHub Enterprise Cloud or Server (not available on Free/Pro/Team)
- Correct understanding: Public repositories get some features free (Dependabot alerts), but CodeQL and secret scanning require Enterprise for private repos
Mistake 2: Scanning only in production or pre-production
- Why it's wrong: Vulnerabilities discovered late are expensive to fix (code already deployed, dependencies locked)
- Correct understanding: Scan in CI pipeline (on commit, PR) to catch issues early when they're cheap to fix
Mistake 3: Not failing pipelines on critical vulnerabilities
- Why it's wrong: Vulnerable code reaches production, increasing risk
- Correct understanding: Configure pipeline to fail if critical vulnerabilities are found (force developers to fix before merge)

Chapter Summary

What We Covered

This chapter covered the security and compliance domain (10-15% of exam), focusing on:

✅ Authentication and Authorization

Service Principals vs Managed Identity (when to use each)
GitHub authentication methods (GitHub Apps, GITHUB_TOKEN, PATs)
Azure DevOps permissions and security groups

✅ Managing Sensitive Information

Azure Key Vault for secrets, keys, and certificates
Accessing Key Vault from pipelines (Managed Identity, Service Principal)
Secret rotation and compliance

✅ Automating Security Scanning

Microsoft Defender for Cloud DevOps Security (agentless scanning, code-to-cloud correlation)
GitHub Advanced Security (CodeQL, secret scanning, Dependabot)
Container scanning (Defender for Containers, Trivy)

Critical Takeaways

Managed Identity > Service Principal: Always use Managed Identity for Azure resources (no secrets to manage); use Service Principal only when Managed Identity is not available (GitHub Actions, on-premises)
GitHub Authentication: Use GITHUB_TOKEN for same-repo operations, GitHub App for cross-repo workflows, PAT for external integrations
Key Vault for Secrets: Never store secrets in code or pipeline variables; use Key Vault with Managed Identity for secure access
Shift-Left Security: Scan code, dependencies, and containers in CI pipeline (on commit, PR) to catch vulnerabilities early
Defender for Cloud: Provides unified visibility across multi-pipeline environments with code-to-cloud correlation
GitHub Advanced Security: Requires GitHub Enterprise; provides CodeQL, secret scanning, and Dependabot for GitHub-native security

Self-Assessment Checklist

Test yourself before moving on:

Authentication:

I can explain when to use Service Principal vs Managed Identity
I understand the three GitHub authentication methods and when to use each
I know how to configure Azure DevOps permissions and security groups

Secrets Management:

I can explain how to store and retrieve secrets from Azure Key Vault
I understand how to use Managed Identity to access Key Vault from pipelines
I know how to implement secret rotation

Security Scanning:

I can describe Microsoft Defender for Cloud DevOps Security capabilities
I understand GitHub Advanced Security features (CodeQL, secret scanning, Dependabot)
I know how to implement container scanning in pipelines

Practice Questions

Try these from your practice test bundles:

Domain 4 Bundle: Questions 1-15 (Security and Compliance)
Full Practice Test 1: Domain 4 questions
Expected score: 75%+ to proceed

If you scored below 75%:

Review sections on authentication methods and when to use each
Focus on Key Vault integration with pipelines
Re-read security scanning tools and their capabilities

Quick Reference Card

Authentication:

Managed Identity: Azure resources, no secrets, automatic
Service Principal: GitHub Actions, on-premises, requires secret storage
GITHUB_TOKEN: Same-repo, automatic, expires with workflow
GitHub App: Cross-repo, short-lived tokens (1hr), not tied to user
PAT: External integrations, tied to user, requires rotation

Key Vault:

Access: Managed Identity (preferred) or Service Principal
Tasks: AzureKeyVault@2 (download secrets), AzureCLI@2 (retrieve with CLI)
Secret Masking: Use issecret=true when setting pipeline variables
Rotation: Update in Key Vault once, all pipelines get new value

Security Scanning:

Defender for Cloud: Multi-pipeline, agentless, code-to-cloud correlation
GitHub Advanced Security: GitHub Enterprise, CodeQL, secret scanning, Dependabot
Container Scanning: Defender for Containers, Trivy, scan on build/push/deploy

Decision Points:

Need Azure resource authentication? → Managed Identity
Need GitHub Actions authentication? → Service Principal
Need to store secrets? → Azure Key Vault
Need multi-pipeline security visibility? → Defender for Cloud
Need GitHub-native security? → GitHub Advanced Security
Need container vulnerability scanning? → Defender for Containers or Trivy

Next Chapter: 06_domain5_instrumentation - Implement an Instrumentation Strategy (Monitoring, telemetry, alerts, log analysis)

Chapter 5: Implement an Instrumentation Strategy (5-10% of exam)

Chapter Overview

What you'll learn:

Configuring monitoring for DevOps environments (Azure Monitor, Application Insights, GitHub insights)
Collecting telemetry from applications and infrastructure
Analyzing metrics and logs using KQL queries
Configuring alerts for pipelines and applications

Time to complete: 4-6 hours
Prerequisites: Chapters 0-4 (Fundamentals through Security)

Why this domain matters: "You can't improve what you don't measure." Instrumentation provides visibility into application performance, infrastructure health, and pipeline reliability. This domain tests your ability to configure monitoring, collect telemetry, analyze data, and set up alerts to detect and respond to issues quickly.

Section 1: Configure Monitoring for DevOps Environment

Introduction

The problem: Without monitoring, you're blind to application performance issues, infrastructure failures, and pipeline problems. Issues are discovered by users (bad customer experience) or during incidents (reactive firefighting). No data means no ability to optimize or improve.

The solution: Implement comprehensive monitoring across applications, infrastructure, and pipelines. Collect telemetry (metrics, logs, traces) from all components. Visualize data in dashboards. Set up alerts to detect issues proactively.

Why it's tested: Monitoring is essential to DevOps feedback loops. The exam tests your ability to configure monitoring tools, collect relevant telemetry, and use data to improve systems.

Core Concepts

Azure Monitor and Application Insights

What they are: Azure Monitor is a comprehensive monitoring solution for Azure resources, applications, and infrastructure. Application Insights is a feature of Azure Monitor focused on application performance monitoring (APM) with distributed tracing, dependency tracking, and exception monitoring.

Why they exist: Applications and infrastructure generate vast amounts of telemetry data (metrics, logs, traces). Without a centralized monitoring solution, this data is scattered across systems, making it impossible to correlate events, identify root causes, or detect patterns. Azure Monitor provides unified collection, storage, analysis, and alerting for all telemetry.

Real-world analogy: Think of Azure Monitor as a hospital's patient monitoring system. Just as doctors monitor vital signs (heart rate, blood pressure, temperature) from a central dashboard and receive alerts when values are abnormal, Azure Monitor collects telemetry (CPU, memory, request rate, error rate) from all systems and alerts you when thresholds are exceeded.

How Azure Monitor works (Detailed step-by-step):

Data Collection: Telemetry is collected from multiple sources:
- Application Insights SDK (embedded in application code)
- Azure Diagnostics extension (VMs, Cloud Services)
- Azure Monitor agent (VMs, on-premises servers)
- Azure resource logs (automatically collected from Azure services)
- Custom metrics and logs (via API)
Data Storage: All telemetry is stored in Azure Monitor data stores:
- Metrics: Time-series database (optimized for fast queries)
- Logs: Log Analytics workspace (supports KQL queries)
- Traces: Application Insights (distributed tracing)
Data Analysis: Query and visualize data:
- Metrics Explorer: Chart metrics over time
- Log Analytics: Query logs using KQL
- Application Insights: Analyze application performance, dependencies, failures
Alerting: Configure alerts based on metrics or log queries:
- Metric alerts: Trigger when metric exceeds threshold
- Log alerts: Trigger when log query returns results
- Action groups: Send notifications (email, SMS, webhook, Logic App)
Visualization: Create dashboards to monitor key metrics:
- Azure dashboards: Pin charts from Metrics Explorer and Log Analytics
- Workbooks: Interactive reports with parameters and visualizations
- Power BI: Advanced analytics and reporting

⭐ Must Know (Critical Facts):

Application Insights: APM solution for web applications; tracks requests, dependencies, exceptions, custom events
Log Analytics Workspace: Centralized log storage; supports KQL queries for analysis
Azure Monitor Agent: Collects telemetry from VMs and on-premises servers
Metrics vs Logs: Metrics are numerical time-series data (CPU%, request count); Logs are text-based events (error messages, traces)
Distributed Tracing: Tracks requests across multiple services (microservices); identifies bottlenecks and failures
Availability Tests: Synthetic monitoring that periodically tests application endpoints from multiple locations
Smart Detection: AI-powered anomaly detection that automatically identifies performance issues

Detailed Example 1: Instrumenting ASP.NET Core Application with Application Insights

Scenario: You have an ASP.NET Core web application deployed to Azure App Service. You need to monitor request performance, track dependencies (database, external APIs), and detect exceptions.

Solution: Add Application Insights SDK to the application and configure telemetry collection.

Step-by-step:

Create Application Insights resource: az monitor app-insights component create --app myApp --location eastus --resource-group myRG --workspace /subscriptions/{sub-id}/resourceGroups/myRG/providers/Microsoft.OperationalInsights/workspaces/myWorkspace
Add Application Insights SDK to project: dotnet add package Microsoft.ApplicationInsights.AspNetCore
Configure in Program.cs:

var builder = WebApplication.CreateBuilder(args);

// Add Application Insights telemetry
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});

var app = builder.Build();
app.Run();

Set connection string in App Service configuration: az webapp config appsettings set --name myApp --resource-group myRG --settings ApplicationInsights__ConnectionString="{connection-string}"
Deploy application to App Service
Application Insights automatically collects:
- HTTP requests (URL, duration, response code)
- Dependencies (SQL queries, HTTP calls to external APIs)
- Exceptions (stack traces, error messages)
- Performance counters (CPU, memory, request rate)
View telemetry in Azure Portal: Application Insights → Performance, Failures, Application Map

Why this approach: Application Insights SDK automatically instruments common scenarios (HTTP requests, database calls, exceptions) with zero code changes. Distributed tracing works across microservices. Telemetry is correlated (can see all dependencies for a single request).

Detailed Example 2: Configuring Alerts for Pipeline Failures

Scenario: You have Azure Pipelines running critical deployments. You need to be notified immediately when a pipeline fails so you can investigate and fix issues quickly.

Solution: Configure Azure Monitor alerts for pipeline failures using Azure DevOps audit logs.

Step-by-step:

Enable Azure DevOps audit log streaming to Log Analytics: Azure DevOps → Organization Settings → Auditing → Streams → Create stream → Select Log Analytics workspace
Wait for audit logs to flow to Log Analytics (may take a few minutes)
Create log alert rule:

AzureDevOpsAuditing
| where OperationName == "Pipelines.PipelineCompleted"
| where Data has "failed"
| project TimeGenerated, ProjectName, PipelineName, Result = tostring(Data.result), BuildNumber = tostring(Data.buildNumber)

Configure alert: Azure Monitor → Alerts → Create alert rule → Select Log Analytics workspace → Use KQL query above → Set threshold (1 result) → Set evaluation frequency (5 minutes)
Create action group: Email → your-email@company.com, SMS → your-phone-number
When pipeline fails, alert triggers and sends email/SMS notification

Why this approach: Proactive notification of pipeline failures (don't wait for users to report). Audit logs provide detailed context (which pipeline, which project, when). Can extend query to filter by specific pipelines or projects.

⭐ Must Know (Critical Facts):

KQL (Kusto Query Language): Query language for Log Analytics; similar to SQL but optimized for log data
Common KQL Operators: where (filter), project (select columns), summarize (aggregate), join (combine tables), render (visualize)
Telemetry Types: Requests (HTTP), Dependencies (external calls), Exceptions (errors), Traces (custom logging), Metrics (counters), Events (custom events)
Sampling: Application Insights can sample telemetry to reduce costs (adaptive sampling adjusts rate based on volume)
Retention: Log Analytics default retention is 30 days (configurable up to 730 days)

Section 2: Analyze Metrics from Instrumentation

Introduction

The problem: Collecting telemetry is not enough - you need to analyze it to identify trends, detect anomalies, and troubleshoot issues. Raw telemetry data is overwhelming (millions of events per day). Manual analysis doesn't scale.

The solution: Use query languages (KQL), visualization tools (dashboards, workbooks), and AI-powered analytics (Smart Detection) to extract insights from telemetry. Focus on key performance indicators (KPIs) and actionable metrics.

Why it's tested: The exam tests your ability to query logs, analyze metrics, and use telemetry to troubleshoot issues and optimize performance.

Core Concepts

Kusto Query Language (KQL) Basics

What it is: A query language for analyzing large volumes of log and telemetry data in Azure Monitor, Application Insights, and Azure Data Explorer. Optimized for fast queries over time-series data.

Why it exists: SQL is designed for relational databases (tables with fixed schemas). Log data is semi-structured (JSON, key-value pairs) and time-series (events over time). KQL is optimized for these data types with operators for filtering, aggregating, and visualizing time-series data.

Real-world analogy: Think of KQL as a specialized tool for analyzing security camera footage. While you could use a general video player (SQL), KQL is like a security system with fast-forward, rewind, motion detection, and timeline scrubbing - optimized for finding specific events in large volumes of footage.

Basic KQL Query Structure:

TableName
| where Condition
| project Column1, Column2
| summarize AggregateFunction by GroupByColumn
| order by Column desc
| take 10

Common KQL Queries for DevOps:

Find failed requests in last 24 hours:

requests
| where timestamp > ago(24h)
| where success == false
| project timestamp, name, url, resultCode, duration
| order by timestamp desc

Calculate average response time by operation:

requests
| where timestamp > ago(1h)
| summarize AvgDuration = avg(duration), RequestCount = count() by operation_Name
| order by AvgDuration desc

Detect slow dependencies (database queries > 1 second):

dependencies
| where timestamp > ago(1h)
| where duration > 1000  // milliseconds
| where type == "SQL"
| project timestamp, name, duration, success
| order by duration desc

Count exceptions by type:

exceptions
| where timestamp > ago(24h)
| summarize ExceptionCount = count() by type, outerMessage
| order by ExceptionCount desc

Analyze pipeline duration trends:

AzureDevOpsAuditing
| where OperationName == "Pipelines.PipelineCompleted"
| extend Duration = todouble(Data.duration)
| summarize AvgDuration = avg(Duration), MaxDuration = max(Duration) by bin(TimeGenerated, 1h), PipelineName = tostring(Data.pipelineName)
| render timechart

⭐ Must Know (Critical Facts):

ago(): Relative time function (ago(1h) = 1 hour ago, ago(7d) = 7 days ago)
bin(): Group time into buckets (bin(timestamp, 1h) = hourly buckets)
summarize: Aggregate data (count, avg, sum, max, min, percentile)
render: Visualize results (timechart, barchart, piechart, table)
join: Combine data from multiple tables (similar to SQL JOIN)
extend: Add calculated columns (extend NewColumn = Column1 + Column2)

Critical Takeaways

Azure Monitor: Unified monitoring for applications, infrastructure, and pipelines
Application Insights: APM with distributed tracing, dependency tracking, exception monitoring
Log Analytics: Centralized log storage with KQL queries
KQL: Query language optimized for log and time-series data
Alerts: Proactive notifications based on metrics or log queries
Dashboards: Visualize key metrics for quick health checks

Self-Assessment Checklist

I can explain Azure Monitor and Application Insights capabilities
I understand how to instrument applications with Application Insights SDK
I can write basic KQL queries to analyze logs and metrics
I know how to configure alerts for pipeline failures and application issues
I understand distributed tracing and dependency tracking

Quick Reference Card

Azure Monitor Components:

Application Insights: APM for applications
Log Analytics: Centralized log storage and KQL queries
Metrics: Time-series numerical data
Alerts: Proactive notifications
Dashboards: Visualizations

KQL Basics:

where: Filter rows
project: Select columns
summarize: Aggregate data
order by: Sort results
take: Limit results
ago(): Relative time
bin(): Time buckets

Common Metrics:

Requests: Count, duration, success rate
Dependencies: External calls, duration
Exceptions: Count, type, message
Performance: CPU, memory, request rate

Next Chapter: 07_integration - Integration & Cross-Domain Scenarios

Integration & Advanced Topics: Putting It All Together

Chapter Overview

This chapter integrates concepts from all domains to solve complex, real-world scenarios that span multiple areas of DevOps. The AZ-400 exam tests your ability to apply knowledge across domains, not just recall facts from individual topics.

What you'll learn:

End-to-end CI/CD pipeline design combining multiple domains
Cross-domain decision-making (when to use which tool/pattern)
Complex scenarios requiring integration of security, monitoring, and deployment strategies
Troubleshooting multi-component issues

Time to complete: 6-8 hours
Prerequisites: All previous chapters (0-5)

Cross-Domain Scenario 1: Secure CI/CD Pipeline with Monitoring

Scenario Description

You're designing a CI/CD pipeline for a microservices application with the following requirements:

Source Control: GitHub repository with branch protection
Build: Automated builds on every commit with code quality checks
Security: Scan code, dependencies, and containers for vulnerabilities
Secrets: Database passwords and API keys must be stored securely
Deployment: Blue-green deployment to Azure App Service with zero downtime
Monitoring: Track application performance and pipeline health
Compliance: Audit all deployments and secret access

Solution Architecture

This scenario integrates concepts from all 5 domains:

Domain 1 (Processes): Work item tracking, metrics dashboard
Domain 2 (Source Control): Branch policies, pull request workflow
Domain 3 (Pipelines): Multi-stage YAML pipeline, blue-green deployment
Domain 4 (Security): Key Vault, GitHub Advanced Security, Managed Identity
Domain 5 (Instrumentation): Application Insights, pipeline alerts

Step-by-Step Implementation:

Configure Source Control (Domain 2):

# Branch protection rules in GitHub
- Require pull request reviews (2 approvers)
- Require status checks (build, tests, security scan)
- Require branches to be up to date
- Require signed commits

Create Multi-Stage Pipeline (Domain 3):

# azure-pipelines.yml
trigger:
  branches:
    include:
      - main

variables:
  - group: production-secrets  # Linked to Key Vault

stages:
  - stage: Build
    jobs:
      - job: BuildAndTest
        pool:
          vmImage: 'ubuntu-latest'
        steps:
          - task: UseDotNet@2
            inputs:
              version: '8.0.x'
          
          - task: DotNetCoreCLI@2
            displayName: 'Restore dependencies'
            inputs:
              command: 'restore'
          
          - task: DotNetCoreCLI@2
            displayName: 'Build'
            inputs:
              command: 'build'
              arguments: '--configuration Release'
          
          - task: DotNetCoreCLI@2
            displayName: 'Run tests'
            inputs:
              command: 'test'
              arguments: '--configuration Release --collect:"XPlat Code Coverage"'
          
          - task: PublishCodeCoverageResults@1
            inputs:
              codeCoverageTool: 'Cobertura'
              summaryFileLocation: '$(Agent.TempDirectory)/**/*coverage.cobertura.xml'
          
          - task: DotNetCoreCLI@2
            displayName: 'Publish'
            inputs:
              command: 'publish'
              publishWebProjects: true
              arguments: '--configuration Release --output $(Build.ArtifactStagingDirectory)'
          
          - task: PublishBuildArtifacts@1
            inputs:
              PathtoPublish: '$(Build.ArtifactStagingDirectory)'
              ArtifactName: 'drop'

  - stage: SecurityScan
    dependsOn: Build
    jobs:
      - job: ScanCode
        pool:
          vmImage: 'ubuntu-latest'
        steps:
          - task: UseDotNet@2
            inputs:
              version: '8.0.x'
          
          # Scan dependencies for vulnerabilities
          - script: |
              dotnet list package --vulnerable --include-transitive
            displayName: 'Scan dependencies'
          
          # Container scanning (if using containers)
          - task: Docker@2
            displayName: 'Build container image'
            inputs:
              command: 'build'
              Dockerfile: '**/Dockerfile'
              tags: '$(Build.BuildId)'
          
          - script: |
              # Install Trivy
              wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
              echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | sudo tee -a /etc/apt/sources.list.d/trivy.list
              sudo apt-get update
              sudo apt-get install trivy
              
              # Scan image
              trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:$(Build.BuildId)
            displayName: 'Scan container for vulnerabilities'

  - stage: DeployStaging
    dependsOn: SecurityScan
    jobs:
      - deployment: DeployToStaging
        environment: 'staging'
        pool:
          vmImage: 'ubuntu-latest'
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureKeyVault@2
                  displayName: 'Get secrets from Key Vault'
                  inputs:
                    azureSubscription: 'ManagedIdentityConnection'
                    KeyVaultName: 'myapp-keyvault'
                    SecretsFilter: 'DbConnectionString,ApiKey'
                
                - task: AzureWebApp@1
                  displayName: 'Deploy to staging slot'
                  inputs:
                    azureSubscription: 'ManagedIdentityConnection'
                    appName: 'myapp'
                    package: '$(Pipeline.Workspace)/drop/**/*.zip'
                    deployToSlotOrASE: true
                    resourceGroupName: 'myapp-rg'
                    slotName: 'staging'
                    appSettings: '-ConnectionStrings:DefaultConnection "$(DbConnectionString)" -ApiKey "$(ApiKey)"'

  - stage: DeployProduction
    dependsOn: DeployStaging
    jobs:
      - deployment: DeployToProduction
        environment: 'production'
        pool:
          vmImage: 'ubuntu-latest'
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureKeyVault@2
                  displayName: 'Get secrets from Key Vault'
                  inputs:
                    azureSubscription: 'ManagedIdentityConnection'
                    KeyVaultName: 'myapp-keyvault'
                    SecretsFilter: 'DbConnectionString,ApiKey'
                
                # Blue-green deployment using slots
                - task: AzureWebApp@1
                  displayName: 'Deploy to blue slot'
                  inputs:
                    azureSubscription: 'ManagedIdentityConnection'
                    appName: 'myapp'
                    package: '$(Pipeline.Workspace)/drop/**/*.zip'
                    deployToSlotOrASE: true
                    resourceGroupName: 'myapp-rg'
                    slotName: 'blue'
                    appSettings: '-ConnectionStrings:DefaultConnection "$(DbConnectionString)" -ApiKey "$(ApiKey)"'
                
                # Warm up blue slot
                - script: |
                    curl -f https://myapp-blue.azurewebsites.net/health || exit 1
                  displayName: 'Health check blue slot'
                
                # Swap blue to production
                - task: AzureAppServiceManage@0
                  displayName: 'Swap blue to production'
                  inputs:
                    azureSubscription: 'ManagedIdentityConnection'
                    action: 'Swap Slots'
                    webAppName: 'myapp'
                    resourceGroupName: 'myapp-rg'
                    sourceSlot: 'blue'
                    targetSlot: 'production'

Configure Monitoring (Domain 5):

Application Insights SDK in application code
Pipeline alerts for failures
Dashboard with key metrics (request rate, error rate, response time)

Set Up Security (Domain 4):

GitHub Advanced Security enabled (CodeQL, secret scanning, Dependabot)
Azure Key Vault for secrets
Managed Identity for pipeline authentication
Defender for Cloud DevOps Security for unified visibility

Key Decision Points

Why this architecture?:

Branch protection: Prevents bad code from reaching main branch
Multi-stage pipeline: Separates concerns (build, security, deploy)
Security scanning: Catches vulnerabilities before production
Key Vault: Centralized secret management with audit trail
Blue-green deployment: Zero downtime, instant rollback
Monitoring: Proactive issue detection

Trade-offs:

Complexity: More stages = longer pipeline duration
Cost: Multiple environments (staging, blue, production) = higher cost
Maintenance: More components to manage and update

Cross-Domain Scenario 2: Multi-Repo Microservices Pipeline

Scenario Description

You have 10 microservices in separate GitHub repositories. When a shared library is updated, all dependent services must be rebuilt and tested. You need to:

Trigger builds in dependent repos when shared library changes
Track which services are affected by library changes
Deploy services independently with feature flags
Monitor cross-service dependencies

Solution Architecture

Approach: Use GitHub Apps for cross-repo workflow triggers, Azure Boards for work tracking, feature flags for deployment control, and Application Insights for distributed tracing.

Key Components:

GitHub App: Trigger workflows in dependent repos
Azure Boards: Track work items across repos
Feature Flags: Azure App Configuration for gradual rollout
Distributed Tracing: Application Insights for cross-service visibility

Common Question Patterns

Pattern 1: "Which tool should I use?"

How to recognize:

Question presents a scenario with specific requirements
Multiple tools could work, but one is optimal
Keywords indicate constraints (cost, complexity, existing tools)

How to answer:

Identify requirements (security, cost, integration, complexity)
Eliminate tools that don't meet requirements
Choose tool that best fits constraints

Example: "You need to scan code for vulnerabilities in a GitHub repository. Which tool should you use?"

If GitHub Enterprise: GitHub Advanced Security (native integration)
If multi-platform: Defender for Cloud DevOps Security (supports GitHub, Azure DevOps, GitLab)
If open-source: Trivy or OWASP Dependency-Check (free)

Pattern 2: "How do you implement X?"

How to recognize:

Question asks for step-by-step implementation
Tests knowledge of specific tasks or configurations
May include YAML snippets or CLI commands

How to answer:

Identify the goal (what needs to be achieved)
List prerequisites (what must be in place first)
Provide step-by-step implementation
Explain why each step is necessary

Example: "How do you implement blue-green deployment in Azure App Service?"

Create two deployment slots (blue, production)
Deploy new version to blue slot
Warm up blue slot (health check)
Swap blue to production
Monitor for issues, swap back if needed

Pattern 3: "What is the best practice?"

How to recognize:

Question asks for recommended approach
Multiple approaches are valid, but one is preferred
Tests understanding of DevOps principles

How to answer:

State the best practice
Explain why it's recommended
Mention alternatives and their trade-offs

Example: "What is the best practice for storing database passwords in pipelines?"

Best practice: Store in Azure Key Vault, retrieve at runtime using Managed Identity
Why: Centralized management, audit trail, no secrets in code/variables, automatic rotation
Alternatives: Pipeline variables (less secure), environment variables (no audit trail)

Chapter Summary

This chapter integrated concepts from all domains to solve complex, real-world scenarios. Key takeaways:

End-to-End Thinking: DevOps solutions span multiple domains (source control, pipelines, security, monitoring)
Trade-offs: Every decision has trade-offs (complexity vs simplicity, cost vs features, speed vs security)
Best Practices: Follow DevOps principles (automation, security, monitoring, continuous improvement)
Tool Selection: Choose tools based on requirements, constraints, and existing infrastructure

Self-Assessment

I can design end-to-end CI/CD pipelines integrating multiple domains
I understand trade-offs between different approaches
I can select appropriate tools based on scenario requirements
I can troubleshoot issues spanning multiple components

Next Chapter: 08_study_strategies - Study Strategies & Test-Taking Techniques

Study Strategies & Test-Taking Techniques

Effective Study Techniques

The 3-Pass Method

Pass 1: Understanding (Weeks 1-6)

Read each chapter thoroughly from beginning to end
Take notes on ⭐ Must Know items
Complete practice exercises after each section
Don't rush - focus on understanding WHY, not just WHAT
Use diagrams to visualize concepts

Pass 2: Application (Weeks 7-8)

Review chapter summaries and quick reference cards
Focus on decision frameworks (when to use which tool/pattern)
Practice full-length tests (3 practice tests included)
Identify weak areas and review those chapters
Practice writing YAML pipelines and KQL queries

Pass 3: Reinforcement (Week 9-10)

Review flagged items from practice tests
Memorize critical facts (service limits, default values, key concepts)
Take final practice tests under exam conditions
Review cheat sheet daily
Focus on cross-domain scenarios

Active Learning Techniques

Teach Someone: Explain concepts out loud to a friend, colleague, or rubber duck. If you can't explain it simply, you don't understand it well enough.
Draw Diagrams: Visualize architectures, workflows, and decision trees. Drawing forces you to understand relationships between components.
Write Scenarios: Create your own exam questions based on real-world scenarios you've encountered. This helps you think like the exam writers.
Compare Options: Use comparison tables to understand differences between similar tools (GitHub Actions vs Azure Pipelines, Service Principal vs Managed Identity, etc.).
Hands-On Practice: Set up a free Azure account and GitHub account. Build actual pipelines, configure Key Vault, set up monitoring. Hands-on experience solidifies learning.

Memory Aids

Mnemonics for Common Lists:

CALMS (DevOps Culture): Culture, Automation, Lean, Measurement, Sharing
CIA (Security Triad): Confidentiality, Integrity, Availability
SMART (Goals): Specific, Measurable, Achievable, Relevant, Time-bound

Visual Patterns:

Pipeline Stages: Build → Test → Scan → Deploy (left to right flow)
Authentication Hierarchy: Managed Identity > Service Principal > PAT (most secure to least secure)
Deployment Patterns: Blue-Green (instant swap), Canary (gradual rollout), Ring (phased by user group)

Test-Taking Strategies

Time Management

Total time: 120 minutes (150 minutes for non-native English speakers)
Total questions: ~50-60 questions
Time per question: ~2-2.5 minutes average
Case studies: 10-15 minutes each (3-5 questions per case study)

Strategy:

First pass (60-70 min): Answer all questions you're confident about
Second pass (30-40 min): Tackle flagged questions and case studies
Final pass (10-20 min): Review marked answers, check for mistakes

Question Analysis Method

Step 1: Read the scenario (30 seconds)

Identify the environment (Azure DevOps, GitHub, hybrid)
Note key requirements (security, cost, performance, compliance)
Look for constraint keywords (must, cannot, minimize, maximize)

Step 2: Identify what's being tested (15 seconds)

Which domain? (Processes, Source Control, Pipelines, Security, Instrumentation)
Which concept? (Authentication, deployment pattern, monitoring, etc.)
What decision needs to be made? (Which tool, which approach, which configuration)

Step 3: Eliminate wrong answers (30 seconds)

Remove options that violate stated requirements
Eliminate technically incorrect options (service doesn't support that feature)
Remove options that are valid but don't fit the scenario

Step 4: Choose best answer (30 seconds)

Select option that best meets ALL requirements
Prefer simpler solutions over complex ones (Occam's Razor)
Choose most commonly recommended solution if unsure
Flag and move on if still unsure (come back later)

Handling Difficult Questions

When stuck:

Eliminate obviously wrong answers (reduces choices from 4 to 2-3)
Look for constraint keywords (must, cannot, minimize, maximize)
Choose most commonly recommended solution (Managed Identity over Service Principal, YAML over Classic, etc.)
Flag and move on (don't spend more than 3 minutes on one question initially)
Return during second pass with fresh perspective

Common traps:

Overthinking: The exam tests practical knowledge, not edge cases
Assuming context: Only use information provided in the question
Ignoring constraints: "Minimize cost" means choose cheapest option, even if more complex
Choosing familiar over correct: Just because you use a tool doesn't mean it's the right answer

⚠️ Never: Spend more than 3 minutes on one question initially. Flag it and return later.

Domain-Specific Tips

Domain 1: Processes and Communications (10-15%)

Focus on: Work tracking tools (Azure Boards, GitHub Projects), metrics (lead time, cycle time, CFD), documentation (wikis, Markdown, Mermaid)
Common questions: Which metric to use, how to configure integration between Azure Boards and GitHub, how to document processes
Key decisions: Azure Boards vs GitHub Projects (choose based on existing tools and complexity needs)

Domain 2: Source Control Strategy (10-15%)

Focus on: Branching strategies (trunk-based, GitFlow, feature branch), branch policies, Git operations (rebase, merge, cherry-pick), repository management (Git LFS, Scalar)
Common questions: Which branching strategy for scenario, how to configure branch policies, how to recover deleted files
Key decisions: Trunk-based (continuous deployment) vs GitFlow (scheduled releases)

Domain 3: Build and Release Pipelines (50-55%)

Focus on: YAML syntax, multi-stage pipelines, deployment patterns (blue-green, canary, feature flags), package management (Azure Artifacts, GitHub Packages), IaC (ARM, Bicep, Terraform), testing strategies
Common questions: How to write YAML pipeline, which deployment pattern for scenario, how to implement feature flags, which IaC tool to use
Key decisions: This is the largest domain - know YAML syntax cold, understand all deployment patterns, memorize common tasks

Domain 4: Security and Compliance (10-15%)

Focus on: Authentication (Service Principal, Managed Identity, GitHub Apps, PATs), secrets management (Key Vault), security scanning (Defender for Cloud, GitHub Advanced Security, CodeQL, Dependabot)
Common questions: Which authentication method for scenario, how to store secrets securely, which scanning tool to use
Key decisions: Managed Identity > Service Principal (always prefer Managed Identity when available)

Domain 5: Instrumentation Strategy (5-10%)

Focus on: Azure Monitor, Application Insights, Log Analytics, KQL queries, alerts, dashboards
Common questions: How to configure monitoring, how to write KQL queries, how to set up alerts
Key decisions: Application Insights for APM, Log Analytics for log analysis, KQL for queries

Exam Day Preparation

Week Before Exam

7 days before:

Take Full Practice Test 1 (target: 60%+)
Review mistakes and weak areas
Re-read chapters for weak domains

5 days before:

Take Full Practice Test 2 (target: 70%+)
Review mistakes and patterns
Focus on decision frameworks

3 days before:

Take domain-focused tests for weak domains
Review cheat sheet
Practice YAML syntax and KQL queries

1 day before:

Take Full Practice Test 3 (target: 75%+)
Light review of cheat sheet (1-2 hours max)
Get 8 hours of sleep

Day Before Exam

Do:

✅ Review cheat sheet (1-2 hours max)
✅ Skim chapter summaries
✅ Review flagged items from practice tests
✅ Get 8 hours of sleep
✅ Prepare exam day materials (ID, confirmation email)

Don't:

❌ Try to learn new topics (too late, will cause confusion)
❌ Take full practice test (will tire you out)
❌ Study late into the night (need rest)
❌ Cram (causes stress, doesn't help retention)

Exam Day

Morning Routine:

Eat a good breakfast (protein, complex carbs)
Light review of cheat sheet (30 minutes max)
Arrive 30 minutes early (online exam: test equipment 1 hour before)
Use restroom before exam starts

Brain Dump Strategy:
When exam starts, immediately write down on scratch paper (or whiteboard for online exam):

YAML pipeline structure (stages, jobs, steps)
Common KQL operators (where, project, summarize, order by, take)
Deployment patterns (blue-green, canary, ring)
Authentication methods (Managed Identity, Service Principal, GitHub App, PAT)
Key Vault access methods
Common Azure DevOps tasks (AzureWebApp@1, AzureKeyVault@2, PublishTestResults@2)

During Exam:

Follow time management strategy (first pass, second pass, final pass)
Use scratch paper for complex scenarios
Flag questions you're unsure about (review later)
Trust your preparation (don't second-guess too much)
Read questions carefully (look for constraint keywords)

Final Confidence Boosters

You're Ready When...

You score 75%+ on all practice tests
You can explain key concepts without notes
You recognize question patterns instantly
You make decisions quickly using frameworks
You can write YAML pipelines from memory
You can write basic KQL queries from memory

Remember

Trust your preparation: You've studied comprehensively
Manage your time well: Don't get stuck on one question
Read questions carefully: Look for constraint keywords
Don't overthink: The exam tests practical knowledge
Stay calm: Take deep breaths if you feel stressed

You've got this! Good luck on your AZ-400 exam!

Next Chapter: 09_final_checklist - Final Week Preparation Checklist

Final Week Checklist

7 Days Before Exam

Knowledge Audit

Go through this comprehensive checklist and mark items you're confident about:

Domain 1: Design and Implement Processes and Communications (10-15%)

I can explain GitHub Flow and when to use it
I understand cycle time, lead time, and time to recovery metrics
I know how to configure Azure Boards integration with GitHub
I can create Mermaid diagrams for process documentation
I understand cumulative flow diagrams and how to interpret them
I know how to configure webhooks for notifications

Domain 2: Design and Implement a Source Control Strategy (10-15%)

I can compare trunk-based, feature branch, and GitFlow strategies
I understand branch policies and protection rules
I know how to configure pull request workflows
I can explain Git LFS and when to use it
I understand how to recover deleted files using Git commands
I know how to remove sensitive data from Git history

Domain 3: Design and Implement Build and Release Pipelines (50-55%)

I can write multi-stage YAML pipelines from memory
I understand trigger rules (CI, PR, scheduled, manual)
I know the difference between Azure Artifacts and GitHub Packages
I can explain SemVer and version ranges (^, ~, exact)
I understand the test pyramid and quality gates
I can implement blue-green, canary, and ring deployments
I know how to use feature flags with Azure App Configuration
I understand ARM templates, Bicep, and Terraform differences
I can configure deployment slots in Azure App Service
I know common Azure DevOps tasks (AzureWebApp@1, AzureKeyVault@2, etc.)
I understand pipeline optimization strategies

Domain 4: Develop a Security and Compliance Plan (10-15%)

I can explain when to use Service Principal vs Managed Identity
I understand GitHub authentication methods (GitHub Apps, GITHUB_TOKEN, PATs)
I know how to configure Azure DevOps permissions and security groups
I can implement Azure Key Vault integration in pipelines
I understand Microsoft Defender for Cloud DevOps Security capabilities
I know GitHub Advanced Security features (CodeQL, secret scanning, Dependabot)
I can implement container scanning in pipelines

Domain 5: Implement an Instrumentation Strategy (5-10%)

I understand Azure Monitor and Application Insights architecture
I can write basic KQL queries (where, project, summarize, order by)
I know how to configure alerts for pipeline failures
I understand distributed tracing and dependency tracking
I can analyze metrics to identify performance issues

If you checked fewer than 80% in any domain: Review those specific chapters today.

Practice Test Marathon

Day 7: Full Practice Test 1

Take test under exam conditions (120 minutes, no interruptions)
Target score: 60%+
Review ALL mistakes (not just wrong answers, understand why)
Identify weak domains

If scored below 60%: Review fundamentals and domain chapters for weak areas.

Day 6: Review and Study

Review mistakes from Practice Test 1
Re-read chapters for weak domains
Focus on decision frameworks (when to use which tool/pattern)
Practice YAML syntax and KQL queries

Day 5: Full Practice Test 2

Take test under exam conditions
Target score: 70%+
Review mistakes and identify patterns
Note common question types

If scored below 70%: Focus on largest domain (Build and Release Pipelines - 50-55%).

Day 4: Review and Practice

Review mistakes from Practice Test 2
Focus on question patterns (which tool, how to implement, best practice)
Practice writing YAML pipelines from memory
Practice writing KQL queries from memory

Day 3: Domain-Focused Tests

Take domain-focused tests for weak domains
Review specific topics within those domains
Use comparison tables to understand differences between similar tools

Day 2: Full Practice Test 3

Take test under exam conditions
Target score: 75%+
Review mistakes (should be fewer than previous tests)
Confirm you understand all concepts

If scored below 75%: Consider rescheduling exam to allow more study time.

Day 1: Light Review

Review cheat sheet (1-2 hours max)
Skim chapter summaries
Review flagged items from practice tests
Do NOT take full practice test (will tire you out)
Get 8 hours of sleep

Day Before Exam

Final Review (2-3 hours max)

Morning (1 hour):

Review cheat sheet
Focus on ⭐ Must Know items from each chapter

Afternoon (1 hour):

Skim chapter summaries
Review quick reference cards
Practice brain dump (write down key facts from memory)

Evening (30 minutes):

Review flagged items from practice tests
Light review of decision frameworks

Don't: Try to learn new topics, take full practice test, study late into the night, cram.

Mental Preparation

Get 8 hours sleep (critical for cognitive performance)
Prepare exam day materials:
- Government-issued ID (name must match exam registration)
- Exam confirmation email
- Scratch paper and pen (if in-person)
- Water bottle (if allowed)
Review testing center policies (or online exam requirements)
Set multiple alarms for exam day

Exam Day

Morning Routine (2-3 hours before exam)

Breakfast:

Eat a good breakfast (protein + complex carbs)
Avoid excessive caffeine (causes jitters)
Stay hydrated

Light Review (30 minutes max):

Review cheat sheet one final time
Practice brain dump on scratch paper
Do NOT try to learn new concepts

Logistics:

Arrive 30 minutes early (in-person) or test equipment 1 hour before (online)
Use restroom before exam starts
Bring required ID and confirmation email

Brain Dump Strategy

When exam starts, immediately write down on scratch paper:

YAML Pipeline Structure:

trigger:
  branches:
    include: [main]

stages:
  - stage: Build
    jobs:
      - job: BuildJob
        steps:
          - task: TaskName@Version

Common Azure DevOps Tasks:

AzureWebApp@1 (deploy to App Service)
AzureKeyVault@2 (get secrets)
PublishTestResults@2 (publish test results)
PublishCodeCoverageResults@1 (publish coverage)
AzureAppServiceManage@0 (swap slots)

KQL Operators:

where (filter)
project (select columns)
summarize (aggregate)
order by (sort)
take (limit)
ago() (relative time)
bin() (time buckets)

Authentication Methods:

Managed Identity: Azure resources, no secrets
Service Principal: GitHub Actions, on-premises, requires secret
GITHUB_TOKEN: Same-repo, automatic, expires with workflow
GitHub App: Cross-repo, short-lived tokens (1hr)
PAT: External integrations, tied to user, requires rotation

Deployment Patterns:

Blue-Green: Two environments, instant swap, zero downtime, 2X cost
Canary: Progressive rollout (5%→100%), monitor metrics, automatic rollback
Ring: Phased by user group (internal→early adopters→general availability)
Feature Flags: Deploy OFF, enable gradually, instant toggle

Key Vault Access:

Managed Identity (preferred): No secrets to manage
Service Principal: Requires storing secret
Access Policies: Legacy, simpler
Azure RBAC: Recommended, consistent with other Azure resources

During Exam

Time Management:

First pass (60-70 min): Answer confident questions
Second pass (30-40 min): Tackle flagged questions and case studies
Final pass (10-20 min): Review marked answers

Question Analysis:

Read scenario carefully (identify environment, requirements, constraints)
Identify what's being tested (which domain, which concept)
Eliminate wrong answers (violate requirements, technically incorrect)
Choose best answer (meets ALL requirements, simplest solution)

Tips:

Use scratch paper for complex scenarios
Flag questions you're unsure about (review later)
Trust your preparation (don't second-guess too much)
Look for constraint keywords (must, cannot, minimize, maximize)
Take deep breaths if you feel stressed

Post-Exam

If You Pass

Celebrate! You've earned it!
Download certificate from Microsoft Learn
Update LinkedIn profile with certification
Share achievement with your network
Consider next certification (AZ-305, AZ-500, GitHub certifications)

If You Don't Pass

Don't be discouraged (many people need multiple attempts)
Review exam feedback (which domains were weak)
Focus study on weak domains
Take more practice tests
Schedule retake (wait 24 hours minimum)
Learn from mistakes and try again

Final Words

You're Ready When...

You score 75%+ on all practice tests
You can explain key concepts without notes
You recognize question patterns instantly
You make decisions quickly using frameworks
You can write YAML pipelines from memory
You can write basic KQL queries from memory

Remember

Trust your preparation: You've studied comprehensively with this guide
Manage your time well: Don't get stuck on one question
Read questions carefully: Look for constraint keywords
Don't overthink: The exam tests practical knowledge, not edge cases
Stay calm: Take deep breaths, you've got this

Good luck on your AZ-400: Designing and Implementing Microsoft DevOps Solutions exam!

Next File: 99_appendices - Quick Reference Tables, Glossary, Additional Resources

Appendices

Appendix A: Quick Reference Tables

Authentication Methods Comparison

Method	Use Case	Pros	Cons	Lifespan
Managed Identity	Azure resources (VMs, App Service)	No secrets to manage, automatic, secure	Only works in Azure	Managed by Azure
Service Principal	GitHub Actions, on-premises, external tools	Works anywhere, flexible	Requires secret storage and rotation	Secret: max 2 years
GITHUB_TOKEN	Same-repo GitHub Actions	Automatic, no setup, free	Can't trigger other workflows, repo-scoped	Workflow duration
GitHub App	Cross-repo workflows, production automation	Short-lived tokens (1hr), not tied to user	Requires setup (create app, install)	Token: 1 hour
PAT (Classic)	Legacy integrations	Simple, works everywhere	Broad scopes, tied to user, requires rotation	Max 1 year
PAT (Fine-grained)	External integrations, temporary access	Repo-specific, granular permissions	Still tied to user, requires rotation	Custom (up to 1 year)

Deployment Patterns Comparison

Pattern	Downtime	Rollback Speed	Cost	Complexity	Best For
Blue-Green	Zero	Instant (swap back)	2X (two full environments)	Low	Critical apps, instant rollback needed
Canary	Zero	Fast (route traffic back)	1.1-1.2X (small canary environment)	Medium	Gradual rollout with monitoring
Ring	Zero	Medium (depends on ring size)	1X (same environment)	Medium	Phased rollout by user group
Rolling	Partial (some instances down)	Slow (redeploy previous version)	1X (same environment)	Low	Non-critical apps, cost-sensitive
Feature Flags	Zero	Instant (toggle flag)	1X + flag service cost	High	Decouple deployment from release

Package Management Comparison

Feature	Azure Artifacts	GitHub Packages	npm Registry	NuGet Gallery
Pricing	2GB free, $2/GB after	Unlimited public, $0.50/GB private	Free	Free
Package Types	NuGet, npm, Maven, Python, Universal	NuGet, npm, Maven, Docker, RubyGems	npm only	NuGet only
Feed Views	Yes (@Local, @Prerelease, @Release)	No	No	No
Upstream Sources	Yes (cache public packages)	No	N/A	N/A
Integration	Azure DevOps native	GitHub native	Universal	Universal
Best For	Enterprise, multi-package-type, feed promotion	GitHub-native workflows, open-source	Public npm packages	Public NuGet packages

IaC Tools Comparison

Tool	Language	Cloud Support	State Management	Learning Curve	Best For
ARM Templates	JSON	Azure only	Azure manages	Steep (verbose JSON)	Complex Azure scenarios, full control
Bicep	Bicep (DSL)	Azure only	Azure manages	Low (simpler than ARM)	Azure-native IaC, recommended for new projects
Terraform	HCL	Multi-cloud	State file (local/remote)	Medium	Multi-cloud, large ecosystem, mature tooling
Pulumi	TypeScript/Python/C#/Go	Multi-cloud	State file (Pulumi service)	Medium	Developers prefer real languages over DSL
Azure CLI	Bash/PowerShell	Azure only	None (imperative)	Low	Quick scripts, automation, not IaC

Azure DevOps Tasks Quick Reference

Task	Purpose	Common Inputs	Example
AzureWebApp@1	Deploy to App Service	azureSubscription, appName, package	Deploy web app
AzureKeyVault@2	Get secrets from Key Vault	azureSubscription, KeyVaultName, SecretsFilter	Retrieve secrets
PublishTestResults@2	Publish test results	testResultsFormat, testResultsFiles	Enable quality gates
PublishCodeCoverageResults@1	Publish code coverage	codeCoverageTool, summaryFileLocation	Visualize coverage
AzureAppServiceManage@0	Manage App Service	azureSubscription, action, webAppName, sourceSlot, targetSlot	Swap deployment slots
NuGetAuthenticate@1	Authenticate to Azure Artifacts	No inputs (uses service connection)	Access private feeds
Docker@2	Build/push Docker images	command, Dockerfile, tags, containerRegistry	Container workflows
AzureCLI@2	Run Azure CLI commands	azureSubscription, scriptType, scriptLocation	Custom Azure operations

KQL Operators Quick Reference

Operator	Purpose	Example	Result
where	Filter rows	`where timestamp > ago(1h)`	Rows from last hour
project	Select columns	`project timestamp, name, duration`	Only specified columns
summarize	Aggregate data	`summarize count() by operation_Name`	Count per operation
order by	Sort results	`order by duration desc`	Sorted by duration (descending)
take	Limit results	`take 10`	First 10 rows
ago()	Relative time	`ago(24h)`	24 hours ago
bin()	Time buckets	`bin(timestamp, 1h)`	Hourly buckets
extend	Add calculated column	`extend DurationSec = duration / 1000`	New column
join	Combine tables	`requests \| join dependencies on operation_Id`	Merged data
render	Visualize	`render timechart`	Time-series chart

Appendix B: Common Exam Scenarios

Scenario Type 1: Tool Selection

Pattern: "You need to [accomplish task]. Which tool should you use?"

Approach:

Identify requirements (security, cost, integration, complexity)
Eliminate tools that don't meet requirements
Choose tool that best fits constraints

Examples:

"Scan code for vulnerabilities in GitHub Enterprise" → GitHub Advanced Security (native integration)
"Store database passwords securely" → Azure Key Vault (centralized, audit trail)
"Deploy with zero downtime" → Blue-green deployment (instant swap)

Scenario Type 2: Implementation Steps

Pattern: "How do you implement [feature/configuration]?"

Approach:

Identify prerequisites
List step-by-step implementation
Explain why each step is necessary

Examples:

"Implement blue-green deployment" → Create slots, deploy to blue, health check, swap
"Access Key Vault from pipeline" → Enable Managed Identity, grant access policy, use AzureKeyVault@2 task
"Configure branch protection" → Set required reviewers, status checks, up-to-date branches

Scenario Type 3: Best Practices

Pattern: "What is the best practice for [scenario]?"

Approach:

State the best practice
Explain why it's recommended
Mention alternatives and trade-offs

Examples:

"Store secrets in pipelines" → Key Vault with Managed Identity (centralized, secure, audit trail)
"Branching strategy for continuous deployment" → Trunk-based (short-lived branches, fast feedback)
"Authentication for Azure resources" → Managed Identity (no secrets to manage)

Appendix C: Glossary

Agentless Scanning: Security scanning that doesn't require pipeline changes; scanner accesses repositories via API (e.g., Defender for Cloud DevOps Security).

Application Insights: Azure Monitor feature for application performance monitoring (APM) with distributed tracing, dependency tracking, and exception monitoring.

Azure Artifacts: Package management service in Azure DevOps supporting NuGet, npm, Maven, Python, and Universal packages.

Azure Boards: Work tracking service in Azure DevOps with support for Agile, Scrum, and Kanban methodologies.

Azure Key Vault: Cloud service for securely storing and accessing secrets, keys, and certificates.

Azure Monitor: Comprehensive monitoring solution for Azure resources, applications, and infrastructure.

Azure Pipelines: CI/CD service in Azure DevOps supporting YAML and classic pipelines.

Bicep: Domain-specific language (DSL) for deploying Azure resources; simpler alternative to ARM templates.

Blue-Green Deployment: Deployment pattern with two identical environments (blue and green); traffic is switched instantly between them for zero-downtime deployments.

Branch Policy: Protection rule on a branch requiring conditions before merge (e.g., required reviewers, passing builds).

Canary Deployment: Deployment pattern where new version is gradually rolled out to a small subset of users (canary) before full deployment.

CI (Continuous Integration): Practice of automatically building and testing code on every commit to detect issues early.

CD (Continuous Delivery): Practice of automatically deploying code to staging/pre-production after successful build and tests; production deployment is manual.

Continuous Deployment: Extension of CD where code is automatically deployed to production after passing all tests (no manual approval).

CodeQL: Semantic code analysis engine that understands code structure to find security vulnerabilities.

Cumulative Flow Diagram (CFD): Visualization showing work items in different states over time; used to identify bottlenecks.

Cycle Time: Time from when work starts (moved to "In Progress") to when it's completed; measures team efficiency.

Defender for Cloud DevOps Security: Cloud-native application protection platform (CNAPP) providing unified visibility and security scanning across Azure DevOps, GitHub, and GitLab.

Dependabot: GitHub feature that automatically scans dependencies for vulnerabilities and creates pull requests to update them.

Deployment Slot: Separate environment in Azure App Service for staging deployments before swapping to production.

Distributed Tracing: Tracking requests across multiple services (microservices) to identify bottlenecks and failures.

Feature Flag: Configuration that enables/disables features at runtime without redeploying code; decouples deployment from release.

Feed View: Azure Artifacts feature for promoting packages through stages (@Local, @Prerelease, @Release).

GitHub Actions: CI/CD platform integrated into GitHub; uses YAML workflows.

GitHub Advanced Security: Suite of security features for GitHub Enterprise including CodeQL, secret scanning, and Dependabot.

GitHub App: Application that integrates with GitHub using short-lived tokens (1 hour); not tied to user account.

GITHUB_TOKEN: Automatically generated token for GitHub Actions workflows; scoped to repository, expires with workflow.

GitFlow: Branching strategy with structured workflow (main, develop, feature, release, hotfix branches) for scheduled releases.

IaC (Infrastructure as Code): Practice of managing infrastructure through code (ARM templates, Bicep, Terraform) rather than manual processes.

KQL (Kusto Query Language): Query language for analyzing log and telemetry data in Azure Monitor and Log Analytics.

Lead Time: Time from when work is created (work item opened) to when it's completed; measures end-to-end delivery time.

Log Analytics Workspace: Centralized log storage in Azure Monitor; supports KQL queries for analysis.

Managed Identity: Azure AD identity automatically managed by Azure for Azure resources; no secrets to manage.

Multi-Stage Pipeline: YAML pipeline with multiple stages (e.g., Build, Test, Deploy) that can run sequentially or in parallel.

PAT (Personal Access Token): User-generated token for authenticating to Azure DevOps or GitHub; tied to user account.

Pull Request (PR): Proposed code change that must be reviewed and approved before merging to target branch.

Ring Deployment: Deployment pattern where new version is rolled out in phases to different user groups (rings): internal → early adopters → general availability.

SemVer (Semantic Versioning): Versioning scheme (MAJOR.MINOR.PATCH) where version numbers convey meaning about changes.

Service Connection: Azure DevOps configuration for authenticating to external services (Azure, GitHub, Docker registries).

Service Principal: Azure AD identity for applications and services; requires storing and rotating secrets.

Shift-Left Security: Practice of integrating security early in development (scanning code on commit/PR) rather than late (production).

Trunk-Based Development: Branching strategy where developers work on short-lived feature branches (hours/days) that merge frequently to main branch.

Upstream Source: Azure Artifacts feature for caching public packages (npm, NuGet) to improve reliability and speed.

YAML: Human-readable data serialization language used for pipeline definitions in Azure Pipelines and GitHub Actions.

Appendix D: Additional Resources

Community Resources

Azure DevOps Blog: https://devblogs.microsoft.com/devops/
GitHub Blog: https://github.blog/
r/azuredevops: Reddit community for Azure DevOps
Stack Overflow: Tag [azure-devops] for questions

Tools and Extensions

Azure DevOps CLI: https://learn.microsoft.com/en-us/azure/devops/cli/
Azure CLI: https://learn.microsoft.com/en-us/cli/azure/
Visual Studio Code: https://code.visualstudio.com/
Azure DevOps Extension for VS Code: https://marketplace.visualstudio.com/items?itemName=ms-azure-devops.azure-pipelines

Final Words

Certification Renewal

Renewal Required: Every 12 months
Renewal Method: Complete free renewal assessment on Microsoft Learn
Notification: Microsoft sends reminder 6 months before expiration
Cost: Free (no exam fee for renewal)

Related Certifications

After AZ-400, consider:

Azure Solutions Architect Expert (AZ-305): If you have AZ-104, adds architecture expertise
Azure Security Engineer Associate (AZ-500): Deepens security knowledge
GitHub Certifications: GitHub Actions, GitHub Advanced Security, GitHub Administration

Career Path

AZ-400 certification demonstrates expertise in:

DevOps engineering and automation
CI/CD pipeline design and implementation
Security and compliance in DevOps
Cloud infrastructure and monitoring

Typical roles: DevOps Engineer, Site Reliability Engineer (SRE), Cloud Engineer, Platform Engineer, Release Manager

Continuous Learning

DevOps is constantly evolving. Stay current by:

Following Azure DevOps and GitHub blogs
Attending conferences (Microsoft Ignite, GitHub Universe)
Participating in community forums
Experimenting with new tools and practices
Sharing knowledge through blog posts or presentations

Congratulations on completing this comprehensive study guide!

You now have all the knowledge needed to pass the AZ-400 exam and excel as a DevOps Engineer. Remember to:

Trust your preparation
Manage your time during the exam
Read questions carefully
Stay calm and confident

Good luck on your certification journey!

AZ-400 Study Guide Generation Progress

Session Summary (Current)

Date: October 5, 2024
Status: In Progress - 33% Complete
Token Usage: ~132K / 200K (66% used)

Completed Work

Files Generated (4 of 12):

✅ Overview - 817 words - Study plan and navigation guide
✅ Fundamentals - 5,638 words - DevOps foundations and prerequisites
✅ 02_domain1_processes_communications - 9,903 words - Work tracking, metrics, collaboration
⏳ 03_domain2_source_control - 3,259 words - PARTIAL (Trunk-based dev, branch policies started)

Total Words: 19,617 / 60,000 minimum (33%)

Diagrams Created (8 of 120-200):

01_fundamentals_devops_lifecycle.mmd - DevOps infinity loop
01_fundamentals_ecosystem.mmd - Azure DevOps + GitHub ecosystem
01_fundamentals_ci_flow.mmd - Continuous Integration flow
01_fundamentals_cd_pipeline.mmd - Continuous Deployment pipeline
02_domain1_lead_cycle_time_comparison.mmd - Lead vs Cycle time
02_domain1_github_projects_architecture.mmd - GitHub Projects structure
02_domain1_cfd_example.mmd - Cumulative Flow Diagram
02_domain1_wiki_documentation_workflow.mmd - Wiki/docs workflow

Quality Assessment

Domain 1 Chapter (02_domain1_processes_communications):

Word count: 9,903 (target: 12,000-15,000) - 82% complete
Diagrams: 4 (target: 20-25) - 20% complete
Content quality: Comprehensive with detailed examples ✓
Missing: More sections on feedback cycles, Azure Boards/GitHub integration details

Domain 2 Chapter (03_domain2_source_control):

Word count: 3,259 (target: 12,000-15,000) - 27% complete
Content: Trunk-based development (complete), Branch policies (started)
Missing: GitFlow, GitHub Flow, release flow, repository management, Git LFS, recovery operations

Remaining Work

Files to Create (8 files):

03_domain2_source_control - Continue and complete (need ~9K words, 15+ diagrams)
- Add: GitFlow strategy, GitHub Flow, release flow comparison
- Add: Repository management (Git LFS, Scalar, monorepo vs multirepo)
- Add: Git recovery operations, sensitive data removal (BFG, git-filter-branch)
- Add: Chapter summary and self-assessment
04_domain3_build_release_pipelines - NEW (25,000-35,000 words, 45-60 diagrams)
- Package management: Azure Artifacts, GitHub Packages, versioning
- Testing strategy: Quality gates, test pyramids, code coverage
- Pipeline design: YAML syntax, GitHub Actions, Azure Pipelines
- Deployments: Blue-green, canary, feature flags, deployment slots
- IaC: ARM, Bicep, Terraform, desired state configuration
- Pipeline maintenance: Optimization, migration, retention
05_domain4_security_compliance - NEW (12,000-15,000 words, 20-25 diagrams)
- Authentication: Service principals, managed identities, GitHub tokens
- Secrets management: Azure Key Vault, secure files, rotation
- Security scanning: Defender for Cloud, GitHub Advanced Security, CodeQL
06_domain5_instrumentation - NEW (8,000-12,000 words, 15-20 diagrams)
- Monitoring: Azure Monitor, Application Insights, telemetry
- Metrics analysis: KQL queries, distributed tracing
Integration - NEW (8,000-12,000 words, 12-18 diagrams)
- Cross-domain scenarios and integration patterns
Study strategies - NEW (4,000-6,000 words, 5-8 diagrams)
- Learning methods, exam strategies
Final checklist - NEW (3,000-5,000 words, 3-5 diagrams)
- Final week preparation
Appendices - NEW (5,000-8,000 words)
- Quick reference, glossary, resources

Total Remaining: ~75,000-105,000 words, 110-180 diagrams

Resume Strategy for Next Session

Priority Order:

Complete Domain 2 (03_domain2_source_control) - Add ~9K words
- GitFlow vs trunk-based comparison table
- GitHub Flow workflow diagram
- Repository management at scale (Git LFS diagram)
- Git recovery operations examples
Domain 3 (Largest chapter - 25-35K words)
- Break into 6 sub-sections, generate progressively
- Heavy on diagrams (45-60 needed)
- Use MCP Microsoft Docs for Azure Pipelines/GitHub Actions accuracy
Domains 4-5 (Security and Instrumentation)
- Use MCP for Azure Key Vault, Defender for Cloud, Monitor documentation
- Code examples from MCP code search
Integration & Study Materials (Final chapters)
- Cross-reference earlier chapters
- Create consolidated diagrams

Notes for Next Session

MCP Tools to Leverage:

microsoft_docs_search - Verify all Azure DevOps and GitHub concepts
microsoft_docs_fetch - Get complete documentation pages for detailed topics
microsoft_code_sample_search - Find official YAML pipeline examples, PowerShell scripts

Consistency Reminders:

Every concept needs: Definition, Why it exists, How it works (step-by-step), 3+ examples
Every diagram needs: 200-400 word explanation
Every chapter needs: Summary, self-assessment, quick reference card
Use same formatting: ⭐ Must Know, 💡 Tips, ⚠️ Common Mistakes, 🔗 Connections

Quality Standards:

Target 2-3X more content than basic guide (comprehensive for novices)
Self-sufficient (no external resources needed)
Heavy on visual diagrams (120-200 total target)
Exam-focused (only tested content)

AZ-400 学习指南

AZ-400: Designing and Implementing Microsoft DevOps Solutions - Comprehensive Study Guide

Overview

Section Organization

Study Plan Overview

Learning Approach

Progress Tracking

Legend

How to Navigate

Exam Details

Study Resources Included

Tips for Success

Getting Started

Chapter 0: Essential Background and DevOps Foundations

What You Need to Know First

Core Concepts Foundation

What is DevOps?

The DevOps Lifecycle

Continuous Integration (CI)

Continuous Delivery (CD)

Infrastructure as Code (IaC)

Version Control and Git Fundamentals

Agile and DevOps Culture

Azure DevOps vs GitHub

Terminology Guide

Mental Model: DevOps Ecosystem

Chapter Summary

What We Covered

Critical Takeaways

Self-Assessment Checklist

Quick Reference Card

Chapter 1: Design and Implement Processes and Communications (12.5% of exam)

Chapter Overview

Section 1: Traceability and Work Flow

Introduction

Core Concepts

Work Item Tracking with Azure Boards

GitHub Projects Integration

Section 2: DevOps Metrics and Dashboards

Introduction

Core Concepts

Cycle Time vs Lead Time

Cumulative Flow Diagram (CFD)

Section 3: Collaboration and Documentation

Introduction

Core Concepts

Project Documentation with Wikis and Markdown

Chapter Summary

What We Covered

Critical Takeaways

Self-Assessment Checklist

Practice Questions

Common Exam Question Patterns

Quick Reference Card

What's Next

Additional Detailed Examples and Scenarios

Example 1: Implementing GitHub Flow with Branch Policies

Example 2: Configuring Azure Boards and GitHub Integration

Example 3: Creating Effective Dashboards with Key Metrics

Chapter 3: Design and Implement a Source Control Strategy (12.5% of exam)

Chapter Overview

Section 1: Branching Strategies

Introduction

Core Concepts

Trunk-Based Development

Section 2: Branch Policies and Pull Requests

Introduction

Core Concepts

Branch Protection Rules and Policies

Decision Framework: When to Use Branch Policies

Section 2: Git Workflow Strategies

Introduction

Core Concepts

Trunk-Based Development

GitFlow Workflow

Comparison Table: Branching Strategies

Practical Scenarios

Section 3: Code Review and Pull Request Best Practices

Introduction

Core Concepts