AWS Certified Data Engineer - Associate (DEA-C01) Exam Guide

Version 1.0 DEA-C01

Introduction

The AWS Certified Data Engineer - Associate (DEA-C01) exam validates a candidate's ability to implement data pipelines and to monitor, troubleshoot, and optimize cost and performance issues in accordance with best practices.

The exam also validates a candidate's ability to complete the following tasks:

Ingest and transform data, and orchestrate data pipelines while applying programming concepts.
Choose an optimal data store, design data models, catalog data schemas, and manage data lifecycles.
Operationalize, maintain, and monitor data pipelines. Analyze data and ensure data quality.
Implement appropriate authentication, authorization, data encryption, privacy, and governance. Enable logging.

Target Candidate Description

The target candidate should have the equivalent of 2–3 years of experience in data engineering. The target candidate should understand the effects of volume, variety, and velocity on data ingestion, transformation, modeling, security, governance, privacy, schema design, and optimal data store design. Additionally, the target candidate should have at least 1–2 years of hands-on experience with AWS services.

Recommended General IT Knowledge

The target candidate should have the following general IT knowledge:

Setup and maintenance of extract, transform, and load (ETL) pipelines from ingestion to destination
Application of high-level but language-agnostic programming concepts as required by the pipeline
How to use Git commands for source control
How to use data lakes to store data
General concepts for networking, storage, and compute

Recommended AWS Knowledge

The target candidate should have the following AWS knowledge:

How to use AWS services to accomplish the tasks listed in the Introduction section of this exam guide
An understanding of the AWS services for encryption, governance, protection, and logging of all data that is part of data pipelines
The ability to compare AWS services to understand the cost, performance, and functional differences between services
How to structure SQL queries and how to run SQL queries on AWS services
An understanding of how to analyze data, verify data quality, and ensure data consistency by using AWS services

Job Tasks That Are Out of Scope for the Target Candidate

The following list contains job tasks that the target candidate is not expected to be able to perform. This list is non-exhaustive. These tasks are out of scope for the exam:

Perform artificial intelligence and machine learning (AI/ML) tasks.
Demonstrate knowledge of programming language-specific syntax.
Draw business conclusions based on data.

Refer to the Appendix for a list of in-scope AWS services and features and a list of out-of-scope AWS services and features.

Exam Content

Response Types

There are two types of questions on the exam:

Multiple choice: Has one correct response and three incorrect responses (distractors)
Multiple response: Has two or more correct responses out of five or more response options

Select one or more responses that best complete the statement or answer the question. Distractors, or incorrect answers, are response options that a candidate with incomplete knowledge or skill might choose. Distractors are generally plausible responses that match the content area.

Unanswered questions are scored as incorrect; there is no penalty for guessing. The exam includes 50 questions that affect your score.

Unscored Content

The exam includes 15 unscored questions that do not affect your score. AWS collects information about performance on these unscored questions to evaluate these questions for future use as scored questions. These unscored questions are not identified on the exam.

Exam Results

The AWS Certified Data Engineer - Associate (DEA-C01) exam has a pass or fail designation. The exam is scored against a minimum standard established by AWS professionals who follow certification industry best practices and guidelines.

Your results for the exam are reported as a scaled score of 100–1,000. The minimum passing score is 720. Your score shows how you performed on the exam as a whole and whether you passed. Scaled scoring models help equate scores across multiple exam forms that might have slightly different difficulty levels.

Your score report could contain a table of classifications of your performance at each section level. The exam uses a compensatory scoring model, which means that you do not need to achieve a passing score in each section. You need to pass only the overall exam.

Each section of the exam has a specific weighting, so some sections have more questions than other sections have. The table of classifications contains general information that highlights your strengths and weaknesses. Use caution when you interpret section-level feedback.

Content Outline

This exam guide includes weightings, content domains, and task statements for the exam. This guide does not provide a comprehensive list of the content on the exam. However, additional context for each task statement is available to help you prepare for the exam.

The exam has the following content domains and weightings:

Domain 1: Data Ingestion and Transformation (34% of scored content)
Domain 2: Data Store Management (26% of scored content)
Domain 3: Data Operations and Support (22% of scored content)
Domain 4: Data Security and Governance (18% of scored content)

Domain 1: Data Ingestion and Transformation

Task Statement 1.1: Perform data ingestion.

Knowledge of:

Throughput and latency characteristics for AWS services that ingest data
Data ingestion patterns (for example, frequency and data history)
Streaming data ingestion
Batch data ingestion (for example, scheduled ingestion, event-driven ingestion)
Replayability of data ingestion pipelines
Stateful and stateless data transactions

Skills in:

Reading data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon Redshift)
Reading data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow)
Implementing appropriate configuration options for batch ingestion
Consuming data APIs
Setting up schedulers by using Amazon EventBridge, Apache Airflow, or time-based schedules for jobs and crawlers
Setting up event triggers (for example, Amazon S3 Event Notifications, EventBridge)
Calling a Lambda function from Amazon Kinesis
Creating allowlists for IP addresses to allow connections to data sources
Implementing throttling and overcoming rate limits (for example, DynamoDB, Amazon RDS, Kinesis)
Managing fan-in and fan-out for streaming data distribution

Task Statement 1.2: Transform and process data.

Knowledge of:

Creation of ETL pipelines based on business requirements
Volume, velocity, and variety of data (for example, structured data, unstructured data)
Cloud computing and distributed computing
How to use Apache Spark to process data
Intermediate data staging locations

Skills in:

Optimizing container usage for performance needs (for example, Amazon Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service [Amazon ECS])
Connecting to different data sources (for example, Java Database Connectivity [JDBC], Open Database Connectivity [ODBC])
Integrating data from multiple sources
Optimizing costs while processing data
Implementing data transformation services based on requirements (for example, Amazon EMR, AWS Glue, Lambda, Amazon Redshift)
Transforming data between formats (for example, from .csv to Apache Parquet)
Troubleshooting and debugging common transformation failures and performance issues
Creating data APIs to make data available to other systems by using AWS services

Task Statement 1.3: Orchestrate data pipelines.

Knowledge of:

How to integrate various AWS services to create ETL pipelines
Event-driven architecture
How to configure AWS services for data pipelines based on schedules or dependencies
Serverless workflows

Skills in:

Using orchestration services to build workflows for data ETL pipelines (for example, Lambda, EventBridge, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], AWS Step Functions, AWS Glue workflows)
Building data pipelines for performance, availability, scalability, resiliency, and fault tolerance
Implementing and maintaining serverless workflows
Using notification services to send alerts (for example, Amazon Simple Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon SQS])

Task Statement 1.4: Apply programming concepts.

Knowledge of:

Continuous integration and continuous delivery (CI/CD) (implementation, testing, and deployment of data pipelines)
SQL queries (for data source queries and data transformations)
Infrastructure as code (IaC) for repeatable deployments (for example, AWS Cloud Development Kit [AWS CDK], AWS CloudFormation)
Distributed computing
Data structures and algorithms (for example, graph data structures and tree data structures)
SQL query optimization

Skills in:

Optimizing code to reduce runtime for data ingestion and transformation
Configuring Lambda functions to meet concurrency and performance needs
Performing SQL queries to transform data (for example, Amazon Redshift stored procedures)
Structuring SQL queries to meet data pipeline requirements
Using Git commands to perform actions such as creating, updating, cloning, and branching repositories
Using the AWS Serverless Application Model (AWS SAM) to package and deploy serverless data pipelines (for example, Lambda functions, Step Functions, DynamoDB tables)
Using and mounting storage volumes from within Lambda functions

Domain 2: Data Store Management

Task Statement 2.1: Choose a data store.

Knowledge of:

Storage platforms and their characteristics
Storage services and configurations for specific performance demands
Data storage formats (for example, .csv, .txt, Parquet)
How to align data storage with data migration requirements
How to determine the appropriate storage solution for specific access patterns
How to manage locks to prevent access to data (for example, Amazon Redshift, Amazon RDS)

Skills in:

Implementing the appropriate storage services for specific cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data Streams, Amazon MSK)
Configuring the appropriate storage services for specific access patterns and requirements (for example, Amazon Redshift, Amazon EMR, Lake Formation, Amazon RDS, DynamoDB)
Applying storage services to appropriate use cases (for example, Amazon S3)
Integrating migration tools into data processing systems (for example, AWS Transfer Family)
Implementing data migration or remote access methods (for example, Amazon Redshift federated queries, Amazon Redshift materialized views, Amazon Redshift Spectrum)

Task Statement 2.2: Understand data cataloging systems.

Knowledge of:

How to create a data catalog
Data classification based on requirements
Components of metadata and data catalogs

Skills in:

Using data catalogs to consume data from the data's source
Building and referencing a data catalog (for example, AWS Glue Data Catalog, Apache Hive metastore)
Discovering schemas and using AWS Glue crawlers to populate data catalogs
Synchronizing partitions with a data catalog
Creating new source or target connections for cataloging (for example, AWS Glue)

Task Statement 2.3: Manage the lifecycle of data.

Knowledge of:

Appropriate storage solutions to address hot and cold data requirements
How to optimize the cost of storage based on the data lifecycle
How to delete data to meet business and legal requirements
Data retention policies and archiving strategies
How to protect data with appropriate resiliency and availability

Skills in:

Performing load and unload operations to move data between Amazon S3 and Amazon Redshift
Managing S3 Lifecycle policies to change the storage tier of S3 data
Expiring data when it reaches a specific age by using S3 Lifecycle policies
Managing S3 versioning and DynamoDB TTL

Task Statement 2.4: Design data models and schema evolution.

Knowledge of:

Data modeling concepts
How to ensure accuracy and trustworthiness of data by using data lineage
Best practices for indexing, partitioning strategies, compression, and other data optimization techniques
How to model structured, semi-structured, and unstructured data
Schema evolution techniques

Skills in:

Designing schemas for Amazon Redshift, DynamoDB, and Lake Formation
Addressing changes to the characteristics of data
Performing schema conversion (for example, by using the AWS Schema Conversion Tool [AWS SCT] and AWS DMS Schema Conversion)
Establishing data lineage by using AWS tools (for example, Amazon SageMaker ML Lineage Tracking)

Domain 3: Data Operations and Support

Task Statement 3.1: Automate data processing by using AWS services.

Knowledge of:

How to maintain and troubleshoot data processing for repeatable business outcomes
API calls for data processing
Which services accept scripting (for example, Amazon EMR, Amazon Redshift, AWS Glue)

Skills in:

Orchestrating data pipelines (for example, Amazon MWAA, Step Functions)
Troubleshooting Amazon managed workflows
Calling SDKs to access Amazon features from code
Using the features of AWS services to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue)
Consuming and maintaining data APIs
Preparing data transformation (for example, AWS Glue DataBrew)
Querying data (for example, Amazon Athena)
Using Lambda to automate data processing
Managing events and schedulers (for example, EventBridge)

Task Statement 3.2: Analyze data by using AWS services.

Knowledge of:

Tradeoffs between provisioned services and serverless services
SQL queries (for example, SELECT statements with multiple qualifiers or JOIN clauses)
How to visualize data for analysis
When and how to apply cleansing techniques
Data aggregation, rolling average, grouping, and pivoting

Skills in:

Visualizing data by using AWS services and tools (for example, AWS Glue DataBrew, Amazon QuickSight)
Verifying and cleaning data (for example, Lambda, Athena, QuickSight, Jupyter Notebooks, Amazon SageMaker Data Wrangler)
Using Athena to query data or to create views
Using Athena notebooks that use Apache Spark to explore data

Task Statement 3.3: Maintain and monitor data pipelines.

Knowledge of:

How to log application data
Best practices for performance tuning
How to log access to AWS services
Amazon Macie, AWS CloudTrail, and Amazon CloudWatch

Skills in:

Extracting logs for audits
Deploying logging and monitoring solutions to facilitate auditing and traceability
Using notifications during monitoring to send alerts
Troubleshooting performance issues
Using CloudTrail to track API calls
Troubleshooting and maintaining pipelines (for example, AWS Glue, Amazon EMR)
Using Amazon CloudWatch Logs to log application data (with a focus on configuration and automation)
Analyzing logs with AWS services (for example, Athena, Amazon EMR, Amazon OpenSearch Service, CloudWatch Logs Insights, big data application logs)

Task Statement 3.4: Ensure data quality.

Knowledge of:

Data sampling techniques
How to implement data skew mechanisms
Data validation (data completeness, consistency, accuracy, and integrity)
Data profiling

Skills in:

Running data quality checks while processing the data (for example, checking for empty fields)
Defining data quality rules (for example, AWS Glue DataBrew)
Investigating data consistency (for example, AWS Glue DataBrew)

Domain 4: Data Security and Governance

Task Statement 4.1: Apply authentication mechanisms.

Knowledge of:

VPC security networking concepts
Differences between managed services and unmanaged services
Authentication methods (password-based, certificate-based, and role-based)
Differences between AWS managed policies and customer managed policies

Skills in:

Updating VPC security groups
Creating and updating IAM groups, roles, endpoints, and services
Creating and rotating credentials for password management (for example, AWS Secrets Manager)
Setting up IAM roles for access (for example, Lambda, Amazon API Gateway, AWS CLI, CloudFormation)
Applying IAM policies to roles, endpoints, and services (for example, S3 Access Points, AWS PrivateLink)

Task Statement 4.2: Apply authorization mechanisms.

Knowledge of:

Authorization methods (role-based, policy-based, tag-based, and attribute-based)
Principle of least privilege as it applies to AWS security
Role-based access control and expected access patterns
Methods to protect data from unauthorized access across services

Skills in:

Creating custom IAM policies when a managed policy does not meet the needs
Storing application and database credentials (for example, Secrets Manager, AWS Systems Manager Parameter Store)
Providing database users, groups, and roles access and authority in a database (for example, for Amazon Redshift)
Managing permissions through Lake Formation (for Amazon Redshift, Amazon EMR, Athena, and Amazon S3)

Task Statement 4.3: Ensure data encryption and masking.

Knowledge of:

Data encryption options available in AWS analytics services (for example, Amazon Redshift, Amazon EMR, AWS Glue)
Differences between client-side encryption and server-side encryption
Protection of sensitive data
Data anonymization, masking, and key salting

Skills in:

Applying data masking and anonymization according to compliance laws or company policies
Using encryption keys to encrypt or decrypt data (for example, AWS Key Management Service [AWS KMS])
Configuring encryption across AWS account boundaries
Enabling encryption in transit for data.

Task Statement 4.4: Prepare logs for audit.

Knowledge of:

How to log application data
How to log access to AWS services
Centralized AWS logs

Skills in:

Using CloudTrail to track API calls
Using CloudWatch Logs to store application logs
Using AWS CloudTrail Lake for centralized logging queries
Analyzing logs by using AWS services (for example, Athena, CloudWatch Logs Insights, Amazon OpenSearch Service)
Integrating various AWS services to perform logging (for example, Amazon EMR in cases of large volumes of log data)

Task Statement 4.5: Understand data privacy and governance.

Knowledge of:

How to protect personally identifiable information (PII)
Data sovereignty

Skills in:

Granting permissions for data sharing (for example, data sharing for Amazon Redshift)
Implementing PII identification (for example, Macie with Lake Formation)
Implementing data privacy strategies to prevent backups or replications of data to disallowed AWS Regions
Managing configuration changes that have occurred in an account (for example, AWS Config)

Appendix

In-scope AWS Services and Features

The following list contains AWS services and features that are in scope for the exam. This list is non-exhaustive and is subject to change. AWS offerings appear in categories that align with the offerings' primary functions:

Analytics:

Amazon Athena
Amazon EMR
AWS Glue
AWS Glue DataBrew
AWS Lake Formation
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Managed Service for Apache Flink
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
Amazon OpenSearch Service
Amazon QuickSight

Application Integration:

Amazon AppFlow
Amazon EventBridge
Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
Amazon Simple Notification Service (Amazon SNS)
Amazon Simple Queue Service (Amazon SQS)
AWS Step Functions

Cloud Financial Management:

AWS Budgets
AWS Cost Explorer

Compute:

AWS Batch
Amazon EC2
AWS Lambda
AWS Serverless Application Model (AWS SAM)

Containers:

Amazon Elastic Container Registry (Amazon ECR)
Amazon Elastic Container Service (Amazon ECS)
Amazon Elastic Kubernetes Service (Amazon EKS)

Database:

Amazon DocumentDB (with MongoDB compatibility)
Amazon DynamoDB
Amazon Keyspaces (for Apache Cassandra)
Amazon MemoryDB for Redis
Amazon Neptune
Amazon RDS
Amazon Redshift

Developer Tools:

AWS CLI
AWS Cloud9
AWS Cloud Development Kit (AWS CDK)
AWS CodeBuild
AWS CodeCommit
AWS CodeDeploy
AWS CodePipeline

Frontend Web and Mobile:

Amazon API Gateway

Machine Learning:

Amazon SageMaker

Management and Governance:

AWS CloudFormation
AWS CloudTrail
Amazon CloudWatch
Amazon CloudWatch Logs
AWS Config
Amazon Managed Grafana
AWS Systems Manager
AWS Well-Architected Tool

Migration and Transfer:

AWS Application Discovery Service
AWS Application Migration Service
AWS Database Migration Service (AWS DMS)
AWS DataSync
AWS Schema Conversion Tool (AWS SCT)
AWS Snow Family
AWS Transfer Family

Networking and Content Delivery:

Amazon CloudFront
AWS PrivateLink
Amazon Route 53
Amazon VPC

Security, Identity, and Compliance:

AWS Identity and Access Management (IAM)
AWS Key Management Service (AWS KMS)
Amazon Macie
AWS Secrets Manager
AWS Shield
AWS WAF

Storage:

AWS Backup
Amazon Elastic Block Store (Amazon EBS)
Amazon Elastic File System (Amazon EFS)
Amazon S3
Amazon S3 Glacier

Out-of-scope AWS Services and Features

The following list contains AWS services and features that are out of scope for the exam. This list is non-exhaustive and is subject to change. AWS offerings that are entirely unrelated to the target job roles for the exam are excluded from this list:

Analytics:

Amazon FinSpace

Business Applications:

Alexa for Business
Amazon Chime
Amazon Connect
Amazon Honeycode
AWS IQ
Amazon WorkDocs
Amazon WorkMail

Compute:

AWS App Runner
AWS Elastic Beanstalk
Amazon Lightsail
AWS Outposts
AWS Serverless Application Repository

Containers:

Red Hat OpenShift Service on AWS (ROSA)

Database:

Amazon Timestream

Developer Tools:

AWS Fault Injection Simulator (AWS FIS)
AWS X-Ray

Frontend Web and Mobile:

AWS Amplify
AWS AppSync
AWS Device Farm
Amazon Location Service
Amazon Pinpoint
Amazon Simple Email Service (Amazon SES)

Internet of Things (IoT):

FreeRTOS
AWS IoT 1-Click
AWS IoT Device Defender
AWS IoT Device Management
AWS IoT Events
AWS IoT FleetWise
AWS IoT RoboRunner
AWS IoT SiteWise
AWS IoT TwinMaker

Machine Learning:

Amazon CodeWhisperer
Amazon DevOps Guru

Management and Governance:

AWS Activate
AWS Managed Services (AMS)

Media Services:

Amazon Elastic Transcoder
AWS Elemental Appliances and Software
AWS Elemental MediaConnect
AWS Elemental MediaConvert
AWS Elemental MediaLive
AWS Elemental MediaPackage
AWS Elemental MediaStore
AWS Elemental MediaTailor
Amazon Interactive Video Service (Amazon IVS)
Amazon Nimble Studio

Migration and Transfer:

AWS Mainframe Modernization
AWS Migration Hub

Storage:

EC2 Image Builder

Survey

How useful was this exam guide? Let us know by taking our survey.

Official DEA-C01 Exam Guide

AWS Certified Data Engineer - Associate (DEA-C01) Exam Guide

Introduction

Target Candidate Description

Recommended General IT Knowledge

Recommended AWS Knowledge

Job Tasks That Are Out of Scope for the Target Candidate

Exam Content

Response Types

Unscored Content

Exam Results

Content Outline

Domain 1: Data Ingestion and Transformation

Task Statement 1.1: Perform data ingestion.

Task Statement 1.2: Transform and process data.

Task Statement 1.3: Orchestrate data pipelines.

Task Statement 1.4: Apply programming concepts.

Domain 2: Data Store Management

Task Statement 2.1: Choose a data store.

Task Statement 2.2: Understand data cataloging systems.

Task Statement 2.3: Manage the lifecycle of data.

Task Statement 2.4: Design data models and schema evolution.

Domain 3: Data Operations and Support

Task Statement 3.1: Automate data processing by using AWS services.

Task Statement 3.2: Analyze data by using AWS services.

Task Statement 3.3: Maintain and monitor data pipelines.

Task Statement 3.4: Ensure data quality.

Domain 4: Data Security and Governance

Task Statement 4.1: Apply authentication mechanisms.

Task Statement 4.2: Apply authorization mechanisms.

Task Statement 4.3: Ensure data encryption and masking.

Task Statement 4.4: Prepare logs for audit.

Task Statement 4.5: Understand data privacy and governance.

Appendix

In-scope AWS Services and Features

Analytics:

Application Integration:

Cloud Financial Management:

Compute:

Containers:

Database:

Developer Tools:

Frontend Web and Mobile:

Machine Learning:

Management and Governance:

Migration and Transfer:

Networking and Content Delivery:

Security, Identity, and Compliance:

Storage:

Out-of-scope AWS Services and Features

Analytics:

Business Applications:

Compute:

Containers:

Database:

Developer Tools:

Frontend Web and Mobile:

Internet of Things (IoT):

Machine Learning:

Management and Governance:

Media Services:

Migration and Transfer:

Storage:

Survey

Related Resources