Tutorial: AWS Certified Data Engineer - Associate (DEA-C01)

How to determine the appropriate storage solution for specific access patterns

Concepts

Before choosing a storage solution, you need to assess the data access patterns such as:

Read-heavy, write-light: Frequent read operations with infrequent writes.
Write-heavy, read-light: Frequent write operations with infrequent reads.
Balanced read/write: Equal frequency of read and write operations.
Large objects: Working with large files needing throughput optimization.
Small, random I/O: High-performance, low-latency access to small chunks of data.
Batch processing: Sequential access to large datasets.
Real-time processing: Immediate processing and accessibility requirements.
Cold data storage: Infrequent access to archived or backup data.

AWS Storage Solutions

Let’s compare the core AWS storage services:

Amazon S3

Amazon Simple Storage Service (S3) is an object storage service with high durability and scalability. It is a good fit for a wide range of access patterns, including:

Static website hosting.
Data lakes and big data analytics.
Backup and archival.
Content distribution.

S3 provides various storage classes to match cost with access patterns, such as S3 Standard for frequently accessed data, S3 Intelligent-Tiering for cost optimization, S3 Standard-IA for infrequent access, and Glacier for long-term cold storage with different retrieval times.

Amazon EBS

Amazon Elastic Block Store (EBS) provides block-level storage volumes for persistent data storage with EC2 instances. It is well-suited for:

Applications requiring a file system.
Databases.
Enterprise applications with a need for consistent IOPS performance.

EBS provides various volume types like General Purpose SSD (gp2 and gp3), Provisioned IOPS SSD (io1 and io2), and Throughput Optimized HDD (st1) for big data and log processing.

Amazon EFS

Amazon Elastic File System (EFS) offers file storage for use with multiple EC2 instances. It is a good choice for:

Shared file storage for applications.
Container storage.
Content management systems.

EFS automatically scales performance and capacity as needed and offers a lifecycle management feature to automatically transition older files to a cost-effective storage class.

Amazon RDS & DynamoDB

For databases, Amazon Relational Database Service (RDS) and Amazon DynamoDB provide managed solutions for structured and NoSQL data respectively. When to use which depends on:

Structured vs. unstructured data.
SQL vs. NoSQL requirements.
Consistency vs. eventual consistency needs.

DynamoDB, in particular, offers single-digit millisecond performance and is ideal for IoT, mobile, web, and gaming applications.

Amazon Redshift

Amazon Redshift is a data warehousing service for analytical processing (OLAP) and is optimized for complex queries on large volumes of structured data. It is suitable for:

Business intelligence tool integration.
Data transformation.

Matching Access Patterns to Storage Solutions

Here’s a breakdown of matching common access patterns to appropriate AWS storage services:

Access Pattern	S3	EBS	EFS	RDS	DynamoDB	Redshift
Read-heavy, write-light	Yes	Yes	Yes	Yes	Yes	Yes
Write-heavy, read-light	Yes	Yes	Yes	Yes	Yes	No
Balanced read/write	Yes	Yes	Yes	Yes	Yes	Yes
Large objects	Yes	Yes*	No	No	No	Yes
Small, random I/O	No	Yes	Yes	Yes	Yes	No
Batch processing	Yes	Yes	Yes	No	No	Yes
Real-time processing	No	Yes	Yes	Yes	Yes	No
Cold data storage	Yes	No	No	No	No	No

*Large object storage on EBS requires managing large volumes and file systems.

Example: Choosing Storage for a Data Lake

For instance, if you were building a data lake that requires frequent insertions of data blobs and diverse file formats while simultaneously supporting analytics and machine learning, Amazon S3 would be the ideal solution. You might use S3 Standard for recently ingested data and move data to S3 Standard-IA or S3 Glacier as it becomes less frequently accessed to optimize costs.

# Sample AWS CLI command to copy data to S3 bucket
aws s3 cp my-data-file.csv s3://my-data-lake-bucket/raw-data/

Example: Storage for a High-Performance Database

Consider a high-performance transactional database with a balanced mix of read/write operations. Amazon RDS provisioned with Provisioned IOPS (io1 or io2) would deliver the needed performance and manageability.

# Sample AWS CLI command to create an RDS instance with Provisioned IOPS
aws rds create-db-instance \
–db-instance-identifier my-high-performance-db \
–db-instance-class db.m4.large \
–engine mysql \
–allocated-storage 100 \
–storage-type io1 \
–iops 1000

Making the Decision

When deciding on a storage solution, consider both the technical aspects of the access pattern and the business requirements such as cost, scalability, latency, and data governance. Leverage AWS’s ability to mix and match solutions to tailor a system to your access needs. Always begin with a clear understanding of your access patterns and then refer to the key characteristics of each AWS storage option to make an informed choice that aligns with the specific demands of your use case.

Answer the Questions in Comment Section

True/False: Amazon Redshift is the most cost-efficient storage solution for frequently accessed, transactional data.

Answer: False

Explanation: Amazon Redshift is a data warehousing service that is optimized for analytical workloads and complex queries on large datasets, not for frequently accessed, transactional data – Amazon Aurora or Amazon RDS would be more appropriate for that use case.

Multiple Select: Which AWS services are suitable for storing infrequently accessed data? (Select two)

A. Amazon S3 Glacier
B. Amazon Redshift
C. Amazon RDS
D. Amazon S3 Standard-Infrequent Access (S3 Standard-IA)

Answer: A, D

Explanation: Amazon S3 Glacier and Amazon S3 Standard-Infrequent Access (S3 Standard-IA) are designed for data that is accessed less frequently, offering lower storage costs than services meant for frequently accessed data.

Single Select: What is an ideal storage solution for read-heavy, low-latency workloads?

A. Amazon EBS
B. Amazon DynamoDB
C. Amazon S3
D. Amazon Glacier

Answer: B

Explanation: Amazon DynamoDB is a NoSQL database service optimized for high-performance, low-latency read and write access, making it ideal for read-heavy, low-latency workloads.

True/False: Amazon EFS is a good choice for data that requires shared access across multiple EC2 instances.

Answer: True

Explanation: Amazon EFS (Elastic File System) provides a file storage service for use with AWS Cloud services and on-premises resources, allowing shared access across multiple EC2 instances.

Single Select: For which use case is Amazon RDS best suited?

A. Big data processing
B. Block-level storage
C. Relational database management
D. Object storage

Answer: C

Explanation: Amazon RDS (Relational Database Service) is best suited for relational database management as it provides scalable and managed relational databases.

True/False: Amazon S3 is suitable for storing large amounts of unstructured data.

Answer: True

Explanation: Amazon S3 (Simple Storage Service) is designed to store and retrieve any amount of data, and is highly suitable for unstructured data.

Multiple Select: Which storage services offer durable storage solutions for long-term archival? (Select two)

A. Amazon Glacier
B. Amazon EFS
C. Amazon S3 Glacier Deep Archive
D. Amazon DynamoDB

Answer: A, C

Explanation: Amazon Glacier and Amazon S3 Glacier Deep Archive are specifically designed for long-term data archival, offering very low storage costs for data that is rarely accessed.

True/False: Amazon EBS should be used for high-throughput, low-latency workloads shared across various services.

Answer: False

Explanation: Amazon EBS (Elastic Block Store) provides persistent block-level storage volumes for use with EC2 instances. It is not designed to share workloads across various services; that’s a scenario better suited for Amazon EFS or Amazon S

Single Select: What is the best choice for a frequently updated NoSQL database with millisecond latency?

A. Amazon Redshift
B. Amazon S3
C. Amazon EBS
D. Amazon DynamoDB

Answer: D

Explanation: Amazon DynamoDB is a NoSQL database service that supports key-value and document data structures and offers single-digit millisecond latency, making it suitable for frequently updated databases.

True/False: Amazon FSx for Lustre is optimized for compute-intensive workloads, such as high-performance computing (HPC), machine learning, and media data processing.

Answer: True

Explanation: Amazon FSx for Lustre provides a high-performance file system optimized for fast processing of workloads that require high-speed storage, like HPC, machine learning, and media processing.

Single Select: Which AWS storage service is optimized for storing and retrieving data that requires a content delivery network (CDN)?

A. Amazon S3
B. Amazon EFS
C. Amazon Glacier
D. Amazon DynamoDB

Answer: A

Explanation: Amazon S3, often used in conjunction with Amazon CloudFront (a CDN), is optimized for storing and retrieving data that needs to be delivered globally with low latency and high transfer speeds.

True/False: AWS Storage Gateway is a service that is used to extend on-premises storage to AWS.

Answer: True

Explanation: AWS Storage Gateway is a hybrid storage service that enables on-premises applications to seamlessly use AWS cloud storage, providing a way to extend the local storage infrastructure to the cloud.

0 0 votes

Article Rating

25 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Margaretha Andreas

10 months ago

Great blog post! Really helped me understand how to choose the right storage solution for different access patterns.

Ece Sadıklar

10 months ago

What would be the best storage solution for a high read-intensive workload?

Ellie Burton

10 months ago

Thank you for the detailed breakdown!

Paige Caldwell

9 months ago

Great blog post! It really helped me understand how to determine the right storage solutions for different access patterns.

Luit Van Houten

11 months ago

I appreciate the detailed breakdown of the storage options available on AWS.

درسا حسینی

10 months ago

Can someone explain the differences between S3 and Glacier for long-term storage?

Izolda Adamović

9 months ago

Thanks! This is exactly what I needed to prepare for the DEA-C01 exam.

Anna Carter

11 months ago

How do I choose between EBS and EFS for file storage?

How to determine the appropriate storage solution for specific access patterns

Concepts

AWS Storage Solutions

Amazon S3

Amazon EBS

Amazon EFS

Amazon RDS & DynamoDB

Amazon Redshift

Matching Access Patterns to Storage Solutions

Example: Choosing Storage for a Data Lake

Example: Storage for a High-Performance Database

Making the Decision

Answer the Questions in Comment Section

True/False: Amazon Redshift is the most cost-efficient storage solution for frequently accessed, transactional data.

Multiple Select: Which AWS services are suitable for storing infrequently accessed data? (Select two)

Single Select: What is an ideal storage solution for read-heavy, low-latency workloads?

True/False: Amazon EFS is a good choice for data that requires shared access across multiple EC2 instances.

Single Select: For which use case is Amazon RDS best suited?

True/False: Amazon S3 is suitable for storing large amounts of unstructured data.

Multiple Select: Which storage services offer durable storage solutions for long-term archival? (Select two)

True/False: Amazon EBS should be used for high-throughput, low-latency workloads shared across various services.

Single Select: What is the best choice for a frequently updated NoSQL database with millisecond latency?

True/False: Amazon FSx for Lustre is optimized for compute-intensive workloads, such as high-performance computing (HPC), machine learning, and media data processing.

Single Select: Which AWS storage service is optimized for storing and retrieving data that requires a content delivery network (CDN)?

True/False: AWS Storage Gateway is a service that is used to extend on-premises storage to AWS.

Related Post

How to ensure accuracy and trustworthiness of data by using data lineage

Best practices for indexing, partitioning strategies, compression, and other data optimization techniques

How to model structured, semi-structured, and unstructured data