Tutorial: AWS Certified Data Engineer - Associate (DEA-C01)

Appropriate storage solutions to address hot and cold data requirements

Concepts

Hot data is frequently accessed and requires quick retrieval times, while cold data is infrequently accessed and can tolerate slower retrieval times. AWS offers a variety of storage solutions designed to meet the distinct needs of hot and cold data.

Amazon S3 for Hot and Cold Data

Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. It provides different storage classes for different data access needs:

S3 Standard: Ideal for frequently accessed data. It offers high availability and automatic retries on errors, making it suitable for hot data.
S3 Intelligent-Tiering: Automatically moves data between two access tiers based on changing access patterns without performance impact or operational overhead.
S3 Standard-Infrequent Access (S3 Standard-IA): For data that is accessed less frequently but requires rapid access when needed. It offers lower storage costs compared to S3 Standard.
S3 One Zone-Infrequent Access (S3 One Zone-IA): Similar to S3 Standard-IA but stores data in a single AZ, making it less expensive but also less resilient.
S3 Glacier and S3 Glacier Deep Archive: Designed for data archiving. Glacier is for data retrieval in minutes to hours, while Deep Archive is the most cost-effective for long-term storage with retrieval times of hours to days.

The following table summarizes the primary characteristics:

Storage Class	Use Case	Availability	Retrieval Time	Cost
S3 Standard	Frequently accessed hot data	99.99%	Milliseconds	High
S3 Intelligent-Tiering	Varying access	99.9%	Milliseconds	Varies with access pattern
S3 Standard-IA	Less frequently accessed data	99.9%	Milliseconds	Lower than S3 Standard
S3 One Zone-IA	Non-critical, infrequent data	99.5%	Milliseconds	Lower than S3 Standard-IA
S3 Glacier	Long-term archive, colder data	99.99%	Minutes to hours	Lower than S3 One Zone-IA
S3 Glacier Deep Archive	Long-term archive, coldest data	99.99%	Hours to days	Lowest

Amazon EFS for File-Based Hot and Cold Data

AWS offers Amazon Elastic File System (EFS) for applications that require a file system interface and file system semantics. EFS provides two performance modes:

General Purpose: Ideal for latency-sensitive use cases where files are frequently accessed.
Max I/O: Optimized for high levels of aggregate throughput and operations per second, best suited for big data applications, media processing, and scientific analysis.

Amazon EFS also has two storage tiers:

Standard: For actively used files.
Infrequent Access (EFS IA): For files accessed less frequently; offers cost savings compared to the Standard tier.

Lifecycle management policies can be set to automatically transition files from Standard to IA based on age.

Amazon EBS for Block-Based Hot and Cold Data

For block storage, there’s Amazon Elastic Block Store (EBS) which offers various volume types for different use cases:

Provisioned IOPS SSD (io1/io2): Designed for I/O intensive applications such as large relational or NoSQL databases (hot data).
General Purpose SSD (gp2/gp3): Balanced cost and performance, suitable for a wide variety of workloads (hot to warm data).
Throughput Optimized HDD (st1) and Cold HDD (sc1): Low-cost volume types for throughput-intensive workloads and less frequently accessed data (warm to cold data).

Here’s a quick comparison:

Volume Type	Use Case	Throughput	IOPS	Cost
Provisioned IOPS SSD	I/O intensive, latency-sensitive data	High	Up to 64,000	High
General Purpose SSD	Balanced workloads	Moderate	Up to 16,000	Moderate
Throughput Optimized HDD	Big data, data warehouses	High	Moderate (500)	Lower
Cold HDD	Infrequently accessed	Low	Low (250)	Lowest among EBS

AWS provides additional features such as lifecycle policies, automatic tiering, and data migration services like AWS DataSync, which can help with optimizing storage costs and managing data transfers between these storage solutions.

When designing storage solutions for data engineering on AWS, it’s crucial to consider the nature of the data (hot, warm, or cold) and select the appropriate storage solution and configuration. It’s also recommended to regularly review usage patterns and adjust the storage strategy accordingly to optimize costs and performance.

Answer the Questions in Comment Section

Amazon S3 Intelligent-Tiering is recommended for data with unknown or changing access patterns.

True
False

Answer: True

Explanation: Amazon S3 Intelligent-Tiering automatically moves data to the most cost-effective storage tier based on changing access patterns without performance impact or operational overhead.

Amazon EFS is a good choice for high-performance computing workloads that require shared file storage.

True
False

Answer: True

Explanation: Amazon EFS provides a simple, scalable, elastic file system for Linux-based workloads and is suitable for high-performance computing applications.

Amazon RDS is optimized for OLTP workloads?

True
False

Answer: True

Explanation: Amazon RDS is designed for relational databases with a focus on OLTP (Online Transaction Processing) workloads, where fast and predictable performance is required.

Which AWS service is used for archiving data that may be accessed infrequently but requires rapid access when needed?

Amazon S3 Standard
Amazon S3 Glacier
AWS Storage Gateway
Amazon S3 Glacier Instant Retrieval

Answer: Amazon S3 Glacier Instant Retrieval

Explanation: Amazon S3 Glacier Instant Retrieval is designed for data that is infrequently accessed but requires retrieval in milliseconds when needed.

Amazon S3 One Zone-Infrequent Access is a suitable storage class for:

Mission-critical data
Backups with a single Availability Zone requirement
Data requiring frequent, fast access
Data that cannot afford any loss

Answer: Backups with a single Availability Zone requirement

Explanation: Amazon S3 One Zone-IA is a cost-effective storage class for storing infrequently accessed data that does not require the higher availability and resilience of Amazon S3 Standard or multiple Availability Zones.

For data that is frequently accessed and requires the lowest-latency access, which storage service is most appropriate?

Amazon S3 Standard-IA
Amazon S3 One Zone-IA
Amazon EBS Provisioned IOPS SSD (io2)
Amazon S3 Glacier

Answer: Amazon EBS Provisioned IOPS SSD (io2)

Explanation: Amazon EBS Provisioned IOPS SSD (io2) is designed for IO-intensive workloads, such as databases, that require consistent, low-latency performance.

Which AWS service can be leveraged for on-premises applications that require seamless access to virtual tape backups stored in AWS?

AWS Storage Gateway
Amazon S3
Amazon Elastic File System (EFS)
Amazon FSx

Answer: AWS Storage Gateway

Explanation: AWS Storage Gateway’s Tape Gateway provides a virtual tape infrastructure that seamlessly connects on-premises backup applications with cloud-based virtual tape storage.

True or False: Amazon S3 Standard storage class offers 99% availability and 999999999% durability.

True
False

Answer: True

Explanation: Amazon S3 Standard is designed for 99% availability and provides 11 9’s (999999999%) durability of objects over a given year.

What is the minimum storage duration for data stored in Amazon S3 Glacier?

30 days
90 days
1 year
No minimum duration

Answer: 90 days

Explanation: Amazon S3 Glacier is designed as a long-term storage solution with a minimum storage duration of 90 days.

AWS Snowball is used for which of the following scenarios?

Real-time data processing
Large-scale data transfers into and out of AWS
Low-latency, high-throughput database workloads
Automated data archival

Answer: Large-scale data transfers into and out of AWS

Explanation: AWS Snowball is a data transport solution used to move large amounts of data into and out of AWS using secure physical appliances.

True or False: Data stored in Amazon S3 Glacier Deep Archive is typically expected to be accessed less than once a year.

True
False

Answer: True

Explanation: Amazon S3 Glacier Deep Archive is designed for data that is rarely accessed and provides the lowest cost storage option for long-term archiving.

Which storage service is best suited for frequently accessed, throughput-sensitive file systems?

Amazon S3 Glacier
Amazon FSx for Lustre
Amazon S3 One Zone-IA
Amazon Glacier Deep Archive

Answer: Amazon FSx for Lustre

Explanation: Amazon FSx for Lustre is a high-performance file system optimized for fast processing of workloads such as machine learning, high-performance computing, and video processing.

0 0 votes

Article Rating

34 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Eloisa Lugo

10 months ago

Thanks for this detailed post. It was really helpful.

Ece Avan

11 months ago

Great post! I have a question regarding S3 storage classes. How do you decide between using Standard and Intelligent-Tiering for hot and cold data?

Enni Juntunen

9 months ago

Reply to Ece Avan

Intelligent-Tiering is more suitable when access patterns are unknown or unpredictable, while Standard is ideal for frequently accessed data.

Austin Hill

10 months ago

This helped clarify my doubts about storage solutions on AWS. Appreciated!

Svein Moan

11 months ago

Can someone explain how Glacier and Glacier Deep Archive differ in terms of use cases for cold data?

Pramitha Saha

11 months ago

Reply to Svein Moan

Glacier is used for data that needs to be accessed occasionally, while Deep Archive is best for data that is rarely accessed and can tolerate longer retrieval times.

بردیا کوتی

10 months ago

Awesome breakdown of hot and cold data storage. Thanks!

Amparo Herrero

10 months ago

I’m confused about when to use EBS vs EFS for storage. Any guidance?

Georgia Harris

8 months ago

Reply to Amparo Herrero

EBS is block storage suitable for single instance use, perfect for databases. EFS, on the other hand, is a managed file storage that can be accessed by multiple instances simultaneously.

Teresa Giraud

10 months ago

This blog is missing some key points on latency for different storage options.

Paula Garrett

9 months ago

Very informative. I now understand which storage classes to select for different data temperatures.

Appropriate storage solutions to address hot and cold data requirements

Concepts

Amazon S3 for Hot and Cold Data

Amazon EFS for File-Based Hot and Cold Data

Amazon EBS for Block-Based Hot and Cold Data

Answer the Questions in Comment Section

Amazon S3 Intelligent-Tiering is recommended for data with unknown or changing access patterns.

Amazon EFS is a good choice for high-performance computing workloads that require shared file storage.

Amazon RDS is optimized for OLTP workloads?

Which AWS service is used for archiving data that may be accessed infrequently but requires rapid access when needed?

Amazon S3 One Zone-Infrequent Access is a suitable storage class for:

For data that is frequently accessed and requires the lowest-latency access, which storage service is most appropriate?

Which AWS service can be leveraged for on-premises applications that require seamless access to virtual tape backups stored in AWS?

True or False: Amazon S3 Standard storage class offers 99% availability and 999999999% durability.

What is the minimum storage duration for data stored in Amazon S3 Glacier?

AWS Snowball is used for which of the following scenarios?

True or False: Data stored in Amazon S3 Glacier Deep Archive is typically expected to be accessed less than once a year.

Which storage service is best suited for frequently accessed, throughput-sensitive file systems?

Related Post

How to ensure accuracy and trustworthiness of data by using data lineage

Best practices for indexing, partitioning strategies, compression, and other data optimization techniques

How to model structured, semi-structured, and unstructured data