Tutorial: AWS Certified Data Engineer - Associate (DEA-C01)

How to model structured, semi-structured, and unstructured data

Concepts

Structured data is highly organized and formatted in a way that is easily searchable in databases. It typically follows a schema, defining the structure and type of data that can be stored. In AWS, structured data is often managed with services such as Amazon RDS or Amazon Redshift.

When modeling structured data for the AWS Certified Data Engineer – Associate exam, it’s important to understand the principles of database normalization to reduce data redundancy and improve data integrity. The structured data may be represented in tables, with fields corresponding to column headers and records corresponding to rows.

For example, a structured data model for customer information might look like this:

Customer Table
CustomerID	CustomerName	CustomerCity
1	John Doe	New York
2	Jane Smith	Los Angeles

Here, ‘CustomerID’ serves as the primary key, uniquely identifying each record in the table.

Modeling Semi-Structured Data

Semi-structured data doesn’t fit neatly into tables, but it does have some organizational properties like tags or other markers to separate semantic elements. AWS services such as Amazon DynamoDB or document-based databases like MongoDB (through Amazon DocumentDB) are often used for semi-structured data.

For semi-structured data, you might use JSON or XML to define and transport data. Data can still be queried, but it requires parsing the structure within the data.

Consider the following JSON example for the same customer information:

[
{
“CustomerID”: 1,
“CustomerName”: “John Doe”,
“CustomerCity”: “New York”
},
{
“CustomerID”: 2,
“CustomerName”: “Jane Smith”,
“CustomerCity”: “Los Angeles”
}
]

Each customer record is a JSON object, and multiple records form a JSON array. Unlike structured data in a relational database, the records can easily have different fields or nested information.

Modeling Unstructured Data

Unstructured data doesn’t have a pre-defined data model or is not organized in a pre-defined manner. It includes text, images, audio, and video. In AWS, unstructured data is often handled with storage solutions like Amazon S3 or analytics services such as Amazon ElasticSearch for searching and analyzing text data.

For unstructured data, it is crucial to tag and categorize the data effectively, so that it can be retrieved and analyzed later. One might use metadata or data lakes to provide structure around unstructured data. AWS Glue could be used to prepare and catalog unstructured data for analysis.

An example of unstructured data might be a set of images stored in an S3 bucket with metadata to describe each image:

s3://my-unstructured-bucket/
image1.jpg // Metadata: {“Uploaded”: “2023-04-01”, “Category”: “Portrait”}
image2.jpg // Metadata: {“Uploaded”: “2023-04-02”, “Category”: “Landscape”}

The metadata attached to each image file helps in searching and categorizing the images, thereby lending some structure to unstructured data.

Comparison of Data Models

Data Type	Example AWS Services	Characteristics
Structured	Amazon RDS, Redshift	Pre-defined schema, tables, easy to query, strict format
Semi-Structured	DynamoDB, DocumentDB	Loose schema, data may be in JSON/XML, queries need parsing
Unstructured	S3, ElasticSearch	No predefined format, requires metadata, flexible storage of media

In preparation for the AWS Certified Data Engineer – Associate exam, understanding these data models will help you decide on the appropriate AWS service and design an effective data pipeline that caters to the specific data type and its use cases.

Answer the Questions in Comment Section

Structured data refers to data that does not have a pre-defined data model or is not organized in a pre-defined manner.

True
False

Answer: False

Explanation: Structured data is data that adheres to a pre-defined data model and is easy to analyze. It is usually stored in relational databases.

Which AWS service is optimized for processing and analyzing real-time, streaming data?

Amazon S3
Amazon Redshift
Amazon Kinesis
Amazon DynamoDB

Answer: Amazon Kinesis

Explanation: Amazon Kinesis is optimized for building real-time data processing systems for streaming data.

JSON and XML are examples of which type of data?

Structured
Semi-structured
Unstructured
None of the above

Answer: Semi-structured

Explanation: JSON and XML have certain structures but are not as rigid as traditional database structures, so they are considered semi-structured.

When modeling data in Amazon Redshift, denormalizing your data schema can lead to improved query performance.

True
False

Answer: True

Explanation: Denormalization in Amazon Redshift can reduce the number of joins needed for querying, which can improve query performance.

In the context of AWS, which service is often used for storing and processing unstructured data such as images, videos, and logs?

Amazon EC2
Amazon S3
Amazon RDS
Amazon EMR

Answer: Amazon S3

Explanation: Amazon Simple Storage Service (S3) is widely used for storing unstructured data like images, videos, and logs, due to its scalability and data availability.

What is the primary benefit of using a columnar database like Amazon Redshift for analytics?

Improved write performance
Improved read performance for specific columns
Supports complex transactions
Automatically structures unstructured data

Answer: Improved read performance for specific columns

Explanation: Columnar databases like Amazon Redshift are optimized for read performance of specific columns, which benefits analytics workloads.

Data Lakes typically store:

Only structured data
Only unstructured data
Structured, semi-structured, and unstructured data
None of the above

Answer: Structured, semi-structured, and unstructured data

Explanation: Data Lakes can store all types of data, serving as a centralized repository for an organization’s data.

Using Amazon DynamoDB for transactional data storage is recommended because it is a:

Relational database service
Columnar database service
Document database service
Key-value and document database service

Answer: Key-value and document database service

Explanation: Amazon DynamoDB is a NoSQL database service that provides fast and predictable performance with seamless scalability, making it suitable for transactional data storage.

When dealing with unstructured data, a common first step is to:

Archive it in Amazon Glacier
Immediately analyze it with Amazon Athena
Ingest and store it using Amazon S3
Index it using Amazon RDS

Answer: Ingest and store it using Amazon S3

Explanation: A common initial step for unstructured data is to ingest and store it on a scalable and flexible storage solution like Amazon S

AWS Glue is a service that is primarily used for:

Data warehousing
Data storage
Data migration
Data cataloging and ETL (Extract, Transform, Load)

Answer: Data cataloging and ETL (Extract, Transform, Load)

Explanation: AWS Glue is a managed ETL service and data catalog that makes it easy to move and transform data between various data stores.

For text analysis of unstructured data, a common AWS approach would involve:

Amazon Quantum Ledger Database (QLDB)
Amazon Textract
Amazon Aurora
Amazon Neptune

Answer: Amazon Textract

Explanation: Amazon Textract is a service that automatically extracts text and data from scanned documents, making it suitable for text analysis of unstructured data.

Amazon RDS is designed to simplify setting up, scaling, and managing which type of database?

Document databases
Relational databases
Key-value stores
Graph databases

Answer: Relational databases

Explanation: Amazon RDS (Relational Database Service) provides a managed relational database service experience and supports several common database engines.