Concepts
Structured data is highly organized and formatted in a way that is easily searchable in databases. It typically follows a schema, defining the structure and type of data that can be stored. In AWS, structured data is often managed with services such as Amazon RDS or Amazon Redshift.
When modeling structured data for the AWS Certified Data Engineer – Associate exam, it’s important to understand the principles of database normalization to reduce data redundancy and improve data integrity. The structured data may be represented in tables, with fields corresponding to column headers and records corresponding to rows.
For example, a structured data model for customer information might look like this:
Customer Table | ||
---|---|---|
CustomerID | CustomerName | CustomerCity |
1 | John Doe | New York |
2 | Jane Smith | Los Angeles |
Here, ‘CustomerID’ serves as the primary key, uniquely identifying each record in the table.
Modeling Semi-Structured Data
Semi-structured data doesn’t fit neatly into tables, but it does have some organizational properties like tags or other markers to separate semantic elements. AWS services such as Amazon DynamoDB or document-based databases like MongoDB (through Amazon DocumentDB) are often used for semi-structured data.
For semi-structured data, you might use JSON or XML to define and transport data. Data can still be queried, but it requires parsing the structure within the data.
Consider the following JSON example for the same customer information:
[
{
“CustomerID”: 1,
“CustomerName”: “John Doe”,
“CustomerCity”: “New York”
},
{
“CustomerID”: 2,
“CustomerName”: “Jane Smith”,
“CustomerCity”: “Los Angeles”
}
]
Each customer record is a JSON object, and multiple records form a JSON array. Unlike structured data in a relational database, the records can easily have different fields or nested information.
Modeling Unstructured Data
Unstructured data doesn’t have a pre-defined data model or is not organized in a pre-defined manner. It includes text, images, audio, and video. In AWS, unstructured data is often handled with storage solutions like Amazon S3 or analytics services such as Amazon ElasticSearch for searching and analyzing text data.
For unstructured data, it is crucial to tag and categorize the data effectively, so that it can be retrieved and analyzed later. One might use metadata or data lakes to provide structure around unstructured data. AWS Glue could be used to prepare and catalog unstructured data for analysis.
An example of unstructured data might be a set of images stored in an S3 bucket with metadata to describe each image:
s3://my-unstructured-bucket/
image1.jpg // Metadata: {“Uploaded”: “2023-04-01”, “Category”: “Portrait”}
image2.jpg // Metadata: {“Uploaded”: “2023-04-02”, “Category”: “Landscape”}
The metadata attached to each image file helps in searching and categorizing the images, thereby lending some structure to unstructured data.
Comparison of Data Models
Data Type | Example AWS Services | Characteristics |
---|---|---|
Structured | Amazon RDS, Redshift | Pre-defined schema, tables, easy to query, strict format |
Semi-Structured | DynamoDB, DocumentDB | Loose schema, data may be in JSON/XML, queries need parsing |
Unstructured | S3, ElasticSearch | No predefined format, requires metadata, flexible storage of media |
In preparation for the AWS Certified Data Engineer – Associate exam, understanding these data models will help you decide on the appropriate AWS service and design an effective data pipeline that caters to the specific data type and its use cases.
Answer the Questions in Comment Section
Structured data refers to data that does not have a pre-defined data model or is not organized in a pre-defined manner.
- True
- False
Answer: False
Explanation: Structured data is data that adheres to a pre-defined data model and is easy to analyze. It is usually stored in relational databases.
Which AWS service is optimized for processing and analyzing real-time, streaming data?
- Amazon S3
- Amazon Redshift
- Amazon Kinesis
- Amazon DynamoDB
Answer: Amazon Kinesis
Explanation: Amazon Kinesis is optimized for building real-time data processing systems for streaming data.
JSON and XML are examples of which type of data?
- Structured
- Semi-structured
- Unstructured
- None of the above
Answer: Semi-structured
Explanation: JSON and XML have certain structures but are not as rigid as traditional database structures, so they are considered semi-structured.
When modeling data in Amazon Redshift, denormalizing your data schema can lead to improved query performance.
- True
- False
Answer: True
Explanation: Denormalization in Amazon Redshift can reduce the number of joins needed for querying, which can improve query performance.
In the context of AWS, which service is often used for storing and processing unstructured data such as images, videos, and logs?
- Amazon EC2
- Amazon S3
- Amazon RDS
- Amazon EMR
Answer: Amazon S3
Explanation: Amazon Simple Storage Service (S3) is widely used for storing unstructured data like images, videos, and logs, due to its scalability and data availability.
What is the primary benefit of using a columnar database like Amazon Redshift for analytics?
- Improved write performance
- Improved read performance for specific columns
- Supports complex transactions
- Automatically structures unstructured data
Answer: Improved read performance for specific columns
Explanation: Columnar databases like Amazon Redshift are optimized for read performance of specific columns, which benefits analytics workloads.
Data Lakes typically store:
- Only structured data
- Only unstructured data
- Structured, semi-structured, and unstructured data
- None of the above
Answer: Structured, semi-structured, and unstructured data
Explanation: Data Lakes can store all types of data, serving as a centralized repository for an organization’s data.
Using Amazon DynamoDB for transactional data storage is recommended because it is a:
- Relational database service
- Columnar database service
- Document database service
- Key-value and document database service
Answer: Key-value and document database service
Explanation: Amazon DynamoDB is a NoSQL database service that provides fast and predictable performance with seamless scalability, making it suitable for transactional data storage.
When dealing with unstructured data, a common first step is to:
- Archive it in Amazon Glacier
- Immediately analyze it with Amazon Athena
- Ingest and store it using Amazon S3
- Index it using Amazon RDS
Answer: Ingest and store it using Amazon S3
Explanation: A common initial step for unstructured data is to ingest and store it on a scalable and flexible storage solution like Amazon S
AWS Glue is a service that is primarily used for:
- Data warehousing
- Data storage
- Data migration
- Data cataloging and ETL (Extract, Transform, Load)
Answer: Data cataloging and ETL (Extract, Transform, Load)
Explanation: AWS Glue is a managed ETL service and data catalog that makes it easy to move and transform data between various data stores.
For text analysis of unstructured data, a common AWS approach would involve:
- Amazon Quantum Ledger Database (QLDB)
- Amazon Textract
- Amazon Aurora
- Amazon Neptune
Answer: Amazon Textract
Explanation: Amazon Textract is a service that automatically extracts text and data from scanned documents, making it suitable for text analysis of unstructured data.
Amazon RDS is designed to simplify setting up, scaling, and managing which type of database?
- Document databases
- Relational databases
- Key-value stores
- Graph databases
Answer: Relational databases
Explanation: Amazon RDS (Relational Database Service) provides a managed relational database service experience and supports several common database engines.
This blog post is really informative. Thanks!
Can someone explain how to convert unstructured data into structured data using AWS Glue?
What are the best practices for modeling semi-structured data in AWS?
Appreciate the detailed explanation on structured data models.
How does Amazon S3 fit into the picture for storing unstructured data?
I think the blog could use more examples with real datasets.
Thanks for sharing this comprehensive guide!
Has anyone used AWS Lambda for real-time processing of semi-structured data?